Chapter 11: Estimating Models with unknown dispersion parameters

library(glmbayes)

1. Introduction

Earlier chapters developed Bayesian linear and generalized linear models under the assumption that the dispersion parameter, denoted by \(\phi\) (or equivalently the residual variance \(\sigma^{2}\)), was known. For Gaussian models this corresponds to fixing the variance of the error term, and for quasi‑families it corresponds to fixing the scale parameter.

This chapter extends the framework to unknown dispersion parameters—a standard topic in Bayesian linear and generalized linear models (O’Hagan and Forster 2004; Gelman et al. 2013)—showing how glmbayes supports:

  1. Priors on the dispersion only
  2. Joint conjugate Normal–Gamma priors
  3. Independent Normal–Gamma priors
    • iid sampling via truncated Gamma priors
    • two‑block Gibbs sampling

Full derivations for Gaussian dispersion, shape, and rate calibration from Prior_Setup() / compute_gaussian_prior()—and how those objects connect to conjugate and independent Normal–Gamma priors—are given in Chapter A12 (https://knygren.r-universe.dev/articles/glmbayes/Chapter-A12.html).

All Gaussian examples use the Dobson plant‑weight data (Dobson 1990), first introduced in Chapter 02.

## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)
group  <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
weight <- c(ctl, trt)

2. Exponential-Family Background and Log-Concavity

Many likelihoods used in generalized linear models belong to the exponential family.
With observation weights \(w_1,\dots,w_n \ge 0\), the weighted exponential-family density is

\[ f(y \mid \theta,\phi,w) = \exp\left\{ \sum_{i=1}^{n} w_i \left[ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \right] \right\}. \]

Here:

The weighted log-likelihood is therefore

\[ \ell(\beta,\phi) = \sum_{i=1}^{n} w_i \left[ \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \right]. \]

This is the form used by both glm() and the Bayesian functions
glmb(), lmb(), rglmb(), and rlmb().


2.2 Special Case: Weighted Gaussian Linear Regression

Consider the weighted Gaussian model

\[ y_i \mid \beta,\phi \sim N(x_i^\top\beta,\;\phi/w_i), \qquad i=1,\dots,n. \]

Let \(W = \mathrm{diag}(w_1,\dots,w_n)\) and define the precision

\[ \tau = \frac{1}{\phi}. \]

Weighted log-likelihood

The log-likelihood (up to constants) is

\[ \ell(\beta,\phi) = -\frac{1}{2\phi} \sum_{i=1}^{n} w_i (y_i - x_i^\top\beta)^2 \;-\; \frac{1}{2} \sum_{i=1}^{n} w_i \log(2\pi\phi). \]

Equivalently, in precision form:

\[ \ell(\beta,\tau) = \frac{\tau}{2} \sum_{i=1}^{n} w_i (y_i - x_i^\top\beta)^2 \;-\; \frac{1}{2} \sum_{i=1}^{n} w_i \log\left(\frac{2\pi}{\tau}\right). \]

Exponential-family identification

This matches the weighted exponential-family form with

\[ \theta_i = \mu_i = x_i^\top\beta, \qquad b(\theta_i) = \tfrac{1}{2}\theta_i^2, \qquad a(\phi) = \phi, \]

and

\[ c(y_i,\phi) = -\frac{1}{2\phi}y_i^2 -\frac{1}{2}\log(2\pi\phi). \]

Weighted least squares representation

Base R implements weighted least squares via

\[ y^\ast = W^{1/2} y, \qquad X^\ast = W^{1/2} X. \]

The glmbayes functions use the same weighted likelihood, so all theoretical expressions above apply directly.

Log-Concavity Properties in the Weighted Gaussian Case

  1. Concavity in \(\beta\)
    For fixed \(\phi\), the negative weighted log-likelihood is
    \[ -\ell(\beta \mid \phi) = \frac{1}{2\phi} (y - X\beta)^{T} W (y - X\beta) + \text{const}, \] which is strictly convex whenever \(W\) has positive diagonal entries.
    Therefore the weighted log-likelihood is strictly concave in \(\beta\).

  2. Concavity in precision \(\tau = 1/\phi\)
    Rewriting the weighted likelihood in terms of \(\tau\): \[ \ell(\beta,\tau) = -\frac{\tau}{2} (y - X\beta)^{T} W (y - X\beta) - \frac{1}{2} \left( \sum_{i=1}^{n} w_i \right) \log \tau + \text{const}, \] which is concave in \(\tau\) because the first term is linear in \(\tau\) and the second is concave.

These properties guarantee:

The weighted Gaussian model thus provides the cleanest illustration of how weights enter the exponential-family structure and how they preserve log-concavity.

Implications for Bayesian GLMs

The Gamma log–link model thus shares the same structural advantages as the Gaussian and Poisson cases:

These properties make the Gamma log–link model an especially clean example of how exponential–family structure supports fast, reliable posterior simulation in glmbayes.

3. Gaussian Model With Unknown Dispersion

For the Gaussian model

\[ y = X\beta + \varepsilon,\qquad \varepsilon \sim N(0,\phi I_{n}), \]

the likelihood is

\[ p(y \mid \beta,\phi) \propto \phi^{-n/2}\exp\!\left(-\frac{S(\beta)}{2\phi}\right), \]

where

\[ S(\beta) = (y - X\beta)^{T}(y - X\beta) \]

is the residual sum of squares.

To support the different prior specification below, we use the Prior_Setup function to set and extract default settings for the needed parameters.

ps <- Prior_Setup(weight ~ group)
x  <- ps$x
mu <- ps$mu
V  <- ps$Sigma
y <- ps$y
shape    <- ps$shape
rate     <- ps$rate

3. Priors for Gaussian Models With Unknown Dispersion

We now specify priors for \((\beta,\tau)\) and derive the corresponding log-priors and log-posteriors.

Throughout:

\[ \mathrm{RSS}(\beta) = \sum_{i=1}^{n} w_i (y_i - x_i^\top\beta)^2. \]


3.1 Prior on the Dispersion Only (Gamma Prior on Precision)

We place a Gamma prior on the precision:

\[ \tau \sim \mathrm{Gamma}(a_0,b_0), \]

with log-prior

\[ \log p(\tau) = (a_0 - 1)\log\tau - b_0\tau + \text{const}. \]

Given fixed \(\beta\), the conditional posterior is

\[ \tau \mid \beta,y \sim \mathrm{Gamma}\left( a_0 + \frac{n_{w}}{2}, \; b_0 + \frac{\mathrm{RSS}(\beta)}{2} \right), \]

where \(n_{w} = \sum_i w_i\).


To illustrate this, we fix \(\beta\) at the full-model coefficient vector from Prior_Setup() (ps$coefficients, the internal GLM maximum-likelihood estimate). We can then generate conditional Bayesian draws for the dispersion as below. This is essentially the equivalent of the classical function gamma.dispersion() function provided by the MASS package.

out_rlmb <- rlmb(
    n = 1000,
    y = y, x = x,
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = ps$coefficients)
  )

The posterior mean of the dispersion is then:

  mean(out_rlmb$dispersion)
#> [1] 0.495487

3.2 Joint Conjugate Normal–Gamma Prior

The conjugate Normal–Gamma prior couples \(\beta\) and \(\tau\) (Lindley and Smith 1972; O’Hagan and Forster 2004):

\[ \beta \mid \tau \sim N(\mu_0,\;(\tau\Sigma_0)^{-1}), \qquad \tau \sim \mathrm{Gamma}(a_0,b_0). \]

Log-prior

\[ \log p(\beta,\tau) = (a_0 - 1)\log\tau - b_0\tau -\frac{\tau}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0) +\text{const}. \]

Log-posterior

Combining with the Gaussian log-likelihood:

\[ \log p(\beta,\tau \mid y) = \frac{\tau}{2}\mathrm{RSS}(\beta) -\frac{n_{w}}{2}\log\left(\frac{2\pi}{\tau}\right) +(a_0 - 1)\log\tau - b_0\tau -\frac{\tau}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0) +\text{const}. \]

This yields closed-form conditional posteriors:

allowing pure i.i.d. Monte Carlo sampling.


The defining property of conjugacy is that the posterior has the same posterior functional form. In glmbayes, this model can be estimated as below.

## Conjugate Normal_Gamma Prior

lmb.D9_v2 <- lmb(
  weight ~ group,
  pfamily = dNormal_Gamma(
    ps$mu,
    Sigma_0 = ps$Sigma_0,
    shape = ps$shape,
    rate  = ps$rate
  )
)

summary(lmb.D9_v2)$dispersion
#> [1] 0.4837559

3.3 Independent Normal–Gamma Prior

Here the prior factorizes:

\[ \beta \sim N(\mu_0,\Sigma_0), \qquad \tau \sim \mathrm{Gamma}(a_0,b_0), \]

so \(\beta\) and \(\tau\) are a priori independent.

Log-prior

\[ \log p(\beta,\tau) = -\frac{1}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0) + (a_0 - 1)\log\tau - b_0\tau +\text{const}. \]

Log-posterior

\[ \log p(\beta,\tau \mid y) = \frac{\tau}{2}\mathrm{RSS}(\beta) -\frac{n_{w}}{2}\log\left(\frac{2\pi}{\tau}\right) -\frac{1}{2}(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0) + (a_0 - 1)\log\tau - b_0\tau +\text{const}. \]

A variation on this prior specification—using a truncated Gamma prior for the dispersion—is the model developed in Chapter A07, where:

This truncated‑Gamma, non‑conjugate Normal–Gamma model is therefore a direct extension of the independent Normal–Gamma prior described above, but requires the more advanced envelope‑based sampling strategy documented in Chapter A07.

3.3.1 Independent Normal Gamma Prior in glmbayes

This prior is implemented in glmbayes using the below specification (but where a truncated gamma replaces the full gamma).

The chunk below is not evaluated when the vignette is built.

Precomputed results are read from inst/extdata/ file Chapter11_Dobson_two_sampler.rds.

To regenerate that file, run make_Chapter11_Dobson_two_sampler_output.R under data-raw/, from the package source root.

ch11_path <- system.file(
  "extdata", "Chapter11_Dobson_two_sampler.rds",
  package = "glmbayes"
)
stopifnot(nzchar(ch11_path), file.exists(ch11_path))
ch11_saved <- readRDS(ch11_path)
stopifnot(
  ncol(ch11_saved$gibbs_two_block$beta_out) == ncol(ps$x),
  nrow(ch11_saved$indep_norm_gamma$coefficients) == ch11_saved$indep_norm_gamma$n_draws
)
lmb.D9_v3 <- lmb(n = 10000,
  weight ~ group,
  dIndependent_Normal_Gamma(
    ps$mu,
    ps$Sigma,
    shape = ps$shape_ING,
    rate  = ps$rate,
    max_disp_perc = 0.99,
    disp_lower = NULL,
    disp_upper = NULL
  )
)
summary(lmb.D9_v3)$coefficients
summary(lmb.D9_v3)$dispersion
sd(lmb.D9_v3$dispersion)
inc <- ch11_saved$indep_norm_gamma
coef_summ_iid <- data.frame(
  posterior_mean = colMeans(inc$coefficients),
  posterior_SD = apply(inc$coefficients, 2, sd),
  row.names = inc$coef_colnames
)
coef_summ_iid
#>             posterior_mean posterior_SD
#> (Intercept)      5.0342294    0.2191827
#> groupTrt        -0.3731156    0.3080974
cat("Dispersion: posterior mean =", mean(inc$dispersion),
    "  SD =", sd(inc$dispersion), "\n")
#> Dispersion: posterior mean = 0.4843865   SD = 0.1469935

3.3.2 Two-Block Gibbs Sampling using glmbayes

If iid sampling is not desired, or if the user wants to inspect mixing behavior, a simple two-block Gibbs sampler (Robert and Casella 2004) may be used:

  1. Sample \(\beta \mid \tau, y\) from a Normal distribution
  2. Sample \(\tau \mid \beta, y\) from a Gamma distribution

The following chunk is not evaluated during the vignette build (eval = FALSE). Stored draws are loaded from the same RDS as above.

set.seed(180)
dispersion2 <- ps$dispersion

## Burn-in
for (i in 1:1000) {
  out1 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2)
  )
  out2 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = out1$coefficients[1, ])
  )
  dispersion2 <- out2$dispersion
}

## Sampling
beta_out <- matrix(0, nrow = 10000, ncol = 2)
disp_out <- rep(0, 10000)

for (i in 1:10000) {
  out1 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2)
  )
  out2 <- rlmb(
    n = 1, y = y, x = x,
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = out1$coefficients[1, ])
  )
  beta_out[i, ] <- out1$coefficients[1, 1:2]
  disp_out[i]   <- out2$dispersion
}

colMeans(beta_out)
sd(beta_out[, 1])
sd(beta_out[, 2])
mean(disp_out)
sd(disp_out)
gb <- ch11_saved$gibbs_two_block
beta_out <- gb$beta_out
disp_out <- gb$disp_out
colMeans(beta_out)
#> (Intercept)    groupTrt 
#>   5.0309618  -0.3660935
sd(beta_out[, 1])
#> [1] 0.2285704
sd(beta_out[, 2])
#> [1] 0.325093
mean(disp_out)
#> [1] 0.5364701
sd(disp_out)
#> [1] 0.190905

3.3.3 Comparison of the Two Samplers

Under default settings, both the truncated‑Gamma iid sampler and the two‑block Gibbs sampler produce nearly identical posterior summaries for the regression coefficients and the dispersion parameter.

To make this explicit, the table below reports posterior means and standard deviations from:

These values illustrate that both samplers target essentially the same posterior distribution (although the former has a truncated prior), with only minor Monte Carlo variation.

Dobson plant weight: independent Normal-Gamma (iid) vs two-block Gibbs
Parameter iid_Mean iid_SD Gibbs_Mean Gibbs_SD
(Intercept) (Intercept) 5.0342 0.2192 5.0310 0.2286
groupTrt groupTrt -0.3731 0.3081 -0.3661 0.3251
Dispersion 0.4844 0.1470 0.5365 0.1909

The difference in the iid and Gibbs SD for the dispersion is largely due to the difference between a truncated and an unrestricted gamma prior.

However, the iid sampler has two clear advantages:

The Gibbs sampler mixes well for this model class, but it inevitably shows mild autocorrelation and convergence diagnostics are beyond the scope of this vignette.

For most users, the iid sampler is easier to work with and safer as no convergence diagnostics are needed.

For readers who want the full mathematical details of the iid sampler, see Chapter A07, which documents the complete envelope construction and accept‑reject algorithm for the independent Normal–Gamma prior.

4. Gamma Regression Models With Unknown Dispersion

We now specify priors for \((\beta,\tau)\) and derive the corresponding log‑priors and log‑posteriors, using the carinsca data (Bailey and Simon 1960; Canadian Automobile Insurance Claims (Carinsca) 2024; McCullagh and Nelder 1989).

Throughout:

data(carinsca)
carinsca$Merit <- ordered(carinsca$Merit)
carinsca$Class <- factor(carinsca$Class)
oldopt <- options(contrasts = c("contr.treatment", "contr.treatment"))
Claims=carinsca$Claims
Insured=carinsca$Insured
Merit=carinsca$Merit
Class=carinsca$Class
Cost=carinsca$Cost
Claims_Adj<-Claims/1000

glm.carinsca <- glm(Cost / Claims ~ Merit + Class,
                    family = Gamma(link = "log"),
                    weights = Claims_Adj, x = TRUE)
ps <- Prior_Setup(
  Cost / Claims ~ Merit + Class,
  family = Gamma(link = "log"),
  weights = Claims_Adj
)
mu=ps$mu
V=ps$Sigma
shape    <- ps$shape
rate     <- ps$rate
x  <- ps$x
y  <- ps$y

m<-nrow(x)
p<-ncol(x)

## Starting dispersion for beta | tau in the two-block Gibbs sampler: same quasi-likelihood
## estimate as used elsewhere for this Carinsca Gamma GLM (not the Dobson Gaussian ps).
dispersion2 <- gamma.dispersion(glm.carinsca)

options(oldopt)

4.1 Prior on the Dispersion Only (Gamma Prior on Precision)

We place a Gamma prior on the precision:

\[ \tau \sim \text{Gamma}(a_0,b_0), \]

with log‑prior

\[ \log p(\tau) = (a_0 - 1)\log\tau - b_0\tau + \text{const}. \]

Given fixed \(\beta\), the conditional posterior is

\[ \log p(\tau \mid \beta,y) = \ell(\beta,\tau) + (a_0 - 1)\log\tau - b_0\tau + \text{const}. \]

This posterior is log‑concave in \(\tau\), enabling envelope‑based accept–reject sampling in the Gamma regression case.

## Carinsca Gamma GLM (already fitted in the Carinsca chunk for gamma.dispersion etc.)
gamma.dispersion(glm.carinsca)
#> [1] 0.007840919

out2 <- rglmb(
    n = 1000, y = y, x = x,
    family  =Gamma(link="log"),
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = ps$coefficients),
weights = Claims_Adj
  )

mean(out2$dispersion)
#> [1] 0.00835056

4.2 Independent Normal–Gamma Prior

Here the priors factorize:

\[ \beta \sim N(\mu_0,\Sigma_0), \qquad \tau \sim \text{Gamma}(a_0,b_0), \]

so \(\beta\) and \(\tau\) are a priori independent.


Log‑prior

\[ \log p(\beta,\tau) = -\tfrac12(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0) + (a_0 - 1)\log\tau - b_0\tau + \text{const}. \]


Log‑posterior

\[ \log p(\beta,\tau \mid y) = \ell(\beta,\tau) -\tfrac12(\beta-\mu_0)^\top\Sigma_0^{-1}(\beta-\mu_0) + (a_0 - 1)\log\tau - b_0\tau + \text{const}. \]

In Gamma regression:

  • the conditional posterior in \(\beta\) is not conjugate
  • the conditional posterior in \(\tau\) is not conjugate
  • the joint posterior remains log‑concave

This in theory enables two sampling strategies for the Gamma regression model:

  1. Two‑block Gibbs sampler
    • \(\beta \mid \tau, y\): sampled using a Normal approximation (via rglmb)
    • \(\tau \mid \beta, y\): sampled using a Gamma‑based quasi‑likelihood update
      This is the only fully implemented method for Gamma regression in the current version of glmbayes.
  2. Accept–reject envelope sampler (Chapter A07) Chapter A07 develops an accept–reject sampler using a truncated Gamma prior for \(\tau\) and a joint envelope over \((\beta,\tau)\). However, this method is implemented only for the Gaussian family.
    Extending the envelope‑based sampler to the Gamma regression model would require additional mathematical development and corresponding modifications to the code.

The truncated‑Gamma independent Normal–Gamma model in Chapter A07 therefore applies exclusively to the Gaussian case, while Gamma regression currently relies on the two‑block Gibbs sampler described above.

Implementation details: alternate rglmb draws for \(\beta \mid \tau, y\) using dNormal(..., dispersion = dispersion2) with draws for \(\tau \mid \beta, y\) using dGamma(...) on the Gamma regression likelihood, updating dispersion2 after each \(\tau\) step. A 1000‑iteration burn‑in is followed by 1000 stored iterations (coefficient matrix beta_out, dispersion vector disp_out, and accept–reject counts iters_out). The same logic is written to inst/extdata/Chapter11_Carinsca_gamma_gibbs.rds by data-raw/make_Chapter11_Carinsca_gamma_gibbs_output.R (set.seed(190)).

The chunk Block_Gibbs_gamma_Regression below is not evaluated when the vignette is built (eval = FALSE). Stored draws are loaded from inst/extdata/Chapter11_Carinsca_gamma_gibbs.rds in chapter11-load-carinsca-gamma, and summaries are produced in Block_Gibbs_gamma_loaded.

ch11_cg_path <- system.file(
  "extdata", "Chapter11_Carinsca_gamma_gibbs.rds",
  package = "glmbayes"
)
stopifnot(nzchar(ch11_cg_path), file.exists(ch11_cg_path))
ch11_cg <- readRDS(ch11_cg_path)
stopifnot(
  ncol(ch11_cg$gibbs_gamma$beta_out) == ncol(ps$x),
  length(ch11_cg$gibbs_gamma$disp_out) == nrow(ch11_cg$gibbs_gamma$beta_out)
)
set.seed(190)
dispersion2 <- gamma.dispersion(glm.carinsca)

suppressWarnings(
  suppressMessages(
for(i in 1:1000){

  ## --- Block 1: Regression coefficients ---
  out1 <- rglmb(
    n = 1, y = y, x = x,
    family  = Gamma(link="log"),
    pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2),
    weights = Claims_Adj
  )

  ## --- Block 2: Dispersion (quasi-likelihood sampler) ---
  out2 <- rglmb(
    n = 1, y = y, x = x,
    family  = Gamma(link="log"),
    pfamily = dGamma(shape = shape, rate = rate,
                     beta = out1$coefficients[1,]),
    weights = Claims_Adj
  )

  ## --- SCALE dispersion for the next beta update ---
  ## Convert quasi  (from out2) to MLE for consistency
  dispersion2 <- out2$dispersion ##* ((m - p)/m)

}

)
)

## Run 1000 additional iterations and store output
beta_out <- matrix(0, nrow = 1000, ncol = ncol(x))
disp_out <- rep(0, 1000)
iters_out <- rep(0, 1000)


suppressWarnings(
  suppressMessages(
    for (i in 1:1000) {
      out1 <- rglmb(
        n = 1, y = y, x = x,
        family = Gamma(link="log"),
        pfamily = dNormal(mu = mu, Sigma = V, dispersion = dispersion2),
        weights = Claims_Adj
      )
      out2 <- rglmb(
        n = 1, y = y, x = x,
        family = Gamma(link="log"),
        pfamily = dGamma(shape = shape, rate = rate,
                         beta = out1$coefficients[1,]),
        weights = Claims_Adj
      )
      dispersion2 <- out2$dispersion ##* ((m - p) / m)
      beta_out[i, ] <- out1$coefficients[1, seq_len(ncol(x))]
      disp_out[i] <- out2$dispersion ##* ((m - p) / m)
      iters_out[i]<-out2$iters
    }
  )
)

## Review output


beta_mean  <- colMeans(beta_out)
beta_sd    <- apply(beta_out, 2, sd)
beta_tlike <- beta_mean / beta_sd   # analogous to GLM t-values

bayes_coef_table <- data.frame(
  Estimate = beta_mean,
  Std.Error = beta_sd,
  t.like = beta_tlike
)

bayes_coef_table

mean(disp_out)
mean(iters_out)
cg <- ch11_cg$gibbs_gamma
beta_out <- cg$beta_out
disp_out <- cg$disp_out
iters_out <- cg$iters_out

beta_mean  <- colMeans(beta_out)
beta_sd    <- apply(beta_out, 2, sd)
beta_tlike <- beta_mean / beta_sd

bayes_coef_table <- data.frame(
  Estimate = beta_mean,
  Std.Error = beta_sd,
  t.like = beta_tlike
)

bayes_coef_table
#>                Estimate  Std.Error      t.like
#> (Intercept) -1.17499555 0.01579067 -74.4107461
#> Merit1      -0.06920914 0.02625167  -2.6363703
#> Merit2      -0.06996381 0.03049230  -2.2944751
#> Merit3      -0.05602611 0.01693771  -3.3077727
#> Class2       0.08391885 0.02693784   3.1152773
#> Class3       0.01667647 0.01863376   0.8949601
#> Class4       0.15751604 0.01996256   7.8905720
#> Class5      -0.07937177 0.03994824  -1.9868650

mean(disp_out)
#> [1] 0.0139002
mean(iters_out)
#> [1] 36.805

4.3 Conditional Sampling for the Dispersion Parameter

Chapter A06 provides a detailed derivation of the conditional posterior for the precision parameter \(v = 1/\phi\) in Gamma regression. The posterior can be written in the form

\[ \ell(v) = ( \text{shape2} - 1 ) \log v - \text{rate1} \, v + \text{testfunc}(v) + \text{const}, \]

where \(\text{testfunc}(v)\) is a concave remainder term capturing the curvature contributed by the Gamma likelihood. This decomposition enables a tangent–envelope accept–reject sampler, in which:

  1. A Gamma proposal distribution is constructed using the linear part
    \[ ( \text{shape2} - 1 ) \log v - \text{rate2} \, v, \] where \(\text{rate2} = \text{rate1} - g^{*}\) and \(g^{*}\) is the derivative of \(\text{testfunc}(v)\) at the anchor point \(v^{*}\).

  2. A proposed value \(v_{\text{prop}}\) is accepted with probability
    \[ \exp\!\left[ \text{testfunc}(v_{\text{prop}}) - \text{testfunc}(v^{*}) - g^{*}(v_{\text{prop}} - v^{*}) \right]. \]

Because \(\text{testfunc}(v)\) is concave, the tangent line at \(v^{*}\) always lies above it, ensuring a valid rejection sampler.

8. Summary

This chapter introduced Bayesian estimation of Gaussian models with unknown dispersion (Gelman et al. 2013; O’Hagan and Forster 2004), including:

• Gamma priors on the precision \(\tau\)
• Joint Normal–Gamma priors
• Independent Normal–Gamma priors
• iid sampling via envelope methods
• two‑block Gibbs sampling

The log‑concavity of the Gaussian likelihood in both \(\beta\) and \(\tau\) ensures that all samplers are stable, efficient, and theoretically well‑behaved.

A1: Posterior Distribution Details for Conjugate Normal-Gamma prior

A1.1 Marginal Posterior for \(\tau\)

Integrating out \(\beta\) yields a Gamma posterior independent of \(\beta\):

\[ \tau \mid y \sim \mathrm{Gamma}\!\left( a_{0} + \frac{n}{2},\; b_{0} + \frac{1}{2} S_{\text{marg}} \right), \]

where

\[ S_{\text{marg}} = (y - X\mu_{0})^{T} \left(I + X\Sigma_{0}X^{T}\right)^{-1} (y - X\mu_{0}). \]

Here \(\Sigma_{0}\) is the same matrix as in Section 3.2 and in A1.2 (so that \(\Sigma_{0}^{-1}\) is the prior precision factor in \(-\frac{\tau}{2}(\beta-\mu_{0})^{\mathsf T}\Sigma_{0}^{-1}(\beta-\mu_{0})\)). Let \(\hat\beta = (X^{\mathsf T}X)^{-1} X^{\mathsf T} y\) be the ordinary least squares estimator and \(\mathrm{RSS} = (y - X\hat\beta)^{\mathsf T}(y - X\hat\beta)\). Write \(G = X^{\mathsf T}X\) (invertible in the usual full-rank setup). The same scalar \(S_{\text{marg}}\) can be written as

\[ S_{\text{marg}} = \mathrm{RSS} + (\hat\beta - \mu_{0})^{\mathsf T} \left(\Sigma_{0} + G^{-1}\right)^{-1} (\hat\beta - \mu_{0}). \]

Equivalently, expanding \((I + X\Sigma_{0}X^{\mathsf T})^{-1}\) with the Sherman–Morrison–Woodbury formula gives the same middle matrix in the precision form \[ G - G\left(\Sigma_{0}^{-1} + G\right)^{-1}G = \left(\Sigma_{0} + G^{-1}\right)^{-1}. \] In particular, if \(\mu_{0} = \hat\beta\) then the second term vanishes and \(S_{\text{marg}} = \mathrm{RSS}\).

Step-by-step (Woodbury + least-squares split). Set \(e = y - X\hat\beta\), \(\delta = \hat\beta - \mu_{0}\), so \(y - X\mu_{0} = e + X\delta\). Normal equations give \(X^{\mathsf T}e = 0\), hence \(\|e + X\delta\|^{2} = \mathrm{RSS} + \delta^{\mathsf T}G\delta\) and \(X^{\mathsf T}(e + X\delta) = G\delta\). Woodbury says \((I + X\Sigma_{0}X^{\mathsf T})^{-1} = I - X(\Sigma_{0}^{-1}+G)^{-1}X^{\mathsf T}\). Substitute \(y-X\mu_{0}=e+X\delta\) into the quadratic form for \(S_{\text{marg}}\); cross terms vanish because \(e^{\mathsf T}X=0\), and the algebra reduces to \(\mathrm{RSS} + \delta^{\mathsf T}\bigl(G - G(\Sigma_{0}^{-1}+G)^{-1}G\bigr)\delta\), which equals the display above.

For observation weights \(w_{i}\) (Section 2.2), define \(G = X^{\mathsf T}WX\), \(\hat\beta = G^{-1} X^{\mathsf T}Wy\), and \(\mathrm{RSS} = (y - X\hat\beta)^{\mathsf T}W(y - X\hat\beta)\). The RSS + \((\hat\beta-\mu_{0})^{\mathsf T}(\Sigma_{0}+G^{-1})^{-1}(\hat\beta-\mu_{0})\) identity is unchanged (same proof in whitened coordinates \(\tilde y = W^{1/2}y\), \(\tilde X = W^{1/2}X\)). The first display becomes \((y-X\mu_{0})^{\mathsf T}W^{1/2}(I + W^{1/2}X\Sigma_{0}X^{\mathsf T}W^{1/2})^{-1}W^{1/2}(y-X\mu_{0})\) for any factor \(W^{1/2}\) with \(W^{1/2}(W^{1/2})^{\mathsf T}=W\) (e.g. diagonal \(\sqrt{w_i}\)).

This closed‑form marginal is what makes the Normal–Gamma prior so computationally attractive: sampling \(\tau\) requires only a single Gamma draw.

A1.2 Conditional Posterior for \(\beta\)

Given \(\tau\), the posterior for \(\beta\) is Gaussian:

\[ \beta \mid \tau,y \sim N(m_{\text{post}}, V_{\text{post}}), \]

with

\[ V_{\text{post}} = \left(\Sigma_{0}^{-1} + \tau X^{T}X\right)^{-1}, \qquad m_{\text{post}} = V_{\text{post}} \left(\Sigma_{0}^{-1}\mu_{0} + \tau X^{T}y\right). \]

A1.3 Sampling the Posterior

Because the Normal–Gamma prior is conjugate to the Gaussian likelihood (O’Hagan and Forster 2004), the joint posterior factorizes into two standard distributions:

\[ p(\tau,\beta \mid y) = p(\tau \mid y)\, p(\beta \mid \tau, y). \]

This structure means no MCMC is required.
Posterior draws are obtained by independent Monte Carlo sampling, not by a Gibbs chain.


Step 1: Draw the marginal posterior for \(\tau\)

The dispersion (precision) parameter has a closed‑form marginal posterior:

\[ \tau \mid y \sim \mathrm{Gamma}\!\left( a_{0} + \frac{n}{2},\; b_{0} + \frac{1}{2} S(\hat\beta_{\text{GLS}}) \right), \]

where

\[ S(\hat\beta_{\text{GLS}}) = (y - X\hat\beta_{\text{GLS}})^{T}(y - X\hat\beta_{\text{GLS}}) \]

and \(\hat\beta_{\text{GLS}}\) is the generalized least‑squares estimator under the prior precision.

Each draw of \(\tau\) is i.i.d. from this Gamma distribution.


Step 2: Draw \(\beta\) conditionally on each sampled \(\tau\)

Given a sampled value \(\tau^{(s)}\), the regression coefficients have a Normal posterior:

\[ \beta \mid \tau^{(s)}, y \sim N\!\left(m_{\text{post}}^{(s)},\, V_{\text{post}}^{(s)}\right), \]

with

\[ V_{\text{post}}^{(s)} = \left(\Sigma_{0}^{-1} + \tau^{(s)} X^{T}X\right)^{-1}, \]

\[ m_{\text{post}}^{(s)} = V_{\text{post}}^{(s)} \left(\Sigma_{0}^{-1}\mu_{0} + \tau^{(s)} X^{T}y\right). \]

Each draw of \(\beta\) is conditionally independent given its corresponding \(\tau^{(s)}\).


Result: Pure i.i.d. Monte Carlo Sampling

The conjugate posterior allows:

No Markov chain is constructed, and no Gibbs or Metropolis steps are involved.
All posterior samples are independent, making the conjugate Normal–Gamma model one of the simplest Bayesian regression frameworks for simulation.

A2. Detailed sampling procedured for the Independent Normal-Gamma prior

A2.1 iid Sampling Under Truncated Gamma Priors

When iid sampling is requested, glmbayes uses a truncated Gamma prior for the precision parameter \(\tau\). The truncation is introduced for numerical stability and does not change the basic Normal-Gamma structure.

The iid sampler for this prior family is based on the envelope accept-reject method described in:

Chapter A07 — Accept-Reject Sampling for Gaussian Regression Models with Independent Normal-Gamma Priors

That vignette explains the full simulation procedure, including:

The key point for users is that the posterior remains log-concave, so the accept-reject sampler produces independent draws for \((\beta, \tau)\). There is no Markov chain, no burn-in, and no autocorrelation.

A3. Posterior mean of \(\beta\) and marginal covariance (Zellner-type prior)

This subsection records closed-form posterior moments for \(\beta\) in the conjugate Normal-Gamma Gaussian regression model, in the same weighted likelihood notation as Section 2.2 and in the prior weight / effective prior sample size notation used by Prior_Setup().

A3.1 Weighted likelihood and prior precision

Assume

\[ y \mid \beta,\phi \sim N\!\left(X\beta,\;\phi W^{-1}\right), \qquad \tau = 1/\phi, \]

with \(W = \mathrm{diag}(w_{1},\dots,w_{n})\) and \(w_i \ge 0\). Define the effective sample size

\[ n_{w} = \sum_{i=1}^{n} w_i \]

(the quantity sum(wt) in rNormalGammaReg()). Write the weighted Gram matrix as

\[ G = X^{\mathsf T} W X. \]

Let \(\hat\beta\) denote the weighted least squares estimator

\[ \hat\beta = G^{-1} X^{\mathsf T} W y \]

and the weighted residual sum of squares

\[ \mathrm{RSS} = (y - X\hat\beta)^{\mathsf T} W (y - X\hat\beta). \]

Take a Zellner-type conditional Normal prior on \(\beta\) given \(\tau\) that is proportional to the data precision:

\[ \beta \mid \tau \sim N\!\left(\mu_{0},\; \tau^{-1}\left(\frac{pwt}{1-pwt}G\right)^{-1}\right), \qquad 0 < pwt < 1, \]

equivalently prior precision \(\tau\, (pwt/(1-pwt))\, G\). This is the structure implied by Prior_Setup() together with dNormal_Gamma(..., Sigma_0, ...) when \(\Sigma_0 = ((1-pwt)/pwt)\, G^{-1}\) (Zellner \(g\)-prior relative to \(G^{-1}\); see Section 3.2).

Define the effective prior sample size \(n_{\mathrm{prior}} > 0\) by

\[ pwt = \frac{n_{\mathrm{prior}}}{n_{\mathrm{prior}} + n_{w}}, \qquad 1 - pwt = \frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}. \]

This is the relationship implemented when Prior_Setup() is called with scalar n_prior (n_prior and n_effective in the function).

A3.2 Posterior mean of \(\beta\)

Combining likelihood precision \(\tau G\) with prior precision \(\tau (pwt/(1-pwt)) G\) gives posterior precision given \(\tau\):

\[ \tau G + \tau \frac{pwt}{1-pwt} G = \frac{\tau}{1-pwt}\, G. \]

Hence

\[ \mathrm{Cov}(\beta \mid \tau, y) = \frac{1-pwt}{\tau}\, G^{-1} = \frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\,\frac{1}{\tau}\, G^{-1}. \]

The posterior mean \(m_{\mathrm{post}} = E(\beta \mid \tau,y)\) solves the same normal equations as the precision-weighted average of \(\hat\beta\) and \(\mu_{0}\), and simplifies to

\[ \boxed{ E(\beta \mid y) = (1-pwt)\,\hat\beta + pwt\,\mu_{0} = \frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\,\hat\beta + \frac{n_{\mathrm{prior}}}{n_{\mathrm{prior}} + n_{w}}\,\mu_{0}. } \]

The right-hand side does not depend on \(\tau\) (for this prior), so \(E(\beta \mid y) = E(\beta \mid \tau,y)\).

A3.3 Marginal covariance of \(\beta\) and the Gamma parameters

Marginalize \(\tau\) with

\[ \tau \mid y \sim \mathrm{Gamma}(a_{n}, b_{n}) \]

in the shape-rate parameterization \(p(\tau) \propto \tau^{a_{n}-1} e^{-b_{n}\tau}\) used in rNormalGammaReg() (R::rgamma(shape, 1/rate)), so \(E(\tau^{-1} \mid y) = b_{n}/(a_{n}-1)\) for \(a_{n} > 1\).

Let

\[ V_{n} = (1-pwt)\, G^{-1} = \frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\, G^{-1}, \]

so \(\mathrm{Cov}(\beta \mid \tau,y) = \tau^{-1} V_{n}\). Because \(E(\beta \mid \tau,y)\) is constant in \(\tau\), the law of total covariance gives

\[ \boxed{ \mathrm{Cov}(\beta \mid y) = E(\tau^{-1} \mid y)\, V_{n} = \frac{b_{n}}{a_{n}-1}\, \frac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\, G^{-1}, \qquad a_{n} > 1. } \]

In the sampler,

\[ a_{n} = a_{0} + \frac{n_{w}}{2}, \qquad b_{n} = b_{0} + \frac{1}{2} S_{\mathrm{marg}}, \]

with \(S_{\mathrm{marg}}\) the marginal sum of squares from Section A1.1. In the notation of this section (\(G = X^{\mathsf T}WX\), weighted least squares \(\hat\beta = G^{-1} X^{\mathsf T}Wy\), weighted \(\mathrm{RSS} = (y - X\hat\beta)^{\mathsf T}W(y - X\hat\beta)\), and \(\Sigma_{0}\) as in Section 3.2),

\[ S_{\mathrm{marg}} = \mathrm{RSS} + (\hat\beta - \mu_{0})^{\mathsf T} \left(\Sigma_{0} + G^{-1}\right)^{-1} (\hat\beta - \mu_{0}). \]

This is the same scalar as the augmented residual sum of squares S produced by rNormal_reg.wfit() (Section A1). Under the Zellner \(g\)-prior of A3.1, the prior covariance is \(\Sigma_{0} = ((1-pwt)/pwt)\,G^{-1}\), so \(\Sigma_{0} + G^{-1} = G^{-1}/pwt\) and \((\Sigma_{0} + G^{-1})^{-1} = pwt\, G\). Equivalently, \[ S_{\mathrm{marg}} = \mathrm{RSS} + pwt\, (\hat\beta - \mu_{0})^{\mathsf T} G (\hat\beta - \mu_{0}). \] The structural point is that the prior–data mismatch enters only through the scalar factor \(pwt\) multiplying a nonnegative quadratic in \(\hat\beta-\mu_{0}\). Hence, holding the data and \(\mu_{0}\) fixed so that \((\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0})\) stays finite, \[ pwt \to 0^{+} \quad\Longrightarrow\quad pwt\,(\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0}) \to 0, \] and \(S_{\mathrm{marg}} \to \mathrm{RSS}\): weak prior weight washes out the gap even when \(\mu_{0} \neq \hat\beta\). Equivalently, \(pwt\to 0\) whenever \(n_{\mathrm{prior}}/(n_{\mathrm{prior}}+n_{w})\to 0\)—for example \(n_{\mathrm{prior}}\to 0\) with \(n_{w}\) held fixed, or \(n_{w}\to\infty\) with \(n_{\mathrm{prior}}\) held fixed. The exact cancellation \(S_{\mathrm{marg}} = \mathrm{RSS}\) also holds for any \(pwt\) when \(\mu_{0} = \hat\beta\) (for example Prior_Setup(..., intercept_source = "full_model", effects_source = "full_model")). For fixed \(pwt>0\) and \(\mu_{0} \neq \hat\beta\), the second term is strictly positive unless \(\hat\beta-\mu_{0}\) lies in the null space of \(G\). Under standard regularity, with fixed \(n_{\mathrm{prior}}\) and \(n_{w}\to\infty\), one has \(pwt = O(1/n_{w})\) while \((\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0})\) is typically \(O_{p}(n_{w})\), so the mismatch term is often \(O_{p}(1)\): the additive difference \(S_{\mathrm{marg}}-\mathrm{RSS}\) need not vanish at finite \(n_{w}\), but \(S_{\mathrm{marg}}/n_{w}\) and \(\mathrm{RSS}/n_{w}\) share the same leading order.

Limit \(pwt\to 0\) and coefficient covariance. Under the Zellner decomposition above, \(pwt\to 0^{+}\) forces \(S_{\mathrm{marg}}\to \mathrm{RSS}\) (for fixed data and \(\mu_{0}\), with a finite mismatch quadratic), not only when \(\mu_{0}=\hat\beta\). Since \(b_{n} = b_{0} + \tfrac{1}{2} S_{\mathrm{marg}}\), the marginal Gamma rate then limits to \(b_{0} + \tfrac{1}{2}\mathrm{RSS}\) whenever \(b_{0}\) is held fixed along the path; \(E(\tau^{-1}\mid y)=b_{n}/(a_{n}-1)\) therefore ties to RSS through \(b_{n}\) in the same limit. At the same time, \((1-pwt)\,G^{-1}\to G^{-1}\) from A3.2–A3.3. Thus \(\mathrm{Cov}(\beta \mid \tau,y) = \tau^{-1}(1-pwt)\,G^{-1}\) tends to \(\tau^{-1}G^{-1}\) (likelihood-only WLS covariance given \(\tau\)), and \[ \mathrm{Cov}(\beta \mid y) = \frac{b_{n}}{a_{n}-1}\,(1-pwt)\,G^{-1}, \qquad a_{n}>1, \] has both \(b_{n}\) driven by \(S_{\mathrm{marg}}\to \mathrm{RSS}\) and the matrix prefactor \((1-pwt)\,G^{-1}\to G^{-1}\): weak \(pwt\) removes prior shrinkage of precision and drives the residual sum of squares in the \(\tau\) update back to the data \(\mathrm{RSS}\). The usual \(\sigma^{2}(X^{\mathsf T}WX)^{-1}\) limit is recovered when, in addition, \(a_{n}\) and \(b_{0}\) follow the weak-prior path discussed below (e.g. formal \(n_{\mathrm{prior}}\to 0\) with \(n_{w}\) fixed, holding \(a_{n}>1\)). If \(pwt\) is driven to zero by \(n_{w}\to\infty\) at fixed \(n_{\mathrm{prior}}\), then \((1-pwt)\to 1\) and \(G^{-1}\) typically scales as \(O(1/n_{w})\), so \(\mathrm{Cov}(\beta\mid y)\) shrinks with more weighted data even though the prior contributes negligibly to precision.

With Prior_Setup() defaults for the Gamma on \(\tau\), \(a_{0} = n_{\mathrm{prior}}/2\) in the pre-calibration Step 9 rate scaling, so \(a_{n} = (n_{\mathrm{prior}} + n_{w})/2\); the prior rate is \(b_{0} = (n_{\mathrm{prior}}/2)\, d\) with dispersion \(d\) returned by Prior_Setup() (see Section 3.2 and ?Prior_Setup).

The marginal posterior of \(\beta\) is multivariate \(t\); the matrix above is its covariance (not the scale matrix in every textbook parameterization of the \(t\) density).

Prior_Setup() dispersion and the Gamma rates \(b_{0}\), \(b_{n}\)

For Gaussian models with scalar n_prior, Prior_Setup() sets the prior Gamma rate to \(\mathrm{rate} = (n_{\mathrm{prior}}/2)\, d\), where \(d\) is the scalar dispersion returned by the function. In the notation above, \(b_{0} = \mathrm{rate}\).

The posterior Gamma rate is \(b_{n} = b_{0} + \frac{1}{2} S_{\mathrm{marg}}\). Thus a larger returned \(d\) increases \(b_{0}\) and, holding \(S_{\mathrm{marg}}\) fixed, would increase \(b_{n}\). In practice, for dNormal_Gamma the matrix passed to the sampler is Sigma_0 (dispersion-free prior covariance from Prior_Setup()), while the returned dispersion calibrates the Gamma rate and the coefficient-scale Sigma; changing \(d\) rescales the Normal–Gamma coupling of \(\beta\) given \(\tau\); the augmented least-squares step then produces a different \(S_{\mathrm{marg}}\) and posterior mean \(m_{\mathrm{post}}\) as well.

Special case: \(b_{0} = \frac{1}{2}(n_{\mathrm{prior}}/n_{w})\,S_{\mathrm{marg}}\)

\[ \frac{S_{\mathrm{marg}}}{n_{w}} \]

This choice is not the Prior_Setup() default; it is a hypothetical reparameterization that sets the prior Gamma rate proportional to the same marginal sum of squares that enters the likelihood fragment, with weight \(n_{\mathrm{prior}}/n_{w}\) on the prior piece relative to the data piece (see the discussion in the main text). Because \(S_{\mathrm{marg}}\) depends on \(y\) (and on \(\mu_{0}\) versus \(\hat\beta\)), such a \(b_{0}\) is data-dependent unless used only in stylized derivations.

Assume \(a_{0} = n_{\mathrm{prior}}/2\) (nominal rate scaling). Then \(a_{n} = (n_{\mathrm{prior}} + n_{w})/2\) and, provided \(a_{n} > 1\) (equivalently \(n_{\mathrm{prior}} + n_{w} > 2\)),

\[ a_{n} - 1 = \frac{n_{\mathrm{prior}} + n_{w} - 2}{2}. \]

Take

\[ b_{0} = \frac{1}{2}\,\frac{n_{\mathrm{prior}}}{n_{w}}\, S_{\mathrm{marg}}, \qquad b_{n} = b_{0} + \frac{1}{2} S_{\mathrm{marg}} = \frac{1}{2}\, S_{\mathrm{marg}}\left(1 + \frac{n_{\mathrm{prior}}}{n_{w}}\right) = \frac{S_{\mathrm{marg}}}{2}\,\frac{n_{w} + n_{\mathrm{prior}}}{n_{w}}. \]

Then the posterior expectation of \(\tau^{-1} = \phi = \sigma^{2}\) is

\[ \frac{b_{n}}{a_{n}-1} = \frac{S_{\mathrm{marg}}(n_{w} + n_{\mathrm{prior}})/n_{w}} {(n_{\mathrm{prior}} + n_{w} - 2)}. \]

Under the Zellner prior of A3.1, \(V_{n} = (1-pwt)\, G^{-1} = \dfrac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\, G^{-1}\). One may write \(\mathrm{Cov}(\beta \mid y)\) as the product of \(b_{n}/(a_{n}-1)\) and \(V_{n}\) without combining the \((n_{\mathrm{prior}} + n_{w})\) terms:

\[ \boxed{ \mathrm{Cov}(\beta \mid y) = \frac{b_{n}}{a_{n}-1}\, V_{n} = \dfrac{S_{\mathrm{marg}}\,(n_{\mathrm{prior}} + n_{w})/n_{w}} {n_{\mathrm{prior}} + n_{w} - 2} \cdot \dfrac{n_{w}}{n_{\mathrm{prior}} + n_{w}}\, G^{-1}. } \]

The two fractions cancel to \(S_{\mathrm{marg}} / (n_{\mathrm{prior}} + n_{w} - 2)\) times \(G^{-1}\). So with this \(b_{0}\), the marginal coefficient covariance matches that heuristic pattern—analogous in structure to \(\mathrm{RSS}/(n-p)\) times \((X^{\mathsf T}WX)^{-1}\) when \(S_{\mathrm{marg}} = \mathrm{RSS}\) (e.g. \(\mu_{0} = \hat\beta\)) and one reads \(n_{\mathrm{prior}} + n_{w} - 2\) as an effective residual degrees of freedom for the joint Normal–Gamma fragment. The factor \(a_{n}-1\) in the denominator of \(b_{n}/(a_{n}-1)\) is exactly half that count, and the factor \((1-pwt) = n_{w}/(n_{\mathrm{prior}}+n_{w})\) from \(V_{n}\) supplies the remaining alignment.

Limit \(n_{\mathrm{prior}} \to 0\) (formal comparison to classical). Holding \(n_{w} > 2\) and the data fixed, \(pwt = n_{\mathrm{prior}}/(n_{\mathrm{prior}}+n_{w}) \to 0\), \(b_{0} = \frac{1}{2}(n_{\mathrm{prior}}/n_{w})S_{\mathrm{marg}} \to 0\), \(a_{n} = (n_{\mathrm{prior}}+n_{w})/2 \to n_{w}/2\), and \(a_{n}-1 \to (n_{w}-2)/2\). Along the same path, \(S_{\mathrm{marg}} = \mathrm{RSS} + pwt\,(\hat\beta - \mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0}) \to \mathrm{RSS}\) because \(pwt\to 0\) (Section A3.3): the prior–data mismatch vanishes even when \(\mu_{0} \neq \hat\beta\). Also \(b_{n} = b_{0} + \frac{1}{2}S_{\mathrm{marg}} \to \frac{1}{2}\mathrm{RSS}\), so

\[ \frac{b_{n}}{a_{n}-1} \to \frac{\mathrm{RSS}}{n_{w}-2}, \qquad V_{n} = (1-pwt)\,G^{-1} \to G^{-1}. \]

Hence \[ \mathrm{Cov}(\beta \mid y) \to \frac{\mathrm{RSS}}{n_{w}-2}\,(X^{\mathsf T}WX)^{-1}, \] the usual weighted least squares covariance under the Gaussian glm.fit dispersion convention of Prior_Setup() (see ?Prior_Setup, Details). Before taking \(n_{\mathrm{prior}}\to 0\), a fixed \(n_{\mathrm{prior}}>0\) with \(\mu_{0} \neq \hat\beta\) typically has \(S_{\mathrm{marg}} > \mathrm{RSS}\) and hence a larger \(b_{n}\) and posterior spread than this classical limit suggests; only in the limit does \(S_{\mathrm{marg}}\) coincide with \(\mathrm{RSS}\) for every \(\mu_{0}\). At \(n_{\mathrm{prior}} = 0\) the Zellner \(g\)-prior weight and the Gamma shape \(a_{0}\) are degenerate, so this limit is best read as asymptotic intuition for weak prior weight, not as a literal prior specification.

Growth of \(n_{w}\) at fixed \(n_{\mathrm{prior}}\). The posterior mean of \(\tau^{-1}=\sigma^{2}\) in the special case can be written with the uncancelled factors as \[ \frac{b_{n}}{a_{n}-1} = \frac{S_{\mathrm{marg}}\,(n_{\mathrm{prior}} + n_{w})/n_{w}} {n_{\mathrm{prior}} + n_{w} - 2} = \frac{S_{\mathrm{marg}}}{n_{w}} \cdot \frac{n_{\mathrm{prior}} + n_{w}}{n_{\mathrm{prior}} + n_{w} - 2}. \] As \(n_{w} \to \infty\) with \(n_{\mathrm{prior}}\) fixed, \((n_{\mathrm{prior}} + n_{w})/(n_{\mathrm{prior}} + n_{w} - 2) \to 1\) and \((n_{\mathrm{prior}} + n_{w})/n_{w} \to 1\), so asymptotically \((b_{n}/(a_{n}-1)) \sim S_{\mathrm{marg}}/n_{w}\): the leading behavior is mean marginal residual energy per effective observation. Moreover \(pwt = n_{\mathrm{prior}}/(n_{\mathrm{prior}}+n_{w}) \to 0\), so in the Zellner decomposition \(S_{\mathrm{marg}} = \mathrm{RSS} + pwt\,(\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0})\) the mismatch term is multiplied by a coefficient tending to zero; under typical regularity \(pwt\,(\hat\beta-\mu_{0})^{\mathsf T}G(\hat\beta-\mu_{0}) = O_{p}(1)\) while \(\mathrm{RSS}\) grows with \(n_{w}\), so \(S_{\mathrm{marg}}/n_{w}\) and \(\mathrm{RSS}/n_{w}\) have the same leading order. The factor \((n_{\mathrm{prior}}+n_{w})/n_{w}\) is the only explicit \(O(1/n_{w})\) correction linking total sample size to effective data count before cancellation with \(V_{n}\).

dGamma posterior with fixed \(\beta\): general form and special choices

For a precision-only prior \[ \tau \sim \mathrm{Gamma}(a_0,b_0), \] and Gaussian weighted likelihood with \(\beta\) treated as fixed, define \[ \mathrm{RSS}_w(\beta)=\sum_i w_i(y_i-x_i^\top\beta)^2. \] Then \[ \tau\mid y,\beta \sim \mathrm{Gamma}\!\left(a_0+\frac{n_w}{2},\; b_0+\frac12\,\mathrm{RSS}_w(\beta)\right). \]

Using the Chapter 11 calibration \[ a_0=\frac{n_{\mathrm{prior}}}{2}, \qquad b_0=\frac12\frac{n_{\mathrm{prior}}}{n_w}\,\mathrm{RSS}_w(\beta), \] we get \[ \tau\mid y,\beta \sim \mathrm{Gamma}\!\left( \frac{n_{\mathrm{prior}}+n_w}{2}, \frac12\left(1+\frac{n_{\mathrm{prior}}}{n_w}\right)\mathrm{RSS}_w(\beta) \right). \]

Hence, for \(n_{\mathrm{prior}}+n_w>2\), \[ E(\tau^{-1}\mid y,\beta) = \frac{b_n}{a_n-1} = \frac{\left(\frac{n_{\mathrm{prior}}+n_w}{n_w}\right)\mathrm{RSS}_w(\beta)} {n_{\mathrm{prior}}+n_w-2}. \]

This expression holds for any fixed \(\beta\).

Choice \(\beta=\beta_\star\) (posterior mode from dNormal_Gamma) and weak-prior limit

If \(\beta=\beta_\star\), where \(\beta_\star\) is the posterior mode from the dNormal_Gamma fit, the same formula applies with \(\mathrm{RSS}_w(\beta_\star)\).

Using the weighted least-squares decomposition: \[ \mathrm{RSS}_w(\beta) = \mathrm{RSS}_w(\hat\beta) + (\beta-\hat\beta)^\top G(\beta-\hat\beta),\qquad G=X^\top W X, \] so \[ \mathrm{RSS}_w(\beta_\star) = \mathrm{RSS}_w(\hat\beta) + (\beta_\star-\hat\beta)^\top G(\beta_\star-\hat\beta). \]

Under scalar Zellner weighting, \(\beta_\star=(1-pwt)\hat\beta+pwt\,\mu_0\), giving \[ \mathrm{RSS}_w(\beta_\star) = \mathrm{RSS}_w(\hat\beta) + pwt^2(\hat\beta-\mu_0)^\top G(\hat\beta-\mu_0). \]

As \(n_{\mathrm{prior}}\to 0\) with \(n_w\) fixed, \(pwt\to 0\), so \[ \mathrm{RSS}_w(\beta_\star)\to \mathrm{RSS}_w(\hat\beta), \] and therefore \[ E(\tau^{-1}\mid y,\beta_\star) \to \frac{\mathrm{RSS}_w(\hat\beta)}{n_w-2}, \] the classical weighted residual-variance limit.

References

Bailey, R. A., and LeRoy J. Simon. 1960. “Two Studies in Automobile Insurance Ratemaking.” ASTIN Bulletin, 192–217.
Canadian Automobile Insurance Claims (Carinsca). 2024. Dataset page http://www.statsci.org/data/general/carinsca.html.
Dobson, A. J. 1990. An Introduction to Generalized Linear Models. Chapman; Hall.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. CRC Press.
Lindley, D. V., and A. F. M. Smith. 1972. “Bayes Estimates for the Linear Model.” Journal of the Royal Statistical Society. Series B (Methodological) 34 (1): 1–41. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. Chapman; Hall.
O’Hagan, Anthony, and Jonathan Forster. 2004. Kendall’s Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold.
Robert, Christian P., and George Casella. 2004. Monte Carlo Statistical Methods. 2nd ed. Springer.