---
title: "Chapter A04: Directional Tail Diagnostics for Prior-Posterior Disagreement"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: REFERENCES.bib
reference-section-title: References
vignette: >
  %\VignetteIndexEntry{Chapter A04: Directional Tail Diagnostics for Prior-Posterior Disagreement}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup,echo = FALSE}
library(glmbayes)
```

## Directional tail diagnostic (current implementation)

This appendix documents the current `directional_tail()` diagnostic, how to call it directly,
and how it is used inside `summary.glmb()`.

The diagnostic measures prior--posterior disagreement in a multivariate direction [@EvansMoshonov2006] rather than
coefficient-by-coefficient only. It reports:

- `mahalanobis_shift`: the standardized prior-posterior shift magnitude in whitened space
- `p_directional`: directional tail probability, `P(delta^T Z <= 0)`
- `delta`: posterior mean shift vector in whitened coordinates

### Theoretical interpretation and relation to t and F

Let `mu0` denote the reference vector (prior mean by default, or a null/reference mode), and let
\(\beta^{(m)}\) be posterior draws. `directional_tail()` computes a whitening map from posterior precision
so that

\[
Z^{(m)} = W\left(\beta^{(m)} - \mu_0\right),
\]

with \(W^\top W = \text{Prec}_{post}\). In this scale, Euclidean distance corresponds to
Mahalanobis distance in the original coefficient space.

Define

\[
\delta = E[Z \mid y], \qquad d = \|\delta\|_2.
\]

The function returns \(d\) as `mahalanobis_shift`, and the directional-tail probability

\[
p_{\text{dir}} = P\!\left(\delta^\top Z \le 0 \mid y\right),
\]

estimated by posterior draws.

Intuition: \(\delta\) is the posterior mean shift away from the reference in whitened space.
The event \(\delta^\top Z \le 0\) is the half-space opposite the direction of disagreement, so
\(p_{\text{dir}}\) quantifies how much posterior mass lies "against" that shift.

Under a Gaussian approximation in whitened space,
\[
Z \mid y \approx N(\delta, I_p),
\]
then with \(u = \delta/\|\delta\|\),
\[
u^\top Z \sim N(d, 1), \qquad
p_{\text{dir}} = P(u^\top Z \le 0) = \Phi(-d).
\]
So the directional tail is a one-dimensional tail probability driven entirely by the standardized
distance \(d\).

#### Relation to univariate t statistics

For \(p=1\), whitening reduces to scalar standardization and
\[
d = \frac{\mu_{post} - \mu_0}{\sigma_{post}},
\]
so \(d\) is the posterior z/t-type contrast against \(\mu_0\). In this case:

- two-sided t/z-style tail: \(2\Phi(-|d|)\) (or Student-t analogue with df)
- directional tail used here: \(\Phi(-|d|)\) when the direction is set by \(\delta\)

Thus `directional_tail()` plays the same role as a one-sided standardized test in the univariate case,
while still extending naturally to multivariate models.

#### Relation to multivariate Wald/F-style tests

The squared shift
\[
d^2 = \delta^\top \delta
\]
is the whitened quadratic distance from the reference. This is the same geometric core as
multivariate Wald statistics (and, after scaling, Hotelling/F forms): larger \(d^2\) means stronger
global departure from the reference.

Difference in emphasis:

- Wald/Hotelling/F use **radial** evidence (any direction).
- `directional_tail()` uses **directional half-space** evidence along \(\delta\).

Under the Gaussian approximation, both are monotone in \(d\): as \(d\) increases,
quadratic-test evidence strengthens and \(p_{\text{dir}}=\Phi(-d)\) decreases.

In this sense, `directional_tail()` is a directional companion to multivariate Wald/F-style testing:
it retains the same Mahalanobis geometry but reports evidence on an interpretable one-sided directional
probability scale.

### Interpretation: what probability is being assessed?

This is the key conceptual difference from classical p-values [@BernardoSmith1994].

In a classical t-test, the p-value is a **sampling probability under the null**:
\[
P_{H_0}\!\left(T(Y^{rep}) \ge T(y_{obs})\right),
\]
that is, how extreme the observed statistic would be in repeated samples if \(H_0\) were true.

By contrast, the directional-tail quantity here is a **posterior probability conditional on the observed data**:
\[
p_{\text{dir}} = P\!\left(\delta^\top Z \le 0 \mid y,\ \text{model},\ \text{prior},\ \mu_0\right).
\]

So the probability being assessed is not a prior probability and not a frequentist repeated-sampling
probability. It is posterior mass, after updating by the data.

Practical interpretation:

- Small `p_directional`: posterior mass is concentrated in the direction away from the reference \(\mu_0\),
  indicating stronger directional disagreement.
- Large `p_directional` (near 0.5): weak directional separation from the reference.
- Very large values (close to 1) can occur when the sign of \(\delta\) is reversed by construction or when
  the chosen reference is on the opposite side of most posterior mass.

For `summary.glmb()`, this means:

- `dir_tail` (vs prior): posterior directional disagreement with the prior mean.
- `dir_tail_null` (vs null): posterior directional disagreement with the null/intercept-only reference used in
  the summary routine.

#### Univariate connection to known Bayesian tail/sign summaries

In the one-parameter case, directional tail reduces to a one-sided posterior tail area relative to \(\mu_0\):

- if posterior shift is positive: \(p_{\text{dir}} = P(\beta \le \mu_0 \mid y)\)
- if posterior shift is negative: \(p_{\text{dir}} = P(\beta \ge \mu_0 \mid y)\)

This is closely related to well-known Bayesian sign/tail summaries (including posterior sign probability /
"probability of direction" style summaries in applied Bayesian reporting [@Gelman2013]). In that language, for a scalar
parameter, `p_directional` is essentially the opposite-side posterior mass, and the sign probability is
approximately \(1 - p_{\text{dir}}\).

The main contribution of `directional_tail()` is the **multivariate extension**: it uses whitening plus
projection onto the posterior shift direction to provide a single coherent directional-tail probability even
when coefficients are correlated and inference is genuinely multivariate.

### How magnitude compares to classical tests

Because the targets differ (posterior mass vs repeated-sampling extremeness), `p_directional` is not
numerically identical to a classical p-value. Still, the two are often comparable through the standardized
shift magnitude \(d\).

In the scalar Gaussian approximation:
\[
p_{\text{dir}} \approx \Phi(-|d|), \qquad
p_{\text{post,2s}} \approx 2\Phi(-|d|),
\]
while a classical two-sided p-value is typically
\[
p_{\text{classical,2s}} \approx 2\{1-F_t(|t|;\nu)\}
\]
(or \(2\Phi(-|z|)\) asymptotically).

#### Strong-prior regime

When the prior is strong relative to the likelihood, two effects dominate:

1. **Centering effect**: posterior mean is pulled toward the prior/reference.
2. **Precision effect**: posterior covariance contracts along prior-informed directions.

Consequences for `p_directional` vs classical p-values:

- If prior and null/reference are aligned with each other (and with little data contradiction), the posterior
  shift from \(\mu_0\) is usually smaller, so \(d\) is smaller and `p_directional` tends to be larger
  (weaker apparent evidence against the reference) than classical p-values.
- If the prior is strong but centered away from the classical null in the same direction as the data, \(d\)
  can increase and `p_directional` can be smaller (stronger evidence in that direction) than classical tails.
- In multivariate settings, this divergence is often anisotropic: prior strength can increase/decrease evidence
  differently across directions, and `directional_tail()` summarizes the net directional result along \(\delta\).

#### Weak-prior (data-dominant) limit

As prior precision becomes small relative to information in the data:

- posterior mean approaches the MLE (or quasi-MLE),
- posterior covariance approaches the inverse observed information,
- \(d\) approaches the usual standardized likelihood-based contrast.

Hence in the univariate case, `p_directional` approaches a one-sided likelihood-based tail measure, and
\(2\,p_{\text{dir}}\) approaches the familiar two-sided z/t-style tail (asymptotically).

In the multivariate case, \(d^2\) approaches the corresponding Wald quadratic form (and its Hotelling/F
scaling in finite samples), so Bayesian directional and classical quadratic tests tend to agree more closely
in large-sample/weak-prior regimes.

#### Scalar decomposition (normal-gamma heuristic)

For one coefficient, the posterior-vs-classical comparison can be decomposed into a center effect and a
scale effect. Write
\[
t_{\text{post}} = \frac{\hat\beta_{\text{post}}-\mu_0}{\text{SE}_{\text{post}}}, \qquad
t_{\text{class}} = \frac{\hat\beta_{\text{MLE}}-\mu_0}{\text{SE}_{\text{class}}}.
\]
Then
\[
t_{\text{post}}
= t_{\text{class}}
\times
\underbrace{\frac{\hat\beta_{\text{post}}-\mu_0}{\hat\beta_{\text{MLE}}-\mu_0}}_{\text{center/shrinkage factor } \kappa}
\times
\underbrace{\frac{\text{SE}_{\text{class}}}{\text{SE}_{\text{post}}}}_{\text{scale factor}}.
\]

In conjugate normal-gamma style parameterizations, a common heuristic is
\[
\text{SE}_{\text{post}}
\approx
\text{SE}_{\text{class}}
\sqrt{\frac{n-p}{n+n_{\text{prior}}}},
\]
so
\[
\frac{\text{SE}_{\text{class}}}{\text{SE}_{\text{post}}}
\approx
\sqrt{\frac{n+n_{\text{prior}}}{n-p}}.
\]
If one also uses a simple shrinkage approximation \(\kappa \approx \frac{n}{n+n_{\text{prior}}}\), then
\[
t_{\text{post}}
\approx
t_{\text{class}}
\cdot
\frac{n}{n+n_{\text{prior}}}
\cdot
\sqrt{\frac{n+n_{\text{prior}}}{n-p}}.
\]

Using \(n_{\text{prior}} = n\,\frac{\text{pwt}}{1-\text{pwt}}\), this can be rewritten as
\[
t_{\text{post}}
\approx
t_{\text{class}}
\cdot
\sqrt{1-\text{pwt}}
\cdot
\sqrt{\frac{n}{n-p}}.
\]

This expression shows why posterior directional tails can be either larger or smaller than classical tails:

- **centering/shrinkage (\(\kappa\))** tends to pull \(t_{\text{post}}\) toward 0,
- **scale correction** can push \(|t_{\text{post}}|\) upward,
- the net result depends on their product.

A useful threshold from this heuristic is:
\[
\sqrt{\frac{n+n_{\text{prior}}}{n-p}} > 1
\;\;\Longleftrightarrow\;\;
n_{\text{prior}} > p,
\]
so once prior pseudo-sample size is large relative to dimension, scale effects can dominate unless center
shrinkage is also strong.

This is the scalar analogue of the multivariate directional-tail story: center and scale are both active,
and the final probability is driven by the standardized shift magnitude after whitening.

### Incorporating the package example (`Ex_directional_tail.R`)

The package example already demonstrates the intended workflow and interpretation. In particular, it shows:

- fitting a model (`lmb`/`glmb`),
- calling `directional_tail(fit)`,
- extracting `mahalanobis_shift`, `p_directional`, and the draw-level objects,
- visualizing tail vs non-tail draws in whitened and raw coefficient spaces.

A compact usage sketch from that example is:

```{r directional-tail-example-sketched, eval=FALSE}
example("directional_tail", package = "glmbayes")

# Core objects used in interpretation:
dt <- directional_tail(fit)
dt$mahalanobis_shift
dt$p_directional

# Same quantities are also surfaced in summary output:
s <- summary(fit)
s$dir_tail
s$dir_tail_null
```

#### Illustrative plots from `Ex_directional_tail.R`

The example script also includes two useful diagnostic visualizations:

1. **Whitened space plot**: tail vs non-tail draws, directional boundary, and shift radius.
2. **Raw coefficient space plot**: tail vs non-tail draws with prior and posterior centers.

```{r directional-tail-plot-setup, include=FALSE}
set.seed(360)

ctl <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)
group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
weight <- c(ctl, trt)
dat2 <- data.frame(weight, group)

ps <- Prior_Setup(
  weight ~ group,
  family = gaussian(),
  pwt = 0.01,
  intercept_source = "null_model",
  effects_source = "null_effects",
  data = dat2
)

fit <- lmb(
  weight ~ group,
  dNormal(ps$mu, ps$Sigma, dispersion = ps$dispersion),
  data = dat2,
  n = 3000
)
```

```{r directional-tail-plot-whitened, eval=TRUE, fig.width=7, fig.height=5}
dt <- directional_tail(fit)

Z     <- dt$draws$Z
flag  <- dt$draws$is_tail
delta <- dt$delta
w     <- delta

plot(
  Z,
  col = ifelse(flag, "red", "blue"),
  pch = 19,
  xlab = "Z1",
  ylab = "Z2",
  main = "Directional Tail Diagnostic (Whitened Space)"
)

# Decision boundary orthogonal to direction vector
abline(a = 0, b = -w[1] / w[2], col = "darkgreen", lty = 2)

# Radius corresponding to Mahalanobis shift
r <- sqrt(sum(delta^2))
symbols(
  delta[1], delta[2],
  circles = r,
  inches  = FALSE,
  add     = TRUE,
  lwd     = 2,
  fg      = "gray"
)

points(0, 0, pch = 4, col = "black", lwd = 2)          # reference center in whitened space
points(delta[1], delta[2], pch = 3, col = "purple", lwd = 2) # posterior shift
```

```{r directional-tail-plot-raw, eval=TRUE, fig.width=7, fig.height=5}
B       <- dt$draws$B
flag    <- dt$draws$is_tail
mu0     <- as.numeric(fit$Prior$mean)
mu_post <- colMeans(B)

plot(
  B,
  col  = ifelse(flag, "red", "blue"),
  pch  = 19,
  xlab = "Coefficient 1",
  ylab = "Coefficient 2",
  main = "Directional Tail Diagnostic (Raw Coefficient Space)"
)

points(mu0[1], mu0[2], pch = 4, col = "black", cex = 1.5)       # reference center
points(mu_post[1], mu_post[2], pch = 3, col = "darkgreen", cex = 1.5)  # posterior center

legend(
  "topright",
  legend = c("Tail draws", "Non-tail draws", "Reference", "Posterior"),
  col    = c("red", "blue", "black", "darkgreen"),
  pch    = c(19, 19, 4, 3),
  bty    = "n"
)
```

### How `directional_tail()` is called

The function is exported and can be called directly on a fitted object (`glmb` or `lmb`):

```{r directional-tail-direct, eval=FALSE}
fit <- glmb(
  counts ~ outcome + treatment,
  family = poisson(),
  pfamily = dNormal(mu = mu, Sigma = V)
)

dt_prior <- directional_tail(fit)      # reference = prior mean
dt_prior
print(dt_prior)
```

To compare against a user-specified reference (for example a null/intercept-only mode), pass
`mu0` explicitly:

```{r directional-tail-null, eval=FALSE}
mu0 <- rep(0, length(fit$Prior$mean))
names(mu0) <- names(fit$Prior$mean)
mu0["(Intercept)"] <- coef(glm(counts ~ 1, family = poisson(), data = fit$data))[1]

dt_null <- directional_tail(fit, mu0 = mu0)
dt_null
```

### How `summary.glmb()` uses directional tail

`summary.glmb()` computes two directional diagnostics internally:

1. `directional_tail(object)` : posterior vs prior mean
2. `directional_tail(object, mu0 = null_est_full)` : posterior vs null reference

These are returned in the summary object:

```{r directional-tail-summary, eval=FALSE}
s <- summary(fit)

s$dir_tail        # vs prior
s$dir_tail_null   # vs null
```

The print method displays a compact "Directional Tail Summaries" table with:

- Mahalanobis Distance (`vs Null`, `vs Prior`)
- Tail Probability (`vs Null`, `vs Prior`)

### Minimal reproducible workflow

The package example `inst/examples/Ex_directional_tail.R` provides a full reproducible script.
A minimal call pattern is:

```{r directional-tail-minimal, eval=FALSE}
ps <- Prior_Setup(weight ~ group, family = gaussian(), data = dat2)
fit <- lmb(weight ~ group, dNormal(ps$mu, ps$Sigma, dispersion = ps$dispersion), data = dat2, n = 10000)

dt <- directional_tail(fit)
print(dt)
```

---

## Supplementary notes: why the Bayesian tail can be smaller

** 1.0  Why the Bayesian tail can be smaller

There are three distinct, concrete reasons the Bayesian posterior tail probability for a coefficient can be smaller than the classical t tail

*** 1.1 Different effective degrees of freedom (**v**)

In the conjugate parameterization above, posterior $\boldsymbol{v = 2\alpha_{0} + n}$.
If the prior has positive $\boldsymbol{\alpha_{0}}$ (an informative prior on the variance),
the posterior $\boldsymbol{v}$ will be larger than the classical $n - p$.
Increasing $v$ makes the Student‑t closer to normal (lighter tails), so the tail probability
at the same $t$ value is smaller.

-- Even for vaguely inforative priors, the prior contributes "pseudo-observations" to the posterior **df**. This is often the main cause of smaller Bayesian tails

*** 1.2 Different standard error (scale) used in the denominator

Bayesian marginal variance for $\boldsymbol{\beta_{j}}$ is 
$\boldsymbol{\frac{\beta_{n}}{\alpha_{n}} V_{n}[j,j]}$.

Two effects happen here:  
(a) $\boldsymbol{V_{n}}$ generally shrinks relative to $(X^{T}X)^{-1}$ because of the prior precision $\boldsymbol{V_{0}^{-1}}$, and  
(b) the posterior scale factor $\boldsymbol{\frac{\beta_{n}}{\alpha_{n}}}$ can be smaller than the classical estimate $\hat{s}^{2}$ depending on prior hyperparameters and the data (e.g., prior favors smaller variance).

Either effect reduces the denominator and can increase the raw $t$ value — but combined with larger $\boldsymbol{v}$ the net tail probability can still be smaller.


So you must compare both the numerator (posterior mean $\boldsymbol{m_{n,j}}$) 
vs. the OLS estimate $\hat{\beta}_{j}$, and the denominator 
(posterior scale vs. $\boldsymbol{\hat{s}\sqrt{(X^{T}X)^{-1}_{jj}}}$). 
Prior shrinkage often makes $m_{n,j}$ closer to zero and $V_{n}[j,j]$ smaller, 
both of which reduce evidence against $H_{0}$.

### 1.3 Different centering (posterior mean vs. OLS) and prior shrinkage

- If the prior pulls $\boldsymbol{m_{n,j}}$ toward $\boldsymbol{\beta_{0}}$ (e.g., 0), 
  the numerator of the Bayesian $t$ is smaller in magnitude than $\hat{\beta}_{j}$, 
  further reducing tail probability. In combination with (1) and (2), the posterior 
  tail probability will often be noticeably smaller than the classical one. 
  Griffin & Brown and other shrinkage‑prior literature discuss this tradeoff 
  explicitly for normal–gamma shrinkage priors.


For Bayesian linear models with normal-gamma priors, it seems like standard errors used in t-statistic calculations need to be adjusted relative to the classical estimates. In particular, the adjustment seems to be

The adjusted standard error is:

\[
\text{Std\_err}_t = \text{Std\_err} \cdot \sqrt{ \frac{n - p}{n + n_{\text{prior}}} }
\]

which leads to:


\[
t_{\text{adj},i}
= \frac{n}{n + n_{\text{prior}}}
  \, t_{\text{unadjusted},i}
  \, \sqrt{\frac{n + n_{\text{prior}}}{n - p}}.
\]

where the first ratio shows the skrinkage effect (lower t-value) and the second effect the impact of the adjustment to the standard error (which boosts the t-value).


\[
t_{\text{adj},i} = \sqrt{\frac{n}{n + n_{\text{prior}}}} \cdot \sqrt{\frac{n}{n -p}}  \cdot t_{\text{unadjusted},i} 
\]




\[
t_{\text{adj},i} = \sqrt{\frac{n}{n + n_{\text{prior}}}} \cdot \sqrt{\frac{n}{n -p}}  \cdot t_{\text{unadjusted},i} 
\]

\[
t_{\text{adj},i} = \sqrt{1-pwt} \cdot \sqrt{\frac{n}{n -p}}  \cdot t_{\text{unadjusted},i} 
\]


It can be seen from the above that t-statitic in the Bayesian context is larger than the unadjusted t-statistic (i.e., inflated)  whenver


\[
n_{\text{prior}} < p \cdot \frac{n}{n - p}
\]


which implies tail probabilities smaller than their classical counterparts.


\[
n + n_{\text{prior}}
= n + n\left(\frac{\text{pwt}}{1 - \text{pwt}}\right)
= \frac{n(1 - \text{pwt}) + n(\text{pwt})}{1 - \text{pwt}}
= \frac{n}{1 - \text{pwt}}.
\]




The adjusted standard error is
\[
\text{StdErr}_{t}
= \text{StdErr} \cdot 
\sqrt{\frac{n - p}{n + n_{\text{prior}}}},
\]
which leads to the adjusted t-statistic


\[
t_{\text{adj},i}
= t_{\text{unadjusted},i} \cdot 
\sqrt{\frac{n + n_{\text{prior}}}{n - p}}.
\]

For reference, the same expression can be written as


\[
\text{StdErr}_{t}
= \text{StdErr} \cdot 
\sqrt{\frac{n - p}{n + n_{\text{prior}}}}.
\]

which leads to


\[
t_{\text{adj},i}
= t_{\text{unadjusted},i} \cdot
  \sqrt{\frac{n + n_{\text{prior}}}{n - p}}.
\]