---
title: "HTDV — Empirical and Simulation Validation"
author: "José Mauricio Gómez Julián"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{HTDV — Empirical and Simulation Validation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE,
  purl = FALSE
)
```

# Purpose

`HTDV` is a hypothesis-testing framework for dependent and unbalanced data
under strong-mixing conditions. Three inferential layers run in parallel:
hierarchical Bayesian inference (HMC via Stan), HAR-Wald frequentist
inference (Andrews-Kiefer-Vogelsang), and stationary block bootstrap
(Patton-Politis-White). The framework exists in order to *triangulate* an
inferential answer when no single layer is universally valid in the
finite-sample regime.

This vignette summarizes the empirical evidence that the framework's
calibration claims are not stylistic but operationally testable. It
collects two pieces of validation:

1. A pre-registered factorial Monte Carlo study covering 1024 cells of
   the (sample size × dependence × tail × imbalance × shift) design.
2. Three external benchmarks against published literature on public
   data: post-1984 US CPI inflation against Stock & Watson (2007), long-
   run log-CAPE against Campbell & Shiller (1998), and the US-Canada
   10-year yield differential against the iid Welch baseline.

Both validations are reproducible end-to-end from scripts shipped in the
companion paper repository. The summary tables are loaded into the
package as datasets so users can interrogate them without re-running.

```{r packages, eval = TRUE}
library(HTDV)
data(htdv_sim_summary)
data(htdv_empirical_benchmarks)
```

# 1. Factorial simulation study

The full factorial design of Section 12-bis of the companion paper crosses
five factors at four levels each:

- **Sample size** $n \in \{40, 80, 200, 500\}$
- **AR(1) coefficient** $\phi \in \{0, 0.3, 0.6, 0.85\}$
- **Innovation tail** $\eta_t \sim \{\mathcal{N}(0,1),\,t_5,\,t_3,\,t_{2.1}\}$
- **Imbalance ratio** $n_1/n_2 \in \{1, 1.5, 3, 6\}$
- **Location shift** $\Delta \in \{0, 0.1, 0.25, 0.5\} \cdot \sigma_\infty$

For a total of $4^5 = 1024$ cells, each replicated $R = 500$ times against
three inferential layers (HAR, bootstrap, Bayes). The full run took
31.1 hours on a 16-core Linux workstation.

The summary table `htdv_sim_summary` carries one row per
(cell × layer) combination, with the following columns:

```{r summary-cols, eval = TRUE}
str(htdv_sim_summary)
```

## 1.1. Empirical size: the calibration story

```{r size-table, eval = TRUE}
size_cells <- htdv_sim_summary[htdv_sim_summary$delta == 0, ]
agg_phi <- aggregate(reject_rate ~ n + phi + layer, data = size_cells,
                     FUN = mean)
agg_phi$reject_rate <- round(agg_phi$reject_rate, 3)
reshape(agg_phi, idvar = c("n", "phi"), timevar = "layer",
        direction = "wide")
```

The Bayesian envelope holds nominal size (~0.05) across the entire grid.
HAR and bootstrap inflate dramatically under high persistence: at
$\phi=0.85$ and $n=40$, HAR rejects 60% and bootstrap rejects 47% of the
time when nominal is 5%.

## 1.2. Coverage (sign-corrected)

```{r coverage-by-layer, eval = TRUE}
agg_cov <- aggregate(coverage ~ layer, data = htdv_sim_summary,
                     FUN = function(x) c(mean = round(mean(x), 4),
                                          min = round(min(x), 3),
                                          max = round(max(x), 3)))
agg_cov
```

Bayesian coverage averages 94.4% (nominal 95%, observed minimum 88.4%).
HAR averages 82.0%; bootstrap 82.4%. Both classical layers fall short of
nominal under finite-sample dependence and—in the worst cells—lose
roughly a third of their nominal coverage.

## 1.3. HMC diagnostics

```{r diag-by-region, eval = TRUE}
b <- htdv_sim_summary[htdv_sim_summary$layer == "bayes", ]
agg_diag <- aggregate(diag_pass_rate ~ n + phi, data = b, FUN = mean)
agg_diag$diag_pass_rate <- round(agg_diag$diag_pass_rate, 3)
reshape(agg_diag, idvar = "n", timevar = "phi", direction = "wide")
```

Diagnostic-pass rate is below 0.7 only in the corner where
$\phi=0.85$ and $n \le 80$, the limit-of-identification zone for the
AR(1) likelihood. Use `htdv_simstudy_warnings()` on a fresh simulation
output to flag those cells automatically.

## 1.4. Pre-registered benchmark verdicts

| Benchmark | Pass rate | Reading |
|-----------|-----------|---------|
| **B1** size control across layers | 0.32 / 255 cells | Failed *as a conjunction*; passes for Bayes alone in 98% of cells |
| **B2** power monotonicity in $\Delta$ | 0.99 / 768 cell-layers | Effectively passes (residual is MC noise) |
| **B3** Bayes coverage $\ge 0.93$ | 0.89 / 1023 cells | Passes; minimum 0.884 in the hardest corner |

The conjunctive failure of B1 is the operational evidence that motivates
the framework: HAR and bootstrap, calibrated against asymptotic
critical values, are not safe inferential anchors under strong
dependence at finite sample size. The Bayesian envelope is.

## 1.5. Reproducing the simulation

The simulation is reproduced by the `run_simstudy.R` script of the
companion paper repository:

```{r repro-sim, eval = FALSE}
res <- htdv_simstudy(
  n_grid    = c(40, 80, 200, 500),
  phi_grid  = c(0, 0.3, 0.6, 0.85),
  tail_grid = c("normal", "t5", "t3", "t2_1"),
  imb_grid  = c(1, 1.5, 3, 6),
  delta_grid = c(0, 0.1, 0.25, 0.5),
  R         = 500,
  seed      = 20260422,
  out_dir   = "simstudy_cells_dir"
)
summ <- htdv_simstudy_summary(res)
```

The per-cell results are written atomically to one RDS per cell in
`out_dir`; partial runs are fully resumable.

# 2. Empirical benchmarks

Three datasets, each compared against a pre-existing reference value
from the published literature on the same vintage. The summary table is
shipped as `htdv_empirical_benchmarks`:

```{r empirical-table, eval = TRUE}
htdv_empirical_benchmarks[, c("dataset", "n", "reference",
                              "reference_value", "har_ci_lo", "har_ci_hi",
                              "bayes_ci_lo", "bayes_ci_hi",
                              "boot_ci_lo", "boot_ci_hi",
                              "agreement_har", "agreement_bayes",
                              "agreement_boot")]
```

All three layers reproduce all three references with `agreement` in every
case. The substantively interesting finding is the *width* of the
agreement, which scales with the persistence of the underlying series.

## 2.1. The persistence ladder

| Dataset | $\widehat\phi$ | HAR half-width | Bayes half-width | Bayes/HAR |
|---------|----------------|----------------|------------------|------------------|
| FRED-MD CPI inflation, post-1984 | $\approx 0.45$ | 0.57 | 0.46 | 0.81 |
| Shiller log-CAPE, 1881-latest | $\approx 0.97$ | 0.25 | 0.70 | 2.80 |
| US-CA 10y yield differential | $\approx 0.99$ | 0.49 | 7.35 | 15.0 |

At moderate persistence, HAR and Bayes agree to within 20%. At
near-unit-root, the Bayesian envelope is 15× wider than HAR. Both layers
are technically asymptotically valid; only Bayes accounts honestly for
the finite-sample uncertainty inflation that occurs when $\phi \to 1$.

## 2.2. Practical guidance

If you are running HTDV on a new dataset, the following workflow gives
the most reliable inference:

1. Run the three layers (`htdv_lrv()` + Wald construction; `htdv_boot()`;
   `htdv_fit()`).
2. Compare the widths of the three intervals. Disagreement at the
   one-third-of-an-order-of-magnitude scale is a real signal: the
   underlying series is in a regime where HAR's asymptotic critical
   values are losing reliability.
3. In that regime, prefer the Bayesian envelope as the primary
   inferential answer and use HAR or bootstrap as cross-checks rather
   than the other way around.

This is exactly what the framework is designed for: making the
calibration disagreement *visible* at the moment of inference.

# 3. Reproducing the empirical benchmarks

Required public files (downloaded from public sources, named per the
script `run_benchmarks.R`):

| Data file | Source URL | Provenance |
|-----------|------------|------------|
| `2026-03-MD.csv` | `stlouisfed.org/research/economists/mccracken/fred-databases` | McCracken-Ng FRED-MD vintage 2026-03 |
| `ie_data.xls` | `shillerdata.com` | Robert Shiller's online dataset |
| `us_canada_10y.csv` | `fred.stlouisfed.org/graph/fredgraph.csv?id=GS10,IRLTLT01CAM156N` | FRED multi-series download |

Total wall-clock for the three benchmarks combined is approximately
10-20 minutes on a single core; the heavy step is the two-sample HMC fit
for the US-CA yield comparison.

# References

McCracken, M. W., & Ng, S. (2016). FRED-MD: A monthly database for
macroeconomic research. *Journal of Business & Economic Statistics*,
34(4), 574-589.

Stock, J. H., & Watson, M. W. (2007). Why has US inflation become harder
to forecast? *Journal of Money, Credit and Banking*, 39, 3-33.

Campbell, J. Y., & Shiller, R. J. (1998). Valuation ratios and the
long-run stock market outlook. *Journal of Portfolio Management*,
24(2), 11-26.