---
title: "Statistical Methodology"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Statistical Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Model overview

`seroreconstruct` implements a Bayesian framework for inferring influenza virus
infection from serial hemagglutination inhibition (HAI) antibody titers.
The model accounts for:

- Antibody boosting after infection
- Antibody waning over time
- Measurement error in the HAI assay
- Heterogeneity in baseline titers and infection risk

The full statistical details are described in
[Tsang et al. (2022) *Nature Communications* 13:1557](https://doi.org/10.1038/s41467-022-29310-8).

## Notation

| Symbol | Description |
|--------|-------------|
| $Y_{i,t}$ | Observed (log2-transformed) HAI titer for individual $i$ at time $t$ |
| $Z_i \in \{0,1\}$ | Latent infection status for individual $i$ |
| $T_i$ | Latent infection time (if infected) |
| $b$ | Antibody boosting (log2 fold-rise after infection) |
| $w$ | Antibody waning rate (log2 fold-decrease per year) |
| $\sigma_r$ | Random measurement error |
| $p_{2f}$ | Two-fold (one-dilution) measurement error probability |
| $\lambda_a$ | Age-specific infection hazard scaling |
| $\beta$ | HAI titer protection coefficient |

## Antibody dynamics model

For an individual $i$ with baseline titer $B_i$ (log2 scale), the true
antibody trajectory is:

**If not infected** ($Z_i = 0$):
$$\tilde{Y}_i(t) = B_i \cdot e^{-w \cdot t / 365}$$

**If infected** ($Z_i = 1$ at time $T_i$):

- Pre-infection: exponential decay from baseline
- Infection: linear 14-day boosting ramp of magnitude $b$ (log2 scale)
- Post-infection: exponential decay from the boosted level

## Measurement error model

The observed titer $Y_{i,t}$ is related to the true titer $\tilde{Y}_i(t)$ by:

1. **Random error**: Gaussian noise with standard deviation $\sigma_r$
2. **Two-fold error**: with probability $p_{2f}$, the observed titer is shifted
   by one dilution step (the HAI assay measures in 2-fold dilutions:
   $<$10, 10, 20, 40, 80, ...)

These parameters are properties of the laboratory assay and are shared across
all individuals and groups.

## Infection risk model

The probability of infection for individual $i$ in age group $a$ depends on
influenza activity and baseline HAI titer:

$$P(Z_i = 1) = 1 - \exp\left(-\lambda_a \cdot \sum_t \text{ILI}(t) \cdot \exp(\beta \cdot B_i / 10)\right)$$

where $\text{ILI}(t)$ is the daily influenza-like illness rate, $\lambda_a$ is
the age-specific infection hazard scaling, and $\beta$ is the HAI titer
protection coefficient (negative values indicate protective effect of higher
baseline titers).

## Prior distributions

The model uses weakly informative priors. The MCMC uses Metropolis-Hastings
updates with adaptive proposal tuning during burn-in. Key features:

- Infection times $T_i$ are sampled uniformly within the follow-up period
- Baseline titers are drawn from the empirical distribution of observed
  pre-season titers
- Parameter proposals use truncated normal distributions to respect
  natural constraints (e.g., boosting $> 0$, waning $> 0$)

## MCMC implementation

The posterior is sampled using a custom C++ MCMC sampler (via Rcpp/RcppArmadillo)
with the following structure:

1. **Gibbs updates** for discrete parameters (infection status, infection time)
2. **Metropolis-Hastings updates** for continuous parameters (boosting, waning,
   error, infection hazard, HAI coefficient)
3. **Parallel chains** via RcppParallel (optional)

### Convergence diagnostics

After fitting, check convergence using:

- **Trace plots**: `plot_diagnostics(fit)` — look for stationarity (no trends)
  and good mixing (no long excursions)
- **Effective sample size**: internal ESS calculation uses the initial positive
  sequence estimator (Geyer, 1992)
- **Multiple chains**: fit the model multiple times with different random seeds
  and compare posterior distributions

### Recommendations for production analyses

| Setting | Minimum | Recommended |
|---------|---------|-------------|
| `n_iteration` | 50,000 | 200,000 |
| `burnin` | 25,000 | 100,000 |
| `thinning` | 5 | 10 |
| Independent chains | 2 | 3–4 |

## Parameter layout

The posterior samples are stored in `fit$posterior_model_parameter`, a matrix
with rows = posterior samples and columns = parameters:

| Columns | Parameters | Count |
|---------|------------|-------|
| 1–2 | Random error, two-fold error | 2 |
| 3–6 | Boosting and waning (2 groups × 2 params) | 4 |
| 7 to 6+$G$ | Infection hazard per group ($G$ groups) | $G$ |
| 6+$G$+1 | HAI titer protection coefficient | 1 |

Total: $7 + G$ parameters per season.

For multi-season fits, infection hazard and HAI coefficient are replicated
per season: $6 + G \times S + S$ total parameters.

## Shared parameter models

When comparing groups (e.g., vaccinated vs unvaccinated), measurement error
is always shared because it reflects the laboratory assay, not the biology.
Boosting and waning can optionally be shared or group-specific:

- `shared = c("error", "boosting_waning")`: all groups share error and
  antibody dynamics; only infection risk differs
- `shared = "error"`: error is shared but each group has its own boosting
  and waning rates

## References

Tsang TK, Perera RAPM, Fang VJ, Wong JY, Shiu EY, So HC, Ip DKM,
Malik Peiris JS, Leung GM, Cowling BJ, Cauchemez S. (2022).
Reconstructing antibody dynamics to estimate the risk of influenza virus
infection. *Nature Communications* 13:1557.
doi: [10.1038/s41467-022-29310-8](https://doi.org/10.1038/s41467-022-29310-8)

Geyer CJ. (1992). Practical Markov chain Monte Carlo. *Statistical Science*
7(4):473–483.
