---
title: "Generating and validating synthetic clinical data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Generating and validating synthetic clinical data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Motivation

Sharing individual-level clinical data across institutions is often
restricted by privacy regulations and institutional review boards.
Synthetic data preserves the statistical properties of real data while
reducing re-identification risk, enabling multi-site collaboration
without data transfer.

## Example: synthesizing patient records

```{r setup}
library(syntheticdata)
```

```{r real-data}
set.seed(42)
real <- data.frame(
  age     = rnorm(500, mean = 65, sd = 12),
  sbp     = rnorm(500, mean = 135, sd = 22),
  sex     = sample(c("Male", "Female"), 500, replace = TRUE),
  smoking = sample(c("Never", "Former", "Current"), 500,
                   replace = TRUE, prob = c(0.4, 0.35, 0.25)),
  outcome = rbinom(500, 1, 0.28)
)
head(real)
```

## Parametric synthesis (Gaussian copula)

The default method estimates marginal distributions empirically and
captures the joint dependence structure via a Gaussian copula on
normal scores. This preserves both marginal shapes and pairwise
correlations.

```{r synthesize}
syn <- synthesize(real, method = "parametric", n = 500, seed = 1)
syn
```

## Validation

`validate_synthetic()` computes four classes of metrics:

```{r validate}
val <- validate_synthetic(syn)
val
```

- **KS statistic**: distributional similarity (lower is better).
- **Correlation difference**: preservation of variable associations.
- **Discriminative AUC**: can a classifier distinguish real from
  synthetic? Values near 0.5 mean indistinguishable.
- **NN distance ratio**: privacy metric. Values above 1 indicate
  synthetic records are not memorizing real individuals.

## Comparing methods

`compare_methods()` runs all three synthesis methods on the same
data and returns a single comparison table:

```{r compare}
comp <- compare_methods(real, seed = 1)
comp
```

## Privacy risk assessment

`privacy_risk()` provides a deeper privacy audit with three metrics:
nearest-neighbor distance ratio, membership inference accuracy, and
(optionally) attribute disclosure risk for sensitive columns.

```{r privacy}
pr <- privacy_risk(syn, sensitive_cols = "age")
pr
```

## Downstream model fidelity

`model_fidelity()` trains a predictive model on synthetic data and
evaluates it on real data. The real-data baseline uses in-sample
evaluation as an upper bound.

```{r fidelity}
mf <- model_fidelity(syn, outcome = "outcome")
mf
```

A synthetic-trained model with AUC close to the real-trained baseline
indicates that the synthetic data preserves the predictive signal.

## Privacy-utility trade-off

Higher `noise_level` improves privacy but reduces utility:

```{r tradeoff}
results <- list()
for (nl in c(0.05, 0.1, 0.2, 0.5)) {
  s <- synthesize(real, method = "noise", noise_level = nl, seed = 1)
  v <- validate_synthetic(s)
  results <- c(results, list(data.frame(
    noise_level = nl,
    ks = v$value[v$metric == "ks_statistic_mean"],
    privacy = v$value[v$metric == "nn_distance_ratio"]
  )))
}
do.call(rbind, results)
```
