---
title: "Risk Taxonomy"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Risk Taxonomy}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(BORG)
```

This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism.

## Risk Classification

BORG classifies risks into two categories based on their impact on evaluation validity:

| Category | Impact | BORG Response |
|----------|--------|---------------|
| **Hard Violation** | Results are invalid | Blocks evaluation, requires fix |
| **Soft Inflation** | Results are biased | Warns, allows with caution |

# Hard Violations

These make your evaluation results invalid. Any metrics computed with these violations are unreliable.

## 1. Index Overlap

**What**: Same row indices appear in both training and test sets.

**Why it matters**: The model has seen the exact data it's being tested on. This is the most basic form of leakage.

**Detection**: Set intersection of `train_idx` and `test_idx`.

```{r index-overlap}
data <- data.frame(x = 1:100, y = rnorm(100))

# Accidental overlap
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
result
```

**Fix**: Ensure indices are mutually exclusive. Use `setdiff()` to create non-overlapping sets.

## 2. Duplicate Rows

**What**: Test set contains rows identical to training rows.

**Why it matters**: Model may have memorized these exact patterns. Even without index overlap, identical feature values constitute leakage.

**Detection**: Row hashing and comparison (C++ backend for numeric data).

```{r duplicate-rows}
# Data with duplicate rows
dup_data <- rbind(
  data.frame(x = 1:5, y = 1:5),
  data.frame(x = 1:5, y = 1:5)  # Duplicates
)

result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10)
result
```

**Fix**: Remove duplicate rows before splitting, or ensure splits respect duplicates (keep all copies in same set).

## 3. Preprocessing Leakage

**What**: Normalization, imputation, or dimensionality reduction fitted on full data before splitting.

**Why it matters**: Test set statistics influenced the preprocessing parameters applied to training data. Information flows backwards from test to train.

**Detection**: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage.

**Supported objects**:

| Object Type | Parameters Checked |
|-------------|-------------------|
| `caret::preProcess` | `$mean`, `$std` |
| `recipes::recipe` | Step parameters after `prep()` |
| `prcomp` | `$center`, `$scale`, rotation matrix |
| `scale()` attributes | `center`, `scale` |

```{r preprocessing-leak, eval=FALSE}
# BAD: Scale fitted on all data
scaled_data <- scale(data)  # Uses all rows!
train <- scaled_data[1:70, ]
test <- scaled_data[71:100, ]

# BORG detects this
borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100)
```

**Fix**: Fit preprocessing on training data only, then apply to test:

```r
train_data <- data[1:70, ]
test_data <- data[71:100, ]

# Fit on train
means <- colMeans(train_data)
sds <- apply(train_data, 2, sd)

# Apply to both
train_scaled <- scale(train_data, center = means, scale = sds)
test_scaled <- scale(test_data, center = means, scale = sds)
```

## 4. Target Leakage (Direct)

**What**: Feature has absolute correlation > 0.99 with target.

**Why it matters**: Feature is almost certainly derived from the outcome. Examples:
- `days_since_diagnosis` when predicting `has_disease`

- `total_spent` when predicting `is_customer`

- Aggregated future values leaked into current features

**Detection**: Compute Pearson correlation of each numeric feature with target on training data.

```{r target-leakage}
# Simulate target leakage
leaky <- data.frame(
  x = rnorm(100),
  outcome = rnorm(100)
)
leaky$leaked <- leaky$outcome + rnorm(100, sd = 0.01)  # Near-perfect correlation

result <- borg_inspect(leaky, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
```

**Fix**: Remove or investigate the leaky feature. If it's a legitimate predictor, document why correlation > 0.99 is expected.

## 5. Group Leakage

**What**: Same group (patient, site, species) appears in both train and test.

**Why it matters**: Observations within a group tend to be similar. If the same patient appears in train and test, the model can exploit patient-specific patterns that won't exist for new patients.

**Detection**: Set intersection of group membership values.

```{r group-leakage}
# Clinical data with patient IDs
clinical <- data.frame(
  patient_id = rep(1:10, each = 10),
  measurement = rnorm(100)
)

# Random split ignoring patients
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]

result <- borg_inspect(clinical, train_idx = train_idx, test_idx = test_idx,
                       groups = "patient_id")
result
```

**Fix**: Use group-aware splitting:

```r
# Split at the patient level
train_patients <- sample(unique(clinical$patient_id), 7)
train_idx <- which(clinical$patient_id %in% train_patients)
test_idx <- which(!clinical$patient_id %in% train_patients)
```

## 6. Temporal Ordering Violation

**What**: Test observations predate training observations.

**Why it matters**: Model uses future information to predict the past. In deployment, future data won't be available.

**Detection**: Compare max training timestamp to min test timestamp.

```{r temporal-leak}
# Time series data
ts_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
  value = cumsum(rnorm(100))
)

# Wrong: random split ignores time
set.seed(42)
random_idx <- sample(100)
train_idx <- random_idx[1:70]
test_idx <- random_idx[71:100]

result <- borg_inspect(ts_data, train_idx = train_idx, test_idx = test_idx,
                       time = "date")
result
```

**Fix**: Use chronological splits where all test data comes after training:

```r
train_idx <- 1:70
test_idx <- 71:100
```

## 7. CV Fold Contamination

**What**: Cross-validation folds contain test indices, or folds overlap incorrectly.

**Why it matters**: Nested CV requires the outer test set to be completely held out from all inner training.

**Detection**: Check if any fold's training indices intersect with held-out test set.

**Supported objects**:

- `caret::trainControl` - checks `$index` and `$indexOut`

- `rsample::vfold_cv` and other `rset` objects

- `rsample::rsplit` objects

## 8. Model Scope

**What**: Model was trained on more rows than claimed training set.

**Why it matters**: Model saw test data during training, even if indirectly (e.g., through hyperparameter tuning on full data).

**Detection**: Compare `nrow(trainingData)` or `length(fitted.values)` to `length(train_idx)`.

**Supported objects**: `lm`, `glm`, `ranger`, `caret::train`, parsnip models, workflows.

# Soft Inflation Risks

These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic.

## 1. Target Leakage (Proxy)

**What**: Feature has correlation 0.95-0.99 with target.

**Why warning not error**: May be a legitimate strong predictor. Requires domain knowledge to judge.

**Detection**: Same as direct leakage, different threshold.

```{r proxy-leakage}
# Strong but not extreme correlation
proxy <- data.frame(
  x = rnorm(100),
  outcome = rnorm(100)
)
proxy$strong_predictor <- proxy$outcome + rnorm(100, sd = 0.3)  # r ~ 0.96

result <- borg_inspect(proxy, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
```

**Action**: Review whether the feature should be available at prediction time in production.

## 2. Spatial Proximity

**What**: Test points are very close to training points in geographic space.

**Why it matters**: Spatial autocorrelation means nearby points share variance. Model learns local patterns that don't generalize to distant locations.

**Detection**: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread.

```{r spatial-proximity}
set.seed(42)
spatial <- data.frame(
  lon = runif(100, 0, 100),
  lat = runif(100, 0, 100),
  value = rnorm(100)
)

# Random split intermixes nearby points
train_idx <- sample(100, 70)
test_idx <- setdiff(1:100, train_idx)

result <- borg_inspect(spatial, train_idx = train_idx, test_idx = test_idx,
                       coords = c("lon", "lat"))
result
```

**Fix**: Use spatial blocking:

```r
# Geographic split
train_idx <- which(spatial$lon < 50)  # West
test_idx <- which(spatial$lon >= 50)  # East
```

## 3. Spatial Overlap

**What**: Test region falls inside training region's convex hull.

**Why it matters**: Interpolation is easier than extrapolation. Model performance on "surrounded" test points overestimates performance on truly new regions.

**Detection**: Compute convex hull of training points, count test points inside.

**Threshold**: Warning if > 50% of test points fall inside training hull.

## 4. Random CV on Dependent Data

**What**: Using random k-fold CV when data has spatial, temporal, or group structure.

**Why it matters**: Random folds break dependencies artificially, leading to optimistic error estimates.

```{r random-cv-inflation}
# Diagnose data dependencies
spatial <- data.frame(
  lon = runif(200, 0, 100),
  lat = runif(200, 0, 100),
  response = rnorm(200)
)

diagnosis <- borg_diagnose(spatial, coords = c("lon", "lat"), target = "response",
                           verbose = FALSE)
diagnosis@recommended_cv
```

**Fix**: Use `borg()` to generate appropriate blocked CV folds.

# Quick Reference

| Risk Type | Severity | Detection Method | Fix |
|-----------|----------|------------------|-----|
| `index_overlap` | Hard | Index intersection | Use `setdiff()` |
| `duplicate_rows` | Hard | Row hashing | Deduplicate or group |
| `preprocessing_leak` | Hard | Parameter comparison | Fit on train only |
| `target_leakage` | Hard | Correlation > 0.99 | Remove feature |
| `group_leakage` | Hard | Group intersection | Group-aware split |
| `temporal_leak` | Hard | Timestamp comparison | Chronological split |
| `cv_contamination` | Hard | Fold index check | Rebuild folds |
| `model_scope` | Hard | Row count | Refit on train only |
| `proxy_leakage` | Soft | Correlation 0.95-0.99 | Domain review |
| `spatial_proximity` | Soft | Distance check | Spatial blocking |
| `spatial_overlap` | Soft | Convex hull | Geographic split |

# Accessing Risk Details

```{r risk-access}
# Create result with violations
result <- borg_inspect(
  data.frame(x = 1:100, y = rnorm(100)),
  train_idx = 1:60,
  test_idx = 51:100
)

# Summary
cat("Valid:", result@is_valid, "\n")
cat("Hard violations:", result@n_hard, "\n")
cat("Soft warnings:", result@n_soft, "\n")

# Individual risks
for (risk in result@risks) {
  cat("\n", risk$type, "(", risk$severity, "):\n", sep = "")
  cat("  ", risk$description, "\n")
  if (!is.null(risk$affected)) {
    cat("  Affected:", head(risk$affected, 5), "...\n")
  }
}

# Tabular format
as.data.frame(result)
```

## See Also

- `vignette("quickstart")` - Basic usage

- `vignette("frameworks")` - Framework integration
