---
title: "Introduction to RobustFlow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to RobustFlow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  fig.width  = 6.5,
  fig.height = 4,
  out.width  = "100%"
)
library(RobustFlow)
library(ggplot2)
```

## Overview

**RobustFlow** is an R package for auditing temporal drift, subgroup
disparities, and robustness in longitudinal decision systems.

It is built around three methodological contributions:

| Metric | Definition | Key function |
|--------|-----------|--------------|
| **Drift Intensity Index (DII)** | Frobenius norm of consecutive transition matrix differences | `compute_drift()` |
| **Bias Amplification Index (BAI)** | Change in disparity gap from first to last time point | `compute_bai()` |
| **Temporal Fragility Index (TFI)** | Minimum hidden-bias perturbation to nullify a trend conclusion | `compute_tfi_simple()` |

An interactive Shiny application (`run_app()`) provides a seven-tab
workflow with point-and-click access to all analyses plus downloadable
HTML reports, CSV bundles, and reproducible R scripts.

---

## Simulated dataset

We construct a panel that mimics a study of mathematics risk status across
five elementary school waves, with three SES groups.

```{r make-data}
set.seed(2024)
n_children <- 200
n_waves    <- 5

# Simulate SES group with unequal sizes
ses <- sample(
  c("Low SES", "Mid SES", "High SES"),
  size    = n_children,
  replace = TRUE,
  prob    = c(0.35, 0.40, 0.25)
)

# Risk probability increases over waves for Low SES,
# decreases slightly for High SES
risk_prob <- function(ses_grp, wave) {
  base <- switch(ses_grp,
    "Low SES"  = 0.55,
    "Mid SES"  = 0.35,
    "High SES" = 0.20
  )
  trend <- switch(ses_grp,
    "Low SES"  =  0.04,
    "Mid SES"  =  0.01,
    "High SES" = -0.02
  )
  pmin(pmax(base + trend * (wave - 1), 0.02), 0.98)
}

rows <- lapply(seq_len(n_children), function(i) {
  data.frame(
    child_id  = i,
    wave      = seq_len(n_waves),
    ses_group = ses[i],
    school_id = sample(seq_len(20), 1),
    risk_math = rbinom(n_waves, 1,
                       vapply(seq_len(n_waves),
                              function(w) risk_prob(ses[i], w),
                              numeric(1)))
  )
})
ecls_demo <- do.call(rbind, rows)
head(ecls_demo, 8)
```

---

## Step 1: Validate panel data

`validate_panel_data()` checks variable existence, sorts the data, and
reports panel balance and missingness.

```{r validate}
validated <- validate_panel_data(
  data     = ecls_demo,
  id       = "child_id",
  time     = "wave",
  decision = "risk_math",
  group    = "ses_group",
  cluster  = "school_id"
)

cat(sprintf(
  "Individuals: %d\nWaves:       %d\nBalanced:    %s\n",
  validated$n_ids,
  validated$n_times,
  validated$balanced
))
```

---

## Step 2: Build decision paths

`build_paths()` constructs each child's complete risk trajectory and
returns aggregate path frequencies, the pooled transition matrix, and
Shannon entropy.

```{r paths}
paths <- build_paths(
  data     = validated$data,
  id       = "child_id",
  time     = "wave",
  decision = "risk_math"
)

# Top 10 paths
head(paths$path_counts, 10)
```

Shannon entropy measures trajectory diversity — higher values mean more
heterogeneous individual histories.

```{r entropy}
cat(sprintf("Path entropy: %.3f bits\n", paths$path_entropy))
```

### Aggregate transition matrix

```{r trans-mat}
paths$transition_matrix
```

```{r path-plot, fig.cap="Top 10 most common five-wave risk trajectories."}
df_paths <- head(paths$path_counts, 10)
df_paths$path <- factor(df_paths$path, levels = rev(df_paths$path))

ggplot(df_paths, aes(x = path, y = n)) +
  geom_col(fill = "#2c7bb6") +
  coord_flip() +
  labs(x = "Path", y = "Count",
       title = "Top 10 Decision Paths") +
  theme_minimal()
```

---

## Step 3: Drift Intensity Index (DII)

`compute_drift()` estimates a period-specific transition matrix for each
consecutive wave pair and computes the Frobenius norm of their difference:

$$DII_t = \| P_t - P_{t-1} \|_F$$

A larger DII signals greater structural instability between two periods.

```{r dii}
drift <- compute_drift(
  data      = validated$data,
  id        = "child_id",
  time      = "wave",
  decision  = "risk_math",
  normalize = TRUE
)

drift$summary
cat(sprintf("Mean DII:             %.4f\n", drift$mean_dii))
cat(sprintf("Period of max drift:  wave %s\n", drift$max_dii_period))
```

```{r dii-plot, fig.cap="DII over time. Higher values indicate greater structural change in risk transitions."}
ggplot(drift$summary, aes(x = time, y = DII)) +
  geom_line(color = "#1a9641", linewidth = 1) +
  geom_point(size = 2.5) +
  labs(x = "Wave", y = "DII",
       title = "Drift Intensity Index Over Time") +
  theme_minimal()
```

---

## Step 4: Group trajectories and Bias Amplification Index (BAI)

`compute_group_gaps()` computes the at-risk rate for each SES group at
each wave and the pairwise gap (alphabetically first group minus second).

`compute_bai()` then summarizes whether that gap widens or narrows:

$$BAI = Gap_T - Gap_1$$

```{r gaps}
gaps <- compute_group_gaps(
  data        = validated$data,
  time        = "wave",
  decision    = "risk_math",
  group       = "ses_group",
  focal_value = 1
)

gaps$gap_df
```

```{r traj-plot, fig.cap="At-risk rates by SES group across five waves."}
ggplot(gaps$long_format,
       aes(x = time, y = rate, color = group, group = group)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1)) +
  labs(x = "Wave", y = "At-Risk Rate", color = "SES Group",
       title = "Math Risk Trajectories by SES Group") +
  theme_minimal() +
  theme(legend.position = "bottom")
```

```{r gap-plot, fig.cap="Disparity gap (first two alphabetic SES groups) over time."}
ggplot(gaps$gap_df, aes(x = time, y = gap)) +
  geom_line(color = "#762a83", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey40") +
  geom_point(size = 2.5) +
  labs(x = "Wave", y = "Gap",
       title = "Disparity Gap Over Time") +
  theme_minimal()
```

```{r bai}
bai <- compute_bai(gaps$gap)
cat(sprintf("BAI:       %.4f\n", bai$bai))
cat(sprintf("Direction: %s\n",   bai$direction))
cat(sprintf("Gap at t=1: %.4f  |  Gap at t=T: %.4f\n",
            bai$gap_start, bai$gap_end))
```

The standardized version divides by the SD of the gap series:

```{r bai-std}
bai_std <- compute_bai(gaps$gap, standardize = TRUE)
cat(sprintf("Standardized BAI: %.4f\n", bai_std$bai))
```

---

## Step 5: Temporal Fragility Index (TFI)

`compute_tfi_simple()` estimates how much hidden bias — modelled as a
scalar attenuation $u$ — would be required to nullify the observed
longitudinal trend in DII. The adjusted slope under perturbation $u$ is:

$$\hat{\beta}(u) = \hat{\beta}_{obs} \times (1 - u)$$

The TFI is the minimum $u$ at which the trend reverses or reaches zero.

```{r tfi}
dii_vals <- drift$summary$DII[!is.na(drift$summary$DII)]

tfi <- compute_tfi_simple(dii_vals)
tfi$summary_table
```

```{r tfi-plot, fig.cap="TFI sensitivity curve. The red dashed line marks the null effect threshold."}
sc <- tfi$sensitivity_curve

ggplot(sc, aes(x = perturbation, y = adjusted_effect)) +
  geom_line(color = "#4575b4", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#d73027",
             linewidth = 0.8) +
  labs(x = "Hidden Bias Parameter (u)",
       y = "Adjusted Slope",
       title = "TFI Sensitivity Curve") +
  theme_minimal()
```

### Interpretation guide

| TFI value | Interpretation |
|-----------|---------------|
| `Inf` | Conclusion cannot be nullified by scalar attenuation — highly robust |
| `> 0.5` | Moderately robust |
| `0.1 – 0.5` | Requires scrutiny — modest hidden bias could reverse conclusions |
| `< 0.1` | Fragile — small bias sufficient to nullify the trend |

---

## Step 6: Exporting a reproducible R script

```{r rscript, eval=FALSE}
generate_r_script(
  id_var       = "child_id",
  time_var     = "wave",
  decision_var = "risk_math",
  group_var    = "ses_group",
  cluster_var  = "school_id",
  focal_value  = 1,
  output_file  = "my_robustflow_analysis.R"
)
```

The generated script is a self-contained `.R` file that a collaborator
can run to reproduce the full analysis without the Shiny app.

---

## Step 7: Launch the interactive app

```{r app, eval=FALSE}
run_app()
```

The app loads in your default browser and provides:

| Tab | Content |
|-----|---------|
| **Data** | Upload, variable mapping, balance diagnostics, missingness |
| **Paths** | Path frequencies, transition matrix, entropy |
| **Drift** | Prevalence over time, DII trend, period-specific matrices |
| **Disparities** | Group trajectories, gap evolution, BAI |
| **Robustness** | TFI, sensitivity curve |
| **Intervention** | Highest-risk transitions, disparity-generating steps |
| **Report** | HTML report, CSV bundle, reproducible R script |

---

## Citation

If you use RobustFlow in published research, please cite:

```
Hait, S. (2025). RobustFlow: Robustness and drift auditing for
longitudinal decision systems. R package version 0.1.0.
https://github.com/subirhait/RobustFlow
```

---

## Session info

```{r session}
sessionInfo()
```