---
title: "Advanced Causal Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced Causal Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

This vignette demonstrates advanced usage of the `causaldef` package, focusing on diagnostics for unobserved confounding and sensitivity analysis. We utilize real data incorporated in the package to showcase these features.

```{r setup}
library(causaldef)
library(ggplot2)

# DAG Helper
plot_dag <- function(coords, edges, title = NULL) {
  edges_df <- merge(edges, coords, by.x = "from", by.y = "name")
  colnames(edges_df)[c(3,4)] <- c("x_start", "y_start")
  edges_df <- merge(edges_df, coords, by.x = "to", by.y = "name")
  colnames(edges_df)[c(5,6)] <- c("x_end", "y_end")
  
  ggplot2::ggplot(coords, ggplot2::aes(x = x, y = y)) +
    ggplot2::geom_segment(data = edges_df, ggplot2::aes(x = x_start, y = y_start, xend = x_end, yend = y_end),
                 arrow = ggplot2::arrow(length = ggplot2::unit(0.3, "cm"), type = "closed"), 
                 color = "gray40", size = 1, alpha = 0.8) +
    ggplot2::geom_point(size = 14, color = "white", fill = "#E7B800", shape = 21, stroke = 1.5) + # different color
    ggplot2::geom_text(ggplot2::aes(label = name), fontface = "bold", size = 3, color = "black") +
    ggplot2::ggtitle(title) + 
    ggplot2::theme_void(base_size = 14) +
    ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold", margin = ggplot2::margin(b = 10))) +
    ggplot2::coord_fixed()
}
```

# Unobserved Confounding and Negative Controls

Unobserved confounding is a major threat to causal inference. The `causaldef` package implements "Negative Control Outcomes" to detect residual confounding. A negative control outcome is a variable that is affected by unobserved confounders but is known not to be causally affected by the treatment.

We will use the `gene_perturbation` dataset provided in the package. This dataset comes from a CRISPR knockout experiment.

*   **Treatment (`knockout_status`)**: Whether a gene was knocked out (Treatment) or not (Control).
*   **Outcome (`target_expression`)**: Expression level of the target gene.
*   **Confounders**: Technical factors like `library_size` and `batch`.
*   **Negative Control (`housekeeping_gene`)**: A housekeeping gene that should *not* be affected by the specific knockout but shares technical confounders.

## Loading the Data

```{r load_data}
data("gene_perturbation")
head(gene_perturbation)
```

## Running the Negative Control Diagnostic

We first specify the causal model, explicitly identifying the negative control outcome.

```{r nc_spec}
spec_nc <- causal_spec(
  data = gene_perturbation,
  treatment = "knockout_status",
  outcome = "target_expression",
  covariates = c("library_size", "batch"),
  negative_control = "housekeeping_gene"
)

# Visualize the Structural Assumption
coords <- data.frame(
  name = c("Unobserved", "Treatment", "Target", "Housekp"),
  x = c(0, -1.5, 1.5, 0),
  y = c(1, 0, 0, -1)
)
edges <- data.frame(
  from = c("Unobserved", "Unobserved", "Unobserved", "Treatment"),
  to =   c("Treatment", "Target", "Housekp", "Target")
)
plot_dag(coords, edges, title = "Why Negative Controls Work:\nShared Unobserved Confounding")
```

Now we run the diagnostic using `nc_diagnostic()`. We test if the Inverse Probability of Treatment Weighting (IPTW) method successfully removes confounding bias.

```{r nc_run}
set.seed(123)
# Check if IPTW approach effectively removes confounding using the negative control
nc_res <- nc_diagnostic(spec_nc, method = "iptw", n_boot = 100)
print(nc_res)
```

**Interpreting Negative Control Diagnostics:**

This diagnostic tests a crucial assumption: *"Did my adjustment remove ALL confounding?"*

**Key Results to Check:**

1. **p_value**: 
   - > 0.05: [PASS] – No evidence that treatment affects the negative control after adjustment
   - $\leq$ 0.05: [FAIL] – Treatment still associated with negative control -> residual confounding detected

2. **falsified**: 
   - FALSE: Assumption not falsified -> Adjustment appears adequate
   - TRUE: Assumption falsified -> Likely unobserved confounding or model misspecification

3. **delta_nc**: 
   - Magnitude of residual association with negative control
   - Smaller values indicate better performance
   - Should be close to zero if adjustment is successful

4. **delta_bound**: 
   - Upper bound on your true deficiency: $\delta$ $\leq$ $\kappa$ $\times$ $\delta$_NC
   - Provides a worst-case estimate even when you can't directly measure $\delta$

**Scientific Implications:**

The negative control is powerful because:

- It doesn't rely on assumptions about what confounders exist
- It provides empirical evidence (not just theoretical arguments)
- Falsification is strong evidence – if it fails, your conclusions are suspect

**If falsified:**

- Do NOT ignore this result
- Investigate: batch effects, technical artifacts, unmeasured biology
- Consider: additional covariates, different methods, or acknowledging the limitation

## Addressing the Falsification

If the diagnostic fails (as it might with unobserved batch effects not fully captured), it indicates that your current adjustment set is insufficient to block all back-door paths to the negative control (and likely the outcome).

**Recommended Actions:**

1.  **Enrich the Confounder Set**: If possible, measure and include additional potential confounders (e.g., more granular technical variables).
2.  **Sensitivity Analysis**: Since you have evidence of unobserved confounding, use `confounding_frontier()` to determine how strong this confounding would need to be to invalidate your results.
3.  **Alternative Designs**: Consider methods that rely on different assumptions, such as Instrumental Variables (if available) or Difference-in-Differences.
4.  **Transparent Reporting**: If you cannot fix the issue, you must report the failure. "We detected residual confounding ($\delta_{nc} = ...$) using negative controls. Consequently, causal estimates interpretability is limited."

# Sensitivity Analysis: Confounding Frontiers

When we suspect unobserved confounding, we can quantify how sensitive our causal conclusions are to potential unmeasured confounders $U$. The `confounding_frontier()` function maps the "deficiency" (information loss) as a function of the strength of confounding paths $U \to A$ ($\alpha$) and $U \to Y$ ($\gamma$).

We will demonstrate this using the `hct_outcomes` dataset (Hematopoietic Cell Transplantation).

```{r sensitivity_setup}
data("hct_outcomes")

# Convert treatment to numeric for the Gaussian model approximation
hct_outcomes$conditioning_numeric <- as.numeric(hct_outcomes$conditioning_intensity) - 1

# Simplified specification focusing on conditioning intensity
spec_sens <- causal_spec(
  data = hct_outcomes,
  treatment = "conditioning_numeric",
  outcome = "time_to_event", # Simplified for this example
  covariates = c("age", "disease_status")
)
```

We compute the confounding frontier. This does not require observing $U$, but rather exploring the space of hypothetical confounding strengths.

```{r run_frontier}
frontier <- confounding_frontier(
  spec_sens,
  alpha_range = c(-1, 1),
  gamma_range = c(-1, 1),
  grid_size = 40
)

# The result contains a grid we can visualize
head(frontier$grid)
```

## Visualizing the Frontier

We can visualize the regions where deficiency is high (problematic) vs low.

```{r plot_frontier, fig.width=6, fig.height=5}
ggplot(frontier$grid, aes(x = alpha, y = gamma, fill = delta)) +
  geom_tile() +
  scale_fill_viridis_c(option = "magma", direction = -1) +
  labs(
    title = "Confounding Frontier",
    x = "Confounding Strength U -> A (alpha)",
    y = "Confounding Strength U -> Y (gamma)",
    fill = "Deficiency\n(Delta)"
  ) +
  theme_minimal() +
  geom_contour(aes(z = delta), color = "white", breaks = c(0.01, 0.05, 0.1))
```

**Interpreting the Confounding Frontier:**

This visualization maps the "danger zones" of unobserved confounding:

**The Axes:**

- **x-axis ($\alpha$)**: Strength of confounding U -> A (unobserved confounder affects treatment)
- **y-axis ($\gamma$)**: Strength of confounding U -> Y (unobserved confounder affects outcome)
- **Color/fill ($\delta$)**: Resulting deficiency (information loss)

**Reading the Plot:**

1. **Dark regions (low $\delta$ $\approx$ 0)**: 

   - "Safe zones" where causal identification is robust
   - Small amounts of unobserved confounding don't destroy identification
   - You can proceed with confidence in these scenarios

2. **Bright regions (high $\delta$ > 0.15)**:

   - "Danger zones" where unobserved confounding severely compromises identification
   - Large distributional distance from the ideal experiment (TV distance > 0.15)
   - Causal conclusions are unreliable without additional assumptions

3. **Contour lines** (white lines at $\delta$ = 0.01, 0.05, 0.1):

   - Boundaries between different risk levels
   - The "frontier" where identification breaks down

**Scientific Decision-Making:**

Use domain knowledge to assess plausibility:

- **If you believe** |$\alpha$| < 0.3 and |$\gamma$| < 0.3 (weak confounding), and the plot shows low $\delta$ in that region -> Proceed with confidence
- **If plausible scenarios** put you in high-$\delta$ regions -> Your conclusions are fragile, acknowledge this limitation
- **If uncertain** about confounding strength -> Report results across the frontier to show sensitivity

**Example Interpretation:**  
*"Our sensitivity analysis shows that if unobserved confounding U -> A and U -> Y are both weak (|$\alpha$|, |$\gamma$| < 0.2), the deficiency remains below 0.05, suggesting robust identification. However, if either confounding strength exceeds 0.5, identification breaks down ($\delta$ > 0.1). Given the biological plausibility of weak confounding in this system, we conclude our estimates are reasonably robust."*

The confounding frontier shows the boundary: if we are outside the "safe" zone defined by the frontier, we cannot claim causal identification without further assumptions (see Akdemir 2026, DOI: 10.5281/zenodo.18367347 for theoretical details).

# Conclusion

The `causaldef` package provides tools not just for estimation, but for **falsification** and **sensitivity analysis**.

1.  **Negative Controls** allow us to empirically test our "no unobserved confounding" assumption using auxiliary data.
2.  **Confounding Frontiers** allow us to theoretically explore how robust our study is to potential unobserved confounders.
