---
title: "Checking GLM Model Fit"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Checking GLM Model Fit}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6
)
```

```{r setup, message = FALSE}
library(pepdiff)
library(dplyr)
```

## Why Check Model Fit?

pepdiff uses a **Gamma GLM with log link** for differential abundance analysis. This model assumes:

- Abundances are positive (no zeros)
- Variance increases with the mean (common in proteomics)
- The relationship between predictors and response is multiplicative (log-linear)

When these assumptions hold, GLM gives reliable p-values and accurate fold change estimates. When they don't, results may be misleading.

**The good news:** You don't need a statistics degree to check model fit. `plot_fit_diagnostics()` gives you visual checks that tell you whether to trust your GLM results or try ART instead.

## Quick Check with plot_fit_diagnostics()

Let's start with simulated data that fits the GLM assumptions well.

```{r generate-good-data}
set.seed(123)

n_peptides <- 40
n_reps <- 5

peptides <- paste0("PEP_", sprintf("%03d", 1:n_peptides))
genes <- paste0("GENE_", LETTERS[((1:n_peptides - 1) %% 26) + 1])

# Simulate well-behaved Gamma data
design <- expand.grid(
  peptide = peptides,
  treatment = c("ctrl", "trt"),
  bio_rep = 1:n_reps,
  stringsAsFactors = FALSE
)

sim_data <- design %>%
  mutate(
    gene_id = genes[match(peptide, peptides)],
    pep_num = as.numeric(gsub("PEP_", "", peptide)),
    base = 10 + pep_num * 0.5,
    # First 20 peptides have 2-fold treatment effect
    effect = ifelse(pep_num <= 20 & treatment == "trt", 2, 1),
    # Gamma noise (well-behaved)
    value = rgamma(n(), shape = 15, rate = 15 / (base * effect))
  ) %>%
  select(peptide, gene_id, treatment, bio_rep, value)

# Import
temp_file <- tempfile(fileext = ".csv")
write.csv(sim_data, temp_file, row.names = FALSE)

dat <- read_pepdiff(
  temp_file,
  id = "peptide",
  gene = "gene_id",
  value = "value",
  factors = "treatment",
  replicate = "bio_rep"
)

unlink(temp_file)
```

Run GLM analysis and check diagnostics:

```{r run-glm-good, message = FALSE}
results <- compare(dat, compare = "treatment", ref = "ctrl", method = "glm")

# Check fit diagnostics
diag <- plot_fit_diagnostics(results)
```

```{r show-good-plot, fig.height = 8}
diag$plot
```

## Reading the Diagnostic Plots

### Panel A: Deviance Distribution

This histogram shows how well each peptide's GLM model fits its data. **Lower deviance = better fit.**

**What to look for:**
- Most peptides should cluster at low deviance values
- A few peptides above the red threshold line is normal (by default, 5%)
- A long right tail means some peptides fit poorly

**In our example:** The distribution is compact with only a few peptides above the threshold. This looks healthy.

### Panel B: Deviance vs Effect Size

This scatter plot reveals whether poorly-fitting peptides cluster at certain effect sizes.

**What to look for:**
- Random scatter = no systematic problem
- High deviance at extreme fold changes = outlier-driven "significant" results (red flag)
- Curved pattern = systematic misfit

**In our example:** Points are scattered without clear pattern. No obvious relationship between fold change and fit quality.

### Panel C: Sample Residual Plots

These show residuals (observed - predicted) vs fitted values for individual peptides. The function selects peptides with high, median, and low deviance to give you a representative sample.

**What good fit looks like:**
- Points scattered randomly around zero (the dashed line)
- No funnel shape (variance shouldn't change with fitted value)
- Flat red loess line (no trend)

**What poor fit looks like:**
- Curved pattern in points
- Funnel shape (variance increasing or decreasing)
- Curved red loess line

**In our example:** The residual plots show random scatter around zero with mostly flat loess lines.

### Panel D: Pooled QQ Plot

This combines residuals from all peptides to check if they follow the expected distribution.

**What good fit looks like:**
- Points fall along the diagonal line
- Minor wiggles at the tails are OK

**What poor fit looks like:**
- S-shaped curve = heavy tails (consider ART)
- Systematic bow = wrong distributional assumption
- Points far from line = outliers

**In our example:** Points follow the line reasonably well, suggesting the Gamma distribution is appropriate.

## Investigating Flagged Peptides

The function identifies peptides with potential fit issues:

```{r view-flagged}
diag$flagged
```

```{r summary-stats}
# Summary statistics
diag$summary
```

If peptides are flagged, investigate them:

1. **Check the data** - Do these peptides have outliers, zeros, or unusual patterns?
2. **Look at sample size** - Low replication can cause fitting issues
3. **Consider biology** - Are these peptides biologically different?

## Example: Data with Poor Fit

Let's simulate data that violates GLM assumptions - heavy tails with outliers:

```{r generate-bad-data}
set.seed(456)

# Same design but with outliers and heavy tails
sim_bad <- design %>%
  mutate(
    gene_id = genes[match(peptide, peptides)],
    pep_num = as.numeric(gsub("PEP_", "", peptide)),
    base = 10 + pep_num * 0.5,
    effect = ifelse(pep_num <= 20 & treatment == "trt", 2, 1),
    # Add outliers (10% of observations)
    outlier = ifelse(runif(n()) < 0.1, runif(n(), 3, 8), 1),
    # Heavier-tailed noise
    value = rgamma(n(), shape = 3, rate = 3 / (base * effect)) * outlier
  ) %>%
  select(peptide, gene_id, treatment, bio_rep, value)

temp_file <- tempfile(fileext = ".csv")
write.csv(sim_bad, temp_file, row.names = FALSE)

dat_bad <- read_pepdiff(
  temp_file,
  id = "peptide",
  gene = "gene_id",
  value = "value",
  factors = "treatment",
  replicate = "bio_rep"
)

unlink(temp_file)
```

```{r run-glm-bad, message = FALSE}
results_bad <- compare(dat_bad, compare = "treatment", ref = "ctrl", method = "glm")
diag_bad <- plot_fit_diagnostics(results_bad)
```

```{r show-bad-plot, fig.height = 8}
diag_bad$plot
```

**Notice the differences:**

- **Deviance distribution** has a longer right tail with more flagged peptides
- **Residual plots** may show non-random patterns
- **QQ plot** deviates from the line, especially in the tails

When diagnostics look like this, ART may be more appropriate.

## Decision: GLM or ART?

### Use GLM when:

- Deviance distribution looks reasonable (few flagged, <10-15%)
- No systematic patterns in residual plots
- QQ plot is reasonably linear
- You want interpretable fold changes

### Consider ART when:

- Many peptides (>15%) have high deviance
- Residual plots show systematic curves or funnels
- QQ plot shows heavy tails (S-curve)
- You have known outliers or heavy-tailed data
- You're uncomfortable with distributional assumptions

### Try ART on the problematic data:

```{r try-art, message = FALSE}
results_art <- compare(dat_bad, compare = "treatment", ref = "ctrl", method = "art")

# Compare significant calls
comparison <- tibble(
  peptide = results_bad$results$peptide,
  glm_sig = results_bad$results$significant,
  art_sig = results_art$results$significant
)

cat("Both significant:", sum(comparison$glm_sig & comparison$art_sig), "\n")
cat("GLM only:", sum(comparison$glm_sig & !comparison$art_sig), "\n")
cat("ART only:", sum(!comparison$glm_sig & comparison$art_sig), "\n")
```

## Summary

1. **Always run `plot_fit_diagnostics()` after GLM analysis** - it only takes seconds
2. **Check all four panels** for warning signs
3. **Investigate flagged peptides** if needed - check for data quality issues
4. **Switch to ART only if diagnostics suggest systematic problems** - ART trades power for robustness
5. **When in doubt, run both methods** and compare results

GLM is the default because it's more powerful when assumptions hold. Use diagnostics to verify those assumptions, and switch to ART when they don't.

See `vignette("art_analysis")` for details on using the ART method.
