---
title: "Factorial Designs with GLM"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Factorial Designs with GLM}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

```{r setup, message = FALSE}
library(pepdiff)
library(dplyr)
```

## When you need GLM

Pairwise tests work well for simple two-group comparisons. But real experiments often have multiple factors:

- Treatment × Timepoint (does treatment effect change over time?)
- Genotype × Treatment (does treatment work differently in wildtype vs knockout?)
- Condition × Dose × Timepoint (complex factorial)

GLM (Generalised Linear Model) handles these designs naturally. It models all factors simultaneously, accounting for their individual and combined effects.

## Why Gamma GLM?

Proteomics abundances have particular properties:

1. **Always positive**: You can't have negative abundance
2. **Right-skewed**: Many low values, few high values
3. **Variance scales with mean**: More abundant peptides have more variable measurements

The Gamma distribution models data with exactly these properties. The log link function means model coefficients are interpretable as fold changes - which is what you want.

**Important:** Zeros cause Gamma GLM to fail. This is correct behaviour - a zero abundance is either a missing value (below detection limit) or a data error. pepdiff treats zeros as missing data.

## Example: 2×2 factorial design

A **2×2 factorial design** means two factors, each with two levels. Here: treatment (ctrl, trt) × timepoint (0h, 24h). This gives four experimental conditions:

|           | 0h  | 24h |
|-----------|-----|-----|
| **ctrl**  | ✓   | ✓   |
| **trt**   | ✓   | ✓   |

Each peptide is measured in all four conditions (with replicates), letting us ask: does treatment have an effect? Does that effect depend on timepoint?

We'll use 5 replicates per condition (20 total observations per peptide) and simulate different effect patterns to see how GLM handles them.

### The simulated peptide groups

To illustrate how GLM handles different effect patterns, we simulate 40 peptides in four groups:

| Group | Peptides | Effect Pattern | What GLM should find |
|-------|----------|----------------|----------------------|
| **A** | PEP_001–010 | Treatment effect only | Main effect significant, same at both timepoints |
| **B** | PEP_011–020 | Timepoint effect only | No treatment effect (treatment comparison) |
| **C** | PEP_021–030 | Interaction | Treatment works at 24h but not 0h |
| **D** | PEP_031–040 | No effect | Nothing significant |

### What these patterns look like in raw data

Here's what the mean abundances look like for each group across the four conditions (using a baseline of ~10 for illustration):

| Group | ctrl @ 0h | ctrl @ 24h | trt @ 0h | trt @ 24h | Pattern |
|-------|-----------|------------|----------|-----------|---------|
| **A** | 10 | 10 | **30** | **30** | trt 3× higher at both timepoints |
| **B** | 10 | **30** | 10 | **30** | 24h 3× higher, same for ctrl and trt |
| **C** | 10 | 10 | 10 | **40** | trt effect only at 24h (interaction) |
| **D** | 10 | 10 | 10 | 10 | no differences |

Notice how Group C is the tricky one: treatment has no effect at 0h (10 vs 10) but a 4-fold effect at 24h (10 vs 40). The main effect would average these, giving ~2-fold - which doesn't represent either timepoint accurately.

We'll refer back to these groups as we analyse the results.

```{r generate-data}
set.seed(456)

n_peptides <- 40
n_reps <- 5  # 5 reps per condition = 20 observations per peptide

peptides <- paste0("PEP_", sprintf("%03d", 1:n_peptides))
genes <- paste0("GENE_", LETTERS[((1:n_peptides - 1) %% 26) + 1])

# Step 1: Create base abundance for each peptide
# Each peptide gets ONE base value shared across all its observations
peptide_info <- tibble(
  peptide = peptides,
  gene_id = genes,
  pep_num = 1:n_peptides,
  base = rgamma(n_peptides, shape = 6, rate = 0.6)  # Random baseline abundance
)

# Step 2: Create the experimental design (all combinations)
# 40 peptides × 2 treatments × 2 timepoints × 5 reps = 800 observations
design <- expand.grid(
  peptide = peptides,
  treatment = c("ctrl", "trt"),
  timepoint = c("0h", "24h"),
  bio_rep = 1:n_reps,
  stringsAsFactors = FALSE
)

# Step 3: Join design with peptide info and apply effects
sim_data <- design %>%
  left_join(peptide_info, by = "peptide") %>%
  mutate(
    # Group A (peptides 1-10): Treatment effect only
    # trt is 3-fold higher than ctrl, same at both timepoints
    trt_effect = ifelse(pep_num <= 10 & treatment == "trt", 3, 1),

    # Group B (peptides 11-20): Timepoint effect only
    # 24h is 3-fold higher than 0h, same for both treatments
    # NO treatment effect - ctrl and trt should be identical
    time_effect = ifelse(pep_num > 10 & pep_num <= 20 & timepoint == "24h", 3, 1),

    # Group C (peptides 21-30): Interaction
    # Treatment works at 24h (4-fold) but NOT at 0h
    int_effect = case_when(
      pep_num > 20 & pep_num <= 30 & treatment == "trt" & timepoint == "24h" ~ 4,
      TRUE ~ 1
    ),

    # Group D (peptides 31-40): No effect
    # All effect multipliers are 1 (no change)

    # Final abundance = base × all effects, with Gamma noise
    value = rgamma(n(), shape = 15, rate = 15 / (base * trt_effect * time_effect * int_effect))
  ) %>%
  select(peptide, gene_id, treatment, timepoint, bio_rep, value)

# Import
temp_file <- tempfile(fileext = ".csv")
write.csv(sim_data, temp_file, row.names = FALSE)

dat <- read_pepdiff(
  temp_file,
  id = "peptide",
  gene = "gene_id",
  value = "value",
  factors = c("treatment", "timepoint"),
  replicate = "bio_rep"
)

dat
```

## Basic GLM analysis: main effects

```{r basic-glm}
results <- compare(
  dat,
  compare = "treatment",
  ref = "ctrl",
  method = "glm"
)

results
```

This fits a Gamma GLM for each peptide with both treatment and timepoint as predictors, then extracts the treatment contrast (trt vs ctrl) **averaged across timepoints**. This averaged effect is called a "marginal" or "main" effect.

### Did GLM find the treatment effects?

Since we simulated this data, we know the truth:

- **Group A**: 3-fold treatment effect at both timepoints
- **Group B**: No treatment effect (the timepoint change affects ctrl and trt equally)
- **Group C**: 4-fold treatment effect at 24h, but no effect at 0h
- **Group D**: No effect

Let's see what the GLM main effect analysis finds:

```{r view-results}
results$results %>%
  mutate(
    pep_num = as.numeric(gsub("PEP_", "", peptide)),
    group = case_when(
      pep_num <= 10 ~ "A: Treatment only",
      pep_num <= 20 ~ "B: Timepoint only",
      pep_num <= 30 ~ "C: Interaction",
      TRUE ~ "D: No effect"
    )
  ) %>%
  group_by(group) %>%
  summarise(
    n_significant = sum(significant),
    n_peptides = n(),
    median_fc = round(median(fold_change, na.rm = TRUE), 2)
  )
```

- **Group A**: All 10 detected ✓ — the consistent 3-fold treatment effect is correctly found
- **Group B**: 0 detected ✓ — correctly identifies no treatment effect (the timepoint change affects both ctrl and trt equally, so trt/ctrl ≈ 1)
- **Group C**: All 10 detected — but note the diluted fold change (~2 instead of 4). The main effect averages the 24h effect (4-fold) with the 0h effect (1-fold), giving ~2-fold
- **Group D**: 0-1 detected ✓ — correctly negative (1 false positive at α=0.05 is expected by chance)

```{r plot-glm-results, fig.height = 6}
plot(results)
```

## Stratified comparisons with `within`

The main effect averages across timepoints. But Group C has an effect at 24h only - the main effect diluted a 4-fold change to ~2-fold. Can we recover the true effect?

Use the `within` argument to get **conditional effects** - the treatment effect within each timepoint separately:

```{r stratified}
results_stratified <- compare(
  dat,
  compare = "treatment",
  ref = "ctrl",
  within = "timepoint",
  method = "glm"
)

results_stratified
```

Now we get separate comparisons: treatment effect at 0h, and treatment effect at 24h.

### Stratified results by group

Let's see how each group looks when we analyse timepoints separately:

```{r view-stratified}
results_stratified$results %>%
  mutate(
    pep_num = as.numeric(gsub("PEP_", "", peptide)),
    group = case_when(
      pep_num <= 10 ~ "A: Treatment only",
      pep_num <= 20 ~ "B: Timepoint only",
      pep_num <= 30 ~ "C: Interaction",
      TRUE ~ "D: No effect"
    )
  ) %>%
  group_by(group, timepoint) %>%
  summarise(
    n_significant = sum(significant),
    median_fc = round(median(fold_change, na.rm = TRUE), 2),
    .groups = "drop"
  ) %>%
  tidyr::pivot_wider(
    names_from = timepoint,
    values_from = c(n_significant, median_fc)
  )
```

**What stratified analysis reveals:**

- **Group A**: FC ≈ 3 at both timepoints — the main effect was accurate
- **Group B**: FC ≈ 1 at both timepoints, 0 significant — correctly negative at both
- **Group C**: FC ≈ 1 at 0h, FC ≈ 4 at 24h — now we see the true 4-fold effect!
- **Group D**: FC ≈ 1 at both, ~0 significant — correctly negative

For Group C, stratified analysis recovers the true 4-fold effect at 24h that was diluted to ~2-fold in the main effect.

## Handling interactions

Group C illustrates an **interaction**: the treatment effect depends on timepoint. The main effect (FC ≈ 2) doesn't represent either condition accurately - it's the average of a 4-fold effect and no effect.

**The emmeans warning:** If you see "Results may be misleading due to involvement in interactions", it means exactly this - the averaged main effect may not represent any real biological condition.

### When to use main effects vs stratified

```{r check-interaction}
# Compare main vs stratified for all groups
main_summary <- results$results %>%
  mutate(
    pep_num = as.numeric(gsub("PEP_", "", peptide)),
    group = case_when(
      pep_num <= 10 ~ "A", pep_num <= 20 ~ "B",
      pep_num <= 30 ~ "C", TRUE ~ "D"
    )
  ) %>%
  group_by(group) %>%
  summarise(main_fc = round(median(fold_change, na.rm = TRUE), 2))

strat_summary <- results_stratified$results %>%
  mutate(
    pep_num = as.numeric(gsub("PEP_", "", peptide)),
    group = case_when(
      pep_num <= 10 ~ "A", pep_num <= 20 ~ "B",
      pep_num <= 30 ~ "C", TRUE ~ "D"
    )
  ) %>%
  group_by(group, timepoint) %>%
  summarise(fc = round(median(fold_change, na.rm = TRUE), 2), .groups = "drop") %>%
  tidyr::pivot_wider(names_from = timepoint, values_from = fc)

left_join(main_summary, strat_summary, by = "group")
```

**Use main effects when:** the effect is consistent across conditions (Group A: ~3 at both timepoints, main ≈ 3)

**Use stratified (`within`) when:** the effect differs across conditions (Group C: ~1 at 0h, ~4 at 24h, main ≈ 2 is misleading)

## Checking convergence

Not every peptide's model will converge. This happens when:
- Too few observations for that peptide
- Extreme values or near-zero variance
- Data patterns that the model can't fit

```{r check-diagnostics}
# Check diagnostics
n_converged <- sum(results$diagnostics$converged)
n_failed <- sum(!results$diagnostics$converged)

cat("Converged:", n_converged, "\n")
cat("Failed:", n_failed, "\n")
```

Failed peptides are excluded from results. This is correct behaviour - if the model can't fit, you shouldn't trust its estimates.

```{r view-failed, eval = FALSE}
# View failed peptides (if any)
results$diagnostics %>%
  filter(!converged) %>%
  select(peptide, converged)
```

**Common causes of convergence failure:**
- Peptide measured in too few samples
- All values identical (zero variance)
- Extreme outliers
- Too many missing values

**What to do:**
- Check if failed peptides have data quality issues
- Consider whether they have enough observations for the model complexity
- Don't force convergence - if it fails, you need more data or a simpler model
- See peppwR for power analysis and sample size planning

## Model checking

The diagnostics slot contains model information:

```{r model-diag}
# Deviance: lower is better fit (relative to other peptides)
results$diagnostics %>%
  arrange(desc(deviance)) %>%
  head()
```

The `error` column contains diagnostic messages only when models fail to converge - `NA` means no error occurred (success). The `deviance` column measures model fit: high deviance relative to other peptides may indicate poor fit. Investigate these peptides if they're in your significant results.

## Exporting results

The results are stored in a tidy tibble, ready for export:

```{r results-structure}
# Main effect results
glimpse(results$results)
```

Stratified results include the `within` factor as an additional column:

```{r stratified-structure}
# Stratified results - note the timepoint column
glimpse(results_stratified$results)
```

Save to CSV for downstream analysis:

```{r export-results, eval = FALSE}
# All results
write.csv(results$results, "glm_results.csv", row.names = FALSE)

# Significant hits only
results$results %>%
  filter(significant) %>%
  write.csv("significant_hits.csv", row.names = FALSE)

# Stratified results (one row per peptide per timepoint)
write.csv(results_stratified$results, "stratified_results.csv", row.names = FALSE)
```

## Practical recommendations

1. **Start simple**: Run without `within` first to get main effects
2. **Check for interactions**: If effects differ across factor levels, use `within` (see below)
3. **Inspect convergence**: Many failures suggest the model is too complex for your data
4. **Look at diagnostics**: Volcano and p-value histogram should look reasonable
5. **Don't over-interpret**: These are statistical estimates, not ground truth

### Quick interaction check

To check for interactions in your own data, compare fold changes across strata:

```r
compare(dat, compare = "treatment", ref = "ctrl",
        within = "timepoint", method = "glm")$results %>%
  group_by(timepoint) %>%
  summarise(median_fc = median(fold_change), .groups = "drop")
```

If these FCs differ substantially (e.g., 1 vs 4), you have an interaction - use stratified results.

## When GLM struggles

- Very small sample sizes (< 3 per condition)
- Heavy-tailed or highly non-normal data
- Many zeros (consider whether these are true zeros or missing)

If GLM diagnostics look poor, try the ART method - see `vignette("art_analysis")`.

```{r cleanup, include = FALSE}
unlink(temp_file)
```