---
title: "Backend Guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Backend Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(ReproStat)
set.seed(20260324)
```

## Overview

ReproStat supports multiple model-fitting backends through the same
high-level API. That means you can often keep the same reproducibility
workflow while changing only the modeling engine.

Supported backends are:

- `"lm"` for ordinary least squares
- `"glm"` for generalized linear models
- `"rlm"` for robust regression via `MASS`
- `"glmnet"` for penalized regression via `glmnet`

This article explains when to use each one and what changes in the returned
diagnostics.

## Common interface

The same entry point is used across backends:

```r
run_diagnostics(
  formula,
  data,
  B = 200,
  method = "bootstrap",
  backend = "lm"
)
```

The key differences are in:

- how the model is fit
- which quantities are available
- how to interpret selection-related outputs

## Backend: lm

`"lm"` is the default backend and is the best place to start for standard
linear regression.

```{r lm-example}
diag_lm <- run_diagnostics(
  mpg ~ wt + hp + disp,
  data = mtcars,
  B = 100,
  backend = "lm"
)

reproducibility_index(diag_lm)
```

Use `"lm"` when:

- the response is continuous
- ordinary least squares is the intended analysis
- you want the simplest interpretation of all components

## Backend: glm

Use `"glm"` when you need a generalized linear model, such as logistic or
Poisson regression.

```{r glm-example}
diag_glm <- run_diagnostics(
  am ~ wt + hp + qsec,
  data = mtcars,
  B = 100,
  backend = "glm",
  family = stats::binomial()
)

reproducibility_index(diag_glm)
```

Notes:

- if you provide `family = ...` while leaving `backend = "lm"`, the function
  promotes the fit to `"glm"`
- prediction stability for GLMs uses response-scale predictions
- p-value and selection summaries remain available

## Backend: rlm

Use `"rlm"` when you want robustness against outliers or heavy-tailed error
behavior.

```{r rlm-example, eval = requireNamespace("MASS", quietly = TRUE)}
if (requireNamespace("MASS", quietly = TRUE)) {
  diag_rlm <- run_diagnostics(
    mpg ~ wt + hp + disp,
    data = mtcars,
    B = 100,
    backend = "rlm"
  )

  reproducibility_index(diag_rlm)
}
```

Use `"rlm"` when:

- a few influential observations may distort OLS results
- you want a more robust regression baseline
- you still want coefficient, selection, prediction, and RI summaries in a
  familiar regression framework

## Backend: glmnet

Use `"glmnet"` when you want penalized regression such as LASSO, ridge, or
elastic net.

```{r glmnet-example, eval = requireNamespace("glmnet", quietly = TRUE)}
if (requireNamespace("glmnet", quietly = TRUE)) {
  diag_glmnet <- run_diagnostics(
    mpg ~ wt + hp + disp + qsec,
    data = mtcars,
    B = 100,
    backend = "glmnet",
    en_alpha = 1
  )

  reproducibility_index(diag_glmnet)
}
```

The `en_alpha` argument controls the penalty mix:

- `1` gives LASSO
- `0` gives ridge
- values in between give elastic net

Important differences for `"glmnet"`:

- p-values are not defined, so the `pvalue` component is `NA`
- selection stability measures non-zero selection frequency
- RI values are therefore based on a different component set than the
  non-penalized backends

## Backend comparison summary

| Backend | Best for | P-values available? | Selection meaning |
|---------|----------|---------------------|-------------------|
| `"lm"` | standard linear regression | yes | sign consistency |
| `"glm"` | logistic / GLM use cases | yes | sign consistency |
| `"rlm"` | robust regression | yes | sign consistency |
| `"glmnet"` | penalized regression | no | non-zero frequency |

## Choosing a backend in practice

A simple decision pattern is:

1. Start with `"lm"` if a standard linear model is appropriate.
2. Move to `"glm"` when the response distribution requires it.
3. Use `"rlm"` when outlier resistance matters.
4. Use `"glmnet"` when shrinkage, regularization, or sparse selection is the
   main modeling goal.

## Comparing RI values across backends

Be careful when comparing RI values between penalized and non-penalized
backends.

For `"glmnet"`, the p-value component is unavailable, so the composite score
is formed from a different set of ingredients. That makes cross-backend RI
comparisons descriptive at best, not strictly apples-to-apples.

## Model comparison with repeated CV

All backends can also be used in `cv_ranking_stability()`:

```{r cv-example}
models <- list(
  compact = mpg ~ wt + hp,
  fuller  = mpg ~ wt + hp + disp
)

cv_obj <- cv_ranking_stability(
  models,
  mtcars,
  v = 5,
  R = 20,
  backend = "lm"
)

cv_obj$summary
```

This is especially valuable when you are choosing between competing formulas
and want to know not just which model is best on average, but which one is
consistently best.

## Next steps

For a broader conceptual explanation, read the interpretation article.
For a complete first analysis, start with `vignette("ReproStat-intro")`.
