---
title: "flexhaz: Hazard-First Survival Modeling"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{flexhaz: Hazard-First Survival Modeling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4
)
```

## The Problem

Standard parametric survival models force you to choose from a short list of
hazard shapes: constant (exponential), monotone power-law (Weibull), exponential
growth (Gompertz). But real failure patterns rarely cooperate:

- **Ceramic capacitors** exhibit wear-out that accelerates with age — the failure
  rate climbs faster than any power law.
- **Software systems** follow a bathtub curve: early crashes from latent bugs,
  a stable period, then increasing failures as dependencies rot.
- **Post-surgical patients** face high initial risk that drops sharply, then
  slowly rises again as long-term complications emerge.

Each of these demands a hazard function that standard families cannot express.
You can reach for semi-parametric methods (Cox regression), but then you lose the
ability to extrapolate, simulate, or compute closed-form quantities like mean
time to failure.

## The Idea

`flexhaz` takes a different approach: **you write the hazard function, and the
package derives everything else**.

Given a hazard (failure rate) function $h(t)$, the full distribution follows:

$$
h(t) \;\xrightarrow{\;\int\;}\; H(t) = \int_0^t h(u)\,du
      \;\xrightarrow{\;\exp\;}\; S(t) = e^{-H(t)}
      \;\xrightarrow{\;\times h\;}\; f(t) = h(t)\,S(t)
$$

If you can express $h(t)$ as an R function of time and parameters, `flexhaz`
gives you survival curves, CDFs, densities, quantiles, random sampling,
log-likelihoods, MLE fitting, and residual diagnostics — automatically.

## Installation

```{r eval=FALSE}
# From r-universe
install.packages("flexhaz", repos = "https://queelius.r-universe.dev")
```

## Your First Distribution

```{r setup}
library(flexhaz)
```

Let's start with the simplest case: the exponential distribution with constant
failure rate.

```{r}
# Exponential with failure rate lambda = 0.1 (MTTF = 10 time units)
exp_dist <- dfr_exponential(lambda = 0.1)
print(exp_dist)
```

### Get Distribution Functions

All distribution functions follow a two-step pattern — the method returns a
**closure** that you then call with time values:

```{r}
# Get closures
S <- surv(exp_dist)
h <- hazard(exp_dist)
f <- density(exp_dist)

# Evaluate at specific times
S(10)   # P(survive past t=10) = exp(-0.1 * 10) ~ 0.37
h(10)   # Hazard at t=10 = 0.1 (constant for exponential)
f(10)   # PDF at t=10
```

You can verify these analytically: $S(10) = e^{-0.1 \times 10} = e^{-1} \approx 0.368$,
the hazard is constant at $\lambda = 0.1$, and
$f(10) = \lambda \, S(10) = 0.1 \times 0.368 \approx 0.0368$.

### Generate Samples

```{r}
samp <- sampler(exp_dist)
set.seed(42)
failure_times <- samp(20)

# Sample mean should be close to MTTF = 1/lambda = 10
mean(failure_times)
```

With only 20 samples the mean can deviate noticeably from the theoretical MTTF
of 10. Larger samples converge closer.

## Fitting to Data

Now let's fit a model to survival data using maximum likelihood estimation.

```{r}
set.seed(123)
n <- 50
df <- data.frame(
  t = rexp(n, rate = 0.08),   # True lambda = 0.08
  delta = rep(1, n)            # All exact observations
)
head(df)
```

### Maximum Likelihood Estimation

```{r, warning=FALSE}
# Template with no parameters (to be estimated)
exp_template <- dfr_exponential()

solver <- fit(exp_template)
result <- solver(df, par = c(0.1))  # Initial guess
coef(result)

# Compare to analytical MLE: lambda_hat = n / sum(t)
n / sum(df$t)
```

The numerical MLE matches the closed-form solution $\hat\lambda = n / \sum t_i$
to five decimal places. Both estimate $\lambda \approx 0.071$, which is
consistent with the true value of 0.08 given the sampling variability at $n = 50$.

### Model Diagnostics

Check model fit using Cox-Snell residuals (should follow the Exp(1) diagonal):

```{r, fig.alt="Cox-Snell residuals Q-Q plot showing points close to diagonal line, indicating good fit."}
fitted_dist <- dfr_exponential(lambda = coef(result))
qqplot_residuals(fitted_dist, df)
```

## Working with Censored Data

Real survival data often includes **censoring** — observations where the exact
failure time is not directly observed. The `flexhaz` package uses a `delta`
column to encode three types of observation:

| `delta` | Meaning | Log-likelihood contribution |
|---------|---------|----------------------------|
| `1`  | Exact failure | $\log h(t) - H(t)$ |
| `0`  | Right-censored (survived past $t$) | $-H(t)$ |
| `-1` | Left-censored (failed before $t$) | $\log(1 - e^{-H(t)})$ |

### Right-Censoring

Right-censoring is the most common type — the subject was still alive (or the
device was still running) when observation ended.

```{r, warning=FALSE}
df_censored <- data.frame(
  t = c(5, 8, 12, 15, 20, 25, 30, 30, 30, 30),
  delta = c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0)  # Last 5 right-censored at t=30
)

solver <- fit(dfr_exponential())
result <- solver(df_censored, par = c(0.1))
coef(result)
```

The censored MLE properly accounts for the five units that survived past $t = 30$.
You can verify: the analytical MLE for censored exponential data is
$\hat\lambda = d / \sum t_i$ where $d$ is the number of failures, giving
$5 / 205 \approx 0.024$ — exactly what the optimizer returns.

### Left-Censoring

Left-censoring arises when we know failure occurred *before* an inspection time,
but not exactly when. This is common in periodic-inspection studies: you check a
device at time $t$ and discover it has already failed.

```{r, warning=FALSE}
df_left <- data.frame(
  t = c(5, 10, 15, 20),
  delta = c(-1, -1, 1, 0)   # left-censored, left-censored, exact, right-censored
)

solver <- fit(dfr_exponential())
result <- solver(df_left, par = c(0.1))
coef(result)
```

This dataset mixes all three observation types: two left-censored, one exact
failure at $t = 15$, and one right-censored at $t = 20$. The estimate
$\hat\lambda \approx 0.072$ is higher than a right-censored-only estimate
because the left-censored observations provide evidence of earlier failures.
All three types can coexist in the same dataset — the log-likelihood sums the
appropriate contribution for each observation.

## Custom Column Names

By default, `flexhaz` looks for columns named `t` (observation time) and `delta`
(event indicator). When your data uses different column names — as is common with
clinical datasets — use the `ob_col` and `delta_col` arguments:

```{r, warning=FALSE}
clinical_data <- data.frame(
  time = c(5, 8, 12, 15, 20),
  status = c(1, 1, 1, 0, 0)
)

dist <- dfr_exponential()
dist$ob_col <- "time"
dist$delta_col <- "status"

solver <- fit(dist)
result <- solver(clinical_data, par = c(0.1))
coef(result)
```

The estimate $\hat\lambda = 0.05$ equals $3 / 60 = d / \sum t_i$, confirming
that the column mapping works correctly.

You can also set custom columns at construction time via `dfr_dist()`:

```{r eval=FALSE}
dist <- dfr_dist(
  rate = function(t, par, ...) rep(par[1], length(t)),
  cum_haz_rate = function(t, par, ...) par[1] * t,
  ob_col = "time",
  delta_col = "status"
)
```

## Beyond Exponential: Weibull

The exponential assumes constant failure rate. For **increasing** or **decreasing**
failure rates, use Weibull:

```{r, fig.alt="Two hazard curves: exponential (flat line) and Weibull shape=2 (increasing curve)."}
# Weibull with increasing failure rate (wear-out)
weib_dist <- dfr_weibull(shape = 2, scale = 20)

# Compare hazard functions
plot(weib_dist, what = "hazard", xlim = c(0, 30), col = "blue",
     main = "Hazard Comparison")
plot(dfr_exponential(lambda = 0.05), what = "hazard", add = TRUE, col = "red")
legend("topleft", c("Weibull (shape=2)", "Exponential"),
       col = c("blue", "red"), lwd = 2)
```

Weibull shape parameter interpretation:

- **shape < 1**: Decreasing hazard (infant mortality, burn-in failures)
- **shape = 1**: Constant hazard (exponential)
- **shape > 1**: Increasing hazard (wear-out, aging)

## The Real Power: Custom Hazards

Where `flexhaz` truly shines is modeling **non-standard hazard patterns** that no
built-in distribution can express.

### Define a bathtub hazard

$$h(t) = a\,e^{-bt} + c + d\,t^k$$

Three additive components: exponential decay (infant mortality), constant
baseline (useful life), and power-law growth (wear-out).

```{r define-dist}
bathtub <- dfr_dist(
  rate = function(t, par, ...) {
    a <- par[[1]]  # infant mortality magnitude
    b <- par[[2]]  # infant mortality decay
    c <- par[[3]]  # baseline hazard
    d <- par[[4]]  # wear-out coefficient
    k <- par[[5]]  # wear-out exponent
    a * exp(-b * t) + c + d * t^k
  },
  par = c(a = 0.5, b = 0.5, c = 0.01, d = 5e-5, k = 2.5)
)
```

### Plot the hazard curve

```{r plot-hazard, fig.alt="Bathtub hazard curve showing high infant mortality at left, flat useful life in the middle, and rising wear-out on the right."}
h <- hazard(bathtub)
t_seq <- seq(0.01, 30, length.out = 300)
plot(t_seq, sapply(t_seq, h), type = "l", lwd = 2, col = "darkred",
     xlab = "Time", ylab = "Hazard rate h(t)",
     main = "Bathtub Hazard Function")
```

### Plot the survival curve

```{r plot-surv, fig.alt="Survival curve derived from the bathtub hazard, showing rapid early decline that stabilizes then drops again."}
plot(bathtub, what = "survival", xlim = c(0, 30),
     col = "darkblue", lwd = 2,
     main = "Survival Function S(t)")
```

### Simulate data and fit via MLE

```{r simulate-fit, warning=FALSE}
# Generate failure times, right-censored at t = 25
set.seed(42)
samp <- sampler(bathtub)
raw_times <- samp(80)
censor_time <- 25

df <- data.frame(
  t = pmin(raw_times, censor_time),
  delta = as.integer(raw_times <= censor_time)
)

cat("Observed failures:", sum(df$delta), "out of", nrow(df), "\n")
cat("Right-censored:", sum(1 - df$delta), "\n")

# Fit the model
solver <- fit(bathtub)
result <- solver(df,
  par = c(0.3, 0.3, 0.02, 1e-4, 2),
  method = "Nelder-Mead"
)
```

### Inspect the fit

```{r inspect}
cat("Estimated parameters:\n")
print(coef(result))

cat("\nTrue parameters:\n")
print(c(a = 0.5, b = 0.5, c = 0.01, d = 5e-5, k = 2.5))
```

With five parameters and only 80 observations, the optimizer recovers the
general shape but not every parameter exactly. In particular, $d$ and $k$ trade
off: a smaller exponent with a larger coefficient can produce a similar wear-out
curve. The Q-Q plot below is the real test of fit quality.

### Residual diagnostics

```{r qq-plot, fig.alt="Cox-Snell residuals Q-Q plot for the bathtub model."}
fitted_bathtub <- dfr_dist(
  rate = function(t, par, ...) {
    par[[1]] * exp(-par[[2]] * t) + par[[3]] + par[[4]] * t^par[[5]]
  },
  par = coef(result)
)
qqplot_residuals(fitted_bathtub, df)
```

Points near the diagonal indicate the model captures the failure pattern well.

## Built-in Distributions

For common parametric families, the package provides optimized constructors
with analytical cumulative hazard and score functions:

| Constructor | Hazard Shape | Parameters | Use Case |
|-------------|-------------|------------|----------|
| `dfr_exponential(lambda)` | Constant | failure rate | Random failures, useful life |
| `dfr_weibull(shape, scale)` | Power-law | shape $k$, scale $\sigma$ | Wear-out, infant mortality |
| `dfr_gompertz(a, b)` | Exponential growth | initial rate, growth rate | Biological aging |
| `dfr_loglogistic(alpha, beta)` | Non-monotonic (hump) | scale, shape | Initial risk that diminishes |

Use these when a standard family fits your problem — they include analytical
derivatives for faster, more accurate MLE.

## Key Functions

| Function | Purpose |
|----------|---------|
| `hazard()` | Get hazard (failure rate) function |
| `surv()` | Get survival function S(t) |
| `density()` | Get PDF f(t) |
| `cdf()` | Get CDF F(t) |
| `inv_cdf()` | Get quantile function |
| `sampler()` | Generate random samples |
| `fit()` | Maximum likelihood estimation |
| `loglik()` | Get log-likelihood function |
| `residuals()` | Model diagnostics (Cox-Snell, Martingale) |
| `plot()` | Visualize distribution functions |
| `qqplot_residuals()` | Q-Q plot for goodness-of-fit |

## Where to Go Next

1. **`vignette("reliability_engineering")`** — Five real-world case studies
2. **`vignette("failure_rate")`** — Mathematical foundations of hazard-based modeling
3. **`vignette("custom_distributions")`** — The three-level optimization paradigm
4. **`vignette("custom_derivatives")`** — Analytical score and Hessian functions
