---
title: "Getting Started with compositional.mle"
author: "Alexander Towell"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Getting Started with compositional.mle}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
```

## Introduction

The `compositional.mle` package provides **composable** optimization strategies for maximum likelihood estimation (MLE). Following SICP principles, it offers:

1. **Primitive solvers** - `gradient_ascent()`, `newton_raphson()`, `bfgs()`, `nelder_mead()`, etc.
2. **Composition operators** - `%>>%` (sequential), `%|%` (race), `with_restarts()`
3. **Closure property** - Combining solvers yields a solver

## Installation

```{r install, eval=FALSE}
devtools::install_github("queelius/compositional.mle")
```

```{r load}
library(compositional.mle)
```

## Quick Start: Normal Distribution MLE

```{r quickstart}
# Generate sample data
set.seed(123)
data <- rnorm(100, mean = 5, sd = 2)

# Define the problem (separate from solver strategy)
problem <- mle_problem(
  loglike = function(theta) {
    if (theta[2] <= 0) return(-Inf)
    sum(dnorm(data, theta[1], theta[2], log = TRUE))
  },
  score = function(theta) {
    mu <- theta[1]; sigma <- theta[2]; n <- length(data)
    c(sum(data - mu) / sigma^2,
      -n / sigma + sum((data - mu)^2) / sigma^3)
  },
  constraint = mle_constraint(
    support = function(theta) theta[2] > 0,
    project = function(theta) c(theta[1], max(theta[2], 1e-6))
  ),
  theta_names = c("mu", "sigma")
)

# Solve with gradient ascent
result <- gradient_ascent()(problem, theta0 = c(0, 1))

cat("Estimated mean:", result$theta.hat[1], "(true: 5)\n")
cat("Estimated sd:", result$theta.hat[2], "(true: 2)\n")
```

## The Problem-Solver Separation

The key design principle is separating **what** you're estimating from **how** you estimate it:

```{r problem-solver}
# The problem encapsulates the statistical model
print(problem)

# Solvers are independent strategies
solver1 <- gradient_ascent(max_iter = 100)
solver2 <- newton_raphson(max_iter = 50)
solver3 <- bfgs()

# Same problem, different solvers
result1 <- solver1(problem, c(0, 1))
result2 <- solver2(problem, c(0, 1))
result3 <- solver3(problem, c(0, 1))

cat("Gradient ascent:", result1$theta.hat, "\n")
cat("Newton-Raphson:", result2$theta.hat, "\n")
cat("BFGS:", result3$theta.hat, "\n")
```

## Composing Solvers

### Sequential Chaining (`%>>%`)

Chain solvers for coarse-to-fine optimization:

```{r sequential}
# Grid search finds a good region, then gradient ascent refines
strategy <- grid_search(lower = c(-10, 0.5), upper = c(10, 5), n = 5) %>>%
  gradient_ascent(max_iter = 50)

result <- strategy(problem, theta0 = c(0, 1))
cat("Result:", result$theta.hat, "\n")
```

### Three-Stage Refinement

```{r three-stage}
# Coarse grid -> gradient ascent -> Newton-Raphson polish
strategy <- grid_search(lower = c(-10, 0.5), upper = c(10, 5), n = 5) %>>%
  gradient_ascent(max_iter = 30) %>>%
  newton_raphson(max_iter = 10)

result <- strategy(problem, theta0 = c(0, 1))
cat("Result:", result$theta.hat, "\n")
cat("Converged:", result$converged, "\n")
```

### Parallel Racing (`%|%`)

Race multiple methods and keep the best result:

```{r race}
# Try gradient-based and derivative-free methods
strategy <- gradient_ascent() %|% bfgs() %|% nelder_mead()

result <- strategy(problem, theta0 = c(0, 1))
cat("Winner:", result$solver_info$name, "\n")
cat("Result:", result$theta.hat, "\n")
```

### Random Restarts

Escape local optima with multiple starting points:

```{r restarts}
strategy <- with_restarts(
  gradient_ascent(),
  n = 10,
  sampler = uniform_sampler(c(-10, 0.5), c(10, 5))
)

result <- strategy(problem, theta0 = c(0, 1))
cat("Best restart:", result$solver_info$composition$best_restart, "of", result$solver_info$composition$n_restarts, "\n")
cat("Result:", result$theta.hat, "\n")
```

### Conditional Refinement

Only refine if the first solver didn't converge:

```{r conditional}
strategy <- unless_converged(
  gradient_ascent(max_iter = 10),  # Quick attempt

  newton_raphson(max_iter = 50)     # Refine if needed
)

result <- strategy(problem, theta0 = c(0, 1))
cat("Result:", result$theta.hat, "\n")
```

## Available Solvers

| Factory | Method | Requires |
|---------|--------|----------|
| `gradient_ascent()` | Steepest ascent with line search | score (or numerical) |
| `newton_raphson()` | Second-order Newton | score, fisher (or numerical) |
| `bfgs()` | Quasi-Newton BFGS | score (or numerical) |
| `lbfgsb()` | L-BFGS-B with box constraints | score (or numerical) |
| `nelder_mead()` | Simplex (derivative-free) | - |
| `grid_search()` | Exhaustive grid | - |
| `random_search()` | Random sampling | - |

## Constraints

Define domain constraints with support checking and projection:
```{r constraints}
# Positive parameters
pos_constraint <- mle_constraint(
  support = function(theta) all(theta > 0),
  project = function(theta) pmax(theta, 1e-8)
)

# Box constraints [0, 10]
box_constraint <- mle_constraint(
  support = function(theta) all(theta >= 0 & theta <= 10),
  project = function(theta) pmax(0, pmin(10, theta))
)

# Use in problem definition
problem_constrained <- mle_problem(
  loglike = function(theta) -sum((theta - 5)^2),
  constraint = pos_constraint
)
```

## Function Transformers

### Stochastic Gradient (Mini-batching)

For large datasets, subsample observations:

```{r subsampling, eval=FALSE}
# Original log-likelihood uses all data
loglike_full <- function(theta, obs = large_data) {
  sum(dnorm(obs, theta[1], theta[2], log = TRUE))
}

# Stochastic version uses random subsets
loglike_sgd <- with_subsampling(loglike_full, data = large_data, subsample_size = 100)
```

### Regularization

Add penalty terms for regularization:

```{r penalties}
loglike <- function(theta) -sum(theta^2)

# L1 (Lasso), L2 (Ridge), Elastic Net
loglike_l1 <- with_penalty(loglike, penalty_l1(), lambda = 0.1)
loglike_l2 <- with_penalty(loglike, penalty_l2(), lambda = 0.1)
loglike_enet <- with_penalty(loglike, penalty_elastic_net(alpha = 0.5), lambda = 0.1)

theta <- c(1, 2, 3)
cat("Original:", loglike(theta), "\n")
cat("With L1:", loglike_l1(theta), "\n")
cat("With L2:", loglike_l2(theta), "\n")
```

## Tracing Optimization

Track the optimization path for diagnostics:

```{r trace}
trace_config <- mle_trace(values = TRUE, path = TRUE, gradients = TRUE)

result <- gradient_ascent(max_iter = 20)(
  problem,
  theta0 = c(0, 1),
  trace = trace_config
)

if (!is.null(result$solver_info$trace_data)) {
  cat("Iterations:", result$solver_info$trace_data$total_iterations, "\n")
  cat("Final log-likelihood:", tail(result$solver_info$trace_data$values, 1), "\n")
}
```

## API Summary

**Problem Specification:**
- `mle_problem()` - Define the estimation problem
- `mle_constraint()` - Domain constraints

**Solver Factories:**
- `gradient_ascent()`, `newton_raphson()`, `bfgs()`, `lbfgsb()`, `nelder_mead()`
- `grid_search()`, `random_search()`

**Composition:**
- `%>>%` - Sequential chaining
- `%|%` - Parallel racing
- `with_restarts()` - Multiple starting points
- `unless_converged()` - Conditional refinement
- `chain()` - Compose multiple solvers sequentially

**Samplers:**
- `uniform_sampler()`, `normal_sampler()`

**Transformers:**
- `with_subsampling()`, `with_penalty()`
- `penalty_l1()`, `penalty_l2()`, `penalty_elastic_net()`
