---
title: "Introduction to nabla"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to nabla}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

```{r setup}
library(nabla)
```

## What is nabla?

`nabla` differentiates any R function exactly — no symbolic algebra, no
finite-difference grid, no loss of precision. Pass a function and a point
to `D()`, and you get the exact derivative:

```{r}
f <- function(x) x[1]^2 + sin(x[1])
D(f, pi / 4)   # exact first derivative at pi/4
```

Functions passed to `D()` use `x[1]`, `x[2]`, ... indexing — the same
convention as R's `optim()`. For multi-parameter functions, `gradient()`,
`hessian()`, and `jacobian()` provide the standard objects of calculus:

```{r}
g <- function(x) x[1]^2 * exp(x[2])
gradient(g, c(2, 0))   # c(4, 4)
hessian(g, c(2, 0))    # 2x2 Hessian matrix
```

## Quick start

The `D` operator is the core of `nabla`. Applied once, it gives the first
derivative. Applied twice (via `order = 2`), it gives the second derivative:

```{r}
f <- function(x) x[1]^2 + sin(x[1])

# First derivative: f'(x) = 2x + cos(x)
D(f, pi / 4)

# Verify against the analytical formula
2 * (pi / 4) + cos(pi / 4)

# Second derivative: f''(x) = 2 - sin(x)
D(f, pi / 4, order = 2)
```

`gradient()` is equivalent to `D(f, x)` for scalar-valued functions:

```{r}
gradient(f, pi / 4)
```

The function and its AD-computed derivative over $[0, 2\pi]$, with the
evaluation point $x = \pi/4$ marked:

```{r fig-function-derivative, fig.width=6, fig.height=4}
f <- function(x) x[1]^2 + sin(x[1])
xs <- seq(0, 2 * pi, length.out = 200)

# Compute f(x) and f'(x) at each grid point using D()
fx <- sapply(xs, function(xi) f(xi))
fpx <- sapply(xs, function(xi) D(f, xi))

# Mark the evaluation point x = pi/4
x_mark <- pi / 4

oldpar <- par(mar = c(4, 4, 2, 1))
plot(xs, fx, type = "l", col = "steelblue", lwd = 2,
     xlab = "x", ylab = "y",
     main = expression(f(x) == x^2 + sin(x) ~ "and its derivative"),
     ylim = range(c(fx, fpx)))
lines(xs, fpx, col = "firebrick", lwd = 2, lty = 2)
points(x_mark, f(x_mark), pch = 19, col = "steelblue", cex = 1.3)
points(x_mark, D(f, x_mark), pch = 19, col = "firebrick", cex = 1.3)
abline(v = x_mark, col = "grey60", lty = 3)
legend("topleft", legend = c("f(x)", "f'(x) via AD"),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n")
par(oldpar)
```

## Works through any R code

`D()` handles control flow, loops, and higher-order functions transparently.
No special annotation is needed — just write ordinary R:

```{r}
# if/else branching
safe_log <- function(x) {
  if (x[1] > 0) log(x[1]) else -Inf
}
D(safe_log, 2)    # 1/2 = 0.5
D(safe_log, -1)   # 0 (constant branch)

# for loop
poly <- function(x) {
  result <- 0
  for (i in 1:5) result <- result + x[1]^i
  result
}
D(poly, 2)   # 1 + 2*2 + 3*4 + 4*8 + 5*16 = 129

# Reduce
f_reduce <- function(x) Reduce("+", lapply(1:4, function(i) x[1]^i))
D(f_reduce, 2)
```

More complex compositions also propagate correctly:

```{r}
g <- function(x) exp(-x[1]^2 / 2) / sqrt(2 * pi)  # standard normal PDF

# D(g, 1) should equal -x * dnorm(x) at x = 1
D(g, 1)
-1 * dnorm(1)
D(g, 1) - (-1 * dnorm(1))  # ~0
```

## Multi-parameter functions

For functions of several variables, `gradient()` returns the gradient vector,
`hessian()` returns the Hessian matrix, and `jacobian()` handles
vector-valued functions:

```{r}
# Gradient of a 2-parameter function
f2 <- function(x) x[1]^2 * x[2]
gradient(f2, c(3, 4))  # c(2*3*4, 3^2) = c(24, 9)

# Hessian
hessian(f2, c(3, 4))

# Jacobian of a vector-valued function
f_vec <- function(x) list(x[1] * x[2], x[1]^2 + x[2])
jacobian(f_vec, c(3, 4))
```

The `D` operator composes for higher-order derivatives. `D(f, x, order = 2)`
returns the Hessian; `D(f, x, order = 3)` returns a third-order tensor:

```{r}
f2 <- function(x) x[1]^2 * x[2]
D(f2, c(3, 4), order = 2)           # Hessian
D(f2, c(3, 4), order = 3)           # 2x2x2 third-order tensor
```

## Quick comparison: three ways to differentiate

Three methods for computing the derivative of $f(x) = x^3 \sin(x)$
at $x = 2$:

```{r}
f <- function(x) x[1]^3 * sin(x[1])
f_num <- function(x) x^3 * sin(x)  # plain numeric version

x0 <- 2

# 1. Analytical: f'(x) = 3x^2 sin(x) + x^3 cos(x)
analytical <- 3 * x0^2 * sin(x0) + x0^3 * cos(x0)

# 2. Finite differences
h <- 1e-8
finite_diff <- (f_num(x0 + h) - f_num(x0 - h)) / (2 * h)

# 3. Automatic differentiation
ad_result <- D(f, x0)

# Compare
data.frame(
  method = c("Analytical", "Finite Diff", "AD (nabla)"),
  derivative = c(analytical, finite_diff, ad_result),
  error_vs_analytical = c(0, finite_diff - analytical, ad_result - analytical)
)
```

The AD result matches the analytical derivative to machine precision — zero
error — because dual-number arithmetic propagates exact derivative rules
through every operation. Finite differences, by contrast, introduce
truncation error of order $O(h^2)$ for central differences; here the error
is roughly $10^{-7}$, reflecting the choice of $h = 10^{-8}$ (the optimal
step size for double-precision central differences is approximately
$\epsilon_{\mathrm{mach}}^{1/3} \approx 6 \times 10^{-6}$, so our step is
in the right ballpark). A bar chart makes this difference vivid:

```{r fig-error-comparison, fig.width=5, fig.height=3.5}
errors <- abs(c(finite_diff - analytical, ad_result - analytical))
# Use log10 scale; clamp AD error to .Machine$double.eps if exactly zero
errors[errors == 0] <- .Machine$double.eps

oldpar <- par(mar = c(4, 5, 2, 1))
bp <- barplot(log10(errors),
              names.arg = c("Finite Diff", "AD (nabla)"),
              col = c("coral", "steelblue"),
              ylab = expression(log[10] ~ "|error|"),
              main = "Absolute error vs analytical derivative",
              ylim = c(-16, 0), border = NA)
abline(h = log10(.Machine$double.eps), lty = 2, col = "grey40")
text(mean(bp), log10(.Machine$double.eps) + 0.8,
     "machine epsilon", cex = 0.8, col = "grey40")
par(oldpar)
```

This advantage becomes critical for optimization algorithms that depend on
accurate gradients.

## Why exact derivatives matter

Any optimizer can find the MLE — the point estimate is the easy part. What
`nabla` gives you is the **exact Hessian** at the MLE, and that's where the
real statistical payoff lies.

The observed information matrix $J(\hat\theta) = -H(\hat\theta)$ is the
negative Hessian of the log-likelihood at the MLE. Its inverse gives the
asymptotic variance-covariance matrix:

$$\text{Var}(\hat\theta) \approx J(\hat\theta)^{-1}$$

Inaccurate Hessians mean inaccurate standard errors and confidence intervals.
Finite-difference Hessians introduce $O(h^2)$ error that propagates directly
into your CIs. With `nabla`, the Hessian is exact to machine precision —
so your standard errors and confidence intervals are as accurate as the
asymptotic theory allows.

Higher-order derivatives push further: `D(f, x, order = 3)` yields the
third-order derivative tensor, which quantifies the **asymptotic skewness**
of the MLE's sampling distribution — a correction that standard Wald
intervals miss entirely.

## How it works: dual numbers

Under the hood, `nabla` uses **dual numbers** — values of the form
$a + b\varepsilon$, where $\varepsilon^2 = 0$. Evaluating a function
$f$ on a dual number $x + \varepsilon$ yields:

$$f(x + \varepsilon) = f(x) + f'(x)\,\varepsilon$$

The "value" part gives $f(x)$ and the "epsilon" part gives $f'(x)$, computed
automatically by propagating $\varepsilon$ through every arithmetic operation.
This is **forward-mode automatic differentiation (AD)**.

The low-level constructors expose dual numbers directly. In this style,
functions use bare `x` (not `x[1]`) because they receive a single `dualr`
object rather than a vector:

```{r}
# A dual variable: value = 3, derivative seed = 1
x <- dual_variable(3)
value(x)
deriv(x)

# A dual constant: value = 5, derivative seed = 0
k <- dual_constant(5)
value(k)
deriv(k)

# Explicit constructor
y <- dual(2, 1)
value(y)
deriv(y)
```

Setting `deriv = 1` means "I'm differentiating with respect to this variable."
Setting `deriv = 0` means "this is a constant."

All standard arithmetic operators propagate derivatives:

```{r}
x <- dual_variable(3)

# Addition: d/dx(x + 2) = 1
r_add <- x + 2
value(r_add)
deriv(r_add)

# Subtraction: d/dx(5 - x) = -1
r_sub <- 5 - x
value(r_sub)
deriv(r_sub)

# Multiplication: d/dx(x * 4) = 4
r_mul <- x * 4
value(r_mul)
deriv(r_mul)

# Division: d/dx(1/x) = -1/x^2 = -1/9
r_div <- 1 / x
value(r_div)
deriv(r_div)

# Power: d/dx(x^3) = 3*x^2 = 27
r_pow <- x^3
value(r_pow)
deriv(r_pow)
```

All standard mathematical functions are supported, with derivatives computed
via the chain rule:

```{r}
x <- dual_variable(1)

# exp: d/dx exp(x) = exp(x)
r_exp <- exp(x)
value(r_exp)
deriv(r_exp)

# log: d/dx log(x) = 1/x
r_log <- log(x)
value(r_log)
deriv(r_log)

# sin: d/dx sin(x) = cos(x)
x2 <- dual_variable(pi / 4)
r_sin <- sin(x2)
value(r_sin)
deriv(r_sin)  # cos(pi/4)

# sqrt: d/dx sqrt(x) = 1/(2*sqrt(x))
x3 <- dual_variable(4)
r_sqrt <- sqrt(x3)
value(r_sqrt)
deriv(r_sqrt)  # 1/(2*2) = 0.25

# Gamma-related
x4 <- dual_variable(3)
r_lgamma <- lgamma(x4)
value(r_lgamma)     # log(2!) = log(2)
deriv(r_lgamma)     # digamma(3)
```

Dual numbers integrate with standard R patterns:

```{r}
# sum() works
a <- dual_variable(2)
b <- dual_constant(3)
total <- sum(a, b, dual_constant(1))
value(total)  # 6
deriv(total)  # 1 (only a has deriv = 1)

# prod() works
p <- prod(a, dual_constant(3))
value(p)  # 6
deriv(p)  # 3 (product rule: 3*1 + 2*0)

# c() creates a dual_vector
v <- c(a, b)
length(v)

# is.numeric returns TRUE (for compatibility)
is.numeric(dual_variable(1))
```

In most cases you won't need these constructors directly — `D()`,
`gradient()`, `hessian()`, and `jacobian()` handle dual number creation
internally. But understanding the mechanism helps when debugging or
building custom AD workflows.

## What's next?

This vignette introduced `D()` and showed that AD computes exact derivatives
where finite differences cannot. For practical applications:

- **`vignette("mle-workflow")`** applies `gradient()`, `hessian()`, and the
  observed information matrix to five increasingly complex MLE problems, from
  Normal to logistic regression, and builds a Newton-Raphson optimizer
  powered entirely by AD.
- **`vignette("higher-order")`** demonstrates `D(f, x, order = 2)` for
  second derivatives, with applications to curvature analysis, Taylor
  expansion, and exact Hessians for MLE standard errors.
- **`vignette("optimizer-integration")`** shows how to plug `nabla`
  gradients into R's built-in optimizers (`optim`, `nlminb`) for
  production-quality MLE fitting.
- **`vignette("mle-skewness")`** uses `D(f, x, order = 3)` to compute
  the asymptotic skewness of MLEs — a third-order analysis that goes
  beyond standard Wald confidence intervals.
