---
title: "Higher-Order Derivatives"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Higher-Order Derivatives}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

```{r setup}
library(nabla)
```

## The `D` operator

The `D` operator composes to arbitrary order. `D(f, x)` gives the first
derivative; `D(f, x, order = 2)` gives the second; `D(f, x, order = 3)`
gives the third-order tensor; and so on:

```{r}
f <- function(x) x[1]^3

D(f, 2)                # f'(2) = 3*4 = 12
D(f, 2, order = 2)     # f''(2) = 6*2 = 12
D(f, 2, order = 3)     # f'''(2) = 6
```

For multi-parameter functions, each application of `D` appends one dimension
to the output. A scalar function $f: \mathbb{R}^n \to \mathbb{R}$ produces
a gradient vector ($n$), then a Hessian matrix ($n \times n$), then a
third-order tensor ($n \times n \times n$):

```{r}
# Gradient of a 2-parameter function
f <- function(x) x[1]^2 * x[2]
D(f, c(3, 4))  # gradient: c(24, 9)

# Hessian via D^2
D(f, c(3, 4), order = 2)

# Composition works identically
Df <- D(f)
DDf <- D(Df)
DDf(c(3, 4))
```

For vector-valued functions, `D` produces the Jacobian:

```{r}
g <- function(x) list(x[1] * x[2], x[1]^2 + x[2])
D(g, c(3, 4))  # 2x2 Jacobian: [[4, 3], [6, 1]]
```

The convenience functions `gradient()`, `hessian()`, and `jacobian()` are
thin wrappers around `D`:

```{r}
gradient(f, c(3, 4))           # == D(f, c(3, 4))
hessian(f, c(3, 4))            # == D(f, c(3, 4), order = 2)
jacobian(g, c(3, 4))           # == D(g, c(3, 4))
```

## Curvature analysis

The curvature of a curve $y = f(x)$ is:

$$\kappa = \frac{|f''(x)|}{(1 + f'(x)^2)^{3/2}}$$

With `D()`, we can compute this directly:

```{r}
curvature <- function(f, x) {
  d1 <- D(f, x)
  d2 <- D(f, x, order = 2)
  abs(d2) / (1 + d1^2)^(3/2)
}

# Curvature of sin(x) at various points
# Wrap sin for D()'s x[1] convention
sin_f <- function(x) sin(x[1])
xs <- seq(0, 2 * pi, length.out = 7)
kappas <- sapply(xs, function(x) curvature(sin_f, x))
data.frame(x = round(xs, 3), curvature = round(kappas, 6))
```

The table reveals the geometry of $\sin(x)$: curvature is zero at
inflection points (integer multiples of $\pi$, where $\sin''(x) = 0$) and
peaks at $\pi/2$ and $3\pi/2$ where $\sin$ reaches its extrema. The
maximum curvature of 1.0 reflects the unit amplitude — for $A\sin(x)$ the
maximum curvature would be $A$.

At $x = \pi/2$, $\sin$ has maximum curvature because $|\sin''(\pi/2)| = 1$
and $\sin'(\pi/2) = 0$:

```{r}
curvature(sin_f, pi / 2)  # should be 1.0
```

### Visualizing curvature along sin(x)

Curvature peaks where the second derivative is largest in magnitude and the
first derivative is small. For $\sin(x)$, this occurs at $\pi/2$ and $3\pi/2$:

```{r fig-sin-curvature, fig.width=6, fig.height=4}
xs_curve <- seq(0, 2 * pi, length.out = 200)
sin_vals <- sin(xs_curve)
kappa_vals <- sapply(xs_curve, function(x) curvature(sin_f, x))

oldpar <- par(mar = c(4, 4.5, 2, 4.5))
plot(xs_curve, sin_vals, type = "l", col = "steelblue", lwd = 2,
     xlab = "x", ylab = "sin(x)",
     main = "sin(x) and its curvature")
par(new = TRUE)
plot(xs_curve, kappa_vals, type = "l", col = "firebrick", lwd = 2, lty = 2,
     axes = FALSE, xlab = "", ylab = "")
axis(4, col.axis = "firebrick")
mtext(expression(kappa(x)), side = 4, line = 2.5, col = "firebrick")
abline(v = c(pi / 2, 3 * pi / 2), col = "grey60", lty = 3)
legend("bottom",
       legend = c("sin(x)", expression(kappa(x))),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n", horiz = TRUE)
par(oldpar)
```

The curvature reaches its maximum of 1.0 at $x = \pi/2$ (where $\sin'(x) = 0$
and $|\sin''(x)| = 1$) and returns to zero at integer multiples of $\pi$
(where $\sin''(x) = 0$).

## Taylor expansion

A second-order Taylor approximation of $f$ around $x_0$:

$$f(x) \approx f(x_0) + f'(x_0)(x - x_0) + \frac{1}{2} f''(x_0)(x - x_0)^2$$

`D()` computes this directly:

```{r}
taylor2 <- function(f, x0, x) {
  f0 <- f(x0)
  f1 <- D(f, x0)
  f2 <- D(f, x0, order = 2)
  f0 + f1 * (x - x0) + 0.5 * f2 * (x - x0)^2
}

# Approximate exp(x) near x = 0
f_exp <- function(x) exp(x[1])
xs <- c(-0.1, -0.01, 0, 0.01, 0.1)
data.frame(
  x = xs,
  exact = exp(xs),
  taylor2 = sapply(xs, function(x) taylor2(f_exp, 0, x)),
  error = exp(xs) - sapply(xs, function(x) taylor2(f_exp, 0, x))
)
```

Near the expansion point, the approximation is accurate. The error grows
as $O(|x - x_0|^3)$ because the Taylor remainder is
$\frac{f'''(\xi)}{6}(x - x_0)^3$. For $\exp(x)$ around $x_0 = 0$,
$f'''(\xi) = \exp(\xi) \approx 1$, so the error at $x = 0.1$ is
approximately $0.1^3/6 \approx 1.7 \times 10^{-4}$ and at $x = 0.01$ it
drops to $\approx 1.7 \times 10^{-7}$ — a factor-of-1000 reduction for a
factor-of-10 decrease in distance, confirming cubic convergence:

```{r fig-taylor-vs-exact, fig.width=6, fig.height=4}
xs_plot <- seq(-2, 3, length.out = 300)
exact_vals <- exp(xs_plot)
taylor_vals <- sapply(xs_plot, function(x) taylor2(f_exp, 0, x))

oldpar <- par(mar = c(4, 4, 2, 1))
plot(xs_plot, exact_vals, type = "l", col = "steelblue", lwd = 2,
     xlab = "x", ylab = "y",
     main = expression("exp(x) vs 2nd-order Taylor at " * x[0] == 0),
     ylim = c(-1, 12))
lines(xs_plot, taylor_vals, col = "firebrick", lwd = 2, lty = 2)
abline(v = 0, col = "grey60", lty = 3)
points(0, 1, pch = 19, col = "grey40", cex = 1.2)
legend("topleft",
       legend = c("exp(x)", expression("Taylor " * T[2](x))),
       col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2,
       bty = "n")
par(oldpar)
```

The Taylor polynomial matches `exp(x)` closely near $x_0 = 0$ but diverges
at larger distances, illustrating the local nature of polynomial approximation.

## Connection to MLE: exact Hessians for standard errors

`hessian()` computes the exact second-derivative matrix at any point. For
maximum likelihood estimation, this gives the **observed information matrix**
$J(\hat\theta) = -H(\hat\theta)$, whose inverse is the asymptotic
variance-covariance matrix. Exact Hessians yield exact standard errors and
confidence intervals — the primary reason to use AD in statistical
inference.

Consider a Poisson log-likelihood for $\lambda$:

```{r}
set.seed(123)
data_pois <- rpois(50, lambda = 3)
n <- length(data_pois)
sum_x <- sum(data_pois)
sum_lfact <- sum(lfactorial(data_pois))

ll_poisson <- function(theta) {
  lambda <- theta[1]
  sum_x * log(lambda) - n * lambda - sum_lfact
}

lambda0 <- 2.5
```

**Primary approach**: Using `hessian()`:

```{r}
hess_helper <- hessian(ll_poisson, lambda0)
hess_helper
```

The observed information is simply `-hess_helper`, and the standard error
is `1 / sqrt(-hess_helper)`:

```{r}
se <- 1 / sqrt(-hess_helper[1, 1])
cat("SE(lambda) at lambda =", lambda0, ":", se, "\n")
```

**Verification**: Manual construction of a second-order dual number — the
same arithmetic that `hessian()` performs internally:

```{r}
# Build a dual_variable_n wrapped in a dual_vector
manual_theta <- dual_vector(list(dual_variable_n(lambda0, 2)))
result_manual <- ll_poisson(manual_theta)
manual_hess <- deriv(deriv(result_manual))
manual_hess
```

Both approaches yield the same result:

```{r}
hess_helper[1, 1] - manual_hess  # ~0
```

## How it works: nested duals

Under the hood, `D(f, x, order = 2)` uses **nested duals** — a dual number
whose components are themselves dual numbers. The structure
`dual(dual(x, 1), dual(1, 0))` simultaneously tracks:

- $f(x)$ — the value of the inner value
- $f'(x)$ — the derivative of the inner value
- $f''(x)$ — the derivative of the outer derivative

Nested duals can be constructed directly with `dual_variable_n()`:

```{r}
x <- dual_variable_n(2, order = 2)

# Evaluate x^3
result <- x^3

# Extract all three quantities
deriv_n(result, 0)  # f(2) = 8
deriv_n(result, 1)  # f'(2) = 3*4 = 12
deriv_n(result, 2)  # f''(2) = 6*2 = 12
```

For constants in higher-order computations, use `dual_constant_n()`:

```{r}
k <- dual_constant_n(5, order = 2)
deriv_n(k, 0)  # 5
deriv_n(k, 1)  # 0
deriv_n(k, 2)  # 0
```

The `differentiate_n()` helper wraps the construction and extraction:

```{r}
# Differentiate sin(x) at x = pi/4
result <- differentiate_n(sin, pi / 4, order = 2)
result$value   # sin(pi/4)
result$d1      # cos(pi/4) = f'
result$d2      # -sin(pi/4) = f''
```

Verify against known values:

```{r}
result$value - sin(pi / 4)     # ~0
result$d1 - cos(pi / 4)        # ~0
result$d2 - (-sin(pi / 4))     # ~0
```

More complex functions work too:

```{r}
f <- function(x) x * exp(-x^2)
d2 <- differentiate_n(f, 1, order = 2)

# Analytical: f'(x) = exp(-x^2)(1 - 2x^2)
# f''(x) = exp(-x^2)(-6x + 4x^3)
analytical_f1 <- exp(-1) * (1 - 2)
analytical_f2 <- exp(-1) * (-6 + 4)
d2$d1 - analytical_f1   # ~0
d2$d2 - analytical_f2   # ~0
```

These low-level functions are useful for understanding how `nabla` works
internally, but for most applications, `D()`, `gradient()`, and `hessian()`
are the recommended interface.

## Summary

The `D` operator provides exact higher-order derivatives through a single
composable interface — no symbolic differentiation, no finite-difference
grid, and no loss of precision:

- **Exact curvature and Taylor coefficients.** `D(f, x, order = 2)` gives
  the exact second derivative needed for curvature analysis, Taylor
  expansion, and Newton-Raphson optimization.
- **Exact standard errors for MLE.** The Hessian from `hessian()` directly
  yields the observed information matrix. Its inverse gives the asymptotic
  variance-covariance matrix — the foundation of confidence intervals and
  hypothesis tests in likelihood-based inference.
- **Arbitrary order.** `D(f, x, order = 3)` yields third-order tensors for
  asymptotic skewness analysis (see `vignette("mle-skewness")`); higher
  orders work the same way.
- **Nested duals under the hood.** Internally, `D` constructs nested dual
  numbers at each order level. The same arithmetic that propagates first
  derivatives recursively propagates second, third, and higher derivatives.
