---
title: "Multivariate and Mixture Distributions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Multivariate and Mixture Distributions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

```{r library}
library(algebraic.dist)
```

This vignette covers the multivariate normal (MVN), affine transformations,
and mixture distribution facilities in `algebraic.dist`. These features were
introduced in version 0.3.0 and provide closed-form operations wherever
possible, falling back to Monte Carlo only when necessary.

## MVN Basics

Construct an MVN with `mvn()`. The `mu` argument is the mean vector and
`sigma` is the variance-covariance matrix (defaulting to the identity).

```{r mvn-construct}
M <- mvn(mu = c(1, 2, 3),
         sigma = matrix(c(1.0, 0.5, 0.2,
                          0.5, 2.0, 0.3,
                          0.2, 0.3, 1.5), nrow = 3, byrow = TRUE))
M
```

Standard accessors return the moments and dimensionality.

```{r mvn-accessors}
mean(M)
vcov(M)
dim(M)
```

The `params()` generic flattens all parameters into a single named vector
(useful for optimisation interfaces).

```{r mvn-params}
params(M)
```

### Sampling

`sampler()` returns a reusable sampling function. The output is an n-by-d
matrix where rows are observations and columns are dimensions.

```{r mvn-sampling}
set.seed(42)
samp <- sampler(M)
X <- samp(100)
dim(X)
head(X, 5)
```

### Marginals

Extracting a marginal over a subset of indices returns a new MVN (or a
univariate `normal` when a single index is selected).

```{r mvn-marginal}
M13 <- marginal(M, c(1, 3))
M13
mean(M13)
vcov(M13)
```

Selecting a single component yields a univariate normal.

```{r mvn-marginal-univariate}
M1 <- marginal(M, 1)
M1
mean(M1)
vcov(M1)
```

## Closed-Form MVN Conditioning

Conditioning a multivariate normal on observed values uses the Schur
complement formula. Given an MVN partitioned into free variables $X_1$
and observed variables $X_2 = a$, the conditional distribution is:

$$X_1 \mid X_2 = a \;\sim\; \mathcal{N}\!\bigl(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(a - \mu_2),\;\; \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\bigr)$$

This is exact -- no Monte Carlo is involved.

```{r mvn-conditional-closedform}
M2 <- mvn(mu = c(0, 0), sigma = matrix(c(1, 0.8, 0.8, 1), 2, 2))

# Condition on X2 = 1
M_cond <- conditional(M2, given_indices = 2, given_values = 1)
mean(M_cond)
vcov(M_cond)
```

With $\rho = 0.8$, the conditional mean is $\mu_1 + \rho(1 - \mu_2) = 0 + 0.8(1 - 0) = 0.8$,
and the conditional variance is $1 - \rho^2 = 1 - 0.64 = 0.36$.

### Predicate-based MC fallback

When the conditioning event is not "observe variable $j$ at value $a$" but
an arbitrary predicate, the package falls back to Monte Carlo. This is
accessed through the `P` argument.

```{r mvn-conditional-mc}
set.seed(42)
M_cond_mc <- conditional(M2, P = function(x) abs(x[2] - 1) < 0.1)
mean(M_cond_mc)
```

The MC result is approximate and depends on the tolerance in the predicate.
Prefer the closed-form interface whenever possible.

## Affine Transformations

`affine_transform(x, A, b)` computes the distribution of $Y = AX + b$
where $X \sim \text{MVN}(\mu, \Sigma)$. The result is:

$$Y \sim \text{MVN}(A\mu + b,\;\; A \Sigma A^\top)$$

This is a closed-form operation that returns a new MVN (or `normal` if
the result is one-dimensional).

### Portfolio example

Consider two assets with expected returns and a covariance structure.
A portfolio is a linear combination of the returns.

```{r affine-portfolio}
returns <- mvn(
  mu = c(0.05, 0.08),
  sigma = matrix(c(0.04, 0.01,
                   0.01, 0.09), 2, 2)
)

# 60/40 portfolio weights
w <- matrix(c(0.6, 0.4), nrow = 1)
portfolio <- affine_transform(returns, A = w)
mean(portfolio)
vcov(portfolio)
```

The portfolio return is $0.6 \times 0.05 + 0.4 \times 0.08 = 0.062$. The
portfolio variance is $w \Sigma w^\top$, a scalar that captures the
diversification benefit.

### Dimensionality reduction

Affine transforms can also project to lower dimensions. For instance,
computing the sum and difference of two variables.

```{r affine-sumdiff}
X <- mvn(mu = c(3, 7), sigma = matrix(c(4, 1, 1, 9), 2, 2))

# A maps (X1, X2) -> (X1 + X2, X1 - X2)
A <- matrix(c(1, 1,
              1, -1), nrow = 2, byrow = TRUE)
Y <- affine_transform(X, A = A)
mean(Y)
vcov(Y)
```

## Mixture Distribution Basics

A mixture distribution is constructed from a list of component distributions
and a vector of non-negative weights that sum to one.

```{r mixture-construct}
m <- mixture(
  components = list(normal(-2, 1), normal(3, 0.5)),
  weights = c(0.4, 0.6)
)
m
```

### Mean and variance

The mixture mean is the weighted average of component means:
$E[X] = \sum_k w_k \mu_k$.

```{r mixture-mean}
mean(m)
```

The mixture variance uses the law of total variance:
$\text{Var}(X) = \underbrace{\sum_k w_k \sigma_k^2}_{\text{within-component}} + \underbrace{\sum_k w_k(\mu_k - \bar\mu)^2}_{\text{between-component}}$.

```{r mixture-vcov}
vcov(m)
```

The between-component term captures the spread of the component means
around the overall mean. This is why mixtures can have much larger variance
than any individual component.

### Density and CDF

`density()` and `cdf()` return callable functions, following the same
pattern as other distributions in the package.

```{r mixture-density, fig.height = 4, fig.width = 6}
f <- density(m)

x_vals <- seq(-6, 8, length.out = 200)
y_vals <- sapply(x_vals, f)
plot(x_vals, y_vals, type = "l", lwd = 2,
     main = "Bimodal mixture density",
     xlab = "x", ylab = "f(x)")
```

The bimodal shape is clearly visible, with peaks near $-2$ and $3$.

```{r mixture-cdf, fig.height = 4, fig.width = 6}
F <- cdf(m)
F_vals <- sapply(x_vals, F)
plot(x_vals, F_vals, type = "l", lwd = 2,
     main = "Mixture CDF",
     xlab = "x", ylab = "F(x)")
```

### Sampling

```{r mixture-sampling, fig.height = 4, fig.width = 6}
set.seed(123)
s <- sampler(m)
samples <- s(2000)
hist(samples, breaks = 50, probability = TRUE, col = "lightblue",
     main = "Mixture samples with true density",
     xlab = "x")
lines(x_vals, y_vals, col = "red", lwd = 2)
```

## Mixture Operations

### Marginals of mixtures

The marginal of a mixture is a mixture of marginals with the same weights.
This follows directly from the definition:
$p(x_I) = \sum_k w_k\, p_k(x_I)$.

```{r mixture-marginal}
m2 <- mixture(
  list(mvn(c(0, 0), diag(2)), mvn(c(3, 3), diag(2))),
  c(0.5, 0.5)
)

m2_x1 <- marginal(m2, 1)
m2_x1
mean(m2_x1)
```

The marginal mean is $0.5 \times 0 + 0.5 \times 3 = 1.5$, the weighted
average of the component marginal means.

### Conditional of mixtures (Bayesian weight update)

Conditioning a mixture on observed values updates the component weights
via Bayes' rule. Each component's weight is scaled by its marginal density
at the observed value:

$$w_k' \propto w_k \, f_k(x_{\text{obs}})$$

Components whose marginal density is higher at the observed value receive
more weight.

```{r mixture-conditional}
mc <- conditional(m2, given_indices = 2, given_values = 0)
mc
mean(mc)
```

Conditioning on $X_2 = 0$ shifts weight toward the first component
(whose mean for $X_2$ is 0) and away from the second component (whose
mean for $X_2$ is 3). The conditional mean of $X_1$ is therefore pulled
closer to 0 than the unconditional value of 1.5.

We can also condition at a value near the second component to see the
opposite effect.

```{r mixture-conditional-high}
mc_high <- conditional(m2, given_indices = 2, given_values = 3)
mc_high
mean(mc_high)
```

Now the weight shifts toward the second component, and the conditional
mean of $X_1$ is pulled toward 3.

## Practical Example: Gaussian Mixture Classification

Gaussian mixture models (GMMs) are widely used for soft classification.
The idea: each component represents a class, and conditioning on observed
features updates the class probabilities.

### Setup: a two-class GMM

Consider two clusters in 2D, representing two classes with equal prior
probability.

```{r gmm-setup}
class_a <- mvn(c(-2, -1), matrix(c(1.0, 0.3, 0.3, 0.8), 2, 2))
class_b <- mvn(c(2, 1.5), matrix(c(0.6, -0.2, -0.2, 1.2), 2, 2))

gmm <- mixture(
  components = list(class_a, class_b),
  weights = c(0.5, 0.5)
)
gmm
```

### Visualise the clusters

```{r gmm-scatter, fig.height = 5, fig.width = 6}
set.seed(314)
samp_a <- sampler(class_a)(300)
samp_b <- sampler(class_b)(300)

plot(samp_a[, 1], samp_a[, 2], col = "steelblue", pch = 16, cex = 0.6,
     xlim = c(-6, 6), ylim = c(-5, 5),
     xlab = expression(X[1]), ylab = expression(X[2]),
     main = "Two-class GMM samples")
points(samp_b[, 1], samp_b[, 2], col = "tomato", pch = 16, cex = 0.6)
legend("topright", legend = c("Class A", "Class B"),
       col = c("steelblue", "tomato"), pch = 16)
```

### Soft classification by conditioning

Suppose we observe $X_1 = -1$. Conditioning updates the class weights
based on how likely each component is to produce that value.

```{r gmm-classify-neg}
post_neg <- conditional(gmm, given_indices = 1, given_values = -1)
post_neg
```

The updated weights give the posterior class probabilities. Since $X_1 = -1$
is much more likely under Class A (centred at $-2$) than Class B
(centred at $2$), Class A receives most of the weight.

The conditional distribution of $X_2$ given $X_1 = -1$ is itself a
mixture of the two conditional normals.

```{r gmm-classify-neg-density, fig.height = 4, fig.width = 6}
f_post <- density(post_neg)
x2_grid <- seq(-5, 5, length.out = 200)
y_post <- sapply(x2_grid, f_post)
plot(x2_grid, y_post, type = "l", lwd = 2,
     xlab = expression(X[2]),
     ylab = expression(f(X[2] ~ "|" ~ X[1] == -1)),
     main = "Posterior density of X2 given X1 = -1")
```

Now observe $X_1 = 2$, which sits near the centre of Class B.

```{r gmm-classify-pos}
post_pos <- conditional(gmm, given_indices = 1, given_values = 2)
post_pos
```

As expected, the weight shifts heavily toward Class B. The conditional
density of $X_2$ now concentrates around the Class B conditional mean.

```{r gmm-classify-pos-density, fig.height = 4, fig.width = 6}
f_post_pos <- density(post_pos)
y_post_pos <- sapply(x2_grid, f_post_pos)
plot(x2_grid, y_post_pos, type = "l", lwd = 2,
     xlab = expression(X[2]),
     ylab = expression(f(X[2] ~ "|" ~ X[1] == 2)),
     main = "Posterior density of X2 given X1 = 2")
```

### Ambiguous observations

When the observation falls between the two cluster means, both classes
retain meaningful probability and the posterior becomes bimodal.

```{r gmm-classify-mid}
post_mid <- conditional(gmm, given_indices = 1, given_values = 0)
post_mid
```

```{r gmm-classify-mid-density, fig.height = 4, fig.width = 6}
f_post_mid <- density(post_mid)
y_post_mid <- sapply(x2_grid, f_post_mid)
plot(x2_grid, y_post_mid, type = "l", lwd = 2,
     xlab = expression(X[2]),
     ylab = expression(f(X[2] ~ "|" ~ X[1] == 0)),
     main = "Posterior density of X2 given X1 = 0 (ambiguous)")
```

At $X_1 = 0$ the posterior for $X_2$ is bimodal -- one mode near the
Class A conditional mean, one near Class B. Although $X_1 = 0$ is
equidistant from the two class means ($-2$ and $2$), Class A receives
roughly 75% of the posterior weight because its broader $X_1$ variance
($\sigma^2 = 1$ vs $0.6$) gives it higher density at the midpoint.
Class B still retains about 25% of the weight, so neither class is
overwhelmingly dominant. This is the hallmark of soft classification:
rather than forcing a hard decision, the mixture posterior represents
genuine uncertainty about the class label.
