---
title: "Masked Data Likelihood Model: Components with Exponentially Distributed Lifetimes Arranged In Series Configuration"
output:
    rmarkdown::html_vignette:
        toc: true
vignette: >
  %\VignetteIndexEntry{Masked Data Likelihood Model: Components with Exponentially Distributed Lifetimes Arranged In Series Configuration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
header-includes:
- \renewcommand{\v}[1]{\boldsymbol{#1}}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  echo = TRUE,
  comment = "#>")

# Set to TRUE to regenerate long-running simulation results
run_long <- FALSE

# maskedcauses re-exports generics from likelihood.model
# (loglik, score, hess_loglik, fit, assumptions, fim)
library(maskedcauses)
# md_encode_matrix and md_boolean_matrix_to_charsets are provided by this package
old_opts <- options(digits = 4)
```


The R package `maskedcauses` is a framework for
estimating the parameters of latent component lifetimes from *masked data*
in a series system.

Exponentially Distributed Component Lifetimes {#expo}
=====================================================
Consider a series system in which the components have exponentially distributed
lifetimes.
The $j$\textsuperscript{th} component of the $i$\textsuperscript{th} has a
lifetime distribution given by
$$
    T_{i j} \sim \operatorname{EXP}(\lambda_j)
$$
for $j=1,\ldots,m$.
Thus, $\lambda = \bigl(\lambda_1,\ldots,\lambda_m\bigr)$.
The random variable $T_{i j}$ has a reliability, pdf, and hazard function
given respectively by
\begin{align}
    \label{eq:expo_reliability}
    R_j(t|\lambda_j)   &= \exp(-\lambda_j t),\\
    \label{eq:expo_pdf}
    f_j(t|\lambda_j)   &= \lambda_j \exp(-\lambda_j t),\\
    \label{eq:expo_haz}
    h_j(\cdot|\lambda_j) &= \lambda_j
\end{align}
where $t > 0$ is the lifetime and $\lambda_j > 0$ is the failure rate of the $j$-th component.

The lifetime of the series system composed of $m$ components with exponentially
distributed lifetimes has a reliability function given by
\begin{equation}
\label{eq:sys_expo_reliability_function}
R_{T_i}(t_i|\v\lambda) = \exp \biggl(-\sum_{j=1}^{m}{\lambda_j} t_i\biggr)
\end{equation}
where $t > 0$.
\begin{proof}
By the series system reliability theorem,
$$
    R_{T_i}(t_i;\v\lambda) = \prod_{j=1}^{m} R_j(t_i;\lambda_j).
$$
Plugging in the component reliability functions given by
Equation \eqref{eq:expo_reliability} obtains the result
\begin{align*}
R_{T_i}(t_i;\v\lambda) &= \prod_{j=1}^{m} \exp(-\lambda_j t_i)\\
    &= \exp \biggl(-\sum_{j=1}^{m}{\lambda_j} t_i\biggr).
\end{align*}
\end{proof}

A series system with exponentially distributed lifetimes is also 
exponentially distributed.
\begin{theorem}
\label{thm:expo_series_family}
The random lifetime $T_i$ of a series system composed of $m$ components with 
exponentially distributed lifetimes is exponentially distributed with a failure 
rate that is the sum of the component failure rates,
$$
    T_i \sim \operatorname{EXP} \biggl(\sum_{j=1}^{m} \lambda_j\biggr).
$$
\end{theorem}
\begin{proof}
By Equation \eqref{eq:sys_expo_reliability_function}, the series system has a
reliability function in the family of the exponential distribution with a
failure rate that is the sum of the component failure rates.
\end{proof}

The series system's failure rate function is given by
\begin{equation}
\label{eq:expo_sys_failure_rate}
h(\cdot|\v\lambda) = \sum_{j=1}^{m} \lambda_j
\end{equation}
whose proof follows from the series system failure rate theorem.

We see that the system failure rate $\lambda_{\text{sys}} = \sum_{j=1}^m \lambda_j$ is *constant*,
consistent with the exponential distribution being the only continuous
distribution that has a constant failure rate.

The pdf of the series system is given by
\begin{equation}
\label{eq:expo_sys_pdf}
f_{T_i}(t_i|\v\lambda) = \biggl( \sum_{j=1}^{m} {\lambda_j} \biggr) \exp
    \biggl(-\sum_{j=1}^{m}{\lambda_j} t_i\biggr)
\end{equation}
where $t_i > 0$ is the lifetime of the system.
\begin{proof}
By definition,
$$
f_{T_i}(t_i|\v\lambda) = h_{T_i}(t_i|\v\lambda) R_{T_i}(t_i|\v\lambda).
$$
Plugging in the failure rate and reliability functions given respectively by
Equations \eqref{eq:expo_sys_failure_rate} and \eqref{eq:expo_reliability} completes
the proof.
\end{proof}

The conditional probability that component $k$ is the cause of a system failure
at time $t$ is given by
\begin{equation}
\label{eq:expo_prob_K_given_S}
f_{K_i|T_i}(k|t,\v\lambda) = f_{K_i}(k|\v\lambda) =
    \frac{\lambda_k}{\sum_{p=1}^{m} \lambda_p}
\end{equation}
where $k \in \{1,\ldots,m\}$ and $t > 0$.
\begin{proof}
By the conditional cause-of-failure theorem,
$$
f_{K_i|T_i}(k|t,\v\lambda) = \frac{h_k(t|\lambda_k)}{h_{T_i}(t;\v\lambda)}.
$$
Plugging in the failure rate of the component indexed by $k$ and the failure
rate of the system given respectively by Equations \eqref{eq:expo_sys_failure_rate}
and \eqref{eq:expo_haz} completes the proof.
\end{proof}

Due to the constant failure rates of the components, $K_i$ and $T_i$ are
mutually independent.
The joint pdf of $K_i$ and $T_i$ is given by
\begin{equation}
\label{eq:expo_joint_k_s}
f_{K_i,T_i}(k,t|\v\lambda) = \lambda_k \exp \biggl(-\sum_{j=1}^{m}{\lambda_j} t\biggr)
\end{equation}
where $k \in \{1,\ldots,m\}$ and $t > 0$.
\begin{proof}
By definition,
$$
f_{K_i,T_i}(k,t|\v\lambda) =
    f_{K_i|T_i}(k|t,\v\lambda) f_{T_i}(t|\v\lambda).
$$
Plugging in the conditional probability and the marginal probability given
respectively by Equations \eqref{eq:expo_prob_K_given_S} and
\eqref{eq:expo_sys_pdf} completes the proof.
\end{proof}

Likelihood Model {#likelihood-model}
====================================

In this study, the system is a series system with $m$
components. The true DGP for the system lifetime is in the
exponential series system family, i.e., the component lifetimes are
exponentially and independently distributed and we denote the
true parameter value by $\theta$.

The principal object of study is $\v\lambda$, which in the case of
the exponential series system family consists of $m$ rate (scale)
parameters for each component lifetime, $\v\lambda = (\lambda_1, \ldots, \lambda_m)'$.

We are interested in estimating the $\theta$ from masked data.
The masking comes in two independent forms:

- Censored system failure times, e.g., right-censoring. The system failure
time is the minimum of the component lifetimes, and it is right-censored
if the system failure time does not occur during the observation period,
$$
    T_i = \min\{\tau_i, T_{i 1}, \ldots, T_{i m}\},
$$
where $\tau_i$ is the right-censoring time for the $i$\textsuperscript{th}
observation and $T_{i 1},\ldots,T_{i m}$ are the component lifetimes
for the $i$th system.

- The cause of failure, the failed component, is masked. This masking
comes in the form of a candidate set $\mathcal{C}_i$ that, on average,
conveys information about the component cause of failure.

The candidate set $\mathcal{C}_i$ is a random variable that is a subset of
$\{1,\ldots,m\}$. The true DGP for the candidate set model has a general form
that may be denoted by
$$
    \Pr\{\mathcal{C}_i=c_i|T_1=j,\ldots,T_m,\theta,\text{other factors}\}.
$$

This is a pretty complicated looking model, and we are not even interested
in the DGP for candidate sets, except to the extent that it affects the
sampling distribution of the MLE for $\theta$.

In theory, given some candidate set model, we could construct a joint likelihood
function for the full model and jointly estimate the parameters of both the
candidate set model and $\theta$. In practice, however, this could be a very
challenging task unless we make some simplifying assumptions about the DGP for
candidate sets.

## Candidate set models
In every model we consider, we assume that the candidate set $\mathcal{C}_i$
is only a function of the component lifetimes $T_{i 1},\ldots,T_{i m}$, $\theta$,
and the right-censoring time $\tau_i$. That is, the candidate set $\mathcal{C}_i$
is independent of any other factors (or held constant for the duration of the
experiment), like ambient temperature, and these factors also have a negligible
effect on the series system lifetime and thus we can ignore them.

### Reduced likelihood model
In the Bernoulli candidate set model, we make the following assumptions about
how candidate sets are generated:

  - $C_1$: The index of the failed component is in the candidate set,
  i.e., $\Pr\{K_i \in \mathcal{C}_i\} = 1$, where
  $K_i = \arg\min_j \{ T_{i j} : j = 1,\ldots,m\}$.

  - $C_2$: The probability of $C_i$ given $K_i$ and $T_i$ is equally
  probable when the failed component varies over the components in the candidate
  set, i.e., $\Pr\{\mathcal{C}_i=c_i|K_i=j,T_i=t_i,\theta\} = \Pr\{C_i=c_i|K_i=j',T_i=t_i\}$ for
  any $j,j' \in c_i$.
    
  - $C_3$: The masking probabilities are conditionally independent of $\theta$
  given $K_i$ and $T_i$, i.e., $\Pr\{\mathcal{C}_i=c_i|K_i=j,T_i=t_i\}$ is not a
  function of $\theta$.

Using these simplifying assumptions, we can arrive at a reduced likelihood
function that only depends on $\theta$ and the observed data and as long as our
candidate set satisfies conditions $C_1$, $C_2$, and $C_3$, our reduced
likelihood function obtains the same MLEs as the full likelihood function.

We see that
$$
    \Pr\{\mathcal{C}_i=c_i,|K_i=j,T_i=t_i\} = g(c_i,t_i),
$$
since the probability cannot depend on $j$ by condition $C_2$ and cannot depend
on $\theta$ by condition $C_3$. Thus, we can write the likelihood function as
$$
    L(\theta) = \prod_{i=1}^n f_{T_i}(t_i|\theta) g(c_i,t_i).
$$

We show that $g(c_i,t_i)$ is proportional to
$$
    g(c_i,t_i) \propto \sum_{j \in c_i} f_j(t_i|\theta_j) \prod_{l=j,l \neq j}^m R_l(t_i|\theta_l),
$$
and thus the reduced likelihood is proportional to the full likelihood, yielding the same MLEs.

Note, however, that different ways in which the conditions are met will yield
MLEs with different sampling distributions, e.g., more or less efficient estimators.

### Bernoulli candidate set model #1
This is a special case of the reduced likelihood model.
In this model, we satisfy conditions $C_1$, $C_2$, and $C_3$, but we include 
each of the non-failed components with a fixed probability $p$, $0 < p < 1$.

In the simplest case, $p = 0.5$, and candidate set $c_i$ has a probability
given by
$$
\Pr\{\mathcal{C}_i=c_i|K_i=j,T_i=t_i\} =
\begin{cases}
        (1/2)^{m-1} & \text{if $j \in c_i$ and $c_i \subseteq \{1,\ldots,m\}$} \\
        0 & \text{if $j \notin c_i$}.
\end{cases}
$$
\begin{proof}
Since there are $m-1$ non-failed components (the failed component $j$ is 
in $c_i$ with probability $1$), there are $2^{m-1}$ possible
candidate sets (the size of the power set of the non-failed component indexes).
Each of these candidate sets has equal probability of occurring, and thus
the probability of any particular candidate set is $1/2^{m-1}$.
\end{proof}

### Bernoulli candidate set model #2

Now, we remove condition $C_2$. We still assume conditions $C_1$ and $C_3$,
but we allow $C_i$ to depend on the failed component $K_i$, i.e.,
$$
    \Pr\{\mathcal{C}_i=c_i|K_i=j,T_i=t_i,\theta\} \neq \Pr\{C_i=c_i|K_i=j',T_i=t_i\}
$$
for $j,j' \in c_i$.

In this case, we can write the likelihood function as
$$
    L(\theta) = \prod_{i=1}^n f_{T_i}(t_i|\theta) \prod_{j=1}^m \Pr\{K_i=j|T_i=t_i\} \prod_{c_i \in \mathcal{C}_i} g(c_i,t_i,j).
$$


Simulation {#simulation}
========================

The most straightforward series system to estimate is the series system with
exponentially distributed component lifetimes.

Suppose an exponential series system with $m$ components is parameterized by
the following R code:

```{r}
theta <- c(1,     # component 1 failure rate
           1.1,   # 3
           0.95,  # 5
           1.15,  # 6
           1.1)   # 7

m <- length(theta)
```

So, in our study, $\theta = (`r theta`)'$.
The component assigned to index $j$ has an exponentially distributed
lifetime with a failure rate $\theta_j$, e.g., $\theta_2 = `r theta[2]`$ is the
failure rate of the component indexed by $2$.

Let's simulate generating the lifetimes of the $m = `r m`$ components for this series
system:
```{r}
set.seed(7231) # set seed for reproducibility
n <- 7500
comp_times <- matrix(nrow=n,ncol=m)
for (j in 1:m)
    comp_times[,j] <- rexp(n,theta[j])
comp_times <- md_encode_matrix(comp_times,"t")
head(comp_times, 4)
```

Next, we use the function `md_series_lifetime_right_censoring` to decorate the
masked data with the right-censor-censoring time chosen by the probability
$\Pr\{T_i > \tau\} = 0.75$:
```{r right-censoring}
q <- 0.25
tau <- rep(-(1/sum(theta))*log(q),n)
data <- comp_times |> md_series_lifetime_right_censoring(tau)
latent <- attr(data, "latent")
head(data[, !colnames(data) %in% latent], 4)
```

## Masked component cause of failure
We simulate candidate sets using the Bernoulli candidate model with an
appropriate set of parameters to satisfy conditions $C_1$, $C_2$, and $C_3$:
```{r bernoulli-cand, warning=F, message=F}
p <- .3
data <- data |> md_bernoulli_cand_c1_c2_c3(p)
head(data[, paste0("q", 1:m)], 4)
```

Now, to generate candidate sets, we sample from these probabilities:
```{r cand-sampler}
data <- data |> md_cand_sampler()
data$omega <- ifelse(data$delta, "exact", "right")
display <- md_boolean_matrix_to_charsets(data, drop_set = TRUE)
latent <- attr(display, "latent")
head(display[, !colnames(display) %in% latent], 6)
```

We see that after dropping latent (unobserved) columns, we only have the right
censoring time, right censoring indicator, and the candidate sets. (Note that
this time we showed the candidate sets in a more friendly way using
`md_boolean_matrix_to_charsets`.)

Constructing the Likelihood Model {#constructing-model}
=================================
The likelihood model is a statistical model that describes the distribution of
the observed data as a function of the parameters of interest.

We construct a likelihood model for the masked data model with exponentially
distributed component lifetimes with the following code:

```{r likelihood-model}
model <- exp_series_md_c1_c2_c3()
```


Maximum likelihood estimation {#mle}
====================================
The log-likelihood for our masked data model under masking conditions
C1 (failed component in candidate set), C2 (uniform candidate set
probability), and C3 (masking independent of parameters) is given by
\begin{equation}
\label{eq:loglik_masked}
\ell(\lambda) =
    \sum_{i=1}^{n} (1-\delta_i) \log \biggl(\sum_{j \in c_i} \lambda_j \biggr) -
    \biggl( \sum_{i=1}^{n} s_i \biggr)
    \biggl( \sum_{j=1}^{m} \lambda_j \biggr).
\end{equation}
\begin{proof}
By the general masked-data log-likelihood (summing over the survival and
candidate-set contributions for each observation),
$$
\ell(\lambda) = \sum_{i=1}^n \log R(s_i;\lambda) + \sum_{i=1}^n (1-\delta_i)
    \log \biggl\{ \sum_{k\in c_i} h_k(s_i;{\lambda_k}) \biggr\}.
$$
Plugging in the component failure rate and system reliability functions given
respectively by Equations \eqref{eq:expo_haz} and
\eqref{eq:sys_expo_reliability_function} and simplifying completes the proof.
\end{proof}

The set of solutions to the MLE equations must be stationary points, i.e., a
point at which the score function of type $\mathbb{R}^m \mapsto \mathbb{R}^m$
is zero.
The $j$-th component of the output of the score function is given by
\begin{equation}
\label{eq:score_expo_j}
\frac{\partial \ell}{\partial \lambda_p} = 
  \sum_{i=1}^{n} \biggl( \sum_{j \in c_i} \lambda_j \biggr)^{-1}
  1_{\{p \in c_i \text{ and } \delta_i = 0\}} - \sum_{i=1}^{n} s_i.
\end{equation}

We may find an MLE by solving the score equation,
i.e., finding stationary points satisfying
$$
\frac{\partial \ell}{\partial \lambda_j}\Biggr|_{\hat\lambda_j} = 0
$$
for $j=1,\ldots,m$.
We approximate a solution to this problem by using the iterative
Newton-Raphson method.

The Newton-Raphson method needs the observed information matrix, which is a
function of $\lambda$ of type $\mathbb{R}^m \mapsto \mathbb{R}^{m \times m}$.
The $(j,k)$-th element of $J(\lambda)$ is given by
\begin{equation}
\label{eq:info_expo}
\frac{\partial^2 \ell}{\partial \lambda_j \partial \lambda_k} = 
  \sum_{i=1}^{n} \biggl( \sum_{j \in c_i} \lambda_j \biggr)^{-2}
  1_{\{j,k \in c_i \text{ and } \delta_i = 0\}}.
\end{equation}

## Log-likelihood of $\theta$ given masked data

The reduced log-likelihood function (the log of the kernel of the likelihood
function) is given by
$$
\ell(\theta|\text{data}) =
    -\left(\sum_{i=1}^{n} t_i\right)
    \left(\sum_{j=1}^{m} \theta_j\right) +
    \sum_{i=1}^{n} (1-\delta_i)\log\left(\sum_{j \in c_i} \theta_j\right).
$$

We compute the log-likelihood function using generic dispatch:
```{r loglike-function}
ll <- loglik(model)
```

The returned function `ll(df, par)` evaluates the log-likelihood at parameter
`par` given data `df`. For example, at the true parameter value:
```{r loglike-eval}
ll(data, theta)
```

Note that the implementation uses minimally sufficient statistics, which
improves computational efficiency.

The log-likelihood function contains the maximum amount of information
about parameter $\theta$ given the sample of masked data `data` satisfying
conditions $C_1$, $C_2$, and $C_3$.

Suppose we do not know that $\theta = (`r theta`)'$.
With the log-likelihood, we may estimate $\theta$ with $\hat\theta$ by solving
$$
\hat{\theta} = \operatorname{argmax}_{\theta \in \Omega} \ell(\theta),
$$
i.e., finding the point that *maximizes* the log-likelihood on
the observed sample `data`.
This is known as *maximum likelihood estimation* (MLE).
We typically solve for the MLE by solving
$$
\nabla \ell|_{\theta=\hat{\theta}} = 0.
$$

A popular choice is gradient ascent, which is an iterative method
based on the update rule
$$
\theta^{(n+1)} = \theta^n + \eta \ell(\theta^n),
$$
where $\eta$ is the learning rate.

We can also obtain the score (gradient) function via generic dispatch:
```{r score-function, message=FALSE, warning=FALSE}
grad <- score(model)
```

The score at the true parameter should be close to zero (at the MLE, it is exactly zero):
```{r score-eval}
grad(data, theta)
```

The `likelihood.model` framework provides analytical score and Hessian
implementations when available, falling back to numerical differentiation
otherwise.

In what follows, we use [algebraic.mle](https://github.com/queelius/algebraic.mle)
to help solve the MLE equations and display various properties of the solution.

To solve the MLE equation, we use the generic `fit()` function, which dispatches
to `fit.likelihood_model` for any object with `"likelihood_model"` in its class.
The `fit()` function returns a solver that uses `optim` internally with the
BFGS method by default.

```{r fit-model, warning=FALSE}
# Get the solver from the model using generic dispatch
solver <- fit(model)

# Solve for MLE with initial guess
theta0 <- rep(1, m)
estimate <- solver(data, par = theta0, method = "Nelder-Mead")
```

The result is an `mle` object from the `algebraic.mle` package with rich
accessor methods:

```{r mle-results}
# Print summary with confidence intervals
print(estimate)
```

We can access specific components of the MLE:
```{r mle-accessors}
# Point estimate
theta.hat <- estimate$par
cat("MLE:", round(theta.hat, 4), "\n")

# Standard errors (sqrt of diagonal of variance-covariance matrix)
cat("SE:", round(sqrt(diag(estimate$vcov)), 4), "\n")

# Log-likelihood at MLE
cat("Log-likelihood:", round(estimate$loglik, 4), "\n")
```

Recall that the true parameter is $\theta = (`r theta`)'$.

Due to sampling variability, different runs of the experiment
will result in different outcomes, i.e., $\hat{\theta}$ has a
sampling distribution.
We see that $\hat{\theta} \neq \theta$, but it is reasonably
close.
We may measure this sampling variability using the variance-covariance
matrix, bias, mean squared error (MSE), and confidence intervals.


Observation Types and Censoring {#observation-types}
====================================================

The likelihood model supports four observation types. Each arises from a
different monitoring scheme:

- **Exact** ($\omega = \text{exact}$): The system failure time $t$ is observed
  directly, e.g., continuous monitoring.
- **Right-censored** ($\omega = \text{right}$): The system is known to have
  survived past $t$, e.g., study ends before failure.
- **Left-censored** ($\omega = \text{left}$): The system is known to have
  failed before $t$, e.g., a single inspection finds it failed.
- **Interval-censored** ($\omega = \text{interval}$): The failure occurred
  in $(t, t_{\text{upper}})$, e.g., periodic inspections bracket the failure.

For the exponential series system, the individual-observation log-likelihood
contributions are:
\begin{align}
\text{Exact:} \quad & \log \lambda_c - \lambda_{\text{sys}} t \\
\text{Right:} \quad & -\lambda_{\text{sys}} t \\
\text{Left:} \quad & \log \lambda_c + \log(1 - e^{-\lambda_{\text{sys}} \tau})
  - \log \lambda_{\text{sys}} \\
\text{Interval:} \quad & \log \lambda_c - \lambda_{\text{sys}} a
  + \log(1 - e^{-\lambda_{\text{sys}}(b - a)}) - \log \lambda_{\text{sys}}
\end{align}
where $\lambda_c = \sum_{j \in c_i} \lambda_j$ and
$\lambda_{\text{sys}} = \sum_j \lambda_j$.

The left-censored contribution can be interpreted as
$\log w_c + \log F(\tau)$, where $w_c = \lambda_c / \lambda_{\text{sys}}$
is the candidate cause weight and $F(\tau)$ is the system CDF.

**The exponential model is the only one where all four types have fully
analytical loglik, score, AND Hessian.** The Weibull models require
numerical integration or `numDeriv` for left/interval types.

## Observe functors

The package provides composable observation functors for generating data under
different monitoring schemes:

```{r observe-functors}
# Periodic inspection every 0.5 time units, study ends at tau = 5
obs_periodic <- observe_periodic(delta = 0.5, tau = 5)

# Single inspection at tau = 3
obs_left <- observe_left_censor(tau = 3)

# Mix of continuous and periodic monitoring
obs_mixed <- observe_mixture(
  observe_right_censor(tau = 5),
  observe_left_censor(tau = 3),
  weights = c(0.7, 0.3)
)
```

## Generating mixed-censoring data

```{r mixed-censoring-data}
gen <- rdata(model)

# Periodic inspections
set.seed(7231)
df_periodic <- gen(theta, n = 500, p = 0.3,
                   observe = observe_periodic(delta = 0.5, tau = 5))
cat("Periodic inspection observation types:\n")
print(table(df_periodic$omega))

# Mixed monitoring
set.seed(7231)
df_mixed <- gen(theta, n = 500, p = 0.3,
                observe = observe_mixture(
                  observe_right_censor(tau = 5),
                  observe_left_censor(tau = 3),
                  weights = c(0.7, 0.3)
                ))
cat("\nMixed monitoring observation types:\n")
print(table(df_mixed$omega))
```

## Likelihood evaluation on mixed-censoring data

All four observation types contribute analytically to the loglik, score, and
Hessian for the exponential model:

```{r mixed-censoring-eval}
ll_fn <- loglik(model)
scr_fn <- score(model)
hess_fn <- hess_loglik(model)

# Evaluate at true parameters
ll_val <- ll_fn(df_periodic, theta)
scr_val <- scr_fn(df_periodic, theta)
hess_val <- hess_fn(df_periodic, theta)

cat("Log-likelihood (periodic):", round(ll_val, 4), "\n")
cat("Score (periodic):", round(scr_val, 4), "\n")
cat("Hessian eigenvalues:", round(eigen(hess_val)$values, 4), "\n")
```

We verify that the analytical score is consistent with numerical
differentiation:

```{r score-verify}
scr_numerical <- numDeriv::grad(
  func = function(th) ll_fn(df_periodic, th),
  x = theta
)
cat("Max |analytical - numerical| score:",
    formatC(max(abs(scr_val - scr_numerical)), format = "e", digits = 2), "\n")
```


## Monte Carlo simulation study

To understand the sampling properties of the MLE, we conduct a Monte Carlo
simulation study. We repeatedly:

1. Generate masked data from the true model
2. Fit the likelihood model to obtain $\hat{\theta}$
3. Compute asymptotic confidence intervals

From the collection of estimates, we can compute:

- **Bias**: $\text{E}[\hat{\theta}] - \theta$
- **Variance**: $\text{Var}[\hat{\theta}]$
- **MSE**: $\text{E}[(\hat{\theta} - \theta)^2] = \text{Bias}^2 + \text{Var}$
- **Coverage probability**: Proportion of CIs containing the true value

```{r load-precomputed, include=FALSE, eval=!run_long}
# Load pre-computed simulation results when not re-running
list2env(readRDS("precomputed_results.rds"), envir = environment())
```

```{r monte-carlo-setup, cache=TRUE, eval=run_long}
set.seed(7231)

# Simulation parameters
B <- 200       # Number of Monte Carlo replications
alpha <- 0.05  # Significance level for CIs

# Storage for results
estimates <- matrix(NA, nrow = B, ncol = m)
se_estimates <- matrix(NA, nrow = B, ncol = m)
ci_lower <- matrix(NA, nrow = B, ncol = m)
ci_upper <- matrix(NA, nrow = B, ncol = m)
converged <- logical(B)
```

```{r monte-carlo-run, cache=TRUE, warning=FALSE, eval=run_long}
for (b in 1:B) {
    # Generate component lifetimes
    comp_times_b <- matrix(nrow = n, ncol = m)
    for (j in 1:m) {
        comp_times_b[, j] <- rexp(n, theta[j])
    }
    comp_times_b <- md_encode_matrix(comp_times_b, "t")

    # Apply masking pipeline
    data_b <- comp_times_b |>
        md_series_lifetime_right_censoring(tau) |>
        md_bernoulli_cand_c1_c2_c3(p) |>
        md_cand_sampler()
    data_b$omega <- ifelse(data_b$delta, "exact", "right")

    # Fit model
    tryCatch({
        result_b <- solver(data_b, par = theta0, method = "Nelder-Mead")
        estimates[b, ] <- result_b$par
        se_estimates[b, ] <- sqrt(diag(result_b$vcov))

        # Asymptotic Wald CIs
        z <- qnorm(1 - alpha/2)
        ci_lower[b, ] <- result_b$par - z * sqrt(diag(result_b$vcov))
        ci_upper[b, ] <- result_b$par + z * sqrt(diag(result_b$vcov))
        converged[b] <- result_b$converged
    }, error = function(e) {
        converged[b] <<- FALSE
    })
}

cat("Convergence rate:", mean(converged, na.rm = TRUE), "\n")
```

### Bias, Variance, and MSE

```{r monte-carlo-results}
# Compute summary statistics (only for converged runs)
valid <- converged & !is.na(estimates[, 1])
est_valid <- estimates[valid, , drop = FALSE]

bias <- colMeans(est_valid) - theta
variance <- apply(est_valid, 2, var)
mse <- bias^2 + variance
rmse <- sqrt(mse)

# Create results table
results_df <- data.frame(
    Component = 1:m,
    True = theta,
    Mean_Est = colMeans(est_valid),
    Bias = bias,
    Variance = variance,
    MSE = mse,
    RMSE = rmse,
    Rel_Bias_Pct = 100 * bias / theta
)

knitr::kable(results_df, digits = 4,
             caption = "Monte Carlo Results: Bias, Variance, and MSE",
             col.names = c("Component", "True θ", "Mean θ̂", "Bias",
                          "Variance", "MSE", "RMSE", "Rel. Bias %"))
```

### Confidence Interval Coverage

The asymptotic $(1-\alpha)$% Wald confidence interval is:
$$
\hat{\theta}_j \pm z_{1-\alpha/2} \cdot \text{SE}(\hat{\theta}_j)
$$

We assess coverage probability: the proportion of intervals that contain
the true parameter value.

```{r ci-coverage}
# Compute coverage for each component
coverage <- numeric(m)
for (j in 1:m) {
    valid_j <- valid & !is.na(ci_lower[, j]) & !is.na(ci_upper[, j])
    covered <- (ci_lower[valid_j, j] <= theta[j]) & (theta[j] <= ci_upper[valid_j, j])
    coverage[j] <- mean(covered)
}

# Mean CI width
mean_width <- colMeans(ci_upper[valid, ] - ci_lower[valid, ], na.rm = TRUE)

coverage_df <- data.frame(
    Component = 1:m,
    True = theta,
    Coverage = coverage,
    Nominal = 1 - alpha,
    Mean_Width = mean_width
)

knitr::kable(coverage_df, digits = 4,
             caption = paste0("Coverage Probability of ", 100*(1-alpha), "% Confidence Intervals"),
             col.names = c("Component", "True θ", "Coverage", "Nominal", "Mean Width"))
```

### Sampling Distribution Visualization

```{r sampling-dist-plot, fig.width=8, fig.height=4}
# Plot sampling distributions
oldpar <- par(mfrow = c(1, min(m, 5)), mar = c(4, 4, 2, 1))
on.exit(par(oldpar))
for (j in 1:min(m, 5)) {
    hist(est_valid[, j], breaks = 20, probability = TRUE,
         main = paste0("Component ", j),
         xlab = expression(hat(theta)[j]),
         col = "lightblue", border = "white")
    abline(v = theta[j], col = "red", lwd = 2, lty = 2)
    abline(v = mean(est_valid[, j]), col = "blue", lwd = 2)
    legend("topright", legend = c("True", "Mean Est."),
           col = c("red", "blue"), lty = c(2, 1), lwd = 2, cex = 0.7)
}
```

### Summary

The Monte Carlo simulation ($n = 7500$, $B = 200$, $p = 0.3$, $\sim 25\%$
censoring) demonstrates that the MLE for the exponential series system with
masked data satisfying conditions C1--C3 has desirable large-sample properties:

- **Essentially unbiased.** All relative biases are below 0.7%, with the largest
  being component 5 at 0.67%. The squared-bias contribution to MSE is negligible
  (e.g., component 1: $\text{bias}^2 \approx 2 \times 10^{-7}$ vs
  $\text{Var} = 0.0019$). The MLE is consistent and approximately unbiased at
  this sample size.

- **Uniform relative precision.** RMSE ranges from 0.042 to 0.046 across
  components --- roughly 4--5% of the true rates. This uniformity is consistent
  with the asymptotic efficiency of the MLE: components with higher failure rates
  (and thus higher variance) also have proportionally wider CIs.

- **Coverage near nominal.** Coverage ranges from 93.4% to 96.5% against a
  nominal 95%. Component 4 (93.4%) slightly undercovers, suggesting the Wald
  interval may be mildly liberal for some components at this sample size. This
  is a known finite-sample property of Wald intervals for rate parameters.

- **CI width as a design metric.** Mean CI widths of 0.170--0.180 (about
  $\pm 8.5\%$ of the true rate) provide a concrete basis for sample size
  planning. Since CI width scales as $1/\sqrt{n}$, halving the width requires
  roughly $4\times$ the sample size.

Sensitivity Analysis {#sensitivity}
===================================

The MLE properties depend on several factors: sample size, masking probability,
and right-censoring rate. In this section, we study how these factors affect
estimation accuracy.

## Effect of Masking Probability

The masking probability $p$ controls how much information about the failed
component is obscured. When $p = 0$, only the failed component is in the
candidate set (perfect information). As $p$ increases, more non-failed
components are included, making estimation more difficult.

```{r masking-sensitivity-setup, cache=TRUE, eval=run_long}
set.seed(7231)

# Use smaller sample for sensitivity study
n_sens <- 500
B_sens <- 100

# Masking probabilities to test
p_values <- seq(0, 0.5, by = 0.1)

# Fixed right-censoring (25% censored)
q_sens <- 0.25
tau_sens <- rep(-(1/sum(theta))*log(q_sens), n_sens)

# Storage
mask_results <- list()
```

```{r masking-sensitivity-run, cache=TRUE, warning=FALSE, message=FALSE, eval=run_long}
for (p_idx in seq_along(p_values)) {
    p_curr <- p_values[p_idx]
    est_p <- matrix(NA, nrow = B_sens, ncol = m)

    for (b in 1:B_sens) {
        # Generate data
        comp_b <- matrix(nrow = n_sens, ncol = m)
        for (j in 1:m) comp_b[, j] <- rexp(n_sens, theta[j])
        comp_b <- md_encode_matrix(comp_b, "t")

        data_b <- comp_b |>
            md_series_lifetime_right_censoring(tau_sens) |>
            md_bernoulli_cand_c1_c2_c3(p_curr) |>
            md_cand_sampler()
        data_b$omega <- ifelse(data_b$delta, "exact", "right")

        tryCatch({
            fit_b <- solver(data_b, par = theta0, method = "Nelder-Mead")
            if (fit_b$converged) est_p[b, ] <- fit_b$par
        }, error = function(e) NULL)
    }

    # Compute statistics
    valid_p <- !is.na(est_p[, 1])
    mask_results[[p_idx]] <- list(
        p = p_curr,
        bias = colMeans(est_p[valid_p, , drop = FALSE]) - theta,
        variance = apply(est_p[valid_p, , drop = FALSE], 2, var),
        mse = (colMeans(est_p[valid_p, , drop = FALSE]) - theta)^2 +
              apply(est_p[valid_p, , drop = FALSE], 2, var)
    )
}
```

```{r masking-sensitivity-plot, fig.width=8, fig.height=5}
# Extract bias and MSE for plotting
bias_mat <- sapply(mask_results, function(x) mean(abs(x$bias)))
mse_mat <- sapply(mask_results, function(x) mean(x$mse))

oldpar <- par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
on.exit(par(oldpar))

# Mean absolute bias vs masking probability
plot(p_values, bias_mat, type = "b", pch = 19, col = "blue",
     xlab = "Masking Probability (p)",
     ylab = "Mean Absolute Bias",
     main = "Bias vs Masking Probability")
grid()

# Mean MSE vs masking probability
plot(p_values, mse_mat, type = "b", pch = 19, col = "red",
     xlab = "Masking Probability (p)",
     ylab = "Mean MSE",
     main = "MSE vs Masking Probability")
grid()
```

```{r masking-sensitivity-table}
mask_df <- data.frame(
    p = p_values,
    Mean_Abs_Bias = bias_mat,
    Mean_MSE = mse_mat,
    Mean_RMSE = sqrt(mse_mat)
)

knitr::kable(mask_df, digits = 4,
             caption = "Effect of Masking Probability on Estimation Accuracy",
             col.names = c("Masking Prob.", "Mean |Bias|", "Mean MSE", "Mean RMSE"))
```

The simulation reveals several patterns in how masking degrades estimation
quality:

- **MSE increases monotonically** from 0.015 ($p = 0$) to 0.062 ($p = 0.5$)
  --- a $4.2\times$ degradation. RMSE roughly doubles from 0.12 to 0.25,
  meaning the typical estimation error doubles when moving from unmasked to
  maximally masked candidate sets.

- **Variance drives the degradation, not bias.** Mean $|\text{bias}|$
  fluctuates between 0.008 and 0.026 with no systematic trend, while variance
  grows steadily with $p$. This is consistent with the theoretical result that
  conditions C1--C3 preserve the unbiasedness of the MLE --- masking erodes
  *precision* without introducing systematic error.

- **The marginal cost of masking accelerates.** MSE roughly doubles from $p = 0$
  to $p = 0.3$ ($0.015 \to 0.033$), then roughly doubles again from $p = 0.3$
  to $p = 0.5$ ($0.033 \to 0.062$). For practitioners, this means moderate
  masking ($p \leq 0.3$) is far less costly than heavy masking.

- **Residual MSE at $p = 0$.** Even with no masking of non-failed components,
  MSE of 0.015 remains due to right-censoring alone --- a useful baseline for
  isolating the censoring contribution.

## Effect of Right-Censoring Rate

Right-censoring reduces the number of exact failure times observed. When a
system is right-censored, we know the system survived beyond the censoring
time, but we don't observe the actual failure time or failed component.

```{r censoring-sensitivity-setup, cache=TRUE, eval=run_long}
set.seed(7231)

# Censoring quantiles (proportion surviving past tau)
# q = 0.1 means 10% survive past tau (light censoring, ~10% censored)
# q = 0.9 means 90% survive past tau (heavy censoring, ~90% censored)
q_values <- seq(0.1, 0.9, by = 0.1)  # Survival probabilities (matches simulation framework)

# Fixed masking probability
p_cens <- 0.2

# Storage
cens_results <- list()
```

```{r censoring-sensitivity-run, cache=TRUE, warning=FALSE, message=FALSE, eval=run_long}
for (q_idx in seq_along(q_values)) {
    q_curr <- q_values[q_idx]
    tau_curr <- rep(-(1/sum(theta))*log(q_curr), n_sens)
    est_q <- matrix(NA, nrow = B_sens, ncol = m)

    for (b in 1:B_sens) {
        # Generate data
        comp_b <- matrix(nrow = n_sens, ncol = m)
        for (j in 1:m) comp_b[, j] <- rexp(n_sens, theta[j])
        comp_b <- md_encode_matrix(comp_b, "t")

        data_b <- comp_b |>
            md_series_lifetime_right_censoring(tau_curr) |>
            md_bernoulli_cand_c1_c2_c3(p_cens) |>
            md_cand_sampler()
        data_b$omega <- ifelse(data_b$delta, "exact", "right")

        tryCatch({
            fit_b <- solver(data_b, par = theta0, method = "Nelder-Mead")
            if (fit_b$converged) est_q[b, ] <- fit_b$par
        }, error = function(e) NULL)
    }

    # Compute statistics
    valid_q <- !is.na(est_q[, 1])
    cens_rate <- mean(data_b$omega == "right")  # Actual censoring rate
    cens_results[[q_idx]] <- list(
        q = q_curr,
        cens_rate = cens_rate,
        bias = colMeans(est_q[valid_q, , drop = FALSE]) - theta,
        variance = apply(est_q[valid_q, , drop = FALSE], 2, var),
        mse = (colMeans(est_q[valid_q, , drop = FALSE]) - theta)^2 +
              apply(est_q[valid_q, , drop = FALSE], 2, var)
    )
}
```

```{r censoring-sensitivity-plot, fig.width=8, fig.height=5}
# Extract statistics
cens_rates <- sapply(cens_results, function(x) x$cens_rate)
bias_cens <- sapply(cens_results, function(x) mean(abs(x$bias)))
mse_cens <- sapply(cens_results, function(x) mean(x$mse))

oldpar <- par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
on.exit(par(oldpar))

# Mean absolute bias vs censoring rate
plot(cens_rates * 100, bias_cens, type = "b", pch = 19, col = "blue",
     xlab = "Censoring Rate (%)",
     ylab = "Mean Absolute Bias",
     main = "Bias vs Censoring Rate")
grid()

# Mean MSE vs censoring rate
plot(cens_rates * 100, mse_cens, type = "b", pch = 19, col = "red",
     xlab = "Censoring Rate (%)",
     ylab = "Mean MSE",
     main = "MSE vs Censoring Rate")
grid()
```

```{r censoring-sensitivity-table}
cens_df <- data.frame(
    Cens_Rate_Pct = round(cens_rates * 100, 1),
    Mean_Abs_Bias = bias_cens,
    Mean_MSE = mse_cens,
    Mean_RMSE = sqrt(mse_cens)
)

knitr::kable(cens_df, digits = 4,
             caption = "Effect of Right-Censoring Rate on Estimation Accuracy",
             col.names = c("Censoring %", "Mean |Bias|", "Mean MSE", "Mean RMSE"))
```

The simulation reveals that censoring is the dominant source of information
loss in this model:

- **MSE grows from 0.019 (10% censored) to 0.183 (90% censored)** --- a
  $9.5\times$ degradation, far exceeding the $4.2\times$ range from masking.
  Censoring is more damaging because it eliminates *both* the failure time and
  the candidate set, whereas masking only dilutes the candidate set.

- **Convex MSE growth, accelerating sharply above 70%.** MSE grows gradually
  from 10% to 70% censoring ($0.019 \to 0.059$, roughly linear), then
  accelerates: $0.088$ at 80% and $0.183$ at 90%. The inflection point is
  near 70--80% censoring, beyond which each additional percentage point of
  censoring is dramatically more damaging.

- **Bias remains modest even under extreme censoring.** Mean $|\text{bias}|$
  grows from 0.005 (10% censoring) to 0.034 (90% censoring) but stays small
  relative to the true rates ($\sim 3\%$ at worst). The MLE remains
  approximately unbiased --- it is consistent but increasingly inefficient.

- **Robustness under extreme censoring.** At 90% censoring, only $\sim 50$ of
  500 observations are exact failures, yet the MLE still converges. The
  resulting estimates are imprecise (RMSE $\approx 0.43$, or $\sim 40\%$ of the
  true rates), but the likelihood-based approach remains functional.

```{r save-precomputed, include=FALSE, eval=run_long}
# Save simulation results for future fast vignette builds
saveRDS(list(
    estimates = estimates, se_estimates = se_estimates,
    ci_lower = ci_lower, ci_upper = ci_upper, converged = converged,
    B = B, alpha = alpha,
    mask_results = mask_results, p_values = p_values,
    cens_results = cens_results, q_values = q_values,
    theta = theta, m = m, n_sens = n_sens, B_sens = B_sens, p_cens = p_cens
), "precomputed_results.rds")
```

## Practical Recommendations

Based on these sensitivity analyses:

1. **Prioritize reducing censoring over masking.** Censoring degrades MSE by
   $9.5\times$ across its range versus $4.2\times$ for masking. In experimental
   design, extending the observation window to reduce censoring below 50% yields
   larger gains than improving diagnostic resolution to reduce masking.

2. **Moderate masking is tolerable.** MSE only doubles from $p = 0$ to
   $p = 0.3$. If reducing the masking probability below 0.3 requires expensive
   diagnostic equipment, the incremental benefit may not justify the cost.

3. **The 70% censoring threshold.** Below $\sim 70\%$ censoring, MSE growth is
   roughly linear and moderate. Above 70%, MSE accelerates sharply --- avoid
   experimental designs where more than two-thirds of systems survive past the
   observation window.

4. **CI widths for sample size planning.** At $n = 7500$ with $p = 0.3$ and
   25% censoring, 95% CI widths are $\sim 0.17$ for rates near 1.0. Width
   scales as $1/\sqrt{n}$, so $n = 1875$ gives widths of $\sim 0.34$ (double),
   and $n = 30000$ gives $\sim 0.085$ (half).

```{r cleanup, include=FALSE}
options(old_opts)
```