---
title: "Introduction to zmctp: Zero-Modified Complex Tri-parametric Pearson Distribution"
author: "Rasheedat Oladoja"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Introduction to zmctp: Zero-Modified Complex Tri-parametric Pearson Distribution}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE
)
library(zmctp)

# Flag: skip slow chunks on CRAN (they time out during R CMD build)
NOT_CRAN <- !identical(Sys.getenv("NOT_CRAN"), "true")
```

# Abstract

The **zmctp** package extends the Complex Tri-parametric Pearson (CTP) distribution with zero-modified versions for overdispersed count data. This package addresses limitations of existing implementations when the parameter $b$ approaches zero, providing distribution functions, maximum likelihood estimation, and diagnostic tools based on Rodríguez-Avi et al. (2003).

# 1. Introduction

## 1.1 Background

In count data analysis, the Poisson regression model is the standard approach when successive events occur independently at the same rate. A unique feature of the Poisson distribution is **equi-dispersion**: the mean and variance are equal. However, real-world data often violates this assumption:

- **Overdispersion**: Variance > Mean (dispersion parameter > 1)
- **Underdispersion**: Variance < Mean (dispersion parameter < 1)

When overdispersion occurs, the Negative Binomial (NB) model is commonly recommended. For cases with excess zeros, Zero-Inflated models (ZIP, ZINB) are used. However, these models have limitations, particularly when handling complex dispersion patterns.

## 1.2 The Complex TriParametric Pearson (CTP) Distribution

The CTP distribution, introduced by Rodríguez-Avi et al. (2003), belongs to the Pearson discrete distribution family and is derived from the Gaussian Hypergeometric Probability Distribution (GHPD). Unlike Poisson-based mixtures, the CTP provides a fundamentally different approach to modeling count data.

### Mathematical Definition

The CTP distribution has probability mass function:

$$f(x|a,b,\gamma) = f_0 \frac{\Gamma(a+ib+x) \Gamma(a-ib+x)}{\Gamma(\gamma+x) \cdot x!}, \quad x = 0, 1, 2, ...$$

where:
- $a \in \mathbb{R}$ (real parameter, can be negative)
- $b \geq 0$ (imaginary parameter)
- $\gamma > 2a + 2$ (shape parameter, ensures variance exists)
- $i$ is the imaginary unit
- $f_0$ is the normalizing constant:

$$f_0 = \frac{\Gamma(\gamma-a-ib)\Gamma(\gamma-a+ib)}{\Gamma(\gamma)\Gamma(\gamma-2a)}$$

### Moments

The distribution has closed-form expressions for its moments:

**Mean:**
$$\mu = \frac{a^2 + b^2}{\gamma - 2a - 1}$$

**Variance:**
$$\sigma^2 = \frac{\mu(\mu + \gamma - 1)}{\gamma - 2a - 2}$$

**Dispersion:**
- Equidispersed if $a = -\frac{\mu + 1}{2}$
- Underdispersed if $a < -\frac{\mu + 1}{2}$
- Overdispersed if $a > -\frac{\mu + 1}{2}$

Notably, the CTP is overdispersed whenever $a \geq 0$.

## 1.3 Why zmctp?

### Problem with Existing Implementations

The `cpd` package (available on CRAN) implements the CTP distribution but has a significant limitation: **when fitting data with large sample sizes or zero-inflation, it often estimates $b \approx 0$**, which reduces the model to a simpler form and loses flexibility.

### The zmctp Solution

This package addresses these limitations through:

1. **Robust optimization** with reparameterization ($\gamma = 2a + 2 + e^\eta$)
2. **Zero-Modified CTP** for excess/deficit zeros
3. **Better handling** of small $b$ values
4. **Comprehensive diagnostics** and visualization tools

# 2. Basic Usage

## 2.1 Distribution Functions

The package provides standard distribution functions following R conventions:
```{r basic-dist}
# Probability mass function
dctp(0:5, a = 1, b = 0.5, gama = 5)

# Cumulative distribution function
pctp(3, a = 1, b = 0.5, gama = 5)

# Quantile function (inverse CDF)
qctp(c(0.25, 0.5, 0.75), a = 1, b = 0.5, gama = 5)

# Random generation
set.seed(123)
x <- rctp(30, a = 1, b = 0.5, gama = 5)
cat("Sample mean:", mean(x), "\nSample variance:", var(x), "\n")
```
## 2.2 Checking for Overdispersion
```{r check-dispersion}
# Generate overdispersed data
set.seed(456)
x_over <- rctp(20, a = 1.2, b = 0.3, gama = 6)
head(x_over)
# Calculate dispersion index
dispersion_index <- var(x_over) / mean(x_over)
cat("Dispersion Index:", dispersion_index, "\n")

if (dispersion_index > 1) {
  cat("Data is OVERDISPERSED\n")
} else if (dispersion_index < 1) {
  cat("Data is UNDERDISPERSED\n")
} else {
  cat("Data is EQUIDISPERSED\n")
}
```
## 2.3 Fitting the CTP Model
```{r fit-ctp, eval=FALSE}
# Fit CTP model to moderate-sized data
set.seed(456)
x_over <- rctp(200, a = 1.2, b = 0.3, gama = 6)
fit_ctp <- ctp.fit(x_over)
print(fit_ctp)
```
```{r fit-ctp-output, echo=FALSE, comment=""}
cat("CTP Distribution - Maximum Likelihood Estimates
===============================================

Parameter Estimates:
     Estimate Std.Error
a      1.4593    0.3585
b      0.0011    0.0000
gama   8.5912    3.1878

Goodness-of-Fit Statistics:
  Log-Likelihood: -178.1003
  AIC: 362.2006
  BIC: 372.0956
  Pearson Chi-sq: 19.6799
  Wald Chi-sq: 189.8432

Convergence: YES")
```
## 2.4 Diagnostic Plots
```{r plot-ctp, fig.height=6, eval=FALSE}
plot(fit_ctp, type = "frequency")
plot(fit_ctp, type = "cdf")
plot(fit_ctp, type = "qq")
```
# 3. Zero-Modified CTP Distribution

## 3.1 When to Use ZM-CTP

Use the Zero-Modified CTP when:
  
  - Data has **excess zeros** (zero-inflation)
- Data has **deficit of zeros** (zero-deflation)
- Standard CTP estimates $b \approx 0$
  - Over-dispersion is caused by zero-inflation

## 3.2 Mathematical Definition

The ZM-CTP probability mass function is:
  
  $$P(X=0) = \omega + (1-\omega) P_{CTP}(0)$$
  $$P(X=k) = (1-\omega) P_{CTP}(k), \quad k > 0$$
  
  where $0 < \omega < 1$ is the zero-modification parameter.

**Interpretation:**
  - $\omega > 0.5$: Zero-inflation
- $\omega < 0.5$: Zero-deflation
- $\omega \approx 0$: Standard CTP is sufficient

## 3.3 Example: Zero-Inflated Data
```{r zm-ctp-example, eval=FALSE}
# Generate zero-inflated data

x_zi <- rzictp(300, a = 1, b = 0.5, gama = 6, omega = 0.3)
cat("Proportion of zeros:", mean(x_zi == 0), "\n")
cat("Expected P(X=0) under CTP:", dctp(0, 1, 0.5, 6), "\n")
fit_ctp <- ctp.fit(x_zi)
fit_zm <- zictp.fit(x_zi)
summary(fit_zm)
cat("Standard CTP AIC:", fit_ctp$AIC, "\n")
cat("ZM-CTP AIC:", fit_zm$AIC, "\n")
cat("Omega estimate:", fit_zm$estimates["omega"], "\n")
```
```{r, echo=FALSE, comment=""}
cat("Proportion of zeros: 0.7966667 
Expected P(X=0) under CTP: 0.7570218 

=== Zero-Modified CTP Distribution Fit Summary ===


Zero-Modified CTP Distribution - Maximum Likelihood Estimates
=============================================================

Parameter Estimates:
      Estimate Std.Error
a       2.5646        NA
b       0.0017        NA
gama   16.8737        NA
omega   0.4656        NA

Goodness-of-Fit Statistics:
  Log-Likelihood: -216.5545
  AIC: 441.1090
  BIC: 455.9241
  Pearson Chi-sq: 2.7242
  Wald Chi-sq: 287.6122

Convergence: YES 

--- Moments ---
  Mean: 0.3271 (empirical: 0.3267)
  Variance: 0.6467 (empirical: 0.6220)
  P(X=0): 0.7967 (empirical: 0.7967)

--- Observed vs Expected Frequencies ---
  x Freq    expected
1 0  239 239.0001656
2 1   40  38.7105351
3 2   11  13.7597381
4 3    5   5.0633781
5 4    4   1.9722958
6 5    1   0.8143692

Standard CTP AIC: 440.03 
ZM-CTP AIC: 441.109 
Omega estimate: 0.4656324")
```
# 5. Comparison with cpd Package

## 5.1 The Problem with Large Sample Sizes

From Oladoja (2021):
  
> "Data sets previously utilized while fitting this model have sample sizes that are less than 2000. In this study, the CTP was fitted to different over-dispersed and under-dispersed count data sets with both small and large sample sizes."

> "The 'cpd' package in R... estimated the parameter b as 0 for cases where the sample size n is large."

## 5.2 Demonstration
```{r cpd-comparison, eval=FALSE}
# Install cpd if needed
# install.packages("cpd")

library(zmctp)

# Generate data where cpd struggles
set.seed(100)
x_problem <- rzictp(2000, a = 1, b = 0.001, gama = 8, omega = 0.2)

# Compare results
cpd_fit <- cpd::fitctp(x_problem, astart=1, bstart=0.001, gammastart=8)
zmctp_fit <- zictp.fit(x_problem)
cat("cpd b estimate:", cpd_fit$coefficients[2], "\n")
cat("zmctp b estimate:", zmctp_fit$estimates["b"], "\n")
cat("zmctp omega estimate:", zmctp_fit$estimates["omega"], "\n")
```
```{r cpd-comparison-result, echo=FALSE, comment=""}
cat("
cpd b estimate: 8.072796e-07 
zmctp b estimate: 0.0007942037 
zmctp omega estimate: 0.04831814")
```
# 6. Model Selection Guide

## 6.1 Decision Tree
```
Is your data count data?
  ├─ NO → Use appropriate model for your data type
└─ YES → Calculate Dispersion Index (DI = variance/mean)
├─ DI ≈ 1 → Poisson
├─ DI > 1 (Over-dispersed)
│   ├─ Excess zeros? → Try ZIP, ZINB, ZM-CTP
│   ├─ Excess non-zeros? → Try CTP, NB
│   └─ Large n & b→0 in cpd? → Use zmctp!
  └─ DI < 1 (Under-dispersed) → Try CTP
```
# 7. Properties and Theoretical Results

## 7.1 Skewness

The CTP distribution has **positive skewness**:
  
  $$\mu_3 = \frac{(a^2 + b^2)[4b^2 + (\gamma - 1)^2] + [b^2 + (\gamma - 1 - a)^2]}{(\gamma - 2a - 1)^3(\gamma - 2a - 2)(\gamma - 2a - 3)}$$
  
  Since both numerator and denominator are always positive, the distribution is right-skewed.

## 7.2 Mode

The distribution is uni-modal at:
  
  $$y = \left\lfloor \frac{(a-1)^2 + b^2}{\gamma - 2a + 1} \right\rfloor$$
  
  ## 7.3 Convergence
  
  The CTP distribution converges to:
  - **Poisson** when $\gamma$ and $a^2 + b^2 \rightarrow \infty$ at the same rate
- **Normal** when $\gamma$ and $\sqrt{a^2 + b^2}$ have the same order of convergence
# 9. Conclusion

The **zmctp** package provides a robust implementation of the CTP distribution that:
  
  ✅ Handles both over-dispersion and under-dispersion  
✅ Works with large sample sizes (n > 2000)  
✅ Addresses the b → 0 problem in existing implementations  
✅ Provides zero-modified variants for zero-inflated data  
✅ Includes comprehensive diagnostics and visualization  

## 9.1 Key Findings from Oladoja (2021)

> "The CTP produced similar fit to the NB model while correcting for over-dispersion caused by excess zeros but is a better fit than the NB when over-dispersion is caused by excessive non-zero values in a data set."

> "The CTP model behaved better with under-dispersed data sets than the Poisson model."

> "The package 'cpd' in R estimates b as 0 for zero-inflated data sets while the program utilized in this study works well."

# References

Böhning, D., Dietz, E., & Schlattmann, P. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. *Journal of the Royal Statistical Society: Series A*, 162(2), 195-209.

Deb, P., & Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture approach. *Journal of Applied Econometrics*, 12(3), 313-336.

Oladoja, R. O. (2021). Complex Tri-Parametric Pearson Distribution and Its Applications. *M.Sc. Thesis*, Kwara State University, Malete, Nigeria.

Rodríguez-Avi, J., Conde-Sánchez, A., & Sáez-Castillo, A. J. (2003). A new class of discrete distributions with complex parameters. *Statistical Papers*, 44(1), 67-88. https://doi.org/10.1007/s00362-002-0134-7

# Session Info
```{r session-info}
sessionInfo()
```


