---
title: "polySegratioMM: An R library for Bayesian mixture models for marker dosage in autopolyploids"
author: "Peter Baker"
date: "`r format(Sys.time(), '%B %d, %Y')`"
bibliography: polySegratioMM-overview.bib
output: bookdown::html_document2
vignette: >
  %\VignetteIndexEntry{ polySegratioMM: An R library for Bayesian mixture models for marker dosage in autopolyploids }
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#:",
  fig.path = "man/figures/"
)
## output: BiocStyle::html_document
##version <- as.vector(read.dcf('DESCRIPTION')[, 'Version'])
##version <- gsub('-', '.', version)
version <- "0.6-5"
```

Version: `r version`

It is well known that the dosage level of markers in autopolyploids
and allopolyploids can be characterised by their observed segregation
ratios. The related package \texttt{polySegratio} provides functions
to allocate dosage by standard approaches and to simulate marker data
sets for differing ploidies and levels of overdispersion. Note that
these methods could equally well be applied to allopolyploids with
specified expected segregation ratios. For details see
\texttt{polySegratio}.

A Bayesian approach was proposed by @baker2010 where marker
dosage estimation was obtained by fitting a finite mixture
distribution.

This library calls the `JAGS` software for Bayesian
calculation. `JAGS 1.0` or higher must be installed following
instructions from https://mcmc-jags.sourceforge.io. Note that
only the most recent version is used for testing with `R`.  The
`JAGS` executable must be in your path. Currently, no checking
is carried out to ascertain whether or not `JAGS` is set up
appropriately.

To use the library, you need to attach it with
```{r}
library(polySegratioMM)
```

```{r, echo=FALSE}
##library(cacheSweave)  #  comment this out after development phase
## to use:    Sweave("polySegratioMM-overview.Rnw", driver=cacheSweaveDriver)
## or without cache: Sweave("polySegratioMM-overview.Rnw")
op <- options()
options(width=70, digits=4)
``` 

# Simulated data 

Library functions are demonstrated on a simulated data set generated
using the `sim.autoMarkers` function from the
`polySegratio` package.

```{r, echo=FALSE}
##<<simData, cache=true}
## simulate small autohexaploid data set of 500 markers for 200 individuals
## with %70 Single, %20 Double and %10 Triple Dose markers
## created with 
## hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200)
## save(hexmarkers, file="../../data/hexmarkers.RData")
data(hexmarkers)
``` 

The following `R` code can be used to generate 500 markers for
200 autohexaploid individuals exhibiting overdispersion with the
parameter `shape1 = 25`. The underlying percentages of single
double and triple dose markers are 70%, 20% and 10%, respectively.

```{r, eval=FALSE}
  hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200)
```

```{r}
print(hexmarkers)
``` 

Note that the segregation ratios for simulated or real data may be
extracted by using `segregationRatios` which sets up the
appropriate objects for testing marker dosage and plotting or
summarising the marker data.

```{r}
sr <-  segregationRatios(hexmarkers$markers)
```

```{r, sim1, echo=FALSE, fig.cap='Segregation ratios of 500 simulated markers from 200 autohexaploid individuals. Percentages of single, double, and triple dosemarkers are 70%, 20% and 10%, respectively. Data were generated assuming no overdispersion', out.width='50%'}
plotTheoretical(ploidy.level=6, seg.ratios=sr,
  expected.segratio=NULL, proportions=c(0.7,0.2,0.1),
  n.individuals=200)
```

For instance, as seen in Figure \@ref(fig:sim1), segregation ratios may be
plotted with
```{r, eval=FALSE}
plotTheoretical(ploidy.level=6, seg.ratios=sr,
  expected.segratio=NULL, proportions=c(0.7,0.2,0.1),
  n.individuals=200)
```

On the other hand, consider a similar data set that exhibits
overdispersion. This may be simulated as follows 

```{r, eval=FALSE}
hexmarkers.overdisp <- sim.autoMarkers(6, c(0.7,0.2,0.1),
                                       n.markers=500,n.individuals=200,
                                       overdispersion=TRUE, shape1=30)
```

Markers 

```{r}
sr.overdisp <- segregationRatios(hexmarkers.overdisp$markers)
```

```{r, sim2, echo=FALSE, fig.cap='Segregation ratios of 500 simulated markers from 200 autohexaploid individuals.  Percentages of single double and triple dose markers are 70%, 20% and 10%, respectively. Data were generated from the Beta-Binomial distribution assuming a shape parameter `shape1` of 30.', out.width='50%'}
print(plotTheoretical(ploidy.level=6, seg.ratios=sr.overdisp, main="",
  expected.segratio=NULL, proportions=c(0.7,0.2,0.1),
  n.individuals=200))
```

The histogram of marker segregation ratios, which is a useful
graphical method for identifying overdispersion or outliers, is seen
in Figure~\@ref(fig:sim2). Note that, due to overdispersion the
theoretical distribution is narrower than the observed data.

# A Bayesian mixture model approach

For the j^th^ marker $j=1 \dots n$, we assume the observed number
$r_j$ of dominant markers out of $N_j$ lines follows a binomial
distribution denoted $Bin(N_j, P_k)$.If we knew the dosage $k$
then, following @ripol99, the expected value of $P_k$ may be written
as

$$
P_k(k| m, x) = 1 - {{{m-k} \choose {mx}} \over {{m} \choose {mx}}}, k=0 \ldots m/2(\#eq:ripol1)
$$

here $m$ is the ploidy level or number of homologous chromosomes and
the monoploid number $x$ is the number of chromosomes in a basic
set. Note that for diploids $m=2$, tetraploids $m=4$ , octaploids then
$m=8$ and so on and also that if there are no marker data missing then
$N_j$ is simply the number of progeny.

Since the dosage of each marker is unknown, we rely on the missing
data representation of @dempster77 and @tannerandwong87
which is commonly adopted for MCMC computation in finite mixture
models.  An indicator variable $z_j$ corresponding to unknown marker
dosage class $k$ is introduced where $z_j = k$ if the marker has dose
$k$. For the $K$ components with $K \leq m/2$, consider the logit
transformation of the true segregation proportions $P_k$ for dose $k,
k=1 \dots K$. The the logit transformed segregation ratio $ \omega_k$
is then

$$
\omega_k = \log({ {P_k}\over{1-P_k} })(\#eq:logit)
$$
  
Let $z=(z_1 \ldots z_n)^T$ be a vector of unknown dosages (labelled
$1,2 \dots K$ corresponding to simplex, duplex, triplex markers and so
on), then $r_j$ is binomially distributed with known size parameter
$N_j$ and unknown proportion parameter $\omega_{Z_j}$ which is the
segregation ratio for marker dosage $z_j$. Hence, given marker dosage
$z_j$ then

$$
\begin{eqnarray}
  r_j|z_j & \sim & Bin \left( N_j, \omega_{Z_j} \right), \\
  \mbox{ where } && \nonumber\\
  \mbox{logit}( \omega_{Z_j} ) & = & \log ( { \omega_{Z_j} \over 1 -  \omega_{Z_
j} } )
  \sim N( \mu_{Z_j}, \tau_{Z_j}^{-1})
              \nonumber
\end{eqnarray} (\#eq:bin-mod1)
$$

where $\mu_k$ and $\tau_k$ are the mean and precision $(\tau_k = 1/\sigma^2_k)$
of marker dosage class $k$ on the logit scale.

Since the dosage is unknown, for the autohexaploid data generated here
then for the logit($\omega_{z_k}$) can be modelled as a finite mixture
of 3 normals

$$\mbox{logit}( \omega_{Z_j} ) \sim \pi_1 N(\mu_1,\tau_1^{-1}) + 
                                    \pi_2 N(\mu_2,\tau_2^{-1}) + \dots +
                              \pi_K N(\mu_K,\tau_K^{-1})(\#eq:3normals)$$

where $\mu_k$ is the mean and $\tau_k$ is the precision of component
$k$ on the logit scale, and $\pi_k$ are the mixing proportions of the
three components with $\sum_{k=1}^K\pi_k = 1$. The probability density
function $f(x)$ of logit($\omega_k$) is

$$f(x) =  \sum_{k=1}^K \pi_k \phi(x | \mu_k, \tau^{-1}_k)$$

where $\phi$ is the normal cumulative distribution function with
parameters mean $\mu_k$ and variance $\sigma^2_k =\tau_k^{-1}$.

Simulation studies suggested that incorporating strong prior
information, such as the expected distributions of @haldane30
provided the best method of allocating dosage. Further details may be
found in @baker2010.

# Specifying a model

A mixture model may be set up with `setModel`. By default, only
two parameters are required, namely the `ploidy.level` or the
number of homologous chromosomes set either as a numeric or as a
character string and also `n.components` or the number of
components for mixture model (less than or equal to maximum number of
possible dosages). By default, strong priors are set by using the
formulae of @haldane30 for the expected numbers and ratios of
offspring for various parental configurations of autopolyploids. 

For the autohexaploid data generated above, the models are set with
```{r}
x.mod1 <- setModel(3,6)  # autohexaploid model with 3 components
``` 

The `R` object `x.mod1` contains components describing
aspects of the model such as the number of components, ploidy,
expected segregation ratios and so on. Note that the `str`
command is useful for displaying the internal structure of any
`R` object.

# Fitting a mixture model

While various options are available for fine tuning the MCMC process,
the simplest way to fit a mixture model to allocate marker dosages is
with the wrapper function `runSegratioMM} as follows:

```{r, eval=FALSE}
mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1)
``` 
which automatically determines starting values, priors, length of
burn in, number of iterations, and other parameters as well as
producing summary statistics and diagnostic plots. 

```{r, echo=FALSE}
## produced using the following but loaded as data to avoid the run time on slow machines
##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, burn.in=200, sample=500, plots=FALSE)
##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, plots=FALSE)
## save(mcmcHexRun, file="../../data/mcmcHexRun.RData")
data(mcmcHexRun)
``` 

To run `JAGS` without producing plots then set the
`plots` option to `FALSE`. For the overdispersed data
running this command produced the following selected output. While
selected output is printed here the simple command
`print(mcmcHexRun)` whould produce the following output and
more.

The summary of processing times:
```{r}
print(mcmcHexRun$run.jags)
```

And summary statistics for the posterior distributions of selected
parameters: 
```{r}
print(mcmcHexRun$summary)
``` 

Note that MCMC convergence diagnostic output is produced
automatically. Assessing convergence is crucial in MCMC and poor
convergence may result in mis-allocated marker dosages. The diagnostic
statistics indicate that convergence was achieved.
```{r}
print(mcmcHexRun$diagnostics)
```

And finally, summaries of marker dosage allocations are produced:
```{r}
print(mcmcHexRun$doses)
``` 

Note that simply plotting `mcmcHexRun` will produce a histogram
of segregation proportions and the fitted model but that other plots
are easily produced.

<!-- NB: new library default - `sigma` rather than `sigma[1]`. -->

```{r, trace1, fig.cap='Trace and posterior density plots for the parameters parameters $p_1$, $\\mu_1$, $\\sigma_1$ and for the 140^th^ marker for the overdispersed data', fig.height=12, out.width='100%'}
plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])
```

When the `plots` option of `runSegratioMM` is set to the
default value of `TRUE`, numerous plots are produced including
trace and density plots from the `CODA` package. These may also
be extracted manually but the process is somewhat more
complicated. For instance, to obtain trace and density plots for the
parameters $p_1$, $\mu_1$, $\sigma_1$ and for the 140^th^ marker,
as shown in Figure \@ref(fig:trace1), then `CODA` may be used
directly by following command:
```{r, eval=FALSE}
plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])
```

```{r, fitted1, echo=FALSE, fig.cap='Fitted (blue) and theoretical (red) distributions for simulated segregation ratios with overdispersion for 500 markers from 200 individuals.', out.width='50%'}
plot(mcmcHexRun, theoretical=TRUE, main="")
``` 
The histogram of segregation proportions with fitted and theoretical
values shown in Figure \@ref(fig:fitted1) may be obtained by setting
the `theoretical` option to `TRUE` as follows.
```{r, eval=FALSE}
plot(mcmcHexRun, theoretical=TRUE)
```

# Assigning marker dosage

Marker dosages allocations may be obtained directly from the object
`mcmcHexRun`. The dosage with maximum posterior probability is
simply `mcmcHexRun\$doses\$max.post.dosage`. A more conservative
allocation is obtained by using `mcmcHexRun\$doses\$dosage[,"0.8"]`
whereby the dosage with posterior probability over 0.8 is
employed. For instance, to tabulate the number of markers (including
those not allocated a dosage which are labelled NA) the `table`
command can be employed.

```{r}
cat("Employing maximum posterior probability\n")
table(Dose=mcmcHexRun$doses$max.post.dosage, exclude=NULL)
cat("Employing posterior probability > 0.8\n")
table(Dose=mcmcHexRun$doses$dosage[,"0.8"], exclude=NULL)
```

And of course since the data were simulated we can compare the
estimated and true dosages obtained as
`hexmarkers.overdisp\$true.doses\$dosage` via cross tabulation. Doses can
also be obtained for the standard$\chi^2$ test by using the
`test.segRatio` command from the `polySegratio` library.

```{r}
cat("Employing theChi squared test\n")
dose.chi <- test.segRatio(sr.overdisp, ploidy.level = 6)
table(Chi2Dose=dose.chi$dosage,
  True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL)
cat("Employing maximum posterior probability\n")
table(MixtureDose=mcmcHexRun$doses$max.post.dosage,
  True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL)
cat("Employing posterior probability > 0.8\n")
table(MixtureDose=mcmcHexRun$doses$dosage[, "0.8"],
  True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL) 
```

These tables show that far fewer markers are allocated a dosage using
the standard $\chi^2$ test than by the mixture model. Fewer markers were
misclassified using a posterior probability threshold of 0.8 rather than
the maximum posterior probability as a basis for allocating dosage.

\bibliographystyle{apalike}

## Acknowledgments

Karen Aitken, given her experience in tetraploids and sugarcane marker
maps, has provided many valuable insights into marker dosage in
autopolyploids. Additionally, Ross Darnell, Andrew George and Kerrie
Mengersen provided useful comments and discussions.

## References