---
title: "Quick start with `radEmu`"
author: "Sarah Teichman and Amy Willis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{quickstart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Setup

First, we will install `radEmu`, if we haven't already. 

```{r, eval = FALSE}
# if (!require("remotes", quietly = TRUE))
#     install.packages("remotes")
# 
# remotes::install_github("statdivlab/radEmu")
```

Next, we can load `radEmu`, as well as `dplyr`.

```{r setup, message = FALSE}
library(radEmu)
library(dplyr)
```

In this vignette we'll explore a [dataset published by Wirbel et al. (2019)](https://www.nature.com/articles/s41591-019-0406-6). This is a meta-analysis of case-control studies of colorectal cancer, meaning that Wirbel et al. collected raw sequencing data from studies other researchers conducted and re-analyzed it.

Wirbel et al. published two pieces of data we'll focus on:

* metadata giving demographics and other information about participants
* a mOTU (metagenomic OTU) table

```{r}
data("wirbel_sample")
# encode control as the baseline level
wirbel_sample$Group <- factor(wirbel_sample$Group, levels = c("CTR","CRC"))
dim(wirbel_sample)
data("wirbel_otu")
dim(wirbel_otu)
```

We can see that we have $566$ samples and $845$ mOTUs. 

While in general we would fit a model to all mOTUs, we are going to subset to some specific genera for the purposes of this tutorial. Let's look at *Eubacterium*, *Porphyromonas*, *Faecalibacteria*, and *Fusobacterium* for now.

```{r}
mOTU_names <- colnames(wirbel_otu)
chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas")
mOTU_name_df <- data.frame(name = mOTU_names) %>% 
  mutate(base_name = stringr::str_remove(mOTU_names, "unknown ") %>%
                      stringr::str_remove("uncultured ")) %>%
  mutate(genus_name = stringr::word(base_name, 1))
restricted_mOTU_names <- mOTU_name_df %>%
  filter(genus_name %in% chosen_genera) %>%
  pull(name)
```

Again, while we would generally fit a model using all of our samples, for this tutorial we are only going to consider data from a case-control study from China. 

```{r}
ch_study_obs <- which(wirbel_sample$Country %in% c("CHI"))
```

Next, we want to confirm that all samples have at least one non-zero count across the categories we've chosen and that all categories have at least one non-zero count across the samples we've chosen.

```{r}
small_Y <- wirbel_otu[ch_study_obs, restricted_mOTU_names]
sum(rowSums(small_Y) == 0) # no samples have a count sum of 0 
sum(colSums(small_Y) == 0) # one category has a count sum of 0

category_to_rm <- which(colSums(small_Y) == 0)
small_Y <- small_Y[, -category_to_rm]
sum(colSums(small_Y) == 0)
```

## Differential abundance analysis with radEmu

We now are ready to fit a `radEmu` model with the `emuFit()` function! This will take 1-2 minutes to run, so we suggest you start running it and then read below about the arguments!

```{r}
mod <- emuFit(data = wirbel_sample[ch_study_obs, ],
              Y = small_Y,
              formula = ~ Group + Gender, 
              test_kj = data.frame(k = 2, j = 1))
```

Some of the important arguments for `emuFit()` are the following: 

  - `formula`: This is a formula telling radEmu what predictors to use in its model. We are using Group, which is an indicator for case (CRC) vs control (CTR), and Gender. 
  - `data`: A data frame containing covariate information about our samples. 
  - `Y`: A matrix containing our observed abundance data (e.g., counts or depth measurements). The rows give the observations (samples), and the columns give the categories (taxa/mOTUs). Note that `Y` doesn't have to be integer-valued (counts)! 
  - `test_kj`: a data frame describing which robust score tests you want to run. `k` corresponds with the predictor(s) that you want to test (if you don't know what order your predictors appear in your design matrix, use the function `make_design_matrix()` with your `formula` and `data` to check!) and `j` corresponds with the taxa that you want to test (check `colnames(Y)` to see the order of the taxa). If you wanted to test the Group covariate for all taxa then you could set `test_kj = data.frame(k = 2, j = 1:ncol(Y))`. 
  - `verbose`: do you want the code to give you updates as it goes along? 
  
In the example above we only test the first taxon, because each test for a data set this size takes approximately 30 seconds. Typically you'll want to test all taxa that you are interested in, so you may need to leave this running for a few hours, or check out the "parallel_radEmu" vignette to see how you could parallelize these score tests if you have access to additional computing resources. 
  
Now that we've fit the model and run a robust score test for the first taxon, we can look at the results! The parameter estimates and any test results are in the `coef` part of the `emuFit` object. 

```{r}
head(mod$coef)
```

The first taxon in our model is *Fusobacterium nucleatum s. vincentii*, which has a log fold-difference estimate of $1.49$. We will interpret this to mean that we expect that *Fusobacterium nucleatum s. vincentii* is $exp(1.49) \approx 4.44$ times more abundant in cases of colorectal cancer compared to controls of the same gender, when compared to the typical fold-differences in abundances of the taxa in this analysis. 

We could similarly interpret the log fold-differences for each taxon in our analysis. 

For *Fusobacterium nucleatum s. vincentii* we have a robust score test p-value of $0.04$. This means that we do have enough evidence to reject the null hypothesis that the fold-difference in abundance of *Fusobacterium nucleatum s. vincentii* between cases and controls is the same as the typical fold-difference in abundance between cases and controls across all taxa in this analysis. When we say ``typical" here we mean approximately the median fold-difference across taxa.  

Now you are ready to start using `radEmu`! We recommend our other vignettes for a deeper look using `radEmu`, including for `phyloseq` or `TreeSummarizedExperiment` objects, with clustered data, and in parallel. 