---
title: "Introduction to taxodist"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to taxodist}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
library(taxodist)
```

## What is taxodist?

`taxodist` answers a simple question: *how related are any two living things?*

Given any two taxon names, a pair of dinosaurs, a dinosaur and a fungus, two
species of fly, or an oak tree and a human, `taxodist` retrieves their full
hierarchical lineages from [The Taxonomicon](http://taxonomicon.taxonomy.nl) and
computes a dissimilarity index between them.

The Taxonomicon is based on *Systema Naturae 2000* (Brands, 1989 onwards) and
provides exceptionally deep lineage resolution, substantially exceeding other programmatic sources.

Searches work at any taxonomic level: genus, species, family, order, or
any clade. Both `"Tyrannosaurus"` and `"Tyrannosaurus rex"` are valid inputs, as
are `"Drosophila melanogaster"`, `"Homo sapiens"`, or `"Araucaria angustifolia"`.

---

## The distance metric

`taxodist` measures how related two taxa are by asking a single question:
*how deep is their most recent common ancestor?*

$$d(A, B) = \frac{1}{\text{depth}(\text{MRCA}(A,B))}$$

The deeper the shared ancestor, the smaller the distance, meaning the more
related the two taxa are. A shallow MRCA (close to the root) means the two
taxa diverged early and are distantly related; a deep MRCA means they share
a long common history and are closely related.

This has a key property: taxa that diverged at the same point
in the tree are always equidistant from any third taxon, regardless of how
many nodes each has in its lineage below the split. For example:

- *Tyrannosaurus* and *Velociraptor* are both Tetanurae: they diverged from
  *Carnotaurus* (Ceratosauria) at the same node (Averostra), so both have
  exactly the same distance to Carnotaurus;
  
- All dinosaurs diverged from the mammal lineage at the same
  node (Amniota), so *Homo sapiens* is equally distant from *Tyrannosaurus*, *Triceratops*, *Carnotaurus* and *Cyanocorax*.

The distance is not bounded to $[0, 1]$, it depends on the depth of the
MRCA in The Taxonomicon's classification. Deeper, more finely resolved
clades will have smaller distances between their members.

---

## Basic usage

### Getting a lineage

```{r lineage}
lin <- get_lineage("Tyrannosaurus")
tail(lin, 8)
#> [1] "Avetheropoda"     "Coelurosauria"    "Tyrannoraptora"   "Tyrannosauroidea"
#> [5] "Tyrannosauridae"  "Tyrannosaurinae"  "Tyrannosaurini"   "Tyrannosaurus" 
```

Species-level searches also work:

```{r lineage-species}
lin <- get_lineage("Drosophila melanogaster")
tail(lin, 4)
#> [1] "Ephydroidea"             "Drosophilidae"           "Drosophilinae"          
#> [4] "Drosophila melanogaster"
```

### Computing distance between two taxa

```{r distance}
result <- taxo_distance("Tyrannosaurus", "Velociraptor")
print(result)
#> -- Taxonomic Distance --
#> 
#> * Tyrannosaurus vs Velociraptor
#>   Distance : 0.0153846153846154
#>   MRCA : Tyrannoraptora (depth 65)
#>   Depth A : 70
#>   Depth B : 73
```

The distance between a dinosaur and a mammal or a bacteria and a human is larger:

```{r distance-far}
taxo_distance("Tyrannosaurus", "Homo")$distance        # 0.02777778
taxo_distance("Tyrannosaurus", "Drosophila")$distance  # 0.06666667
taxo_distance("Tyrannosaurus", "Quercus")$distance     # 0.25
taxo_distance("Escherichia", "Homo")$distance          # 1
```

### Finding the most recent common ancestor

```{r mrca}
mrca("Tyrannosaurus", "Velociraptor")  # "Tyrannoraptora"
mrca("Tyrannosaurus", "Triceratops")   # "Dinosauria"
mrca("Tyrannosaurus", "Homo")          # "Amniota"
mrca("Tyrannosaurus", "Drosophila")    # "Nephrozoa"
mrca("Tyrannosaurus", "Quercus")       # "discaria"
```

---

## Working with multiple taxa

### Pairwise distance matrix

```{r matrix}
taxa <- c(
  "Tyrannosaurus", "Carnotaurus", "Triceratops",
  "Parasaurolophus", "Stegosaurus", "Brachiosaurus",
  "Homo sapiens", "Homo neanderthalensis", "Pan troglodytes",
  "Panthera leo", "Canis lupus",
  "Ornithorhynchus anatinus",
  "Loxodonta africana",       
  "Struthio camelus",           
  "Aptenodytes forsteri",       
  "Ara ararauna",            
  "Crocodylus niloticus",
  "Chelonia mydas",
  "Ambystoma mexicanum",
  "Octopus vulgaris",
  "Carcharodon carcharias",    
  "Balaenoptera musculus",   
  "Drosophila melanogaster",
  "Apis mellifera",
  "Arabidopsis thaliana",
  "Quercus robur",
  "Ginkgo biloba",
  "Welwitschia mirabilis",
  "Saccharomyces cerevisiae",
  "Escherichia coli",
  "Bacillus subtilis",
  "Plasmodium falciparum"
)
mat <- distance_matrix(taxa)
print(mat)
```

The matrix is symmetric with zeros on the diagonal. 

### Built-in Clustering and Ordination

`taxodist` has wrapper functions for basic multivariate exploration of these matrices, saving you from manually writing boilerplate code:

#### Hierarchical Clustering

```{r clustering}
cl <- taxo_cluster(taxa)
plot(cl)
```

#### PCoA

```{r PCoA}
ord <- taxo_ordinate(taxa)
summary(ord)
plot(ord)
```

#### Heatmap visualization

```{r}
taxo_heatmap(taxa)
```

### Finding the closest relative

```{r closest}
closest_relative(
  "Carnotaurus",
  c("Aucasaurus", "Velociraptor", "Triceratops",
    "Brachiosaurus", "Homo sapiens", "Apis mellifera")
)
#>            taxon   distance
#> 1     Aucasaurus 0.01515152
#> 2   Velociraptor 0.01666667
#> 4  Brachiosaurus 0.01754386
#> 3    Triceratops 0.01818182
#> 5   Homo sapiens 0.02777778
#> 6 Apis mellifera 0.06666667
```

---

## Lineage utilities

### Comparing lineages side by side

```{r compare}
compare_lineages("Carnotaurus", "Tyrannosaurus")
#> -- Lineage Comparison --
#> MRCA: Averostra at depth 60
#>
#> Shared lineage (60 nodes):
#>   Biota ... Theropoda
#>
#> Carnotaurus only (7 nodes):
#> Ceratosauria
#> Neoceratosauria
#> Abelisauroidea
#> Abelisauria
#> Abelisauridae
#> Carnotaurinae
#> Carnotaurus
#>
#> Tyrannosaurus only (10 nodes):
#> Tetanurae
#> Orionides
#> ...
```

### Listing shared clades

```{r shared}
# what do a fly and a beetle have in common?
shared_clades("Drosophila melanogaster", "Tribolium castaneum")
# returns their shared lineage from Biota down to their MRCA

# what do T. rex and a rose share?
shared_clades("Tyrannosaurus rex", "Rosa agrestis")
```

### Testing clade membership

```{r membership}
is_member("Tyrannosaurus", "Theropoda")          # TRUE
is_member("Carnotaurus", "Abelisauridae")        # TRUE
is_member("Triceratops", "Theropoda")            # FALSE
is_member("Homo sapiens", "Amniota")             # TRUE
is_member("Drosophila melanogaster", "Insecta")  # TRUE
is_member("Quercus robur", "Animalia")           # FALSE
```

### Filtering a list of taxa by clade

```{r filter}
taxa <- c("Tyrannosaurus", "Carnotaurus", "Triceratops",
          "Velociraptor", "Homo sapiens", "Drosophila melanogaster",
          "Quercus robur", "Saccharomyces cerevisiae")

filter_clade(taxa, "Dinosauria")
#> [1] "Tyrannosaurus" "Carnotaurus"   "Triceratops"   "Velociraptor"

filter_clade(taxa, "Theropoda")
#> [1] "Tyrannosaurus" "Carnotaurus"   "Velociraptor"

filter_clade(taxa, "Animalia")
#> [1] "Tyrannosaurus"          "Carnotaurus"
#> [3] "Triceratops"            "Velociraptor"
#> [5] "Homo sapiens"           "Drosophila melanogaster"
```

---

## Coverage and caching

### Checking coverage before a large run

```{r coverage}
taxa <- c("Tyrannosaurus", "Velociraptor", "Apis mellifera", "Fakeosaurus")
check_coverage(taxa)
#>  Tyrannosaurus  Velociraptor  Apis mellifera    Fakeosaurus
#>           TRUE          TRUE            TRUE          FALSE
```

Use `check_coverage()` to pre-screen a list before running `distance_matrix()`
on a large dataset — taxa that return `FALSE` will produce `NA` distances.

### Caching

Lineages are automatically cached in memory during an R session to avoid
redundant network requests. This means the second call to `get_lineage()` for
the same taxon is instantaneous. Clear the cache with:

```{r cache}
clear_cache()
```

### Persisting the cache across sessions

By default the cache lives only in memory and is lost when you close R.
If you have retrieved lineages for a large set of taxa, you can save the
cache to disk and reload it in a future session to avoid hitting the network
again:

```{r save-cache}
save_cache("my_taxa_cache.rds") # at the end of a session

load_cache("my_taxa_cache.rds") # at the start of the next session, before any distance calls
```

This is especially useful before running `distance_matrix()` on a large
dataset.

---

## A note on lineage depth

The Taxonomicon provides substantially deeper lineage resolution than most
other programmatic sources. For example, *Tyrannosaurus* has 70 nodes
in its lineage, capturing intermediate clades at the level of superfamilies,
tribes, and named subclades that are absent from most sources. This depth is 
what makes the distance metric meaningful, shallower sources would produce 
coarser distances that conflate distantly related groups.

---

## Data source and citation

All lineage data is sourced from The Taxonomicon (taxonomy.nl), based on
*Systema Naturae 2000*:

> Brands, S.J. (1989 onwards). *Systema Naturae 2000*. Amsterdam,
> The Netherlands. Retrieved from The Taxonomicon,
> http://taxonomicon.taxonomy.nl.

Please cite this resource in any published work using `taxodist`.
