---
title: "Exact Graph Construction from big.matrix Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Exact Graph Construction from big.matrix Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

`bigKNN` can build exact neighbour graphs directly from
`bigmemory::big.matrix` references. These graph builders are useful when the
exact neighbourhood structure is the real output you care about, not only the
raw search result.

Typical uses include:

- clustering and community discovery
- manifold learning and neighborhood-preserving embeddings
- graph-based preprocessing for downstream models
- exact ground truth for evaluating approximate graph builders

This vignette introduces the graph families available in `bigKNN`, the
symmetrization rules they use, and the output formats that make them easy to
inspect or hand off downstream.

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

if (!requireNamespace("bigmemory", quietly = TRUE)) {
  cat("This vignette requires the 'bigmemory' package.\n")
  knitr::knit_exit()
}

library(bigKNN)
library(bigmemory)
```

```{r helpers, include=FALSE}
clean_graph <- function(x) {
  out <- as.data.frame(x)
  rownames(out) <- NULL
  out[order(out$from, out$to), , drop = FALSE]
}
```

# Build a Small Reference Set

We will work with six points arranged as two well-separated local groups. That
geometry keeps the graph outputs easy to read while still showing meaningful
differences between directed, mutual, shared-nearest-neighbour, and radius
graphs.

```{r create-reference}
reference_points <- data.frame(
  id = paste0("p", 1:6),
  x1 = c(0.0, 0.3, 1.2, 4.0, 4.3, 5.2),
  x2 = c(0.0, 0.0, 0.0, 4.0, 4.1, 4.0)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))

reference_points
```

The returned graph objects always refer to rows of the reference matrix. In
other words, vertices `1:6` correspond to `p1:p6`.

# Directed `k`-NN graphs with `knn_graph_bigmatrix()`

A directed `k`-nearest-neighbour graph stores one outgoing edge per retained
neighbour for every row. With `include_distance = TRUE`, each edge value is the
exact distance between the source row and the neighbour row.

```{r directed-knn-graph}
directed_knn <- knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list",
  symmetrize = "none"
)

clean_graph(directed_knn)
```

Here:

- `from` is the source row
- `to` is the chosen neighbour row
- `distance` is the exact Euclidean distance

This is the most literal graph form: every query row keeps its own directed
outgoing edges. If one row points to another but not vice versa, both facts are
visible.

# Mutual `k`-NN graphs with `mutual_knn_graph_bigmatrix()`

Mutual `k`-NN graphs keep only reciprocal neighbour relationships. They are
often sparser and more conservative than directed `k`-NN graphs, because
one-sided edges are dropped.

```{r mutual-graph}
mutual_knn <- mutual_knn_graph_bigmatrix(
  reference,
  k = 1,
  format = "edge_list"
)

clean_graph(mutual_knn)
```

In this example, only the pairs `(1, 2)` and `(4, 5)` survive. Vertices `3`
and `6` each point toward their nearest local anchor, but that preference is
not returned in the other direction, so those edges disappear in the mutual
graph.

`mutual_knn_graph_bigmatrix()` is equivalent to calling
`knn_graph_bigmatrix(..., symmetrize = "mutual")`:

```{r mutual-wrapper}
identical(
  clean_graph(mutual_knn),
  clean_graph(knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual"))
)
```

# Shared-nearest-neighbour graphs with `snn_graph_bigmatrix()`

Shared-nearest-neighbour graphs connect pairs of rows that have overlapping
exact neighbour sets. Instead of storing distances, they store a similarity-like
weight derived from neighbourhood overlap.

`bigKNN` currently supports two weight definitions:

- `weight = "count"`: the number of shared neighbours
- `weight = "jaccard"`: shared neighbours divided by the union size

```{r snn-graph}
snn_count <- snn_graph_bigmatrix(
  reference,
  k = 2,
  weight = "count",
  format = "edge_list"
)

snn_jaccard <- snn_graph_bigmatrix(
  reference,
  k = 2,
  weight = "jaccard",
  format = "edge_list"
)

merge(
  clean_graph(snn_count),
  clean_graph(snn_jaccard),
  by = c("from", "to"),
  suffixes = c("_count", "_jaccard")
)
```

The edge set is the same for both weight choices in this example, but the
weights differ:

- `count` emphasizes absolute overlap
- `jaccard` normalizes overlap by neighbourhood size

That normalization can be helpful when you want weights that are easier to
compare across regions of the graph with different neighbour-set structure.

# Radius graphs with `radius_graph_bigmatrix()`

Radius graphs use a fixed distance threshold instead of a fixed `k`. That means
the number of neighbours can vary from row to row.

```{r radius-graph}
radius_directed <- radius_graph_bigmatrix(
  reference,
  radius = 1.1,
  format = "edge_list",
  symmetrize = "none"
)

radius_union <- radius_graph_bigmatrix(
  reference,
  radius = 1.1,
  format = "edge_list",
  symmetrize = "union"
)

clean_graph(radius_directed)
clean_graph(radius_union)
```

Compared with fixed-`k` graphs:

- radius graphs encode a distance scale directly
- dense regions can produce more edges than sparse regions
- isolated rows may have very few or no neighbours at the chosen radius

In this self-search setting with a symmetric metric, `symmetrize = "union"` and
`symmetrize = "mutual"` coincide because any radius edge that appears in one
direction also appears in the other:

```{r radius-symmetry}
identical(
  clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "union")),
  clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "mutual"))
)
```

# Symmetrization options and edge semantics

For `k`-NN and radius graph builders, `symmetrize` controls how directed edges
are collapsed:

- `"none"` keeps directed edges as-is
- `"union"` keeps a pair if either direction appears
- `"mutual"` keeps a pair only when both directions appear

The difference is easiest to see on `k = 1`:

```{r symmetrize-comparison}
graph_none <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "none")
graph_union <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "union")
graph_mutual <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual")

data.frame(
  symmetrize = c("none", "union", "mutual"),
  n_edges = c(nrow(clean_graph(graph_none)), nrow(clean_graph(graph_union)), nrow(clean_graph(graph_mutual))),
  row.names = NULL
)
```

When distances are stored, `bigKNN` keeps the minimum directed distance for the
collapsed undirected pair. When unit weights are stored instead
(`include_distance = FALSE`), the retained value is the maximum weight, which
remains `1` for unweighted graphs.

# Converting outputs with `as_edge_list()`, `as_triplet()`, and `as_sparse_matrix()`

Different downstream tasks prefer different graph formats:

- edge lists are easiest to inspect and export
- triplet form is a compact sparse representation in R lists
- sparse matrices are convenient for linear algebra and adjacency-style
  operations

```{r graph-coercion}
triplet_graph <- as_triplet(graph_mutual)
sparse_graph <- as_sparse_matrix(
  knn_graph_bigmatrix(
    reference,
    k = 1,
    format = "edge_list",
    symmetrize = "union",
    include_distance = FALSE
  )
)

triplet_graph
class(sparse_graph)
Matrix::summary(sparse_graph)
```

One subtle but important point: symmetric edge lists and triplets store each
undirected pair once, while the sparse matrix representation stores both matrix
entries `(i, j)` and `(j, i)`. That is why the sparse summary above shows twice
as many non-zero entries as the symmetrized edge list.

You can round-trip back to an edge list when needed:

```{r graph-roundtrip}
clean_graph(as_edge_list(sparse_graph))
```

# Using graph outputs downstream

Once a graph is in sparse-matrix form, ordinary matrix operations become useful
immediately. For an unweighted symmetric adjacency, row sums give vertex
degrees:

```{r downstream-usage}
degree <- Matrix::rowSums(sparse_graph)

data.frame(
  vertex = reference_points$id,
  degree = as.numeric(degree),
  row.names = NULL
)
```

The same graph could also be handed off to a separate graph package after
conversion:

- use `as_edge_list()` for packages that ingest edge tables
- use `as_sparse_matrix()` for adjacency-based workflows
- use `as_triplet()` when you want a lightweight sparse representation without
  constructing a matrix object yet

That keeps `bigKNN` focused on exact search and exact graph construction,
without forcing a particular downstream graph ecosystem on the user.
