---
title: "Working with Backends"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with Backends}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library(joinspy)
```

joinspy works with base R data frames, tibbles, and data.tables. The join
wrappers (`left_join_spy()`, `join_strict()`, etc.) detect the input class and
dispatch to the right engine automatically. The diagnostic layer (`join_spy()`,
`key_check()`, `join_explain()`, and friends) is backend-agnostic: it runs
the same analysis regardless of what class the inputs are.

We walk through detection, explicit overrides, and class preservation below.


## Auto-detection

When we call `left_join_spy()` or `join_strict()` without specifying a
backend, joinspy inspects the class of `x` and `y` and picks the backend
according to a fixed priority: data.table > tibble > base R.

data.table takes priority because its merge implementation depends on key
handling, indexing, and reference semantics that a dplyr join would discard.
dplyr, on the other hand, handles a coerced data.table without issues. Both
inputs are checked -- if one side is a tibble and the other a plain data frame,
dplyr is selected. If a mixed-class call selects a backend whose package is
not installed, joinspy falls back to base R with a warning.

Here is the detection in action with each input type:

```{r}
# Base R data frames: auto-detects "base"
orders_df <- data.frame(
  id = c(1, 2, 3),
  amount = c(100, 250, 75),
  stringsAsFactors = FALSE
)

customers_df <- data.frame(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana"),
  stringsAsFactors = FALSE
)

result_base <- left_join_spy(orders_df, customers_df, by = "id", .quiet = TRUE)
class(result_base)
```

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
# Tibbles: auto-detects "dplyr"
orders_tbl <- dplyr::tibble(
  id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

customers_tbl <- dplyr::tibble(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana")
)

result_dplyr <- left_join_spy(orders_tbl, customers_tbl, by = "id", .quiet = TRUE)
class(result_dplyr)
```

```{r eval = requireNamespace("data.table", quietly = TRUE)}
# data.tables: auto-detects "data.table"
orders_dt <- data.table::data.table(
  id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

customers_dt <- data.table::data.table(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana")
)

result_dt <- left_join_spy(orders_dt, customers_dt, by = "id", .quiet = TRUE)
class(result_dt)
```

When the two inputs have different classes, the higher-priority class wins:

```{r eval = requireNamespace("data.table", quietly = TRUE) && requireNamespace("dplyr", quietly = TRUE)}
# data.table + tibble: data.table wins
mixed_result <- left_join_spy(orders_dt, customers_tbl, by = "id", .quiet = TRUE)
class(mixed_result)
```


## Explicit override

All join wrappers and `join_strict()` accept a `backend` argument that
overrides auto-detection. The three valid values are `"base"`, `"dplyr"`, and
`"data.table"`.

We can force dplyr on plain data frames to get tibble output:

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
result <- left_join_spy(orders_df, customers_df, by = "id",
                        backend = "dplyr", .quiet = TRUE)
class(result)
```

Or force base R to sidestep dplyr's many-to-many warning when we already know
the expansion is intentional:

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
# These have a legitimate many-to-many relationship
tags <- dplyr::tibble(
  item_id = c(1, 1, 2),
  tag = c("red", "large", "small")
)

prices <- dplyr::tibble(
  item_id = c(1, 2, 2),
  currency = c("USD", "USD", "EUR")
)

# Force base R to avoid dplyr's many-to-many warning
result <- left_join_spy(tags, prices, by = "item_id",
                        backend = "base", .quiet = TRUE)
nrow(result)
```

Or force data.table on plain data frames for speed on large inputs:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
result <- left_join_spy(orders_df, customers_df, by = "id",
                        backend = "data.table", .quiet = TRUE)
class(result)
```

An explicit backend must be installed. Requesting `backend = "dplyr"` without
dplyr will error, not silently fall back -- auto-detection is a convenience,
but an explicit override is a contract.

Setting `backend = "base"` is also a way to guarantee reproducibility across
environments where dplyr may or may not be installed.


## Class preservation

joinspy preserves input class through the full diagnostic-repair-join cycle:

- **Diagnostics** (`join_spy()`, `key_check()`, etc.) accept any data frame
  subclass and return report objects without modifying the input.
- **Repair** (`join_repair()`) operates on key columns in place and returns
  the same class it received.
- **Join wrappers** dispatch to the native join engine for the detected class.

Here is a full cycle with base R data frames:

```{r}
messy_df <- data.frame(
  code = c("A-1 ", "B-2", " C-3"),
  value = c(10, 20, 30),
  stringsAsFactors = FALSE
)

lookup_df <- data.frame(
  code = c("A-1", "B-2", "C-3"),
  label = c("Alpha", "Beta", "Gamma"),
  stringsAsFactors = FALSE
)

# 1. Diagnose
report <- join_spy(messy_df, lookup_df, by = "code")

# 2. Repair
repaired_df <- join_repair(messy_df, by = "code")
class(repaired_df)  # still data.frame

# 3. Join
joined_df <- left_join_spy(repaired_df, lookup_df, by = "code", .quiet = TRUE)
class(joined_df)  # still data.frame
joined_df
```

The same cycle with tibbles:

```{r eval = requireNamespace("dplyr", quietly = TRUE)}
messy_tbl <- dplyr::tibble(
  code = c("A-1 ", "B-2", " C-3"),
  value = c(10, 20, 30)
)

lookup_tbl <- dplyr::tibble(
  code = c("A-1", "B-2", "C-3"),
  label = c("Alpha", "Beta", "Gamma")
)

repaired_tbl <- join_repair(messy_tbl, by = "code")
class(repaired_tbl)  # still tbl_df

joined_tbl <- left_join_spy(repaired_tbl, lookup_tbl, by = "code", .quiet = TRUE)
class(joined_tbl)  # still tbl_df
joined_tbl
```

And with data.tables:

```{r eval = requireNamespace("data.table", quietly = TRUE)}
messy_dt <- data.table::data.table(
  code = c("A-1 ", "B-2", " C-3"),
  value = c(10, 20, 30)
)

lookup_dt <- data.table::data.table(
  code = c("A-1", "B-2", "C-3"),
  label = c("Alpha", "Beta", "Gamma")
)

repaired_dt <- join_repair(messy_dt, by = "code")
class(repaired_dt)  # still data.table

joined_dt <- left_join_spy(repaired_dt, lookup_dt, by = "code", .quiet = TRUE)
class(joined_dt)  # still data.table
joined_dt
```

When `join_repair()` receives both `x` and `y`, it returns a list with `$x`
and `$y`, each preserving the class of the corresponding input.

`join_strict()` also preserves class -- the cardinality check runs before the
join, so a satisfied constraint returns the native class and a violated one
errors before any output is produced.

The one exception is an explicit backend override that does not match the input
class. Passing `backend = "data.table"` on a tibble returns a data.table,
because that is what the data.table engine produces.


## Diagnostics are backend-agnostic

The diagnostic functions (`join_spy()`, `key_check()`, `key_duplicates()`,
`join_explain()`, `detect_cardinality()`, `check_cartesian()`) operate purely
on column values and never call a join engine. They produce identical results
regardless of input class.

This means we can diagnose on data.tables and join with dplyr, or diagnose in
a base-R script and pass the data to a Shiny app that uses dplyr internally.

```{r eval = requireNamespace("data.table", quietly = TRUE) && requireNamespace("dplyr", quietly = TRUE)}
# Diagnose on data.tables
orders_dt <- data.table::data.table(
  id = c(1, 2, 3),
  amount = c(100, 250, 75)
)

customers_dt <- data.table::data.table(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Diana")
)

report <- join_spy(orders_dt, customers_dt, by = "id")

# Join with dplyr (convert first)
orders_tbl <- dplyr::as_tibble(orders_dt)
customers_tbl <- dplyr::as_tibble(customers_dt)
result <- left_join_spy(orders_tbl, customers_tbl, by = "id", .quiet = TRUE)
class(result)
```

The report object is structurally identical across backends -- `$issues`,
`$expected_rows`, and `$match_analysis` contain the same values. This also
means we can write unit tests for key quality using plain data frames even when
production code uses data.table.


## Backend differences at a glance

The three backends differ in a few ways worth noting:

- **Column name collisions.** Base R and dplyr append `.x`/`.y` suffixes;
  data.table appends `i.` to right-table columns.
- **Row ordering.** Base R sorts by key; dplyr preserves left-table order;
  data.table sorts by key if keyed, otherwise preserves insertion order.
- **Performance.** data.table is the fastest for large inputs. Base R and dplyr
  are comparable for small to medium datasets.

If we switch backends mid-project, it is worth checking that column references
and row-order assumptions still hold.


## See Also

- `vignette("quickstart")` for a quick introduction to joinspy

- `vignette("common-issues")` for a catalogue of join problems and solutions

- `?left_join_spy`, `?join_strict` for backend parameter documentation
