---
title: "Audit Trail Walkthrough"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Audit Trail Walkthrough}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message=FALSE}
library(tidyaudit)
library(dplyr)
```

## Tracking what happens in your pipeline

Every filter, join, and transformation in a data pipeline encodes an
assumption about the data. How many rows did that filter actually drop? Did the
join introduce NAs or duplicates? Which records did not match?
How much revenue was lost at each step?

These questions matter. Even when these answers are clear in the moment, they become hard to recall
during code review, debugging, or when someone inherits your pipeline months
later. An audit trail keeps that record for you.

tidyaudit captures **metadata-only snapshots** at each step of a pipe — row
counts, column counts, NA totals, numeric summaries — without storing the data
itself. The trail is a lightweight, structured record of your pipeline's
behavior that you can print, diff, export, and share.

## Your first trail

Create a trail object and insert `audit_tap()` calls into your pipeline. Each
tap records a snapshot and passes the data through unchanged — three taps,
three snapshots, one timeline.

```{r basic-trail}
# Sample data
orders <- data.frame(
  id       = 1:20,
  customer = rep(c("Alice", "Bob", "Carol", "Dan", "Eve"), 4),
  amount   = c(150, 200, 50, 300, 75, 120, 400, 90, 250, 60,
               180, 210, 45, 320, 85, 130, 380, 95, 270, 55),
  status   = rep(c("complete", "pending", "complete", "cancelled", "complete"), 4)
)

trail <- audit_trail("order_pipeline")

result <- orders |>
  audit_tap(trail, "raw") |>
  filter(status == "complete") |>
  audit_tap(trail, "complete_only") |>
  mutate(tax = amount * 0.1) |>
  audit_tap(trail, "with_tax")
```

Print the trail to see the full timeline:

```{r print-trail}
print(trail)
```

The timeline shows row counts, column counts, NA totals, and change summaries
between consecutive steps. Notice how the filter reduced 20 rows to 12, and the
mutate added a column — all captured automatically.

## Operation-aware taps

Plain `audit_tap()` records what the data looks like at a given point, but it
can't tell you *why* it changed. Operation-aware taps solve this — they
perform the dplyr operation AND record enriched diagnostics in a single step.

### Join taps

Joins are where data quality problems hide. A left join can silently introduce
NAs, inflate row counts, or produce unexpected many-to-many relationships.
Replace `dplyr::left_join()` with `left_join_tap()` to capture match rates,
relationship type, and duplicate key information automatically:

```{r join-tap}
customers <- data.frame(
  customer = c("Alice", "Bob", "Carol", "Dan"),
  region   = c("East", "West", "East", "North")
)

trail2 <- audit_trail("join_pipeline")

result2 <- orders |>
  audit_tap(trail2, "raw") |>
  left_join_tap(customers, by = "customer",
                .trail = trail2, .label = "with_region")

print(trail2)
```

The `Type` column now shows the join type, relationship, and match rate — all
without leaving the pipe. All six dplyr join types are supported:
`left_join_tap()`, `right_join_tap()`, `inner_join_tap()`, `full_join_tap()`,
`anti_join_tap()`, `semi_join_tap()`.

### Filter taps

Filters are invisible by default — you remove rows and never know what you
lost. `filter_tap()` keeps matching rows (like `dplyr::filter()`) while
recording exactly how many rows were dropped:

```{r filter-tap}
trail3 <- audit_trail("filter_pipeline")

result3 <- orders |>
  audit_tap(trail3, "raw") |>
  filter_tap(status == "complete",
             .trail = trail3, .label = "complete_only") |>
  filter_tap(amount > 100,
             .trail = trail3, .label = "high_value",
             .stat = amount)

print(trail3)
```

The `.stat` argument is the feature that makes filter taps indispensable for
financial and business pipelines: it tracks how much of a numeric column was
removed at each step. You can see not just that you dropped 4 rows, but that
those rows represented a specific dollar amount.

`filter_out_tap()` works the same way but drops matching rows (the inverse).

## Comparing snapshots

`audit_diff()` gives you a detailed before/after comparison between any two
snapshots in the trail — not just adjacent ones:

```{r audit-diff}
audit_diff(trail3, "raw", "high_value")
```

This shows row/column/NA deltas, columns added or removed, and numeric
distribution shifts across the common columns.

## Full audit report

`audit_report()` prints the complete trail summary plus all consecutive diffs
in one call — the full story of what your pipeline did:

```{r audit-report}
audit_report(trail3)
```

## Domain-specific diagnostics

The built-in metrics cover structure and shape (rows, columns, NAs, numeric
summaries). But your domain has its own questions: how many valid records
remain? What's the total revenue at this stage? Is a business rule still
satisfied?

Pass a named list of functions via `.fns` to compute anything domain-specific
at any tap. Each function receives the data and its return value is stored in
the snapshot:

```{r custom-fns}
trail4 <- audit_trail("custom_example")

result4 <- orders |>
  audit_tap(trail4, "raw", .fns = list(
    n_complete   = ~sum(.$status == "complete"),          # scalar
    amount_stats = ~c(mean = mean(.$amount),              # named vector
                      max  = max(.$amount))
  )) |>
  filter(status == "complete") |>
  audit_tap(trail4, "complete_only", .fns = list(
    n_complete   = ~sum(.$status == "complete"),
    amount_stats = ~c(mean = mean(.$amount),
                      max  = max(.$amount))
  ))
```

Custom results appear as inline annotations directly below each snapshot row:

```{r print-custom}
print(trail4)
```

The rendering rules are:

- **Scalar** return value: `fn_name: value`
- **Named vector or named list of scalars**: `fn_name: key=val, key=val`
  (truncated at 60 characters)
- **Complex object** (data frame, unnamed vector, nested list):
  `fn_name: [complex -- see audit_report()]`

To suppress annotations and display only the main table:

```{r show-custom-false}
print(trail4, show_custom = FALSE)
```

## Snapshot controls

On wide datasets, computing numeric summaries for every column is unnecessary
and slows the pipeline down. Three parameters on all tap functions let you
control what gets captured:

- `.numeric_summary = FALSE` — skip quantile computation entirely
- `.cols_include` — character vector of column names to include in the
  snapshot's schema (mutually exclusive with `.cols_exclude`)
- `.cols_exclude` — character vector of column names to exclude

Core invariants — `nrow`, `ncol`, and `total_nas` — are always recorded
regardless of these settings.

```{r snapshot-controls}
wide_data <- cbind(orders, matrix(rnorm(20 * 50), nrow = 20))

trail_ctrl <- audit_trail("snapshot_controls")

wide_data |>
  audit_tap(trail_ctrl, "full_snapshot") |>
  audit_tap(trail_ctrl, "minimal",
            .numeric_summary = FALSE,
            .cols_include = c("id", "amount", "status"))

print(trail_ctrl)
```

The "minimal" snapshot still knows the data has `r ncol(wide_data)` columns and
20 rows, but its schema only describes the three columns you asked for and
contains no numeric summaries.

## Tabulation in pipelines

`tab()` produces one-way frequency tables or two-way crosstabulations:

```{r tab-standalone}
tab(orders, status)
tab(orders, status, customer)
```

`tab_tap()` embeds a tabulation inside a pipeline as a custom diagnostic
annotation. It runs `tab()` on the data, stores the result in the snapshot, and
returns the data unchanged — useful for tracking how categorical distributions
shift across pipeline steps:

```{r tab-tap}
trail_tab <- audit_trail("tab_pipeline")

result_tab <- orders |>
  tab_tap(status, .trail = trail_tab, .label = "status_dist") |>
  filter(status == "complete") |>
  tab_tap(customer, .trail = trail_tab, .label = "customer_dist",
          .sort = "freq_desc")

print(trail_tab)
```

## Standalone mode

All tap functions work without a trail. When `.trail = NULL` (the default):

- **No diagnostic args**: behaves like the plain dplyr function
- **With `.stat` or `.warn_threshold`**: runs diagnostics and prints results
  without recording to a trail

This makes it easy to add quick diagnostics to any pipeline without setting up a
full trail:

```{r null-trail}
# Plain filter -- no diagnostics
orders |> filter_tap(amount > 100) |> nrow()

# Diagnostics without a trail
orders |> filter_tap(amount > 100, .stat = amount) |> invisible()
```

## Exporting and sharing trails

Trails live in-memory as environment-based S3 objects. To share them with
colleagues, CI systems, or documentation, convert them to portable formats.

### Converting to R objects

```{r trail-to-objects}
# As a plain R list (suitable for jsonlite::toJSON())
trail_list <- trail_to_list(trail3)
str(trail_list, max.level = 2)

# As a data.frame (one row per snapshot)
trail_df <- trail_to_df(trail3)
print(trail_df)
```

### Saving and loading trails

RDS format preserves all R types and round-trips perfectly:

```{r trail-rds}
tmp_rds <- tempfile(fileext = ".rds")
write_trail(trail3, tmp_rds)
restored <- read_trail(tmp_rds)
print(restored)
```

JSON format is available for interoperability with other tools (requires
jsonlite):

```{r trail-json}
tmp_json <- tempfile(fileext = ".json")
write_trail(trail3, tmp_json, format = "json")
```

### HTML visualization

`audit_export()` produces a self-contained HTML file — one file you can email,
embed in documentation, or drop into a compliance folder:

```{r trail-html}
audit_export(trail3, tempfile(fileext = ".html"))
```

The HTML page renders an interactive pipeline flow diagram with light/dark theme
toggle. Nodes are clickable and expand to show column schema and diagnostics.
Edges show the full diff between adjacent snapshots. No server or internet
connection required.