---
title: "Diagnostic Functions Guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Diagnostic Functions Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message=FALSE}
library(tidyaudit)
library(dplyr)
```

Before you build a pipeline, you need to understand your data. Before you trust
a join, you need to verify the keys. Before you filter, you want to know what
you'll lose. tidyaudit's diagnostic functions help with the detective work that
comes before, during, and after every analysis.

These functions are designed for **interactive use** — the questions you ask in
the console while exploring a dataset or debugging a transformation. They
complement the [audit trail system](tidyaudit.html), which instruments
production pipelines.

## Before you join

You're about to left-join orders to customers. Will every order match? Will the
join create duplicates? How many rows will end up with NAs?

`validate_join()` analyzes a potential join **without performing it**, reporting
match rates, relationship type, duplicate keys, and unmatched rows:

```{r validate-join}
orders <- data.frame(
  id     = c(1L, 2L, 3L, 3L, 4L, 5L),
  amount = c(100, 200, 150, 175, 300, 50)
)
customers <- data.frame(
  id   = c(2L, 3L, 6L),
  name = c("Alice", "Bob", "Carol")
)

validate_join(orders, customers, by = "id")
```

### Different key names

When the key columns have different names, use a named vector:

```{r validate-join-named}
products <- data.frame(prod_id = 1:3, price = c(10, 20, 30))
sales    <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7))

validate_join(products, sales, by = c("prod_id" = "item_id"))
```

### Tracking the impact on a metric

Want to know the revenue at risk from unmatched keys? Use `stat` (same column
name in both tables) or `stat_x`/`stat_y` (different column names):

```{r validate-join-stat}
x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400))
y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30))

validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost")
```

## Are your keys unique?

You assume `id` uniquely identifies every row. Does it?

`validate_primary_keys()` tests whether a set of columns forms a valid primary
key:

```{r validate-pk}
df <- data.frame(
  id    = c(1L, 2L, 3L, 3L, 4L),
  group = c("A", "A", "B", "C", "A"),
  value = c(10, 20, 30, 40, 50)
)

# Single column -- not unique
validate_primary_keys(df, "id")

# Composite key -- unique
validate_primary_keys(df, c("id", "group"))
```

### Beyond uniqueness: what's the relationship?

`validate_var_relationship()` determines the functional relationship between two
columns — one-to-one, one-to-many, many-to-one, or many-to-many:

```{r validate-var-rel}
df2 <- data.frame(
  dept    = c("Sales", "Sales", "Engineering", "Engineering"),
  manager = c("Ann", "Ann", "Bob", "Bob")
)
validate_var_relationship(df2, "dept", "manager")
```

## What changed between versions?

You have yesterday's extract and today's extract. Or the data before your
transformation and after. What's different?

`compare_tables()` compares two data frames by examining columns, row counts,
key overlap, and numeric discrepancies:

```{r compare-tables}
before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0))
after  <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0))

compare_tables(before, after)
```

## Filter with visibility

`dplyr::filter()` silently removes rows. You apply the condition, the rows
vanish, and you move on. `filter_keep()` and `filter_drop()` do the same
filtering but tell you exactly what was removed — and optionally warn you when
too much is gone.

### filter_keep

Keeps rows where the condition is `TRUE` (same as `dplyr::filter()`), with
diagnostic output:

```{r filter-keep}
sales <- data.frame(
  id     = 1:10,
  amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60),
  status = rep(c("valid", "suspect"), 5)
)

result <- filter_keep(sales, amount > 100, .stat = amount)
```

### filter_drop

Drops rows where the condition is `TRUE` (the inverse):

```{r filter-drop}
result2 <- filter_drop(sales, status == "suspect", .stat = amount)
```

### Warning thresholds

Set `.warn_threshold` to get a warning when too many rows are dropped — a
safety net for production pipelines:

```{r filter-warn, warning=TRUE}
filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5)
```

## Diagnose data quality

You just received a dataset. Where are the gaps?

### Missing values

`diagnose_nas()` reports NA counts and percentages for every column:

```{r diagnose-nas}
messy <- data.frame(
  id    = 1:6,
  name  = c("A", NA, "C", "D", NA, "F"),
  score = c(10, 20, NA, NA, 50, NA),
  grade = c("A", "B", "C", NA, "A", "B")
)

diagnose_nas(messy)
```

### Column summaries

`summarize_column()` gives type-appropriate statistics for a single vector —
quantiles for numeric, value counts for character, balance for logical:

```{r summarize-column}
summarize_column(c(1, 2, 3, NA, 5, 10, 100))
summarize_column(c("apple", "banana", "apple", "cherry", NA))
```

`get_summary_table()` applies this across all columns (or a selection):

```{r get-summary-table}
get_summary_table(messy)
```

## Frequency tables

Base R's `table()` gives you raw counts. `tab()` gives you percentages,
cumulative totals, sorting, cutoffs, weighting, and two-way crosstabulations.

### One-way tables

```{r tab-oneway}
tab(mtcars, cyl)
```

### Sorting and cutoffs

```{r tab-sort}
# Sort by frequency
tab(mtcars, carb, .sort = "freq_desc")

# Keep only top-2 values, collapse rest into (Other)
tab(mtcars, carb, .cutoff = 2)
```

### Two-way crosstabulations

```{r tab-twoway}
tab(mtcars, cyl, gear)

# Show row percentages instead of counts
tab(mtcars, cyl, gear, .display = "row_pct")
```

### Weighted tabulation

```{r tab-weighted}
tab(mtcars, cyl, .wt = mpg)
```

## String quality

Duplicates hiding behind case differences, leading spaces, and encoding issues
are among the most common causes of failed joins and inflated group counts.

`diagnose_strings()` audits a character vector for these problems:

```{r diagnose-strings}
firms <- c("Apple", "APPLE", "apple", "  Microsoft ", "Google", NA, "")
diagnose_strings(firms)
```

## Audit your transformations

You're about to apply `trimws()` to a column, or `round()` to a price vector.
How many values will actually change? What will they look like afterward?

`audit_transform()` shows exactly what a transformation does. It automatically
detects the vector type and computes type-appropriate diagnostics:

- **Numeric**: mean/min/max shift, proportion changed
- **Date/POSIXct**: range change, proportion shifted
- **Factor**: level remapping (gained/lost levels)
- **Logical**: TRUE/FALSE/NA balance shift
- **Character**: string-level before/after comparison

### Character vectors

```{r audit-transform}
audit_transform(firms, trimws)
audit_transform(firms, tolower)
```

### Numeric vectors

```{r audit-transform-numeric}
prices <- c(10.456, 20.789, 30.123, NA, 50.999)
audit_transform(prices, round)
```

### Date vectors

```{r audit-transform-date}
dates <- as.Date(c("2024-01-15", "2024-06-30", "2024-12-01", NA))
audit_transform(dates, function(d) d + 30)
```

### Factor vectors

```{r audit-transform-factor}
sizes <- factor(c("S", "M", "L", "XL", "XXL", "S", "M"))
audit_transform(sizes, function(f) {
  levels(f)[levels(f) %in% c("XL", "XXL")] <- "XL+"
  f
})
```

### Logical vectors

```{r audit-transform-logical}
flags <- c(TRUE, FALSE, TRUE, NA, FALSE)
audit_transform(flags, function(x) !x)
```
