You receive monthly customer exports from a CRM system. The data
should have unique customer_id values and complete
email addresses. One month, someone upstream changes the
export logic. Now customer_id has duplicates and some
emails are missing.
Without explicit checks, you won’t notice until something breaks downstream—wrong row counts after a join, duplicated invoices, failed email campaigns.
# January export: clean data
january <- data.frame(
customer_id = c(101, 102, 103, 104, 105),
email = c("alice@example.com", "bob@example.com", "carol@example.com",
"dave@example.com", "eve@example.com"),
segment = c("premium", "basic", "premium", "basic", "premium")
)
# February export: corrupted upstream (duplicates + missing email)
february <- data.frame(
customer_id = c(101, 102, 102, 104, 105), # Note: 102 is duplicated
email = c("alice@example.com", "bob@example.com", NA,
"dave@example.com", "eve@example.com"),
segment = c("premium", "basic", "basic", "basic", "premium")
)The February data looks fine at a glance:
head(february)
#> customer_id email segment
#> 1 101 alice@example.com premium
#> 2 102 bob@example.com basic
#> 3 102 <NA> basic
#> 4 104 dave@example.com basic
#> 5 105 eve@example.com premium
nrow(february) # Same row count
#> [1] 5But it will silently corrupt your analysis.
keyed catches these issues by making your assumptions explicit:
# Define what you expect: customer_id is unique
january_keyed <- january |>
key(customer_id) |>
lock_no_na(email)
# This works - January data is clean
january_keyed
#> # A keyed tibble: 5 x 3
#> # Key: customer_id
#> customer_id email segment
#> <dbl> <chr> <chr>
#> 1 101 alice@example.com premium
#> 2 102 bob@example.com basic
#> 3 103 carol@example.com premium
#> 4 104 dave@example.com basic
#> 5 105 eve@example.com premiumNow try the same with February’s corrupted data:
# This fails immediately - duplicates detected
february |>
key(customer_id)
#> Warning: Key is not unique.
#> ℹ 1 duplicate key value(s) found.
#> ℹ Key columns: customer_id
#> # A keyed tibble: 5 x 3
#> # Key: customer_id
#> customer_id email segment
#> <dbl> <chr> <chr>
#> 1 101 alice@example.com premium
#> 2 102 bob@example.com basic
#> 3 102 <NA> basic
#> 4 104 dave@example.com basic
#> 5 105 eve@example.com premiumThe error catches the problem at import time, not downstream when you’re debugging a mysterious row count mismatch.
Goal: Validate each month’s export against expected constraints before processing.
Challenge: Data quality varies month-to-month. Silent corruption causes cascading errors.
Strategy: Define keys and assumptions once, apply consistently to each import.
validate_customer_export <- function(df) {
df |>
key(customer_id) |>
lock_no_na(email) |>
lock_nrow(min = 1)
}
# January: passes
january_clean <- validate_customer_export(january)
summary(january_clean)
#>
#> ── Keyed Data Frame Summary
#> Dimensions: 5 rows x 3 columns
#>
#> Key columns: customer_id
#> ✔ Key is unique
#>
#> Row IDs: noneOnce defined, keys persist through dplyr operations:
# Filter preserves key
premium_customers <- january_clean |>
filter(segment == "premium")
has_key(premium_customers)
#> [1] TRUE
get_key_cols(premium_customers)
#> [1] "customer_id"
# Mutate preserves key
enriched <- january_clean |>
mutate(domain = sub(".*@", "", email))
has_key(enriched)
#> [1] TRUEIf an operation breaks uniqueness, keyed errors and tells you to use
unkey() first:
# This creates duplicates - keyed stops you
january_clean |>
mutate(customer_id = 1)
#> Error in `mutate()`:
#> ! Key is no longer unique after transformation.
#> ℹ Use `unkey()` first if you intend to break uniqueness.To proceed, you must explicitly acknowledge breaking the key:
january_clean |>
unkey() |>
mutate(customer_id = 1)
#> # A tibble: 5 × 3
#> customer_id email segment
#> <dbl> <chr> <chr>
#> 1 1 alice@example.com premium
#> 2 1 bob@example.com basic
#> 3 1 carol@example.com premium
#> 4 1 dave@example.com basic
#> 5 1 eve@example.com premiumGoal: Join customer data with orders without accidentally duplicating rows.
Challenge: Join cardinality mistakes are common and hard to debug. A “one-to-one” join that’s actually one-to-many silently inflates your data.
Strategy: Use diagnose_join() to
understand cardinality before joining.
customers <- data.frame(
customer_id = 1:5,
name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
tier = c("gold", "silver", "gold", "bronze", "silver")
) |>
key(customer_id)
orders <- data.frame(
order_id = 1:8,
customer_id = c(1, 1, 2, 3, 3, 3, 4, 5),
amount = c(100, 150, 200, 50, 75, 125, 300, 80)
) |>
key(order_id)diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE)
#>
#> ── Join Diagnosis
#> Cardinality: one-to-many
#> x: 5 rows, unique
#> y: 8 rows, 3 duplicatesThe diagnosis shows:
Cardinality is one-to-many: Each customer can have multiple orders
Coverage: Shows how many keys match vs. don’t match
Now you know what to expect. A left_join() will create 8
rows (one per order), not 5 (one per customer).
compare_keys(customers, orders)
#>
#> ── Key Comparison
#> Comparing on: customer_id
#>
#> x: 5 unique keys
#> y: 5 unique keys
#>
#> Common: 5 (100.0% of x)
#> Only in x: 0
#> Only in y: 0This shows the join key exists in both tables but with different uniqueness properties—essential information before joining.
Goal: Track which original rows survive through a complex pipeline.
Challenge: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data.
Strategy: Use add_id() to attach stable
identifiers that survive transformations.
# Add UUIDs to rows
customers_tracked <- customers |>
add_id()
customers_tracked
#> # A keyed tibble: 5 x 4
#> # Key: customer_id | .id
#> .id customer_id name tier
#> <chr> <int> <chr> <chr>
#> 1 e87304fc-09ed-4634-8caa-a9d9cf2352cc 1 Alice gold
#> 2 d4c8b392-666d-43e1-8178-ce8e01efd218 2 Bob silver
#> 3 149ca4bd-d304-46a8-822b-fe344600d006 3 Carol gold
#> 4 d6031d5d-90eb-44f7-96a8-db4a0ef6da72 4 Dave bronze
#> 5 2aad2788-779c-48fb-86de-8dfd71222a2c 5 Eve silver# Filter: IDs persist
gold_customers <- customers_tracked |>
filter(tier == "gold")
get_id(gold_customers)
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"
# Compare with original
compare_ids(customers_tracked, gold_customers)
#> $lost
#> [1] "d4c8b392-666d-43e1-8178-ce8e01efd218"
#> [2] "d6031d5d-90eb-44f7-96a8-db4a0ef6da72"
#> [3] "2aad2788-779c-48fb-86de-8dfd71222a2c"
#>
#> $gained
#> character(0)
#>
#> $preserved
#> [1] "e87304fc-09ed-4634-8caa-a9d9cf2352cc"
#> [2] "149ca4bd-d304-46a8-822b-fe344600d006"The comparison shows exactly which rows were lost (filtered out) and which were preserved.
When appending new data, bind_id() handles ID
conflicts:
batch1 <- data.frame(x = 1:3) |> add_id()
batch2 <- data.frame(x = 4:6) # No IDs yet
# bind_id assigns new IDs to batch2 and checks for conflicts
combined <- bind_id(batch1, batch2)
combined
#> .id x
#> 1 beb82b30-6d2b-4a1a-a952-9b710fcf7f62 1
#> 2 766c5fb1-2c63-4b61-9ce3-c46a80e92cfa 2
#> 3 7aac4ff0-a2f9-4965-abf1-7d1501bbf0b6 3
#> 4 c42f55c3-c5d8-4f76-a95e-9303876450fd 4
#> 5 a4ab668b-4e54-4dc1-bdc2-21686e4944d6 5
#> 6 30995197-d6a4-4938-8904-92358c8c7088 6Goal: Detect when data changes unexpectedly between pipeline runs.
Challenge: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions.
Strategy: Commit snapshots with
commit_keyed() and check for drift with
check_drift().
# Simulate upstream change: EU tax rate changed
modified_data <- reference_data
modified_data$tax_rate[2] <- 0.21
# Drift detected!
check_drift(modified_data)
#>
#> ── Drift Report
#> ! Drift detected
#> Snapshot: 76a76466... (2026-02-03 22:34)
#> ℹ Key values changed
#> ℹ Cell values modifiedThe drift report shows exactly what changed, letting you decide whether to accept the new data or investigate.
| Function | Purpose |
|---|---|
key() |
Define key columns (validates uniqueness) |
unkey() |
Remove key |
has_key(), get_key_cols() |
Query key status |
| Function | Validates |
|---|---|
lock_unique() |
No duplicate values |
lock_no_na() |
No missing values |
lock_complete() |
All expected values present |
lock_coverage() |
Reference values covered |
lock_nrow() |
Row count within bounds |
| Function | Purpose |
|---|---|
diagnose_join() |
Analyze join cardinality |
compare_keys() |
Compare key structures |
compare_ids() |
Compare row identities |
find_duplicates() |
Find duplicate key values |
key_status() |
Quick status summary |
| Function | Purpose |
|---|---|
add_id() |
Add UUID to rows |
get_id() |
Retrieve row IDs |
bind_id() |
Combine data with ID handling |
make_id() |
Create deterministic IDs from columns |
check_id() |
Validate ID integrity |
| Function | Purpose |
|---|---|
commit_keyed() |
Save reference snapshot |
check_drift() |
Compare against snapshot |
list_snapshots() |
View saved snapshots |
clear_snapshot() |
Remove specific snapshot |
keyed is designed for flat-file workflows without database infrastructure. If you need:
| Need | Better Alternative |
|---|---|
| Enforced schema | Database (SQLite, DuckDB) |
| Version history | Git, git2r |
| Full data validation | pointblank, validate |
| Production pipelines | targets |
keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction.
Design Philosophy - The reasoning behind keyed’s approach
Function Reference - Complete API documentation