This vignette catalogs common join problems and shows how joinspy detects and resolves them. Each issue includes detection methods and recommended solutions.
Problem: When one or both tables have duplicate keys, joins multiply rows unexpectedly.
orders <- data.frame(
customer_id = c(1, 2, 2, 3),
amount = c(100, 50, 75, 200)
)
addresses <- data.frame(
customer_id = c(1, 2, 2, 3),
address = c("NYC", "LA", "SF", "Chicago")
)
join_spy(orders, addresses, by = "customer_id")
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: customer_id
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 4 Unique keys: 3 Duplicate keys: 1 NA keys: 0
#> Right table: Rows: 4 Unique keys: 3 Duplicate keys: 1 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 3
#> Keys only in left: 0
#> Keys only in right: 0
#> Match rate (left): "100%"
#>
#> ── Issues Detected ──
#>
#> ! Left table has 1 duplicate key(s) affecting 2 rows - may cause row multiplication
#> ! Right table has 1 duplicate key(s) affecting 2 rows - may cause row multiplication
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 6
#> left_join: 6
#> right_join: 6
#> full_join: 6Detection: join_spy() reports duplicate
counts and expected row multiplication.
Solution: Aggregate or filter duplicates before joining.
Problem: Invisible leading/trailing spaces prevent matches.
sales <- data.frame(
product = c("Widget", "Gadget ", " Gizmo"),
units = c(10, 20, 30),
stringsAsFactors = FALSE
)
inventory <- data.frame(
product = c("Widget", "Gadget", "Gizmo"),
stock = c(100, 200, 300),
stringsAsFactors = FALSE
)
join_spy(sales, inventory, by = "product")
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: product
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 1
#> Keys only in left: 2
#> Keys only in right: 2
#> Match rate (left): "33.3%"
#>
#> ── Issues Detected ──
#>
#> ! Left column 'product' has 2 value(s) with leading/trailing whitespace
#> ℹ 2 near-match(es) found (e.g., 'Gadget ' ~ 'Gadget', ' Gizmo' ~ 'Gizmo') - possible typos?
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 1
#> left_join: 3
#> right_join: 3
#> full_join: 5Detection: join_spy() flags whitespace
issues in the Issues section.
Solution: Use join_repair() or
trimws().
Problem: Keys differ only by case (“ABC” vs “abc”).
left <- data.frame(
code = c("ABC", "def", "GHI"),
value = 1:3,
stringsAsFactors = FALSE
)
right <- data.frame(
code = c("abc", "DEF", "ghi"),
label = c("A", "D", "G"),
stringsAsFactors = FALSE
)
join_spy(left, right, by = "code")
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: code
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 0
#> Keys only in left: 3
#> Keys only in right: 3
#> Match rate (left): "0%"
#>
#> ── Issues Detected ──
#>
#> ! 3 key(s) would match if case-insensitive (e.g., 'ABC' vs 'abc')
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 0
#> left_join: 3
#> right_join: 3
#> full_join: 6Detection: join_spy() detects case
mismatches when keys would match if case-insensitive.
Solution: Standardize case with
join_repair().
Problem: NA values in key columns never match (by design in R).
orders <- data.frame(
customer_id = c(1, NA, 3, NA),
amount = c(100, 200, 300, 400)
)
customers <- data.frame(
customer_id = c(1, 2, 3, NA),
name = c("Alice", "Bob", "Carol", "Unknown")
)
join_spy(orders, customers, by = "customer_id")
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: customer_id
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 4 Unique keys: 2 Duplicate keys: 0 NA keys: 2
#> Right table: Rows: 4 Unique keys: 3 Duplicate keys: 0 NA keys: 1
#>
#> ── Match Analysis ──
#>
#> Keys in both: 2
#> Keys only in left: 0
#> Keys only in right: 1
#> Match rate (left): "100%"
#>
#> ── Issues Detected ──
#>
#> ! Left table has 2 NA key(s) - these will not match
#> ! Right table has 1 NA key(s) - these will not match
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 2
#> left_join: 4
#> right_join: 4
#> full_join: 6Detection: join_spy() reports NA counts
in the Table Summary.
Solution: Handle NA values explicitly—remove them or replace with a placeholder.
Problem: Inner join returns zero rows when you expected matches.
system_a <- data.frame(
user_id = c("USR001", "USR002", "USR003"),
score = c(85, 90, 78),
stringsAsFactors = FALSE
)
system_b <- data.frame(
user_id = c("1", "2", "3"),
department = c("Sales", "Marketing", "Engineering"),
stringsAsFactors = FALSE
)
report <- join_spy(system_a, system_b, by = "user_id")Detection: Match analysis shows 0% match rate.
Solution: Create a mapping table or transform keys to a common format.
Problem: Both tables have duplicate keys, causing exponential row growth.
order_items <- data.frame(
order_id = c(1, 1, 2, 2, 2),
item = c("A", "B", "C", "D", "E")
)
order_payments <- data.frame(
order_id = c(1, 1, 2, 2),
payment = c("CC1", "CC2", "Cash", "Check")
)
report <- join_spy(order_items, order_payments, by = "order_id")Detection: Expected row counts show multiplication (inner join = 10 rows from 9 source rows).
Solution: Aggregate one table first, or use
check_cartesian().
Problem: Keys have different types (numeric vs character).
orders <- data.frame(
product_id = c(1, 2, 3),
quantity = c(10, 20, 30)
)
products <- data.frame(
product_id = c("1", "2", "3"),
name = c("Widget", "Gadget", "Gizmo"),
stringsAsFactors = FALSE
)
join_spy(orders, products, by = "product_id")
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: product_id
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 3
#> Keys only in left: 0
#> Keys only in right: 0
#> Match rate (left): "100%"
#>
#> ── Issues Detected ──
#>
#> ! Type mismatch: 'product_id' is numeric, 'product_id' is character - may cause unexpected results
#>
#> ── Expected Row Counts ──
#>
#> inner_join: 3
#> left_join: 3
#> right_join: 3
#> full_join: 3Detection: join_spy() flags type
coercion warnings.
Solution: Convert to matching types before joining.
Problem: Empty strings ("") and
NA behave differently in joins.
left <- data.frame(
id = c("A", "", "C"),
value = 1:3,
stringsAsFactors = FALSE
)
right <- data.frame(
id = c("A", "B", ""),
label = c("Alpha", "Beta", "Empty"),
stringsAsFactors = FALSE
)
join_spy(left, right, by = "id")
#>
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: id
#>
#> ── Table Summary ──
#>
#> Left table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 3 Unique keys: 3 Duplicate keys: 0 NA keys: 0
#>
#> ── Match Analysis ──
#>
#> Keys in both: 2
#> Keys only in left: 1
#> Keys only in right: 1
#> Match rate (left): "66.7%"
#>
#> ── Issues Detected ──
#>
#> ℹ Left column 'id' has 1 empty string(s) - these match other empty strings but not NA
#> ℹ Right column 'id' has 1 empty string(s) - these match other empty strings but not NA
#>
#> ── Expected Row Counts ──
#>
#> inner_join: NA
#> left_join: NA
#> right_join: NA
#> full_join: NADetection: join_spy() warns about empty
strings in keys.
Solution: Convert empty strings to NA with
join_repair().
When you know the expected cardinality, use
join_strict() to fail fast:
Let joinspy determine the actual relationship:
| Issue | Detection | Solution |
|---|---|---|
| Duplicates | join_spy(), key_duplicates() |
Aggregate or filter |
| Whitespace | join_spy(), key_check() |
join_repair(), trimws() |
| Case mismatch | join_spy() |
join_repair(standardize_case=) |
| NA keys | join_spy() Table Summary |
Remove or replace |
| No matches | join_spy() Match Analysis |
Check key format/mapping |
| M:M explosion | join_spy(), check_cartesian() |
Aggregate first |
| Type mismatch | join_spy() |
Convert types |
| Empty strings | join_spy() |
join_repair(empty_to_na=TRUE) |
join_spy(x, y, by) to get a comprehensive
diagnostickey_duplicates() to locate specific duplicate
rowsjoin_repair() to fix whitespace/case/encoding
issuesjoin_strict() to enforce expected cardinalityjoin_explain() to understand row
count changesvignette("introduction") - Getting started guide?join_spy, ?key_check,
?join_repair, ?join_strict?check_cartesian, ?detect_cardinalitysessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] joinspy_0.7.3
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.55
#> [5] glue_1.8.0 cachem_1.1.0 knitr_1.51 htmltools_0.5.9
#> [9] rmarkdown_2.30 lifecycle_1.0.5 cli_3.6.5 vctrs_0.7.0
#> [13] sass_0.4.10 jquerylib_0.1.4 compiler_4.5.2 tools_4.5.2
#> [17] pillar_1.11.1 evaluate_1.0.5 bslib_0.9.0 yaml_2.3.12
#> [21] otel_0.2.0 rlang_1.1.7 jsonlite_2.0.0