joinspy

CRAN status CRAN downloads Monthly downloads R-CMD-check Codecov test coverage License: MIT

Diagnostic Tools for Data Frame Joins in R

The joinspy package helps you understand and debug join operations by analyzing key columns before and after joins, detecting common issues, and explaining unexpected row count changes. Catch problems early instead of discovering them when downstream analysis breaks.

Quick Start

library(joinspy)

# Pre-join diagnostics
report <- join_spy(orders, customers, by = "customer_id")
summary(report)

# Quick pass/fail check
key_check(orders, customers, by = "customer_id")

# Safe join with cardinality enforcement
result <- join_strict(orders, customers, by = "customer_id", expect = "1:1")

# Auto-repair common issues
orders_fixed <- join_repair(orders, by = "customer_id")

Statement of Need

Joins silently produce unexpected results when:

These problems are discovered only when downstream analysis breaks. joinspy catches them upfront by analyzing keys before you join, explaining why joins misbehave, and showing where the problems are.

Features

Pre-Join Diagnostics

Post-Join Analysis

Safe Join Wrappers

Auto-Repair

Advanced Analysis

Visualization & Logging

Installation

# Install from CRAN (when available)
install.packages("joinspy")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/joinspy")

Usage Examples

Pre-Join Diagnostics

library(joinspy)

orders <- data.frame(
  customer_id = c("A", "B", "B", "C", "D "),
  amount = c(100, 200, 150, 300, 50),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  customer_id = c("A", "B", "C", "D", "E"),
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  stringsAsFactors = FALSE
)

# Full diagnostic report
report <- join_spy(orders, customers, by = "customer_id")

# Compact summary
summary(report)
#>              metric value
#> 1         left_rows     5
#> 2        right_rows     5
#> 3   left_unique_keys    4
#> 4  right_unique_keys    5
#> ...

Cardinality Enforcement

# Succeeds - 1:1 relationship
products <- data.frame(id = 1:3, name = c("Widget", "Gadget", "Gizmo"))
prices <- data.frame(id = 1:3, price = c(10, 20, 30))

join_strict(products, prices, by = "id", expect = "1:1")

# Fails - duplicates violate 1:1
prices_dup <- data.frame(id = c(1, 1, 2, 3), price = c(10, 15, 20, 30))
join_strict(products, prices_dup, by = "id", expect = "1:1")
#> Error: Cardinality violation: expected '1:1' but found '1:m'

Auto-Repair

messy <- data.frame(
  id = c(" A", "B ", "  C  "),
  value = 1:3,
  stringsAsFactors = FALSE
)

# Preview what would be fixed
join_repair(messy, by = "id", dry_run = TRUE)

# Apply fixes
fixed <- join_repair(messy, by = "id")
fixed$id
#> [1] "A" "B" "C"

Silent Pipeline Mode

# Silent join for pipelines
result <- left_join_spy(orders, customers, by = "customer_id", .quiet = TRUE)

# Access diagnostics afterward
last_report()$match_analysis$match_rate
#> [1] 0.8

Visualization

report <- join_spy(orders, customers, by = "customer_id")

# Venn diagram
plot(report)

# Save to file
plot(report, file = "overlap.png")

Documentation

joinspy fills the gap: it tells you why joins misbehave and where the problems are.

Support

“Software is like sex: it’s better when it’s free.” — Linus Torvalds

I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{joinspy,
  author = {Colling, Gilles},
  title = {joinspy: Diagnostic Tools for Data Frame Joins},
  year = {2025},
  url = {https://github.com/gcol33/joinspy}
}