| Title: | Diagnostic Tools for Data Frame Joins |
| Version: | 0.7.3 |
| Language: | en-US |
| Description: | Provides diagnostic tools for understanding and debugging data frame joins. Analyzes key columns before joining to detect duplicates, mismatches, encoding issues, and other common problems. Explains unexpected row count changes and provides safe join wrappers with cardinality enforcement. Concepts and diagnostics build on tidy data principles as described in Wickham (2014) <doi:10.18637/jss.v059.i10>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | cli, rlang |
| Suggests: | dplyr, data.table, tibble, testthat (≥ 3.0.0), knitr, rmarkdown, shiny, miniUI |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| URL: | https://gillescolling.com/joinspy/, https://github.com/gcol33/joinspy |
| BugReports: | https://github.com/gcol33/joinspy/issues |
| Depends: | R (≥ 4.1) |
| NeedsCompilation: | no |
| Packaged: | 2026-01-19 20:31:30 UTC; Gilles Colling |
| Author: | Gilles Colling |
| Maintainer: | Gilles Colling <gilles.colling051@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-01-23 14:30:06 UTC |
joinspy: Diagnostic Tools for Data Frame Joins
Description
Provides diagnostic tools for understanding and debugging data frame joins. Analyzes key columns before joining to detect duplicates, mismatches, encoding issues, and other common problems. Explains unexpected row count changes and provides safe join wrappers with cardinality enforcement. Concepts and diagnostics build on tidy data principles as described in Wickham (2014) doi:10.18637/jss.v059.i10.
Author(s)
Maintainer: Gilles Colling gilles.colling051@gmail.com (ORCID) [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/gcol33/joinspy/issues
Analyze Match Between Two Key Sets
Description
Compares keys from two tables to determine overlap.
Usage
.analyze_match(x_keys, y_keys, x_n_rows)
Arguments
x_keys |
Vector of keys from left table (NA removed). |
y_keys |
Vector of keys from right table (NA removed). |
x_n_rows |
Total rows in left table. |
Value
A list with match analysis.
Analyze Multi-Column Key Breakdown
Description
For composite keys, determines which column(s) cause mismatches.
Usage
.analyze_multicolumn_keys(x, y, x_by, y_by)
Arguments
x |
Data frame (left table). |
y |
Data frame (right table). |
x_by |
Column names in x. |
y_by |
Column names in y. |
Value
A list with per-column match analysis.
Detect Case Mismatches Between Keys
Description
Finds keys that would match if case-insensitive but don't match case-sensitive.
Usage
.detect_case_mismatch(x, y)
Arguments
x |
A character vector (keys from left table). |
y |
A character vector (keys from right table). |
Value
A list with:
- has_issues
Logical indicating if case mismatches found
- mismatches
Data frame of key pairs that differ only by case
Detect Empty Strings in Keys
Description
Checks for empty strings which behave differently from NA in joins.
Usage
.detect_empty_strings(x)
Arguments
x |
A character vector to check. |
Value
A list with empty string detection results.
Detect Encoding Issues in Keys
Description
Checks for encoding inconsistencies and invisible Unicode characters.
Usage
.detect_encoding_issues(x)
Arguments
x |
A character vector to check. |
Value
A list with:
- has_issues
Logical indicating if encoding issues found
- mixed_encoding
Logical if multiple encodings detected
- invisible_chars
Indices of values with invisible Unicode
- affected_values
Values with encoding issues
Detect Factor Level Mismatches
Description
Checks if factor columns have mismatched levels.
Usage
.detect_factor_mismatch(x_col, y_col, x_name, y_name)
Arguments
x_col |
Factor column from left table. |
y_col |
Factor column from right table. |
x_name |
Name of left column. |
y_name |
Name of right column. |
Value
A list with factor mismatch information.
Detect Near-Matches Between Keys
Description
Finds keys that almost match (small edit distance).
Usage
.detect_near_matches(x_keys, y_keys, max_distance = 2, max_candidates = 10)
Arguments
x_keys |
Unique keys from left table. |
y_keys |
Unique keys from right table. |
max_distance |
Maximum 'Levenshtein' distance to consider a near-match. |
max_candidates |
Maximum number of near-match candidates to return. |
Value
A list with near-match information.
Detect Numeric Precision Issues
Description
Checks for floating-point keys that may not match exactly.
Usage
.detect_numeric_precision(x)
Arguments
x |
A numeric vector to check. |
Value
A list with precision issue detection results.
Detect Type Mismatches Between Key Columns
Description
Checks if key columns have compatible types for joining.
Usage
.detect_type_mismatch(x_col, y_col, x_name, y_name)
Arguments
x_col |
Column from left table. |
y_col |
Column from right table. |
x_name |
Name of left column. |
y_name |
Name of right column. |
Value
A list with type mismatch information.
Detect Whitespace Issues in Keys
Description
Checks for leading or trailing whitespace in character vectors.
Usage
.detect_whitespace(x)
Arguments
x |
A character vector to check. |
Value
A list with:
- has_issues
Logical indicating if whitespace issues found
- leading
Indices of values with leading whitespace
- trailing
Indices of values with trailing whitespace
- affected_values
Unique values with whitespace issues
Format Bytes as Human-Readable String
Description
Format Bytes as Human-Readable String
Usage
.format_bytes(bytes)
Arguments
bytes |
Numeric bytes. |
Value
Character string.
Internal join wrapper helper
Description
Internal join wrapper helper
Usage
.join_spy_impl(x, y, by, type, verbose, .quiet = FALSE, ...)
Arguments
x |
Left data frame. |
y |
Right data frame. |
by |
Column names. |
type |
Join type. |
verbose |
Print output. |
.quiet |
Suppress all output (overrides verbose). |
... |
Additional args to merge. |
Value
Joined data frame with report attribute.
Calculate 'Levenshtein' Distance
Description
Simple implementation for detecting near-matches.
Usage
.levenshtein(s1, s2)
Arguments
s1 |
First string. |
s2 |
Second string. |
Value
Integer distance.
Predict Row Counts for Different Join Types
Description
Estimates the number of rows that will result from different join types.
Usage
.predict_row_counts(x, y, by)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
Column names to join by. |
Value
A list with predicted row counts.
Convert JoinReport to list for serialization
Description
Convert JoinReport to list for serialization
Usage
.report_to_list(report, timestamp = TRUE)
Arguments
report |
A JoinReport object. |
timestamp |
Include timestamp. |
Value
List representation.
Convert JoinReport to text
Description
Convert JoinReport to text
Usage
.report_to_text(report, timestamp = TRUE)
Arguments
report |
A JoinReport object. |
timestamp |
Include timestamp. |
Value
Character string.
Summarize Key Column
Description
Creates a summary of a key column or composite key.
Usage
.summarize_keys(data, by)
Arguments
data |
A data frame. |
by |
Column name(s) to summarize. |
Value
A list with summary statistics.
Simple JSON serialization (no dependencies)
Description
Simple JSON serialization (no dependencies)
Usage
.to_json(x)
Arguments
x |
List to convert. |
Value
JSON string.
'RStudio' Addin: Join Inspector
Description
Opens an interactive dialog to explore join diagnostics between two data frames in the current R environment.
Usage
addin_join_inspector()
Value
Called for side effects (opens Shiny gadget).
Analyze Multi-Table Join Chain
Description
Analyzes a sequence of joins to identify potential issues in the chain. Useful for debugging complex multi-table joins.
Usage
analyze_join_chain(tables, joins)
Arguments
tables |
A named list of data frames to join. |
joins |
A list of join specifications, each with elements:
|
Value
A summary of the join chain analysis.
See Also
Examples
orders <- data.frame(order_id = 1:3, customer_id = c(1, 2, 2))
customers <- data.frame(customer_id = 1:3, region_id = c(1, 1, 2))
regions <- data.frame(region_id = 1:2, name = c("North", "South"))
analyze_join_chain(
tables = list(orders = orders, customers = customers, regions = regions),
joins = list(
list(left = "orders", right = "customers", by = "customer_id"),
list(left = "result", right = "regions", by = "region_id")
)
)
Detect Potential Cartesian Product
Description
Warns if a join will produce a very large result due to many-to-many relationships (Cartesian product explosion).
Usage
check_cartesian(x, y, by, threshold = 10)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
Column names to join by. |
threshold |
Warn if result will exceed this many times the larger input. Default 10. |
Value
A list with explosion analysis.
See Also
Examples
# Dangerous: both tables have duplicates
x <- data.frame(id = c(1, 1, 2, 2), val_x = 1:4)
y <- data.frame(id = c(1, 1, 2, 2), val_y = 1:4)
check_cartesian(x, y, by = "id")
Detect Join Relationship Type
Description
Determines the actual cardinality relationship between two tables.
Usage
detect_cardinality(x, y, by)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
Column names to join by. |
Value
Character string: "1:1", "1:m", "m:1", or "m:m".
See Also
Examples
# 1:1 relationship
x <- data.frame(id = 1:3, val = 1:3)
y <- data.frame(id = 1:3, name = c("A", "B", "C"))
detect_cardinality(x, y, "id")
# 1:m relationship
x <- data.frame(id = 1:3, val = 1:3)
y <- data.frame(id = c(1, 1, 2, 3), name = c("A1", "A2", "B", "C"))
detect_cardinality(x, y, "id")
Full Join with Diagnostics
Description
Performs a full join and automatically prints diagnostic information.
Usage
full_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, ...)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
... |
Additional arguments passed to the underlying join function. |
Value
The joined data frame with a "join_report" attribute.
See Also
left_join_spy(), join_spy(), last_report()
Get Current Log File
Description
Returns the current automatic log file path, if set.
Usage
get_log_file()
Value
The log file path, or NULL if not set.
See Also
Inner Join with Diagnostics
Description
Performs an inner join and automatically prints diagnostic information.
Usage
inner_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, ...)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
... |
Additional arguments passed to the underlying join function. |
Value
The joined data frame with a "join_report" attribute.
See Also
left_join_spy(), join_spy(), last_report()
Check if Object is a JoinReport
Description
Check if Object is a JoinReport
Usage
is_join_report(x)
Arguments
x |
An object to test. |
Value
TRUE if x is a JoinReport, FALSE otherwise.
Compare Data Frame Before and After Join
Description
Shows a side-by-side comparison of key statistics before and after a join operation.
Usage
join_diff(before, after, by = NULL)
Arguments
before |
The original data frame (before joining). |
after |
The result data frame (after joining). |
by |
Optional. Column names to analyze for key statistics. |
Value
Invisibly returns a comparison summary. Prints a formatted comparison.
See Also
Examples
before <- data.frame(id = 1:3, x = letters[1:3])
after <- data.frame(id = c(1, 2, 2, 3), x = c("a", "b", "b", "c"), y = 1:4
)
join_diff(before, after)
Explain Row Count Changes After a Join
Description
After performing a join, use this function to understand why the row count changed. It analyzes the original tables and the result to explain the difference.
Usage
join_explain(result, x, y, by, type = NULL)
Arguments
result |
The result of a join operation. |
x |
The original left data frame. |
y |
The original right data frame. |
by |
A character vector of column names used in the join. |
type |
Character. The type of join that was performed. One of
|
Value
Invisibly returns a list with explanation details. Prints a human-readable explanation.
See Also
Examples
orders <- data.frame(id = c(1, 2, 2, 3), value = 1:4)
customers <- data.frame(id = c(1, 2, 2, 4), name = c("A", "B1", "B2", "D"))
result <- merge(orders, customers, by = "id", all.x = TRUE)
# Explain why we got more rows than expected
join_explain(result, orders, customers, by = "id", type = "left")
Repair Common Key Issues
Description
Automatically fixes trivial join key issues like whitespace and case mismatches. Returns the repaired data frame(s) with a summary of changes.
Usage
join_repair(
x,
y = NULL,
by,
trim_whitespace = TRUE,
standardize_case = NULL,
remove_invisible = TRUE,
empty_to_na = FALSE,
dry_run = FALSE
)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). If NULL, only repairs x. |
by |
A character vector of column names to repair. |
trim_whitespace |
Logical. Trim leading/trailing whitespace. Default TRUE. |
standardize_case |
Character. Standardize case to "lower", "upper", or NULL (no change). Default NULL. |
remove_invisible |
Logical. Remove invisible Unicode characters. Default TRUE. |
empty_to_na |
Logical. Convert empty strings to NA. Default FALSE. |
dry_run |
Logical. If TRUE, only report what would be changed without modifying data. Default FALSE. |
Value
If y is NULL, returns the repaired x. If both are provided,
returns a list with x and y. In dry_run mode, returns a summary of
proposed changes.
See Also
Examples
# Data with whitespace issues
orders <- data.frame(
id = c(" A", "B ", "C"),
value = 1:3,
stringsAsFactors = FALSE
)
# Dry run to see what would change
join_repair(orders, by = "id", dry_run = TRUE)
# Actually repair
orders_fixed <- join_repair(orders, by = "id")
Comprehensive Pre-Join Diagnostic Report
Description
Analyzes two data frames before joining to detect potential issues and predict the outcome. Returns a detailed report of key quality, match rates, and detected problems.
Usage
join_spy(x, y, by, sample = NULL, ...)
Arguments
x |
A data frame (left table in the join). |
y |
A data frame (right table in the join). |
by |
A character vector of column names to join by, or a named character
vector for joins where column names differ (e.g., |
sample |
Integer or NULL. If provided, randomly sample this many rows from each table for faster diagnostics on large datasets. Default NULL (analyze all rows). |
... |
Reserved for future use. |
Details
This function detects the following common join issues:
-
Duplicate keys: Keys appearing multiple times, which cause row multiplication during joins
-
Whitespace: Leading or trailing spaces that prevent matches
-
Case mismatches: Keys that differ only by case (e.g., "ABC" vs "abc")
-
Encoding issues: Different character encodings or invisible Unicode characters
-
NA values: Missing values in key columns
Value
A JoinReport object with the following components:
- x_summary
Summary statistics for keys in the left table
- y_summary
Summary statistics for keys in the right table
- match_analysis
Details of which keys will/won't match
- issues
List of detected problems (duplicates, whitespace, etc.)
- expected_rows
Predicted row counts for each join type
See Also
key_check(), join_explain(), join_strict()
Examples
# Create sample data with issues
orders <- data.frame(
order_id = 1:5,
customer_id = c("A", "B", "B", "C", "D ")
)
customers <- data.frame(
customer_id = c("A", "B", "C", "E"),
name = c("Alice", "Bob", "Carol", "Eve")
)
# Get diagnostic report
join_spy(orders, customers, by = "customer_id")
Strict Join with Cardinality Enforcement
Description
Performs a join operation that fails if the specified cardinality constraint is violated. Use this to catch unexpected many-to-many relationships early.
Usage
join_strict(
x,
y,
by,
type = c("left", "right", "inner", "full"),
expect = c("1:1", "1:m", "1:many", "m:1", "many:1", "m:m", "many:many"),
...
)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
type |
Character. The type of join to perform. One of |
expect |
Character. The expected cardinality relationship. One of:
|
... |
Additional arguments passed to the underlying join function. |
Value
The joined data frame if the cardinality constraint is satisfied. Throws an error if the constraint is violated.
See Also
Examples
orders <- data.frame(id = 1:3, product = c("A", "B", "C"))
customers <- data.frame(id = 1:3, name = c("Alice", "Bob", "Carol"))
# This succeeds (1:1 relationship)
join_strict(orders, customers, by = "id", expect = "1:1")
# This fails if customers had duplicate ids (wrapped in try to show error)
customers_dup <- data.frame(id = c(1, 1, 2), name = c("A1", "A2", "B"))
try(join_strict(orders, customers_dup, by = "id", expect = "1:1"))
Quick Key Quality Check
Description
A fast check of join key quality that returns a simple pass/fail status
with a brief summary. Use this for quick validation; use join_spy() for
detailed diagnostics.
Usage
key_check(x, y, by, warn = TRUE)
Arguments
x |
A data frame (left table in the join). |
y |
A data frame (right table in the join). |
by |
A character vector of column names to join by. |
warn |
Logical. If |
Value
Invisibly returns a logical: TRUE if no issues detected, FALSE otherwise.
Also prints a brief status message unless warn = FALSE.
See Also
Examples
orders <- data.frame(id = c(1, 2, 2, 3), value = 1:4)
customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D"))
# Quick check
key_check(orders, customers, by = "id")
# Silent check
is_ok <- key_check(orders, customers, by = "id", warn = FALSE)
Find Duplicate Keys
Description
Identifies rows with duplicate values in the specified key columns. Returns a data frame containing only the rows with duplicated keys, along with a count of occurrences.
Usage
key_duplicates(data, by, keep = c("all", "first", "last"))
Arguments
data |
A data frame. |
by |
A character vector of column names to check for duplicates. |
keep |
Character. Which duplicates to return:
|
Value
A data frame containing the duplicated rows, with an additional
column .n_duplicates showing how many times each key appears.
Returns an empty data frame (0 rows) if no duplicates found.
See Also
Examples
df <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 4),
value = letters[1:7]
)
# Find all duplicates
key_duplicates(df, by = "id")
# Find first occurrence only
key_duplicates(df, by = "id", keep = "first")
Get the Last Join Report
Description
Retrieves the most recent JoinReport object from any *_join_spy() call.
Useful when using .quiet = TRUE in pipelines and wanting to inspect
the diagnostics afterward.
Usage
last_report()
Value
The last JoinReport object, or NULL if no join has been performed.
See Also
Examples
orders <- data.frame(id = 1:3, value = c(10, 20, 30))
customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D"))
# Silent join in a pipeline
result <- left_join_spy(orders, customers, by = "id", .quiet = TRUE)
# Inspect the report afterward
last_report()
Left Join with Diagnostics
Description
Performs a left join and automatically prints diagnostic information about the operation. The diagnostic report is also attached as an attribute.
Usage
left_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, ...)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
... |
Additional arguments passed to the underlying join function. |
Value
The joined data frame with a "join_report" attribute containing
the diagnostic information.
See Also
join_spy(), join_strict(), last_report()
Examples
orders <- data.frame(id = 1:3, value = c(10, 20, 30))
customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D"))
result <- left_join_spy(orders, customers, by = "id")
# Access the diagnostic report
attr(result, "join_report")
# Silent mode for pipelines
result2 <- left_join_spy(orders, customers, by = "id", .quiet = TRUE)
last_report() # Access diagnostics afterward
Log Join Report to File
Description
Writes a JoinReport object to a file for audit trails and reproducibility.
Supports plain text, JSON, and RDS formats.
Usage
log_report(report, file, append = FALSE, timestamp = TRUE)
Arguments
report |
A |
file |
File path to write to. Extension determines format:
|
append |
Logical. If |
timestamp |
Logical. If |
Value
Invisibly returns the file path.
See Also
Examples
orders <- data.frame(id = 1:3, value = c(10, 20, 30))
customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D"))
report <- join_spy(orders, customers, by = "id")
# Log to temporary file
tmp <- tempfile(fileext = ".log")
log_report(report, tmp, append = TRUE)
unlink(tmp)
Create a JoinReport Object
Description
Create a JoinReport Object
Usage
new_join_report(
x_summary,
y_summary,
match_analysis,
issues,
expected_rows,
by
)
Arguments
x_summary |
Summary statistics for keys in the left table. |
y_summary |
Summary statistics for keys in the right table. |
match_analysis |
Details of which keys will/won't match. |
issues |
List of detected problems. |
expected_rows |
Predicted row counts for each join type. |
by |
The columns used for joining. |
Value
A JoinReport S3 object.
Plot Method for JoinReport
Description
Creates a Venn diagram showing key overlap between tables.
Usage
## S3 method for class 'JoinReport'
plot(
x,
file = NULL,
width = 6,
height = 5,
colors = c("#4A90D9", "#D94A4A"),
...
)
Arguments
x |
A |
file |
Optional file path to save the plot (PNG, SVG, or PDF based on extension). If NULL (default), displays in the current graphics device. |
width |
Width in inches (default 6). |
height |
Height in inches (default 5). |
colors |
Character vector of length 2 for left and right circle colors. |
... |
Additional arguments (ignored). |
Value
Invisibly returns the plot data (left_only, both, right_only counts).
Examples
orders <- data.frame(id = 1:5, val = 1:5)
customers <- data.frame(id = 3:7, name = letters[3:7])
report <- join_spy(orders, customers, by = "id")
plot(report)
Print Method for JoinReport
Description
Print Method for JoinReport
Usage
## S3 method for class 'JoinReport'
print(x, ...)
Arguments
x |
A |
... |
Additional arguments (ignored). |
Value
Invisibly returns x.
Right Join with Diagnostics
Description
Performs a right join and automatically prints diagnostic information.
Usage
right_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, ...)
Arguments
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
... |
Additional arguments passed to the underlying join function. |
Value
The joined data frame with a "join_report" attribute.
See Also
left_join_spy(), join_spy(), last_report()
Configure Automatic Logging
Description
Sets up automatic logging of all join reports to a specified file.
When enabled, every *_join_spy() call will append its report to the log.
Usage
set_log_file(file, format = c("text", "json"))
Arguments
file |
File path for automatic logging. Set to |
format |
Log format: "text" (default) or "json". |
Value
Invisibly returns the previous log file setting.
See Also
Examples
# Enable automatic logging to temp file
tmp <- tempfile(fileext = ".log")
old <- set_log_file(tmp)
# Disable logging and clean up
set_log_file(NULL)
unlink(tmp)
Suggest Repair Code
Description
Analyzes join issues and returns R code snippets to fix them.
Usage
suggest_repairs(report)
Arguments
report |
A |
Value
Character vector of R code snippets to fix detected issues.
See Also
Examples
orders <- data.frame(id = c("A ", "B"), val = 1:2, stringsAsFactors = FALSE)
customers <- data.frame(id = c("a", "b"), name = c("Alice", "Bob"), stringsAsFactors = FALSE)
report <- join_spy(orders, customers, by = "id")
suggest_repairs(report)
Summary Method for JoinReport
Description
Returns a compact summary data frame of the join diagnostic report.
Usage
## S3 method for class 'JoinReport'
summary(object, format = c("data.frame", "text", "markdown"), ...)
Arguments
object |
A |
format |
Output format: "data.frame" (default), "text", or "markdown". |
... |
Additional arguments (ignored). |
Value
A data frame with key metrics (or printed output for text/markdown).
Examples
orders <- data.frame(id = 1:5, val = 1:5)
customers <- data.frame(id = 3:7, name = letters[3:7])
report <- join_spy(orders, customers, by = "id")
summary(report)
summary(report, format = "markdown")