---
title: "clinCompare: Dataset Comparison with CDISC Validation"
author: "Siddharth Lokineni"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{clinCompare: Dataset Comparison with CDISC Validation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

clinCompare is an R package for comparing datasets at the dataset, variable,
and observation level. For clinical trial data, an optional CDISC validation
layer checks SDTM and ADaM conformance automatically. The package is designed
for statistical programmers, data managers, and regulatory professionals who
need to ensure data quality and compliance with industry standards.

### Key Features

- Compare dimensions, variable names, data types, and values in a single call
- Key-based row matching with auto-detected CDISC ID variables
- CDISC validation for 51 SDTM domains and 14 ADaM datasets
- Export results to HTML, plain text, or Excel
- Batch compare entire submissions across two directories
- Numeric tolerance for floating-point comparisons

## Getting Started

```{r setup}
library(clinCompare)
```

## Basic Dataset Comparison

### Comparing Two Data Frames

The `compare_datasets()` function gives a comprehensive overview: dimension
checks, variable comparison, type mismatches, and row-level value differences.

```{r compare-datasets}
baseline <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE     = c(45, 52, 38),
  SEX     = c("M", "F", "M"),
  RACE    = c("WHITE", "WHITE", "ASIAN"),
  stringsAsFactors = FALSE
)

updated <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE     = c(45, 53, 38),
  SEX     = c("M", "F", "F"),
  RACE    = c("WHITE", "WHITE", "ASIAN"),
  stringsAsFactors = FALSE
)

result <- compare_datasets(baseline, updated)
result
```

The result is a structured list you can drill into programmatically:

```{r drill-into-result}
# Per-column difference counts
result$observation_comparison$discrepancies

# Row-level details for a specific variable
result$observation_comparison$details$SEX
```

### Comparing Variables

Use `compare_variables()` to focus on structural differences between two
datasets -- column names, data types, and variable ordering.

```{r compare-variables}
df_a <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02"),
  AGE     = c(45, 52),
  SEX     = c("M", "F"),
  stringsAsFactors = FALSE
)

df_b <- data.frame(
  USUBJID = c("SUBJ01", "SUBJ02"),
  AGE     = c(45L, 52L),
  WEIGHT  = c(75.5, 80.2),
  stringsAsFactors = FALSE
)

compare_variables(df_a, df_b)
```

### Comparing Observations

Use `compare_observations()` for row-by-row value comparison on common columns:

```{r compare-observations}
df1 <- data.frame(
  ID    = c(1, 2, 3),
  SCORE = c(80, 90, 70),
  stringsAsFactors = FALSE
)

df2 <- data.frame(
  ID    = c(1, 2, 3),
  SCORE = c(80, 95, 70),
  stringsAsFactors = FALSE
)

compare_observations(df1, df2)
```

## Data Preparation

### Cleaning Data

Remove duplicates and standardize text case before comparing:

```{r clean-dataset}
messy <- data.frame(
  NAME  = c("Alice", "alice", "Bob", "Bob"),
  SCORE = c(100, 100, 85, 85),
  stringsAsFactors = FALSE
)

clean_dataset(messy, remove_duplicates = TRUE, convert_to_case = "upper")
```

### Sorting and Filtering

Prepare two datasets identically before comparison:

```{r prepare-datasets}
df_unsorted1 <- data.frame(
  REGION = c("West", "East", "North"),
  SALES  = c(150, 200, 180)
)

df_unsorted2 <- data.frame(
  REGION = c("East", "North", "West"),
  SALES  = c(210, 185, 160)
)

prepped <- prepare_datasets(df_unsorted1, df_unsorted2, sort_columns = "REGION")
prepped$df1
prepped$df2
```

## Group-Wise Comparison

Compare datasets within specific subgroups. Useful for multi-site or
multi-arm studies:

```{r compare-by-group}
site_data_v1 <- data.frame(
  SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"),
  SUBJID = c("S01", "S02", "S03", "S04"),
  AGE    = c(45, 52, 38, 61)
)

site_data_v2 <- data.frame(
  SITEID = c("SITE01", "SITE01", "SITE02", "SITE02"),
  SUBJID = c("S01", "S02", "S03", "S04"),
  AGE    = c(45, 53, 38, 62)
)

by_site <- compare_by_group(site_data_v1, site_data_v2, group_vars = "SITEID")
names(by_site)
```

## CDISC Comparison

### What is CDISC?

CDISC (Clinical Data Interchange Standards Consortium) provides standardized
formats for regulatory submissions:

- **SDTM** (Study Data Tabulation Model): Raw data from clinical trials
- **ADaM** (Analysis Data Model): Derived datasets used for statistical analysis

CDISC validation ensures that datasets meet industry standards and regulatory
requirements. For official CDISC standards documentation, see
<https://www.cdisc.org/standards>.

### Auto-Detecting CDISC Domains

clinCompare auto-detects the CDISC domain of a dataset using column matching,
ADaM indicator columns, and filename hints:

```{r detect-domain}
dm_data <- data.frame(
  STUDYID  = rep("STUDY01", 3),
  USUBJID  = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE      = c(45, 62, 51),
  SEX      = c("M", "F", "M"),
  RACE     = c("WHITE", "BLACK", "ASIAN"),
  ARMCD    = c("TRT", "PBO", "TRT"),
  ARM      = c("Treatment", "Placebo", "Treatment"),
  stringsAsFactors = FALSE
)

detect_cdisc_domain(dm_data)
```

### Comparing with CDISC Validation

`cdisc_compare()` is the flagship function. It compares two datasets,
auto-detects the CDISC domain and key variables, performs key-based row
matching, and validates against CDISC standards -- all in one call.

```{r cdisc-compare}
dm_v1 <- data.frame(
  STUDYID  = rep("STUDY01", 3),
  USUBJID  = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE      = c(45, 62, 51),
  SEX      = c("M", "F", "M"),
  RACE     = c("WHITE", "BLACK", "ASIAN"),
  ARMCD    = c("TRT", "PBO", "TRT"),
  ARM      = c("Treatment", "Placebo", "Treatment"),
  RFSTDTC  = c("2024-01-15", "2024-01-16", "2024-01-17"),
  stringsAsFactors = FALSE
)

dm_v2 <- data.frame(
  STUDYID  = rep("STUDY01", 3),
  USUBJID  = c("SUBJ01", "SUBJ02", "SUBJ03"),
  AGE      = c(45, 62, 52),
  SEX      = c("M", "F", "M"),
  RACE     = c("WHITE", "BLACK", "ASIAN"),
  ARMCD    = c("TRT", "PBO", "TRT"),
  ARM      = c("Treatment", "Placebo", "Treatment"),
  RFSTDTC  = c("2024-01-15", "2024-01-16", "2024-01-17"),
  stringsAsFactors = FALSE
)

cdisc_result <- cdisc_compare(dm_v1, dm_v2, domain = "DM", standard = "SDTM")
cdisc_result
```

### Validating a Single Dataset

Use `validate_cdisc()` to check a dataset against CDISC standards without
comparing it to another dataset:

```{r validate-cdisc}
validation <- validate_cdisc(dm_v1, domain = "DM", standard = "SDTM")
```

### Extracting All Differences

`get_all_differences()` returns every value-level difference as a single
long-format data frame, making it easy to filter, count, or export:

```{r get-all-diffs}
diffs <- get_all_differences(cdisc_result)
diffs
```

## Exporting Reports

`export_report()` auto-detects the output format from the file extension:

```{r export-report}
# HTML report
export_report(cdisc_result, file.path(tempdir(), "dm_report.html"))

# Text report
export_report(cdisc_result, file.path(tempdir(), "dm_report.txt"))
```

Excel export requires the `openxlsx` package:

```{r export-excel, eval=FALSE}
# Excel workbook with Summary, Variable Diffs, Value Diffs, and CDISC tabs
export_report(cdisc_result, file.path(tempdir(), "dm_report.xlsx"))
```

## Batch Comparing a Submission

`compare_submission()` scans two directories, matches files by name, and runs
`cdisc_compare()` on every matched pair. Domain, standard, and key variables
are all auto-detected per file.

```{r batch-compare, eval=FALSE}
results <- compare_submission(
  base_dir    = "submission_v1/",
  compare_dir = "submission_v2/",
  output_file = "submission_diff.xlsx"
)
```

## CDISC Coverage

clinCompare ships with hand-curated metadata for **51 SDTM domains**
(IG 3.4, with 3.3 support) and **14 ADaM datasets** (IG 1.3, with
1.2/1.1 provenance tracking).

**SDTM domains:** AE, AG, BE, BS, CE, CM, CO, CP, DA, DD, DM, DS, DV, EC,
EG, EX, FA, GF, HO, IE, IS, LB, MB, MH, MI, ML, MS, PC, PE, PP, PR, QS,
RELREC, RS, SC, SE, SM, SS, SU, SUPPQUAL, SV, TA, TD, TE, TI, TM, TR, TS,
TU, TV, VS.

**ADaM datasets:** ADAE, ADCM, ADEG, ADEFF, ADEX, ADLB, ADMH, ADPC, ADPP,
ADRS, ADSL, ADTR, ADTTE, ADVS.

**Disclaimer:** clinCompare is a quality-assurance and exploratory analysis
tool. It is not a substitute for official CDISC compliance validation software
(e.g., Pinnacle 21). For regulatory submissions, always cross-reference with
your organization's validated tools.

## Summary

clinCompare provides a complete workflow for dataset comparison in clinical
trials: compare any two data frames with `compare_datasets()`, add CDISC
validation with `cdisc_compare()`, batch process entire submissions with
`compare_submission()`, and export results to HTML, text, or Excel with
`export_report()`.

For more information and additional examples, visit the
[GitHub repository](https://github.com/siddharthlokineni/clinCompare).