---
title: "Getting Started with TernTables"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with TernTables}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  warning   = FALSE,
  message   = FALSE
)
library(TernTables)
options(tibble.width = Inf)  # show all columns in printed tibbles
# Output directory for exported .docx files.
# Override by setting options(TernTables.vignette_outdir = "/your/path") before rendering.
out_dir <- getOption("TernTables.vignette_outdir", default = tempdir())
```

```{css, echo = FALSE}
img { border: none !important; box-shadow: none !important; }
```

## Overview

**TernTables** is built for clinical researchers who need to go from raw data to a manuscript-ready Word table — with variable detection, statistical test selection, and formatting all handled automatically.

Given a data frame and an optional grouping variable, it automatically:

- Detects each variable's type (continuous, binary, categorical)
- Selects the appropriate statistical test
- Formats *P* values and summary statistics for publication-ready tables
- Exports directly to a styled `.docx` Word file and generates a boilerplate
  statistical methods paragraph
- Returns a tibble for inspection, Excel export, or further analysis in R

Three table types are supported: **descriptive summaries** (single cohort, no
comparisons), **two-group comparisons** (with optional odds ratios), and
**comparisons across three or more groups**.

The convenience is in the automation, not in any compromise to statistical
rigor. Test selection follows established published criteria throughout:
normality by Shapiro-Wilk per group, Fisher's exact triggered by the Cochran
(1954) expected-cell criterion, and odds ratios reported as unadjusted with
the first factor level of the grouping variable as the reference. The auto-generated methods paragraph covers the statistical approach used and is suitable as a starting draft for a manuscript methods section.

> **No R required?** TernTables is available as a free point-and-click web
> application at [tern-tables.com](https://tern-tables.com/). Upload a CSV
> or XLSX, configure your table, and download a formatted Word document —
> all without writing a line of code. The web app is powered by this package,
> so the statistical methods, normality routing, and Word output are identical.
> A built-in side panel shows the R commands running in the background and
> the full script can be downloaded at the end of your session, making every
> analysis fully transparent and reproducible. For scripted or reproducible
> workflows, the R package (this vignette) remains the canonical reference.

## Example Dataset

```{r load-data}
data(tern_colon)
```

`tern_colon` is bundled with TernTables. It is derived from `survival::colon`
and contains 929 patients from a landmark colon cancer adjuvant chemotherapy
trial (Moertel et al., 1990), filtered to the recurrence endpoint — one row
per patient. See `?tern_colon` for full details.

Key variables used in these examples:

| Column | Description |
|---|---|
| `Age_Years` | Age at registration (years) |
| `Sex` | Female / Male |
| `Colonic_Obstruction` | Colonic obstruction present — n (%) |
| `Bowel_Perforation` | Bowel perforation present — n (%) |
| `Positive_Lymph_Nodes_n` | Number of positive lymph nodes |
| `Over_4_Positive_Nodes` | More than 4 positive lymph nodes — n (%) |
| `Tumor_Adherence` | Tumour adherence to nearby organs — n (%) |
| `Tumor_Differentiation` | Well / Moderate / Poor |
| `Extent_of_Local_Spread` | Depth of tumour penetration (4 levels) |
| `Recurrence` | No Recurrence / Recurrence — **2-group** |
| `Treatment_Arm` | Levamisole + 5FU / Levamisole / Observation — **3-group** |

---

## Preprocessing Raw Data (`ternP`)

If your source is a raw CSV or XLSX file — rather than an already-clean R
object — use `ternP()` to standardize it before passing it to `ternG()` or
`ternD()`. It handles the messiness most commonly introduced by manual data
entry or spreadsheet workflows:

| Transformation | What it fixes |
|---|---|
| String NA conversion | `"NA"`, `"na"`, `"Na"`, `"unk"` → `NA` |
| Whitespace trimming | Leading/trailing spaces in character columns |
| Empty column removal | 100% `NA` columns silently dropped |
| Blank row removal | Rows where every cell is `NA` |
| Case normalization | `"fEMALE"` / `"Female"` unified to title case |

`ternP()` also applies two **hard stops** before any cleaning takes place:
it errors immediately if any column name matches a protected health information
(PHI) pattern (e.g. `MRN`, `DOB`, `FirstName`), or if any unnamed column
contains data.

```{r ternP-run, eval = FALSE}
# Load a messy CSV shipped with the package
path <- system.file("extdata/csv", "tern_colon_messy.csv",
                    package = "TernTables")
raw    <- readr::read_csv(path, show_col_types = FALSE)
result <- ternP(raw)
# The print method fires automatically, summarising every transformation applied.
```

The printed summary identifies each transformation and shows the final
dimensions of the cleaned data. If the data was already clean, a single
"No transformations required" line appears.

Three items are returned in the result object:

```{r ternP-access, eval = FALSE}
result$clean_data    # Cleaned, analysis-ready tibble
result$sparse_rows   # Rows with >50% NA (retained, not removed — review these)
result$feedback      # Named list; NULL elements mean no action was taken
```

To write a Word document recording the cleaning steps, call
`write_cleaning_doc()`. It is fully dynamic — only paragraphs for triggered
transformations are written, so the document is concise for already-clean data.

```{r ternP-doc, eval = FALSE}
write_cleaning_doc(result,
                   filename = file.path(out_dir, "cleaning_summary.docx"))
```

Once preprocessing is complete, pass `result$clean_data` directly to `ternD()`
or `ternG()`:

```{r ternP-handoff, eval = FALSE}
tbl <- ternG(result$clean_data,
             exclude_vars = c("ID"),
             group_var    = "Recurrence")
```

---

## Descriptive Table (`ternD`)

Use `ternD()` for a single cohort with no group comparisons — the standard
"Table 1" in a cohort description. Pass `output_docx` to write a
publication-ready Word file in the same call; pass `output_xlsx` to also save
the tibble as an Excel file. Use `category_start` to insert bold section headers
grouping related variables; anchors can be either the raw column name or the
cleaned display label.

```{r ternD-example, results = "hide"}
tbl_descriptive <- ternD(
  data             = tern_colon,
  exclude_vars     = c("ID"),
  output_docx      = file.path(out_dir, "Tern_descriptive.docx"),
  methods_filename = file.path(out_dir, "TernTables_methods.docx"),
  category_start = c(
    "Patient Demographics"  = "Age (yr)",
    "Surgical Findings"     = "Colonic Obstruction",
    "Tumor Characteristics" = "Positive Lymph Nodes (n)",
    "Outcomes"              = "Recurrence"
  )
)
tbl_descriptive
```

Continuous variables show mean ± SD or median [IQR] based on the four-gate
ROBUST normality algorithm (n < 3 fail-safe, skewness check, CLT at n ≥ 30,
Shapiro-Wilk for small samples). Columns whose values are exactly Y/N,
YES/NO, or numeric 0/1 are detected as binary and shown as a single n (%) row
(the positive/yes count). All other categorical variables — including two-level
variables like Male/Female — are shown with each level as an indented sub-row.

Variable names are automatically cleaned for display (`smart_rename = TRUE` by
default) — underscores replaced with spaces, capitalisation normalised, and
common medical abbreviations formatted (e.g. `Age_Years` → `Age (yr)`,
`Positive_Lymph_Nodes_n` → `Positive Lymph Nodes (n)`). Pass
`smart_rename = FALSE` to use column names exactly as they appear in the data.

Descriptive summary table exported to Word:

```{r ternD-figure, echo=FALSE, fig.align="center", out.width="45%"}
knitr::include_graphics("figures/tern_descriptive.png")
```

---

## Two-Group Comparison (`ternG` — 2 levels)

Use `ternG()` to compare variables between two groups. Set `OR_col = TRUE` to
add odds ratios with 95% CI for binary variables (Y/N, YES/NO, 0/1) and
two-level categorical variables such as Male/Female. For two-level categoricals
displayed with sub-rows, the reference level (factor level 1 or alphabetical
first) shows `1.00 (ref.)`; the non-reference level shows the computed OR with
95% CI. Fisher's exact or Wald is chosen automatically based on expected cell
counts. Pass `output_docx` to write the Word table directly; `output_xlsx`
exports the tibble to Excel.

```{r ternG-2group, results = "hide"}
tbl_2group <- ternG(
  data             = tern_colon,
  exclude_vars     = c("ID"),
  group_var        = "Recurrence",
  output_docx      = file.path(out_dir, "Tern_2_group.docx"),
  methods_filename = file.path(out_dir, "TernTables_methods.docx"),
  OR_col           = TRUE,
  insert_subheads  = TRUE,
  category_start   = c(
    "Patient Demographics"  = "Age (yr)",
    "Surgical Findings"     = "Colonic Obstruction",
    "Tumor Characteristics" = "Positive Lymph Nodes (n)",
    "Treatment Details"     = "Treatment Arm"
  )
)
tbl_2group
```

The Word table includes an OR column (odds ratio with 95% CI for binary
variables) and a *P* value column (test *P* value for each variable).

Two-group comparison table exported to Word, with odds ratios and category section headers:

![](figures/tern_2_group.png){width=100%}

---

## Three or More Groups (`ternG` — 3+ levels)

The same `ternG()` function handles three or more groups automatically,
switching from t-test/Wilcoxon to Welch ANOVA/Kruskal-Wallis as appropriate.
Odds ratios are not available for 3+ group comparisons. `consider_normality`
controls normality routing; the default (`"ROBUST"`) applies the four-gate
algorithm (n < 3 fail-safe → skewness → CLT → Shapiro-Wilk). `FALSE` forces parametric tests
throughout; `"FORCE"` forces nonparametric throughout.

Set `post_hoc = TRUE` to run pairwise post-hoc tests automatically when the
omnibus *P* < 0.05. The test is matched to the omnibus test used: **Games-Howell**
follows Welch ANOVA (parametric path); **Dunn’s test with Holm correction**
follows Kruskal-Wallis (non-parametric and ordinal path). Results are appended
to each cell as compact letter display (CLD) superscripts — groups sharing a
letter are not significantly different after correction. Categorical variables
never receive post-hoc testing. When `post_hoc = TRUE` and at least one test
fires, an explanatory footnote is added automatically to the Word output.

```{r ternG-3group, results = "hide"}
tbl_3group <- ternG(
  data               = tern_colon,
  exclude_vars       = c("ID"),
  group_var          = "Treatment_Arm",
  group_order        = c("Observation", "Levamisole", "Levamisole + 5FU"),
  output_docx        = file.path(out_dir, "Tern_3_group.docx"),
  methods_filename   = file.path(out_dir, "TernTables_methods.docx"),
  consider_normality = "ROBUST",
  post_hoc           = TRUE,
  category_start     = c(
    "Patient Demographics"  = "Age (yr)",
    "Surgical Findings"     = "Colonic Obstruction",
    "Tumor Characteristics" = "Positive Lymph Nodes (n)",
    "Outcomes"              = "Recurrence"
  )
)
tbl_3group
```

Three-group comparison table exported to Word with category section headers:

![](figures/tern_3_group.png){width=100%}

---

## Word Output Formatting

Two optional parameters control text that appears outside the table body in the
exported Word document.

**`table_caption`** places a bold size-11 Arial caption above the table,
single-spaced with a small gap between the caption and the table:

```{r caption-example, eval = FALSE}
tbl_descriptive <- ternD(
  data          = tern_colon,
  exclude_vars  = c("ID"),
  output_docx   = file.path(out_dir, "Tern_descriptive.docx"),
  table_caption = "Table 1. Baseline patient characteristics."
)
```

**`table_footnote`** adds a merged footer row below the table in size-6 Arial
italic, bordered above and below by a double rule. Pass a single string or a
character vector for multiple lines (lines are joined with a line break inside
the same cell — no extra row spacing):

```{r footnote-example, eval = FALSE}
tbl_2group <- ternG(
  data           = tern_colon,
  exclude_vars   = c("ID"),
  group_var      = "Recurrence",
  OR_col         = TRUE,
  output_docx    = file.path(out_dir, "Tern_2_group.docx"),
  table_caption  = "Table 2. Characteristics by recurrence status.",
  table_footnote = c(
    "Abbreviations: OR, odds ratio; CI, confidence interval.",
    "\u2020 P values from chi-square or Wilcoxon rank-sum test.",
    "\u2021 ORs from unadjusted logistic regression."
  )
)
```

Both parameters are also stored in the table's metadata and reproduced
automatically when combining tables with `ternB()`.

---

## Statistical Test Logic

TernTables selects tests automatically based on variable type and normality:

| Variable type | Test (2 groups) | Test (3+ groups) | Post-hoc (3+ groups, `post_hoc = TRUE`, omnibus *p* < 0.05) |
|---|---|---|---|
| Continuous, normal | Welch's *t*-test | Welch ANOVA | Games-Howell |
| Continuous, non-normal | Wilcoxon rank-sum | Kruskal-Wallis | Dunn's + Holm |
| Binary / Categorical | Fisher's exact or Chi-squared\* | Fisher's exact or Chi-squared\* | — |
| Ordinal (`force_ordinal`) | Wilcoxon rank-sum | Kruskal-Wallis | Dunn's + Holm |

\*Fisher's exact is used when any expected cell count is < 5 (Cochran criterion). If the exact algorithm cannot complete (workspace limit exceeded for large tables), Fisher's exact with Monte Carlo simulation (B = 10,000; seed fixed via `getOption("TernTables.seed")`, default 42) is used automatically.

Normality routing uses `consider_normality = "ROBUST"` (the default) — a
four-gate decision applied per group: (1) any group n < 3 → non-parametric
(conservative fail-safe); (2) absolute skewness > 2 in any group → non-parametric
regardless of sample size; (3) all groups n ≥ 30 → parametric via the Central
Limit Theorem; (4) otherwise Shapiro-Wilk p > 0.05 in all groups → parametric. For 3+ group comparisons,
omnibus *P* values are reported. When `post_hoc = TRUE`, pairwise comparisons
are performed automatically for continuous and ordinal variables when omnibus
*P* < 0.05, using the test paired to the omnibus (Games-Howell or Dunn's +
Holm). CLD superscript letters are appended to cell values; groups sharing a
letter are not significantly different. Categorical variables never receive
post-hoc testing. `post_hoc` defaults to `FALSE`.
Set `consider_normality = TRUE` to use Shapiro-Wilk alone (original behaviour).

---

## Methods Document

A methods paragraph is written automatically with every `ternD()` and `ternG()`
call (`methods_doc = TRUE` by default), saved to `"TernTables_methods.docx"` in
the working directory unless overridden via `methods_filename`. Set
`methods_doc = FALSE` to suppress it.

`write_methods_doc()` can also be called directly on any saved tibble. Pass
`show_test = TRUE` to `ternG()` to populate the `test` column; when present,
the paragraph is tailored to only the test types that actually appeared (e.g.
omits the t-test sentence if all continuous variables were nonparametric).
Without it, standard boilerplate is used.

```{r methods-doc, eval = FALSE}
write_methods_doc(
  tbl      = tbl_2group,
  filename = file.path(out_dir, "Tern_methods.docx")
)
```

---

## Web Application

The full TernTables workflow — preprocessing, descriptive tables, two-group
and three-group comparisons, Word export, and methods paragraphs — is
available as a **free, no-code web application** at
[tern-tables.com](https://tern-tables.com/). No R or package installation is
required. The web app is powered by the same TernTables R package described
in this vignette; all statistical methods and outputs are identical.

The web app is transparent by design. A built-in side panel displays the
exact R commands being executed in the background as you work, and the full
script can be downloaded at the end of your session. The downloaded script
runs as-is in R and produces identical output — making every analysis fully
auditable and reproducible. This is suitable for submission to statistical
reviewers, inclusion in supplemental materials, or IRB documentation, and
provides a natural learning path for researchers who want to transition to
scripted R workflows. This repository remains the canonical reference for
the underlying implementation.

---

## References

Moertel CG, Fleming TR, Macdonald JS, et al. (1990). Levamisole and fluorouracil
for adjuvant therapy of resected colon carcinoma. *New England Journal of Medicine*,
**322**(6), 352–358. <https://doi.org/10.1056/NEJM199002083220602>