---
title: "Getting started with spicy"
description: >
  Get started with spicy for descriptive statistics, variable
  inspection, frequency tables, cross-tabulations, association
  measures, categorical and continuous summary tables, and codebooks in
  R. A tidyverse-friendly alternative to SPSS and Stata for survey and
  labelled data workflows.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with spicy}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

build_rich_tables <- identical(Sys.getenv("IN_PKGDOWN"), "true")
```

```{r setup}
library(spicy)
```

spicy is an R package for descriptive statistics and data analysis,
designed for data science and survey research workflows. It covers
variable inspection, frequency tables, cross-tabulations with
chi-squared tests and effect sizes, and publication-ready summary
tables, offering functionality similar to Stata or SPSS but within a
tidyverse-friendly R environment. This vignette walks through the core
workflow using the bundled [`sochealth`](../reference/sochealth.html)
dataset, a simulated social-health survey with 1200 respondents and
24 variables.

## Inspect your data

`varlist()` (or its shortcut `vl()`) gives a compact overview of every
variable in a data frame: name, label, representative values, class,
number of distinct values, valid observations, and missing values.
In RStudio or Positron, calling `varlist()` without arguments opens an
interactive viewer - this is the most common usage in practice. Here we
use `tbl = TRUE` to produce static output for the vignette:

```{r varlist}
varlist(sochealth, tbl = TRUE)
```

You can also select specific columns with tidyselect syntax:

```{r varlist-select}
varlist(sochealth, starts_with("bmi"), income, weight, tbl = TRUE)
```

## Frequency tables

`freq()` produces frequency tables with counts, percentages, and
(optionally) valid and cumulative percentages.

```{r freq}
freq(sochealth, education)
```

Weighted frequencies use the `weights` argument. With `rescale = TRUE`,
the total weighted N matches the unweighted N:

```{r freq-weighted}
freq(sochealth, education, weights = weight, rescale = TRUE)
```

## Cross-tabulations

`cross_tab()` crosses two categorical variables. By default it shows
counts, a chi-squared test, and Cramer's V:

```{r crosstab}
cross_tab(sochealth, smoking, education)
```

Add percentages with `percent`:

```{r crosstab-pct}
cross_tab(sochealth, smoking, education, percent = "col")
```

Group by a third variable with `by`:

```{r crosstab-by}
cross_tab(sochealth, smoking, education, by = sex)
```

When both variables are ordered factors, `cross_tab()` automatically
selects an ordinal measure (Kendall's Tau-b) instead of Cramer's V:

```{r crosstab-ordinal}
cross_tab(sochealth, self_rated_health, education)
```

## Association measures

For a quick overview of all available association statistics, pass a
contingency table to `assoc_measures()`:

```{r assoc-measures}
tbl <- xtabs(~ smoking + education, data = sochealth)
assoc_measures(tbl)
```

Individual functions such as `cramer_v()`, `gamma_gk()`, or
`kendall_tau_b()` return a scalar by default. Pass `detail = TRUE` for
the confidence interval and p-value:

```{r cramer-detail}
cramer_v(tbl, detail = TRUE)
```

## Summary tables

`table_categorical()` covers grouped or one-way summary tables for
categorical variables:

```{r table-categorical-tt, eval = build_rich_tables}
table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "tinytable"
)
```

`table_continuous()` summarizes continuous variables, either overall or
by a categorical `by` variable, and can also add group-comparison
tests:

```{r table-continuous}
table_continuous(
  sochealth,
  select = c(bmi, life_sat_health),
  by = education
)
```

`table_continuous_lm()` covers the same reporting territory when you
want to stay in a linear-model framework, for example with robust
standard errors or case weights:

```{r table-continuous-lm}
table_continuous_lm(
  sochealth,
  select = c(wellbeing_score, bmi),
  by = sex,
  vcov = "HC3"
)
```

For detailed guidance, see the dedicated articles on
`table_categorical()`, `table_continuous()`, `table_continuous_lm()`,
and the final reporting overview for APA-style summary tables.

## Row-wise summaries

`mean_n()`, `sum_n()`, and `count_n()` compute row-wise statistics
across selected columns, with automatic handling of missing values.

```{r mean-n}
sochealth |>
  dplyr::mutate(
    mean_sat  = mean_n(select = starts_with("life_sat")),
    sum_sat   = sum_n(select = starts_with("life_sat"), min_valid = 2),
    n_missing = count_n(select = starts_with("life_sat"), special = "NA")
  ) |>
  dplyr::select(starts_with("life_sat"), mean_sat, sum_sat, n_missing) |>
  head() |>
  as.data.frame()
```

## Learn more

- See `?varlist` to inspect variables, labels, values, and missing data.
- See `?cross_tab` for the full list of arguments (weights, simulation,
  association measures).
- See `?table_categorical` for grouped or one-way categorical tables.
- See `?table_continuous` for continuous summaries and group
  comparisons.
- See `?table_continuous_lm` for model-based mean-comparison tables
  with robust standard errors or case weights.
- See `?assoc_measures` for the complete list of association
  statistics.
- See `?code_book` to generate an interactive HTML codebook.