---
title: "Explore variables and build codebooks in R"
description: >
  Explore variables, inspect labels, and build interactive codebooks in R
  with spicy. Learn how to use varlist(), vl(), code_book(), and
  label_from_names() for survey and labelled datasets.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Explore variables and build codebooks in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

build_rich_tables <- identical(Sys.getenv("IN_PKGDOWN"), "true")
```

```{r setup}
library(spicy)
```

Before you build frequency tables or cross-tabulations, it is often worth
checking how your variables are named, labelled, and coded.

spicy provides a simple workflow for variable exploration and
documentation in R. You can derive labels from imported column names,
inspect variables with `varlist()` or `vl()`, and build an interactive
codebook with `code_book()`.

This vignette focuses on three common tasks:

- clean imported column names and recover variable labels with
  `label_from_names()`
- inspect variables, labels, values, classes, and missing data with
  `varlist()` and `vl()`
- generate an interactive codebook for review or export with
  `code_book()`

These tools are especially useful for survey datasets, labelled data,
and imported files where variable names and labels need to be checked
before analysis.

## Why inspect variables before analysis?

Variable inspection helps catch common problems early: unclear names,
missing labels, unexpected coding, and variables with many missing
values. A quick review of your dataset also makes it easier to choose
which variables to tabulate, summarize, or report later.

## Recover labels from imported column names

Some imported files store both a variable name and a variable label in
the column header. `label_from_names()` splits names of the form
`name<sep>label`, renames the columns, and stores the label as a proper
variable label.

```{r label-from-names}
df <- tibble::tibble(
  "age. Age of respondent" = c(25, 30, 41),
  "edu. Highest education level" = c("Lower", "Upper", "Tertiary"),
  "smoke. Current smoker" = c("No", "Yes", "No")
)

out <- label_from_names(df)
labelled::var_label(out)
```

This is especially useful for LimeSurvey CSV exports when using
Export results -> Export format: CSV -> Headings: Question code &
question text, where column names look like `"code. question text"`.
In this case the default separator is `". "`.

## Inspect variables with varlist()

`varlist()` gives a compact summary of each variable, including its
name, label, representative values, class, number of distinct values,
number of valid observations, and missing values.

In RStudio or Positron, the main way to use `varlist()` is
interactively. With its default behavior, it opens a searchable,
sortable variable overview in the Viewer, which makes it easy to scan
labels, look for specific variables, filter what you want to inspect,
and review the structure of a dataset before analysis.

```{r varlist-interactive, eval = FALSE}
varlist(sochealth)
```

If you prefer a shorter call in interactive work, `vl()` is a shortcut
for `varlist()`:

```{r vl-interactive, eval = FALSE}
vl(sochealth)
```

If you want the same summary returned as a tibble, use `tbl = TRUE`:

```{r varlist-all}
varlist(sochealth, tbl = TRUE)
```

If you want the `Values` column to include explicit missing values, use
`include_na = TRUE`:

```{r varlist-include-na}
head(subset(varlist(sochealth, include_na = TRUE, tbl = TRUE), NAs > 0))
```

If you want to display all unique non-missing values in the `Values`
column, use `values = TRUE`. This is especially useful for variables
with a small number of distinct values:

```{r varlist-values}
head(subset(varlist(sochealth, values = TRUE, tbl = TRUE), N_distinct <= 5))
```

For a focused inspection, select only the variables you want to review:

```{r varlist-selected}
varlist(sochealth, smoking, education, income_group, tbl = TRUE)
```

This is often enough to confirm that labels, factor levels, and missing
values look correct before moving on to tabulations.

## Select subsets of variables

`varlist()` supports tidyselect, which makes it easy to inspect a subset
of variables by name pattern or type.

```{r varlist-tidyselect}
varlist(sochealth, starts_with("life_sat"), tbl = TRUE)
```

```{r varlist-numeric}
varlist(sochealth, where(is.numeric), tbl = TRUE)
```

`vl()` also works with tidyselect in the same way:

```{r vl-example}
vl(sochealth, starts_with("bmi"), tbl = TRUE)
```

## Build an interactive codebook

When you want a searchable and exportable overview of the whole dataset or a
selected set of variables, `code_book()` builds an interactive codebook in the
Viewer.

```{r code-book-basic, eval = build_rich_tables}
if (requireNamespace("DT", quietly = TRUE)) {
  code_book(sochealth)
}
```

Use the same tidyselect-style selectors as `varlist()` to build a focused
codebook:

```{r code-book-select, eval = build_rich_tables}
if (requireNamespace("DT", quietly = TRUE)) {
  code_book(
    sochealth,
    starts_with("bmi"),
    values = TRUE,
    title = "BMI codebook",
    filename = "bmi_codebook"
  )
}
```

You can also request a fuller display of values or include missing
values explicitly in the summary:

```{r code-book-values, eval = build_rich_tables}
if (requireNamespace("DT", quietly = TRUE)) {
  code_book(sochealth, values = TRUE, include_na = TRUE)
}
```

This is useful when reviewing a dataset with collaborators or preparing
documentation before analysis.

## When to use varlist() and code_book()

Use `varlist()` when you want a quick summary in a script or a tibble
you can inspect directly.

Use `vl()` when you want the same summary with a shorter call in
interactive work.

Use `code_book()` when you want a searchable, interactive codebook for
review or export.
