---
title: "Decoding UKB Column Names and Values"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Decoding UKB Column Names and Values}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = FALSE
)
```

## Overview

Raw UKB phenotype data contains encoded column names and values that need to be converted before analysis.

| Source | Column names | Column values |
|---|---|---|
| `extract_pheno()` | `participant.p31` | Raw integer codes — needs `decode_values()` |
| `extract_batch()` | `p31`, `p53_i0` | Usually already decoded — `decode_values()` typically not needed |

Both outputs need `decode_names()` to convert field ID column names to human-readable snake_case.

> **Call order matters**: when using `extract_pheno()` output, always run `decode_values()` before `decode_names()`, because value decoding relies on the numeric field ID still being present in the column name.

---

## Recommended Workflow

```{r workflow}
library(ukbflow)

df <- extract_pheno(c(31, 54, 20116, 21022))
df <- decode_values(df)   # 0/1 → "Female"/"Male", etc.
df <- decode_names(df)    # participant.p31 → sex
```

---

## Step 1: Decode Values

`decode_values()` converts raw integer codes to human-readable labels for categorical fields that have UKB encoding mappings. Continuous, date, text, and already-decoded fields are left unchanged.

```{r decode-values}
df <- decode_values(df)
#> ✔ Decoded 3 categorical columns; 2 non-categorical columns unchanged.
```

It requires two metadata files from the UKB Showcase. Download them once with:

```{r fetch-meta}
fetch_metadata(dest_dir = "data/metadata")
```

Then point `decode_values()` to the same directory (default matches `fetch_metadata()`):

```{r decode-values-dir}
df <- decode_values(df, metadata_dir = "data/metadata")
```

### What gets decoded

| Column | Raw value | Decoded value |
|---|---|---|
| `p31` | `0` / `1` | `"Female"` / `"Male"` |
| `p54` | `11012` | `"Leeds"` |
| `p20116_i0` | `0` / `1` / `2` | `"Never"` / `"Previous"` / `"Current"` |

Codes absent from the encoding table (including UKB missing codes `-1`, `-3`, `-7`) are returned as `NA`.

---

## Step 2: Decode Names

`decode_names()` renames columns from field ID format to snake_case labels using the approved UKB field dictionary available to your project.

```{r decode-names}
df <- decode_names(df)
#> ✔ Renamed 5 columns.
```

### Name conversion examples

| Raw name | Decoded name |
|---|---|
| `participant.eid` | `eid` |
| `participant.p31` | `sex` |
| `participant.p21022` | `age_at_recruitment` |
| `participant.p53_i0` | `date_of_attending_assessment_centre_i0` |
| `p31` | `sex` |
| `p53_i0` | `date_of_attending_assessment_centre_i0` |

Both `extract_pheno()` format (`participant.p31`) and `extract_batch()` format (`p31`) are handled automatically.

### Long names

Some UKB field titles are verbose. Names exceeding `max_nchar` characters are flagged with a warning (default: 60). Lower the threshold to catch more aggressively:

```{r long-names}
df <- decode_names(df, max_nchar = 30)
#> ! 1 column name longer than 30 characters - consider renaming manually:
#> • date_of_attending_assessment_centre_i0
```

Rename manually to something concise:

```{r rename}
names(df)[names(df) == "date_of_attending_assessment_centre_i0"] <- "date_baseline"
```

---

## Getting Help

- `?decode_values`, `?decode_names`
- `vignette("extract")` — extracting phenotype data
- [GitHub Issues](https://github.com/evanbio/ukbflow/issues)
