---
title: "Mortality Data from SIM with healthbR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Mortality Data from SIM with healthbR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Overview

The **SIM (Sistema de Informacoes sobre Mortalidade)** is Brazil's national mortality information system, managed by the Ministry of Health through DATASUS. It records individual death certificates (*Declaracao de Obito*) with cause of death coded by ICD-10.

| Feature | Details |
|---------|---------|
| Coverage | Per state (UF), all 27 states |
| Years | 1996--2024 (CID-10 era) |
| Unit | One row per death certificate |
| Format | .dbc files from DATASUS FTP |

## Getting started

```{r setup}
library(healthbR)
library(dplyr)
```

### Check available years

```{r}
sim_years()

# include preliminary data
sim_years(status = "all")
```

### Module information

```{r}
sim_info()
```

## Downloading data

### Basic download (one state, one year)

```{r}
deaths_ac <- sim_data(year = 2022, uf = "AC")
```

### Multiple states and years

```{r}
deaths_se <- sim_data(year = 2020:2022, uf = c("SP", "RJ", "MG"))
```

### All states (default)

```{r}
# downloads all 27 states -- may take several minutes
deaths_all <- sim_data(year = 2022)
```

### Filter by cause of death

Use CID-10 code prefixes to filter by cause:

```{r}
# Acute myocardial infarction (I21)
mi <- sim_data(year = 2022, uf = "SP", cause = "I21")

# All cardiovascular diseases (Chapter IX)
cardio <- sim_data(year = 2022, uf = "SP", cause = "I")

# All neoplasms (Chapter II)
cancer <- sim_data(year = 2022, uf = "SP", cause = "C")
```

### Select variables

```{r}
deaths <- sim_data(
  year = 2022,
  uf = "SP",
  vars = c("CAUSABAS", "DTOBITO", "SEXO", "IDADE", "RACACOR", "CODMUNRES")
)
```

## Age decoding

The `IDADE` variable uses a 3-digit encoding where the first digit indicates the unit and the remaining two indicate the value:

| First digit | Unit | Example |
|-------------|------|---------|
| 0 | Minutes | `005` = 5 minutes |
| 1 | Hours | `112` = 12 hours |
| 2 | Days | `215` = 15 days |
| 3 | Months | `306` = 6 months |
| 4 | Years | `445` = 45 years |
| 5 | 100+ years | `502` = 102 years |

By default, `decode_age = TRUE` adds an `age_years` column:

```{r}
deaths <- sim_data(year = 2022, uf = "AC")
deaths$age_years  # numeric age in years

# disable decoding
deaths_raw <- sim_data(year = 2022, uf = "AC", decode_age = FALSE)
```

## Key variables

| Variable | Description |
|----------|-------------|
| CAUSABAS | Underlying cause of death (CID-10) |
| DTOBITO | Date of death |
| SEXO | Sex (1=Male, 2=Female, 0=Unknown) |
| IDADE | Age (3-digit encoded) |
| RACACOR | Race/color (1=White, 2=Black, 3=Yellow, 4=Brown, 5=Indigenous) |
| CODMUNRES | Municipality of residence (IBGE 6 digits) |
| LINHAA-D | Cause of death lines A-D from the certificate |
| ESCMAE | Mother's education level |
| ESTCIV | Marital status |

### Data dictionary

```{r}
sim_dictionary()
sim_dictionary("SEXO")
sim_dictionary("RACACOR")
```

### Explore variables

```{r}
sim_variables()
sim_variables(search = "causa")
```

## Example: Mortality by cause chapter

```{r}
deaths <- sim_data(year = 2022, uf = "SP")

deaths |>
  mutate(chapter = substr(CAUSABAS, 1, 1)) |>
  count(chapter, sort = TRUE)
```

## Example: Age-specific mortality rate

Combine SIM data with Census population denominators:

```{r}
# deaths by age group
deaths <- sim_data(year = 2022, uf = "SP") |>
  filter(!is.na(age_years)) |>
  mutate(age_group = cut(age_years,
    breaks = c(0, 1, 5, 15, 30, 45, 60, 80, Inf),
    right = FALSE
  )) |>
  count(age_group, name = "deaths")

# population from Census 2022
pop <- censo_populacao(year = 2022, territorial_level = "state", geo_code = "35")

# join and calculate rates per 100,000
```

## Smart type parsing

```{r}
# parsed types (default)
deaths <- sim_data(year = 2022, uf = "AC")
class(deaths$DTOBITO)  # Date

# all character (backward-compatible)
deaths_raw <- sim_data(year = 2022, uf = "AC", parse = FALSE)
```

## Cache and lazy evaluation

```{r}
sim_cache_status()
sim_clear_cache()

# lazy query (requires arrow)
lazy <- sim_data(year = 2022, uf = "SP", lazy = TRUE)
lazy |>
  filter(CAUSABAS >= "I20", CAUSABAS < "I26") |>
  collect()
```

## Further reading

- SIM on DATASUS (`datasus.saude.gov.br`)
- [SINASC vignette](sinan-notifiable-diseases.html) for live birth data
- [Census vignette](censo-denominadores.html) for population denominators
