---
title: "Getting started with ibger"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with ibger}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Overview

**ibger** provides a tidyverse-friendly interface to the
[IBGE Aggregate Data API](https://servicodados.ibge.gov.br/api/docs/agregados?versao=3)
(version 3). This is the same API that powers
[SIDRA](https://sidra.ibge.gov.br/) — the automatic data retrieval system
for all surveys and censuses conducted by the Brazilian Institute of
Geography and Statistics (IBGE).

Each SIDRA table corresponds to an **aggregate** in the API. With ibger
you can browse aggregates, inspect their metadata, and retrieve tidy data
— all from R.

## Installation

```{r}
# install.packages("remotes")
remotes::install_github("StrategicProjects/ibger")
```

```{r setup}
library(ibger)
```

## A typical workflow

### Step 1 — Find an aggregate

Use `ibge_aggregates()` to list every aggregate grouped by survey. Optional
filters let you narrow the search:

```{r}
# All aggregates
ibge_aggregates()
#> ✔ 1420 aggregates found.
#> # A tibble: 1,420 × 4
#>   survey_id survey_name          aggregate_id aggregate_name
#>   <chr>     <chr>                <chr>        <chr>
#> 1 AB        Abate de animais     1705         Animais abatidos …
#> 2 AB        Abate de animais     1706         Peso total das ca…
#> ...

# Monthly aggregates only
ibge_aggregates(periodicity = "P5")

# Aggregates with municipality-level data
ibge_aggregates(level = "N6")
```

### Step 2 — Inspect the metadata

Once you have an aggregate ID, `ibge_metadata()` tells you everything about
its structure:

```{r}
meta <- ibge_metadata(1705)
meta
```

The print method shows a structured summary:

```
── Animais abatidos ──
ID: 1705
Survey: Pesquisa Trimestral do Abate de Animais
Periodicity: trimestral (200101 to 202404)
Territorial levels: N1, N2, N3

── Variables (2) ──
  284: Número de informantes (Unidades)
  285: Cabeças abatidas (Cabeças)

── Classifications (1) ──
  12529: Tipo de rebanho bovino (9 categories)
    115236: Total [level 0]
    115237: Bois [level 1]
    115238: Vacas [level 1]
    ...
```

Each component is accessible directly:

```{r}
meta$variables
#> # A tibble: 2 × 3
#>   id    name                  unit
#>   <chr> <chr>                 <chr>
#> 1 284   Número de informantes Unidades
#> 2 285   Cabeças abatidas      Cabeças

meta$classifications
#> # A tibble: 1 × 3
#>   id    name                        categories
#>   <chr> <chr>                       <list>
#> 1 12529 Tipo de rebanho bovino      <tibble [9 × 4]>

# Unnest to see every category
tidyr::unnest(meta$classifications, categories)

# Geographic levels
meta$territorial_level
#> $administrative
#> [1] "N1" "N2" "N3"

# Time range
meta$periodicity
#> $frequency [1] "trimestral"
#> $start     [1] "200101"
#> $end       [1] "202404"
```

### Step 3 — Retrieve data

`ibge_variables()` is the main workhorse. It sends a single request and
returns a tidy tibble:

```{r}
ibge_variables(1705, localities = "BR")
#> ✔ 12 records retrieved.
#> # A tibble: 12 × 9
#>   variable_id variable_name      variable_unit classification_12529
#>   <chr>       <chr>              <chr>         <chr>
#> 1 284         Número de inform…  Unidades      Total
#> 2 285         Cabeças abatidas   Cabeças       Total
#> ...
#>   locality_id locality_name locality_level period value
#>   <chr>       <chr>         <chr>          <chr>  <chr>
#> 1 1           Brasil        Brasil         202303 2584
#> 2 1           Brasil        Brasil         202303 7802044
#> ...
```

## Specifying localities

The `localities` parameter accepts several convenient formats:

```{r}
# Country total
ibge_variables(1705, localities = "BR")

# All states
ibge_variables(8884, localities = "N3")

# Specific states (RJ = 33, SP = 35)
ibge_variables(8884, localities = list(N3 = c(33, 35)))

# Mix levels: metropolitan areas + a specific municipality
ibge_variables(1705, localities = list(N7 = c(3501, 3301), N6 = 5208707))
```

The geographic level codes follow the IBGE convention:

| Code | Level                      | Example                              |
|------|----------------------------|--------------------------------------|
| `N1` | Brazil                     | `"BR"` or `list(N1 = 1)`            |
| `N2` | Major region               | `list(N2 = 1)` — North              |
| `N3` | State (UF)                 | `list(N3 = 33)` — Rio de Janeiro    |
| `N6` | Municipality               | `list(N6 = 3550308)` — São Paulo/SP |
| `N7` | Metropolitan area          | `list(N7 = 3501)` — RM São Paulo    |

> **Tip**: Not every aggregate is available at every level. Aggregate 1705
> has data for N1, N2, and N3 but not N6. Use `ibge_metadata()` to check.

## Specifying periods

Periods follow the API convention — negative values mean "last N":

```{r}
# Last 6 periods (the default)
ibge_variables(1705, periods = -6, localities = "BR")

# Last 12 periods
ibge_variables(1705, periods = -12, localities = "BR")

# Specific period codes
ibge_variables(8884, periods = c(202301, 202302, 202303), localities = "BR")

# Range (inclusive)
ibge_variables(8884, periods = "202101-202304", localities = "BR")

# Range + extra period
ibge_variables(8884, periods = "202101-202106|202301", localities = "BR")
```

> **Note**: Negative values cannot be mixed with specific periods. Period
> codes encode both the date and the periodicity — `202001` could mean
> January 2020 (monthly), Q1 2020 (quarterly), or S1 2020 (semi-annual),
> depending on the aggregate.

## Filtering with classifications

Many aggregates break their data further by classifications (dimensions).
For instance, aggregate 1712 (crop production) has a classification for
the type of product (226) and another for the producer condition (218).

```{r}
# Single category: pineapple (4844) from product classification (226)
ibge_variables(
  aggregate      = 1712,
  localities     = "BR",
  classification = list("226" = 4844)
)

# Multiple categories
ibge_variables(
  aggregate      = 1712,
  localities     = "BR",
  classification = list("226" = c(4844, 96608, 96609))
)

# Multiple classifications
ibge_variables(
  aggregate      = 1712,
  localities     = "BR",
  classification = list("226" = c(4844, 96608), "218" = 4780)
)

# All categories of a classification (can be large!)
ibge_variables(
  aggregate      = 1712,
  periods        = -1,
  localities     = "BR",
  classification = list("226" = "all")
)
```

When no classification is specified, the API returns the **Total** category
(ID = 0) — an aggregate across all categories.

## Automatic validation

Before sending any request, `ibge_variables()` and `ibge_localities()`
validate your parameters against the aggregate's metadata. If something
doesn't match, you get a clear error with the allowed values:

```{r}
# N3 (states) is not available for aggregate 1705
ibge_variables(1705, localities = "N3")
#> Error:
#> ! Geographic level(s) "N3" not available for aggregate 1705.
#> ℹ Available levels: "N1", "N6", and "N7".

# Period out of range
ibge_variables(1705, periods = 199901, localities = "BR")
#> Error:
#> ! Period(s) "199901" out of range for aggregate 1705.
#> ℹ Valid range: "201202" to "202001" (monthly).

# Non-existent variable
ibge_variables(1705, variable = 999, localities = "BR")
#> Error:
#> 355 - IPCA15 - Variação mensal (%)
#> 356 - IPCA15 - Variação acumulada no ano (%)
#> 1120 - IPCA15 - Variação acumulada em 12 meses (%)
#> 357 - IPCA15 - Peso mensal (%)
```

Metadata is fetched once per session and cached. To force a refresh:

```{r}
ibge_clear_cache()
```

Skip validation entirely with `validate = FALSE`:

```{r}
ibge_variables(1705, localities = "BR", validate = FALSE)
```

## Browsing the survey catalog

Beyond aggregate-level data, ibger also provides access to the
[IBGE Metadata API](https://servicodados.ibge.gov.br/api/docs/metadados?versao=2)
(v2), which catalogs IBGE's surveys with institutional and methodological
information such as status, category, collection frequency, and thematic
classifications.

This is useful when you want to understand **what surveys exist** and
**how they are structured** before diving into specific aggregates.

```{r}
# List all 98 IBGE surveys
ibge_surveys()
#> # A tibble: 98 × 8
#>   id    name                                 status category    ...
#>   <chr> <chr>                                <chr>  <chr>
#> 1 AC    Pesquisa Anual da Indústria da Cons… Ativa  Estrutural
#> 2 AA    Pesquisa Nacional de Saúde do Escol… Ativa  Especial
#> ...

# Filter active monthly surveys
library(dplyr)
ibge_surveys(thematic_classifications = FALSE) |>
  filter(status == "Ativa", category == "Conjuntural")

# Check which periods have metadata for the Censo Demográfico
ibge_survey_periods("CD")
#> # A tibble: 9 × 3
#>    year month order
#>   <int> <int> <int>
#> 1  2022    NA     0
#> 2  2010    NA     0
#> ...

# Get full institutional metadata for a specific period
meta <- ibge_survey_metadata("CD", year = 2022)
meta
#> ── CD ──
#> Status: Ativa
#> Category: Estrutural
#> ...
#> ── Metadata occurrences (1) ──
#> Use `meta$occurrences` to explore the full metadata.

# Explore methodology fields
names(meta$occurrences[[1]])
```

Survey codes are validated before each request. If you use a wrong code,
the error suggests similar alternatives:

```{r}
ibge_survey_periods("PMS")
#> Error: Survey code "PMS" not found in the IBGE catalog.
#> ℹ Did you mean one of these?
#>   * SC - Pesquisa Mensal de Serviços
#>   * MC - Pesquisa Mensal de Comércio
#>   ...
```

## API limits and special values

Each request can return at most **100,000 values**, computed as:

> categories × periods × localities ≤ 100,000

If exceeded, the API returns HTTP 500. Split your request into smaller
chunks when working with many localities or categories.

The `value` column may contain special characters instead of numbers:

| Value | Meaning                                                      |
|-------|--------------------------------------------------------------|
| `-`   | Numeric zero (not from rounding)                             |
| `..`  | Not applicable                                               |
| `...` | Data not available                                           |
| `X`   | Suppressed to avoid identifying individual respondents       |

These come through as character strings in the `value` column. Use
`parse_ibge_value()` to convert to numeric in one step:

```{r}
ibge_variables(7060, localities = "BR") |>
  dplyr::mutate(value = parse_ibge_value(value))
```