---
title: "Bayesian Surprise with Canadian Census Data (cancensus)"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Bayesian Surprise with Canadian Census Data (cancensus)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  warning = FALSE,
  message = FALSE,
  fig.alt = "Canadian Census example visualization of Bayesian surprise or funnel plot values."
)

is_pkgdown <- identical(Sys.getenv("IN_PKGDOWN"), "true")
has_cancensus_key <- nzchar(Sys.getenv("CM_API_KEY")) ||
  nzchar(Sys.getenv("CANCENSUS_API_KEY"))
run_cancensus_examples <- is_pkgdown && has_cancensus_key
```

# Introduction

This vignette demonstrates how to use the `bayesiansurpriser` package with
Canadian Census data accessed through the `cancensus` package. The cancensus
package provides access to Canadian Census data from 1996 through 2021 via
the CensusMapper API.

Canadian Census data is a useful setting for Bayesian Surprise analysis because
it provides:

- Comprehensive population data at multiple geographic levels
- Rich demographic and socioeconomic variables
- Consistent geography across Census years
- Both counts and rates for many variables

As with ACS examples, Census profile counts are best treated carefully. The
surprise maps below model rates directly using distributional models. Funnel
plots are used separately as diagnostics for how population or household counts
affect the reliability of rates.

## Prerequisites

You'll need a CensusMapper API key from https://censusmapper.ca/users/sign_up

```{r load-packages}
library(bayesiansurpriser)
library(cancensus)
library(sf)
library(ggplot2)
library(dplyr)

# Set your CensusMapper API key (do this once)
# set_cancensus_api_key("YOUR_KEY_HERE", install = TRUE)

# Optionally set a cache path for faster subsequent queries
# set_cancensus_cache_path("path/to/cache", install = TRUE)
```

***

# Understanding cancensus Data Structure

Before diving into analysis, let's understand the available datasets and
variables.

```{r list-datasets, eval=FALSE}
# List available Census datasets
datasets <- list_census_datasets()
print(datasets %>% select(dataset, description))
```

```{r list-regions, eval=FALSE}
# List regions for the 2021 Census
regions_2021 <- list_census_regions("CA21")
head(regions_2021 %>% filter(level == "CMA"))
```

***

# Example 1: Low Income Rates by Census Division

Let's examine low income rates across Canada's Census Divisions and identify
which regions have atypical rates relative to the cross-division distribution.

## Fetch Census Data

```{r fetch-income-data, eval=run_cancensus_examples}
# Find relevant vectors for low income
# v_CA21_1085: Total - Low-income status in 2020
# v_CA21_1086: 0 to 17 years in low income
# v_CA21_1090: In low income

# Get low income data at the Census Division level for all of Canada
# Using labels = "short" for cleaner column names
income_data <- get_census(
  dataset = "CA21",
  regions = list(C = "01"),  # All of Canada
  vectors = c(
    total_pop = "v_CA21_1085",  # Total population for low income status
    low_income = "v_CA21_1090"   # Population in low income
  ),
  level = "CD",
  geo_format = "sf",
  labels = "short",
  quiet = TRUE
)

# Clean and filter
income_data <- income_data %>%
  filter(!is.na(total_pop) & !is.na(low_income)) %>%
  filter(total_pop > 0) %>%
  mutate(
    low_income_rate = low_income / total_pop
  )

head(income_data %>% st_drop_geometry() %>% select(GeoUID, `Region Name`, total_pop, low_income, low_income_rate))
```

## Compute Bayesian Surprise

```{r compute-income-surprise, eval=run_cancensus_examples}
# Compute surprise on the low-income-rate distribution.
income_surprise <- surprise(
  income_data,
  observed = low_income_rate,
  models = c("uniform", "gaussian", "sampled")
)

print(income_surprise)
summary(income_surprise)
```

## Visualize Surprising Regions

```{r plot-income-surprise, fig.width=10, fig.height=8, eval=run_cancensus_examples}
# Add surprise values to data
income_data$surprise <- get_surprise(income_surprise, "surprise")
income_data$signed_surprise <- get_surprise(income_surprise, "signed")

# Transform to Statistics Canada Lambert projection for better Canada visualization
income_data_lambert <- st_transform(income_data, "EPSG:3347")

# Plot Canada-wide
ggplot(income_data_lambert) +
  geom_sf(aes(fill = signed_surprise), color = "white", linewidth = 0.1) +
  scale_fill_surprise_diverging(name = "Signed\nsurprise") +
  labs(
    title = "Atypical Low Income Rates by Census Division",
    subtitle = "Signed surprise from the cross-division low-income-rate distribution",
    caption = "Source: Statistics Canada, 2021 Census"
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  )
```

## Identify Most Surprising Regions

```{r top-surprise-regions, eval=run_cancensus_examples}
# Top regions with positive signed surprise
high_income_surprise <- income_data %>%
  st_drop_geometry() %>%
  filter(signed_surprise > 0) %>%
  arrange(desc(surprise)) %>%
  select(`Region Name`, low_income_rate, total_pop, surprise) %>%
  head(10)

cat("Census Divisions with positive signed surprise:\n")
print(high_income_surprise)

# Top regions with negative signed surprise
low_income_surprise <- income_data %>%
  st_drop_geometry() %>%
  filter(signed_surprise < 0) %>%
  arrange(desc(surprise)) %>%
  select(`Region Name`, low_income_rate, total_pop, surprise) %>%
  head(10)

cat("\nCensus Divisions with negative signed surprise:\n")
print(low_income_surprise)
```

***

# Example 2: Housing Analysis in Metro Vancouver

Let's analyze owner-occupied housing rates at the Census Tract level within
the Vancouver CMA.

```{r fetch-housing-data, eval=run_cancensus_examples}
# Get housing tenure data for Vancouver CMA
# v_CA21_4237: Total private households by tenure
# v_CA21_4238: Owner households

housing_data <- get_census(
  dataset = "CA21",
  regions = list(CMA = "59933"),  # Vancouver CMA
  vectors = c(
    total_hh = "v_CA21_4237",  # Total households
    owner_hh = "v_CA21_4238"   # Owner households
  ),
  level = "CT",
  geo_format = "sf",
  labels = "short",
  quiet = TRUE
)

housing_data <- housing_data %>%
  filter(!is.na(total_hh) & !is.na(owner_hh)) %>%
  filter(total_hh > 50) %>%  # Filter small tracts
  mutate(
    owner_rate = owner_hh / total_hh
  )

cat("Number of Census Tracts:", nrow(housing_data), "\n")
```

```{r housing-surprise, fig.width=9, fig.height=7, eval=run_cancensus_examples}
# Compute surprise on the home-ownership-rate distribution.
housing_surprise <- surprise(
  housing_data,
  observed = owner_rate,
  models = c("uniform", "gaussian", "sampled")
)

housing_data$signed_surprise <- get_surprise(housing_surprise, "signed")

# Plot Vancouver
ggplot(housing_data) +
  geom_sf(aes(fill = signed_surprise), color = "white", linewidth = 0.02) +
  scale_fill_surprise_diverging(name = "Signed\nsurprise") +
  labs(
    title = "Atypical Home Ownership Rates in Metro Vancouver",
    subtitle = "Signed surprise from the tract-level owner-rate distribution",
    caption = "Source: Statistics Canada, 2021 Census"
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 9)
  )
```

***

# Example 3: Language at Home Analysis

Identify areas with high model-conditional surprise for non-official language
speakers at home.

```{r fetch-language-data, eval=run_cancensus_examples}
# Get language data for Toronto CMA
# v_CA21_1144: Total - Knowledge of official languages
# v_CA21_1153: Neither English nor French

language_data <- get_census(
  dataset = "CA21",
  regions = list(CMA = "35535"),  # Toronto CMA
  vectors = c(
    total_pop = "v_CA21_1144",  # Total population
    no_official = "v_CA21_1153"   # Neither official language
  ),
  level = "CT",
  geo_format = "sf",
  labels = "short",
  quiet = TRUE
)

language_data <- language_data %>%
  filter(!is.na(total_pop) & !is.na(no_official)) %>%
  filter(total_pop > 100) %>%
  mutate(no_official_rate = no_official / total_pop)
```

```{r language-surprise, fig.width=9, fig.height=7, eval=run_cancensus_examples}
# Compute surprise on the non-official-language-rate distribution.
lang_surprise <- surprise(
  language_data,
  observed = no_official_rate,
  models = c("uniform", "gaussian", "sampled")
)

language_data$signed_surprise <- get_surprise(lang_surprise, "signed")

# Plot Toronto
ggplot(language_data) +
  geom_sf(aes(fill = signed_surprise), color = "white", linewidth = 0.02) +
  scale_fill_surprise_diverging(name = "Signed\nsurprise") +
  labs(
    title = "Atypical Non-Official Language Concentrations in Toronto",
    subtitle = "Signed surprise from the tract-level non-official-language-rate distribution",
    caption = "Source: Statistics Canada, 2021 Census"
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 9)
  )
```

***

# Example 4: Provincial Comparison with Funnel Plot

Compare provinces using a funnel plot to account for population size variation.

```{r provincial-funnel, eval=run_cancensus_examples}
# Get provincial data
provincial_data <- get_census(
  dataset = "CA21",
  regions = list(C = "01"),
  vectors = c(
    total_pop = "v_CA21_1085",  # Total for low income
    low_income = "v_CA21_1090"   # In low income
  ),
  level = "PR",
  geo_format = NA,
  labels = "short",
  quiet = TRUE
)

provincial_data <- provincial_data %>%
  filter(!is.na(total_pop) & total_pop > 0)

# Compute funnel data for points
funnel_df <- compute_funnel_data(
  observed = provincial_data$low_income,
  sample_size = provincial_data$total_pop,
  type = "count",
  limits = c(2, 3)
)
funnel_df$province <- provincial_data$`Region Name`
funnel_df$rate <- funnel_df$observed / funnel_df$sample_size

# National rate
national_rate <- sum(provincial_data$low_income) / sum(provincial_data$total_pop)

# Create continuous funnel bands
n_range <- range(provincial_data$total_pop)
n_seq <- exp(seq(log(n_range[1] * 0.8), log(n_range[2] * 1.2), length.out = 200))
se_seq <- sqrt(national_rate * (1 - national_rate) / n_seq)

funnel_bands <- data.frame(
  n = n_seq,
  rate = national_rate,
  lower_2sd = national_rate - 2 * se_seq,
  upper_2sd = national_rate + 2 * se_seq,
  lower_3sd = national_rate - 3 * se_seq,
  upper_3sd = national_rate + 3 * se_seq
)

# Create funnel plot
ggplot() +
  # 3 SD band (outer)
  geom_ribbon(
    data = funnel_bands,
    aes(x = n, ymin = lower_3sd, ymax = upper_3sd),
    fill = "#9ecae1", alpha = 0.5
  ) +
  # 2 SD band (inner)
  geom_ribbon(
    data = funnel_bands,
    aes(x = n, ymin = lower_2sd, ymax = upper_2sd),
    fill = "#3182bd", alpha = 0.5
  ) +
  # National average line
  geom_hline(
    yintercept = national_rate,
    linetype = "dashed", color = "#636363", linewidth = 0.8
  ) +
  # Points
  geom_point(
    data = funnel_df,
    aes(x = sample_size, y = rate, color = abs(z_score) > 2),
    size = 4
  ) +
  # Labels
  ggrepel::geom_text_repel(
    data = funnel_df,
    aes(x = sample_size, y = rate, label = province),
    size = 3, fontface = "bold",
    max.overlaps = 15,
    segment.color = "gray60"
  ) +
  scale_color_manual(
    values = c("FALSE" = "#636363", "TRUE" = "#e6550d"),
    guide = "none"
  ) +
  scale_x_log10(
    labels = scales::comma,
    breaks = c(1e4, 1e5, 1e6, 1e7)
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Funnel Plot: Provincial Low Income Rates",
    subtitle = "Provinces outside bands have surprising rates given population size",
    x = "Population (log scale)",
    y = "Low Income Rate",
    caption = "Source: Statistics Canada, 2021 Census"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )
```

***

# Example 5: Comparing Census Years

Compare surprise patterns between 2016 and 2021 Census at the provincial level.

```{r temporal-comparison, eval=run_cancensus_examples}
# Get provincial data for 2016 and 2021
# Note: Variable IDs differ between Census years

# 2021 data
prov_2021 <- get_census(
  dataset = "CA21",
  regions = list(C = "01"),
  vectors = c(total_pop = "v_CA21_1085", low_income = "v_CA21_1090"),
  level = "PR",
  geo_format = NA,
  labels = "short",
  quiet = TRUE
)
prov_2021 <- prov_2021[!is.na(prov_2021$total_pop) & prov_2021$total_pop > 0, ]
prov_2021$low_income_rate <- prov_2021$low_income / prov_2021$total_pop
prov_2021$census_year <- 2021

# 2016 data (different vector IDs for same concept)
prov_2016 <- get_census(
  dataset = "CA16",
  regions = list(C = "01"),
  vectors = c(total_pop = "v_CA16_2540", low_income = "v_CA16_2555"),
  level = "PR",
  geo_format = NA,
  labels = "short",
  quiet = TRUE
)
prov_2016 <- prov_2016[!is.na(prov_2016$total_pop) & prov_2016$total_pop > 0, ]
prov_2016$low_income_rate <- prov_2016$low_income / prov_2016$total_pop
prov_2016$census_year <- 2016

# Compute surprise for each year's provincial low-income-rate distribution
surprise_2021 <- surprise(
  prov_2021,
  observed = low_income_rate,
  models = c("uniform", "gaussian", "sampled")
)
prov_2021$signed_surprise <- get_surprise(surprise_2021, "signed")

surprise_2016 <- surprise(
  prov_2016,
  observed = low_income_rate,
  models = c("uniform", "gaussian", "sampled")
)
prov_2016$signed_surprise <- get_surprise(surprise_2016, "signed")

# Compare changes
comparison <- merge(
  prov_2016[, c("GeoUID", "Region Name", "signed_surprise")],
  prov_2021[, c("GeoUID", "signed_surprise")],
  by = "GeoUID",
  suffixes = c("_2016", "_2021")
)
comparison$change <- comparison$signed_surprise_2021 - comparison$signed_surprise_2016
print(comparison[order(abs(comparison$change), decreasing = TRUE), ])
```

Note: This example is not evaluated because it requires multiple API calls.
Run locally with your CensusMapper API key.

***

# Example 6: Custom Model Space for Census Data

For Census data, create a custom model space that matches the quantity being
modeled. Here the observed values are home-ownership rates, so the models are
rate-distribution models rather than population-baseline count models.

```{r custom-models, eval=run_cancensus_examples}
# Using the Vancouver housing data
# Create a custom model space

# Create custom model space for rates
housing_models <- model_space(
  bs_model_gaussian(),
  bs_model_sampled()
)

print(housing_models)

# Use surprise() with custom model space
custom_result <- surprise(
  housing_data,
  observed = owner_rate,
  models = housing_models
)

housing_data$custom_surprise <- get_surprise(custom_result, "signed")

# Summary statistics
cat("Custom model surprise summary:\n")
summary(housing_data$custom_surprise)
```

***

# Interpreting Surprise Values

Understanding what surprise values mean is crucial for proper interpretation:

| Surprise Magnitude | Interpretation | Meaning |
|-------------------|----------------|---------|
| < 0.1 | Trivial | Little update from prior to posterior model probabilities |
| 0.1 - 0.3 | Minor | Small model update |
| 0.3 - 0.5 | Moderate | Noticeable model update |
| 0.5 - 1.0 | Substantial | Strong model update |
| > 1.0 | High | Very strong model update under the chosen model space |

**Key insight**: Surprise measures how much the data discriminates between models, not
just how far a rate is from average. A region with an unusual rate may show low
surprise if all models assign similar likelihood to it. A region with a rate near
the average can still show surprise if it falls in a part of the distribution
where the chosen models disagree. Use the funnel plot separately to understand
whether population size makes a rate estimate unusually reliable or noisy.

```{r interpret-surprise, eval=run_cancensus_examples}
# Categorize regions by surprise level
income_data <- income_data %>%
  mutate(
    surprise_category = cut(
      abs(signed_surprise),
      breaks = c(0, 0.1, 0.3, 0.5, 1.0, Inf),
      labels = c("Trivial", "Minor", "Moderate", "Substantial", "High"),
      include.lowest = TRUE
    )
  )

# Summary by category
cat("Census Divisions by surprise level:\n")
print(table(income_data$surprise_category))
```

***

# Session Info

```{r session-info}
sessionInfo()
```
