---
title: "2. Ensuring spatial consistency: countries, states, and coordinates"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2. Ensuring spatial consistency: countries, states, and coordinates}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = F,
  warning = FALSE
)
```

## Introduction

Even after initial formatting, species occurrence data often retain spatial inconsistencies that can compromise subsequent analyses. Common issues include varying spellings for the same country (i.e., Brasil, Brazil or BR) or state name, missing administrative information, or coordinates that fall outside the political-administrative jurisdiction assigned to the record. This vignette demonstrates how to ensure the spatial consistency of your occurrence records by addressing name standardization, data imputation, verification, and correction.

```{r}
# Load RuHere package
library(RuHere)
```

## Overview of the functions:
+ `standardize_countries()`: standardizes country names and codes.
+ `standardize_states()`: standardizes state/province names and codes.
+ `country_from_coords()`: extracts the country name from geographic coordinates.
+ `states_from_coords()`: extracts the state/province name from geographic coordinates.
+ `check_countries()`: verifies if coordinates fall within the boundaries of the assigned country.
+ `check_states()`: verifies if coordinates fall within the boundaries of the assigned state/province.
+ `fix_countries()`: identifies and corrects common coordinate errors based on country jurisdiction.

## Standardizing country and state names

Standardizing administrative names is the first step to ensure that all spelling variations and codes are mapped to a single accepted format. 

### Occurrence data

At this stage, you should have an occurrence dataset that has been standardized using the `format_columns()` function and merged with `bind_here()`. For additional details on this workflow, see the vignette *“1. Obtaining and preparing species occurrence data”*.

To illustrate how the function works, we use the example occurrence dataset included in the package, which contains records for three species: the Paraná pine (*Araucaria angustifolia*), the azure jay (*Cyanocorax caeruleus*), and the yellow trumpet tree (*Handroanthus albus*).

```{r, eval = TRUE}
# Loading package occurrence data
data("occurrences", package = "RuHere")
# Number of records per species
table(occurrences$species)
```


### Standardizing countries (`standardize_countries`)

This function harmonizes country names using exact matching and fuzzy matching to correct typos and variations. It compares the input against a comprehensive dictionary of names and codes provided in `rnaturalearthdata::map_units110()`.

```{r}
# Standardize country names
occ_country_std <- standardize_countries(
    occ = occurrences,
    country_column = "country",
    max_distance = 0.1,      # Maximum error distance for fuzzy matching
    lookup_na_country = TRUE # Try to extract country from coords if value is 
    # NA using the country_from_coords() function internally
)
```

This function returns a list with two elements:

+ `$occ`: the original data frame with two new columns: `country_suggested` (the standardized or corrected country name) and `country_source` (whether the suggested country came from the original metadata or was imputed from coordinates).

+ `$report`: a summary of the corrections made, showing the original name and the suggested/standardized name.

Below are the first few rows of the modified data frame and the standardization report:

```{r}
# Printing first rows and columns
occ_country_std$occ[1:3, 1:5]
#>   country country_suggested country_source  record_id               species
#> 1      AR         argentina       metadata  gbif_5516  Araucaria angustifolia
#> 2      AR         argentina       metadata gbif_15849  Araucaria angustifolia
#> 3      AR         argentina       metadata  gbif_4935  Araucaria angustifolia

occ_country_std$report[1:5, ]
#>      country country_suggested
#> 1  argentina         argentina
#> 2    bolivia           bolivia
#> 3     brasil            brazil
#> 4         UY           uruguay
#> 5         PT          portugal
```

### Standardizing states (`standardize_states`)

Similarly, this function standardizes state or province names. It uses the previously standardized country column (`country_suggested`) to disambiguate states that might share names across different countries, using as reference the names and postal codes provided in `rnaturalearthdata::states50()`.

```{r}
# Standardize state names
occ_state_std <- standardize_states(
    occ = occ_country_std$occ,
    state_column = "stateProvince",
    country_column = "country_suggested",
    max_distance = 0.1,
    lookup_na_state = TRUE # Try to extract state from coords if value is NA
)
```

Like `standardize_countries()`, the `standardize_states()` function returns a list with two elements:

+ `$occ`: the input data frame with two new columns: `state_suggested` (the standardized or corrected state/province name) and `state_source` (indicates whether the suggested state came from the original metadata or was imputed from coordinates).

+ `$report`: a summary table of the corrections and standardizations made, showing the original name and the suggested name, constrained by the suggested country.

Below are the first few rows of the modified data frame and the standardization report:

```{r}
occ_state_std$occ[1:3, 1:6]
#>   stateProvince state_suggested state_source country_suggested country country_source
#> 1          acre            acre     metadata            brazil  brazil       metadata
#> 2          acre            acre     metadata            brazil  brazil       metadata
#> 3          acre            acre     metadata            brazil  brazil       metadata

occ_state_std$report[1:3, ]
#>       stateProvince           state_suggested  country_suggested
#> 1        sa£o paulo                 sao paulo             brazil
#> 2         tocantins                 tocantins             brazil
#> 3               RS          rio grande do sul             brazil
```

## Imputing geographic information from coordinates

Sometimes, records have valid coordinates but lack administrative labels entirely. We can use spatial intersection to retrieve this information.

### Extracting country from coordinates (`country_from_coords`)

This function uses geographic coordinates (`long`, `lat`) and a reference world map (`rnaturalearthdata::map_units110()`) to determine the country for each point.

```{r}
# Explicitly extract country from coordinates for all records
occ_with_country_xy <- country_from_coords(
    occ = occ_state_std$occ,
    from = "all", # 'all' extracts for every record; 'na_only' extracts for missing ones
    output_column = "country_xy"
)

# Compare the original country vs. the one derived from coordinates
head(occ_with_country_xy[, c("country", "country_xy")])
#>   country country_xy
#> 1  brazil     brazil
#> 2  brazil     brazil
#> 3  brazil     brazil
#> 4      BR     brazil
#> 5      BR     brazil
#> 6      BR     brazil
```

### Extracting state from coordinates (`states_from_coords`)

Similarly, we can extract state or province names. Here, we demonstrate filling all records (`from = "all"`) and appending a source column to track where the data came from.

```{r}
# Extract state from coordinates for all records
occ_imputed <- states_from_coords(
    occ = occ_with_country_xy,
    from = "all",
    state_column = "stateProvince",
    output_column = "state_xy"
)

head(occ_imputed[, c("stateProvince", "state_xy", "state_source")])
#>   stateProvince state_xy state_source
#> 1          acre     acre     metadata
#> 2          acre     acre     metadata
#> 3          acre     acre     metadata
#> 4          acre amazonas     metadata
#> 5          acre     acre     metadata
#> 6          acre     acre     metadata
```

## Checking and fixing spatial inconsistencies

A critical quality control step is verifying whether the coordinates actually fall within the administrative unit assigned to them. Discrepancies often indicate errors in either the label or the coordinates. 

### Checking country consistency (`check_countries`)

This function compares the coordinates against the boundaries of the country assigned in the `country_suggested` column.

```{r}
# Check if coordinates fall within the assigned country
occ_checked_country <- check_countries(
    occ = occ_imputed,
    country_column = "country_suggested",
    distance = 5,      # Allows a 5 km buffer for border points
    try_to_fix = TRUE  # Automatically attempts to fix inverted/swapped coordinates
)
#> Testing countries...
#> 468 records fall in wrong countries
#> Task 1 of 7: testing if longitude is inverted
#> 0 coordinates with longitude inverted
#> Task 2 of 7: testing if latitude is inverted
#> 0 coordinates with latitude inverted
#> Task 3 of 7: testing if longitude and latitude are inverted
#> 2 coordinates with longitude and latitude inverted
#> Task 4 of 7: testing if longitude and latitude are swapped
#> 1 coordinates with longitude and latitude swapped
#> Task 5 of 7: testing if longitude and latitude are swapped with longitude inverted
#> 0 coordinates with longitude and latitude swapped and latitude inverted
#> Task 6 of 7: testing if longitude and latitude are swapped - with latitude inverted
#> 0 coordinates with longitude and latitude swapped and longitude inverted
#> Task 7 of 7: testing if longitude and latitude are swapped - with longitude latitude inverted
#> 0 coordinates with longitude and latitude swapped and inverted

# The 'correct_country' column indicates validity
head(occ_checked_country[, c("country_suggested", "correct_country", "country_issues")])
#>   country_suggested correct_country country_issues
#> 1            brazil            TRUE        correct
#> 2            brazil            TRUE        correct
#> 3            brazil            TRUE        correct
#> 4            brazil            TRUE        correct
#> 5            brazil            TRUE        correct
#> 6            brazil            TRUE        correct
```

The column `correct_country` is added, indicating `TRUE` if the point falls within the country. Because we set `try_to_fix = TRUE`, the function internally calls `fix_countries()` to identify and correct errors like swapped latitude/longitude, recording the action in `country_issues`.

### Checking state consistency (`check_states`)

We perform a similar verification for states. Note that `check_states` verifies points against the `state_suggested` column.

```{r}
# Check if coordinates fall within the assigned state
occ_checked_state <- check_states(
    occ = occ_checked_country,
    state_column = "state_suggested",
    distance = 5,
    try_to_fix = FALSE # We just want to flag issues here, not auto-fix
)
#> Testing states...
#> 87 records fall in wrong states

head(occ_checked_state[, c("state_suggested", "correct_state")])
#>   state_suggested correct_state
#> 1            acre          TRUE
#> 2            acre          TRUE
#> 3            acre          TRUE
#> 4            acre         FALSE
#> 5            acre          TRUE
#> 6            acre          TRUE
```

The `correct_country` and `correct_states` columns represent the first set of flags: records marked as FALSE indicate potentially erroneous entries. For additional details on how to explore and remove flagged records, see the vignette *“3. Flagging Records Using Record Information”*.

### Fixing coordinate errors explicitly (`fix_countries`)

If you prefer to run the fixing process separately (instead of inside `check_countries`), you can use `fix_countries()`. This function runs seven distinct tests to detect issues such as inverted signs or swapped coordinates.

```{r}
# This step is only necessary if you did NOT set try_to_fix = TRUE above
fixing_example <- fix_countries(
   occ = occ_checked_country,
   country_column = "country_suggested",
   correct_country = "correct_country" # Column created by check_countries
)
#> Task 1 of 7: testing if longitude is inverted
#> 0 coordinates with longitude inverted
#> Task 2 of 7: testing if latitude is inverted
#> 0 coordinates with latitude inverted
#> Task 3 of 7: testing if longitude and latitude are inverted
#> 0 coordinates with longitude and latitude inverted
#> Task 4 of 7: testing if longitude and latitude are swapped
#> 0 coordinates with longitude and latitude swapped
#> Task 5 of 7: testing if longitude and latitude are swapped with longitude inverted
#> 0 coordinates with longitude and latitude swapped and latitude inverted
#> Task 6 of 7: testing if longitude and latitude are swapped - with latitude inverted
#> 0 coordinates with longitude and latitude swapped and longitude inverted
#> Task 7 of 7: testing if longitude and latitude are swapped - with longitude latitude inverted
#> 0 coordinates with longitude and latitude swapped and inverted
```

Records identified as "inverted" or "swapped" are corrected in place, and the `country_issues` column is updated to reflect the specific error type found.

Now that we can have our dataset with the countries and states standardized and checked, we can go to the next step: *3. Flagging Records Using Associated Information"*.
