---
title: "Data Quality Assessment"
author:
  - Matthias Templ^[University of Applied Sciences Northwestern Switzerland (FHNW)]
  - Barbara Templ^[Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)]
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Data Quality Assessment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

*Vignette 2 of 4 for the **pep725** R package.
The most current version is available on
[GitHub](https://github.com/matthias-da/pep725) and
[CRAN](https://CRAN.R-project.org/package=pep725).*

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Introduction

Data quality assessment is a critical but often overlooked step in phenological
analysis. Before calculating trends, normals, or climate sensitivities, you need
to understand what's in your data and make informed decisions about unusual
observations.

### The Data Quality Challenge

Long-term phenological datasets such as PEP725 are compiled from many observers over decades. This richness enables large-scale analyses, but it also introduces data quality challenges that need to be assessed:

| Issue Type | Examples | Consequence if Ignored |
|------------|----------|------------------------|
| **Recording errors** | Typographical errors (DOY 150 instead of 105), incorrect year | Outliers bias statistics |
| **Identification errors** | Confusion between morphologically similar species | Inconsistent or mixed time series |
| **Protocol changes** | Changes in the definition of phenological phases over time | Artificial shifts or trends |
| **Biological anomalies** | Irregular phenological events as a result of climate extremes | May represent real biological signals |

The following sections introduce practical tools in pep725 package for diagnosing and handling these issues in a transparent and reproducible way.

### What You'll Learn

| Section | Topic | Key Functions |
|---------|-------|---------------|
| Part 1 | Detecting statistical outliers | `pep_flag_outliers()`, `pep_plot_outliers()` |
| Part 2 | Temporal coverage and completeness | `pep_completeness()` |
| Part 3 | Phase presence validation | `pep_check_phases()` |
| Part 4 | Integrated workflow | Combining all approaches |

### Prerequisites

This vignette assumes you are familiar with:

- Basic phenological concepts (from "Getting Started" vignette)
- Day of Year (DOY) as a timing metric
- BBCH phenological phase codes

## Setup

```{r setup, message=FALSE, warning=FALSE}
library(pep725)
library(data.table)

# Download the synthetic dataset
pep <- pep_download()

# For this vignette, we'll focus on flowering phases
flowering <- pep[phase_id %in% c(60, 65)]
cat("Flowering observations:", nrow(flowering), "\n")
cat("Species:", length(unique(flowering$species)), "\n")
cat("Year range:", min(flowering$year), "-", max(flowering$year), "\n")
```

---

# Part 1: Outlier Detection

## Why Detect Outliers?

An **outlier** is an observation that differs substantially from the expected
pattern. In phenological data, outliers may indicate:

1. **Recording errors** that should be excluded from analysis
   - A typo: DOY 250 instead of DOY 150 (roughly 3 months difference)
   - Wrong year entered: observation assigned to incorrect season

2. **Unusual weather events** worth investigating
   - Very warm winter causing early flowering
   - Late frost delaying spring phenology

3. **Second flowering** or other abnormal phenological events
   - Plants flowering again in autumn/winter after spring flowering
   - Increasingly observed with climate change

**The challenge**: You can't simply delete all outliers. Some are errors, but
others are scientifically valuable observations of unusual events.

## Statistical Methods for Outlier Detection

The `pep_flag_outliers()` function provides four statistical methods for identifying
potential outliers. Each method has different assumptions and strengths:

### Method 1: 30-Day Rule (Simple Threshold)

The simplest approach: flag any observation more than 30 days from the median
for that species/phase combination.

```{r flag-basic}
# Flag outliers using the default 30-day rule
outliers <- pep_flag_outliers(
  pep = flowering,
  method = "30day",
  by = c("species", "phase_id")
)

print(outliers)
```

**How it works:**

```
                     Distribution of flowering dates

        ← Early                                    Late →

   ----|----[=======|=======]----|-----
        ^           ^           ^
     -30 days    median      +30 days

   Anything outside the brackets is flagged as an outlier
```

### Understanding the Output

The function adds new columns to your original data:

| Column | What It Contains | How to Use It |
|--------|------------------|---------------|
| `is_outlier` | TRUE/FALSE flag | Filter data: `data[is_outlier == FALSE]` |
| `deviation` | Days from expected | Prioritize investigation: larger = more extreme |
| `expected_doy` | Reference DOY (median) | Understand what "normal" looks like |

### Method 2: MAD (Median Absolute Deviation) - Recommended

The MAD method is **robust to the presence of multiple outliers**. This is
important because if your data contains many outliers, they can distort
mean-based methods.

```{r flag-methods, eval=FALSE}
# MAD method: flag if > 3 MAD from median (robust)
outliers_mad <- pep_flag_outliers(flowering, method = "mad", threshold = 3)
```

**Why MAD is recommended:**

| Scenario | Mean-based method | MAD method |
|----------|-------------------|------------|
| 1-2 outliers | Works reasonably | Works well |
| Many outliers | Outliers inflate SD, may miss more | Stays robust |
| Asymmetric data | Assumes symmetry | Handles asymmetry |

### Method 3: IQR (Interquartile Range) - Also Robust

The IQR method uses quartiles and is familiar from boxplot "whiskers":

```{r iqr-method, eval=FALSE}
# IQR method: flag if outside 1.5 * IQR (standard boxplot rule)
outliers_iqr <- pep_flag_outliers(flowering, method = "iqr", threshold = 1.5)
```

**How it works:**
- Q1 = 25th percentile, Q3 = 75th percentile
- IQR = Q3 - Q1
- Outlier if: value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR

### Method 4: Z-Score - Sensitive but Less Robust

The z-score method assumes normally distributed data:

```{r zscore-method, eval=FALSE}
# Z-score method: flag if |z| > 3 (assumes normal distribution)
outliers_zscore <- pep_flag_outliers(flowering, method = "zscore", threshold = 3)
```

**Caution**: The z-score method is sensitive to existing outliers. If your data
has many outliers, they will inflate the standard deviation, making the z-score
threshold less effective at detecting them.

### Method 5: GAM Residual - Model-based, Covariate-aware

The four methods above are all **univariate within a group** — they compare
each observation to the central tendency of its own station-species-phase
combination. That misses three realistic outlier patterns:

1. **Year-effect outliers.** In a year where most of central Europe bloomed
   8 days earlier, a station that flowered on its 30-year median looks
   "late" only because the reference ignores the shared weather signal.
2. **Covariate-inconsistent reports.** A lowland station that reports a
   DOY typical of a mountain station is invisible to a within-station
   detector because the station's own history may be short.
3. **Trend-inconsistent reports.** A station whose DOY is constant while
   every other station in the network has a trend is anomalous only in
   relation to the trend.

`method = "gam_residual"` tackles all three. For each group (typically
species × phase), it fits a GAM of DOY on year, altitude, latitude, and a
station random intercept (customisable via `formula =`) and flags
observations whose residual — standardised by a robust MAD scale — exceeds
the threshold:

```{r gam-method, eval=FALSE}
# Model-based detection across the whole species x phase group.
outliers_gam <- pep_flag_outliers(
  apple_flowering,
  by = c("genus", "species", "phase_id"),
  method = "gam_residual",
  threshold = 3.5
)
```

The `deviation` column now holds GAM residuals in days (not deviations
from a within-station median), and `expected_doy` holds the fitted
values — so the station-year whose residual is 20 days above the fit is
flagged whether or not its station's history made that value look
"normal".

Groups with fewer observations than `min_n_per_group` (default 50) or
where the GAM fails to converge silently fall back to the 30-day rule.

### Method 6: Mahalanobis - Multivariate, Robust (MCD)

All methods above flag observations one at a time. A station-year could
have an entirely reasonable BBCH 60 date and an entirely reasonable BBCH
65 date, but only **one day apart** — biologically impossible, because
first flowering precedes full flowering by ~10 days.

`method = "mahalanobis"` catches this by treating each station-year as
a *vector* across phases and computing the **robust Mahalanobis
distance** of that vector from the centre of the rest of the network.
"Robust" matters: the classical Mahalanobis distance uses the sample
mean and covariance, which are themselves pulled by outliers
(masking). We use the **Minimum Covariance Determinant** estimator
(Rousseeuw 1984, `robustbase::covMcd()`) which gives a high-breakdown
robust centre and scatter, so the outliers stay *outside* the robust
ellipsoid instead of contaminating it:

```{r mahalanobis-method, eval=FALSE}
# Multivariate detection on the phase profile per station-year.
# by = c("genus", "species") pools across phases so the covariance
# is estimated from the joint (BBCH 60, BBCH 65, BBCH 89) distribution.
outliers_mahal <- pep_flag_outliers(
  apple,
  by = c("genus", "species"),
  method = "mahalanobis"
)
```

The default threshold is `sqrt(qchisq(0.975, df = p))` where *p* is the
number of phases — the standard chi-square containment cut-off under
multivariate normality. Flagged station-years get all their phase rows
marked; `deviation` stores the robust MD (not a day-count, so it is
unitless).

### Choosing a Method

| Method | Best For | Threshold Meaning |
|--------|----------|-------------------|
| `30day` | Simple, interpretable threshold | Absolute days from median |
| `mad` | Most situations (recommended for single-phase) | Number of MADs from median |
| `iqr` | Familiar boxplot-style detection | IQR multiplier |
| `zscore` | Clean data, normal distribution | Standard deviations |
| `gam_residual` | Year / elevation / station covariates available | Robust z on GAM residuals |
| `mahalanobis` | Multiple phases per station-year | `sqrt(chi^2_{0.975, p})` default |

### Summary Statistics

```{r flag-summary}
summary(outliers)
```

### Comparing Outlier Rates Across Species

Different species may have different outlier rates due to observation difficulty
or data quality:

```{r flag-compare}
# Check outliers for grapevine
vine_flowering <- pep[species == "Vitis vinifera" & phase_id %in% c(60, 65)]

if (nrow(vine_flowering) > 0) {
  outliers_vine <- pep_flag_outliers(
    pep = vine_flowering,
    method = "30day",
    by = c("species", "phase_id")
  )

  cat("Outlier comparison:\n")
  cat("All flowering species: ", round(100 * mean(outliers$is_outlier), 2), "%\n")
  cat("Grapevine only:        ", round(100 * mean(outliers_vine$is_outlier), 2), "%\n")
} else {
  cat("No grapevine flowering data available for comparison.\n")
}
```

## Visualizing Outliers

Visual inspection is essential for understanding outlier patterns. The
`pep_plot_outliers()` function provides several visualization types —
four for the univariate methods plus two targeted at the model-based
and multivariate methods:

### 1. Overview Plot

Shows the big picture: how many outliers, which species/phases are affected:

```{r plot-overview, fig.height=6}
# Overview of outlier patterns
pep_plot_outliers(outliers, type = "overview")
```

**What to look for:**
- Which species have the most outliers? (Might indicate data quality issues)
- Are outliers concentrated in certain phases? (Might indicate definition problems)
- What's the overall outlier rate? (Typical: 1-5% for well-curated data)

### 2. Seasonal Distribution

When in the year do outliers occur?

```{r plot-seasonal, fig.height=5}
# When do outliers occur in the year?
pep_plot_outliers(outliers, type = "seasonal")
```

**Interpretation guide:**

| Outlier Timing | Likely Explanation |
|----------------|-------------------|
| Very early (DOY < 50) | Possible data errors (flowering in January/February unlikely for most species) |
| Somewhat early | Warm winters or Mediterranean locations |
| Somewhat late | Cool springs or high-altitude stations |
| Very late (DOY > 250) | Possible **second flowering** events - investigate! |

### 3. Detailed Context

See outliers alongside normal observations:

```{r plot-detail, fig.height=5}
# See outliers in context of all observations
pep_plot_outliers(outliers, type = "detail", n_top = 15)
```

**This plot shows:**
- Full distribution of observations (gray)
- Flagged outliers (highlighted)
- The `n_top` parameter controls how many species/phases to show

### 4. Geographic Distribution

Where are outliers located?

```{r plot-map, fig.height=5}
# Where are outliers located?
pep_plot_outliers(outliers, type = "map")
```

**What to look for:**
- Clustered outliers in one region? Might indicate local data quality issues
- Outliers at network edges? Might be at environmental limits
- Random scatter? Suggests individual observation errors

### 5. Model Diagnostic (for `gam_residual` / `mahalanobis`)

`type = "diagnostic"` produces a paper-ready 4-panel figure tailored to
the detection method. For `gam_residual` the panels are classical GAM
diagnostics — residuals vs fitted, Q-Q against Normal, |residual| vs
altitude, and a per-station worst-case-residual map. For `mahalanobis`
the panels switch automatically to MD-specific diagnostics: sorted MD
with the chi-square threshold line, Q-Q of MD² against χ²_*p*,
mean/max MD over time (exposes network-wide anomalous epochs), and
the per-station worst-case MD map.

```{r plot-diagnostic-gam, eval=FALSE}
# For gam_residual: classical model diagnostics
outliers_gam <- pep_flag_outliers(
  apple_flowering,
  by = c("genus", "species", "phase_id"),
  method = "gam_residual"
)
pep_plot_outliers(outliers_gam, type = "diagnostic")
```

![GAM-residual diagnostic on a synthetic fixture with one injected contamination at station 1, year 2000. Panel A: residuals vs fitted; the injected point appears at residual ≈ +50 days. Panel B: Q-Q against N(0,1); the right-tail departure corresponds to the outlier. Panel C: |residual| vs altitude, showing the model has no altitude bias away from the outlier. Panel D: per-station worst-case residual map.](figures/gam_diagnostic.png)

```{r plot-diagnostic-mahal, eval=FALSE}
# For mahalanobis: MD-specific diagnostics
outliers_mahal <- pep_flag_outliers(
  apple,
  by = c("genus", "species"),
  method = "mahalanobis"
)
pep_plot_outliers(outliers_mahal, type = "diagnostic")
```

![Mahalanobis-specific diagnostic on the same synthetic fixture. Panel A: sorted robust Mahalanobis distances across all station-years, with the $\sqrt{\chi^2_{0.975,\,3}} \approx 3.06$ threshold as a red dashed line; flagged points are red. Panel B: Q-Q of observed $\mathrm{MD}^2$ against $\chi^2_3$ — upper-tail departure identifies the outlier mass. Panel C: mean (blue) and max (red dotted) MD per year — the spike at year 2000 coincides with the injected anomaly. Panel D: per-station worst-case MD map.](figures/mahalanobis_diagnostic.png)

**What to look for:**

- **Residuals vs fitted (GAM)**: a level trend with no structure is
  ideal; a funnel or curve signals model misspecification, not outliers.
- **Q-Q (GAM)**: heavy-tailed deviations in the corners are the real
  outlier signal — light tails mean the threshold is picking up noise.
- **Sorted MD with threshold (Mahalanobis)**: the steep rise at the
  right-hand end is where flagged station-years sit; a long flat
  "knee" across the threshold means the threshold is arguably too
  loose or too strict.
- **Mean / max MD over time (Mahalanobis)**: spikes in the max line
  identify *years* in which many stations reported biologically
  inconsistent phase sequences — useful for correlating with known
  data-collection or instrumental events.

### 6. Phase-profile Plot (primary Mahalanobis figure)

The MD detector flags station-years whose *shape* across phases is
atypical, even when each marginal DOY looks fine. `type = "profile"`
plots the phase profile as parallel coordinates — one line per
station-year, flagged ones in red, with the robust median profile
overlaid in black as a reference:

```{r plot-profile, eval=FALSE}
# Phase-profile parallel coordinates (primary Mahalanobis figure).
pep_plot_outliers(outliers_mahal, type = "profile")
```

![Phase-profile parallel-coordinates plot. Each line is one station-year; the grey envelope shows the natural variability of the network, the red lines are station-years flagged by the robust Mahalanobis detector, and the thick black line is the MCD robust per-phase median profile. Red lines that cross the phases at atypical angles — not just at unusual absolute DOYs — are the classic Mahalanobis flags.](figures/mahalanobis_profile.png)

**How to read it:**

- Every grey line is a "normal-looking" station-year; together they
  form the envelope of the robust ellipsoid.
- Red lines are flagged. Look where they bend oddly: a station-year
  whose BBCH 60 → 65 slope is much flatter or steeper than the
  reference is the classic MD flag.
- The black reference profile is the robust (MCD) per-phase median —
  the centre of the ellipsoid. Anything strongly parallel to this is
  safe; anything that crosses multiple phases at unusual angles is
  suspect.

---

# Part 2: Data Completeness

## Why Check Completeness?

**Data completeness** refers to how many years of observations exist for each
station/phase combination. This matters because:

1. **Trend analysis requires continuous data**: Gaps can bias trend estimates
2. **Normals require representative coverage**: WMO recommends 24+ years in a 30-year period
3. **Missing data isn't random**: Stations often drop out or are added systematically

### Visualizing Completeness Issues

```
Station A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●  <- 100% complete
           1990                      2020

Station B: ●●●●●●●●●●○○○○○○○○○○●●●●●●●●●●  <- Gap in middle
           1990                      2020

Station C: ○○○○○○○○○○○○○○○○○○○○●●●●●●●●●●  <- Only recent data
           1990                      2020

Station D: ●●●●●●●●●●●●●●●●●●●●○○○○○○○○○○  <- Discontinued
           1990                      2020
```

Different completeness patterns have different implications:

| Pattern | Implications for Analysis |
|---------|---------------------------|
| Full coverage | Ideal - reliable trends and normals |
| Gap in middle | Careful - may miss important changes |
| Only recent | Cannot compare to historical period |
| Discontinued | Cannot assess recent changes |

## Assessing Completeness

```{r completeness-basic}
# Check completeness by station and phase
# Use year_range to focus on a specific period
completeness <- pep_completeness(
  pep = flowering,
  by = c("s_id", "phase_id"),
  year_range = c(1990, 2020)
)

print(completeness)
```

### Understanding Completeness Metrics

The function calculates several metrics:

| Metric | What It Measures | Interpretation |
|--------|------------------|----------------|
| `n_years` | Total years with data | More = better |
| `completeness_pct` | % of year span covered | 70%+ for trends, 80%+ for normals |
| `year_span` | Total span (year_max - year_min + 1) | Context for completeness |
| `year_min` | Earliest observation | Needed for historical comparison |
| `year_max` | Most recent observation | Needed for current status |

### Summary Statistics

```{r completeness-summary}
summary(completeness)
```

## Filtering by Completeness

Use completeness information to select stations appropriate for your analysis:

```{r completeness-filter}
# Get stations with good coverage (>= 70%)
good_coverage <- completeness[completeness_pct >= 70]
cat("Stations with >= 70% coverage:", nrow(good_coverage), "\n")

# Use these for trend analysis
good_stations <- unique(good_coverage$s_id)
flowering_complete <- flowering[s_id %in% good_stations]
cat("Observations from complete stations:", nrow(flowering_complete), "\n")
```

### Completeness Thresholds for Different Analyses

| Analysis Type | Minimum Completeness | Reasoning |
|---------------|----------------------|-----------|
| Trend detection | 50% (15+ years) | Need enough points for trend fitting |
| Climate sensitivity | 60% | Need variability across climate conditions |
| 30-year normals | 80% (24+ years) | WMO standard requirement |
| Station comparison | Same time period | Avoid bias from different eras |

## Visualizing Completeness

```{r completeness-plot, fig.height=5}
plot(completeness)
```

---

# Part 3: Phase Presence Validation

## Why Check Phase Presence?

Before starting an analysis, it's important to verify that your data contains
the phenological phases you need. The `pep_check_phases()` function validates that
expected BBCH phase codes are present in your data.

**Common questions this helps answer:**

- Does my dataset have flowering observations (phase 60)?
- Are flowering and fruit maturity phases both available for apple?
- Which phases are missing for my species of interest?

## Checking Phase Presence

```{r check-basic}
# Check if expected phases are present for apple
apple <- pep[species == "Malus domestica"]

phase_check <- pep_check_phases(
  pep = apple,
  expected = c(60, 65, 87)  # flowering, full flowering, fruit maturity
)

print(phase_check)
```

### Understanding the Output

The function returns information about phase coverage:

| Output | What It Shows |
|--------|---------------|
| `expected` | The phases you asked for |
| `present` | Which expected phases were found |
| `missing` | Which expected phases are absent |
| `complete` | TRUE if all expected phases are present |
| `n_obs` | Number of observations per phase |

## Checking Multiple Species

To check phase presence across multiple species at once:

```{r check-multi}
# Check phases for multiple species
multi_check <- pep_check_phases_multi(
  pep = pep,
  species_list = c("Malus domestica", "Vitis vinifera"),
  expected = c(60, 65, 87)
)

print(multi_check)
```

### Common Phases to Check

| Plant Type | Common Phases | BBCH Codes |
|------------|---------------|------------|
| Cereals | Heading, Flowering, Harvest | 60, 65, 100 |
| Fruit trees | Flowering, Full flowering, Fruit maturity | 60, 65, 87 |
| Deciduous trees | Leaf unfolding, Flowering, Leaf fall | 11, 60, 95 |

---

# Part 4: Integrated Quality Workflow

## Putting It All Together

Here's a recommended workflow for data quality assessment that combines all the
tools covered in this vignette:

```{r workflow}
# ══════════════════════════════════════════════════════════════════════════════
# STEP 1: Assess temporal completeness
# ══════════════════════════════════════════════════════════════════════════════
# Why: Incomplete stations can bias trend estimates and normals

completeness <- pep_completeness(flowering, by = c("s_id", "phase_id"))
good_stations <- completeness[completeness_pct >= 50, s_id]
fl_filtered <- flowering[s_id %in% good_stations]

cat("Kept", length(good_stations), "stations with >= 50% completeness\n")

# ══════════════════════════════════════════════════════════════════════════════
# STEP 2: Flag statistical outliers
# ══════════════════════════════════════════════════════════════════════════════
# Why: Identify observations that deviate from expected patterns

outliers_wf <- pep_flag_outliers(fl_filtered, method = "mad", threshold = 3)

cat("Flagged", sum(outliers_wf$is_outlier), "outliers",
    "(", round(100 * mean(outliers_wf$is_outlier), 1), "% )\n")

# ══════════════════════════════════════════════════════════════════════════════
# STEP 3: Make informed decisions about exclusion
# ══════════════════════════════════════════════════════════════════════════════
# Key principle: Document your decisions!

# Option A: Strict cleaning (for normals calculation)
fl_strict <- outliers_wf[is_outlier == FALSE]

# Option B: Moderate cleaning (keep moderate outliers)
fl_moderate <- outliers_wf[is_outlier == FALSE | abs(deviation) < 60]

cat("Strict cleaning keeps:", nrow(fl_strict), "obs\n")
cat("Moderate cleaning keeps:", nrow(fl_moderate), "obs\n")

# ══════════════════════════════════════════════════════════════════════════════
# STEP 4: Proceed with analysis on cleaned data
# ══════════════════════════════════════════════════════════════════════════════

normals <- pheno_normals(fl_moderate, period = 1991:2020, min_years = 5)
cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
```

## Documentation Template

When publishing research, document your quality control decisions:

```
Data Quality Control:
1. Completeness filtering: Excluded stations with < 50% temporal coverage
   (reduced dataset from N to M stations)

2. Outlier detection: Used MAD method with threshold = 3
   (flagged X observations as outliers, Y% of total)

3. Outlier disposition:
   - Excluded: Z observations with deviations > 60 days (likely errors)
   - Retained: W observations identified as abnormal events (e.g., second flowering)

4. Phase validation: Checked for sequence violations in [species list]
   (identified V cases for manual review)
```

---

# Best Practices Summary

## For Outlier Detection

1. **Use robust methods**: MAD or IQR methods handle multiple outliers better
   than z-scores. The MAD method is generally recommended.

2. **Check seasonally**: Late-season outliers (DOY > 250) may be biologically
   meaningful events (e.g. second flowering) rather than errors.

3. **Verify extremes**: For observations with very large deviations (> 60 days),
   try to check original data sources or contact data providers.

4. **Document decisions**: Record which outliers you exclude and why. This is
   essential for reproducibility and peer review.

5. **Don't delete automatically**: Human judgment is needed to distinguish
   errors from genuine unusual events.

## For Abnormal Event Detection

1. **Consider species biology**: Some species (e.g., certain Rosaceae) are more
   prone to second flowering than others.

2. **Check geographic patterns**: If multiple nearby stations report late events
   in the same year, this suggests a real regional phenomenon.

3. **Look at weather context**: Link abnormal events to weather extremes -
   drought is a common trigger.

4. **Distinguish from errors**: Very isolated observations (single station,
   single year) need verification before treating them as e.g. second flowering.

5. **Keep for separate analysis**: Abnormal phenological events are scientifically
   interesting (proof of climate change) - don't just delete them!

## For Completeness Assessment

1. **Set appropriate thresholds**: Use 80%+ coverage for calculating normals,
   50%+ for trend analysis.

2. **Consider gap patterns**: Many small gaps are usually better than one long
   gap that might coincide with important climate changes.

3. **Check temporal bias**: Ensure you're not comparing stations with data only
   from early periods to stations with only recent data.

4. **Document station selection**: Report how many stations met your completeness
   criteria and how this affected your sample.

## For Phase Presence Checking

1. **Check early**: Run `pep_check_phases()` at the start of your analysis to
   verify required phases are available in your data.

2. **Know required phases**: Different analyses need different phases -
   know which BBCH codes you need before starting.

3. **Check across species**: Use `pep_check_phases_multi()` to verify data
   availability for all species you plan to analyze.

---

## Summary

This vignette covered data quality tools for phenological analysis:

| Function | Purpose | Key Output |
|----------|---------|------------|
| `pep_flag_outliers()` | Identify unusual observations | `is_outlier` flag, `deviation` in days |
| `pep_plot_outliers()` | Visualize outlier patterns | Four plot types: overview, seasonal, detail, map |
| `pep_completeness()` | Assess temporal coverage | Completeness %, gaps, year range |
| `pep_check_phases()` | Check phase presence | Missing phases, observation counts |

### Key Take-Home Messages

1. **Data quality assessment is not optional** - it's a critical step that
   affects the validity of all downstream analyses.

2. **Not all outliers are errors** - some represent genuine biological phenomena
   like second flowering that deserve further study.

3. **Document your decisions** - quality control choices should be transparent
   and reproducible.

4. **Use robust methods** - the MAD method is recommended for phenological data
   because it handles multiple outliers well.

5. **Visual inspection matters** - plots often reveal patterns that summary
   statistics miss.

## Next Steps

Explore the other vignettes for complementary analyses:

- **Getting Started**: Data access, the `pep` class, and basic exploration

```{r eval = FALSE}
vignette("getting-started", package = "pep725")
```

- **Phenological Analysis**: Normals, anomalies, quality grading, and trends

```{r eval = FALSE}
vignette("phenological-analysis", package = "pep725")
```

- **Spatial Phenological Patterns**: Gradients, synchrony, and mapping

```{r eval = FALSE}
vignette("spatial-patterns", package = "pep725")
```

---

## Session Info

```{r session}
sessionInfo()
```
