---
title: "Working With NYC Wetlands Data"
author: "Shannon Joyce"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working With NYC Wetlands Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(nycOpenData)
library(ggplot2)
library(dplyr)
library(knitr)
```

## Introduction
New York City is home to many wetland features. In an effort to grow awareness of their existence and multitude, [this dataset](https://data.cityofnewyork.us/dataset/NYC-Wetlands/p48c-iqtu/about_data) containing the geographic locations and descriptions of wetland features was created. In R, the `nycOpenData` package can be used to pull this data directly.

The `nycOpenData` package provides a streamlined interface for accessing New York City's vast open data resources. It connects directly to the NYC Open Data Portal. It is currently utilized as a primary tool for teaching data acquisition in [Reproducible Research Using R](https://martinezc1-reproducible-research-using-r.share.connect.posit.cloud/), helping students bridge the gap between raw city APIs and tidy data analysis.

By using the `p48c-iqtu()` function, we can gather the most recently listed wetland features in New York City, and filter based upon any of the columns inside the dataset. 

> Note: `p48c-iqtu()` automatically sorts in descending order based on the verificationstatusyear column. Due to this order, the first group of rows are `Unverified`, so the verificationstatus year is omitted for those rows.

## Pulling a Small Sample
To start, let's pull a small sample to see what the data looks like. By default, the function pulls in the *10,000 most recent* additions, however, let's change that to only see the latest 3 additions. To do this, we can set `limit = 3`.

```{r small-sample}
small_sample <- nyc_pull_dataset("p48c-iqtu", limit = 3)
small_sample

# Seeing what columns are in the dataset
names(small_sample)
```

Fantastic! We successfully pulled wetlands data from the NYC Open Data Portal. 

Let's now pull the complete dataset to work with:

## Pulling Full Dataset
```{r full-data}
wetlands_data <- nyc_pull_dataset("p48c-iqtu", limit = 100)

# Let's take a look at what our full dataset looks like
wetlands_data |>
  slice_head(n = 6)
```

In our small sample data, the first few rows' verification status were `Unverified`. Let's see what the other values in that column are:

```{r ver-status}
wetlands_data |>
  distinct(verificationstatus)
```

Now that we see the different values in the `verificationstatus` column, let's filter *out* all of the unverified wetland features:

```{r filter-brooklyn-nypd}
# Creating the dataset
verified_wetlands <- wetlands_data |> filter(verificationstatus != "Unverified")

# Quick check to make sure our filtering worked
verified_wetlands |>
  slice_head(n = 6)

verified_wetlands |>
  distinct(verificationstatus)
```

Success! Now that we have our full list of verified wetland features in NYC, let's take a look at some of its descriptive stats.

## Mini Analysis

Let's create a summary table showing how many wetland features were verified each year:

```{r year-summary}
verified_per_year <- verified_wetlands |> 
  group_by(verificationstatusyear) |> 
  count(verificationstatusyear)

verified_per_year |> kable(caption = "Verified Wetland Features Per Year")
```

Let's create a bar graph to see how many wetlands of each classification are verified!

```{r fig.width=7, fig.height=4}
ggplot(data = verified_wetlands, aes(x = classname)) +
  geom_bar(fill = "forestgreen") +
  labs(title = "Total Number of Wetland Features By Classification", x = "Classification Name", y = "Total Count") +
  theme_minimal()
```

Though this vignette only demonstrates a simple use of this function, the inclusion of geospatial data allows users to map these wetland features using the provided multipolygon coordinates.

## Summary
The `nycOpenData` package serves as a robust interface for the NYC Open Data portal, streamlining the path from raw city APIs to actionable insights. By abstracting the complexities of data acquisition—such as pagination, type-casting, and complex filtering—it allows users to focus on analysis rather than data engineering.

As demonstrated in this vignette, the package provides a seamless workflow for targeted data retrieval, automated filtering, and rapid visualization.

## How to Cite
If you use this package for research or educational purposes, please cite it as follows:

Martinez C (2026). nycOpenData: Convenient Access to NYC Open Data API Endpoints. R package version 0.1.6, https://martinezc1.github.io/nycOpenData/.