---
title: "Getting Started with fuzzystring"
author: "Paul Efren Santos Andrade"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with fuzzystring}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(fuzzystring)
```

## Introduction

**fuzzystring** provides fast, flexible fuzzy string joins for `data.frame` and
`data.table` objects using approximate string matching. Built on top of
`data.table` and `stringdist`, it uses compiled C++ result assembly plus
adaptive candidate planning to reduce unnecessary distance evaluations in
single-column joins.

## Installation

You can install **fuzzystring** from CRAN:

```r
install.packages("fuzzystring")
```

You can also install the development version from GitHub:

```r
# Using pak (recommended)
# pak::pak("PaulESantos/fuzzystring")

# Or using remotes
# remotes::install_github("PaulESantos/fuzzystring")
```

## Quick Start

Here's a simple example matching diamond cuts with slight misspellings:

```{r quick-start}
# Your messy data
x <- data.frame(
  name = c("Idea", "Premiom", "Very Good"), 
  id = 1:3
)

# Reference data
y <- data.frame(
  approx_name = c("Ideal", "Premium", "VeryGood"), 
  grp = c("A", "B", "C")
)

# Fuzzy join with max distance of 2 edits
fuzzystring_inner_join(
  x, y,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)
```

## Key Features

### All Join Types Supported

**fuzzystring** supports all standard join types. Below is a small, reusable
example dataset so you can compare the behavior of each join family.

```{r join-datasets}
x_join <- data.frame(
  name = c("Idea", "Premiom", "Very Good", "Gooood"),
  id = 1:4
)

y_join <- data.frame(
  approx_name = c("Ideal", "Premium", "VeryGood", "Good"),
  grp = c("A", "B", "C", "D")
)
```

- `fuzzystring_inner_join()`: Only matching rows.
- `fuzzystring_left_join()`: All rows from `x`, matching rows from `y`.
- `fuzzystring_right_join()`: All rows from `y`, matching rows from `x`.
- `fuzzystring_full_join()`: All rows from both tables.
- `fuzzystring_semi_join()`: Rows from `x` that have a match in `y`.
- `fuzzystring_anti_join()`: Rows from `x` that don't have a match in `y`.

#### Inner join

```{r join-inner, eval = TRUE}
fuzzystring_inner_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)
```

#### Left join

```{r join-left, eval = TRUE}
fuzzystring_left_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)
```

#### Right join

```{r join-right, eval = TRUE}
fuzzystring_right_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)
```

#### Full join

```{r join-full, eval = TRUE}
fuzzystring_full_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)
```

#### Semi join (rows from `x` with a match in `y`)

```{r join-semi, eval = TRUE}
fuzzystring_semi_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2
)
```

#### Anti join (rows from `x` without a match in `y`)

```{r join-anti, eval = TRUE}
fuzzystring_anti_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2
)
```

#### Using the generic `fuzzystring_join()`

If you prefer a single entry point, you can use `fuzzystring_join()` directly
by specifying `mode`.

```{r join-generic, eval = TRUE}
fuzzystring_join(
  x_join, y_join,
  by = c(name = "approx_name"),
  max_dist = 2,
  mode = "left",
  distance_col = "distance"
)
```

### Multiple Distance Methods

You can choose from various distance metrics provided by the `stringdist` package:

```{r distance-methods, eval = FALSE}
# Optimal String Alignment (default)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")

# Damerau-Levenshtein
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")

# Jaro-Winkler (good for names)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")

# Soundex (phonetic matching)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")
```

### Case-Insensitive Matching

Use `ignore_case = TRUE` to ignore capitalization:

```{r ignore-case, eval = FALSE}
fuzzystring_inner_join(
  x, y, 
  by = c(name = "approx_name"),
  ignore_case = TRUE,
  max_dist = 1
)
```

## Advanced Usage

### Multiple Column Joins

You can match on multiple string columns at once. The same distance method and
threshold are applied to each mapped column.

```{r multi-column, eval = FALSE}
x_multi <- data.frame(
  first = c("Jon", "Maira"),
  last = c("Smyth", "Gonzales")
)

y_multi <- data.frame(
  first_ref = c("John", "Maria"),
  last_ref = c("Smith", "Gonzalez"),
  customer_id = 1:2
)

fuzzystring_inner_join(
  x_multi, y_multi,
  by = c(first = "first_ref", last = "last_ref"),
  method = "osa",
  max_dist = 1
)
```

## Performance

**fuzzystring** now keeps more of the join execution on a compiled C++ path
while using `data.table` to orchestrate candidate generation. In practice this
means compiled row expansion and binding across join modes, better preservation
of typed columns, and adaptive candidate planning that helps both
duplicate-heavy and low-duplication workloads.

For a dedicated comparison against `fuzzyjoin::stringdist_join()`, see the
benchmark article bundled with the package.
