---
title: "validata"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{validata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)


iris <- tibble::tibble(iris)
```


```{r setup}
library(validata)
library(tidyselect)
```

# Distinct

## Confirm Distinct

In data analysis tasks we often have data sets with multiple possible ID columns, but it's not always clear which combination uniquely identifies each row.

sample_data1 has 125 row with 3 ID type columns and 3 value columns.

```{r}
head(sample_data1)
```
Let's use `confirm_distinct` iteratively to find the uniquely identifying columns of sample_data1.

```{r}
sample_data1 %>% 
  confirm_distinct(ID_COL1)
```

```{r}
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2)
```

```{r}
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
```

Here we can conclude that the combination of 3 ID columns is the primary key for the data.

## Determine Distinct

These steps can be automated with the wrapper function `determine distinct`. 

```{r}
sample_data1 %>% 
  determine_distinct(matches("ID"))
```

# Mapping


`confirm_mapping` tells you the mapping between two columns in a data frame:

- 1 - 1 mapping
- 1 - many mapping
- many - 1 mapping
- many - many mapping


## Confirm mapping

`confirm_mapping` gives the option to view which type of mapping is associated with each individual row. 

```{r}
sample_data1 %>% 
  confirm_mapping(ID_COL1, ID_COL2, view = F)
```

## Determine mapping

```{r}
sample_data1 %>% 
  determine_mapping(everything())
```

# Overlap

The `overlap` functions give a venn style description of the values in 2 columns. This is especially useful before performing a `join` function, and you want to confirm that the dataframes have matching keys. 

## Confirm Overlap

`confirm_overlap` is different from the other `confirm` functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary 

```{r}

confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap

```

`confirm_overlap` returns a summary data frame invisibly allowing you to access individual elements using the helper functions.

```{r}
print(iris_overlap)
```

Find the elements unique to the first column

```{r}
iris_overlap %>% 
  co_find_only_in_1() %>% 
  head()

```

Find the elements unique to the second column

```{r}
iris_overlap %>% 
  co_find_only_in_2() %>% 
  head()
```

Find the elements shared by both columns

```{r}
iris_overlap %>% 
  co_find_in_both() %>% 
  head()
```

## Determine Overlap

`determine_overlap` takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested. 

```{r eval=FALSE, include=FALSE,}
iris %>% 
  determine_overlap(everything())
```
Note that the `overlap` functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see [Complex Upset Plots](https://krassowski.github.io/complex-upset/)

# string length

## confirm string length

Get a frequency table of string lengths in a character column.
Table is printed while the original df is returned invisibly with a column indicating the string lengths.

```{r}
iris %>% 
  confirm_strlen(Species) -> species_len
```

output is a dataframe

```{r}
head(species_len)
```

## choose string length

A helped function for the output of `confirm_strlen` that filters the database for chosen string lengths.

```{r}
species_len %>% 
  choose_strlen(len = 6) %>% 
  head()
```

# diagnose

Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set. 

```{r}
iris %>% 
  diagnose()
```