---
title: "Dealing with DSV files"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Dealing with DSV files}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

"Delimiter-separated values" (DSV) files are (unfortunately) still a very common way
to serialize tabular data and are widely used as a data exchange format.
Since there isn't a strict technical standard that specifies how different data types
should be represented in plain text, and a lot of possible customizations are left to
the final users, it is very common to encounter issues when data is imported from this
kind of files.
The purpose of this article is to show some problems you can encounter 
loading data in a SQLite database from DSV files and how to deal with them.

## Data interpretation issues

When dealing with DSV files, one of the most common sources of issues is the way
data is represented in the file. Since DSV files are plain text files, there are
many conventions that can be used to represent different data types (e.g., strings,
numbers, dates, missing data, etc.).

```{r setup}
library(RSQLite.toolkit)
```

Let's start with a simple example using real-world data. To retrieve some example
files, we can use the package [piggyback](https://github.com/ropensci/piggyback)
to download public datasets that have been stored in the 
[RSQLite.toolkit-tests](https://github.com/fab-algo/RSQLite.toolkit-tests)
GitHub repository[^1].

The first example will deal with a table contained in a "standard" CSV file
(i.e. a text file with the first line containing column names, using the
comma to separate fields' values and the dot as a decimal separator for 
numbers); we will load  the data in a new table inside a SQLite database
using the `dbTableFromDSV()` function.

First, we download the data from the repo:

```{r getdata}
library("piggyback", quietly = TRUE)
pb_download(file = "DOSE_V2.10.zip", dest = tempdir(),
            repo = "fab-algo/RSQLite.toolkit-tests", tag = "latest")

unzip(zipfile = file.path(tempdir(), "DOSE_V2.10.zip"), exdir = tempdir())
dir(file.path(tempdir(), "DOSE_V2.10"))

data_file <- file.path(tempdir(), "DOSE_V2.10/DOSE_V2.10.csv")
```

After unzipping the downloaded file, we get the `DOSE_V2.10.csv` file that
contains the dataset we are interested in.
We can now check the number of rows in the input file:

```{r inspect_file2}
n_rows <- length(count.fields(data_file))
n_rows
```

So we expect to have `r format(n_rows-1, nsmall=0)` records in the SQLite
table at the end of the import process.
Now we connect to a sample database (i.e. `tests.sqlite`): 

```{r dbcon}
dbcon <- dbConnect(RSQLite::SQLite(), file.path(tempdir(), "tests.sqlite"))
```
and move the dataset in a table there, calling the `dbTableFromDSV()` function 
specifying the connection to the database (i.e., `dbcon` parameter) and the 
table name (i.e., `table_name` parameter). We don't need to specify the parameters
for the CSV file structure (i.e., `header = TRUE`, `sep = ","`, `dec = "."`, 
`grp = ""`) since the default values will be fine.

```{r, label=loaddata1, error=TRUE}
## do not run: error
dbTableFromDSV(input_file = data_file, dbcon = dbcon, table_name = "DOSE")
```

To understand what happened we need to keep in mind that the `dbTableFromDSV()` 
function calls *`file_schema_dsv`* function, to guess the table structure stored
in the DSV file, and then it calls the *`scan`* function to actually read the
file content in a data frame and write it to the SQLite table.

Both these functions make some assumptions about how the data is stored in the DSV
file as "plain text". To that end, the `scan` function uses the parameters listed
in the following table along with their default values (we inclued also those of
the `read.table` function for comparison):

| parameters   | *scan*  | read.table |
|--------------|---------|------------|
| quote        | `"'\""` | `"\"'"`    |
| allowEscapes | `FALSE` | `FALSE`    |
| skipNul      | `FALSE` | `FALSE`    |
| strip.white  | `FALSE` | `FALSE`    |
| comment.char | `""`    | `"#"`      |
| na.strings   | `"NA"`  | `"NA"`     |
| fill         | `FALSE` | `FALSE`    |
| fileEncoding | `""`    | `""`       |

: `scan` and `read.table` parameters to handle data interpretation.

These parameters control the conventions used to interpret the text read 
from the file:

- *quoting*: convention used to include in strings "special" characters or to
   preserve the exact string without interpretation.
   Parameter: `quote`
- *escaping*: rule used to represent characters that would otherwise be interpreted 
   as delimiters or special commands.
   Parameter: `allowEscapes`
- *NUL characters*: how to manage ASCII NUL characters that could be found 
   inside strings.
   Parameter: `skipNul`
- *whitespace stripping*: whether unwanted spaces from the beginning, end, or middle of a 
   string should be removed. 
   Parameter: `strip.white`
- *commenting*: a character used to identify that one line (or the rest of a line after
   the "comment" identifier) does not contain data and should be ignored.
   Parameter: `comment.char`
- *missing data*: how missing data is represented in the file.
   Parameter: `na.strings`
- *missing fields*: how to manage rows with less fields then expected.
   Parameter: `fill`
- *encoding*: the technical standard used to represent characters in binary form, such as
   ASCII, UTF-8, EBCDIC.
   Parameter: `fileEncoding`

To understand how to use each of these parameters you are strongly encouraged to read 
the help pages of the `scan` function.

It should be noted that the default values are seldom—almost never—the right
ones and we should invest some time in better understanding the strategies used by the
data source. If we have a technical description of the assumptions used in 
creating the file, we can follow those guidelines, otherwise we have to inspect the 
file and determine them through direct examination.

To look into a DSV file (especially big files) it is better to use log file
explorers and avoid general purpose text editors. As an example, we can
use [klogg](https://klogg.filimonov.dev/) that is available for all major
operating systems.

If we open the file and inspect some records, we understand that:

- there are strings with comma and single quote inside, surrounded by 
  double quotes, so the file is using a specific quoting convention;
- missing data is represented by a "zero-lenght" string;
- many strings use special characters that need a specific encoding 
  system; in this file the UTF-8 encoding is used.

For the input file used in this example we need:

- to explicitly force the `quote` parameter to `"\""` (i.e., only the double quote 
  character) and the `na.strings` parameter to `""`;
- to turn off the commenting option: `comment.char = ""`;
- to tell the `base::scan()` function that the input file is encoded in UTF-8: 
  `fileEncoding = "UTF-8"`.

```{r schema}
f_schema1 <- file_schema_dsv(input_file = data_file,
                             quote = "\"", na.strings = "",
                             comment.char = "", fileEncoding = "UTF-8")
```

Now we can see that the guessed schema is correct. Let's look at the first few rows
of the file schema created using these parameters:

```{r schema_head}
f_schema1$schema[1:8, ]
```

The `file_schema_dsv()` function returns a list with many useful information about the
input file, including the `schema` data frame that describes the columns found in the
file (i.e. name, type, size, etc.). For more details about it, please refer to its
documentation. One point to note is that the `year` column has been modified to
`F_year` since `year` is a reserved keyword in SQLite. That's the default behavior of
the `dbTableFromDSV()` function when dealing with column names (see next section for
more details).

Now we can re-run the `dbTableFromDSV()` function specifying the correct parameters
to interpret the data in the input file:

```{r loaddata2}
dbTableFromDSV(input_file = data_file, dbcon = dbcon, table_name = "DOSE",
               drop_table = TRUE, quote = "\"", na.strings = "",
               comment.char = "", fileEncoding = "UTF-8")
```

As we can see now we got all the expected records in the SQLite table. We can
check that by listing the tables in the database, the fields in the `DOSE` table
and counting the number of records in it:

```{r check_db}
dbListTables(dbcon)

dbListFields(dbcon, "DOSE")[1:8]

dbGetQuery(dbcon, "SELECT COUNT(*) AS n_records FROM DOSE")
```

## Issues with column names

Unfortunately, when dealing with DSV files, problems are not limited to data interpretation
only. Another common source of issues are the column names used in the input file.

While DSV files may contain column headers that are valid in their original context
(such as Excel or other applications), these names might not conform to SQLite's naming
conventions or best practices.

SQLite has specific rules and restrictions regarding column names, including:

- Must begin with a letter (A-Z, a-z) or an underscore (_)
- Can contain letters, digits (0-9), and underscores (_)
- Cannot be a reserved keyword in SQLite
- Should avoid special characters and spaces for better compatibility

You can find more details about SQLite's naming conventions in the
[official documentation](https://www.sqlite.org/lang_keywords.html) and
[CREATE TABLE](https://www.sqlite.org/lang_createtable.html) page. There are
also some interesting discussions on Stack Overflow:

- [What are valid table names in SQLite?](https://stackoverflow.com/questions/3694276/what-are-valid-table-names-in-sqlite)
- [What SQLite column name can be/cannot be?](https://stackoverflow.com/questions/3373234/what-sqlite-column-name-can-be-cannot-be)
- [SQLite table and column name requirements](https://stackoverflow.com/questions/23770480/sqlite-table-and-column-name-requirements)

Some typical problems you might encounter include:

- *Special characters*: Column names containing spaces, hyphens, or other 
  special characters (e.g., `"Patient-ID"`, `"Test Result"`, `"Cost($)"`);
- *Reserved keywords*: Column names that match SQLite reserved words
  (e.g., `"SELECT"`, `"TABLE"`, `"WHERE"`);
- *Starting with numbers*: Column names beginning with digits 
  (e.g., `"2025_Sales"`);
- *Case sensitivity*: While SQLite is case-insensitive for column names,
  mixing cases can lead to confusion;
- *Empty or duplicate names*: Missing headers or columns with identical names.

You should always inspect the column names in your DSV files before importing them
into a SQLite database to ensure they meet these criteria. Failing to do so can lead
to errors during the import process or unexpected behavior when querying the database.

### How RSQLite.toolkit handles column names

  The `dbTableFromDSV()` function[^2] offers a built-in mechanism to handle problematic
column names when importing data from DSV files. It does so through the
`id_quote_method` parameter that can be set to one of the following values:

- `DB_NAMES` tries to build a valid SQLite column name:
    a. substituting all characters, that are not letters or digits or
       the `_` character, with the `_` character;
    b. prefixing `N_` to all strings starting with a digit;
    c. prefixing `F_` to all strings equal to any SQL92 keyword.
- `SINGLE_QUOTES` encloses each string in single quotes.
- `SQL_SERVER` encloses each string in square brackets.
- `MYSQL` encloses each string in back ticks.

The default value is `DB_NAMES`.

What actually happens is that to ensure valid
and unique column names, the `dbTableFromDSV()` function calls the
`file_schema_dsv()` function that in turn calls the
`format_column_names()` function, passing to it the column names read from the
DSV file, the selected `id_quote_method` and the `unique_names = TRUE`
parameter, to ensure that all column names are unique.

If you are not satisfied with the way column names are handled, you can always
manually specify the column names to be used in the SQLite table using the
`col_names` parameter of the `dbTableFromDSV()` function.

### Example

Let's create a simple example to demonstrate this behavior; we will use a
DSV file with problematic column names that we will download from the
[RSQLite.toolkit-tests](https://github.com/fab-algo/RSQLite.toolkit-tests)
repository.

```{r getdata2}
library("piggyback", quietly = TRUE)
pb_download(file = "Blockchain_Banking_Scopus_Dataset_2015_2025.zip",
            dest = tempdir(), repo = "fab-algo/RSQLite.toolkit-tests",
            tag = "latest")

unzip(zipfile = file.path(tempdir(),
                          "Blockchain_Banking_Scopus_Dataset_2015_2025.zip"),
      exdir = file.path(tempdir(), "Blockchain"))
dir(file.path(tempdir(), "Blockchain"))

data_file <- file.path(tempdir(),
                       "Blockchain/Blockchain_Banking_Scopus_Dataset_2015_2025.csv") # nolint
```

Now we can inspect the column names used in the input file:

```{r inspect_colnames}
f_schema2 <- file_schema_dsv(input_file = data_file,
                             quote = "\"", na.strings = "",
                             comment.char = "", fileEncoding = "UTF-8")

f_schema2$schema[1:10, ]
```

The `src_names` column contains the original column names found in the input file.
As we can see, there are some problematic column names, such as `Author(s) ID`,
`Document Title`, `Funding Text`, `Art. No.`, etc. 

The `col_names` column contains the transformed column names using the
`DB_NAMES` method (i.e., the default value for the `id_quote_method` parameter).
Comparing the two columns, we can see how some characters have been replaced with
the `_` character, spaces have been replaced with `_`, and some names have been
prefixed with `F_` to avoid conflicts with SQLite reserved keywords (e.g., `Year`).

Now let's see how the column names are handled when using the `SQL_SERVER` method: 

```{r inspect_colnames2}
f_schema2 <- file_schema_dsv(input_file = data_file,
                             quote = "\"", na.strings = "",
                             id_quote_method = "SQL_SERVER",
                             comment.char = "", fileEncoding = "UTF-8")

f_schema2$schema[1:10, ]
``` 

As we can see, now the column names have been enclosed in square brackets
(`[` and `]`) to make them valid SQLite identifiers. This quoting convention
allows the use of special characters and spaces in column names without causing
issues during database operations. We can test that by creating a new table
in the database using this quoting method:

```{r loaddata3}
dbTableFromDSV(input_file = data_file,
               dbcon = dbcon, table_name = "BLOCKCHAIN_BANKING",
               quote = "\"", na.strings = "",
               comment.char = "", fileEncoding = "UTF-8",
               id_quote_method = "SQL_SERVER",
               drop_table = TRUE)
```

```{r check_db2}
dbListTables(dbcon)

dbListFields(dbcon, "BLOCKCHAIN_BANKING")[1:8]

dbGetQuery(dbcon, "SELECT COUNT(*) AS n_records FROM BLOCKCHAIN_BANKING")
```

```{r close_db}

dbDisconnect(dbcon)
```

## Conclusions

To more safely handle DSV files (and avoid unexpected results) when
importing data in a SQLite database using the `dbTableFromDSV()` function, you
should always:

- inspect the input file to understand how data is represented; to that
  end, you can use the `file_schema_dsv()` function to help you in this task and test
  different parameter settings before importing the data;
- specify the correct parameters to interpret data in the input file (i.e. `quote`,
  `na.strings`, `comment.char`, `fileEncoding`, etc.);
- check the column names used in the input file and choose the right strategy
  to handle them, here the `id_quote_method` parameter of the `dbTableFromDSV()`
  function can be very useful.

[^1]: The original source of each dataset is described in the README page of the GitHub repo.

[^2]: The same mechanism to handle column names is used also by the `dbTableFromXlsx()`,
`dbTableFromFeather()` and `dbTableFromDataFrame()` functions.