---
title: "Data Types & Compression"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Types & Compression}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

`h5lite` is designed to seamlessly map R's diverse data structures to HDF5's portable format. This vignette explains the supported R data types, how `h5lite` writes them to HDF5, and how you can precisely control data types and compression when needed.

```{r setup}
library(h5lite)
file <- tempfile(fileext = ".h5")
```

## Supported Data Types

`h5lite` supports reading and writing a wide range of R data types. The table below lists the default mapping when writing to HDF5.

| R Data Type    | HDF5 Equivalent  | Description                                    |
| :------------- | :--------------- | :--------------------------------------------- |
| **Numeric**    | *variable*       | Selects optimal type: `uint8`, `float32`, etc. |
| **Logical**    | `H5T_STD_U8LE`   | Stored as 0 (FALSE) or 1 (TRUE) (`uint8`).     |
| **Character**  | `H5T_STRING`     | Variable or fixed-length UTF-8 strings.        |
| **Complex**    | `H5T_COMPLEX`    | Native HDF5 2.0+ complex numbers.              |
| **Raw**        | `H5T_OPAQUE`     | Raw bytes / binary data.                       |
| **Factor**     | `H5T_ENUM`       | Integer indices with label mapping.            |
| **integer64**  | `H5T_STD_I64LE`  | 64-bit signed integers via `bit64` package.    |
| **POSIXt**     | `H5T_STRING`     | ISO 8601 string (`YYYY-MM-DDTHH:MM:SSZ`).      |
| **List**       | `H5O_TYPE_GROUP` | Recursive container structure.                 |
| **Data Frame** | `H5T_COMPOUND`   | Table of mixed types.                          |
| **NULL**       | `H5S_NULL`       | Creates a placeholder.                         |

## Dimensions: Scalars, Vectors, and Arrays

Atomic data types (Integer, integer64, Double, Logical, Character, Complex, Raw, and POSIXt) can be written to HDF5 as scalars, 1D vectors, or N-dimensional arrays.

* **Scalars:** To write a single value as a true HDF5 scalar (0 dimensions), you must wrap the value in `I()`.
* **Vectors:** Standard R vectors are written as 1D arrays (Simple Dataspace with rank 1).
* **Arrays/Matrices:** R objects with `dim` attributes are written as N-dimensional datasets, preserving their shape.

```{r}
# 1. Scalar (0 dims)
h5_write(I(42), file, "structure/scalar")

# 2. Vector (1 dim)
h5_write(c(1, 2, 3), file, "structure/vector")

# 3. Matrix (2 dims)
h5_write(matrix(1:9, 3, 3), file, "structure/matrix")
```

*For more complex dimensional structures, refer to `vignette('matrices')`.*

## Numeric Data

R uses 32-bit integers and 64-bit doubles. When writing with `as = "auto"`, `h5lite` analyzes the range of your data to select the most compact HDF5 type.

* **Default:** Selects optimal type based on range of values.
* **With NA:** `float64` (`H5T_IEEE_F64LE`)
* **Fractional Values:** Double-precision vectors with fractional values default to `float64`.
* **Coercion:** You can override this using `int[8|16|32|64]`, `uint[8|16|32|64]`, `float[16|32|64]`, or `bfloat16`.

```{r}
# Integers between 0 and 255 (uint8)
h5_write(c(1L, 2L, 3L), file, "integers/small")

# Integers with NA -> float64
h5_write(c(1L, NA, 3L), file, "integers/with_na")

# Force larger type (int16)
h5_write(1:100, file, "integers/short", as = "int16")
```

## 64-bit Integers (`integer64`)

* **Default:** `int64` (`H5T_STD_I64LE`)
* **Coercion:** none

R does not natively support 64-bit integers, but `h5lite` supports reading and writing them via the `bit64` package.

```{r}
if (requireNamespace("bit64", quietly = TRUE)) {
  val <- bit64::as.integer64(c("9223372036854775807", "-9223372036854775807"))
  h5_write(val, file, "integers/int64")
}
```

## Double (Numeric) Data

R's default numeric type is double-precision.

* **Default:** `float64` (`H5T_IEEE_F64LE`)
* **Coercion:** `int[8|16|32|64]`, `uint[8|16|32|64]`, `float[16|32|64]`, or `bfloat16`

```{r}
data <- rnorm(10)

# Default (float64)
h5_write(data, file, "doubles/default")

# Single Precision (float32) - Saves 50% space
h5_write(data, file, "doubles/float32", as = "float32")
```

## Logical Data

* **Default:** `uint8` (`H5T_STD_U8LE`)
* **With NA:** `float64` (`H5T_IEEE_F64LE`)
* **Coercion:** `int[8|16|32|64]`, `uint[8|16|32|64]`, `float[16|32|64]`, or `bfloat16`

```{r}
bools <- sample(c(TRUE, FALSE), 1000, replace = TRUE)

h5_write(bools, file, "logicals/packed")
```

## Character Data

HDF5 supports two methods for storing strings. By default (`as = "auto"`), `h5lite` chooses the best approach:

* **Variable-Length:** Used if the vector contains `NA` or if string lengths are highly inconsistent.
* **Fixed-Length:** Used for short, consistent strings without `NA` to allow for compression.

### **Variable-Length:**

Explicitly requested with `as = "utf8"` or `as = "ascii"`.

* Compressible: **NO**
* Handles `NA`: **YES**

```r
# UTF-8 variable length
h5_write(c("apple", "banana", NA), file, "strings/var_utf8")

# ASCII variable length
h5_write(c("A", "B", "C"), file, "strings/var_ascii", as = "ascii")
```

### **Fixed-Length:**

Use `as = "ascii[10]"` / `as = "utf8[10]"` (explicit size=10) or `as = "ascii[]"` / `as = "utf8[]"` (auto-detect max length).

* Compressible: **YES**
* Handles `NA`: **NO**

```{r}
# UTF-8 auto-detected fixed length
h5_write(c("apple", "banana"), file, "strings/fixed_utf8")

# ASCII fixed length (1 byte)
h5_write(c("A", "B", "C"), file, "strings/fixed_ascii", as = "ascii[1]")
```

> **Technical Note:** `h5lite` uses `H5T_C_S1` for all strings, and `H5T_STR_NULLTERM` for all fixed length strings.


## Dates and Times (`POSIXt`)

R date-time objects (`POSIXct` / `POSIXlt`) are stored as **Strings** in ISO 8601 format (`YYYY-MM-DDTHH:MM:SSZ`). This ensures maximum portability with other languages and HDF5 tools that do not share R's specific epoch-based integer representation.

```{r}
now <- Sys.time()
h5_write(now, file, "datetime/iso8601")
```

## Complex Data

R complex numbers are written using the new complex floating-point type introduced in HDF5 2.0.0 (`H5T_COMPLEX_IEEE_F64LE`).

**Compatibility Warning:** This data type for complex numbers is a feature specific to HDF5 version 2.0+. Datasets written with this type generally cannot be read by HDF5 readers built against older versions of the library (e.g., HDF5 1.10 or 1.12). Ensure that any downstream tools or libraries used to read these files are updated to support HDF5 2.0 standards.

```{r}
comp <- c(1+2i, 3+4i)
h5_write(comp, file, "complex_data")
```

## Raw Data

Raw vectors (bytes) are stored as HDF5 `OPAQUE` types. This is ideal for storing binary blobs, images, or serialized objects where you need to preserve the exact byte sequence without interpretation.

```{r}
raw_vec <- as.raw(c(0x01, 0xFF, 0x1A))
h5_write(raw_vec, file, "binary_blob")
```

## Factors

R Factors are stored as HDF5 `ENUM` types. This maps the integer codes to the factor levels (labels) efficiently within the file header, ensuring the labels are preserved without duplicating string data for every element.

```{r}
fac <- factor(c("low", "high", "medium", "low"))
h5_write(fac, file, "categorical")
```

## Lists

R lists are mapped to HDF5 **Groups**. Since lists are recursive containers, `h5lite` walks the list and creates a dataset (or subgroup) for every element found. You can use `as = c("element_name" = "skip")` to exclude specific items.

```{r}
my_list <- list(data = 1:100, meta = list(valid = TRUE))
h5_write(my_list, file, "types/list")
```

## Data Frames

Data Frames are stored as HDF5 **Compound** types (tables). This ensures that rows are kept together in memory. You can use the `as` argument to specify the type of individual columns.

*For a comprehensive guide, see `vignette('data-frames')`.*

```{r}
df <- data.frame(
  id = 1:5,
  score = c(10.5, 20.2, 15.0, 9.8, 30.1)
)

# 1. 'id' coerced to uint16
# 2. 'score' coerced to float32
h5_write(df, file, "types/dataframe", as = c(
  "id"    = "uint16",
  "score" = "float32"
))
```

## NULL

The `NULL` object in R is mapped to a dataset with a **NULL Dataspace** (`H5S_NULL`). This creates a dataset that exists in the file structure but contains no data elements and consumes no storage space.

```{r}
h5_write(NULL, file, "placeholders/empty_slot")
```

## Compression

HDF5 supports transparent data compression using the zlib (gzip) and szip algorithms. You can control the compression behavior using the `compress` argument.

* **`"gzip-5"`** (default): Standard zlib compression at level 5. Levels `"gzip-1"` through `"gzip-9"` are also supported. Safe and universally compatible.
* **`"szip-nn"`**: Szip with Nearest Neighbor coding. Best for continuous, correlated, or floating-point data (e.g., time series or smooth gradients).
* **`"szip-ec"`**: Szip with Entropy Coding. Best for uncorrelated, discrete, or categorical integer data.
* **`"none"`**: Disables compression entirely.

```{r}
# Maximum zlib compression
h5_write(rnorm(1000), file, "data/max", compress = "gzip-9")

# Szip Entropy Coding for discrete integer data
h5_write(sample(1:5, 1000, replace = TRUE), file, "data/szip", compress = "szip-ec")
```

### The Shuffle Filter

When `gzip` compression is enabled, `h5lite` automatically applies the HDF5 **Byte Shuffle Filter** before the data is compressed. The Shuffle Filter does not compress data itself; rather, it rearranges the byte stream to make it more compressible by zlib.

It works by separating the bytes of each value by their significance. For example, in a 4-byte integer array:

1.  All the 1st bytes (least significant) are grouped together.
2.  All the 2nd bytes are grouped together.
3.  And so on.

**Why this helps:**

* **Integers:** Small integers often have many zero-padding bytes. The shuffle filter groups these zeros into long runs, which zlib compresses extremely efficiently. This allows `int32` data to compress nearly as well as `int8` data if the values are small.
* **Doubles:** Floating point numbers often share the same exponent bytes if they are in a similar range. The shuffle filter groups these identical exponent bytes, creating repetitive patterns that zlib can compress.

```{r, include = FALSE}
unlink(file)
```
