---
title: "Atomic Vectors"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Atomic Vectors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

Atomic vectors are the fundamental data structure in R. They include **numeric** (integer and double), **logical**, **character**, **complex**, and **raw** vectors. This vignette explains how `h5lite` maps these R types to HDF5 datasets and provides guidance on controlling storage types and compression.

```{r setup}
library(h5lite)
file <- tempfile(fileext = ".h5")
```

## Basic Usage

Writing a vector to HDF5 is straightforward using `h5_write()`. The package automatically creates the necessary dataset and handles dimensions.

```{r}
# Write a numeric vector
vec <- c(1.5, 2.3, 4.2, 5.1)
h5_write(vec, file, "data/numeric_vector")

# Read it back
res <- h5_read(file, "data/numeric_vector")
print(res)
```

## Scalars vs. 1D Arrays

In R, a "scalar" is simply a vector of length 1. However, HDF5 distinguishes between a **Scalar Dataspace** (a single value with no dimensions) and a **Simple Dataspace** (an array) with dimensions `[1]`.

By default, `h5lite` treats length-1 vectors as 1D arrays to maintain consistency with R's vector behavior. To write a true HDF5 scalar, you must wrap the value in `I()`.

```{r}
# 1. Default: 1D Array (Length 1)
h5_write(42, file, "structure/array_1d")

# 2. Explicit Scalar: Wrapped in I()
h5_write(I(42), file, "structure/scalar")

h5_str(file, "structure")
```

*Note: When reading data back into R, both storage formats appear as standard R vectors of length 1.*

## Numeric and Logical Data

### Automatic Type Selection

`h5lite` attempts to map R types to the most efficient HDF5 equivalents automatically (`as = "auto"`).

1.  **Numeric:** `h5lite` analyzes the range of your data and picks the smallest fitting HDF5 type (e.g., `uint8`, `int16`, `int32`, `float64`).
2.  **Logicals:** `h5lite` maps these to `uint8` (0 or 1) in HDF5 to save space.

### Handling Missing Values (NA)

A key challenge in HDF5 is that standard integer and boolean types do not have a native representation for `NA` (missing values).

To ensure data safety, `h5lite` performs the following check:

* If an integer or logical vector contains `NA`, it is **automatically promoted to `float64`**.
* The `NA` values are stored as an `NaN` variant in the file.
* When read back, `h5_read()` restores them as `numeric` vectors with `NA`.

```{r}
# Integer vector with NO missing values -> Automatic optimal type (uint8)
h5_write(c(1L, 2L, 3L), file, "safe/ints")
h5_typeof(file, "safe/ints")

# Integer vector WITH missing values -> Promoted to float64
h5_write(c(1L, NA, 3L), file, "safe/ints_na")
h5_typeof(file, "safe/ints_na")
```

### Forcing Specific Types

If you know your data range fits into a smaller type (e.g., `int8`, `uint16`), you can use the `as` argument to force a specific storage type.

*Warning: If you force an integer type on data containing `NA` or values outside the integer type's range then `h5lite` will throw an error.*

```{r}
# Store small integers as 8-bit signed integers
h5_write(c(10, -5, 100), file, "small_ints", as = "int8")

# Store logicals as 8-bit unsigned integers
h5_write(c(TRUE, FALSE), file, "bools", as = "uint8")
```

## Character Vectors (Strings)

HDF5 supports two primary methods for storing strings: **Variable-Length** and **Fixed-Length**.

### Automatic Type Selection

By default (`as = "auto"`), `h5lite` chooses the most efficient string representation:

* If the vector contains `NA`, it uses **Variable-Length UTF-8** (which natively supports missing values).
* If there are no missing values and the strings are relatively short and consistent in length, it uses **Fixed-Length UTF-8** to allow for compression and faster access.

### Variable-Length

You can explicitly request variable-length storage using `as = "utf8"` or `as = "ascii"`.

* **Pros:** Most flexible; exact memory usage per string; supports `NA` (stored as NULL pointers).
* **Cons:** Cannot be compressed using standard HDF5 filters; slower to read/write for extreme dataset sizes.

```{r}
# Variable length strings (handles NA)
h5_write(c("apple", "banana", NA), file, "strings/var")
```

### Fixed-Length

You can force fixed-length storage using the syntax `[n]`, where `n` is the number of bytes.

* **Pros:** Fast; allows compression.
* **Cons:** Truncates strings longer than `n`; pads shorter strings; **does not support `NA`**.

```{r}
# Fixed length strings (10 bytes per string)
h5_write(c("A", "B", "C"), file, "strings/fixed", as = "ascii[10]")

# Auto-detect max length (converts to fixed length based on longest string)
h5_write(c("short", "longer", "longest"), file, "strings/auto_fixed", as = "ascii[]")
```

## Compression

Compression in HDF5 requires the dataset to be "chunked". `h5lite` handles chunking parameters automatically when you enable compression.

You can configure compression using the `compress` argument:

* `"gzip-5"` (default): Standard zlib compression at level 5. Levels `"gzip-1"` through `"gzip-9"` are also supported. Safe and universally compatible.
* `"szip-nn"`: Szip with Nearest Neighbor coding. Best for continuous, correlated, or floating-point data (e.g., time series or smooth gradients).
* `"szip-ec"`: Szip with Entropy Coding. Best for uncorrelated, discrete, or categorical integer data.
* `"none"`: Disables compression entirely.

```{r}
# Write a large vector with max zlib compression
x <- rep(rnorm(100), 100)
h5_write(x, file, "compressed_data", compress = "gzip-9")

# Write a smooth, correlated dataset using szip Nearest Neighbor
smooth_data <- sin(seq(0, 10, length.out = 1000))
h5_write(smooth_data, file, "szip_data", compress = "szip-nn")
```

## 64-bit Integers

R does not natively support 64-bit integers, but the `bit64` package provides an `integer64` class. `h5lite` supports reading and writing these types directly to HDF5 `int64`.

```{r}
if (requireNamespace("bit64", quietly = TRUE)) {
  val <- bit64::as.integer64(c("9223372036854775807", "-9223372036854775807"))
  
  h5_write(val, file, "huge_ints")
  h5_typeof(file, "huge_ints")
  
  in_val <- h5_read(file, "huge_ints", as = "bit64")
  print(class(in_val))
}
```

```{r, include=FALSE}
unlink(file)
```
