---
title: "Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(glmbayes)
```

# Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL

## 1. Introduction

The `glmbayes` package constructs envelopes for accept–reject sampling by evaluating the negative log posterior (f2) and its gradient (f3) at each point of a **tangency grid**. The grid size grows with dimension (e.g., \(3^\ell\) points for \(\ell\) coefficients), and each grid point can be evaluated independently. These evaluations are **embarrassingly parallel** and are a natural target for GPU acceleration.

The optional **OpenCL** implementation accelerates these evaluations by running them on a GPU (or OpenCL-capable CPU). OpenCL is vendor-neutral and works across NVIDIA, AMD, and Intel hardware. When OpenCL is available and `use_opencl = TRUE`, the package uses GPU-accelerated evaluation inside `EnvelopeEval`, which is called by `EnvelopeBuild` and `EnvelopeDispersionBuild`. If OpenCL is not available or `use_opencl = FALSE`, the CPU path (`f2_f3_non_opencl`) is used instead. The package runs correctly in either case.

This chapter describes the OpenCL implementation: the two-layer design (kernel wrapper and kernel runner), how the OpenCL program is assembled from ported nmath/rmath functions, and how the pilot and safeguards integrate with the envelope build. For installation and availability checks, see Chapter 12.


## 2. Architecture Overview

The OpenCL path consists of two layers:

| Layer | Function | Location | Role |
|-------|----------|----------|------|
| **Wrapper** | `f2_f3_opencl` | `kernel_wrappers.cpp` | Flatten R inputs, select kernel, load sources, call runner, reshape outputs |
| **Runner** | `f2_f3_kernel_runner` | `kernel_runners.cpp` | Platform/device setup, buffer creation, kernel launch, readback |

### 2.1 Call Path

```
EnvelopeEval (EnvelopeEval.cpp)
    |-- if use_opencl && pilot/check passes:
    |       f2_f3_opencl() -> f2_f3_kernel_runner()
    \-- else:
            f2_f3_non_opencl() -> CPU famfuncs (f2_f3_binomial_logit, etc.)
```

`EnvelopeEval` receives the grid `G4` (coefficients at each tangency point), the design matrix, prior parameters, and family/link. It chooses the GPU or CPU path based on `use_opencl` and the outcome of the pilot (when applicable). The GPU path calls `f2_f3_opencl`, which assembles the OpenCL program and invokes `f2_f3_kernel_runner`; the CPU path calls `f2_f3_non_opencl`, which dispatches to the appropriate C++ family function.


## 3. Program Construction

OpenCL kernels are not stored as single monolithic files. Instead, the package builds the program by **concatenating** several source components in a fixed order. This mirrors a C/C++ build where headers and libraries are included before the main source.

### 3.1 Assembly Order

The full program string is:

```
all_src = OPENCL.cl
        + rmath (load_kernel_library("rmath"))
        + dpq   (load_kernel_library("dpq"))
        + nmath (load_kernel_library("nmath"))
        + kernel file (load_kernel_source, e.g. "src/f2_f3_binomial_logit.cl")
```

1. **OPENCL.cl** – Global configuration: extensions (`cl_khr_fp64`, `cl_khr_printf`), IEEE constants (`ML_NAN`, `ML_POSINF`), feature detection for `expm1`/`log1p`, and utility macros.
2. **rmath** – Mathematical constants (M_E, M_PI, etc.) and distribution function declarations.
3. **dpq** – R-style density/CDF macros (`R_D__0`, `R_DT_val`, etc.) used in give_log/lower_tail logic.
4. **nmath** – Ported numerical routines (bd0, stirlerr, lgamma, dbinom, dpois, pnorm, etc.).
5. **Kernel file** – The model-specific kernel (e.g., `f2_f3_binomial_logit.cl`) that computes f2 and f3 for each grid point.

### 3.2 load_kernel_source and load_kernel_library

- **`load_kernel_source(relative_path)`** – Loads a single `.cl` file from `inst/cl/` via `system.file("cl", relative_path, package = "glmbayes")`.
- **`load_kernel_library(subdir)`** – Loads all `.cl` files in a subdirectory (e.g., `"nmath"`, `"rmath"`, `"dpq"`), parses `@provides` and `@depends` annotations, performs a **dependency-aware topological sort**, and concatenates the files so that dependents appear after their dependencies.

The `@provides` and `@depends` tags allow modular, maintainable kernel code: each file declares what it provides and what it needs. See `?load_kernel_library` for details.


## 4. Ported Math Libraries (nmath / rmath / dpq)

The likelihood and gradient computations require statistical functions that match R's behavior. Because OpenCL C does not include these, the package **ports** a core set from R's nmath and rmath libraries to OpenCL C.

### 4.1 nmath

| Module | Provides | Purpose |
|--------|----------|---------|
| nmath.cl | ML_*, ME_*, ISNAN, R_FINITE, ML_ERROR, etc. | Constants, error handling, validation macros |
| log1p.cl | Rlog1p | Log(1+x) for numerical stability |
| expm1.cl | expm1 | exp(x)-1 for numerical stability |
| bd0.cl | bd0, ebd0, etc. | Poisson/binomial deviance terms |
| stirlerr.cl | stirlerr | Stirling error for factorial terms |
| lgamma.cl | lgammafn | Log-gamma |
| gamma.cl | gammafn | Gamma function |
| dbinom.cl | dbinom, dbinom_raw | Binomial density |
| dpois.cl | dpois, dpois_raw | Poisson density |
| dgamma.cl | dgamma | Gamma density |
| dnorm.cl | dnorm4 | Normal density |
| pnorm.cl | pnorm5, pnorm_both | Normal CDF (probit) |
| pgamma.cl | pgamma | Gamma CDF (inverse link) |
| d1mach.cl | d1mach | Machine constants |
| chebyshev.cl | chebyshev_eval | Polynomial evaluation |

### 4.2 rmath

`Rmath.cl` provides additional constants (M_E, M_PI, M_LN2, etc.) and distribution functions consistent with R's `Rmath` library.

### 4.3 dpq

`dpq.cl` and `dpq_prelude.cl` provide macros for density and CDF handling with `give_log` and `lower_tail`, matching R's DPQ (density, probability, quantile) conventions.

These ports ensure that the OpenCL kernels produce results **numerically consistent** with the CPU path and with R itself.


## 5. Family/Link Kernels

Each supported family/link has a dedicated kernel file that implements the f2/f3 logic for that model.

| Family | Link | Kernel file |
|--------|------|-------------|
| binomial | logit | `f2_f3_binomial_logit.cl` |
| binomial | probit | `f2_f3_binomial_probit.cl` |
| binomial | cloglog | `f2_f3_binomial_cloglog.cl` |
| poisson | log | `f2_f3_poisson.cl` |
| Gamma | inverse | `f2_f3_gamma.cl` |
| gaussian | identity | `f2_f3_gaussian.cl` |

### 5.1 Kernel Structure

Each kernel follows the same pattern:

1. **Work-item mapping** – `int j = get_global_id(0)`; one work item per grid point \(j\).
2. **Prior term** – Compute \(P(\beta_j - \mu)\) and the quadratic form \(\tfrac{1}{2}(\beta_j - \mu)' P (\beta_j - \mu)\), accumulate into `qf[j]`.
3. **Prior gradient** – Initialize local gradient with \(P(\beta_j - \mu)\).
4. **Data loop** – For each observation \(i\), compute linear predictor, link, likelihood contribution, and gradient contribution. Use ported functions (`dbinom_raw`, `dpois_raw`, `lgamma`, `pnorm5`, etc.) as needed.
5. **Write outputs** – `qf[j]` = negative log posterior; `grad[k*m1 + j]` = gradient (column-major layout).

The kernels use a fixed `MAX_L2` for local arrays (e.g., 64); this limits the number of coefficients when using OpenCL. See the source for current limits.


## 6. f2_f3_opencl Flow

The wrapper `f2_f3_opencl` in `kernel_wrappers.cpp` performs:

1. **Input flattening** – Convert R matrices/vectors (x, b, mu, P, alpha, y, wt) to contiguous C++ vectors in the layout expected by the runner.
2. **Kernel selection** – Map `(family, link)` to `kernel_name` and `kernel_file` (e.g., binomial + logit → `f2_f3_binomial_logit`, `src/f2_f3_binomial_logit.cl`).
3. **Program assembly** – Load OPENCL.cl, rmath, nmath, dpq, and the kernel file; concatenate into `all_src`.
4. **Runner call** – Invoke `f2_f3_kernel_runner(all_src, kernel_name, l1, l2, m1, ...)` with flattened inputs and output buffers.
5. **Output reshaping** – Copy `qf_flat` into an R vector; wrap `grad_flat` as an Armadillo matrix (m1 × l2) for return.

The returned list has components `qf` (negative log posterior per grid point) and `grad` (gradient matrix).


## 7. f2_f3_kernel_runner

The runner in `kernel_runners.cpp` handles the low-level OpenCL API:

1. **Platform and device** – `clGetPlatformIDs`, `clGetDeviceIDs` (CL_DEVICE_TYPE_DEFAULT).
2. **Context and queue** – `clCreateContext`, `clCreateCommandQueueWithProperties`.
3. **Program** – `clCreateProgramWithSource` with the concatenated string, `clBuildProgram`.
4. **Kernel** – `clCreateKernel` with the kernel name.
5. **Buffers** – Create read-only buffers for X, B, mu, P, alpha, y, wt; write-only for qf and grad. Copy host data for inputs.
6. **Launch** – `clEnqueueNDRangeKernel` with `global = m1` (one work item per grid point).
7. **Readback** – `clEnqueueReadBuffer` for qf and grad.
8. **Sanity check** – If both outputs are all zeros, throw (likely kernel failure).
9. **Cleanup** – Release buffers, kernel, program, queue, context.

The runner is GLM-specific and lives in the `glmbayes::opencl` namespace; the general OpenCL utilities (kernel loading, device enumeration) are in `openclPort`.


## 8. Pilot and Safeguards

For large grids (e.g., \(m_1 > 50{,}000\)), `EnvelopeEval` runs `f2_f3_opencl_pilot` before the full evaluation to estimate runtime.

### 8.1 Pilot Logic

- **Warm-up** – Run `f2_f3_opencl` on a small grid slice.
- **Calibration** – Run on slices of ~1% and ~2% of the grid to estimate fixed cost and per-grid cost.
- **Refined estimate** – Compute `refined_est_total_sec = per_grid_sec × m1`.
- **5-minute safeguard** – If `refined_est_total_sec > 300`:
  - **Interactive session**: Prompt `"Do you want to continue? [y/N]: "`.
  - **Non-interactive session**: Proceed automatically (e.g., CI, batch).

This mirrors the parallel sampling pilots in Chapter A09. The envelope build phase cannot be interrupted once the full OpenCL run starts, so the pilot gives users a chance to decline long runs.

### 8.2 When the Pilot Runs

The pilot runs when `m1_total > 50000` and `use_opencl` is true. For smaller grids, the full evaluation runs without a pilot.


## 9. Installation and Availability

OpenCL support is **optional**. The package compiles and runs without it; in that case, `use_opencl` is ignored and the CPU path is always used.

To enable OpenCL:

- Install OpenCL headers and development libraries (see Chapter 12).
- Build the package from source with OpenCL enabled (configure script detects OpenCL).
- Ensure an OpenCL runtime (ICD loader, vendor driver) is available at runtime.

Use `has_opencl()` to check availability and `diagnose_glmbayes()` for a full diagnostic report. See Chapter 12 for platform-specific installation instructions (Windows, Linux, macOS) and AMD ROCm notes.


## 10. Cross-References

- **Chapter A05** – Simulation pipeline (optimization, standardization, envelope build, sampling)
- **Chapter A08** – Envelope-related functions
- **Chapter A09** – Parallel sampling (pilot pattern, 5-minute safeguard)
- **Chapter 12** – GPU acceleration: installation, `has_opencl`, `diagnose_glmbayes`
- **`?EnvelopeBuild`**, **`?EnvelopeEval`** – use_opencl argument
- **`?load_kernel_library`**, **`?load_kernel_source`** – Program assembly
- **`?glmb`**, **`?rglmb`** – use_opencl argument