---
title: "Chapter A09: Parallel Sampling Implementation using RcppParallel"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter A09: Parallel Sampling Implementation using RcppParallel}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(glmbayes)
```

# Chapter A09: Parallel Sampling Implementation using RcppParallel

## 1. Introduction

The `glmbayes` package uses **parallel sampling** for many envelope-based samplers because generating iid draws can be computationally expensive and acceptance rates may be low. By default, `use_parallel = TRUE`, so large fits often run on multiple cores via RcppParallel.

However, RcppParallel-based sampling **cannot be interrupted**. Once the main parallel loop starts, the R session is blocked until it completes. Users may experience the sampler as "freezing" during long runs. To address this, the package runs **pilots** that estimate how long sampling will take and, in interactive sessions, asks users to opt in or out before the non-interruptible phase begins.

This chapter describes the parallel sampling implementation, the pilot logic, and the interactive safeguards. Parallel **envelope construction** (e.g., in `EnvelopeDispersionBuild`) uses a similar pilot pattern and is noted briefly. GPU acceleration via OpenCL is covered in Chapter A10.


## 2. Where Parallel Sampling Is Used

Parallel sampling is used in two main C++ paths:

| Entry point | When | Notes |
|-------------|------|-------|
| `rNormalGLM_cpp` | Non-Gaussian GLMs (Poisson, binomial, Gamma, etc.) with `use_parallel = TRUE` and `n > 1` | Called by `rNormal_reg` |
| `rIndepNormalGammaReg_cpp` | Gaussian with independent Normal–Gamma prior, `use_parallel = TRUE` and `n > 1` | Uses `rIndepNormalGammaReg_std_parallel_cpp` |

**Serial fallback:** When `use_parallel = FALSE` or `n == 1`, the serial path is used instead. The serial path includes `Rcpp::checkUserInterrupt()` in its loop, so it can be interrupted (e.g., with Ctrl+C).


## 3. Why Parallel Sampling Cannot Be Interrupted

RcppParallel uses Intel TBB (Threading Building Blocks) to distribute work across multiple threads. The R main thread is blocked while `RcppParallel::parallelFor()` runs. R's `R_CheckUserInterrupt()` is not safe to call from worker threads, and there is no supported way to cancel a running TBB task from R. As a result:

- During the parallel phase, Ctrl+C and similar interrupts do not reliably stop the run.
- The session may appear to hang until the run completes.

The pilot phase mitigates this by estimating runtime *before* the full parallel run and giving users a chance to decline.


## 4. Pilot-Based Time Estimation and Opt-In

The pilot flow consists of:

1. **Pilot run** – A small serial or parallel batch (e.g., 1 draw or a few draws) to measure per-draw or per-observation time.
2. **Calibration run** – A larger parallel batch (typically up to ~1% of `n`, capped so it takes at most ~5 minutes) to refine the time estimate.
3. **Time estimate** – Extrapolation to the full sample size `n`.
4. **Interactive safeguard** – If the estimate exceeds 5 minutes, the user is prompted in interactive sessions.
5. **Non-interactive behavior** – In non-interactive sessions (e.g., R CMD check, batch jobs), the run proceeds automatically without prompting.

**Threshold:** 300 seconds (5 minutes). Above this, the user is asked whether to continue.

**Prompt:** `"Estimated simulation exceeds 5 minutes. Continue? [y/N]: "` (or similar).

**Responses:** `y`, `yes`, `1`, or `continue` → proceed; `n`, `no`, `2`, or empty → stop with an informative error.


## 5. rNormalGLM Pilot Logic (Non-Gaussian GLMs)

For non-Gaussian GLMs (Poisson, binomial, Gamma, etc.), the `rNormalGLM` C++ path uses `run_rcppparallel_pilot()`:

### 5.1 Single-Draw Test

A single draw is generated in **serial** mode to measure per-draw time and to detect problems early.

### 5.2 max_draws Cap and Zero Accepts

If the pilot hits an internal `max_draws` cap with **zero** accepted draws, this indicates the envelope may be insufficiently tight. The code warns that:

- The envelope may be too loose for the posterior.
- The full run has no cap and cannot be interrupted.
- If acceptance remains zero, the simulation may appear to hang indefinitely.

Recommended actions include:
- Set `use_opencl = TRUE` or increase the requested number of draws (which affects `n_envopt` and may tighten the envelope).
- Try a different `Gridtype` to force a tighter envelope.
- Strengthen the prior.

An interactive prompt then asks: `"Enter 1 to continue full run, 2 to stop and return partial results: "`. If the user chooses 2, partial test results are returned and the full run is aborted.

### 5.3 Calibration Run and Refined Estimate

The calibration batch size is:

\[
m_{\text{stage}} = \min\left( \lceil 0.01 \cdot n \rceil,\; \lfloor 300000 / \text{est\_per\_draw\_ms} \rfloor \right),
\]

where 300000 ms $\approx$ 5 minutes. The calibration run uses `parallelFor(0, m_stage, worker)` to measure per-draw time empirically.

From the calibration:

- `per_candidate_sec` = calibration elapsed time / total candidates used
- `est_per_draw_sec` = `per_candidate_sec` × `E_draws` (expected candidates per accepted draw)
- `est_total_sec` = `est_per_draw_sec` × `n`

### 5.4 Five-Minute Safeguard

If `est_total_sec > 300`:

- **Interactive:** Prompt `"Do you want to continue? [y/N]: "`
- **Non-interactive:** Proceed automatically with a note.

### 5.5 Diagnostics (verbose = TRUE)

With `verbose = TRUE`, the pilot reports:

- Estimated simulation time for `n` draws
- Note that the phase uses RcppParallel and cannot be safely interrupted
- Calibration elapsed time and number of draws
- `avg_candidates_per_draw` (empirical) vs `E_draws`
- `per_candidate_sec` and `est_per_draw_sec`


## 6. rIndepNormalGammaReg Pilot Logic

For Gaussian regression with independent Normal–Gamma priors, the pilot logic is similar in spirit but tailored to joint \((\beta, \phi)\) sampling:

### 6.1 Single-Draw Test

One observation is generated in serial mode to measure per-observation time.

### 6.2 Calibration Run

\[
m_{\text{stage}} = \min\left( \lceil 0.01 \cdot n \rceil,\; \lfloor 300000 / \text{per\_obs\_ms\_serial} \rfloor \right).
\]

The calibration run is parallel: `parallelFor(0, m_stage, worker)`.

### 6.3 Time Estimate

- `per_obs_sec` = calibration elapsed time / `m_stage`
- `est_total_sec` = `per_obs_sec` × `n`

### 6.4 Five-Minute Safeguard

Same as rNormalGLM: if `est_total_sec > 300`, prompt in interactive sessions.


## 7. EnvelopeDispersionBuild Pilot (Envelope Construction)

The RSS and UB2 steps in `EnvelopeDispersionBuild` use closed-form bounds
(`bound_rss_over_dispersion` and `bound_ub2_over_dispersion`) and no longer
perform RSS/UB2 minimization or pilot timing blocks.

### 7.1 Five-Minute Safeguard

Since minimization/pilots are disabled, the time-estimate safeguard is not used.


## 8. Choosing Serial vs Parallel

| Scenario | Recommendation |
|----------|----------------|
| **Debugging or small runs** | `use_parallel = FALSE` – serial path is interruptible |
| **Large production runs** | `use_parallel = TRUE` – faster; rely on the pilot and prompt |
| **Scripts / non-interactive** | No prompt; runs proceed automatically |

To disable parallel sampling:

```{r eval=FALSE}
glmb(formula, data, family, pfamily, n = 5000, use_parallel = FALSE)
rglmb(n = 5000, y, x, family, pfamily, use_parallel = FALSE)
```

Serial sampling is slower but can be interrupted. Parallel sampling is faster but non-interruptible once the main loop starts.


## 9. Technical Details

### 9.1 RcppParallel Worker Pattern

The parallel samplers use the standard RcppParallel `Worker` pattern:

- A struct inheriting from `RcppParallel::Worker`
- `operator()(size_t begin, size_t end)` processes the range `[begin, end)`
- Shared outputs use `RcppParallel::RVector` or `RcppParallel::RMatrix` for thread-safe writes
- Each thread has its own RNG state to avoid contention

### 9.2 Thread Safety

- RNG: Thread-local state; no sharing across workers
- Callbacks: The negative log-posterior and gradient functions (`f2`, `f3`) are called from workers; a mutex protects R callback invocations where needed
- Output: Write-only ranges per worker; no overlapping writes

### 9.3 Dependencies

- **RcppParallel** – Provides TBB and the `parallelFor` interface
- **TBB** – Bundled with RcppParallel; no separate installation required


## 10. Troubleshooting

| Symptom | Likely cause | Suggestion |
|---------|--------------|------------|
| Sampler appears to hang | Long parallel run | Use `verbose = TRUE` to see pilot estimates before the main run |
| Zero accepts in pilot | Envelope too loose | Try `use_opencl = TRUE`, larger `n`, different `Gridtype`, or stronger prior |
| Prompt never appears | Non-interactive session | Expected; runs proceed automatically in batch/CI |
| Want to interrupt a run | Parallel phase has started | Cannot interrupt; use `use_parallel = FALSE` for interruptible runs |


## 11. Cross-References

- **Chapter A05** – Simulation pipeline (optimization, standardization, envelope, sampling)
- **Chapter A08** – Envelope-related functions
- **Chapter A10** – OpenCL acceleration for envelope construction
- **`?glmb`**, **`?rglmb`**, **`?lmb`**, **`?rlmb`** – `use_parallel` argument
- **NEWS.md** – Parallel sampling and pilot-related changes
