---
title: "Getting Started with clinTrialData"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with clinTrialData}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
has_arrow <- requireNamespace("arrow", quietly = TRUE)
```

## Introduction

`clinTrialData` is a **community-grown library** of clinical trial example
datasets for R. The package ships with a core set of studies and is designed
to expand over time — anyone can contribute a new data source, and users can
download any available study on demand without waiting for a new package
release.

Data is stored in Parquet format and accessed through the `connector` package,
giving a consistent API regardless of which study you are working with.

Key features:

- **Growing library**: New datasets are added by the community as GitHub Release assets — no CRAN resubmission needed
- **On-demand download**: Use `download_study()` to fetch any available study and cache it locally
- **Generic interface**: Use `connect_clinical_data()` to connect to any available data source
- **Automatic discovery**: `list_data_sources()` finds all studies on your machine; `list_available_studies()` shows everything available to download
- **Data protection**: Downloaded and bundled datasets are locked against accidental modification

## Installation

```r
# Install from CRAN
install.packages("clinTrialData")

# Or the development version from GitHub:
# install.packages("remotes")
remotes::install_github("Lovemore-Gakava/clinTrialData")
```

## Available Data Sources

```{r}
library(clinTrialData)

# Studies on your machine (bundled + previously downloaded)
list_data_sources()
```

## Quick Start

### Connect to a Data Source

The package bundles the CDISC Pilot 01 study, so you can connect immediately:

```{r, eval = has_arrow}
# Connect to CDISC Pilot data
db <- connect_clinical_data("cdisc_pilot")

# List available datasets in the ADaM domain
db$adam$list_content_cnt()

# Read the subject-level dataset
adsl <- db$adam$read_cnt("adsl")
head(adsl[, c("USUBJID", "TRT01A", "AGE", "SEX", "RACE")])
```

### Discover and Download Additional Studies

Studies beyond the bundled data can be downloaded from GitHub Releases:

```{r eval=FALSE}
# What's available to download?
list_available_studies()

# Download a study once — cached locally from then on
download_study("cdisc_pilot_extended")

# Where is the cache?
cache_dir()
```

### Explore the Data

```{r, eval = has_arrow}
# Dimensions
dim(adsl)

# Quick structure overview
str(adsl, list.len = 10)
```

## Working with Different Domains

### ADaM Datasets

```{r, eval = has_arrow}
# Read adverse events data
adae <- db$adam$read_cnt("adae")
head(adae[, c("USUBJID", "AEDECOD", "AESEV", "AESER")])
```

### SDTM Datasets

```{r, eval = has_arrow}
# Read demographics
dm <- db$sdtm$read_cnt("dm")
head(dm[, c("USUBJID", "ARM", "AGE", "SEX", "RACE")])
```

## Example Analysis

```{r, eval = has_arrow}
library(dplyr)

# Basic demographic summary by treatment
adsl |>
  group_by(TRT01A) |>
  summarise(
    n = n(),
    mean_age = mean(AGE, na.rm = TRUE),
    female_pct = mean(SEX == "F", na.rm = TRUE) * 100,
    .groups = "drop"
  )
```

## Contributing New Data Sources

Anyone can add a new study to the library. Datasets live on
[GitHub Releases](https://github.com/Lovemore-Gakava/clinTrialData/releases),
not inside the package — so **no pull request or CRAN submission is needed**
to add data.

### Step 1: Prepare your data

Organize your Parquet files by domain:

```
your_new_study/
├── adam/
│   ├── adsl.parquet
│   └── adae.parquet
└── sdtm/
    ├── dm.parquet
    └── ae.parquet
```

### Step 2: Upload data and metadata to a GitHub Release

Open an [issue](https://github.com/Lovemore-Gakava/clinTrialData/issues) to
request a release slot, then use the helper script:

```r
source("data-raw/upload_to_release.R")

# Upload the data zip
upload_study_to_release("your_new_study", tag = "v1.1.0")

# Generate and upload metadata (enables dataset_info() for your study)
generate_and_upload_metadata(
  source      = "your_new_study",
  description = "Brief description of your study",
  version     = "v1.1.0",
  license     = "Your license here",
  source_url  = "https://link-to-original-data",
  tag         = "v1.1.0"
)
```

### Step 3: Users can inspect and access it immediately

```r
dataset_info("your_new_study")       # inspect before downloading
download_study("your_new_study")     # download and cache
connect_clinical_data("your_new_study")
```

No CRAN submission required. The study is available to all users as soon as
it is uploaded.