End-to-End Pipeline: From API to Multi-Sport Analysis

vald.extractor

2026-01-18

Introduction

The vald.extractor package provides a robust, production-ready pipeline for extracting, cleaning, and analyzing VALD ForceDecks data across multiple sports. This vignette demonstrates the complete workflow from API authentication to publication-ready visualizations.

Key Problems Solved

  1. Stability: Chunked batch processing prevents timeout errors when working with large datasets (1000+ tests)
  2. Data Cleaning: Automated sports taxonomy mapping standardizes inconsistent team/group names
  3. Reproducibility: All processing steps are documented and version-controlled
  4. Generic Analysis: Test-type suffix removal enables writing analysis code once that works for all test types

Step 1: Authentication and Data Extraction

First, set your VALD API credentials and extract test/trial data:

library(vald.extractor)

# Set credentials
valdr::set_credentials(
  client_id     = "your_client_id",
  client_secret = "your_client_secret",
  tenant_id     = "your_tenant_id",
  region        = "aue"
)

# Fetch data from 2020 onwards in chunks of 100 tests
vald_data <- fetch_vald_batch(
  start_date = "2020-01-01T00:00:00Z",
  chunk_size = 100,
  verbose = TRUE
)

# Extract components
tests_df <- vald_data$tests
trials_df <- vald_data$trials

cat("Extracted", nrow(tests_df), "tests and", nrow(trials_df), "trials\n")

Why chunking matters: Without chunking, large organizations with 5000+ tests will experience API timeout errors. The chunked approach processes 100 tests at a time, with fault-tolerant error handling that logs issues without halting the entire extraction.

Step 2: Fetch and Standardize Metadata

Retrieve athlete profiles and group memberships via OAuth2:

# Fetch raw metadata
metadata <- fetch_vald_metadata(
  client_id     = "your_client_id",
  client_secret = "your_client_secret",
  tenant_id     = "your_tenant_id",
  region        = "aue"
)

# Standardize: unnest group memberships and create unified athlete records
athlete_metadata <- standardize_vald_metadata(
  profiles = metadata$profiles,
  groups   = metadata$groups
)

head(athlete_metadata)

Understanding the Metadata Structure

The VALD API stores group memberships as a nested array (groupIds). The standardize_vald_metadata() function:

  1. Unnests the array so each athlete-group pair gets its own row
  2. Joins with group names from the Groups API
  3. Collapses back to one row per athlete with all group names concatenated

Result: A clean metadata table where all_group_names contains “Football, U18, Elite” for an athlete in multiple groups.

Step 3: Apply Sports Taxonomy

Map inconsistent team names to standardized sports categories:

athlete_metadata <- classify_sports(
  data = athlete_metadata,
  group_col = "all_group_names",
  output_col = "sports_clean"
)

# Inspect the mapping
table(athlete_metadata$sports_clean)

The Value Add: This regex-based classification is the core innovation. Organizations often have:

Without this automation, analysts spend hours manually categorizing athletes. The package includes patterns for 15+ sports and can be easily extended.

Step 4: Transform to Wide Format and Join

Combine trials into tests, pivot to wide format, and merge with metadata:

library(dplyr)

# Join trials and tests
all_data <- left_join(trials_df, tests_df, by = c("testId", "athleteId"))

# Aggregate trials and pivot to wide format
structured_test_data <- all_data %>%
  group_by(athleteId, testId, testType, recordedUTC,
           recordedDateOffset, trialLimb, definition_name) %>%
  summarise(
    mean_result = mean(as.numeric(value), na.rm = TRUE),
    mean_weight = mean(as.numeric(weight), na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    TestTimestampUTC = lubridate::ymd_hms(recordedUTC),
    TestTimestampLocal = TestTimestampUTC + lubridate::minutes(recordedDateOffset),
    Testdate = as.Date(TestTimestampLocal)
  ) %>%
  select(athleteId, Testdate, testId, testType, trialLimb,
         definition_name, mean_result, mean_weight) %>%
  tidyr::pivot_wider(
    id_cols = c(athleteId, Testdate, testId, mean_weight),
    names_from = c(definition_name, trialLimb, testType),
    values_from = mean_result,
    names_glue = "{definition_name}_{trialLimb}_{testType}"
  ) %>%
  rename(Weight_on_Test_Day = mean_weight)

# Join with metadata
final_analysis_data <- structured_test_data %>%
  mutate(profileId = as.character(athleteId)) %>%
  left_join(
    athlete_metadata %>% mutate(profileId = as.character(profileId)),
    by = "profileId"
  ) %>%
  mutate(
    Testdate = as.Date(Testdate),
    dateofbirth = as.Date(dateOfBirth),
    age = as.numeric((Testdate - dateofbirth) / 365.25),
    sports = sports_clean
  )

cat("Final dataset:", nrow(final_analysis_data), "rows with",
    ncol(final_analysis_data), "columns\n")

Step 5: Split by Test Type

The “Don’t Repeat Yourself” (DRY) principle in action:

# Split into separate datasets per test type
test_datasets <- split_by_test(
  data = final_analysis_data,
  metadata_cols = c("profileId", "sex", "Testdate", "dateofbirth",
                    "age", "testId", "Weight_on_Test_Day", "sports")
)

# Access individual test types
cmj_data <- test_datasets$CMJ
dj_data <- test_datasets$DJ

# Crucially: column names are now generic
head(names(cmj_data))
# "profileId", "sex", "Testdate", "PEAK_FORCE_Both", "JUMP_HEIGHT_Both", ...
# Note: "_CMJ" suffix has been removed!

Why this matters: You can now write one analysis function that works for all test types:

analyze_peak_force <- function(test_data) {
  summary(test_data$PEAK_FORCE_Both)  # Works for CMJ, DJ, ISO, etc.
}

# Apply to all test types
lapply(test_datasets, analyze_peak_force)

Without suffix removal, you’d need separate code for PEAK_FORCE_Both_CMJ, PEAK_FORCE_Both_DJ, etc.

Step 6: Patch Missing Metadata (Optional)

Fix missing or incorrect demographic data:

# Create an Excel file with: profileId, sex, dateOfBirth
# Example: corrections.xlsx with rows like:
#   profileId         sex       dateOfBirth
#   abc123           Male      1995-03-15
#   def456           Female    1998-07-22

cmj_data <- patch_metadata(
  data = cmj_data,
  patch_file = "corrections.xlsx",
  patch_sheet = 1,
  id_col = "profileId",
  fields_to_patch = c("sex", "dateOfBirth")
)

# Verify corrections
table(cmj_data$sex)  # "Unknown" values should now be fixed

Step 7: Generate Summary Statistics

Create publication-ready summary tables:

cmj_summary <- summary_vald_metrics(
  data = cmj_data,
  group_vars = c("sex", "sports"),
  exclude_cols = c("profileId", "testId", "Testdate", "dateofbirth", "age")
)

# View summary
print(cmj_summary)

# Export to CSV
write.csv(cmj_summary, "cmj_summary_by_sport_sex.csv", row.names = FALSE)

Output example:

sex    sports      PEAK_FORCE_Both_Mean  PEAK_FORCE_Both_SD  PEAK_FORCE_Both_CV  PEAK_FORCE_Both_N
Male   Football    2450.32               245.67              10.02               45
Male   Basketball  2310.45               198.23              8.58                32
Female Football    1980.12               187.45              9.47                38

Step 9: Compare Across Groups

Create boxplots for cross-sectional comparisons:

plot_vald_compare(
  data = cmj_data,
  metric_col = "PEAK_FORCE_Both",
  group_col = "sports",
  fill_col = "sex",
  title = "CMJ Peak Force Comparison by Sport and Sex"
)

# Compare jump height
plot_vald_compare(
  data = cmj_data,
  metric_col = "JUMP_HEIGHT_Both",
  group_col = "sports",
  fill_col = "sex",
  title = "CMJ Jump Height Comparison"
)

Advanced: Multi-Test Analysis

Analyze multiple test types simultaneously:

# Define a function to extract a common metric across test types
compare_metric_across_tests <- function(test_datasets, metric = "PEAK_FORCE_Both") {

  results <- lapply(names(test_datasets), function(test_name) {
    test_data <- test_datasets[[test_name]]

    if (metric %in% names(test_data)) {
      data.frame(
        testType = test_name,
        metric = metric,
        mean = mean(test_data[[metric]], na.rm = TRUE),
        sd = sd(test_data[[metric]], na.rm = TRUE),
        n = sum(!is.na(test_data[[metric]]))
      )
    }
  })

  do.call(rbind, results)
}

# Compare peak force across CMJ, DJ, and ISO
force_comparison <- compare_metric_across_tests(test_datasets, "PEAK_FORCE_Both")
print(force_comparison)

Best Practices for Production Use

1. Schedule Regular Updates

# Weekly refresh script
library(vald.extractor)

# Fetch only new data since last update
last_update <- "2024-01-01T00:00:00Z"

new_data <- fetch_vald_batch(
  start_date = last_update,
  chunk_size = 100
)

# Append to existing database
load("vald_database.RData")
updated_tests <- rbind(existing_tests, new_data$tests)
updated_trials <- rbind(existing_trials, new_data$trials)

save(updated_tests, updated_trials, file = "vald_database.RData")

2. Error Logging

The chunked extraction automatically logs errors without halting:

# Errors are printed to console with chunk information:
# "ERROR on chunk 23 (rows 2201-2300): API timeout"
# "Continuing to next chunk..."

# This ensures partial data extraction even if some chunks fail

3. Version Control Your Taxonomy

Store your sports classification rules in a separate config file:

# sports_taxonomy.R
sports_patterns <- list(
  Football = "Football|FSI|TCFC|MCFC|Soccer",
  Basketball = "Basketball|BBall",
  Cricket = "Cricket",
  # ... add your organization's patterns
)

# Then use in classify_sports()

Conclusion

The vald.extractor package transforms raw VALD API data into analysis-ready datasets with:

This workflow is production-tested with 10,000+ tests across 15+ sports and is designed for CRAN submission.

Next Steps

For issues or feature requests, visit: GitHub Issues