Getting Started with SportMiner

Praveen D Chougale and Usha Ananthakumar

2026-01-12

Introduction

SportMiner is a comprehensive R package for mining, analyzing, and visualizing scientific literature in sport science domains. It provides an end-to-end workflow for:

This vignette demonstrates the core functionality of SportMiner through a practical example.

Installation

install.packages("SportMiner")

Setting Up Your Scopus API Key

Before using SportMiner, you need a Scopus API key. You can obtain one by registering at Elsevier Developer Portal.

library(SportMiner)

# Option 1: Set directly
sm_set_api_key("your_api_key_here")

# Option 2: Set via environment variable (recommended)
# Add to your .Renviron file:
# SCOPUS_API_KEY=your_api_key_here
# Then restart R and run:
sm_set_api_key()

Step 1: Retrieve Papers from Scopus

Let’s search for papers on talent identification in sport science that use principal component analysis or cluster analysis.

# Define the search query
query <- paste0(
  'TITLE-ABS-KEY(',
  '("talent identification" OR "sport science" OR "athlete") ',
  'AND ',
  '("principal component analysis" OR "PCA" OR "cluster analysis") ',
  ') AND DOCTYPE(ar) AND PUBYEAR > 2010'
)

# Retrieve papers
papers <- sm_search_scopus(
  query = query,
  max_count = 100,
  verbose = TRUE
)

# View the data structure
head(papers[, c("title", "year", "author_keywords")])

Step 2: Preprocess Text Data

Convert the raw abstracts into a clean, stemmed word count format.

# Preprocess abstracts
processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  min_word_length = 3
)

# View the processed data
head(processed_data)

Step 3: Create Document-Term Matrix

Transform the word counts into a sparse matrix suitable for topic modeling.

# Create DTM
dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Check dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

Step 4: Select Optimal Number of Topics

Use coherence-based selection to find the best number of topics.

# Test different values of k
k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 16, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$results)
print(paste("Optimal k:", k_selection$optimal_k))

Step 5: Train Topic Model

Fit an LDA model using the optimal k.

# Train the model
lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 500
)

Step 6: Visualize Topics

Top Terms per Topic

# Plot top terms
sm_plot_topic_terms(
  model = lda_model,
  n_terms = 10
)

Topic Frequency Distribution

# Plot document distribution
sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)

Step 7: Keyword Co-occurrence Network

Visualize how author keywords co-occur across papers.

# Create network
network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 2,
  top_n = 30
)

print(network_plot)

Advanced: Compare Multiple Models

Compare LDA, STM, and CTM to find the best-performing model.

# Run comparison
comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)

# Get recommendation
print(paste("Recommended model:", comparison$recommendation))

# Use the recommended model
best_model <- comparison$models[[tolower(comparison$recommendation)]]

Customizing Visualizations

All plotting functions use the custom theme_sportminer() theme, but you can customize further.

library(ggplot2)

# Create a plot with custom theme settings
p <- sm_plot_topic_frequency(lda_model, dtm)

# Add customizations
p +
  labs(
    title = "Distribution of Research Topics in Sport Science",
    subtitle = "Based on 100 papers from Scopus (2010-2025)"
  ) +
  theme_sportminer(base_size = 14, grid = FALSE)

Best Practices

  1. API Rate Limits: Scopus has rate limits. Use max_count wisely and add delays between large queries.

  2. Reproducibility: Always set seeds when running topic models:

    sm_train_lda(dtm, k = 10, seed = 1729)
  3. Hyperparameter Tuning: Experiment with min_term_freq and max_term_freq in sm_create_dtm() to balance vocabulary size and model performance.

  4. Model Selection: Don’t rely solely on coherence. Inspect the top terms for each topic to ensure interpretability.

Next Steps

Citation

If you use SportMiner in your research, please cite:

citation("SportMiner")

References