Getting Started with SportMiner

Introduction

SportMiner is a comprehensive R package for mining, analyzing, and visualizing scientific literature in sport science domains. It provides an end-to-end workflow for:

Retrieving abstracts from the Scopus database
Preprocessing and cleaning text data
Performing advanced topic modeling (LDA, STM, CTM)
Creating publication-ready visualizations
Analyzing keyword co-occurrence networks

This vignette demonstrates the core functionality of SportMiner through a practical example.

Installation

install.packages("SportMiner")

Setting Up Your Scopus API Key

Before using SportMiner, you need a Scopus API key. You can obtain one by registering at Elsevier Developer Portal.

library(SportMiner)

# Option 1: Set directly
sm_set_api_key("your_api_key_here")

# Option 2: Set via environment variable (recommended)
# Add to your .Renviron file:
# SCOPUS_API_KEY=your_api_key_here
# Then restart R and run:
sm_set_api_key()

Step 1: Retrieve Papers from Scopus

Let’s search for papers on talent identification in sport science that use principal component analysis or cluster analysis.

# Define the search query
query <- paste0(
  'TITLE-ABS-KEY(',
  '("talent identification" OR "sport science" OR "athlete") ',
  'AND ',
  '("principal component analysis" OR "PCA" OR "cluster analysis") ',
  ') AND DOCTYPE(ar) AND PUBYEAR > 2010'
)

# Retrieve papers
papers <- sm_search_scopus(
  query = query,
  max_count = 100,
  verbose = TRUE
)

# View the data structure
head(papers[, c("title", "year", "author_keywords")])

Step 2: Preprocess Text Data

Convert the raw abstracts into a clean, stemmed word count format.

# Preprocess abstracts
processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  min_word_length = 3
)

# View the processed data
head(processed_data)

Step 3: Create Document-Term Matrix

Transform the word counts into a sparse matrix suitable for topic modeling.

# Create DTM
dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Check dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

Step 4: Select Optimal Number of Topics

Use coherence-based selection to find the best number of topics.

# Test different values of k
k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 16, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$results)
print(paste("Optimal k:", k_selection$optimal_k))

Step 5: Train Topic Model

Fit an LDA model using the optimal k.

# Train the model
lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 500
)

Step 6: Visualize Topics

Topic Frequency Distribution

# Plot document distribution
sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)

Topic Trends Over Time

# Add doc_id to papers for joining
papers$doc_id <- paste0("doc_", seq_len(nrow(papers)))

# Plot trends
sm_plot_topic_trends(
  model = lda_model,
  dtm = dtm,
  metadata = papers,
  doc_id_col = "doc_id"
)

Step 7: Keyword Co-occurrence Network

Visualize how author keywords co-occur across papers.

# Create network
network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 2,
  top_n = 30
)

print(network_plot)

Advanced: Compare Multiple Models

Compare LDA, STM, and CTM to find the best-performing model.

# Run comparison
comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)

# Get recommendation
print(paste("Recommended model:", comparison$recommendation))

# Use the recommended model
best_model <- comparison$models[[tolower(comparison$recommendation)]]

Customizing Visualizations

All plotting functions use the custom theme_sportminer() theme, but you can customize further.

library(ggplot2)

# Create a plot with custom theme settings
p <- sm_plot_topic_frequency(lda_model, dtm)

# Add customizations
p +
  labs(
    title = "Distribution of Research Topics in Sport Science",
    subtitle = "Based on 100 papers from Scopus (2010-2025)"
  ) +
  theme_sportminer(base_size = 14, grid = FALSE)

Best Practices

API Rate Limits: Scopus has rate limits. Use max_count wisely and add delays between large queries.
Reproducibility: Always set seeds when running topic models:
```
sm_train_lda(dtm, k = 10, seed = 1729)
```
Hyperparameter Tuning: Experiment with min_term_freq and max_term_freq in sm_create_dtm() to balance vocabulary size and model performance.
Model Selection: Don’t rely solely on coherence. Inspect the top terms for each topic to ensure interpretability.

Next Steps

Explore the package documentation for detailed function reference
Experiment with different preprocessing and modeling parameters
Contact the maintainer for bug reports and feature requests

Citation

If you use SportMiner in your research, please cite:

citation("SportMiner")

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R package for structural topic models. Journal of Statistical Software, 91(2), 1-40.