SportMiner: Text Mining and Topic Modeling for Sport Science Literature

Praveen D Chougale

IIT Bombay
praveenmaths89@gmail.com

Usha Ananthakumar

IIT Bombay
usha@som.iitb.ac.in

2026-01-12

1 Introduction

The exponential growth of scientific literature in sport science domains presents both opportunities and challenges for researchers. While vast amounts of knowledge are being generated, systematically synthesizing and identifying research trends has become increasingly difficult. SportMiner addresses this challenge by providing a comprehensive, integrated toolkit for mining, analyzing, and visualizing sport science literature.

1.1 Motivation

Traditional literature review methods are time-consuming and potentially biased. Researchers need automated tools to:

  1. Efficiently retrieve relevant papers from large databases
  2. Systematically process textual content at scale
  3. Discover latent themes using advanced topic modeling
  4. Visualize trends in publication patterns and research focus
  5. Identify knowledge gaps and emerging research directions

1.3 Contributions

SportMiner makes the following contributions:

  1. Integrated workflow from data retrieval through visualization
  2. Scopus API integration for systematic literature searches
  3. Multiple topic modeling algorithms (LDA, CTM, STM) with comparison tools
  4. Publication-ready visualizations following modern design principles
  5. Keyword co-occurrence networks for understanding research connections
  6. Temporal trend analysis for tracking research evolution

2 Installation and Setup

2.1 Installation

Install the released version from CRAN:

install.packages("SportMiner")

Or install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("praveenchougale/SportMiner")

2.2 API Configuration

SportMiner uses the Scopus API for literature retrieval. Obtain a free API key from the Elsevier Developer Portal.

library(SportMiner)

# Option 1: Set directly in session
sm_set_api_key("your_api_key_here")

# Option 2: Store in .Renviron (recommended)
# usethis::edit_r_environ()
# Add: SCOPUS_API_KEY=your_api_key_here
# Restart R, then:
sm_set_api_key()

3 Complete Workflow Example

This section demonstrates a complete analysis workflow from literature search through topic modeling and visualization.

3.1 Step 1: Literature Retrieval

Scopus queries follow a structured syntax with field codes and Boolean operators.

3.1.1 Basic Query Syntax

# Search in title, abstract, and keywords
query_basic <- 'TITLE-ABS-KEY("machine learning" AND "sports")'

# Search specific fields
query_title <- 'TITLE("performance prediction")'
query_abstract <- 'ABS("neural networks")'
query_keywords <- 'KEY("injury prevention")'

3.1.2 Boolean Operators and Filters

# Complex query with multiple conditions
query <- paste0(
  'TITLE-ABS-KEY(',
  '("machine learning" OR "deep learning" OR "artificial intelligence") ',
  'AND ("sports" OR "athlete*" OR "performance") ',
  'AND NOT "e-sports"',
  ') ',
  'AND DOCTYPE(ar) ',                    # Articles only
  'AND PUBYEAR > 2018 ',                 # Published after 2018
  'AND LANGUAGE(english) ',              # English only
  'AND SUBJAREA(MEDI OR HEAL OR COMP)'   # Relevant subject areas
)

3.1.3 Available Search Filters

Document Type Filters: - DOCTYPE(ar): Journal articles - DOCTYPE(re): Review articles - DOCTYPE(cp): Conference papers

Date Filters: - PUBYEAR = 2024: Exact year - PUBYEAR > 2019: After 2019 - PUBYEAR > 2019 AND PUBYEAR < 2025: Between years

Subject Area Filters: - SUBJAREA(MEDI): Medicine - SUBJAREA(HEAL): Health Professions - SUBJAREA(COMP): Computer Science - SUBJAREA(PSYC): Psychology

3.2 Step 2: Text Preprocessing

Raw abstracts require preprocessing before topic modeling.

processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  doc_id_col = "doc_id",
  min_word_length = 3
)

head(processed_data)

The preprocessing pipeline performs:

  1. Tokenization: Split text into individual words
  2. Lowercasing: Convert to lowercase
  3. Stop word removal: Remove common words (the, and, of, etc.)
  4. Number removal: Remove numeric tokens
  5. Stemming: Reduce words to root forms using Porter stemmer
  6. Filtering: Keep only words with minimum length

3.3 Step 3: Document-Term Matrix

Create a sparse matrix representation of term frequencies.

dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Matrix dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

# Sparsity
sparsity <- 100 * (1 - slam::row_sums(dtm > 0) / (dtm$nrow * dtm$ncol))
print(paste("Sparsity:", round(sparsity, 2), "%"))

Parameters min_term_freq and max_term_freq control vocabulary size: - min_term_freq: Minimum document frequency (removes rare terms) - max_term_freq: Maximum document proportion (removes very common terms)

3.4 Step 4: Optimal Topic Number Selection

Determine the appropriate number of topics using model evaluation metrics.

k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 20, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$metrics)
print(paste("Optimal k:", k_selection$optimal_k))

The function compares models across different values of \(k\) using perplexity, a measure of model fit (lower is better).

3.5 Step 5: Train Topic Model

Fit a Latent Dirichlet Allocation (LDA) model with the optimal number of topics.

lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 2000,
  alpha = 50 / k_selection$optimal_k,  # Symmetric Dirichlet prior
  seed = 1729
)

# Examine top terms per topic
terms_matrix <- topicmodels::terms(lda_model, 10)
print(terms_matrix)

LDA (Blei, Ng, and Jordan 2003) models each document as a mixture of topics, where each topic is a distribution over words. The Gibbs sampling method (Griffiths and Steyvers 2004) estimates model parameters through Markov Chain Monte Carlo.

3.6 Step 6: Model Comparison

Compare multiple topic modeling approaches.

comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)
print(paste("Recommended model:", comparison$recommendation))

# Extract best model
best_model <- comparison$models[[tolower(comparison$recommendation)]]

The function fits three models: - LDA: Standard Latent Dirichlet Allocation - CTM: Correlated Topic Model (Blei and Lafferty 2007) (allows topic correlations) - STM: Structural Topic Model (Roberts et al. 2014) (not yet implemented)

3.7 Step 7: Visualization

3.7.1 Topic Terms Visualization

Display the most important terms for each topic.

plot_terms <- sm_plot_topic_terms(
  model = lda_model,
  n_terms = 10
)
print(plot_terms)

The visualization shows term importance (beta values) within each topic. Higher beta indicates greater relevance to the topic.

3.7.2 Topic Frequency Distribution

Show how topics are distributed across the document collection.

plot_freq <- sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)
print(plot_freq)

3.7.4 Keyword Co-occurrence Network

Analyze relationships between author keywords.

network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 3,
  top_n = 30
)
print(network_plot)

Network analysis reveals: - Node size: Keyword frequency - Edge width: Co-occurrence strength - Communities: Clusters of related keywords

4 Advanced Usage

4.1 Custom Preprocessing

Override default preprocessing parameters.

processed_custom <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  doc_id_col = "doc_id",
  min_word_length = 4,      # Longer minimum word length
  custom_stopwords = c("study", "research", "paper")  # Additional stopwords
)

4.2 Hyperparameter Tuning

LDA performance depends on hyperparameters.

# Test different alpha values
alphas <- c(0.1, 0.5, 1.0)
results <- lapply(alphas, function(a) {
  model <- sm_train_lda(dtm, k = 10, alpha = a, seed = 1729)
  perplexity <- topicmodels::perplexity(model, dtm)
  list(alpha = a, perplexity = perplexity)
})

# Compare results
do.call(rbind, results)

4.3 Exporting Results

Save models and visualizations for publication.

# Save model
saveRDS(lda_model, "lda_model.rds")

# Save plots
ggplot2::ggsave("topic_terms.png", plot_terms,
                width = 12, height = 8, dpi = 300)
ggplot2::ggsave("topic_trends.png", plot_trends,
                width = 12, height = 6, dpi = 300)

# Export document-topic assignments
topics <- topicmodels::topics(lda_model, 1)
papers$dominant_topic <- paste0("Topic_", topics)
write.csv(papers, "papers_with_topics.csv", row.names = FALSE)

# Export topic-term matrix
beta <- topicmodels::posterior(lda_model)$terms
write.csv(beta, "topic_term_matrix.csv")

5 Case Study: Sports Analytics Literature

This case study demonstrates SportMiner on a systematic review of sports analytics literature.

5.1 Research Question

What are the main research themes in sports analytics over the past decade, and how have they evolved?

5.2 Method

# Comprehensive search query
query_case <- paste0(
  'TITLE-ABS-KEY(',
  '("sports analytics" OR "sports data science" OR "sports informatics" OR ',
  '"performance analysis" OR "match analysis") ',
  'AND ("data" OR "analytics" OR "statistics" OR "modeling")',
  ') ',
  'AND DOCTYPE(ar OR re) ',
  'AND PUBYEAR > 2013 ',
  'AND LANGUAGE(english)'
)

# Retrieve papers
papers_case <- sm_search_scopus(query_case, max_count = 500, verbose = TRUE)

# Full preprocessing pipeline
processed_case <- sm_preprocess_text(papers_case, text_col = "abstract")
dtm_case <- sm_create_dtm(processed_case, min_term_freq = 5, max_term_freq = 0.4)

# Model selection
k_case <- sm_select_optimal_k(dtm_case, k_range = seq(6, 18, by = 2), plot = TRUE)

# Train final model
model_case <- sm_train_lda(dtm_case, k = k_case$optimal_k,
                           iter = 2000, seed = 1729)

# Visualizations
terms_plot <- sm_plot_topic_terms(model_case, n_terms = 12)
trends_plot <- sm_plot_topic_trends(model_case, dtm_case, papers_case)

5.3 Results Interpretation

The topic model with \(k = 12\) topics identified distinct research themes:

  1. Performance prediction models: Machine learning for outcome forecasting
  2. Injury prevention: Biomechanical analysis and risk assessment
  3. Tactical analysis: Team strategy and formation analysis
  4. Player evaluation: Rating systems and talent identification
  5. Training optimization: Load monitoring and periodization
  6. Computer vision: Automated video analysis
  7. Wearable sensors: Real-time monitoring systems
  8. Network analysis: Team dynamics and interactions
  9. Social media analytics: Fan engagement analysis
  10. Betting markets: Prediction markets and odds analysis
  11. Fantasy sports: Player selection algorithms
  12. Officials and refereeing: Decision-making analysis

Temporal trends reveal: - Increasing focus on deep learning and AI (2018-2024) - Declining emphasis on traditional statistical methods - Emerging interest in explainable AI and interpretability

6 Computational Performance

SportMiner is designed for efficiency with large document collections.

6.1 Benchmarks

# Test on varying document sizes
sizes <- c(100, 500, 1000, 2000)
times <- sapply(sizes, function(n) {
  subset_dtm <- dtm_case[1:min(n, dtm_case$nrow), ]
  system.time({
    sm_train_lda(subset_dtm, k = 10, iter = 1000)
  })["elapsed"]
})

# Display results
data.frame(documents = sizes, time_seconds = times)

6.2 Optimization Tips

  1. Start small: Test on subset before full corpus
  2. Reduce iterations: Use 500-1000 for exploration, 2000+ for final models
  3. Parallel processing: Enable for large k ranges in sm_select_optimal_k()
  4. DTM filtering: Aggressive term filtering reduces computational burden

7 Best Practices

7.1 Reproducibility

Always set random seeds for reproducible results:

sm_train_lda(dtm, k = 10, seed = 1729)
sm_compare_models(dtm, k = 10, seed = 1729)

7.2 Query Design

  1. Start broad, refine iteratively: Begin with general queries, narrow based on results
  2. Test on Scopus web interface: Verify query syntax and result counts
  3. Document your queries: Save queries in a text file or R script
  4. Consider synonyms: Include alternative terms and spellings

7.3 Preprocessing Decisions

7.4 Model Selection

  1. Don’t rely solely on metrics: Inspect topic terms for interpretability
  2. Check topic coherence: Topics should be semantically meaningful
  3. Consider domain knowledge: Validate topics with subject matter experts
  4. Multiple k values: Research questions may be answerable at different granularities

7.5 Visualization

All plots use theme_sportminer() for consistent aesthetics:

library(ggplot2)

# Customize theme parameters
plot_terms + theme_sportminer(base_size = 14, grid = FALSE)

8 Summary

SportMiner provides an integrated, efficient workflow for analyzing sport science literature. The package combines database querying, text preprocessing, topic modeling, and visualization in a unified framework. Researchers can rapidly identify research trends, discover thematic structures, and track field evolution over time.

8.1 Key Features

8.2 Future Development

Planned enhancements include:

8.3 Acknowledgments

We thank the reviewers for valuable feedback that improved this package.

9 References

Blei, David M, and John D Lafferty. 2007. “A Correlated Topic Model of Science.” In The Annals of Applied Statistics, 1:17–35. 1. JSTOR.
Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.
Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in r.” Journal of Statistical Software 25: 1–54. https://doi.org/10.18637/jss.v025.i05.
Griffiths, Thomas L, and Mark Steyvers. 2004. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (suppl 1): 5228–35.
Grün, Bettina, and Kurt Hornik. 2011. “Topicmodels: An r Package for Fitting Topic Models.” Journal of Statistical Software 40 (13): 1–30. https://doi.org/10.18637/jss.v040.i13.
Roberts, Margaret E, Brandon M Stewart, and Dustin Tingley. 2019. “Stm: An r Package for Structural Topic Models.” Journal of Statistical Software 91 (2): 1–40. https://doi.org/10.18637/jss.v091.i02.
Roberts, Margaret E, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G Rand. 2014. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58 (4): 1064–82.
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.

10 Computational Details

sessionInfo()

11 Appendix: Function Reference

11.1 Data Retrieval Functions

11.2 Preprocessing Functions

11.3 Topic Modeling Functions

11.4 Visualization Functions