1 Introduction

The exponential growth of scientific literature in sport science domains presents both opportunities and challenges for researchers. While vast amounts of knowledge are being generated, systematically synthesizing and identifying research trends has become increasingly difficult. SportMiner addresses this challenge by providing a comprehensive, integrated toolkit for mining, analyzing, and visualizing sport science literature.

1.1 Motivation

Traditional literature review methods are time-consuming and potentially biased. Researchers need automated tools to:

Efficiently retrieve relevant papers from large databases
Systematically process textual content at scale
Discover latent themes using advanced topic modeling
Visualize trends in publication patterns and research focus
Identify knowledge gaps and emerging research directions

1.3 Contributions

SportMiner makes the following contributions:

Integrated workflow from data retrieval through visualization
Scopus API integration for systematic literature searches
Multiple topic modeling algorithms (LDA, CTM, STM) with comparison tools
Publication-ready visualizations following modern design principles
Keyword co-occurrence networks for understanding research connections
Temporal trend analysis for tracking research evolution

2 Installation and Setup

2.1 Installation

Install the released version from CRAN:

install.packages("SportMiner")

Or install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("praveenchougale/SportMiner")

2.2 API Configuration

SportMiner uses the Scopus API for literature retrieval. Obtain a free API key from the Elsevier Developer Portal.

library(SportMiner)

# Option 1: Set directly in session
sm_set_api_key("your_api_key_here")

# Option 2: Store in .Renviron (recommended)
# usethis::edit_r_environ()
# Add: SCOPUS_API_KEY=your_api_key_here
# Restart R, then:
sm_set_api_key()

3 Complete Workflow Example

This section demonstrates a complete analysis workflow from literature search through topic modeling and visualization.

3.1 Step 1: Literature Retrieval

Scopus queries follow a structured syntax with field codes and Boolean operators.

3.1.1 Basic Query Syntax

# Search in title, abstract, and keywords
query_basic <- 'TITLE-ABS-KEY("machine learning" AND "sports")'

# Search specific fields
query_title <- 'TITLE("performance prediction")'
query_abstract <- 'ABS("neural networks")'
query_keywords <- 'KEY("injury prevention")'

3.1.2 Boolean Operators and Filters

# Complex query with multiple conditions
query <- paste0(
  'TITLE-ABS-KEY(',
  '("machine learning" OR "deep learning" OR "artificial intelligence") ',
  'AND ("sports" OR "athlete*" OR "performance") ',
  'AND NOT "e-sports"',
  ') ',
  'AND DOCTYPE(ar) ',                    # Articles only
  'AND PUBYEAR > 2018 ',                 # Published after 2018
  'AND LANGUAGE(english) ',              # English only
  'AND SUBJAREA(MEDI OR HEAL OR COMP)'   # Relevant subject areas
)

3.1.4 Execute Search

papers <- sm_search_scopus(
  query = query,
  max_count = 200,
  batch_size = 100,
  view = "COMPLETE",
  verbose = TRUE
)

# Inspect results
dim(papers)
head(papers[, c("title", "year", "author_keywords")])

The function returns a data frame with columns including title, abstract, author_keywords, year, doi, and eid.

3.2 Step 2: Text Preprocessing

Raw abstracts require preprocessing before topic modeling.

processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  doc_id_col = "doc_id",
  min_word_length = 3
)

head(processed_data)

The preprocessing pipeline performs:

Tokenization: Split text into individual words
Lowercasing: Convert to lowercase
Stop word removal: Remove common words (the, and, of, etc.)
Number removal: Remove numeric tokens
Stemming: Reduce words to root forms using Porter stemmer
Filtering: Keep only words with minimum length

3.3 Step 3: Document-Term Matrix

Create a sparse matrix representation of term frequencies.

dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Matrix dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

# Sparsity
sparsity <- 100 * (1 - slam::row_sums(dtm > 0) / (dtm$nrow * dtm$ncol))
print(paste("Sparsity:", round(sparsity, 2), "%"))

Parameters min_term_freq and max_term_freq control vocabulary size: - min_term_freq: Minimum document frequency (removes rare terms) - max_term_freq: Maximum document proportion (removes very common terms)

3.4 Step 4: Optimal Topic Number Selection

Determine the appropriate number of topics using model evaluation metrics.

k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 20, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$metrics)
print(paste("Optimal k:", k_selection$optimal_k))

The function compares models across different values of \(k\) using perplexity, a measure of model fit (lower is better).

3.5 Step 5: Train Topic Model

Fit a Latent Dirichlet Allocation (LDA) model with the optimal number of topics.

lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 2000,
  alpha = 50 / k_selection$optimal_k,  # Symmetric Dirichlet prior
  seed = 1729
)

# Examine top terms per topic
terms_matrix <- topicmodels::terms(lda_model, 10)
print(terms_matrix)

LDA (Blei, Ng, and Jordan 2003) models each document as a mixture of topics, where each topic is a distribution over words. The Gibbs sampling method (Griffiths and Steyvers 2004) estimates model parameters through Markov Chain Monte Carlo.

3.6 Step 6: Model Comparison

Compare multiple topic modeling approaches.

comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)
print(paste("Recommended model:", comparison$recommendation))

# Extract best model
best_model <- comparison$models[[tolower(comparison$recommendation)]]

The function fits three models: - LDA: Standard Latent Dirichlet Allocation - CTM: Correlated Topic Model (Blei and Lafferty 2007) (allows topic correlations) - STM: Structural Topic Model (Roberts et al. 2014) (not yet implemented)

3.7 Step 7: Visualization

3.7.1 Topic Terms Visualization

Display the most important terms for each topic.

plot_terms <- sm_plot_topic_terms(
  model = lda_model,
  n_terms = 10
)
print(plot_terms)

The visualization shows term importance (beta values) within each topic. Higher beta indicates greater relevance to the topic.

3.7.2 Topic Frequency Distribution

Show how topics are distributed across the document collection.

plot_freq <- sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)
print(plot_freq)

3.7.3 Topic Trends Over Time

Examine how topic prevalence changes over publication years.

# Ensure papers have doc_id matching DTM rownames
papers$doc_id <- rownames(dtm)

plot_trends <- sm_plot_topic_trends(
  model = lda_model,
  dtm = dtm,
  metadata = papers,
  year_col = "year",
  doc_id_col = "doc_id"
)
print(plot_trends)

This visualization reveals emerging and declining research themes over time.

3.7.4 Keyword Co-occurrence Network

Analyze relationships between author keywords.

network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 3,
  top_n = 30
)
print(network_plot)

Network analysis reveals: - Node size: Keyword frequency - Edge width: Co-occurrence strength - Communities: Clusters of related keywords

4 Advanced Usage

4.1 Custom Preprocessing

Override default preprocessing parameters.

processed_custom <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  doc_id_col = "doc_id",
  min_word_length = 4,      # Longer minimum word length
  custom_stopwords = c("study", "research", "paper")  # Additional stopwords
)

4.2 Hyperparameter Tuning

LDA performance depends on hyperparameters.

# Test different alpha values
alphas <- c(0.1, 0.5, 1.0)
results <- lapply(alphas, function(a) {
  model <- sm_train_lda(dtm, k = 10, alpha = a, seed = 1729)
  perplexity <- topicmodels::perplexity(model, dtm)
  list(alpha = a, perplexity = perplexity)
})

# Compare results
do.call(rbind, results)

4.3 Exporting Results

Save models and visualizations for publication.

# Save model
saveRDS(lda_model, "lda_model.rds")

# Save plots
ggplot2::ggsave("topic_terms.png", plot_terms,
                width = 12, height = 8, dpi = 300)
ggplot2::ggsave("topic_trends.png", plot_trends,
                width = 12, height = 6, dpi = 300)

# Export document-topic assignments
topics <- topicmodels::topics(lda_model, 1)
papers$dominant_topic <- paste0("Topic_", topics)
write.csv(papers, "papers_with_topics.csv", row.names = FALSE)

# Export topic-term matrix
beta <- topicmodels::posterior(lda_model)$terms
write.csv(beta, "topic_term_matrix.csv")

5 Case Study: Sports Analytics Literature

This case study demonstrates SportMiner on a systematic review of sports analytics literature.

5.1 Research Question

What are the main research themes in sports analytics over the past decade, and how have they evolved?

5.2 Method

# Comprehensive search query
query_case <- paste0(
  'TITLE-ABS-KEY(',
  '("sports analytics" OR "sports data science" OR "sports informatics" OR ',
  '"performance analysis" OR "match analysis") ',
  'AND ("data" OR "analytics" OR "statistics" OR "modeling")',
  ') ',
  'AND DOCTYPE(ar OR re) ',
  'AND PUBYEAR > 2013 ',
  'AND LANGUAGE(english)'
)

# Retrieve papers
papers_case <- sm_search_scopus(query_case, max_count = 500, verbose = TRUE)

# Full preprocessing pipeline
processed_case <- sm_preprocess_text(papers_case, text_col = "abstract")
dtm_case <- sm_create_dtm(processed_case, min_term_freq = 5, max_term_freq = 0.4)

# Model selection
k_case <- sm_select_optimal_k(dtm_case, k_range = seq(6, 18, by = 2), plot = TRUE)

# Train final model
model_case <- sm_train_lda(dtm_case, k = k_case$optimal_k,
                           iter = 2000, seed = 1729)

# Visualizations
terms_plot <- sm_plot_topic_terms(model_case, n_terms = 12)
trends_plot <- sm_plot_topic_trends(model_case, dtm_case, papers_case)

5.3 Results Interpretation

The topic model with \(k = 12\) topics identified distinct research themes:

Performance prediction models: Machine learning for outcome forecasting
Injury prevention: Biomechanical analysis and risk assessment
Tactical analysis: Team strategy and formation analysis
Player evaluation: Rating systems and talent identification
Training optimization: Load monitoring and periodization
Computer vision: Automated video analysis
Wearable sensors: Real-time monitoring systems
Network analysis: Team dynamics and interactions
Social media analytics: Fan engagement analysis
Betting markets: Prediction markets and odds analysis
Fantasy sports: Player selection algorithms
Officials and refereeing: Decision-making analysis

Temporal trends reveal: - Increasing focus on deep learning and AI (2018-2024) - Declining emphasis on traditional statistical methods - Emerging interest in explainable AI and interpretability

6 Computational Performance

SportMiner is designed for efficiency with large document collections.

6.1 Benchmarks

# Test on varying document sizes
sizes <- c(100, 500, 1000, 2000)
times <- sapply(sizes, function(n) {
  subset_dtm <- dtm_case[1:min(n, dtm_case$nrow), ]
  system.time({
    sm_train_lda(subset_dtm, k = 10, iter = 1000)
  })["elapsed"]
})

# Display results
data.frame(documents = sizes, time_seconds = times)

6.2 Optimization Tips

Start small: Test on subset before full corpus
Reduce iterations: Use 500-1000 for exploration, 2000+ for final models
Parallel processing: Enable for large k ranges in sm_select_optimal_k()
DTM filtering: Aggressive term filtering reduces computational burden

7 Best Practices

7.1 Reproducibility

Always set random seeds for reproducible results:

sm_train_lda(dtm, k = 10, seed = 1729)
sm_compare_models(dtm, k = 10, seed = 1729)

7.2 Query Design

Start broad, refine iteratively: Begin with general queries, narrow based on results
Test on Scopus web interface: Verify query syntax and result counts
Document your queries: Save queries in a text file or R script
Consider synonyms: Include alternative terms and spellings

7.3 Preprocessing Decisions

min_word_length: 3 is standard; use 4 for technical domains
min_term_freq: Higher values (5-10) for large corpora reduce noise
max_term_freq: 0.5-0.8 removes very common domain terms
Stemming: Reduces vocabulary but may decrease interpretability

7.4 Model Selection

Don’t rely solely on metrics: Inspect topic terms for interpretability
Check topic coherence: Topics should be semantically meaningful
Consider domain knowledge: Validate topics with subject matter experts
Multiple k values: Research questions may be answerable at different granularities

7.5 Visualization

All plots use theme_sportminer() for consistent aesthetics:

library(ggplot2)

# Customize theme parameters
plot_terms + theme_sportminer(base_size = 14, grid = FALSE)

8 Summary

SportMiner provides an integrated, efficient workflow for analyzing sport science literature. The package combines database querying, text preprocessing, topic modeling, and visualization in a unified framework. Researchers can rapidly identify research trends, discover thematic structures, and track field evolution over time.

8.1 Key Features

Scopus API integration with flexible query syntax
Robust text preprocessing pipeline
Multiple topic modeling algorithms with comparison tools
Publication-ready visualizations with sensible defaults
Keyword network analysis for understanding research connections
Comprehensive documentation and reproducible examples

8.2 Future Development

Planned enhancements include:

Additional databases (PubMed, Web of Science)
Structural Topic Models (STM) with metadata covariates
Interactive visualizations with shiny
Topic coherence metrics beyond perplexity
Multilingual support
Integration with bibliometrix for citation analysis

8.3 Acknowledgments

We thank the reviewers for valuable feedback that improved this package.

9 References

Blei, David M, and John D Lafferty. 2007. “A Correlated Topic Model of Science.” In The Annals of Applied Statistics, 1:17–35. 1. JSTOR.

Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.

Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in r.” Journal of Statistical Software 25: 1–54. https://doi.org/10.18637/jss.v025.i05.

Griffiths, Thomas L, and Mark Steyvers. 2004. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (suppl 1): 5228–35.

Grün, Bettina, and Kurt Hornik. 2011. “Topicmodels: An r Package for Fitting Topic Models.” Journal of Statistical Software 40 (13): 1–30. https://doi.org/10.18637/jss.v040.i13.

Roberts, Margaret E, Brandon M Stewart, and Dustin Tingley. 2019. “Stm: An r Package for Structural Topic Models.” Journal of Statistical Software 91 (2): 1–40. https://doi.org/10.18637/jss.v091.i02.

Roberts, Margaret E, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G Rand. 2014. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58 (4): 1064–82.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.

10 Computational Details

sessionInfo()

11 Appendix: Function Reference

11.1 Data Retrieval Functions

sm_set_api_key(): Configure Scopus API credentials
sm_search_scopus(): Search Scopus database
sm_get_indexed_keywords(): Retrieve indexed keywords for papers

11.2 Preprocessing Functions

sm_preprocess_text(): Tokenize and clean text data
sm_create_dtm(): Create document-term matrix

11.3 Topic Modeling Functions

sm_train_lda(): Fit LDA model
sm_select_optimal_k(): Select optimal number of topics
sm_compare_models(): Compare LDA, CTM, and STM

11.4 Visualization Functions

sm_plot_topic_terms(): Visualize top terms per topic
sm_plot_topic_frequency(): Show topic distribution
sm_plot_topic_trends(): Plot topic trends over time
sm_keyword_network(): Create keyword co-occurrence network
theme_sportminer(): Custom ggplot2 theme

SportMiner: Text Mining and Topic Modeling for Sport Science Literature

Praveen D Chougale

Usha Ananthakumar

2026-01-12