---
title: "Getting Started with BCFM: A Complete Workflow"
author: "Allison Tegge, Virginia Tech"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
    number_sections: false
vignette: >
  %\VignetteIndexEntry{Getting Started with BCFM: A Complete Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6
)

library(BCFM)
library(dplyr)    
library(tidyr)    
library(ggplot2)   
```

# Introduction

This vignette demonstrates a complete workflow for using the **BCFM** (Bayesian Clustering Factor Models) package. We'll use a simulated dataset that comes with the package to illustrate:

- Data preparation and exploration
- Model selection across different numbers of groups and factors
- Fitting the optimal model
- Comprehensive visualization of results

---

# Load and Explore the Data

For this tutorial, we use a simulated dataset included in the BCFM package. The dataset contains 200 observations, 20 variables, and 5 time points, with a known structure of 4 groups and 3 latent factors.
```{r simulated-data-exploration}
# Load the simulated dataset
data("sim.data", package = "BCFM")

# Examine the structure
str(sim.data)

# Extract dimensions
n_obs <- dim(sim.data)[1]      # 200 observations
n_vars <- dim(sim.data)[2]     # 20 variables

cat("Dataset dimensions:\n")
cat("  Observations:", n_obs, "\n")
cat("  Variables:", n_vars, "\n")
```

---

# Model Selection

The first step is to identify the optimal number of groups (clusters) and latent factors. We use `BCFM.model.selection()` to fit models across a grid of possible configurations and compare them using IC.
```{r run-model-selection}
# Specify variables to use for clustering
cluster.vars <- paste0("V", 1:n_vars)

# Create output directory for results (using tempdir() for CRAN compliance)
output_dir <- file.path(tempdir(), "BCFM_analysis")

# Run model selection
# Note: grouplist and factorlist can be vectors (e.g., c(2, 3, 4)) or sequences (e.g., 2:4)
# The function will fit a model for each combination of groups and factors
BCFM.model.selection(
  data = sim.data,
  cluster.vars = cluster.vars,
  grouplist = 4,          # Number of groups
  factorlist = 3,         # Number of factors
  n.iter = 100,           # Number of MCMC iterations (use 50000+ for real analysis)
  burnin = 10,            # Burnin for IC calculation (use 10000+ for real analysis)
  every = 10,             # Retains the output of every "every" MCMC iterations
  cluster.size = 0.01,    # Minimum proportion required for each cluster (default 0.05)
  output_dir = output_dir, # Specify where to save results
  seed = 123              # Optional seed for reproducibility
)
```

> **Note:** For demonstration, we use minimal iterations (100) and burnin (10). For actual analysis, use `n.iter = 50000` and `burnin = 10000` or higher to ensure convergence.

> **Output Directory:** The `output_dir` parameter specifies where model results will be saved. By default, results are saved to `tempdir()`. For your own analyses, you can specify any directory where you have write permissions.

---

# Examining Output Files

After running model selection, check the output directory:
```{r check-output}
# Check what files were created
list.files(output_dir)

# Load IC results (if multiple models were fitted)
load(file.path(output_dir, "IC.Rdata"))
print(IC.matrix)

# Load the fitted model results
load(file.path(output_dir, "results-covarianceF-g4-f3.Rdata"))

# Examine the structure
str(SDresult$Result, max.level = 1)
```

When testing multiple configurations (e.g., `grouplist = 2:4, factorlist = 2:4`), an `IC.Rdata` file is created for model comparison.

---

# Visualization

The BCFM package provides comprehensive visualization functions to explore your results.

## Factor Loadings Matrix

The factor loadings matrix **B** shows the relationship between observed variables and latent factors. Each column represents a latent factor, and each row represents a variable. Higher absolute values indicate stronger associations.
```{r factor-loadings-matrix}
# Check dimensions and calculate posterior mean of B (after burnin)
n_iter <- dim(SDresult$Result$B)[1]
burnin <- min(100, floor(n_iter * 0.2))  # Use 20% as burnin or 100, whichever is smaller

cat("Total iterations:", n_iter, "\n")
cat("Using burnin:", burnin, "\n")

# Calculate posterior mean after burnin
B.matrix <- apply(SDresult$Result$B[burnin:n_iter, , ], MARGIN = c(2, 3), FUN = mean)
B.matrix <- as.data.frame(B.matrix)

# Set variable names
rownames(B.matrix) <- paste0("Var", 1:nrow(B.matrix))

# Set informative column names for factors
colnames(B.matrix) <- paste0("Factor ", 1:ncol(B.matrix))

# Print the factor loadings matrix
knitr::kable(round(B.matrix, 3), 
             caption = "Posterior Mean Factor Loadings",
             align = 'c')
```

## Latent Profiles

Visualize the mean profiles for each cluster:
```{r latent-profiles-plot, results='hide'}
ggplot_latent.profiles(Gibbs = SDresult$Result)
```

## Cluster Means

Explore the posterior distributions of cluster means:
```{r mu-density-plot, results='hide'}
ggplot_mu.density(Gibbs = SDresult$Result, add.legend = TRUE)
```

## Cluster Probabilities

### Density Plots

Examine the posterior distributions of cluster membership probabilities:
```{r probs-density-plot, results='hide'}
ggplot_probs.density(Gibbs = SDresult$Result)
```

### Trace Plots

Check convergence of cluster probabilities:
```{r probs-trace-plot, results='hide'}
ggplot_probs.trace(Gibbs = SDresult$Result)
```

## Cluster Assignments Heatmap

Visualize the cluster membership assignments over MCMC iterations:
```{r heatmap-plot, results='hide'}
ggplot_Zit.heatmap(Gibbs = SDresult$Result)
```

## Additional Visualizations

The BCFM package includes several other plotting functions to explore your results in depth:

- `ggplot_B.trace()` - Check convergence of factor loadings
- `ggplot_B.CI()` - Examine credible intervals for loadings
- `ggplot_omega.density()` - Visualize within-cluster covariance structures
- `ggplot_sigma2.CI()` - Examine error variances
- `ggplot_tau.CI()` - Examine factor variances
- `ggplot_variability()` - Understand variance decomposition across factors

See the function documentation for details.

---

# Interpreting Results

## Cluster Assignments

The heatmap shows how observations are assigned to clusters. Stable assignments (consistent colors across iterations) indicate good convergence and clear cluster structure.

## Factor Loadings

The loading matrix **B** shows which variables are associated with each latent factor. Large credible intervals that include zero suggest weak associations.

## Cluster Characteristics

The latent profiles plot reveals the distinctive patterns of each cluster across variables, helping to interpret what makes each cluster unique.

## Convergence Diagnostics

The trace plots for factor loadings and cluster probabilities help assess MCMC convergence. Look for:

- No trends over iterations (stationary)
- Good mixing (values move freely)
- Similar patterns across chains

## Variance Components

- **Sigma² (Error variances)**: Larger values indicate more unexplained variance for those variables
- **Tau (Factor variances)**: Shows the relative importance of each latent factor
- **Variance decomposition**: Illustrates how much variance each factor explains

---

# Accessing Model Components

You can directly access various components of the fitted model:
```{r model-components}
# Cluster assignments (most likely cluster for each observation)
cluster_assignments <- SDresult$Result$Z

# Loading matrix
loadings <- SDresult$Result$B

# Cluster means
cluster_means <- SDresult$Result$mu

# Cluster probabilities
cluster_probs <- SDresult$Result$probs

# Factor scores
factor_scores <- SDresult$Result$X

# Show dimensions
cat("Dimensions of key components:\n")
cat("  B (loadings):", dim(loadings), "\n")
cat("  mu (means):", dim(cluster_means), "\n")
cat("  X (scores):", dim(factor_scores), "\n")
```

---

# Summary

This vignette covered:

1. **Model Selection** - Using `BCFM.model.selection()` to fit models
2. **Output Examination** - Exploring result files and IC matrices
3. **Factor Analysis** - Interpreting loadings, convergence, and variance
4. **Clustering** - Visualizing cluster profiles and assignments
5. **Diagnostics** - Assessing model fit and convergence

For more details, see `?BCFM.model.selection` and individual function help pages.

---

# Session Info
```{r session-info}
sessionInfo()
```
