---
title: "LightGBM models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{LightGBM models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
if (requireNamespace("lightgbm", quietly = TRUE)) {
  library(tidypredict)
  library(lightgbm)
  library(dplyr)
  eval_code <- TRUE
} else {
  eval_code <- FALSE
}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = eval_code
)
```

| Function                                                      |Works|
|---------------------------------------------------------------|-----|
|`tidypredict_fit()`, `tidypredict_sql()`, `parse_model()`      |  ✔  |
|`tidypredict_to_column()`                                      |  ✔  |
|`tidypredict_test()`                                           |  ✔  |
|`tidypredict_interval()`, `tidypredict_sql_interval()`         |  ✗  |
|`parsnip`                                                      |  ✔  |

## `tidypredict_` functions

```{r}
library(lightgbm)

# Prepare data
X <- data.matrix(mtcars[, c("mpg", "cyl", "disp")])
y <- mtcars$hp

dtrain <- lgb.Dataset(X, label = y, colnames = c("mpg", "cyl", "disp"))

model <- lgb.train(
  params = list(
    num_leaves = 4L,
    learning_rate = 0.5,
    objective = "regression",
    min_data_in_leaf = 1L
  ),
  data = dtrain,
  nrounds = 10L,
  verbose = -1L
)
```

- Create the R formula
    ```{r}
tidypredict_fit(model)
    ```

- Add the prediction to the original table
    ```{r}
library(dplyr)

mtcars %>%
  tidypredict_to_column(model) %>%
  glimpse()
    ```

- Confirm that `tidypredict` results match to the model's `predict()` results. The `xg_df` argument expects the matrix data set.
    ```{r}
tidypredict_test(model, xg_df = X)
    ```

## Supported objectives

LightGBM supports many objective functions. The following objectives are supported by `tidypredict`:

### Regression objectives (identity transform)

- `regression` / `regression_l2` (default)
- `regression_l1`
- `huber`
- `fair`
- `quantile`
- `mape`

### Regression objectives (exp transform)

- `poisson`
- `gamma`
- `tweedie`

### Binary classification (sigmoid transform)

- `binary`
- `cross_entropy`

### Multiclass classification

- `multiclass` (softmax transform)
- `multiclassova` (per-class sigmoid)

## Binary classification example

```{r}
X_bin <- data.matrix(mtcars[, c("mpg", "cyl", "disp")])
y_bin <- mtcars$am

dtrain_bin <- lgb.Dataset(X_bin, label = y_bin, colnames = c("mpg", "cyl", "disp"))

model_bin <- lgb.train(
  params = list(
    num_leaves = 4L,
    learning_rate = 0.5,
    objective = "binary",
    min_data_in_leaf = 1L
  ),
  data = dtrain_bin,
  nrounds = 10L,
  verbose = -1L
)

tidypredict_test(model_bin, xg_df = X_bin)
```

## Multiclass classification

For multiclass models, `tidypredict_fit()` returns a named list of formulas, one for each class:

```{r}
X_iris <- data.matrix(iris[, 1:4])
colnames(X_iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
y_iris <- as.integer(iris$Species) - 1L

dtrain_iris <- lgb.Dataset(X_iris, label = y_iris, colnames = colnames(X_iris))

model_multi <- lgb.train(
  params = list(
    num_leaves = 4L,
    learning_rate = 0.5,
    objective = "multiclass",
    num_class = 3L,
    min_data_in_leaf = 1L
  ),
  data = dtrain_iris,
  nrounds = 5L,
  verbose = -1L
)

fit_formulas <- tidypredict_fit(model_multi)
names(fit_formulas)
```

Each formula produces the predicted probability for that class:

```{r}
iris %>%
  mutate(
    prob_setosa = !!fit_formulas$class_0,
    prob_versicolor = !!fit_formulas$class_1,
    prob_virginica = !!fit_formulas$class_2
  ) %>%
  select(Species, starts_with("prob_")) %>%
  head()
```

Note: `tidypredict_test()` does not support multiclass models. Use `tidypredict_fit()` directly.

## Categorical features

LightGBM supports native categorical features. When a feature is marked as categorical, `tidypredict` generates appropriate `%in%` conditions:

```{r}
set.seed(123)
n <- 200
cat_data <- data.frame(
  cat_feat = sample(0:3, n, replace = TRUE),
  y = NA
)
cat_data$y <- ifelse(cat_data$cat_feat %in% c(0, 1), 10, -10) + rnorm(n, sd = 2)

X_cat <- matrix(cat_data$cat_feat, ncol = 1)
colnames(X_cat) <- "cat_feat"

dtrain_cat <- lgb.Dataset(
  X_cat,
  label = cat_data$y,
  categorical_feature = "cat_feat"
)

model_cat <- lgb.train(
  params = list(
    num_leaves = 4L,
    learning_rate = 1.0,
    objective = "regression",
    min_data_in_leaf = 1L
  ),
  data = dtrain_cat,
  nrounds = 2L,
  verbose = -1L
)

tidypredict_fit(model_cat)
```

## parsnip

`parsnip` fitted models (via the `bonsai` package) are also supported by `tidypredict`:

```{r, eval = requireNamespace("parsnip", quietly = TRUE) && requireNamespace("bonsai", quietly = TRUE)}
library(parsnip)
library(bonsai)

p_model <- boost_tree(
  trees = 10,
  tree_depth = 3,
  min_n = 1
) %>%
  set_engine("lightgbm") %>%
  set_mode("regression") %>%
  fit(hp ~ mpg + cyl + disp, data = mtcars)

# Extract the underlying lgb.Booster
lgb_model <- p_model$fit

tidypredict_test(lgb_model, xg_df = X)
```

## Parse model spec

Here is an example of the model spec:
```{r}
pm <- parse_model(model)
str(pm, 2)
```

```{r}
str(pm$trees[1])
```

## Limitations

- Ranking objectives (`lambdarank`, `rank_xendcg`) are not supported
- Prediction intervals are not supported
- `tidypredict_test()` does not support multiclass models
- LightGBM uses 32-bit floats for split thresholds, which may cause prediction discrepancies at exact split boundaries. See the [float precision](float-precision.html) article for details.
