---
title: "A Practical Example for Working with Datasets"
author: "Zeynel Cebeci"
date: "2026-2-14"
output:
  html_document:
    theme: cosmo
    highlight: tango
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{A Practical Example for Working with Datasets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
oldopts <- options(width = 120)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = " ",
  fig.width=7,
  fig.height=5
)

```

<style>
body {
max-width: 90%;
margin: auto;
font-size: 11pt;
line-height: 1.2;
text-align: justify;
}

h1 { font-size: 2.2em; }
h2 { font-size: 1.8em; }
h3 { font-size: 1.5em; }

p, li {font-size: 1.05em;}
code, pre {font-size: 0.95em;}
table {font-size: 0.95em;}
</style>

# Introduction

The Normality Transformation via Optimized Skewness and Kurtosis (OSKT) is a normality method that simultaneously evaluates deviations in skewness and kurtosis of non-normal data.

If the package `osktnorm` has already been installed, load it into R working environment by using the following command:

```{r loadlib, eval=TRUE, message=FALSE, warning=FALSE}
library(osktnorm)

```

# Normalizing Data from an Excel File

## Load Required Packages for Excel I/O
To handle Excel file formats, the readxl package is used for importing data and writexl for exporting 
results. The following code chunk checks for their presence and installs them if necessary before loading.

```{r load-excel-pkgs, message=FALSE, warning=FALSE}
if (!requireNamespace("readxl", quietly = TRUE)) install.packages("readxl")
if (!requireNamespace("writexl", quietly = TRUE)) install.packages("writexl")

library(readxl)
library(writexl)

```

## Read Excel File

The target dataset (sativapheno.xlsx) is imported from the package's internal extdata directory. 
The code then subsets specific columns of interest and applies a filter to retain only numeric 
variables, ensuring the data is compatible with the normalization algorithms.

```{r read-excel, message=FALSE, warning=FALSE}
file_path <- system.file("extdata", "sativapheno.xlsx", package = "osktnorm")
pheno <- read_excel(file_path)

# Select some variables from the pheno
phenoset <- as.data.frame(pheno[, c(4,5,6,7,9,10,12,13,14)])
phenoset <- phenoset[, sapply(phenoset, is.numeric)]

head(phenoset)

```

## Visualization of Variables
This section focuses on evaluating the distributional characteristics of the raw phenotypic data before any transformation. 
Visual inspection is a critical first step in determining the degree of skewness and kurtosis present in 
each trait.

### Boxplots
A grid of histograms is generated to visualize the distribution of each variable.
For each plot, the actual data distribution (black line) is compared against a theoretical 
normal distribution curve (red dashed line). This comparison highlights which traits deviate 
significantly from normality.

```{r box-plots, message=FALSE, warning=FALSE}
oldpar <- par(no.readonly = TRUE)
nvar <- ncol(phenoset)
nrow_plot <- ceiling(sqrt(nvar))
ncol_plot <- ceiling(nvar / nrow_plot)

par(mfrow = c(nrow_plot, ncol_plot), mar = c(4, 4, 3, 1))
cols <- rainbow(nvar)

for(i in seq_len(nvar)) {
  x <- phenoset[[i]]
  x <- x[!is.na(x)]
  
  if(length(x) == 0) next
  h_calc <- hist(x, plot = FALSE) 
  d <- density(x)   
  max_y <- max(c(h_calc$density, d$y)) * 1.1
  hist(x, 
       breaks = 15, 
       freq = FALSE, 
       col = adjustcolor(cols[i], 0.5), 
       border = "white",
       main = colnames(phenoset)[i],
       xlab = "Transformed Value", 
       ylab = "Density",
       ylim = c(0, max_y))
  
  lines(d, col = "black", lwd = 2)
  curve(dnorm(x, mean = mean(x), sd = sd(x)), add = TRUE, col = "red", lty = 2, lwd = 2)
}
par(oldpar)
```

### Normality Tests for Original Values

To provide statistical evidence of non-normality, the Shapiro-Wilk test is applied to each trait.
 A data frame is created to summarize the W statistic and p-values. Traits with a p-value <0.05
 are flagged as "NO" (not normal), justifying the need for OSKT normalization.

```{r densplots, message=FALSE, warning=FALSE}
shapiro_results <- data.frame(
  Trait = colnames(phenoset),
  W     = NA_real_,
  pval  = NA_real_,
  IsNormal = NA_character_,
  stringsAsFactors = FALSE
)

for(i in seq_len(ncol(phenoset))) {
  x <- phenoset[[i]]
  x <- x[!is.na(x)]
  
  if(length(x) >= 3) {
    test <- shapiro.test(x)
    shapiro_results$W[i] <- test$statistic
    shapiro_results$pval[i] <- test$p.value
    shapiro_results$IsNormal[i] <- ifelse(test$p.value > 0.05, "YES", "NO")
  } else {
    shapiro_results$W[i] <- NA
    shapiro_results$pval[i] <- NA
    shapiro_results$IsNormal[i] <- NA
  }
}
shapiro_results

```

## Normalization with OSKT
The core normalization process is executed using the osktnorm function. By setting normtests = FALSE, the function focuses on calculating the optimal transformation parameters and generating the normalized 
data frame (phenonormal), where each variable is adjusted to achieve a distribution as close to 
normal as possible.

```{r batch-normalization, message=FALSE, warning=FALSE}

# Normalization only 

norm_res1 <- osktnorm(phenoset, normtests = FALSE)
phenonormal <- norm_res1$normalized
head(phenonormal)

```

## Visualizing the Normalized Values
Following the transformation, the data is re-visualized to confirm the effectiveness of 
the OSKT method. This allows for a direct comparison between the initial skewed distributions 
and the newly normalized results.

### Boxplots 

Boxplots are utilized to examine the center, spread, and outliers of the normalized traits. 
This visualization helps verify that the variables now share a similar scale and that the 
influence of extreme values (outliers) has been effectively stabilized across the entire dataset.

```{r boxplots, message=FALSE, warning=FALSE}
oldpar <- par(no.readonly = TRUE)
par(mar = c(8,4,4,2))
boxplot(
  phenonormal,
  col = rainbow(ncol(phenonormal)),
  border = "black",
  outline = TRUE,
  las = 2,
  main = "OSKT Normalized Values",
  ylab = "Values",
  cex.axis = 0.8
)
par(oldpar)
```

### Histograms and Density Plots
After transformation, a new grid of histograms is generated for the phenonormal dataset. 
This visualization allows for an empirical evaluation of the OSKT method, confirming that 
the distributions now exhibit improved symmetry and follow a bell-shaped curve more 
closely compared to the raw data.

```{r histograms, message=FALSE, warning=FALSE}
oldpar <- par(no.readonly = TRUE)
nvar <- ncol(phenonormal)
nrow_plot <- ceiling(sqrt(nvar))
ncol_plot <- ceiling(nvar / nrow_plot)

par(mfrow = c(nrow_plot, ncol_plot), mar = c(4,4,3,1))
cols <- rainbow(nvar)

for(i in seq_len(nvar)) {
  x <- phenonormal[[i]]
  x <- x[!is.na(x)]
  
  if(length(x) == 0) next
  
  d <- density(x)
  h_calc <- hist(x, plot = FALSE) 
  
  max_y <- max(d$y, h_calc$density) * 1.1
  
  hist(x, freq = FALSE, 
       col = adjustcolor(cols[i], 0.6),
       border = "white",
       main = colnames(phenonormal)[i],
       xlab = "", ylab = "Density",
       ylim = c(0, max_y))
  
  lines(d, col = "black", lwd = 2)
}
par(oldpar)

```

## Optimization Parameters
This section extracts the internal parameters estimated during the transformation process. 
The table provides the optimized values for g (skewness parameter) and h (kurtosis parameter), 
along with the final A^2 value, which represents the minimized objective function for each trait.

```{r optparams, message=FALSE, warning=FALSE}
norm_res2 <- osktnorm(phenoset, normtests = "all")

phenonormal <- norm_res2$normalized
paramstable <- norm_res2$parameters
paramstable
```

## Normality Test Results
A comprehensive suite of statistical diagnostics is performed on the normalized data. 
By requesting normtests = "all", the function computes multiple metrics including 
skewness, kurtosis, and various p-values (Shapiro-Wilk, Anderson-Darling, etc.). 
This table serves as the final validation that the traits have successfully achieved 
or approached normality.

```{r testresults, message=FALSE, warning=FALSE}
testtable <- norm_res2$normtests
testtable

```

## Saving the Normalized Data
In the final step, the successfully transformed and normalized dataset is exported to an external file. 
Using the writexl package, the data is saved as an Excel spreadsheet (sativapheno_normalized.xlsx), 
making it ready for use in subsequent downstream statistical analyses.

```{r write-excel, eval=FALSE, echo=TRUE, message=FALSE, warning=FALSE}
writexl::write_xlsx(phenonormal, path = "sativapheno_normalized.xlsx")

```

```{r final-checks, message=FALSE, warning=FALSE}
 options(oldopts)
```
