CSAMA: Multivariate Analysis and PCA

Author

Bio Data Science^3 faculty

Published

May 26, 2026

Reference: Modern Statistics for modern biology Susan Holmes and Wolfgang Huber. Chapter 7. Multivariate Analysis. Cambridge University Press.

Goal

In this lab we will learn the basics of multivariate analysis and PCA using a few simple examples.

Work through this lab by running all the R code on your computer and making sure that you understand the input and the output. Make alterations where you seem fit. We encourage you to work through this lab with a partner.

Setup

# load required packages
library(GGally)
library(ggplot2)
library(pheatmap)
library(factoextra)
library(ade4)

Get data sets from the web:

turtles <- read.table(url("https://web.stanford.edu/class/bios221/data/PaintedTurtles.txt"),
                      header = TRUE)
download.file(url = "https://web.stanford.edu/class/bios221/data/athletes.RData",
              destfile = "athletes.RData", mode = "wb")

Or load them locally:

turtles <- read.table("PaintedTurtles.txt", header = TRUE)
load("athletes.RData")

head(turtles)
  sex length width height
1   f     98    81     38
2   f    103    84     38
3   f    103    86     42
4   f    105    86     40
5   f    109    88     44
6   f    123    92     50
athletes[1:3, ]
   m100 long weight highj  m400  m110  disc pole javel m1500
1 11.25 7.43  15.48  2.27 48.90 15.13 49.28  4.7 61.32 268.9
2 10.87 7.45  14.97  1.97 47.71 14.46 44.36  5.1 61.76 273.0
3 11.18 7.44  14.20  1.97 48.29 14.81 43.66  5.2 64.16 263.2

Let’s first get to know our data sets.

Questions
  1. How many athletes / turtles do you have in the data sets?
  2. What’s the record distance in the longjump category? And which athlete (number) made this record?
  3. What’s the average time across all athletes for the 100m run?
  4. Can you plot the histogram showing the distribution of the times for the 100m run?
  5. How many athletes of those who run faster than the average in the 100m run, also run faster than the average in the 1500m distance?

It is instructive to first consider 2-dimensional summaries of the data. The function ggpairs() from the GGally package gives a nice summary of the features and how they are correlated with each other.

ggpairs(turtles[, -1], axisLabels = "none")

Questions
  1. What do you see on the diagonal? What do the stars indicate next to the correlation value?
  2. Can you repeat this plot for the athletes data?
  3. Use the pheatmap() function in the package with the same name, pheatmap, to illustrate the pairwise correlations (computed using the cor() function) of the features in the athletes data set.

Preprocessing the data

In many cases, different variables are measured in different units and at different scales. Here, we elect to standardize the data to a common standard deviation. This rescaling is done using the scale() function, which subtracts the mean and divides by the standard deviation, so that every column has a unit standard deviation and mean zero.

scaledTurtles <- data.frame(sex = turtles[, 1], scale(turtles[, -1]))
head(scaledTurtles)
  sex   length   width  height
1   f -1.30300 -1.1390 -0.9929
2   f -1.05888 -0.9023 -0.9929
3   f -1.05888 -0.7445 -0.5163
4   f -0.96123 -0.7445 -0.7546
5   f -0.76593 -0.5867 -0.2780
6   f -0.08239 -0.2712  0.4369
Questions
  1. Can you compute the standard deviation and mean of each column in the turtles data frame? Can you do the same on the scaled dataset, i.e. on scaledturtles? What was the mean of turtles’ heights before standardizing?

We can visualize two columns/dimensions (for example height and width) of the scaled data using ggplot.

ggplot(scaledTurtles, aes(x = width, y = height, group = sex)) +
  geom_point(aes(color = sex)) + coord_fixed()

What is the purpose of the coord_fixed() modifier here?

Dimensionality reduction

In this part, we will use geometrical projections of points in a higher dimensional space and project them down to lower dimensions.

The first example will be the projection of the points in a two-dimensional space (defined by weight and disc distance in the athlete data set) onto a 1-dimensional space. The 1-dimensional space in this case is defined by the weight-axis/x-axis.

But first we need to scale the athlete data set, in the same way as we did it with the turtles data set.

scaledathletes <- data.frame(scale(athletes))
n <- nrow(scaledathletes)
# First, p is a 2-dimensional plot of the points defined by weight (x) and disc (y)
p <- ggplot(scaledathletes, aes(x = weight, y = disc)) + geom_point(shape = 1)

# Then we add the projected points and the projection lines (dashed)
p + geom_point(aes(y = rep(0, n)), color = "#0056B9") +
    geom_segment(aes(xend = weight, yend = rep(0, n)), linetype = "dashed")

Questions

Now try to do the following:

  1. Calculate the standard deviation of the blue points (their \(x\)-coordinates) in the above figure.

  2. Make a similar plot showing projection lines onto the \(y\)-axis and show projected points in yellow. What is the variance of the projected points now?

Summarize 2D-data by a line

In the above example when projecting the 2-dimensional points to the weight axis, we lost the disc information. In order to keep more information, we will now project the 2 dimensional point cloud onto another line.

For this, we first compute a linear model to find the regression line using the lm() function (linear model). We regress disc on weight. The regression line is defined by two parameters: its slope and its intercept. The slope a is given by the second coefficient in the output of lm and its intercept b is the first coefficient:

reg1 <- lm(disc ~ weight, data = scaledathletes)

Extract intercept and slope values

a1 <- reg1$coefficients[1] # Intercept
b1 <- reg1$coefficients[2] # slope

Plot the points p (computed in the code section before) and the regression line.

pline <- p +
    geom_abline(intercept = a1, slope = b1, col = "#0056B9", lwd = 1.5) +
    coord_fixed()

Add the projection lines (from the point to its fitted value)

pline +
    geom_segment(aes(xend = weight, yend = reg1$fitted),
                 color = "#FFD800",
                 arrow = arrow(length = unit(0.15, "cm")))

Question

Can you regress weight on discs and generate a similar plot?

Question

Can you create a plot that shows all points, as well as both regression lines, i.e., a plot that show both the line you get from lm(disc ~ weight) and lm(weight ~ disc)?

A line that minimizes distances in both directions

Below we are plotting a line chosen to minimize the error in both the horizontal and vertical directions. This results in minimizing the diagonal projections onto the line.

Specifically, we compute a line that minimizes the sum of squares of the orthogonal (perpendicular) projections of data points onto it. We call this the principal component line.

X <- cbind(scaledathletes$disc, scaledathletes$weight)
svda <- svd(X)
pc <- X %*% svda$v[, 1] %*% t(svda$v[, 1])
bp <- svda$v[2, 1] / svda$v[1, 1]
ap <- mean(pc[, 2]) - bp * mean(pc[, 1])

p + geom_segment(xend = pc[, 1], yend = pc[, 2], arrow = arrow(length = unit(0.15, "cm"))) +
  geom_abline(intercept = ap, slope = bp, col = "#606060", lwd = 1.5) +
  coord_fixed()

Now let’s see how we can use the learned on a higher-dimensional data set.

Turtle PCA

To start we will come back to the turtles data set. First, we need to make sure we understand the basic features of the data and preprocess it in a way that its in the correct “shape” for running the PCA analysis.

Questions
  1. What are the mean values and standard deviation, of each of the 3 features: length, width and height.
  2. Scale the data.
  3. Explore the correlations between the 3 variables after scaling the data. What do you see?

From the correlations, you see that all 3 variables are strongly correlated. (In the heatmap, note that the color scale already starts with a high value at its lower end.) Hence we expect that the data can be well approximated by a single variable. Let’s do the PCA:

pca1 <- princomp(turtlesc)
pca1
Call:
princomp(x = turtlesc)

Standard deviations:
Comp.1 Comp.2 Comp.3 
1.6955 0.2048 0.1448 

 3  variables and  48 observations.

To look at the relative importance of the principal components, we can look at their variances: the screeplot. The screeplot shows the eigenvalues for the standardized data.

fviz_eig(pca1, geom = "bar", width = 0.4)

Note: Here we see one very large component in this case and two very small ones. In this case the data are (almost) one dimensional.

Questions
  1. What is the percentage of variance explained by the first PC? How can you obtain this value from the pca1 object?
  2. How many PCs are you using if you want to project the turtles data set?

Now, lets plot the samples with their PC1 and PC2 coordinates, together with the variables. The representation of both, the samples and the variables is called a biplot.

fviz_pca_biplot(pca1, label = "var")

Questions
  1. Can you extend this plotting code to color the female samples differently than the male samples?
  2. Did the males or female turtles tend to be larger?

Back to the athletes

Now let us try to run the PCA on a larger data set and interpret the corresponding scree plot. In this case we are using a different library, with a slightly different output of the PCA computation. But the principle is the same.

# The dudi.pca function by default already centers and scales the data by itself
pca.ath <- dudi.pca(athletes, scannf = FALSE)
pca.ath$eig
 [1] 3.4182 2.6064 0.9433 0.8780 0.5566 0.4912 0.4306 0.3068 0.2669 0.1019
Questions
  1. Just like in the above turtle data set. Can you produce a scree plot?
  2. How many PCs are you using if you want to project the athletes data set?
  3. Can you plot the samples with their PC1 and PC2 coordinates, together with the variables in a biplot?
  4. Can you plot the numbers of the athletes onto the samples. What do you notice about the numbers?

Session information

sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so 
LAPACK: /usr/local/lib/R/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ade4_1.7-24      factoextra_2.0.0 pheatmap_1.0.13  GGally_2.4.0    
[5] ggplot2_4.0.3   

loaded via a namespace (and not attached):
 [1] generics_0.1.4     tidyr_1.3.2        rstatix_0.7.3      digest_0.6.39     
 [5] magrittr_2.0.5     evaluate_1.0.5     grid_4.6.0         RColorBrewer_1.1-3
 [9] fastmap_1.2.0      jsonlite_2.0.0     ggrepel_0.9.8      backports_1.5.1   
[13] Formula_1.2-5      purrr_1.2.2        scales_1.4.0       abind_1.4-8       
[17] cli_3.6.6          rlang_1.2.0        withr_3.0.2        yaml_2.3.12       
[21] otel_0.2.0         tools_4.6.0        ggsignif_0.6.4     dplyr_1.2.1       
[25] ggpubr_0.6.3       ggstats_0.13.0     broom_1.0.13       vctrs_0.7.3       
[29] R6_2.6.1           lifecycle_1.0.5    car_3.1-5          htmlwidgets_1.6.4 
[33] MASS_7.3-65        pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6      
[37] glue_1.8.1         Rcpp_1.1.1-1.1     xfun_0.57          tibble_3.3.1      
[41] tidyselect_1.2.1   knitr_1.51         farver_2.1.2       htmltools_0.5.9   
[45] rmarkdown_2.31     carData_3.0-6      labeling_0.4.3     compiler_4.6.0    
[49] S7_0.2.2