CSAMA: Multivariate Analysis and PCA

Author

Bio Data Science^3 faculty

Published

May 26, 2026

Reference: Modern Statistics for modern biology Susan Holmes and Wolfgang Huber. Chapter 7. Multivariate Analysis. Cambridge University Press.

Goal

In this lab we will learn the basics of multivariate analysis and PCA using a few simple examples.

Work through this lab by running all the R code on your computer and making sure that you understand the input and the output. Make alterations where you seem fit. We encourage you to work through this lab with a partner.

Setup

# load required packages
library(GGally)
library(ggplot2)
library(pheatmap)
library(factoextra)
library(ade4)

Get data sets from the web:

turtles <- read.table(url("https://web.stanford.edu/class/bios221/data/PaintedTurtles.txt"),
                      header = TRUE)
download.file(url = "https://web.stanford.edu/class/bios221/data/athletes.RData",
              destfile = "athletes.RData", mode = "wb")

Or load them locally:

turtles <- read.table("PaintedTurtles.txt", header = TRUE)
load("athletes.RData")

head(turtles)

  sex length width height
1   f     98    81     38
2   f    103    84     38
3   f    103    86     42
4   f    105    86     40
5   f    109    88     44
6   f    123    92     50

athletes[1:3, ]

   m100 long weight highj  m400  m110  disc pole javel m1500
1 11.25 7.43  15.48  2.27 48.90 15.13 49.28  4.7 61.32 268.9
2 10.87 7.45  14.97  1.97 47.71 14.46 44.36  5.1 61.76 273.0
3 11.18 7.44  14.20  1.97 48.29 14.81 43.66  5.2 64.16 263.2

Let’s first get to know our data sets.

Questions

How many athletes / turtles do you have in the data sets?
What’s the record distance in the longjump category? And which athlete (number) made this record?
What’s the average time across all athletes for the 100m run?
Can you plot the histogram showing the distribution of the times for the 100m run?
How many athletes of those who run faster than the average in the 100m run, also run faster than the average in the 1500m distance?

It is instructive to first consider 2-dimensional summaries of the data. The function ggpairs() from the GGally package gives a nice summary of the features and how they are correlated with each other.

ggpairs(turtles[, -1], axisLabels = "none")

Questions

What do you see on the diagonal? What do the stars indicate next to the correlation value?
Can you repeat this plot for the athletes data?
Use the pheatmap() function in the package with the same name, pheatmap, to illustrate the pairwise correlations (computed using the cor() function) of the features in the athletes data set.

Preprocessing the data

In many cases, different variables are measured in different units and at different scales. Here, we elect to standardize the data to a common standard deviation. This rescaling is done using the scale() function, which subtracts the mean and divides by the standard deviation, so that every column has a unit standard deviation and mean zero.

scaledTurtles <- data.frame(sex = turtles[, 1], scale(turtles[, -1]))
head(scaledTurtles)

  sex   length   width  height
1   f -1.30300 -1.1390 -0.9929
2   f -1.05888 -0.9023 -0.9929
3   f -1.05888 -0.7445 -0.5163
4   f -0.96123 -0.7445 -0.7546
5   f -0.76593 -0.5867 -0.2780
6   f -0.08239 -0.2712  0.4369

Questions

Can you compute the standard deviation and mean of each column in the turtles data frame? Can you do the same on the scaled dataset, i.e. on scaledturtles? What was the mean of turtles’ heights before standardizing?

We can visualize two columns/dimensions (for example height and width) of the scaled data using ggplot.

ggplot(scaledTurtles, aes(x = width, y = height, group = sex)) +
  geom_point(aes(color = sex)) + coord_fixed()

What is the purpose of the coord_fixed() modifier here?

Dimensionality reduction

In this part, we will use geometrical projections of points in a higher dimensional space and project them down to lower dimensions.

The first example will be the projection of the points in a two-dimensional space (defined by weight and disc distance in the athlete data set) onto a 1-dimensional space. The 1-dimensional space in this case is defined by the weight-axis/x-axis.

But first we need to scale the athlete data set, in the same way as we did it with the turtles data set.

scaledathletes <- data.frame(scale(athletes))
n <- nrow(scaledathletes)

# First, p is a 2-dimensional plot of the points defined by weight (x) and disc (y)
p <- ggplot(scaledathletes, aes(x = weight, y = disc)) + geom_point(shape = 1)

# Then we add the projected points and the projection lines (dashed)
p + geom_point(aes(y = rep(0, n)), color = "#0056B9") +
    geom_segment(aes(xend = weight, yend = rep(0, n)), linetype = "dashed")

Questions

Now try to do the following:

Calculate the standard deviation of the blue points (their $x$-coordinates) in the above figure.
Make a similar plot showing projection lines onto the $y$-axis and show projected points in yellow. What is the variance of the projected points now?

Summarize 2D-data by a line

In the above example when projecting the 2-dimensional points to the weight axis, we lost the disc information. In order to keep more information, we will now project the 2 dimensional point cloud onto another line.

For this, we first compute a linear model to find the regression line using the lm() function (linear model). We regress disc on weight. The regression line is defined by two parameters: its slope and its intercept. The slope a is given by the second coefficient in the output of lm and its intercept b is the first coefficient:

reg1 <- lm(disc ~ weight, data = scaledathletes)

Extract intercept and slope values

a1 <- reg1$coefficients[1] # Intercept
b1 <- reg1$coefficients[2] # slope

Plot the points p (computed in the code section before) and the regression line.

pline <- p +
    geom_abline(intercept = a1, slope = b1, col = "#0056B9", lwd = 1.5) +
    coord_fixed()

Add the projection lines (from the point to its fitted value)

pline +
    geom_segment(aes(xend = weight, yend = reg1$fitted),
                 color = "#FFD800",
                 arrow = arrow(length = unit(0.15, "cm")))

Question

Can you regress weight on discs and generate a similar plot?

Question

Can you create a plot that shows all points, as well as both regression lines, i.e., a plot that show both the line you get from lm(disc ~ weight) and lm(weight ~ disc)?

A line that minimizes distances in both directions

Below we are plotting a line chosen to minimize the error in both the horizontal and vertical directions. This results in minimizing the diagonal projections onto the line.

Specifically, we compute a line that minimizes the sum of squares of the orthogonal (perpendicular) projections of data points onto it. We call this the principal component line.

X <- cbind(scaledathletes$disc, scaledathletes$weight)
svda <- svd(X)
pc <- X %*% svda$v[, 1] %*% t(svda$v[, 1])
bp <- svda$v[2, 1] / svda$v[1, 1]
ap <- mean(pc[, 2]) - bp * mean(pc[, 1])

p + geom_segment(xend = pc[, 1], yend = pc[, 2], arrow = arrow(length = unit(0.15, "cm"))) +
  geom_abline(intercept = ap, slope = bp, col = "#606060", lwd = 1.5) +
  coord_fixed()

Now let’s see how we can use the learned on a higher-dimensional data set.

Turtle PCA

To start we will come back to the turtles data set. First, we need to make sure we understand the basic features of the data and preprocess it in a way that its in the correct “shape” for running the PCA analysis.

Questions

What are the mean values and standard deviation, of each of the 3 features: length, width and height.
Scale the data.
Explore the correlations between the 3 variables after scaling the data. What do you see?

From the correlations, you see that all 3 variables are strongly correlated. (In the heatmap, note that the color scale already starts with a high value at its lower end.) Hence we expect that the data can be well approximated by a single variable. Let’s do the PCA:

pca1 <- princomp(turtlesc)
pca1

Call:
princomp(x = turtlesc)

Standard deviations:
Comp.1 Comp.2 Comp.3 
1.6955 0.2048 0.1448 

 3  variables and  48 observations.

To look at the relative importance of the principal components, we can look at their variances: the screeplot. The screeplot shows the eigenvalues for the standardized data.

fviz_eig(pca1, geom = "bar", width = 0.4)

Note: Here we see one very large component in this case and two very small ones. In this case the data are (almost) one dimensional.

Questions

What is the percentage of variance explained by the first PC? How can you obtain this value from the pca1 object?
How many PCs are you using if you want to project the turtles data set?

Now, lets plot the samples with their PC1 and PC2 coordinates, together with the variables. The representation of both, the samples and the variables is called a biplot.

fviz_pca_biplot(pca1, label = "var")

Questions

Can you extend this plotting code to color the female samples differently than the male samples?
Did the males or female turtles tend to be larger?

Back to the athletes

Now let us try to run the PCA on a larger data set and interpret the corresponding scree plot. In this case we are using a different library, with a slightly different output of the PCA computation. But the principle is the same.

# The dudi.pca function by default already centers and scales the data by itself
pca.ath <- dudi.pca(athletes, scannf = FALSE)
pca.ath$eig

 [1] 3.4182 2.6064 0.9433 0.8780 0.5566 0.4912 0.4306 0.3068 0.2669 0.1019

Questions

Just like in the above turtle data set. Can you produce a scree plot?
How many PCs are you using if you want to project the athletes data set?
Can you plot the samples with their PC1 and PC2 coordinates, together with the variables in a biplot?
Can you plot the numbers of the athletes onto the samples. What do you notice about the numbers?

Session information

sessionInfo()

R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so 
LAPACK: /usr/local/lib/R/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ade4_1.7-24      factoextra_2.0.0 pheatmap_1.0.13  GGally_2.4.0    
[5] ggplot2_4.0.3   

loaded via a namespace (and not attached):
 [1] generics_0.1.4     tidyr_1.3.2        rstatix_0.7.3      digest_0.6.39     
 [5] magrittr_2.0.5     evaluate_1.0.5     grid_4.6.0         RColorBrewer_1.1-3
 [9] fastmap_1.2.0      jsonlite_2.0.0     ggrepel_0.9.8      backports_1.5.1   
[13] Formula_1.2-5      purrr_1.2.2        scales_1.4.0       abind_1.4-8       
[17] cli_3.6.6          rlang_1.2.0        withr_3.0.2        yaml_2.3.12       
[21] otel_0.2.0         tools_4.6.0        ggsignif_0.6.4     dplyr_1.2.1       
[25] ggpubr_0.6.3       ggstats_0.13.0     broom_1.0.13       vctrs_0.7.3       
[29] R6_2.6.1           lifecycle_1.0.5    car_3.1-5          htmlwidgets_1.6.4 
[33] MASS_7.3-65        pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6      
[37] glue_1.8.1         Rcpp_1.1.1-1.1     xfun_0.57          tibble_3.3.1      
[41] tidyselect_1.2.1   knitr_1.51         farver_2.1.2       htmltools_0.5.9   
[45] rmarkdown_2.31     carData_3.0-6      labeling_0.4.3     compiler_4.6.0    
[49] S7_0.2.2

--- title: "CSAMA: Multivariate Analysis and PCA" author: "Bio Data Science^3 faculty" date: "26 May 2026" format: html: code-fold: false code-tools: true embed-resources: true highlight-style: github toc: true code-line-numbers: false params: answers: false simplified: false sessioninfo: true --- ```{r} #| label: initialize #| echo: FALSE #| include: FALSE options("scipen"=100, "digits"=4) knitr::opts_chunk$set( eval = params$answers, echo = params$answers, message = FALSE, warning = FALSE, fig.width=6, fig.height=4, fig.align="center", fig.pos="h" ) set.seed(533) ``` **Reference**: [Modern Statistics for modern biology](https://www.huber.embl.de/msmb/) Susan Holmes and Wolfgang Huber. Chapter 7. *Multivariate Analysis*. Cambridge University Press. # Goal In this lab we will learn the basics of multivariate analysis and PCA using a few simple examples. Work through this lab by running all the R code on your computer and making sure that you understand the input and the output. Make alterations where you seem fit. We encourage you to work through this lab with a partner. # Setup ```{r} #| eval: true #| echo: true # load required packages library(GGally) library(ggplot2) library(pheatmap) library(factoextra) library(ade4) ``` Get data sets from the web: ```{r} #| label: turtlesLoadWeb #| eval: false #| echo: true turtles <- read.table(url("https://web.stanford.edu/class/bios221/data/PaintedTurtles.txt"), header = TRUE) download.file(url = "https://web.stanford.edu/class/bios221/data/athletes.RData", destfile = "athletes.RData", mode = "wb") ``` Or load them locally: ```{r} #| label: turtlesLoadLocal #| eval: true #| echo: true turtles <- read.table("PaintedTurtles.txt", header = TRUE) load("athletes.RData") head(turtles) athletes[1:3, ] ``` Let's first get to know our data sets. ::: {.callout-note collapse="false"} ## Questions 1. How many athletes / turtles do you have in the data sets? 2. What's the record distance in the longjump category? And which athlete (number) made this record? 3. What's the average time across all athletes for the 100m run? 4. Can you plot the histogram showing the distribution of the times for the 100m run? 5. How many athletes of those who run faster than the average in the 100m run, also run faster than the average in the 1500m distance? ```{r } #| label: questions1-5Hint #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #1 ____(athletes) ____(turtles) #2 ___(athletes$____) _________(athletes$____) #3 ____(athletes$____) #4 ____(athletes$____) #5 av100 = ____(athletes$____) av1500 = ____(athletes$_____) ___( (athletes$____ < av100) & (athletes$_____ < av1500) ) #5 another solution dplyr::______(athletes, ____ < mean(____), _____ < mean(_____)) |> ____() # Hints (question #5): # 1. Calculate the average times for 100m run and 1500m run. # 2. Compare each athlete's time to the calculated average times. # 3. Count how many athletes meet both conditions above. # Alternative approaches: # 1. Use a summary table to examine how mant athletes fall below the average # in each type of run. # 2. Apply your knowledge of dplyr and its filter() function, and count the rows. ``` ```{r} #| label: questions1-5Solution #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" #1 nrow(athletes) nrow(turtles) #2 max(athletes$long) which.max(athletes$long) #3 mean(athletes$m100) #4 hist(athletes$m100) #5 av100 <- mean(athletes$m100) av1500 <- mean(athletes$m1500) sum( (athletes$m100 < av100) & (athletes$m1500 < av1500) ) #5 alternative solution with(athletes, table(m100 < mean(m100), m1500 < mean(m1500))) #5 yet another solution dplyr::filter(athletes, m100 < mean(m100), m1500 < mean(m1500)) |> nrow() ``` ::: It is instructive to first consider 2-dimensional summaries of the data. The function `ggpairs()` from the `GGally` package gives a nice summary of the features and how they are correlated with each other. ```{r} #| label: GGally #| eval: true #| echo: true #| fig-width: 5 #| fig-height: 5 #| out-width: "80%" ggpairs(turtles[, -1], axisLabels = "none") ``` ::: {.callout-note collapse="false"} ## Questions 1. What do you see on the diagonal? What do the stars indicate next to the correlation value? 2. Can you repeat this plot for the `athletes` data? 3. Use the `pheatmap()` function in the package with the same name, `pheatmap`, to illustrate the pairwise correlations (computed using the `cor()` function) of the features in the athletes data set. ```{r} #| label: questions6-7Hint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #2 _______(________, axisLabels = "none") # Hint: # ... ``` ```{r} #| label: questions6-7Solution #| fig-width: 8 #| fig-height: 8 #| out-width: "100%" #1 # Diagonal: histogram displaying the distribution of the different variables. # Stars: significant correlation between the two variables #2 ggpairs(athletes, axisLabels = "none") ``` ```{r} #| label: questions8Hint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #3 mycolors = colorRampPalette(c("#FFD800", "#0056B9"))(100) ________(___(________), cellwidth = 20, cellheight = 20, color = mycolors, breaks = seq(-1, +1, length.out = length(mycolors) + 1)) # Hint: # ... ``` ```{r} #| label: questions8Solution #| fig-width: 5 #| fig-height: 5 #| out-width: "60%" #3 library("pheatmap") mycolors <- colorRampPalette(c("#FFD800", "#0056B9"))(100) pheatmap(cor(athletes), cellwidth = 20, cellheight = 20, color = mycolors, breaks = seq(-1, +1, length.out = length(mycolors) + 1)) ``` ::: # Preprocessing the data In many cases, different variables are measured in different units and at different scales. Here, we elect to standardize the data to a common standard deviation. This rescaling is done using the `scale()` function, which subtracts the mean and divides by the standard deviation, so that every column has a unit standard deviation and mean zero. ```{r } #| label: turtlesDim12 #| eval: true #| echo: true scaledTurtles <- data.frame(sex = turtles[, 1], scale(turtles[, -1])) head(scaledTurtles) ``` ::: {.callout-note collapse="false"} ## Questions 1. Can you compute the standard deviation and mean of each column in the `turtles` data frame? Can you do the same on the scaled dataset, i.e. on `scaledturtles`? What was the mean of turtles' heights before standardizing? ```{r} #| label: questions9Hint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above _____(turtles[____], _, __) _____(scaledTurtles[____], _, __) _____(turtles[____], _, ____) # Hint: # ... ``` ```{r} #| label: questions9Solution apply(turtles[, -1], 2, sd) apply(scaledTurtles[, -1], 2, sd) apply(turtles[, -1], 2, mean) apply(scaledTurtles[, -1], 2, mean) ``` ::: We can visualize two columns/dimensions (for example height and width) of the scaled data using `ggplot`. ```{r } #| label: turtlesDim #| eval: true #| echo: true #| fig-width: 5 #| fig-height: 4 #| out-width: "80%" ggplot(scaledTurtles, aes(x = width, y = height, group = sex)) + geom_point(aes(color = sex)) + coord_fixed() ``` What is the purpose of the `coord_fixed()` modifier here? # Dimensionality reduction In this part, we will use geometrical projections of points in a higher dimensional space and project them down to lower dimensions. The first example will be the projection of the points in a two-dimensional space (defined by weight and disc distance in the athlete data set) onto a 1-dimensional space. The 1-dimensional space in this case is defined by the weight-axis/x-axis. But first we need to scale the athlete data set, in the same way as we did it with the turtles data set. ```{r} #| label: scaledathletes #| eval: true #| echo: true scaledathletes <- data.frame(scale(athletes)) n <- nrow(scaledathletes) ``` ```{r} #| label: ggplotscaledathletes #| eval: true #| echo: true #| fig-width: 5 #| fig-height: 4 #| out-width: "80%" # First, p is a 2-dimensional plot of the points defined by weight (x) and disc (y) p <- ggplot(scaledathletes, aes(x = weight, y = disc)) + geom_point(shape = 1) # Then we add the projected points and the projection lines (dashed) p + geom_point(aes(y = rep(0, n)), color = "#0056B9") + geom_segment(aes(xend = weight, yend = rep(0, n)), linetype = "dashed") ``` ::: {.callout-note collapse="false"} ## Questions Now try to do the following: 1. Calculate the standard deviation of the blue points (their $x$-coordinates) in the above figure. 2. Make a similar plot showing projection lines onto the $y$-axis and show projected points in yellow. What is the variance of the projected points now? ```{r} #| label: questions10-11Hint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #1 __(scaledathletes$______) #2 p + geom_point(aes(x = ___(0, _)), color = "#0056B9") + geom_segment(aes(yend = ____, xend = rep(0, _)), linetype = "dashed") __(scaledathletes$____) # Hint: # ... ``` ```{r} #| label: questions10-11Solution #1 sd(scaledathletes$weight) #2 p + geom_point(aes(x = rep(0, n)), color = "#0056B9") + geom_segment(aes(yend = disc, xend = rep(0, n)), linetype = "dashed") sd(scaledathletes$disc) ``` ::: # Summarize 2D-data by a line In the above example when projecting the 2-dimensional points to the `weight` axis, we lost the `disc` information. In order to keep more information, we will now project the 2 dimensional point cloud onto another line. For this, we first compute a linear model to find the regression line using the `lm()` function (linear model). We regress `disc` on `weight`. The regression line is defined by two parameters: its slope and its intercept. The slope a is given by the second coefficient in the output of `lm` and its intercept b is the first coefficient: ```{r} #| label: reg1 #| eval: true #| echo: true reg1 <- lm(disc ~ weight, data = scaledathletes) ``` # Extract intercept and slope values ```{r} #| label: extractab #| eval: true #| echo: true a1 <- reg1$coefficients[1] # Intercept b1 <- reg1$coefficients[2] # slope ``` Plot the points p (computed in the code section before) and the regression line. ```{r} #| label: reg3 #| eval: true #| echo: true pline <- p + geom_abline(intercept = a1, slope = b1, col = "#0056B9", lwd = 1.5) + coord_fixed() ``` Add the projection lines (from the point to its fitted value) ```{r} #| label: reg4 #| eval: true #| echo: true #| fig-width: 5 #| fig-height: 4 #| out-width: "80%" pline + geom_segment(aes(xend = weight, yend = reg1$fitted), color = "#FFD800", arrow = arrow(length = unit(0.15, "cm"))) ``` ::: {.callout-note collapse="false"} ## Question Can you regress `weight` on `discs` and generate a similar plot? ```{r} #| label: questionsReg1Hint1 #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #1 reg2 = __(______ ~ ____, data = scaledathletes) # Extract the intercept and slope values a2 = reg2$____________[_] # Intercept b2 = reg2$____________[_] # slope # Plot the points p (computed in the code section before) & the regression line p = ______(scaledathletes, aes(x = ____, y = ______)) + geom_point(shape = 1) + coord_fixed() newline = p + geom_abline(intercept = __, slope = __, col = "#FFD800") # Add the projection lines (from the point to its fitted value) newline + geom_segment( aes(y = ______, x = ____, yend = reg2$______, xend = ____), color = "#0056B9", arrow = arrow(length = unit(0.15, "cm"))) + coord_flip() # Hint: # ... ``` ```{r} #| label: questionsReg1Solution1 #| fig-width: 4 #| fig-height: 4 #| out-width: "80%" #1 reg2 <- lm(weight ~ disc, data = scaledathletes) # Extract the intercept and slope values a2 <- reg2$coefficients[1] # Intercept b2 <- reg2$coefficients[2] # slope # Plot the points p (computed in the code section before) & the regression line p <- ggplot(scaledathletes, aes(x = disc, y = weight)) + geom_point(shape = 1) + coord_fixed() newline <- p + geom_abline(intercept = a2, slope = b2, col = "#FFD800") # Add the projection lines (from the point to its fitted value) newline + geom_segment( aes(y = weight, x = disc, yend = reg2$fitted, xend = disc), color = "#0056B9", arrow = arrow(length = unit(0.15, "cm"))) + coord_flip() ``` ::: ::: {.callout-note collapse="false"} ## Question Can you create a plot that shows all points, as well as both regression lines, i.e., a plot that show both the line you get from `lm(disc ~ weight)` and `lm(weight ~ disc)`? ```{r} #| label: questionsReg1Solution2 #| echo: false #| results: 'asis' cat("We plot the data such that the $x$-axis is `disc` and the $y$-axis is `weight`. So we can directly use the intercept and slope parameters from the first regression, `reg1`. For the second regression, `reg2`, we invert \\begin{align} y&=a+bx\\quad\\quad\\Rightarrow\\\\ x&=-\\frac{a}{b}+\\frac{1}{b}y \\end{align}") ``` ```{r} #| label: questionsReg1Hint3 #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above ggplot(scaledathletes, aes(x = ____, y = ______)) + geom_point(shape = 1) + coord_fixed() + geom_abline(intercept = __, slope = __, col = "#0056B9") + geom_abline(intercept = -__/__, slope = 1/__, col = "#FFD800") # Hint: # ... ``` ```{r} #| label: questionsReg1Solution3 #| fig-width: 4 #| fig-height: 4 #| out-width: "80%" ggplot(scaledathletes, aes(x = disc, y = weight)) + geom_point(shape = 1) + coord_fixed() + geom_abline(intercept = a1, slope = b1, col = "#0056B9") + geom_abline(intercept = -a2/b2, slope = 1/b2, col = "#FFD800") ``` ::: # A line that minimizes distances in both directions Below we are plotting a line chosen to minimize the error in both the horizontal and vertical directions. This results in minimizing the diagonal projections onto the line. Specifically, we compute a line that minimizes the sum of squares of the orthogonal (perpendicular) projections of data points onto it. We call this the principal component line. ```{r } #| include: true #| eval: true #| echo: true X <- cbind(scaledathletes$disc, scaledathletes$weight) svda <- svd(X) pc <- X %*% svda$v[, 1] %*% t(svda$v[, 1]) bp <- svda$v[2, 1] / svda$v[1, 1] ap <- mean(pc[, 2]) - bp * mean(pc[, 1]) p + geom_segment(xend = pc[, 1], yend = pc[, 2], arrow = arrow(length = unit(0.15, "cm"))) + geom_abline(intercept = ap, slope = bp, col = "#606060", lwd = 1.5) + coord_fixed() ``` Now let's see how we can use the learned on a higher-dimensional data set. # Turtle PCA To start we will come back to the turtles data set. First, we need to make sure we understand the basic features of the data and preprocess it in a way that its in the correct "shape" for running the PCA analysis. ::: {.callout-note collapse="false"} ## Questions 1. What are the mean values and standard deviation, of each of the 3 features: length, width and height. 2. Scale the data. 3. Explore the correlations between the 3 variables after scaling the data. What do you see? ```{r} #| label: questionsTurtleHint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #1 _____(turtles[____], _, ____) _____(turtles[____], _, __) #2 turtlesc = _____(turtles[____]) #3 corrs = ___(turtlesc) corrs |> round(3) ________(corrs, cellwidth = 40, cellheight = 40, color = mycolors) # Hint: # ... ``` ```{r} #| label: questionsTurtleSolution #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" #1 apply(turtles[, -1], 2, mean) apply(turtles[, -1], 2, sd) #2 turtlesc = scale(turtles[, -1]) #3 corrs = cor(turtlesc) corrs |> round(3) pheatmap(corrs, cellwidth = 40, cellheight = 40, color = mycolors) ``` ::: ```{r} #| echo: false #| eval: true turtlesc <- scale(turtles[, -1]) ``` From the correlations, you see that all 3 variables are strongly correlated. (In the heatmap, note that the color scale already starts with a high value at its lower end.) Hence we expect that the data can be well approximated by a single variable. Let's do the PCA: ```{r} #| label: pcaturtles #| eval: true #| echo: true #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" pca1 <- princomp(turtlesc) pca1 ``` To look at the relative importance of the principal components, we can look at their variances: the screeplot. The screeplot shows the eigenvalues for the standardized data. ```{r } #| label: scree #| eval: true #| echo: true #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" fviz_eig(pca1, geom = "bar", width = 0.4) ``` Note: Here we see one very large component in this case and two very small ones. In this case the data are (almost) one dimensional. ::: {.callout-note collapse="false"} ## Questions 1. What is the percentage of variance explained by the first PC? How can you obtain this value from the pca1 object? 2. How many PCs are you using if you want to project the turtles data set? ```{r} #| label: questionsSummaryPcaturtleHint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #1 _______(____) #2 Your idea here # Hint: # ... ``` ```{r} #| label: questionsSummaryPcaturtleSolution #1 summary(pca1) #2 # One PC would be sufficient. ``` ::: Now, lets plot the samples with their PC1 and PC2 coordinates, together with the variables. The representation of both, the samples and the variables is called a biplot. ```{r } #| label: turtlesBiplot #| eval: true #| echo: true fviz_pca_biplot(pca1, label = "var") ``` ::: {.callout-note collapse="false"} ## Questions 1. Can you extend this plotting code to color the female samples differently than the male samples? 2. Did the males or female turtles tend to be larger? ```{r} #| label: questionsFvizPcaBiplotTurtlesHint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #6 _______________(____, label = "var", _______ = turtles[__]) #7 Your idea here # Hint: # ... ``` ```{r} #| label: questionsFvizPcaBiplotTurtlesSolution #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" #6 fviz_pca_biplot(pca1, label = "var", col.ind = turtles[,1]) #7 # Females ``` ::: # Back to the athletes Now let us try to run the PCA on a larger data set and interpret the corresponding scree plot. In this case we are using a different library, with a slightly different output of the PCA computation. But the principle is the same. ```{r} #| label: pac.ath #| eval: true #| echo: true # The dudi.pca function by default already centers and scales the data by itself pca.ath <- dudi.pca(athletes, scannf = FALSE) pca.ath$eig ``` ::: {.callout-note collapse="false"} ## Questions 1. Just like in the above turtle data set. Can you produce a scree plot? 2. How many PCs are you using if you want to project the athletes data set? 3. Can you plot the samples with their PC1 and PC2 coordinates, together with the variables in a biplot? 4. Can you plot the numbers of the athletes onto the samples. What do you notice about the numbers? ```{r } #| label: questionsAthletetsHint #| eval: false #| echo: !expr params$simplified # Fill in the blanks to complete the code and answer the question above #1 ________(_______, geom = "bar", bar_width = 0.3) + ggtitle("") #2 # ____________________________ #3 _______________(_______, label = "var") #4 ____________(_______) + ggtitle("") + ylim(c(____, ___)) # Hint: # ... ``` ```{r} #| label: questionsAthletetsSolution #| fig-width: 4 #| fig-height: 4 #| out-width: "60%" #1 fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("") #2 # Somewhere between 2 and 4 #3 fviz_pca_biplot(pca.ath, label = "var") #4 fviz_pca_ind(pca.ath) + ggtitle("") + ylim(c(-2.5, 5.7)) ``` ::: # Session information ```{r} #| label: sessionInfo #| eval: !expr params$sessioninfo #| echo: !expr params$sessioninfo sessionInfo() ```