Regression: randomness and linear models

CSAMA 2026

Davide Risso

Least-squares regression

Least-Squares Regression

If we want to measure the linear association between two continuous variables, we can use the correlation coefficient.
An alternative way to look at the association between two variables is to determine the best line that describe their relationship.
What do we mean by “best”? In the context of regression, “best” means the line that minimizes the sum of squared distances between the observed and fitted values of the response variable.

A simple example

Least-Squares Regression

Mathematically, we can describe the linear relation between the two quantities \(x\) and \(y\) as \[ \hat{y} = \alpha + \beta x. \]

Finding the best line corresponds to minimize the sum of squared distances between the values of \(y\) and the values of \(\alpha + \beta x\), i.e. \[ \min_{\alpha, \beta} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]

Least-Squares Regression

To minimize this quantity, as usual, we compute the partial derivatives.

\[\begin{align*} \frac{\partial}{\partial \alpha} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0,\\ \frac{\partial}{\partial \beta} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) (x_i) = 0,\\ \end{align*}\]

We can show that \[\begin{align*} \alpha &= \bar{y} - \beta \bar{x},\\ \beta &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}. \end{align*}\]

Multiple Regression

The ideas of linear regression can be applied in the case where we have more than one covariate.

Instead of fitting the line \(\hat{y} = \alpha + \beta x\), we can consider the more general form \[ \hat{y} = \alpha + \beta_1 x_1 + \ldots + \beta_p x_p, \] where \(p\) is the number of covariates and \(x_1, \ldots, x_p\) are \(p\) variables.

For symmetry, the intercept \(\alpha\) is often denoted with \(\beta_0\).

Example: Cherry trees

head(trees)

  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Example: Cherry trees

For the Cherry tree data, assume to model the volume (\(y\)) of the trees as a function of girth (\(x_1\)) and height (\(x_2\)): \[ Y_i=\beta_0+ \beta_1 x_{i1}+ \beta_2 x_{i2}+\varepsilon_i\,,\,\,\, i=1, \ldots, 31. \]

Example: Cherry trees

fit <- lm(Volume ~ Girth + Height, data = trees)
fit


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Coefficients:
(Intercept)        Girth       Height  
   -57.9877       4.7082       0.3393

Linear models

Regression as a statistical model

Up until now, we have only defined the regression line as a mathematical entity that we can compute starting from a vector \(y\) and a matrix \(X\).
However, in inferential statistics, both \(Y\) and \(X\) are random variables, and what we observe are the values of our random sample from the population.
This means that what we want to infer the true relationship between \(X\) and \(Y\) in the population starting from the relationship that we observe in the sample.

Regression as a statistical model

In other words, there is a true parameter \(\beta\) in the population, and we need to estimate it with the values of our sample.
In the context of inference, regression is often referred to as the linear model.

The linear model: matrix notation

Given a set of \(n\) independent observations, and \(p\) covariates, with \(n > p\), the linear model is defined, in matrix notation, as \[ Y = X \beta + \varepsilon, \] where

\(Y\) is a \(n \times 1\) vector containing the response variable.
\(X\) is a \(n \times p\) matrix of covariates, called the design matrix.
\(\beta\) is a \(p \times 1\) vector of regression parameters.
\(\varepsilon\) is a \(n \times 1\) vector of random errors.

The linear model: matrix notation

Assumptions

As with all the statistical models, we need some assumptions.

There is a linear relation between \(Y\) and \(X\).
The error terms \(\varepsilon\) are i.i.d. with mean 0 and constant variance:
- \(E[\varepsilon] = 0\);
- \(Var(\varepsilon) = \sigma^2\) (homoscedasticity).
The error terms \(\varepsilon\) are independent of \(X\):
- \(\varepsilon {\perp\!\!\!\!\perp} X\).

A note on linearity

The term linear in the linear model refers to the linearity of the model with respect to the parameter \(\beta\).

It does not imply that we have linearity in \(X\) and \(X\) may contain non-linear transformation of the data. For instance the following models are valid linear models.

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon \]

\[ Y = \beta_0 + \beta_1 \log X + \varepsilon \]

Why fitting linear models

Explanation

Here the idea is the that system under study really is (approximately) linear, and we are interested in the coefficients per se.

It is often of particular interest to find a minimal set of explanatory variables.

Example

We can use a linear model to understand the effect of a certain treatment on gene expression.

We want to understand which genes are affected by the treatment and why.

Why fitting linear models

Prediction

Here a model is a convenient means to create predictions for new cases.

The only interest is in the quality of the predictions.

Example

We can use a linear model to predict whether a patient will react to a therapy based on their gene expression

We want to get the best possible prediction, we do not care much about which genes are involved.

To solve now

Fitting the model: how do we estimate \((\beta, \sigma^2)\)?
Inference: what can we say about \(\beta\) (rarely, about \(\sigma^2\)) based on the \(n\) observations?

Ordinary Least Squares

We do not observe the true parameter \(\beta\) and we need to estimate it from the data.

Luckily, we already know how to estimate \(\beta\), as the value that minimizes the sum of the squares of the differences between \(Y\) and \(X\beta\).

In matrix notation, we have \[ \hat\beta=(X^\top X)^{-1}X^\top y. \] This is called the ordinary least squares (OLS) estimator.

Ordinary Least Squares

To estimate the variance of the random terms, it seems natural to consider the variability of the residuals. An unbiased estimator for \(\sigma^2\) is given by \[ s^2=\displaystyle\frac{\sum(y_i - \hat y_i)^2}{n-p}. \]

Why least squares?

We can prove that \(\hat\beta\) is conditionally unbiased, i.e., \[ E[\hat \beta | X] = \beta. \]

The variance of the OLS estimator is

\[ Var(\hat\beta | X) = \sigma^2 (X^\top X)^{-1}, \]

which is the lowest possible variance among all linear, unbiased estimators (BLUE).

A note on the design matrix

Recall that \[ \hat\beta = (X^\top X)^{-1}X^\top Y. \]

This means that \(X^\top X\) needs to be invertible, which implies that \(X\) is full rank.

Definition

The rank of a matrix is the number of columns that are independent of all the others.

If the rank is equal to the number of columns, the matrix is said to be full rank.

This is often referred to as the non collinearity condition.

A note on the assumptions

Note which assumptions we did not need to get to these results.

We did not assume normality!
We did not assume independence of the columns of \(X\), just non-collinearity.

These results ensure that linear models can be applied in a lot of different settings, without requiring too stringent assumptions.

Example: Cherry trees

betahat <- solve(crossprod(X)) %*% crossprod(X, y)
betahat

            [,1]
[1,] -57.9876589
[2,]   4.7081605
[3,]   0.3392512

res <- y - X %*% betahat
n <- nrow(X)
p <- ncol(X)
s2 <- sum(res^2)/(n-p)
s2

[1] 15.06862

sdb <- sqrt(diag(s2 * solve(crossprod(X))))
sdb

[1] 8.6382259 0.2642646 0.1301512

Example: Cherry trees

s <- summary(fit)
s$sigma^2

[1] 15.06862

s$coefficients[,1:2]

               Estimate Std. Error
(Intercept) -57.9876589  8.6382259
Girth         4.7081605  0.2642646
Height        0.3392512  0.1301512

The Normal Linear Model

Normality assumption

For the next results to hold, we need to restrict our model to the following assumption: \[ \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2). \]

We need the normality assumption in order to make inference on the coefficients of the model, i.e., to create confidence intervals and test hypotheses on the parameter \(\beta\).

Under the normal linear model

It can be proved that, for \(j=1,\ldots,p\)

\[ \hat\beta_j\sim N(\beta_j, \sigma^2[(X^\top X)^{-1}]_{jj}). \]

Hence, the null hypothesis \(H_0 : \beta_j = 0\) can be tested with a t-test: \[ T_j = \frac{\hat\beta_j}{\sqrt{\hat{V}(\hat\beta_j)}} \sim t_{n-p}. \]

ANOVA

Consider the model \[ Y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \varepsilon_i. \]

Want to test the overall significance test \[ H_0: \beta_1 =\beta_2 = 0. \]

ANOVA

Two models

Full: \(Y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \varepsilon_i\)
Reduced: \(Y_i = \beta_0 + \varepsilon_i\)

F-statistic, under \(H_0\) (details similar as before): \[ F = \frac{(e^\top_Re_R - e^\top_Fe_F)/2}{e^\top_Fe_F/(n-3)} \sim F_{2, n-3} \] Reject \(H_0\) at level \(\alpha\) if \(F^{obs} > F_{2, n-3; 1-\alpha}.\)

Example

summary(fit)


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Generalized Linear Models (GLMs)

The Generalized Linear Model (GLM) is a family of models that includes the linear model and extend it to situations in which the response variable has a distribution of the exponential family.

The exponential family is a family of distributions that includes the Gaussian, Poisson, Binomial, Gamma, among others.

Generalizations of the linear model

We can think of GLMs as a two-way generalization of linear models:

We can use a function of the linear predictor to transform it to a range of values that is more useful for the application.
We can use a different distribution for the response other than the Gaussian.

The three components of a GLM

The three components that we need in order to fully specify a GLM are:

An exponential family distribution for the response.
A link function that connects the linear predictor to the response.
A variance function that specifies how the variance depends on the mean.

Example: the normal linear model

If we assume that:

The response variable is distributed as a Gaussian r.v.
The link function is the identity function.
The variance is constant with respect to the mean.

Then, we have the standard linear model.

Example: the normal linear model

In formulas:

\(Y \sim \mathcal{N}( \mu, \sigma^2)\).
\(\mu = X \beta\).
\(V(\mu) = \sigma^2\).

Example: Poisson regression

Poisson regression (or log-linear regression) assumes that:

The response variable is distributed as a Poisson r.v.
The link function is the log function.
The variance is equal to the mean.

Example: Poisson regression

In formulas:

\(Y \sim \mathcal{Poi}(\mu)\).
\(\log(\mu) = X \beta\).
\(V(\mu) = \mu\).

Important

The log is called the canonical link function because it is equal to the function that transforms the mean into the canonical parameter in the exponential family formulation.

Example: Poisson regression

We measure the expression of two genes, DUSP1 and VCAM1, in airway smooth muscle (ASM) cell lines.
The DUSP1 gene plays an important role in the human cellular response to environmental stress.
The VCAM1 gene is involved in cell adhesion and signal transduction.
The genes are measured with an assay that yields counts as a measure of gene expression.

Example: Poisson regression

Much more to learn

Goodness of fit

xkcd.com/1725/

Extrapolation

xkcd.com/605/

Out-of-sample prediction

xkcd.com/2893/

Sample size and data independence

xkcd.com/2893/

Correlation vs causality

xkcd.com/925/

Under- vs Over-fitting

xkcd.com/2048/

Data quality

xkcd.com/2618/

Up next

Hypothesis testing (in a few minutes!)
Using GLMs for RNA-seq differential expression (tomorrow!)

Thank you for your attention!

xkcd.com/2400/