Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
CSAMA 2026
If we want to measure the linear association between two continuous variables, we can use the correlation coefficient.
An alternative way to look at the association between two variables is to determine the best line that describe their relationship.
What do we mean by “best”? In the context of regression, “best” means the line that minimizes the sum of squared distances between the observed and fitted values of the response variable.
Mathematically, we can describe the linear relation between the two quantities \(x\) and \(y\) as \[ \hat{y} = \alpha + \beta x. \]
Finding the best line corresponds to minimize the sum of squared distances between the values of \(y\) and the values of \(\alpha + \beta x\), i.e. \[ \min_{\alpha, \beta} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]
To minimize this quantity, as usual, we compute the partial derivatives.
\[\begin{align*} \frac{\partial}{\partial \alpha} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0,\\ \frac{\partial}{\partial \beta} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) (x_i) = 0,\\ \end{align*}\]
We can show that \[\begin{align*} \alpha &= \bar{y} - \beta \bar{x},\\ \beta &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}. \end{align*}\]
The ideas of linear regression can be applied in the case where we have more than one covariate.
Instead of fitting the line \(\hat{y} = \alpha + \beta x\), we can consider the more general form \[ \hat{y} = \alpha + \beta_1 x_1 + \ldots + \beta_p x_p, \] where \(p\) is the number of covariates and \(x_1, \ldots, x_p\) are \(p\) variables.
For symmetry, the intercept \(\alpha\) is often denoted with \(\beta_0\).
For the Cherry tree data, assume to model the volume (\(y\)) of the trees as a function of girth (\(x_1\)) and height (\(x_2\)): \[ Y_i=\beta_0+ \beta_1 x_{i1}+ \beta_2 x_{i2}+\varepsilon_i\,,\,\,\, i=1, \ldots, 31. \]
Up until now, we have only defined the regression line as a mathematical entity that we can compute starting from a vector \(y\) and a matrix \(X\).
However, in inferential statistics, both \(Y\) and \(X\) are random variables, and what we observe are the values of our random sample from the population.
This means that what we want to infer the true relationship between \(X\) and \(Y\) in the population starting from the relationship that we observe in the sample.
In other words, there is a true parameter \(\beta\) in the population, and we need to estimate it with the values of our sample.
In the context of inference, regression is often referred to as the linear model.
Given a set of \(n\) independent observations, and \(p\) covariates, with \(n > p\), the linear model is defined, in matrix notation, as \[ Y = X \beta + \varepsilon, \] where
As with all the statistical models, we need some assumptions.
The term linear in the linear model refers to the linearity of the model with respect to the parameter \(\beta\).
It does not imply that we have linearity in \(X\) and \(X\) may contain non-linear transformation of the data. For instance the following models are valid linear models.
\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon \]
\[ Y = \beta_0 + \beta_1 \log X + \varepsilon \]
Explanation
Here the idea is the that system under study really is (approximately) linear, and we are interested in the coefficients per se.
It is often of particular interest to find a minimal set of explanatory variables.
Example
We can use a linear model to understand the effect of a certain treatment on gene expression.
We want to understand which genes are affected by the treatment and why.
Prediction
Here a model is a convenient means to create predictions for new cases.
The only interest is in the quality of the predictions.
Example
We can use a linear model to predict whether a patient will react to a therapy based on their gene expression
We want to get the best possible prediction, we do not care much about which genes are involved.
Fitting the model: how do we estimate \((\beta, \sigma^2)\)?
Inference: what can we say about \(\beta\) (rarely, about \(\sigma^2\)) based on the \(n\) observations?
We do not observe the true parameter \(\beta\) and we need to estimate it from the data.
Luckily, we already know how to estimate \(\beta\), as the value that minimizes the sum of the squares of the differences between \(Y\) and \(X\beta\).
In matrix notation, we have \[ \hat\beta=(X^\top X)^{-1}X^\top y. \] This is called the ordinary least squares (OLS) estimator.
To estimate the variance of the random terms, it seems natural to consider the variability of the residuals. An unbiased estimator for \(\sigma^2\) is given by \[ s^2=\displaystyle\frac{\sum(y_i - \hat y_i)^2}{n-p}. \]
We can prove that \(\hat\beta\) is conditionally unbiased, i.e., \[ E[\hat \beta | X] = \beta. \]
The variance of the OLS estimator is
\[ Var(\hat\beta | X) = \sigma^2 (X^\top X)^{-1}, \]
which is the lowest possible variance among all linear, unbiased estimators (BLUE).
Recall that \[ \hat\beta = (X^\top X)^{-1}X^\top Y. \]
This means that \(X^\top X\) needs to be invertible, which implies that \(X\) is full rank.
Definition
The rank of a matrix is the number of columns that are independent of all the others.
If the rank is equal to the number of columns, the matrix is said to be full rank.
This is often referred to as the non collinearity condition.
Note which assumptions we did not need to get to these results.
These results ensure that linear models can be applied in a lot of different settings, without requiring too stringent assumptions.
For the next results to hold, we need to restrict our model to the following assumption: \[ \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2). \]
We need the normality assumption in order to make inference on the coefficients of the model, i.e., to create confidence intervals and test hypotheses on the parameter \(\beta\).
It can be proved that, for \(j=1,\ldots,p\)
\[ \hat\beta_j\sim N(\beta_j, \sigma^2[(X^\top X)^{-1}]_{jj}). \]
Hence, the null hypothesis \(H_0 : \beta_j = 0\) can be tested with a t-test: \[ T_j = \frac{\hat\beta_j}{\sqrt{\hat{V}(\hat\beta_j)}} \sim t_{n-p}. \]
Consider the model \[ Y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \varepsilon_i. \]
Want to test the overall significance test \[ H_0: \beta_1 =\beta_2 = 0. \]
Two models
Full: \(Y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \varepsilon_i\)
Reduced: \(Y_i = \beta_0 + \varepsilon_i\)
F-statistic, under \(H_0\) (details similar as before): \[ F = \frac{(e^\top_Re_R - e^\top_Fe_F)/2}{e^\top_Fe_F/(n-3)} \sim F_{2, n-3} \] Reject \(H_0\) at level \(\alpha\) if \(F^{obs} > F_{2, n-3; 1-\alpha}.\)
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
The Generalized Linear Model (GLM) is a family of models that includes the linear model and extend it to situations in which the response variable has a distribution of the exponential family.
The exponential family is a family of distributions that includes the Gaussian, Poisson, Binomial, Gamma, among others.
We can think of GLMs as a two-way generalization of linear models:
We can use a function of the linear predictor to transform it to a range of values that is more useful for the application.
We can use a different distribution for the response other than the Gaussian.
The three components that we need in order to fully specify a GLM are:
An exponential family distribution for the response.
A link function that connects the linear predictor to the response.
A variance function that specifies how the variance depends on the mean.
If we assume that:
The response variable is distributed as a Gaussian r.v.
The link function is the identity function.
The variance is constant with respect to the mean.
Then, we have the standard linear model.
In formulas:
\(Y \sim \mathcal{N}( \mu, \sigma^2)\).
\(\mu = X \beta\).
\(V(\mu) = \sigma^2\).
Poisson regression (or log-linear regression) assumes that:
The response variable is distributed as a Poisson r.v.
The link function is the log function.
The variance is equal to the mean.
In formulas:
\(Y \sim \mathcal{Poi}(\mu)\).
\(\log(\mu) = X \beta\).
\(V(\mu) = \mu\).
Important
The log is called the canonical link function because it is equal to the function that transforms the mean into the canonical parameter in the exponential family formulation.
We measure the expression of two genes, DUSP1 and VCAM1, in airway smooth muscle (ASM) cell lines.
The DUSP1 gene plays an important role in the human cellular response to environmental stress.
The VCAM1 gene is involved in cell adhesion and signal transduction.
The genes are measured with an assay that yields counts as a measure of gene expression.
xkcd.com/1725/
xkcd.com/605/
xkcd.com/2893/
xkcd.com/2893/
xkcd.com/925/
xkcd.com/2048/
xkcd.com/2618/
Hypothesis testing (in a few minutes!)
Using GLMs for RNA-seq differential expression (tomorrow!)
xkcd.com/2400/