## Theory

### Data matrix

I have a data matrix $$X$$ with $$n$$ rows (number of observations) and $$p$$ columns (number of features) with column-wise zero empirical mean. We don’ have response variable $$Y$$.

Each row of $$X$$ (observation), $$\vec{x}_{(i)} = (x_1, \dots, x_p)_{(i)}$$, is a vector in a $$p$$-dimensional space.

### Definition of PCA

PCA is defined as an orthogonal linear transformation of the data to a new coordinate system such that

• the first coordinate (called the first principal component) is the coordinate for which the variance of the projection of the observations into the axis is the greatest

• the second coordinate (called the second principal component) is the coordinate for which the variance of the projection of the observations into the axis is the second greatest

• and so on.

### The new coordinate system

The new coordinate system is defined by the vectors of weights or loadings

$\vec{w}_{(k)} = (w_1, \dots, w_p)_{(k)} \quad \text{for} \quad k=1 \dots m$

### Vector of principal components scores

The vector of principal components scores, $$\vec{t}_{(i)} = (t_1, \dots, t_m)_{(i)}$$, is the map of the vector $$\vec{x}_{(i)}$$ in the new coordinate system.

For a given observation $$\vec{x}_{(i)}$$, the $$k$$ score is given by the projection of $$\vec{x}_{(i)}$$ over $$\vec{w}_{(k)}$$

$t_{k(i)} = \vec{x}_{(i)} \cdot \vec{w}_{(k)} \quad \text{for} \quad i=1 \dots n \qquad k=1 \dots m$

### Matrix notation

If $$W$$ is a matrix in which column $$k$$ is $$\vec{w}_{(k)}$$ and $$T$$ is matrix in which each row is $$t_{(i)}$$, then

$T = X \cdot W$

### How to find the loadings?

It could be demonstrated that the loadings are the eigenvectors of $$X^T X$$ ordered from decreasing value of the eigenvalue.

The loadings are unit vectors.

### Simple example

We create a data matrix

library(tidyverse)

set.seed(40)
X <- tibble(x = seq(-6, 6, 2), y = x) %>%
rowwise() %>%
mutate(x = x + rnorm(1), y = y + rnorm(1)) %>%
ungroup() %>%
mutate(x = x - mean(x), y = y - mean(y)) %>%
as.matrix()

dim(X)
##  7 2
X
##               x           y
## [1,] -4.9848963 -4.13654747
## [2,] -2.9664525 -4.16974178
## [3,] -2.3222197 -3.19032814
## [4,] -0.2916953  0.04908539
## [5,]  2.2157916  0.89361091
## [6,]  3.2335942  4.92753384
## [7,]  5.1158780  5.62638724
p <- ggplot(X %>% as_tibble(), aes(x,y)) +
geom_point() +
coord_equal()
p We use prcomp to calculate the loadings

pr <- prcomp(X, center = FALSE, scale = FALSE)

w <- pr$rotation w_plot <- as_tibble(t(w)) %>% bind_cols(tibble(component = rownames(t(w)))) p + geom_segment(data = w_plot, aes(x = 0, y = 0, xend = x, yend = y, color = component), arrow = arrow(length = unit(0.05, "npc"))) + scale_color_discrete(name = "loadings", breaks = c("PC1", "PC2"), labels = c(expression(bold(w)[(1)]), expression(bold(w)[(2)]))) + coord_equal() We calculate the principal components scores using the loadings X %*% w ## PC1 PC2 ## [1,] -6.4029580 0.98099209 ## [2,] -5.0876012 -0.55036473 ## [3,] -3.9274834 -0.38180056 ## [4,] -0.1569716 0.25070991 ## [5,] 2.1394194 -1.06355886 ## [6,] 5.8316228 0.85375438 ## [7,] 7.6039720 -0.08973222 and check that corresponds to the principal components scores given directly by prcomp pr$x
##             PC1         PC2
## [1,] -6.4029580  0.98099209
## [2,] -5.0876012 -0.55036473
## [3,] -3.9274834 -0.38180056
## [4,] -0.1569716  0.25070991
## [5,]  2.1394194 -1.06355886
## [6,]  5.8316228  0.85375438
## [7,]  7.6039720 -0.08973222
X %*% w == pr$x ## PC1 PC2 ## [1,] TRUE TRUE ## [2,] TRUE TRUE ## [3,] TRUE TRUE ## [4,] TRUE TRUE ## [5,] TRUE TRUE ## [6,] TRUE TRUE ## [7,] TRUE TRUE We plot the the scores in the new coordinate system ggplot(pr$x %>% as_tibble(), aes(PC1, PC2)) +
geom_point() +
coord_equal() ## Checking centering and scaling in prcomp

library(tidyverse)

iris_center_scale <- iris %>%
mutate_if(is.numeric, funs( (. - mean(.)) / sd(.)))

# we check that is properly centered and scaled
iris_center_scale %>% summarise_if(is.numeric, funs(mean, sd))
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1     -4.462602e-16     2.178518e-16     -1.111958e-17    -3.663049e-17
##   Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd
## 1               1              1               1              1

PCA is the centered and scaled data

prcomp(iris_center_scale[-5], scale = FALSE, center = FALSE) %>%
biplot() PCA automatically centering and scaling

prcomp(iris[-5], scale = TRUE, center = TRUE) %>%
biplot() 