Theory

Data matrix

I have a data matrix \(X\) with \(n\) rows (number of observations) and \(p\) columns (number of features) with column-wise zero empirical mean. We don’ have response variable \(Y\).

Each row of \(X\) (observation), \(\vec{x}_{(i)} = (x_1, \dots, x_p)_{(i)}\), is a vector in a \(p\)-dimensional space.

Definition of PCA

PCA is defined as an orthogonal linear transformation of the data to a new coordinate system such that

  • the first coordinate (called the first principal component) is the coordinate for which the variance of the projection of the observations into the axis is the greatest

  • the second coordinate (called the second principal component) is the coordinate for which the variance of the projection of the observations into the axis is the second greatest

  • and so on.

The new coordinate system

The new coordinate system is defined by the vectors of weights or loadings

\[ \vec{w}_{(k)} = (w_1, \dots, w_p)_{(k)} \quad \text{for} \quad k=1 \dots m\]

Vector of principal components scores

The vector of principal components scores, \(\vec{t}_{(i)} = (t_1, \dots, t_m)_{(i)}\), is the map of the vector \(\vec{x}_{(i)}\) in the new coordinate system.

For a given observation \(\vec{x}_{(i)}\), the \(k\) score is given by the projection of \(\vec{x}_{(i)}\) over \(\vec{w}_{(k)}\)

\[t_{k(i)} = \vec{x}_{(i)} \cdot \vec{w}_{(k)} \quad \text{for} \quad i=1 \dots n \qquad k=1 \dots m \]

Matrix notation

If \(W\) is a matrix in which column \(k\) is \(\vec{w}_{(k)}\) and \(T\) is matrix in which each row is \(t_{(i)}\), then

\[T = X \cdot W\]

How to find the loadings?

It could be demonstrated that the loadings are the eigenvectors of \(X^T X\) ordered from decreasing value of the eigenvalue.

The loadings are unit vectors.

Simple example

We create a data matrix

library(tidyverse)

set.seed(40)
X <- tibble(x = seq(-6, 6, 2), y = x) %>%
  rowwise() %>% 
  mutate(x = x + rnorm(1), y = y + rnorm(1)) %>% 
  ungroup() %>% 
  mutate(x = x - mean(x), y = y - mean(y)) %>% 
  as.matrix()

dim(X)
## [1] 7 2
X
##               x           y
## [1,] -4.9848963 -4.13654747
## [2,] -2.9664525 -4.16974178
## [3,] -2.3222197 -3.19032814
## [4,] -0.2916953  0.04908539
## [5,]  2.2157916  0.89361091
## [6,]  3.2335942  4.92753384
## [7,]  5.1158780  5.62638724
p <- ggplot(X %>% as_tibble(), aes(x,y)) +
  geom_point() +
  coord_equal()
p

We use prcomp to calculate the loadings

pr <- prcomp(X, center = FALSE, scale = FALSE)

w <- pr$rotation


w_plot <- as_tibble(t(w)) %>% 
  bind_cols(tibble(component = rownames(t(w))))
                 

p + geom_segment(data = w_plot, 
                 aes(x = 0, y = 0, xend = x, yend = y, color = component),
                 arrow = arrow(length = unit(0.05, "npc"))) +
  scale_color_discrete(name = "loadings", 
                       breaks = c("PC1", "PC2"), 
                       labels = c(expression(bold(w)[(1)]), 
                                  expression(bold(w)[(2)]))) +
  coord_equal()

We calculate the principal components scores using the loadings

X %*% w
##             PC1         PC2
## [1,] -6.4029580  0.98099209
## [2,] -5.0876012 -0.55036473
## [3,] -3.9274834 -0.38180056
## [4,] -0.1569716  0.25070991
## [5,]  2.1394194 -1.06355886
## [6,]  5.8316228  0.85375438
## [7,]  7.6039720 -0.08973222

and check that corresponds to the principal components scores given directly by prcomp

pr$x
##             PC1         PC2
## [1,] -6.4029580  0.98099209
## [2,] -5.0876012 -0.55036473
## [3,] -3.9274834 -0.38180056
## [4,] -0.1569716  0.25070991
## [5,]  2.1394194 -1.06355886
## [6,]  5.8316228  0.85375438
## [7,]  7.6039720 -0.08973222
X %*% w == pr$x
##       PC1  PC2
## [1,] TRUE TRUE
## [2,] TRUE TRUE
## [3,] TRUE TRUE
## [4,] TRUE TRUE
## [5,] TRUE TRUE
## [6,] TRUE TRUE
## [7,] TRUE TRUE

We plot the the scores in the new coordinate system

ggplot(pr$x %>% as_tibble(), aes(PC1, PC2)) +
  geom_point() +
  coord_equal()

Checking centering and scaling in prcomp

library(tidyverse)

iris_center_scale <- iris %>% 
  mutate_if(is.numeric, funs( (. - mean(.)) / sd(.)))

# we check that is properly centered and scaled 
iris_center_scale %>% summarise_if(is.numeric, funs(mean, sd))
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1     -4.462602e-16     2.178518e-16     -1.111958e-17    -3.663049e-17
##   Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd
## 1               1              1               1              1

PCA is the centered and scaled data

prcomp(iris_center_scale[-5], scale = FALSE, center = FALSE) %>% 
  biplot()

PCA automatically centering and scaling

prcomp(iris[-5], scale = TRUE, center = TRUE) %>% 
  biplot()