Theory

Data matrix

I have a data matrix \(X\) with \(n\) rows (number of observations) and \(p\) columns (number of features) with column-wise zero empirical mean. We donโ€™ have response variable \(Y\).

Each row of \(X\) (observation), \(\vec{x}_{(i)} = (x_1, \dots, x_p)_{(i)}\), is a vector in a \(p\)-dimensional space.

Definition of PCA

PCA is defined as an orthogonal linear transformation of the data to a new coordinate system such that

  • the first coordinate (called the first principal component) is the coordinate for which the variance of the projection of the observations into the axis is the greatest

  • the second coordinate (called the second principal component) is the coordinate for which the variance of the projection of the observations into the axis is the second greatest

  • and so on.

The new coordinate system

The new coordinate system is defined by the vectors of weights or loadings

\[ \vec{w}_{(k)} = (w_1, \dots, w_p)_{(k)} \quad \text{for} \quad k=1 \dots m\]

Vector of principal components scores

The vector of principal components scores, \(\vec{t}_{(i)} = (t_1, \dots, t_m)_{(i)}\), is the map of the vector \(\vec{x}_{(i)}\) in the new coordinate system.

For a given observation \(\vec{x}_{(i)}\), the \(k\) score is given by the projection of \(\vec{x}_{(i)}\) over \(\vec{w}_{(k)}\)

\[t_{k(i)} = \vec{x}_{(i)} \cdot \vec{w}_{(k)} \quad \text{for} \quad i=1 \dots n \qquad k=1 \dots m \]

Matrix notation

If \(W\) is a matrix in which column \(k\) is \(\vec{w}_{(k)}\) and \(T\) is matrix in which each row is \(t_{(i)}\), then

\[T = X \cdot W\]

How to find the loadings?

It could be demonstrated that the loadings are the eigenvectors of \(X^T X\) ordered from decreasing value of the eigenvalue.

The loadings are unit vectors.

Simple example

We create a data matrix

library(tidyverse)

set.seed(40)
X <- tibble(x = seq(-6, 6, 2), y = x) %>%
  rowwise() %>% 
  mutate(x = x + rnorm(1), y = y + rnorm(1)) %>% 
  ungroup() %>% 
  mutate(x = x - mean(x), y = y - mean(y)) %>% 
  as.matrix()

dim(X)
## [1] 7 2
X
##               x             y
## 1 -6.000000e+00 -6.000000e+00
## 2 -4.000000e+00 -4.000000e+00
## 3 -2.000000e+00 -2.000000e+00
## 4  5.551115e-17 -1.110223e-16
## 5  2.000000e+00  2.000000e+00
## 6  4.000000e+00  4.000000e+00
## 7  6.000000e+00  6.000000e+00
p <- ggplot(X %>% as_tibble(), aes(x,y)) +
  geom_point() +
  coord_equal()
p

We use prcomp to calculate the loadings

pr <- prcomp(X, center = FALSE, scale = FALSE)

w <- pr$rotation


w_plot <- as_tibble(t(w)) %>% 
  bind_cols(tibble(component = rownames(t(w))))
                 

p + geom_segment(data = w_plot, 
                 aes(x = 0, y = 0, xend = x, yend = y, color = component),
                 arrow = arrow(length = unit(0.05, "npc"))) +
  scale_color_discrete(name = "loadings", 
                       breaks = c("PC1", "PC2"), 
                       labels = c(expression(bold(w)[(1)]), 
                                  expression(bold(w)[(2)]))) +
  coord_equal()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

We calculate the principal components scores using the loadings

X %*% w
##             PC1          PC2
## 1 -8.485281e+00 0.000000e+00
## 2 -5.656854e+00 0.000000e+00
## 3 -2.828427e+00 0.000000e+00
## 4 -3.925231e-17 1.177569e-16
## 5  2.828427e+00 0.000000e+00
## 6  5.656854e+00 0.000000e+00
## 7  8.485281e+00 0.000000e+00

and check that corresponds to the principal components scores given directly by prcomp

pr$x
##             PC1          PC2
## 1 -8.485281e+00 0.000000e+00
## 2 -5.656854e+00 0.000000e+00
## 3 -2.828427e+00 0.000000e+00
## 4 -3.925231e-17 1.177569e-16
## 5  2.828427e+00 0.000000e+00
## 6  5.656854e+00 0.000000e+00
## 7  8.485281e+00 0.000000e+00
X %*% w == pr$x
##    PC1  PC2
## 1 TRUE TRUE
## 2 TRUE TRUE
## 3 TRUE TRUE
## 4 TRUE TRUE
## 5 TRUE TRUE
## 6 TRUE TRUE
## 7 TRUE TRUE

We plot the the scores in the new coordinate system

ggplot(pr$x %>% as_tibble(), aes(PC1, PC2)) +
  geom_point() +
  coord_equal()

Checking centering and scaling in prcomp

library(tidyverse)

iris_center_scale <- iris %>% 
  mutate_if(is.numeric, funs( (. - mean(.)) / sd(.)))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
# we check that is properly centered and scaled 
iris_center_scale %>% summarise_if(is.numeric, funs(mean, sd))
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1     -4.484318e-16     2.034094e-16     -2.895326e-17    -2.989362e-17
##   Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd
## 1               1              1               1              1

PCA is the centered and scaled data

prcomp(iris_center_scale[-5], scale = FALSE, center = FALSE) %>% 
  biplot()

PCA automatically centering and scaling

prcomp(iris[-5], scale = TRUE, center = TRUE) %>% 
  biplot()