Biplot in PCA
Biplot in PCA

How to Perform PCA in R

We will discuss here how to perform principal component analysis in R.Although PCA is required for data sets which have very high dimentionality,we will use the iris data set for simple demonstration.Importing the library MASS for iris dataset.The dimentionality of iris data set is 4 excluding the species variable which is target variable.So we will only use first four feature variables for demonstration.

Importing Libraries

Let’s import the library MASS 

library(MASS,quietly = TRUE)

Reading the Dataset

Storing the data set named “iris” into DataFrame named “DataFrame”

DataFrame <- iris

Investigating the dataset

Type help(“iris”) to know about the data set

help("iris")

Lets check out the structure of the data

str(DataFrame)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Check the dimension of this data frame

dim(DataFrame)
## [1] 150 5

Check first 3 rows

head(DataFrame,3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa

Check the summary of data

summary(DataFrame)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##

Check the number of unique values

apply(DataFrame,2,function(x) length(unique(x)))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 35 23 43 22 3

Lets check the data set again

str(DataFrame)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...




Performing PCA using prcomp function

Let’s do the principal companent analysis.Center=TRUE and scale=TRUE means we are scaling and centering the data before PCA.

modelPCA<-prcomp(x=DataFrame[,1:4],
center = TRUE,
scale. = TRUE)

Variance Explained 

Plot the variance explained by principal components

plot(modelPCA,type = "l",
main="Variance explained by PCA"
)
Variance Explained by PC
Variance Explained by each of the PC’s




Variance explained by first 2 PC’s

Let’s find the variance explained by the first two Principal components(PC’s)

sum(modelPCA$sdev[1:2]^2)/sum(modelPCA$sdev^2)
## [1] 0.9581321

We can see from above that only first two principal components alone can explain 95.8 % variance in the data.

Let’s check the complete summary of PCA.It shows the standard deviance and variance explained by each of the PCA components.Cumulative proportion of PC3 is 0.994 which means if we use first three components together in the data then these three components alltogether explains 99.4 % variablility in the data set.

summary(modelPCA)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000

We can see from above that sum of proportion of variance explained by first two principal components is 95.8 % (0.7296+0.2285).

Biplot

Let’s plot the biplot showing first two PC’s and the original feature vectors in this 2D space i.e original feature vectors as linear combination of first two PC’s

biplot(modelPCA)
Biplot in PCA
Biplot in PCA




Data visualization in Principal components Space

Let’s try to do data visualization .We will use the principal component feature vectors instead of actual feature vectors like sepal-width,petal-width,etc.We will then color the data points based on Species variable.It is very easy to see that our PCA has worked!! Just based on two principal components we can see three clusters of setosa, versicolor, virginica in the data which are clearly separate.

library(ggplot2) ggplot(as.data.frame(modelPCA$x[,1:2]))+geom_point(aes(x=PC1,y=PC2), col=colors()[c(55,150,300)][as.factor(DataFrame$Species)])

Data points in PC space
Data points in PC space