K-means Clustering for Data Analytics in R

k-means
k-means clusters

Introduction

Here we will know about “how to perform k-means clustering in R” and “how to find best value of k in k-means clustering”

Importing library

Let’s open RStudio and follow along !!

Let’s import the ggplot2 library which is needed for ggplot visualization

library(ggplot2)

Reading Dataset

Let’s import the data set named “iris” into the data frame named “DataFrame”

DataFrame<-iris

Looking at data structure

Let’s check the str of the data set

str(DataFrame)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Best Online Courses for Machine Learning and Data Science.Follow this link

Machine Learning and Data Science best online courses

2D plot before k-means clustering

Let’s have a look on 2D plot of Petal.Length and Petal.Width.We can easily from the plot that data points are clustered and data points can be divided into three clusters or groups.

ggplot(DataFrame) + geom_point(aes(x=Petal.Length, 
                                   y=Petal.Width, 
                                   color = Species,
                                   size=2))+
  scale_x_continuous(name = "Length of Petal")+
  scale_y_continuous(name="Width of Petal")+
  theme_bw()
k-means
k-means clusters

Performing k-means Clustering 

We already know that there are three clusters or groups in this data set
based on the Species variable.They are setosa,versicolor and virginica
So let’s fit the kmeans algorithm using k-means function  and  see what k-means tells.centers=3 inside the k-means function means that “we are trying to look for 3 clusters or groups in the data “.

set.seed(12345)
kmeansClust <- kmeans(DataFrame[, 1:4],centers=3)

Summary of k-means model

The output of this kmeans algorithm is as follows.

kmeansClust
## K-means clustering with 3 clusters of sizes 62, 38, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
## [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
## [141] 2 2 1 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 23.87947 15.15100
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Variance before Clustering

The total variance in the data before clustering is

wss <- (nrow(DataFrame)-1)*sum(apply(DataFrame[,1:4],2,var))
wss
## [1] 681.3706

Finding best value of k

Let’s pretend that we don’t know the Species variable and so we also don’t know about the number of clusters in the data.So we need to fit the algorithm with  different values of “number of clusters” or “k”, say from 2 to 15.

We should expect the elbow point to be around 3 which is equal to the number of clusters in the data.

for (i in 2:15) wss[i] <- sum(kmeans(DataFrame[,1:4],
                                       centers=i)$withinss)

SSE(Sum of squared error) vs k 

Look for a bend or elbow in the sum of squared error (SSE) scree plot.The location of the elbow in the resulting plot suggests a suitable number of clusters for this kmeans algorithm is 3:

plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")
SSE
Sum of Squared errors