Introduction to Random Forest in R
Let’s learn from precise Demo on Random Forest in R for Machine Learning and Data Analytics .Open your RStudio and begin typing in the same things as below.Learn by Practice
Importing the libraries
We need to import the libraries like randomForest in order to use the random forest algorithm in R.
#### Setting the seed so the we get same results each time we run random Forest
set.seed(123)
#### Importing the library MASS for birthwt dataset and library randomForest for
#### randomForest model
library(MASS,quietly = TRUE)
library(randomForest,quietly = TRUE)
## Warning: package 'randomForest' was built under R version 3.2.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
Reading the data
Let’s read the data so that we can implement Random forest algorithm in R.
#### Storing the data set named "birthwt" into DataFrame
named "DataFrame"
DataFrame <- birthwt
#### Type help("birthwt") to know about the data set
#### Lets check out the structure of the data
str(DataFrame)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
Data Exploration
Before we begin to apply Random forest in R,let’s first explore the data set.
#### Check the dimention of this data frame
dim(DataFrame)
## [1] 189 10
#### Check first 3 rows
head(DataFrame,3)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
#### Check summary of data
summary(DataFrame)
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
Categorical Variables
We need to convert the categorical variables into factor variables in order to use the random forest in R
#### Check the number of unique values
apply(DataFrame,2,function(x) length(unique(x)))
## low age lwt race smoke ptl ht ui ftv bwt
## 2 24 75 3 2 4 2 2 6 131
#### Seems like variables low,race,smoke,ptl,ht,ui,ftv are categorical variables
#### For converting into categorical:use as.factor
#### Converting into factor
cols<-c("low","race","smoke","ptl","ht","ui","ftv")
for(i in cols){
DataFrame[,i]=as.factor(DataFrame[,i])
}
#### Lets check the data set again
str(DataFrame)
## 'data.frame': 189 obs. of 10 variables:
## $ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
Data Partition
We need to partition the data into training and testing data .Testing data is required in order to test the accuracy of random forest model .
#### Lets create the train and test data set.Target variable is low
library(caTools)
ind<-sample.split(Y = DataFrame$low,SplitRatio = 0.7)
trainDF<-DataFrame[ind,]
testDF<-DataFrame[!ind,]
#### Random Forest parameters
#1. mtry= random number of variables selected at each split.
#2. ntree=no.of trees to grow
#3. nodesize=minimum size of terminal nodes
Model Fitting for Random Forest in R
Let’s now fit the random forest model in R using the randomForest function.
#### Fitting the model
modelRandom<-randomForest(low~.,data = trainDF,mtry=3,ntree=20)
#### looking at summary of model modelRandom
##
## Call:
## randomForest(formula = low ~ ., data = trainDF, mtry = 3, ntree = 20)
## Type of random forest: classification
## Number of trees: 20
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.79%
## Confusion matrix:
## 0 1 class.error
## 0 89 2 0.02197802
## 1 3 38 0.07317073
Variable Importance in Random Forest in R
Let’s check which of the predictor variables in the random forest model have high importance in predictions.We can use the importance function for this purpose.
#### Plotting the importance of each variables
#### higher value of mean decrease accuracy or mean decrease gini score implies
#### higher the importance of the variable in the model
importance(modelRandom)
## MeanDecreaseGini
## age 2.43397938
## lwt 1.93633860
## race 0.49444115
## smoke 1.44438665
## ptl 1.77731653
## ht 0.13828860
## ui 0.08424019
## ftv 0.84281661
## bwt 46.56713167
varImpPlot(modelRandom)
Predictions
Let’s now check what the Random forest model is predicting for the test data set and then compare these predicted values with actual values.
#### Predictions
PredictionsWithClass<- predict(modelRandom, testDF, type = 'class')
t<-table(predictions=PredictionsWithClass, actual=testDF$low)
t
## actual
## predictions 0 1
## 0 39 0
## 1 0 18
Accuracy Metric
From the above confusion matrix we can calculate the accuracy metric.Diagonal numbers are correct predictions and non-diagonal are incorrect predictions done by the random forest model in R.
#### Accuracy metric
sum(diag(t))/sum(t)
## [1] 1
ROC Curve
Let’s plot the ROC curve for random forest model in R.The function auc calculates the auc values .
#### Plotting ROC curve and calculating AUC metric
library(pROC,quietly = TRUE)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
PredictionsWithProbs<- predict(modelRandom, testDF, type = 'prob')
auc<-auc(testDF$low,PredictionsWithProbs[,2])
plot(roc(testDF$low,PredictionsWithProbs[,2]))
##
## Call:
## roc.default(response = testDF$low, predictor = PredictionsWithProbs[, 2])
##
## Data: PredictionsWithProbs[, 2] in 39 controls (testDF$low 0) < 18
## cases (testDF$low 1).
## Area under the curve: 1
Best Random Forest Model
Let’s tune the mtry parameter of random forest model using the tuneRF function in R.
#### To find the best mtry
bestmtry<-tuneRF(trainDF,trainDF$low,ntreeTry = 200,stepFactor = 1.2,
improve = 0.01,trace = T,plot = T)
## mtry = 3 OOB error = 0.76%
## Searching left ...
## Searching right ...
bestmtry
## mtry OOBError
## 3.OOB 3 0.007575758
#### You can now use this bestmtry to fit the model again to get better predictions