random forest in T
Random Forest

Introduction to Random Forest in R

Let’s learn from precise Demo on Random Forest in R for Machine Learning and Data Analytics .Open your RStudio and begin typing in the same things as below.Learn by Practice

Importing the libraries

We need to import the libraries like randomForest in order to use the random forest algorithm in R.

#### Setting the seed so the we get same results each time we run random Forest 
set.seed(123)

#### Importing the library MASS for birthwt dataset and library randomForest for 
#### randomForest model

library(MASS,quietly = TRUE)
library(randomForest,quietly = TRUE)
## Warning: package 'randomForest' was built under R version 3.2.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

Reading the data

Let’s read the data so that we can implement Random forest algorithm in R.

#### Storing the data set named "birthwt" into DataFrame 
named "DataFrame"

DataFrame <- birthwt


#### Type help("birthwt") to know about the data set 

#### Lets check out the structure of the data 
str(DataFrame)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Data Exploration

Before we begin to apply Random forest in R,let’s first explore the data set.

#### Check the dimention of this data frame
dim(DataFrame)
## [1] 189  10
#### Check first 3 rows
head(DataFrame,3)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
#### Check summary of data 
summary(DataFrame)
##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Categorical Variables

We need to convert the categorical variables into factor variables in order to use the random forest  in R

#### Check the number of unique values 
apply(DataFrame,2,function(x) length(unique(x)))
##   low   age   lwt  race smoke   ptl    ht    ui   ftv   bwt 
##     2    24    75     3     2     4     2     2     6   131
#### Seems like variables low,race,smoke,ptl,ht,ui,ftv are categorical variables
#### For converting into categorical:use as.factor

#### Converting into factor 
cols<-c("low","race","smoke","ptl","ht","ui","ftv")
for(i in cols){
  DataFrame[,i]=as.factor(DataFrame[,i])
}


#### Lets check the data set again
str(DataFrame)
## 'data.frame':    189 obs. of  10 variables:
##  $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Data Partition

We need to partition the data into training and testing data .Testing data is required in order to test the accuracy of random forest model .

#### Lets create the train and test data set.Target variable is low
library(caTools)
ind<-sample.split(Y = DataFrame$low,SplitRatio = 0.7)
trainDF<-DataFrame[ind,]
testDF<-DataFrame[!ind,]


#### Random Forest parameters

#1. mtry= random number of variables selected at each split.
#2. ntree=no.of trees to grow
#3. nodesize=minimum size of terminal nodes

Model Fitting for Random Forest in R

Let’s now fit the random forest model in R using the randomForest function.


#### Fitting the model
modelRandom<-randomForest(low~.,data = trainDF,mtry=3,ntree=20) 

#### looking at summary of model modelRandom
## 
## Call:
##  randomForest(formula = low ~ ., data = trainDF, mtry = 3, ntree = 20) 
##                Type of random forest: classification
##                      Number of trees: 20
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.79%
## Confusion matrix:
##    0  1 class.error
## 0 89  2  0.02197802
## 1  3 38  0.07317073

Variable Importance in Random Forest in R

Let’s check which of the predictor variables in the random forest model have high importance in predictions.We can use the importance function for this purpose.

####  Plotting the importance of each variables
#### higher value of mean decrease accuracy or mean decrease gini score implies
#### higher  the importance of the variable in the model


importance(modelRandom)
##       MeanDecreaseGini
## age         2.43397938
## lwt         1.93633860
## race        0.49444115
## smoke       1.44438665
## ptl         1.77731653
## ht          0.13828860
## ui          0.08424019
## ftv         0.84281661
## bwt        46.56713167
varImpPlot(modelRandom)
VarImpPlot
Importance of Variables

Predictions

Let’s now check what the Random forest model is predicting for the test data set and then compare these predicted values with actual values.

#### Predictions
PredictionsWithClass<- predict(modelRandom, testDF, type = 'class')
t<-table(predictions=PredictionsWithClass, actual=testDF$low)
t
##            actual
## predictions  0  1
##           0 39  0
##           1  0 18

Accuracy Metric

From the above confusion matrix we can calculate the accuracy metric.Diagonal numbers are correct predictions and non-diagonal are incorrect predictions done by the random forest model in R.

#### Accuracy metric
sum(diag(t))/sum(t)
## [1] 1

ROC Curve

Let’s plot the ROC curve for random forest model in R.The function auc calculates the auc values .

#### Plotting ROC curve and calculating AUC metric
library(pROC,quietly = TRUE)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
PredictionsWithProbs<- predict(modelRandom, testDF, type = 'prob')
auc<-auc(testDF$low,PredictionsWithProbs[,2])
plot(roc(testDF$low,PredictionsWithProbs[,2]))
ROC plot
ROC plot
## 
## Call:
## roc.default(response = testDF$low, predictor = PredictionsWithProbs[,     2])
## 
## Data: PredictionsWithProbs[, 2] in 39 controls (testDF$low 0) < 18 
## cases (testDF$low 1).
## Area under the curve: 1

Best Random Forest Model

Let’s tune the mtry parameter of random forest model using the tuneRF function in R.

####  To find the best mtry 
bestmtry<-tuneRF(trainDF,trainDF$low,ntreeTry = 200,stepFactor = 1.2,
                 improve = 0.01,trace = T,plot = T)
## mtry = 3  OOB error = 0.76% 
## Searching left ...
## Searching right ...
best mtry
Optimizing for mtry parameter
bestmtry
##       mtry    OOBError
## 3.OOB    3 0.007575758
#### You can now use this bestmtry to fit the model again to get better predictions