**Introduction to Random Forest in R**

##### Let’s learn from precise Demo on Random Forest in R for Machine Learning and Data Analytics .Open your RStudio and begin typing in the same things as below.Learn by Practice

**Importing the libraries**

We need to import the libraries like randomForest in order to use the random forest algorithm in R.

```
#### Setting the seed so the we get same results each time we run random Forest
set.seed(123)
#### Importing the library MASS for birthwt dataset and library randomForest for
#### randomForest model
library(MASS,quietly = TRUE)
library(randomForest,quietly = TRUE)
```

```
## Warning: package 'randomForest' was built under R version 3.2.2
```

```
## randomForest 4.6-12
```

```
## Type rfNews() to see new features/changes/bug fixes.
```

**Reading the data**

Let’s read the data so that we can implement Random forest algorithm in R.

```
#### Storing the data set named "birthwt" into DataFrame
named "DataFrame"
DataFrame <- birthwt
#### Type help("birthwt") to know about the data set
#### Lets check out the structure of the data
str(DataFrame)
```

```
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
```

**Data Exploration**

Before we begin to apply Random forest in R,let’s first explore the data set.

```
#### Check the dimention of this data frame
dim(DataFrame)
```

```
## [1] 189 10
```

```
#### Check first 3 rows
head(DataFrame,3)
```

```
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
```

```
#### Check summary of data
summary(DataFrame)
```

```
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt
## Min. :0.0000 Min. : 709
## 1st Qu.:0.0000 1st Qu.:2414
## Median :0.0000 Median :2977
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
```

**Categorical Variables**

We need to convert the categorical variables into factor variables in order to use the random forest in R

```
#### Check the number of unique values
apply(DataFrame,2,function(x) length(unique(x)))
```

```
## low age lwt race smoke ptl ht ui ftv bwt
## 2 24 75 3 2 4 2 2 6 131
```

```
#### Seems like variables low,race,smoke,ptl,ht,ui,ftv are categorical variables
#### For converting into categorical:use as.factor
#### Converting into factor
cols<-c("low","race","smoke","ptl","ht","ui","ftv")
for(i in cols){
DataFrame[,i]=as.factor(DataFrame[,i])
}
#### Lets check the data set again
str(DataFrame)
```

```
## 'data.frame': 189 obs. of 10 variables:
## $ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
## $ ptl : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
## $ ftv : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
```

**Data Partition**

We need to partition the data into training and testing data .Testing data is required in order to test the accuracy of random forest model .

```
#### Lets create the train and test data set.Target variable is low
library(caTools)
ind<-sample.split(Y = DataFrame$low,SplitRatio = 0.7)
trainDF<-DataFrame[ind,]
testDF<-DataFrame[!ind,]
#### Random Forest parameters
#1. mtry= random number of variables selected at each split.
#2. ntree=no.of trees to grow
#3. nodesize=minimum size of terminal nodes
```

**Model Fitting for Random Forest in R**

Let’s now fit the random forest model in R using the randomForest function.

```
#### Fitting the model
modelRandom<-randomForest(low~.,data = trainDF,mtry=3,ntree=20)
#### looking at summary of model modelRandom
##
## Call:
## randomForest(formula = low ~ ., data = trainDF, mtry = 3, ntree = 20)
## Type of random forest: classification
## Number of trees: 20
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.79%
## Confusion matrix:
## 0 1 class.error
## 0 89 2 0.02197802
## 1 3 38 0.07317073
```

**Variable Importance in Random Forest in R**

Let’s check which of the predictor variables in the random forest model have high importance in predictions.We can use the importance function for this purpose.

```
#### Plotting the importance of each variables
#### higher value of mean decrease accuracy or mean decrease gini score implies
#### higher the importance of the variable in the model
importance(modelRandom)
```

```
## MeanDecreaseGini
## age 2.43397938
## lwt 1.93633860
## race 0.49444115
## smoke 1.44438665
## ptl 1.77731653
## ht 0.13828860
## ui 0.08424019
## ftv 0.84281661
## bwt 46.56713167
```

```
varImpPlot(modelRandom)
```

**Predictions**

Let’s now check what the Random forest model is predicting for the test data set and then compare these predicted values with actual values.

```
#### Predictions
PredictionsWithClass<- predict(modelRandom, testDF, type = 'class')
t<-table(predictions=PredictionsWithClass, actual=testDF$low)
t
```

```
## actual
## predictions 0 1
## 0 39 0
## 1 0 18
```

**Accuracy Metric**

From the above confusion matrix we can calculate the accuracy metric.Diagonal numbers are correct predictions and non-diagonal are incorrect predictions done by the random forest model in R.

```
#### Accuracy metric
sum(diag(t))/sum(t)
```

```
## [1] 1
```

**ROC Curve**

Let’s plot the ROC curve for random forest model in R.The function auc calculates the auc values .

```
#### Plotting ROC curve and calculating AUC metric
library(pROC,quietly = TRUE)
```

```
## Type 'citation("pROC")' for a citation.
```

```
##
## Attaching package: 'pROC'
```

```
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
```

```
PredictionsWithProbs<- predict(modelRandom, testDF, type = 'prob')
auc<-auc(testDF$low,PredictionsWithProbs[,2])
plot(roc(testDF$low,PredictionsWithProbs[,2]))
```

```
##
## Call:
## roc.default(response = testDF$low, predictor = PredictionsWithProbs[, 2])
##
## Data: PredictionsWithProbs[, 2] in 39 controls (testDF$low 0) < 18
## cases (testDF$low 1).
## Area under the curve: 1
```

**Best Random Forest Model**

Let’s tune the mtry parameter of random forest model using the tuneRF function in R.

```
#### To find the best mtry
bestmtry<-tuneRF(trainDF,trainDF$low,ntreeTry = 200,stepFactor = 1.2,
improve = 0.01,trace = T,plot = T)
```

```
## mtry = 3 OOB error = 0.76%
## Searching left ...
## Searching right ...
```

```
bestmtry
```

```
## mtry OOBError
## 3.OOB 3 0.007575758
```

```
#### You can now use this bestmtry to fit the model again to get better predictions
```