deeplearning
deeplearning

Introduction to Deep learning in R

It is precise Demo on Deep learning for Machine Learning  using h2o in R. H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”.The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multi-layer neural net models).
Open your RStudio and follow along !!

Importing libraries

#### Let's import the data set from package MASS 
#### Also import h2o package for using h2o
library(MASS)
library(h2o)

Reading the dataset

#### Storing the data set named "Boston" into DataFrame
DataFrame <- Boston

#### To get the Help on Boston dataset uncomment the following code 
#### help("Boston")

#### Lets have a look on Structure of Boston data 
str(DataFrame) 
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Data Exploration

#### Histogram of the target or outcome variable "medv"
hist(DataFrame$medv,col=colors()[100:110],
     breaks = 10,main="Histogram of medv",
     xlab="medv"
     )
histogram
Histogram of medv
####  Check the dimension of this data frame
dim(DataFrame)
## [1] 506  14
####  Check first 3 rows
head(DataFrame,3)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
#### Check the summary of each variable
summary(DataFrame)
##       crim                 zn                indus         
##  Min.   : 0.006320   Min.   :  0.00000   Min.   : 0.46000  
##  1st Qu.: 0.082045   1st Qu.:  0.00000   1st Qu.: 5.19000  
##  Median : 0.256510   Median :  0.00000   Median : 9.69000  
##  Mean   : 3.613524   Mean   : 11.36364   Mean   :11.13678  
##  3rd Qu.: 3.677083   3rd Qu.: 12.50000   3rd Qu.:18.10000  
##  Max.   :88.976200   Max.   :100.00000   Max.   :27.74000  
##       chas                 nox                  rm          
##  Min.   :0.00000000   Min.   :0.3850000   Min.   :3.561000  
##  1st Qu.:0.00000000   1st Qu.:0.4490000   1st Qu.:5.885500  
##  Median :0.00000000   Median :0.5380000   Median :6.208500  
##  Mean   :0.06916996   Mean   :0.5546951   Mean   :6.284634  
##  3rd Qu.:0.00000000   3rd Qu.:0.6240000   3rd Qu.:6.623500  
##  Max.   :1.00000000   Max.   :0.8710000   Max.   :8.780000  
##       age                dis                 rad           
##  Min.   :  2.9000   Min.   : 1.129600   Min.   : 1.000000  
##  1st Qu.: 45.0250   1st Qu.: 2.100175   1st Qu.: 4.000000  
##  Median : 77.5000   Median : 3.207450   Median : 5.000000  
##  Mean   : 68.5749   Mean   : 3.795043   Mean   : 9.549407  
##  3rd Qu.: 94.0750   3rd Qu.: 5.188425   3rd Qu.:24.000000  
##  Max.   :100.0000   Max.   :12.126500   Max.   :24.000000  
##       tax              ptratio             black         
##  Min.   :187.0000   Min.   :12.60000   Min.   :  0.3200  
##  1st Qu.:279.0000   1st Qu.:17.40000   1st Qu.:375.3775  
##  Median :330.0000   Median :19.05000   Median :391.4400  
##  Mean   :408.2372   Mean   :18.45553   Mean   :356.6740  
##  3rd Qu.:666.0000   3rd Qu.:20.20000   3rd Qu.:396.2250  
##  Max.   :711.0000   Max.   :22.00000   Max.   :396.9000  
##      lstat               medv         
##  Min.   : 1.73000   Min.   : 5.00000  
##  1st Qu.: 6.95000   1st Qu.:17.02500  
##  Median :11.36000   Median :21.20000  
##  Mean   :12.65306   Mean   :22.53281  
##  3rd Qu.:16.95500   3rd Qu.:25.00000  
##  Max.   :37.97000   Max.   :50.00000
#### This will give min and max value for each of the variable
apply(DataFrame,2,range)
##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0
##       black lstat medv
## [1,]   0.32  1.73    5
## [2,] 396.90 37.97   50

Data Transformation & H2o initialization

Before we start with deep learning model fitting in R ,we need to do data transformations and h2o initialization.

#### Seems like scale of each variable is not same

### Lets Normalize the data set variables in interval [0,1] 
### Normalization  is necessary so that each variable is scaled properly
### and none of the variables over dominates in the model 
### scale function will give min-max scaling here
### Below is the snippet of code for the same

maxValue <- apply(DataFrame, 2, max) 
minValue <- apply(DataFrame, 2, min)
DataFrame<-as.data.frame(scale(DataFrame,center = minValue,
                                         scale = maxValue-minValue))

H2o Initialization

####  Let's do H2o initialization.This will start H2o cluster in local machine 
####  There are options for running the same on servers
####  I'm using 2650 Megabytes of RAM out of 8GB RAM.You can choose according to 
####  your RAM configuration.
h2o.init(ip = "localhost",port = 54321,max_mem_size = "2650m")
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\Arpan\AppData\Local\Temp\RtmpI75ZMa/h2o_Arpan_started_from_r.out
##     C:\Users\Arpan\AppData\Local\Temp\RtmpI75ZMa/h2o_Arpan_started_from_r.err
## 
## 
## ...Successfully connected to http://localhost:54321/ 
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         6 seconds 66 milliseconds 
##     H2O cluster version:        3.6.0.8 
##     H2O cluster name:           H2O_started_from_R_Arpan_nhu082 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   2.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
## 
## Note:  As started, H2O is limited to the CRAN default of 2 CPUs.
##        Shut down and restart H2O as shown below to use all your CPUs.
##            > h2o.shutdown()
##            > h2o.init(nthreads = -1)

Data Partition & Modelling

Let’s partition the data set into training and testing data set before we start modelling the deep learning model. We need test data to test the accuracy of predictions done by deep learning model in R.So first we will partition the data set ,then we will discuss about deep learning model parameters in H2o in  R and then we will finally fit the deep learning model.

#### Lets partition the dataset into train and test data set

ind<-sample(1:nrow(DataFrame),400)
trainDF<-DataFrame[ind,]
testDF<-DataFrame[-ind,]


#### To know about the h2o.deeplearning function just run the follwing code 
#### by uncommenting it
#### ?h2o.deeplearning


#### Let's give a brief overview of parameters used in h2o.deeplearning function


#1. x is column names of predictor variable
#2. y is column name of target variable i.e medv
#3.activation=Tanh,TanhWithDropout,Rectifier,RectifierWithDropout
#              Maxout,etc                
#4.input_dropout_ratio=fraction of features for each 
#                      training row to be omitted in training(Its like random 
                       sampling for features)               


#5. l1,l2=regularization
#l1=makes weights 0
#l2=makes weights nearly zero not exactly zero

#6. loss="Automatic", "CrossEntropy" (for classification only),
#         "Quadratic", "Absolute" (experimental) or "Huber" 

#7. distribution=bernoulli,gaussian,multinomial,poisson,gamma,etc
#8. stopping metric="Auto",AUC,r2,logloss,etc
#9. stopping tolerance, metric-based stopping criterion
#10. nfolds =no. of folds for crossvalidation



#### Let's define x and y
y<-"medv"
x<-setdiff(colnames(DataFrame),y)


#### Fitting the Deeplearning  model in H2o
model<-h2o.deeplearning(x=x,
                        y=y,
                        seed = 1234,
                        training_frame = as.h2o(trainDF),
                        nfolds = 3,
                        stopping_rounds = 7,
                        epochs = 400,
                        overwrite_with_best_model = TRUE,
                        activation = "Tanh",
                        input_dropout_ratio = 0.1,
                        hidden = c(10,10),
                        l1 = 6e-4,
                        loss = "Automatic",
                        distribution = "AUTO",
                        stopping_metric = "MSE")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |===============                                                  |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%

Model Summary

Let’s now check the summary of the fitted deep learning model in R.

#### Let's check the summary of this model
model
## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model ID:  DeepLearning_model_R_1484377135173_5 
## Status of Neuron Layers: predicting medv, regression, gaussian distribution, Quadratic loss, 261 weights/biases, 7.9 KB, 164,000 training samples, mini-batch size 1
##   layer units   type dropout       l1       l2 mean_rate rate_RMS momentum
## 1     1    13  Input 10.00 %                                              
## 2     2    10   Tanh  0.00 % 0.000600 0.000000  0.001421 0.000803 0.000000
## 3     3    10   Tanh  0.00 % 0.000600 0.000000  0.174112 0.350760 0.000000
## 4     4     1 Linear         0.000600 0.000000  0.169970 0.341460 0.000000
##   mean_weight weight_RMS mean_bias bias_RMS
## 1                                          
## 2    0.001932   0.216830 -0.130434 0.400658
## 3   -0.025786   0.239159  0.046105 0.240334
## 4   -0.077879   0.724127  0.720618 0.000000
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## Description: Metrics reported on temporary (load-balanced) training frame
## 
## MSE:  0.004882084245
## R2 :  0.888641872
## Mean Residual Deviance :  0.004882084245
## 
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## Description: 3-fold cross-validation on training data
## 
## MSE:  0.01677206978
## R2 :  0.6174366931
## Mean Residual Deviance :  0.01677206978

Predictions on test data

Let’s see what the deep learning model is predicting for test data set and then compare with the actual values of the test data set.

#### Let's do the Predictions on test data set 
predictions<-as.data.frame(predict(model,as.h2o(testDF)))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
#### Let's check the predictions structure.Its a dataframe
str(predictions)
## 'data.frame':    106 obs. of  1 variable:
##  $ predict: num  0.342 0.365 0.331 0.234 0.371 ...
#### MSE(Mean Squared Error)
sum((predictions$predict-testDF$medv)^2)/nrow(testDF)
## [1] 0.004096587937

Real vs predicted values

#### plotting actual vs predicted values 
plot(testDF$medv,predictions$predict,col='blue',main='Real vs Predicted',
     pch=1,cex=0.9,type = "p",xlab = "Actual",ylab = "Predicted")
abline(0,1,col="black")
real vs predicted
Real vs predicted values

Shutting down h2o cluster

#### Let's now shut down the H2o cluster.
h2o.shutdown(prompt=FALSE)
## [1] TRUE