Random Forest using Python

Let’s learn from a precise demo on Fitting Random Forest on Titanic Data Set for Machine Learning

Description: On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem : To predict which passengers survived in this tragedy based on the data given

What we will do :
1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Random Forest Classifier
4. Predict for test data set

Let's import the library

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation

import matplotlib.pyplot as plt
%matplotlib inline

Let's import the data set

train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')

Please have a look on post for missing values before you proceed on this.Let's create a function for cleaning

def data_cleaning(train):
    train["Age"] = train["Age"].fillna(train["Age"].median())
    train["Fare"] = train["Age"].fillna(train["Fare"].median())
    train["Embarked"] = train["Embarked"].fillna("S")


    train.loc[train["Sex"] == "male", "Sex"] = 0
    train.loc[train["Sex"] == "female", "Sex"] = 1

    train.loc[train["Embarked"] == "S", "Embarked"] = 0
    train.loc[train["Embarked"] == "C", "Embarked"] = 1
    train.loc[train["Embarked"] == "Q", "Embarked"] = 2

    return train

Let's clean the data

train=data_cleaning(train)
test=data_cleaning(test)

Let's choose the predictor variables.We will not choose the cabin and Passenger id variable

predictor_Vars = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

Let's choose model parameters

modelRandom = RandomForestClassifier(n_estimators=1000,max_depth=4,max_features=3,random_state=123)

Below is a nice course followed by many with good ratings on Udemy.Price is Low!! Hurry !! & buy this course and begin to learn R & Python

Let's do the 5 fold crossvalidation

modelRandomCV= cross_validation.cross_val_score(modelRandom,train[predictor_Vars],train["Survived"],cv=5)


Let's check the accuracy metric of each of the five folds

modelRandomCV
array([ 0.83240223,  0.82681564,  0.8258427 ,  0.79213483,  0.85875706])

Let's see the same information on the plot

plt.plot(modelRandomCV,"p")
[<matplotlib.lines.Line2D at 0xa981470>]
accuracy metric
accuracy metric in random forest

Let's check the mean model accuracy of all five folds

print(modelRandomCV.mean())

0.827190493466

Let's now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

modelRandom = RandomForestClassifier(n_estimators=1000,max_depth=4,max_features=3,random_state=123)
modelRandom.fit(train[predictor_Vars], train.Survived)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features=3, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False)
predictions=modelRandom.predict(test[predictor_Vars])