# Random Forest using Python

0
655
##### Let’s learn from a precise demo on Fitting Random Forest on Titanic Data Set for Machine Learning

Description: On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem : To predict which passengers survived in this tragedy based on the data given

What we will do :
1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Random Forest Classifier
4. Predict for test data set

Let's import the library

``````import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
``````

Let's import the data set

``````train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
``````

Please have a look on post for missing values before you proceed on this.Let's create a function for cleaning

``````def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")

train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

return train
``````

Let's clean the data

``````train=data_cleaning(train)
test=data_cleaning(test)
``````

Let's choose the predictor variables.We will not choose the cabin and Passenger id variable

``````predictor_Vars = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
``````

Let's choose model parameters

``````modelRandom = RandomForestClassifier(n_estimators=1000,max_depth=4,max_features=3,random_state=123)

``````

## Below is a nice course followed by many with good ratings on Udemy.Price is Low!! Hurry !! & buy this course and begin to learn R & Python

Let's do the 5 fold crossvalidation

``````modelRandomCV= cross_validation.cross_val_score(modelRandom,train[predictor_Vars],train["Survived"],cv=5)

``````

Let's check the accuracy metric of each of the five folds

``````modelRandomCV
``````
``````array([ 0.83240223,  0.82681564,  0.8258427 ,  0.79213483,  0.85875706])
``````

Let's see the same information on the plot

``````plt.plot(modelRandomCV,"p")
``````
``````[<matplotlib.lines.Line2D at 0xa981470>]
``````

Let's check the mean model accuracy of all five folds

``````print(modelRandomCV.mean())

``````
``````0.827190493466
``````

Let's now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

``````modelRandom = RandomForestClassifier(n_estimators=1000,max_depth=4,max_features=3,random_state=123)
modelRandom.fit(train[predictor_Vars], train.Survived)

``````

``````RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=4, max_features=3, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=123, verbose=0, warm_start=False)
``````
``````predictions=modelRandom.predict(test[predictor_Vars])
``````
SHARE
Previous articleBoxplot using ggplot2 in R
Next articleLogistic Regression using python