Accuracy metric for 5 folds
Decision Tree Classifier Accuracy metric for five folds

Introduction

Let’s learn from a precise demo on Fitting Decision Tree Classifier on Titanic Data Set for Machine Learning

Description:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem

To predict which passengers survived in this tragedy tragedy based on the data given

What we will do :

1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Decision Tree Classifier
4. Predict for test data set

Importing libraries

Let’s import the library

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation


import matplotlib.pyplot as plt
%matplotlib inline

Reading training and testing data sets

Let’s import the data set

train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')

Data Cleaning

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Encoding the categorical variables manually
2. Imputing the missing values.

def data_cleaning(train):
    train["Age"] = train["Age"].fillna(train["Age"].median())
    train["Fare"] = train["Age"].fillna(train["Fare"].median())
    train["Embarked"] = train["Embarked"].fillna("S")


    train.loc[train["Sex"] == "male", "Sex"] = 0
    train.loc[train["Sex"] == "female", "Sex"] = 1

    train.loc[train["Embarked"] == "S", "Embarked"] = 0
    train.loc[train["Embarked"] == "C", "Embarked"] = 1
    train.loc[train["Embarked"] == "Q", "Embarked"] = 2

    return train

 

Let’s clean the data

train=data_cleaning(train)
test=data_cleaning(test)

Selecting Predictor variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]

X & y

Let’s separate predictors and target.X is array of predictor variables and y is target variable.We will use these while data modelling.

X, y = train[predictor_Vars], train.Survived

Let’s check X

X.iloc[:5]
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

 

Let’s check y

y.iloc[:5]
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Model Initialization

Let’s initialize the decision tree classifier model and choose model parameters if you want.

modelDecisionTree = tree.DecisionTreeClassifier()

Cross-validation

If you wants to know about cross-validation then please follow What is crossvalidation .
Let’s do the 5 fold crossvalidation now

modelDecisionTreeCV= cross_validation.cross_val_score(modelDecisionTree,X,y,cv=5)

Let’s check the accuracy metric of each of the five folds

modelDecisionTreeCV
array([ 0.77094972,  0.7877095 ,  0.76966292,  0.74719101,  0.79661017])

Let’s see the same information on the plot

plt.plot(modelDecisionTreeCV,"p")
[<matplotlib.lines.Line2D at 0xa8517f0>]
Decision Tree Classifier Model
Accuracy Metric for 5 folds in Decision Tree Classifier Model

Let’s check the mean model accuracy of all five folds

print(modelDecisionTreeCV.mean())

0.774424663991

Model Fitting

Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

modelDecisionTree= tree.DecisionTreeClassifier()
modelDecisionTree= modelDecisionTree.fit(X, y)

Predictions on test data set

Let’s get the prediction values for the test data set

predictions=modelDecisionTree.predict(test[predictor_Vars])
predictions
array([0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 0], dtype=int64)