January 10, 2017

## Description:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

## Machine learning Problem

To predict which passengers survived in this tragedy tragedy based on the data given

## What we will do :

1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Decision Tree Classifier
4. Predict for test data set

### Importing libraries

Let’s import the library

``````import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
``````

### Reading training and testing data sets

Let’s import the data set

``````train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
``````

### Data Cleaning

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Encoding the categorical variables manually
2. Imputing the missing values.

``````def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")

train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

return train
``````

Let’s clean the data

``````train=data_cleaning(train)
test=data_cleaning(test)
``````

### Selecting Predictor variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

``````predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
``````

### X & y

Let’s separate predictors and target.X is array of predictor variables and y is target variable.We will use these while data modelling.

``````X, y = train[predictor_Vars], train.Survived

``````

Let’s check X

``````X.iloc[:5]
``````
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

Let’s check y

``````y.iloc[:5]
``````
``````0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
``````

### Model Initialization

Let’s initialize the decision tree classifier model and choose model parameters if you want.

``````modelDecisionTree = tree.DecisionTreeClassifier()

``````

### Cross-validation

Let’s do the 5 fold crossvalidation now

``````modelDecisionTreeCV= cross_validation.cross_val_score(modelDecisionTree,X,y,cv=5)
``````

Let’s check the accuracy metric of each of the five folds

``````modelDecisionTreeCV
``````
``````array([ 0.77094972,  0.7877095 ,  0.76966292,  0.74719101,  0.79661017])
``````

Let’s see the same information on the plot

``````plt.plot(modelDecisionTreeCV,"p")
``````
``````[<matplotlib.lines.Line2D at 0xa8517f0>]
``````

Let’s check the mean model accuracy of all five folds

``````print(modelDecisionTreeCV.mean())

``````
``````0.774424663991
``````

### Model Fitting

Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

``````modelDecisionTree= tree.DecisionTreeClassifier()
modelDecisionTree= modelDecisionTree.fit(X, y)
``````

## https://www.udemy.com/machine-learning-using-r/?couponCode=DISFOR123

### Predictions on test data set

Let’s get the prediction values for the test data set

``````predictions=modelDecisionTree.predict(test[predictor_Vars])
predictions
``````
``````array([0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1,
1, 0, 1, 0], dtype=int64)
``````