January 10, 2017

## Description:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

## Machine learning Problem :

To predict which passengers survived in this tragedy tragedy based on the data given

## What we will do :

1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Nearest Neighbor Classifier
4. Predict for test data set

### Importing libraries

Let’s import the library

``````import numpy as np
import pandas as pd
from sklearn import neighbors
from sklearn import cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
``````

Let’s import the data set

``````train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
``````

### Data Cleaning

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Encoding the categorical variables manually
2. Imputing the missing values.

``````def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")

train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

return train
``````

Let’s clean the data

``````train=data_cleaning(train)
test=data_cleaning(test)
``````

### Selecting predictor variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

``````predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
``````

### X & y

Let’s separate predictors and target.X is array of predictor variables and y is target variable.

``````X, y = train[predictor_Vars], train.Survived

``````

Let’s check X

``````X.iloc[:5]
``````
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

Let’s check y

``````y.iloc[:5]
``````
``````0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
``````

### Model Initialization

Let’s initialize the nearest neighbors classifier model and choose model parameters if you want.

``````modelNeighbors = neighbors.KNeighborsClassifier(n_neighbors=4, weights='distance')

``````

### Cross-validation

Let’s do the 5 fold crossvalidation

``````modelNeighborsCV= cross_validation.cross_val_score(modelNeighbors,X,y,cv=5)

``````

### Accuracy Metric

Let’s check the accuracy metric of each of the five folds

``````modelNeighborsCV

``````
``````array([ 0.79329609,  0.77653631,  0.71348315,  0.71910112,  0.77966102])
``````

Let’s see the same information on the plot

``````plt.plot(modelNeighborsCV,"p")

``````
``````[<matplotlib.lines.Line2D at 0x97b8eb8>]
``````

Let’s check the mean model accuracy of all five folds

``````print(modelNeighborsCV.mean())

``````
``````0.756415537769
``````

### Model Fitting

Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

``````modelNeighbors = neighbors.KNeighborsClassifier(n_neighbors=4, weights='distance')
modelNeighbors.fit(X,y)

``````
``````KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=4, p=2,
weights='distance')
``````

### Predictions on test data

``````predictions=modelNeighbors.predict(test[predictor_Vars])
predictions

``````
``````array([0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0], dtype=int64)
``````

## Below are some important links for you

What is crossvalidation

What are dimentionality reduction techniques