accuracy metric across Crossvalidation folds
Nearest Neighbor classifier accuracy metric across Crossvalidation folds

Introduction

Let’s learn from a precise demo on Fitting Nearest Neighbor Classifier on Titanic Data Set for Machine Learning

Description:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem :

To predict which passengers survived in this tragedy tragedy based on the data given

What we will do :

1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Nearest Neighbor Classifier
4. Predict for test data set

Importing libraries

Let’s import the library

import numpy as np
import pandas as pd
from sklearn import neighbors
from sklearn import cross_validation


import matplotlib.pyplot as plt
%matplotlib inline

Reading dataset

Let’s import the data set

train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')

Data Cleaning

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Encoding the categorical variables manually
2. Imputing the missing values.

def data_cleaning(train):
    train["Age"] = train["Age"].fillna(train["Age"].median())
    train["Fare"] = train["Age"].fillna(train["Fare"].median())
    train["Embarked"] = train["Embarked"].fillna("S")


    train.loc[train["Sex"] == "male", "Sex"] = 0
    train.loc[train["Sex"] == "female", "Sex"] = 1

    train.loc[train["Embarked"] == "S", "Embarked"] = 0
    train.loc[train["Embarked"] == "C", "Embarked"] = 1
    train.loc[train["Embarked"] == "Q", "Embarked"] = 2

    return train

Let’s clean the data

train=data_cleaning(train)
test=data_cleaning(test)

Selecting predictor variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]

X & y

Let’s separate predictors and target.X is array of predictor variables and y is target variable.

X, y = train[predictor_Vars], train.Survived

Let’s check X

X.iloc[:5]
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

Let’s check y

y.iloc[:5]
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Model Initialization

Let’s initialize the nearest neighbors classifier model and choose model parameters if you want.

modelNeighbors = neighbors.KNeighborsClassifier(n_neighbors=4, weights='distance')

Cross-validation

Let’s do the 5 fold crossvalidation

modelNeighborsCV= cross_validation.cross_val_score(modelNeighbors,X,y,cv=5)

Accuracy Metric

Let’s check the accuracy metric of each of the five folds

modelNeighborsCV

array([ 0.79329609,  0.77653631,  0.71348315,  0.71910112,  0.77966102])

Let’s see the same information on the plot

plt.plot(modelNeighborsCV,"p")

[<matplotlib.lines.Line2D at 0x97b8eb8>]


accuracy metric across Crossvalidation folds
Nearest Neighbor classifier accuracy metric across Crossvalidation folds

Let’s check the mean model accuracy of all five folds

print(modelNeighborsCV.mean())

0.756415537769

Model Fitting

Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

modelNeighbors = neighbors.KNeighborsClassifier(n_neighbors=4, weights='distance')
modelNeighbors.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='distance')

Predictions on test data 

predictions=modelNeighbors.predict(test[predictor_Vars])
predictions

array([0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 1, 0], dtype=int64)

Below are some important links for you

What is crossvalidation

What are dimentionality reduction techniques