Missing Value Analysis using Python

Missing Value analysis on Titanic data set for Machine Learning and Data Analytics

Let's import the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let's read the train and test data set .We are using the titanic data set

train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')

Let's check the dimensions of train data set

train.shape
(891, 12)

Column or variable names

train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Check data types of each column

train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

Here is a nice online descriptive course at good price.

There are 891 rows in train data set.Let's find the number of non-missing values in each column out of 891

train.count()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

We can see from above that Age,Cabin and Embarked has some missing values in it

Let's check the number of missing values in each column

train.apply(lambda x: sum(x.isnull().values), axis = 0)
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Les's check the distribution of missing values across rows

sns.countplot(train.isnull().sum(axis=1))
<matplotlib.axes._subplots.AxesSubplot at 0xa7fb5f8>
missing value barplot
Missing Value barplot

Here is a nice online descriptive course at good price.

We can see from above that only around 180 rows are there which have no missing values in it .This is very important to see

Let's check the exact number of rows which have no missing values

train.shape[0]-train.isnull().any(axis=1).sum()
183

Let's check the data with rows having any missing values in it .I'm choosing Age and Cabin column only and rows from 40 to 50.

train[train.isnull().any(axis=1)][["Age","Cabin"]].iloc[40:50]
AgeCabin
48NaNNaN
4918.0NaN
507.0NaN
5121.0NaN
5329.0NaN
55NaNC52
5621.0NaN
5728.5NaN
585.0NaN
5911.0NaN

Let's check the data which have rows with no missing values.I'm choosing Age and Cabin column only and rows from 40 to 50.

train[train.isnull().sum(axis=1)==0][["Age","Cabin"]].iloc[40:50]
AgeCabin
24544C78
24837D35
25129G6
25262C87
25730B77
26252E67
26340B94
26858C125
26935C99
27337C118

Let's check the data with only columns having missing values i.e Age,Embarked and cabin

train[["Age","Embarked","Cabin"]].iloc[0:5]
AgeEmbarkedCabin
022SNaN
138CC85
226SNaN
335SC123
435SNaN

Let's check the individual variables now like Age

Let's find the mean age

train["Age"].mean()

29.69911764705882

Let's check the missing values in Age column

train["Age"].isnull().sum()
177

Let's impute the missing values with random numbers which are distributed randomly between 1 standard deviation around the mean value of age.

mean_age=train["Age"].mean()
std_age=train["Age"].std()

randnum = np.random.randint(mean_age - std_age,mean_age + std_age, size = 177)
train["Age"][np.isnan(train["Age"])]= randnum

Lets check the cabin variable

Its a categorical variable .Not sure if it can help in predictive modelling but let's check the unique values of cabin

train.Cabin.unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6', 'C23 C25 C27',
       'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33', 'F G73', 'E31',
       'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101', 'F E69', 'D47',
       'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4', 'A32', 'B4',
       'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87', 'B77',
       'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49', 'D',
       'C22 C26', 'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66', 'C7',
       'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T', 'C128', 'D37',
       'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44', 'A34', 'C104', 'C111',
       'C92', 'E38', 'D21', 'E12', 'E63', 'A14', 'B37', 'C30', 'D20',
       'B79', 'E25', 'D46', 'B73', 'C95', 'B38', 'B39', 'B22', 'C86',
       'C70', 'A16', 'C101', 'C68', 'A10', 'E68', 'B41', 'A20', 'D19',
       'D50', 'D9', 'A23', 'B50', 'A26', 'D48', 'E58', 'C126', 'B71',
       'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63', 'C62 C64', 'E24', 'C90',
       'C45', 'E8', 'B101', 'D45', 'C46', 'D30', 'E121', 'D11', 'E77',
       'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36', 'B102', 'B69', 'E49',
       'C47', 'D28', 'E17', 'A24', 'C50', 'B42', 'C148'], dtype=object)

We have two options now .Either we can drop this variable or do the exploratory analysis to check if it is of any help in survival prediction.I'm not doing any exploratory analysis further in this direction here.We can do the one hot encoding and use “nan” as one of the dummy variable.