Let’s learn from a precise demo on Fitting Logistic Regression on Titanic Data Set for Machine Learning

Description: On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem : To predict which passengers survived in this tragedy based on the data given

What we will do :
1. Basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Logistic Regression
4. Predict for test data set

Importing Libraries

We will import following libraries.

Let’s import the library

``````import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
``````

Reading the training and testing data sets

Let’s import the data set

``````train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
``````

Data Cleaning/Encoding/Preprocessing

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Imputing the missing values in Age,fare and Embarked variable.
2. Encoding the categorical variables manually

``````def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")

train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

return train
``````

Let’s clean the data

``````train=data_cleaning(train)
test=data_cleaning(test)
``````

Choosing Predictor Variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

``````predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
``````

Â X & y

X is array for predictor variables and Y is target or independent variable.We will put X & y while model fitting inside the function.

``````X, y = train[predictor_Vars], train.Survived

``````

Let’s check X

``````X.iloc[:5]
``````
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

Let’s check y

``````y.iloc[:5]
``````
``````0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
``````

Model Initialization

Let’s initialize the logistic regression model and choose model parameters like C,l1,etc. if you wish.

``````modelLogistic = linear_model.LogisticRegression()

``````

Cross-validation

If you want to know what is Cross-validation,follow Â What is crossvalidation

Let’s do the 5 fold crossvalidation

``````modelLogisticCV= cross_validation.cross_val_score(modelLogistic,X,y,cv=5)

``````

Plotting Cross-validation Results

Let’s plot the accuracy metric for all five folds

``````plt.plot(modelLogisticCV,"p")
``````
``````[<matplotlib.lines.Line2D at 0xaf99630>]
``````

Mean Accuracy for cross-validation folds

Let’s check the mean model accuracy of all five folds

``````print(modelLogisticCV.mean())

``````
``````0.787851316429
``````

Model Initialization & FittingÂ

Once you are satisfied with the crossvalidation results,you can fit the model with the same parameters you used during cross-validation.Here I have not chosen any parameter. Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part
of data set as we did in crossvalidation

``````modelLogistic = linear_model.LogisticRegression()
modelLogistic.fit(X,y)

``````
``````LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
``````

``````predictions=modelLogistic.predict(test[predictor_Vars])
predictions
``````
``````array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0], dtype=int64)
``````

Below are some important links for you

Logistic regression output interpretation

Deviance and AIC in Logistic Regression

How to fit nearest neighbor classifier using-python

What are dimentionality reduction techniques

How to fit Decision tree classifier using python

How to fit Naive bayes classifier using python

Xgboost model tuning

SHARE
Previous articleRandom Forest using Python
Next articleSupport Vector Machine using python