logistic regression using python
logistic regression
Let’s learn from a precise demo on Fitting Logistic Regression on Titanic Data Set for Machine Learning

Description: On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem : To predict which passengers survived in this tragedy based on the data given

What we will do :
1. Basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Logistic Regression
4. Predict for test data set

Importing Libraries

We will import following libraries.

Let’s import the library

import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import cross_validation


import matplotlib.pyplot as plt
%matplotlib inline

Reading the training and testing data sets

Let’s import the data set

train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')

Data Cleaning/Encoding/Preprocessing

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Imputing the missing values in Age,fare and Embarked variable.
2. Encoding the categorical variables manually

def data_cleaning(train):
    train["Age"] = train["Age"].fillna(train["Age"].median())
    train["Fare"] = train["Age"].fillna(train["Fare"].median())
    train["Embarked"] = train["Embarked"].fillna("S")


    train.loc[train["Sex"] == "male", "Sex"] = 0
    train.loc[train["Sex"] == "female", "Sex"] = 1

    train.loc[train["Embarked"] == "S", "Embarked"] = 0
    train.loc[train["Embarked"] == "C", "Embarked"] = 1
    train.loc[train["Embarked"] == "Q", "Embarked"] = 2

    return train

Let’s clean the data

train=data_cleaning(train)
test=data_cleaning(test)

Choosing Predictor Variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]


 X & y

X is array for predictor variables and Y is target or independent variable.We will put X & y while model fitting inside the function.

X, y = train[predictor_Vars], train.Survived

Let’s check X

X.iloc[:5]
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

 

Let’s check y

y.iloc[:5]
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Model Initialization

Let’s initialize the logistic regression model and choose model parameters like C,l1,etc. if you wish.

modelLogistic = linear_model.LogisticRegression()

Cross-validation

If you want to know what is Cross-validation,follow  What is crossvalidation

Let’s do the 5 fold crossvalidation

modelLogisticCV= cross_validation.cross_val_score(modelLogistic,X,y,cv=5)


Plotting Cross-validation Results

Let’s plot the accuracy metric for all five folds

plt.plot(modelLogisticCV,"p")
[<matplotlib.lines.Line2D at 0xaf99630>]
logistic regression using python
logistic regression

Mean Accuracy for cross-validation folds

Let’s check the mean model accuracy of all five folds

print(modelLogisticCV.mean())

0.787851316429

Model Initialization & Fitting 

Once you are satisfied with the crossvalidation results,you can fit the model with the same parameters you used during cross-validation.Here I have not chosen any parameter. Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part
of data set as we did in crossvalidation

modelLogistic = linear_model.LogisticRegression()
modelLogistic.fit(X,y)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Read about AIC,deviance and more about Logistic Regression.

predictions=modelLogistic.predict(test[predictor_Vars])
predictions
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0], dtype=int64)

Below are some important links for you

Logistic regression output interpretation

Deviance and AIC in Logistic Regression

How to fit nearest neighbor classifier using-python

What are dimentionality reduction techniques

How to fit Decision tree classifier using python

How to fit Naive bayes classifier using python

Xgboost model tuning