##### Let’s learn from a precise demo on Fitting Logistic Regression on Titanic Data Set for Machine Learning

**Description**: On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

**Machine learning Problem** : To predict which passengers survived in this tragedy based on the data given

**What we will do** :

1. Basic cleaning for missing values in train and test data set

2. 5 fold crossvalidation

3. Model used is Logistic Regression

4. Predict for test data set

**Importing Libraries**

We will import following libraries.

**Let’s import the library**

```
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
```

**Reading the training and testing data sets**

**Let’s import the data set**

```
train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')
```

**Data Cleaning/Encoding/Preprocessing**

Let’s create a function for cleaning the training and testing data .Here we are doing two things.

1. Imputing the missing values in Age,fare and Embarked variable.

2. Encoding the categorical variables manually

```
def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")
train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2
return train
```

Let’s clean the data

```
train=data_cleaning(train)
test=data_cleaning(test)
```

**Choosing Predictor Variables**

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

```
predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
```

**Â X & y**

X is array for predictor variables and Y is target or independent variable.We will put X & y while model fitting inside the function.

```
X, y = train[predictor_Vars], train.Survived
```

Let’s check X

```
X.iloc[:5]
```

Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|

0 | 0 | 22 | 1 | 0 | 22 |

1 | 1 | 38 | 1 | 0 | 38 |

2 | 1 | 26 | 0 | 0 | 26 |

3 | 1 | 35 | 1 | 0 | 35 |

4 | 0 | 35 | 0 | 0 | 35 |

Let’s check y

```
y.iloc[:5]
```

```
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int64
```

**Model Initialization**

Let’s initialize the logistic regression model and choose model parameters like C,l1,etc. if you wish.

```
modelLogistic = linear_model.LogisticRegression()
```

**Cross-validation**

If you want to know what is Cross-validation,follow Â What is crossvalidation

Let’s do the 5 fold crossvalidation

```
modelLogisticCV= cross_validation.cross_val_score(modelLogistic,X,y,cv=5)
```

**Plotting Cross-validation Results**

Let’s plot the accuracy metric for all five folds

```
plt.plot(modelLogisticCV,"p")
```

```
[<matplotlib.lines.Line2D at 0xaf99630>]
```

**Mean Accuracy for cross-validation folds**

Let’s check the mean model accuracy of all five folds

```
print(modelLogisticCV.mean())
```

```
0.787851316429
```

**Model Initialization & FittingÂ **

Once you are satisfied with the crossvalidation results,you can fit the model with the same parameters you used during cross-validation.Here I have not chosen any parameter. Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part

of data set as we did in crossvalidation

```
modelLogistic = linear_model.LogisticRegression()
modelLogistic.fit(X,y)
```

```
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
```

Read aboutÂ AIC,deviance and more about Logistic Regression.

```
predictions=modelLogistic.predict(test[predictor_Vars])
predictions
```

```
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0], dtype=int64)
```

**Below are some important links for you**

Logistic regression output interpretation

Deviance and AIC in Logistic Regression

How to fit nearest neighbor classifier using-python

What are dimentionality reduction techniques

How to fit Decision tree classifier using python

How to fit Naive bayes classifier using python