## Description:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

## Machine learning Problem :

To predict which passengers survived in this tragedy based on the data given

## What we will do :

1. basic cleaning for missing values in train and test data set
2. 5 fold crossvalidation
3. Model used is Support Vector Machines
4. Predict for test data set

### Importing libraries

Let’s import the library

``````import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn import cross_validation

import matplotlib.pyplot as plt
%matplotlib inline
``````

### Reading training and testing data set

``````train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv')
``````

### Data Cleaning

Let’s create a function for cleaning the training and testing data .Here we are doing two things.
1. Encoding the categorical variables manually
2. Imputing the missing values.

``````def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")

train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

return train
``````

Let’s clean the data

``````train=data_cleaning(train)
test=data_cleaning(test)
``````

### Selecting Predictor Variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

``````predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
``````

### X & y

Let’s separate predictors and target.X is array of predictor variables and y is target variable.We will use these while model fitting.

``````X, y = train[predictor_Vars], train.Survived

``````

Let’s check X

``````X.iloc[:5]
``````
SexAgeSibSpParchFare
00221022
11381038
21260026
31351035
40350035

Let’s check y

``````y.iloc[:5]
``````
``````0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
``````

### Model Initialization & Fitting

Let’s choose Support Vector Classifier model parameters and fit the model.

``````modelSVM = SVC(kernel='linear', C=0.8,gamma=0.01).fit(X,y)

``````

### Cross-validation

Let’s do the 5 fold crossvalidation

``````modelSVMCV= cross_validation.cross_val_score(modelSVM,X,y,cv=5)

``````

Let’s check the accuracy metric of each of the five folds

``````modelSVMCV
``````
``````array([ 0.80446927,  0.80446927,  0.78651685,  0.75280899,  0.78531073])
``````

Let’s see the same information on the plot

``````plt.plot(modelSVMCV,"p")
``````
``````[<matplotlib.lines.Line2D at 0xad06320>]
``````

Let’s check the mean model accuracy of all five folds

``````print(modelSVMCV.mean())

``````
``````0.786715024929
``````

### Model fitting

If we are satisfied with the cross-validation, then let’s now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation.

``````modelSVM = SVC(kernel='linear', C=0.8,gamma=0.01).fit(X,y)
modelSVM.fit(X,y)
``````
``````SVC(C=0.8, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
``````

### Predictions on test data set

``````predictions=modelSVM.predict(test[predictor_Vars])
predictions

``` array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
```       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,1, 0,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0], dtype=int64)
``````