kaggle-titanic

https://www.kaggle.com/c/titanic

import pandas as pd
import numpy as np
import random as rnd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style("white")
survived = ["#ffb3ba", "#bae1ff"]
sns.set_palette(survived)
#sns.palplot(sns.color_palette())

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings('ignore')

The Challenge

"The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc)."

https://www.kaggle.com/c/titanic/overview/

Data

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Data Types, Describing Data and Missing Values

train_df.info()
print('_'*40)

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False, cmap="Pastel1")

<AxesSubplot:>

Visually, Cabin features most missing values.

Analysing Features & Initial Data Presumptions:

PassengerId: is a running index (we will drop this feature later on);
Survived: indication whether the passenger survived (1) or not (0);
Pclass: Ticket-class: first (1), second (2), and third (3) class tickets. We might assume that first PClass might correlate positively with the Fare feature as a higher class should cost more relative to second or third ticket-classes;
Name: Name of the passenger. Names contain the family name (indicating possible family relations), titles, such as Mr., Miss., Mrs. (indicating certain age groups or even possibly correlating to Pclass if the name contains a nobility title);
Sex: indicator whether the passenger was (female) or (male);
Age: the age in years of the passenger. Listed as a float64, it might be corectly converted to int64 as no decimal age notation was registered. There are 177 NaN values in the train_df and 86 NaN values in the test_df. To estimate some of those missing values, we might use some of the name titles indicating certain age groups;
SibSp: number of siblings / spouses aboard the Titanic associated with the passenger;
Parch: number of parents / children aboard the Titanic associated with the passenger;
Ticket: alphanumeric variable indicating the ticket number. (If totally random, we might as well drop it too later on);
Fare: how much each passenger paid for the ticket; There's a single missing-value in the test_df;
Cabin: cabin number of each passenger. Around 77% and 78% of the values are missing for the train_df and test_df respectively;
Embarked: shows the port of embarkation as a categorical character value. Either (C), (S) or (Q) (C = Cherbourg; Q = Queenstown; S = Southampton);

Exploratory Data Analysis (EDA)

How Many Survived?

survived = train_df[train_df['Survived']==1]
not_survived = train_df[train_df['Survived']==0]

print("Survived: %i (%.1f percent)\nNot Survived: %i (%.1f percent)\nTotal: %i"\
      %(len(survived), 1.*len(survived)/len(train_df)*100.0,\
        len(not_survived), 1.*len(not_survived)/len(train_df)*100.0, len(train_df)))

Survived: 342 (38.4 percent)
Not Survived: 549 (61.6 percent)
Total: 891

In order to have a better predictive algorithm, we must achieve better than 61.6% accuracy. (Same as predicting everyone dies)

Age Analysis

Converting Age to integer

g = sns.FacetGrid(train_df, col='Survived',hue='Survived')
g.map(plt.hist, "Age",bins=20);
sns.despine(left = True,bottom = True)

a = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
a.map(sns.kdeplot, 'Age', shade= True)
a.set(xlim=(0 , train_df['Age'].max()))
sns.despine(left = True,bottom = True)

The y-axis of a KDE plot represents the Kernel Density Estimate (KDE) of the Probability Density Function (PDF) of a random variable, interpreted as a probability differential. The probability of a value being between two x values is the total area under the curve.

Graphical Observations Regarding Age vs Survival Probability

Children under 8 years old had high survival rate;
Passengers above 40 years old had split odds of survival;
Most passengers that did not survive were between 16 and 32 years old;

Unfortunatly, around 20% of the combined dataframe does not have an 'Age' value.

Sex Analysis

ax = sns.countplot(x = 'Sex', hue = 'Survived', data = train_df)
ax.set(xlabel = 'Sex', ylabel='Total')
sns.despine(left = True,bottom = True)

Graphical Observations Regarding Sex vs Survival Probability

74% of Females Survived vs 19% of Males Survived

PClass Analysis

sns.barplot(x='Pclass', y='Survived', palette='Pastel1', data=train_df)
sns.despine(left = True,bottom = True)

Embarked

sns.barplot(x='Embarked', y='Survived', palette='Pastel1', data=train_df)
sns.despine(left = True,bottom = True)

Fare

a = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
a.map(sns.kdeplot, 'Fare', shade= True)
a.set(xlim=(0 , train_df['Fare'].max()))
sns.despine(left = True,bottom = True)

a = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
a.map(sns.kdeplot, 'Fare', shade= True)
a.set(xlim=(0 , 100))
sns.despine(left = True,bottom = True)

f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(train_df[train_df['Pclass']==1].Fare,ax=ax[0], color="#bae1ff")
ax[0].set_title('Fares in Pclass 1')
sns.distplot(train_df[train_df['Pclass']==2].Fare,ax=ax[1],color="#bae1ff")
ax[1].set_title('Fares in Pclass 2')
sns.distplot(train_df[train_df['Pclass']==3].Fare,ax=ax[2],color="#bae1ff")
ax[2].set_title('Fares in Pclass 3')
sns.despine(left = True,bottom = True)
plt.show()

Feature Engineering

train_test_df = [train_df, test_df]

Sex

sex_mapping = {'male': 0, 'female': 1}
for df in train_test_df:
    df['Sex'] = df['Sex'].map(sex_mapping)

Embarked

ax = sns.countplot(x = 'Embarked', hue = 'Pclass', palette='Pastel1', data = train_df)
ax.set(xlabel = 'Embarked', ylabel='Total')
sns.despine(left = True,bottom = True)

train_test_df = [train_df, test_df]

for df in train_test_df:
    df['Embarked'] = df['Embarked'].fillna('S')

emb_mapping = {'S': 0, 'C': 1, 'Q': 2}
for df in train_test_df:
    df['Embarked'] = df['Embarked'].map(emb_mapping)

SibSp and Parch Analysis

train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
axes = sns.pointplot('FamilySize','Survived', data=train_df)
sns.despine(left = True,bottom = True)

facet = sns.FacetGrid(train_df, hue='Survived', aspect=4)
facet.map(sns.kdeplot, 'FamilySize', shade=True)
facet.set(xlim=(0, train_df['FamilySize'].max()))
plt.xlim(0)
sns.despine(left = True,bottom = True)

train_df = train_df.drop(['Parch','SibSp'], axis=1)
test_df = test_df.drop(['Parch','SibSp'], axis=1)

Name Analysis

In order to fill most accuratly the missing-values associated with Age and further detail our Exploratory Data Analysis, let's extract titles from passenger names. Not only grouping the final result by Pclass should improve our estimates, most importantly we should group by Parch as most definitely it will present the most discrepencies regarding the mean age (e.g. a "Miss" title may translate a group age of either a child, yound-adult or adult, depending whether or not the person is accompanied by other famility members).

train_test_df = [train_df, test_df]

for df in train_test_df:
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

train_df['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Major         2
Mlle          2
Jonkheer      1
Sir           1
Mme           1
Lady          1
Ms            1
Countess      1
Don           1
Capt          1
Name: Title, dtype: int64

test_df['Title'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Rev         2
Col         2
Ms          1
Dr          1
Dona        1
Name: Title, dtype: int64

TitleDict = {"Capt": "Officer","Col": "Officer","Major": "Officer","Jonkheer": "Nobility", \
             "Don": "Nobility", "Sir" : "Nobility","Dr": "Nobility","Rev": "Nobility", \
             "Countess":"Nobility", "Mme": "Mrs", "Mlle": "Miss", "Ms": "Mrs","Mr" : "Mr", \
             "Mrs" : "Mrs","Miss" : "Miss","Master" : "Master","Lady" : "Nobility", 'Dona':'Nobility'}

for df in train_test_df:
    df['Title'] = df['Title'].map(TitleDict)

sns.barplot(x='Title', y='Survived', palette='Pastel1', data=train_df)
sns.despine(left = True, bottom = True)

title_mapping = {"Master": 0, "Miss": 1, "Mr": 2, "Mrs": 3, "Nobility": 4, "Officer": 5}

for df in train_test_df:
    df['Title'] = df['Title'].map(title_mapping)

train_df.drop('Name', axis = 1, inplace = True)
test_df.drop('Name', axis = 1, inplace = True)

train_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived	Pclass	Sex	Age	Ticket	Fare	Cabin	Embarked	FamilySize	Title
0	1	0	3	0	22.0	A/5 21171	7.2500	NaN	0	2	2
1	2	1	1	1	38.0	PC 17599	71.2833	C85	1	2	3
2	3	1	3	1	26.0	STON/O2. 3101282	7.9250	NaN	0	1	1
3	4	1	1	1	35.0	113803	53.1000	C123	0	2	3
4	5	0	3	0	35.0	373450	8.0500	NaN	0	1	2

Age

train_df_corr = train_df.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
train_df_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
train_df_corr[train_df_corr['Feature 1'] == 'Age']

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Feature 1	Feature 2	Correlation Coefficient
7	Age	Age	1.000000
13	Age	Title	0.510545
15	Age	Pclass	0.369226
20	Age	FamilySize	0.301914
40	Age	Fare	0.096067
42	Age	Sex	0.093254
49	Age	Survived	0.077221
66	Age	PassengerId	0.036847
78	Age	Embarked	0.010171

train_df.groupby(['Title', 'Pclass','FamilySize'])['Age'].agg(['mean', 'count'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

			mean	count
Title	Pclass	FamilySize
0	1	3	4.000000	1
	1	4	5.960000	2
	2	3	2.416250	8
	2	4	1.000000	1
	3	2	6.140000	3
...	...	...	...	...
4	1	3	47.000000	2
	2	1	42.333333	6
	2	2	41.000000	2
5	1	1	53.250000	4
5	1	3	70.000000	1

74 rows × 2 columns

for df in train_test_df:
    df['Age'].fillna(df.groupby(['Title','Pclass','FamilySize'])['Age'].transform('mean'), inplace=True)

for df in train_test_df:
    df['Age'].fillna(df.groupby(['Title','Pclass'])['Age'].transform('mean'), inplace=True)

sns.set_style("white")
survived = ["#ffb3ba", "#bae1ff"]
sns.set_palette(survived)
#sns.palplot(sns.color_palette())

g = sns.FacetGrid(train_df, col='Survived',hue='Survived')
g.map(plt.hist, "Age",bins=20);
sns.despine(left = True,bottom = True)

a = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
a.map(sns.kdeplot, 'Age', shade= True)
a.set(xlim=(0 , train_df['Age'].max()))
sns.despine(left = True,bottom = True)

facet = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
facet.map(sns.kdeplot, 'Age', shade= True)
facet.set(xlim=(0 , train_df['Age'].max()))
sns.despine(left = True,bottom = True)
facet.set(xticks=np.arange(0,21,1))
plt.xlim(0, 16)

(0.0, 16.0)

facet = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
facet.map(sns.kdeplot, 'Age', shade= True)
facet.set(xlim=(0 , train_df['Age'].max()))
sns.despine(left = True,bottom = True)
facet.set(xticks=np.arange(17,35,1))
plt.xlim(17, 34)

(17.0, 34.0)

facet = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
facet.map(sns.kdeplot, 'Age', shade= True)
facet.set(xlim=(0 , train_df['Age'].max()))
sns.despine(left = True,bottom = True)
facet.set(xticks=np.arange(35,61,1))
plt.xlim(35, 60)

(35.0, 60.0)

facet = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
facet.map(sns.kdeplot, 'Age', shade= True)
facet.set(xlim=(0 , train_df['Age'].max()))
sns.despine(left = True,bottom = True)
facet.set(xticks=np.arange(35,61,1))
plt.xlim(35, 60)

(35.0, 60.0)

for df in train_test_df:
    df.loc[df['Age'] <= 16, 'Age'] = 0
    df.loc[(df['Age'] > 16) & (df['Age'] <= 34), 'Age'] = 1
    df.loc[(df['Age'] > 34) & (df['Age'] <= 43), 'Age'] = 2
    df.loc[(df['Age'] > 43) & (df['Age'] <= 60), 'Age'] = 3
    df.loc[df['Age'] > 60, 'Age'] = 4

Fare

test_df['Fare'].fillna(test_df.groupby(['Pclass'])['Fare'].transform('median'), inplace=True)

facet = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
facet.map(sns.kdeplot, 'Fare', shade= True)
facet.set(xlim=(0 , train_df['Fare'].max()))
sns.despine(left = True,bottom = True)
facet.set(xticks=np.arange(0,21,1))
plt.xlim(0,17)

(0.0, 17.0)

facet = sns.FacetGrid(train_df, hue = 'Survived',aspect=2 )
facet.map(sns.kdeplot, 'Fare', shade= True)
facet.set(xlim=(0 , train_df['Fare'].max()))
sns.despine(left = True,bottom = True)
facet.set(xticks=np.arange(17,31,1))
plt.xlim(17,30)

(17.0, 30.0)

train_test_df = [train_df, test_df]

for df in train_test_df:
    df.loc[df['Fare'] <= 17, 'Fare'] = 0
    df.loc[(df['Fare'] > 17) & (df['Fare'] <= 30), 'Fare'] = 1
    df.loc[(df['Fare'] > 30) & (df['Age'] <= 100), 'Fare'] = 2
    df.loc[df['Fare'] > 100, 'Fare'] = 3

train_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived	Pclass	Sex	Age	Ticket	Fare	Cabin	Embarked	FamilySize	Title
0	1	0	3	0	1.0	A/5 21171	0.0	NaN	0	2	2
1	2	1	1	1	2.0	PC 17599	2.0	C85	1	2	3
2	3	1	3	1	1.0	STON/O2. 3101282	0.0	NaN	0	1	1
3	4	1	1	1	2.0	113803	2.0	C123	0	2	3
4	5	0	3	0	2.0	373450	0.0	NaN	0	1	2

train_df.drop(['PassengerId','Ticket','Cabin'], axis = 1, inplace = True)
test_df.drop(['Ticket','Cabin'], axis = 1, inplace = True)

Modelling

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

((891, 7), (891,), (418, 7))

X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      418 non-null    int64  
 1   Sex         418 non-null    int64  
 2   Age         418 non-null    float64
 3   Fare        418 non-null    float64
 4   Embarked    418 non-null    int64  
 5   FamilySize  418 non-null    int64  
 6   Title       418 non-null    int64  
dtypes: float64(2), int64(5)
memory usage: 23.0 KB

Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

79.8

SVM

svc = LinearSVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

80.02

KNN

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

85.19

Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

78.0

Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

72.5

Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

80.02

SGD

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

75.42

Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

88.1

Random Forest

random_forest = RandomForestClassifier(n_estimators=3)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

86.76

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('submissionRF.csv', index=False)

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Model	Score
8	Decision Tree	88.10
3	Random Forest	86.76
1	KNN	85.19
0	Support Vector Machines	80.02
7	Linear SVC	80.02
2	Logistic Regression	79.80
4	Naive Bayes	78.00
6	Stochastic Gradient Decent	75.42
5	Perceptron	72.50

Cross Validation (K-Fold)

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

k_fold = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 1)

from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=3)
scores = cross_val_score(rf, X_train, Y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [0.76666667 0.80898876 0.74157303 0.82022472 0.84269663 0.78651685
 0.80898876 0.75280899 0.80898876 0.80898876]
Mean: 0.7946441947565543
Standard Deviation: 0.030334789759678862

importances = pd.DataFrame({'feature':X_train.columns,'Importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('Importance',ascending=False).set_index('feature')
importances.head(7)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Importance
feature
Title	0.248
Sex	0.235
FamilySize	0.179
Age	0.100
Pclass	0.095
Fare	0.081
Embarked	0.062

Grid Search

from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators': (10,30,50,70,90,100),
             'criterion': ('gini', 'entropy'),
             'max_depth': (3,5,7,9,10),
             'max_features': ('auto', 'sqrt'),
             'min_samples_split': (2,4,6)
             }

RF_grid = GridSearchCV(RandomForestClassifier(n_jobs = -1, oob_score = False), param_grid = parameters, cv = 3, verbose = True)

RF_grid_model = RF_grid.fit(X_train, Y_train)

Fitting 3 folds for each of 360 candidates, totalling 1080 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1080 out of 1080 | elapsed:  1.5min finished

RF_grid_model.best_estimator_

RandomForestClassifier(criterion='entropy', max_depth=5, n_estimators=10,
                       n_jobs=-1)

RF_grid_model.best_score_

0.839506172839506

Model Evaluation

Confusion Matrix

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
confusion_matrix(Y_train, predictions)

array([[476,  73],
       [107, 235]], dtype=int64)

Precision and Recall

from sklearn.metrics import precision_score, recall_score

print("Precision:", precision_score(Y_train, predictions))
print("Recall:",recall_score(Y_train, predictions))

Precision: 0.762987012987013
Recall: 0.6871345029239766

F-Score

from sklearn.metrics import f1_score
f1_score(Y_train, predictions)

0.7230769230769231

Precision Recall Curve

from sklearn.metrics import precision_recall_curve

# getting the probabilities of our predictions
y_scores = random_forest.predict_proba(X_train)
y_scores = y_scores[:,1]

precision, recall, threshold = precision_recall_curve(Y_train, y_scores)
def plot_precision_and_recall(precision, recall, threshold):
    plt.plot(threshold, precision[:-1], "r-", label="precision", linewidth=5)
    plt.plot(threshold, recall[:-1], "b", label="recall", linewidth=5)
    plt.xlabel("threshold", fontsize=19)
    plt.legend(loc="upper right", fontsize=19)
    plt.ylim([0, 1])

plt.figure(figsize=(14, 7))
plot_precision_and_recall(precision, recall, threshold)
plt.show()

def plot_precision_vs_recall(precision, recall):
    plt.plot(recall, precision, "g--", linewidth=2.5)
    plt.ylabel("recall", fontsize=19)
    plt.xlabel("precision", fontsize=19)
    plt.axis([0, 1.5, 0, 1.5])

plt.figure(figsize=(14, 7))
plot_precision_vs_recall(precision, recall)
plt.show()

ROC AUC Curve

from sklearn.metrics import roc_curve
# compute true positive rate and false positive rate
false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_train, y_scores)
# plotting them against each other
def plot_roc_curve(false_positive_rate, true_positive_rate, label=None):
    plt.plot(false_positive_rate, true_positive_rate, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'r', linewidth=4)
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (FPR)', fontsize=16)
    plt.ylabel('True Positive Rate (TPR)', fontsize=16)

plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()

from sklearn.metrics import roc_auc_score
r_a_score = roc_auc_score(Y_train, y_scores)
print("ROC-AUC-Score:", r_a_score)

ROC-AUC-Score: 0.926447874391504

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
imgs		imgs
README.md		README.md
a-random-forest-on-the-atlantic-ocean.ipynb		a-random-forest-on-the-atlantic-ocean.ipynb

joaopereiradsantos/kaggle-titanic

Folders and files

Latest commit

History

Repository files navigation

kaggle-titanic

The Challenge

Data

Data Types, Describing Data and Missing Values

Exploratory Data Analysis (EDA)

How Many Survived?

Age Analysis

Sex Analysis

PClass Analysis

Embarked

Fare

Feature Engineering

Sex

Embarked

SibSp and Parch Analysis

Name Analysis

Age

Fare

Modelling

Logistic Regression

SVM

KNN

Naive Bayes

Perceptron

Linear SVC

SGD

Decision Tree

Random Forest

Cross Validation (K-Fold)

Grid Search

Model Evaluation

Confusion Matrix

Precision and Recall

F-Score

Precision Recall Curve

ROC AUC Curve

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages