Binary classification on Titanic Dataset

6 minute read

This is a classic dataset used in many data mining tutorials and demos – perfect for getting started with exploratory analysis and building binary classification models to predict survival.

Data covers passengers only, not crew.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score
from sklearn.preprocessing import StandardScaler

df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\titanic.csv")

df.shape

(891, 15)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB

Data cleaning

# Drop duplicate values

df=df.drop_duplicates()
df.shape

(784, 15)

df.isnull().sum()

survived         0
pclass           0
sex              0
age            106
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           582
embark_town      2
alive            0
alone            0
dtype: int64

‘deck’ is mostly empty so we will drop it in the next step.

Age has 106 null values
embarked and embark_town has 2 null values

# Drop columns which are not needed for our analysis or which are duplicates (another column with values meaning the same)

df=df.drop(["deck", "embarked", "adult_male", "alive", "class"], axis=1)

‘deck’ consists of crew data which is excluded in this dataset.

‘embarked’ consists of abbreviated values of ‘embark_town’

‘adult_male’ can be analysed from ‘who’ column; since only a man has adult_male value as 1

‘alive’ is a duplicate of ‘survived’

‘class’ is the textual representation of ‘pclass’

df.head(10)

	survived	pclass	sex	age	sibsp	parch	fare	who	embark_town	alone
0	0	3	male	22.0	1	0	7.2500	man	Southampton	False
1	1	1	female	38.0	1	0	71.2833	woman	Cherbourg	False
2	1	3	female	26.0	0	0	7.9250	woman	Southampton	True
3	1	1	female	35.0	1	0	53.1000	woman	Southampton	False
4	0	3	male	35.0	0	0	8.0500	man	Southampton	True
5	0	3	male	NaN	0	0	8.4583	man	Queenstown	True
6	0	1	male	54.0	0	0	51.8625	man	Southampton	True
7	0	3	male	2.0	3	1	21.0750	child	Southampton	False
8	1	3	female	27.0	0	2	11.1333	woman	Southampton	False
9	1	2	female	14.0	1	0	30.0708	child	Cherbourg	False

df["age"].describe()

count    678.000000
mean      29.869351
std       14.759076
min        0.420000
25%       20.000000
50%       28.250000
75%       39.000000
max       80.000000
Name: age, dtype: float64

# Fill the empty values in 'age' column to the mean value of age

print('Replacing null values with mean value : ', int(df["age"].mean()))
df["age"] = df["age"].fillna(int(df["age"].mean()))
df.head(10)

Replacing null values with mean value :  29

	survived	pclass	sex	age	sibsp	parch	fare	who	embark_town	alone
0	0	3	male	22.0	1	0	7.2500	man	Southampton	False
1	1	1	female	38.0	1	0	71.2833	woman	Cherbourg	False
2	1	3	female	26.0	0	0	7.9250	woman	Southampton	True
3	1	1	female	35.0	1	0	53.1000	woman	Southampton	False
4	0	3	male	35.0	0	0	8.0500	man	Southampton	True
5	0	3	male	29.0	0	0	8.4583	man	Queenstown	True
6	0	1	male	54.0	0	0	51.8625	man	Southampton	True
7	0	3	male	2.0	3	1	21.0750	child	Southampton	False
8	1	3	female	27.0	0	2	11.1333	woman	Southampton	False
9	1	2	female	14.0	1	0	30.0708	child	Cherbourg	False

# Drop rows with empty (null) values

df=df.dropna()

Exploring the Data

a = ['survived', 'pclass', 'sex','sibsp', 'parch', 'who', 'embark_town', 'alone']
b = ['age', 'fare']

for i in a:
    sns.countplot(x=i, data=df)
    plt.show()
    print(df[i].value_counts())

png

0    461
1    321
Name: survived, dtype: int64

png

  405
  212
  165
Name: pclass, dtype: int64

png

male      491
female    291
Name: sex, dtype: int64

png

  515
  201
   27
   18
   14
    5
    2
Name: sibsp, dtype: int64

png

  578
  114
   75
    5
    5
    4
    1
Name: parch, dtype: int64

png

man      451
woman    249
child     82
Name: who, dtype: int64

png

Southampton    568
Cherbourg      155
Queenstown      59
Name: embark_town, dtype: int64

png

True     444
False    338
Name: alone, dtype: int64

From the above countplots, we can make the following conclusions:

Survived

461 passengers survived while 321 didnt.

Passenger Class

Majority of the passengers were in 3rd class, followed by 1st class and then 2nd class

Sex

Number of male passengers was more than female

Port of Embarkation

Majority of the passengers embarked from Southampton, followed by Cherbourg and the Queenstown

Accompanied

There were more passengers who travelled alone than with others

for i in b:
    print(df[i].describe())
    print(df[i].skew())
    sns.distplot(df[i], kde=False)
    plt.show()

count    782.000000
mean      29.700026
std       13.692729
min        0.420000
25%       22.000000
50%       29.000000
75%       36.000000
max       80.000000
Name: age, dtype: float64
0.4190340853087404

png

count    782.000000
mean      34.595913
std       52.176458
min        0.000000
25%        8.050000
50%       15.875000
75%       33.375000
max      512.329200
Name: fare, dtype: float64
4.583205969233933

png

From the above countplots, we can make the following conclusions:

Age

Youngest: 0.42 (5 months)
Average age: 29
Oldest: 80

Fare

Lowest: 0 (free)
Average: 34.5
Highest: 512

The highly positive skewed data (and the info shown above) proves that 75% of the passengers paid a fare lesser than 34.

sns.swarmplot(x="survived", y="fare", data=df)
plt.show()

png

From the figure, we can come to a conclusion that there were a few set of ‘elite’ passengers who paid a very high fare, and were able to survive.

df[['pclass', 'survived']].groupby(['pclass'], as_index=False).mean()

	pclass	survived
0	1	0.627358
1	2	0.509091
2	3	0.256790

sns.countplot(x='pclass',hue="survived", data=df)
plt.show()

png

Survival rate was highest in 1st (upper) class and lowest in 3rd class.

Highest number of passengers who survived were from 1st class.

Highest number of passengers who DID NOT survive were from 3rd class.

df[['sex', 'survived']].groupby(['sex'], as_index=False).mean()

	sex	survived
0	female	0.738832
1	male	0.215886

sns.countplot(x='sex',hue="survived", data=df)
plt.show()

png

Survival rate of female passengers was more than male

age_survived = sns.FacetGrid(df, col='survived')
age_survived.map(plt.hist, 'age', bins=15)
plt.show()

png

Older passengers (of around 80 years old) survived.

Machine Learning Model - Logistic Regression

# Convert Boolean to Integer

df["alone"] = df["alone"].astype(int)
df.head()

	survived	pclass	sex	age	sibsp	fare	who	embark_town	alone
0	0	3	male	22.0	1	7.2500	man	Southampton	0
1	1	1	female	38.0	1	71.2833	woman	Cherbourg	0
2	1	3	female	26.0	0	7.9250	woman	Southampton	1
3	1	1	female	35.0	1	53.1000	woman	Southampton	0
4	0	3	male	35.0	0	8.0500	man	Southampton	1

# Encode the data (convert strings to numbers) so that the model can understand.

le_sex = LabelEncoder()
df["sex"]=le_sex.fit_transform(df["sex"])

le_who = LabelEncoder()
df["who"]=le_who.fit_transform(df["who"])

le_embark_town = LabelEncoder()
df["embark_town"]=le_embark_town.fit_transform(df["embark_town"])

print(df.shape)
df.head(10)

(782, 10)

	survived	pclass	sex	age	sibsp	parch	fare	who	embark_town	alone
0	0	3	1	22.0	1	0	7.2500	1	2	0
1	1	1	0	38.0	1	0	71.2833	2	0	0
2	1	3	0	26.0	0	0	7.9250	2	2	1
3	1	1	0	35.0	1	0	53.1000	2	2	0
4	0	3	1	35.0	0	0	8.0500	1	2	1
5	0	3	1	29.0	0	0	8.4583	1	1	1
6	0	1	1	54.0	0	0	51.8625	1	2	1
7	0	3	1	2.0	3	1	21.0750	0	2	0
8	1	3	0	27.0	0	2	11.1333	2	2	0
9	1	2	0	14.0	1	0	30.0708	0	0	0

# Split the dependent and independent values
x = df.drop("survived", axis=1)
y = df["survived"]

# pre-processing the data
x = StandardScaler().fit(x).transform(x)

# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.8)

print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)

Train set: (625, 9) (625,)
Test set: (157, 9) (157,)

# Load logistic regression model from sklearn and fit the training sets
algo = LogisticRegression().fit(xtrain,ytrain)

# find out the predictions for the testing set
ypred = algo.predict(xtest)

# compare predicted values and actual values and find out accuracy

print("Mean Absolute Error: ", mean_absolute_error(ytest,ypred))
print("Accuracy: ", accuracy_score(ytest,ypred))

Mean Absolute Error:  0.17834394904458598
Accuracy:  0.821656050955414

Highest Accuracy : 0.821656050955414

Share on

Twitter Facebook Google+ LinkedIn

Rakshith Poojary

Binary classification on Titanic Dataset

Data cleaning

Exploring the Data

Machine Learning Model - Logistic Regression

Highest Accuracy : 0.821656050955414

Share on

You May Also Enjoy

Classification on Breast Cancer Wisconsin dataset

Binary classification on PIMA Indian Dataset

Model to predict power output of a peaker power plant