Binary classification on Titanic Dataset
This is a classic dataset used in many data mining tutorials and demos – perfect for getting started with exploratory analysis and building binary classification models to predict survival.
Data covers passengers only, not crew.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score
from sklearn.preprocessing import StandardScaler
df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\titanic.csv")
df.shape
(891, 15)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB
Data cleaning
# Drop duplicate values
df=df.drop_duplicates()
df.shape
(784, 15)
df.isnull().sum()
survived         0
pclass           0
sex              0
age            106
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           582
embark_town      2
alive            0
alone            0
dtype: int64
‘deck’ is mostly empty so we will drop it in the next step.
Age has 106 null values
embarked and embark_town has 2 null values
# Drop columns which are not needed for our analysis or which are duplicates (another column with values meaning the same)
df=df.drop(["deck", "embarked", "adult_male", "alive", "class"], axis=1)
‘deck’ consists of crew data which is excluded in this dataset.
‘embarked’ consists of abbreviated values of ‘embark_town’
‘adult_male’ can be analysed from ‘who’ column; since only a man has adult_male value as 1
‘alive’ is a duplicate of ‘survived’
‘class’ is the textual representation of ‘pclass’
df.head(10)
| survived | pclass | sex | age | sibsp | parch | fare | who | embark_town | alone | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | Southampton | False | 
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | Cherbourg | False | 
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | Southampton | True | 
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | Southampton | False | 
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | Southampton | True | 
| 5 | 0 | 3 | male | NaN | 0 | 0 | 8.4583 | man | Queenstown | True | 
| 6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | man | Southampton | True | 
| 7 | 0 | 3 | male | 2.0 | 3 | 1 | 21.0750 | child | Southampton | False | 
| 8 | 1 | 3 | female | 27.0 | 0 | 2 | 11.1333 | woman | Southampton | False | 
| 9 | 1 | 2 | female | 14.0 | 1 | 0 | 30.0708 | child | Cherbourg | False | 
df["age"].describe()
count    678.000000
mean      29.869351
std       14.759076
min        0.420000
25%       20.000000
50%       28.250000
75%       39.000000
max       80.000000
Name: age, dtype: float64
# Fill the empty values in 'age' column to the mean value of age
print('Replacing null values with mean value : ', int(df["age"].mean()))
df["age"] = df["age"].fillna(int(df["age"].mean()))
df.head(10)
Replacing null values with mean value :  29
| survived | pclass | sex | age | sibsp | parch | fare | who | embark_town | alone | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | Southampton | False | 
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | Cherbourg | False | 
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | Southampton | True | 
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | Southampton | False | 
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | Southampton | True | 
| 5 | 0 | 3 | male | 29.0 | 0 | 0 | 8.4583 | man | Queenstown | True | 
| 6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | man | Southampton | True | 
| 7 | 0 | 3 | male | 2.0 | 3 | 1 | 21.0750 | child | Southampton | False | 
| 8 | 1 | 3 | female | 27.0 | 0 | 2 | 11.1333 | woman | Southampton | False | 
| 9 | 1 | 2 | female | 14.0 | 1 | 0 | 30.0708 | child | Cherbourg | False | 
# Drop rows with empty (null) values
df=df.dropna()
Exploring the Data
a = ['survived', 'pclass', 'sex','sibsp', 'parch', 'who', 'embark_town', 'alone']
b = ['age', 'fare']
for i in a:
    sns.countplot(x=i, data=df)
    plt.show()
    print(df[i].value_counts())

0    461
1    321
Name: survived, dtype: int64

3    405
1    212
2    165
Name: pclass, dtype: int64

male      491
female    291
Name: sex, dtype: int64

0    515
1    201
2     27
4     18
3     14
5      5
8      2
Name: sibsp, dtype: int64

0    578
1    114
2     75
5      5
3      5
4      4
6      1
Name: parch, dtype: int64

man      451
woman    249
child     82
Name: who, dtype: int64

Southampton    568
Cherbourg      155
Queenstown      59
Name: embark_town, dtype: int64

True     444
False    338
Name: alone, dtype: int64
From the above countplots, we can make the following conclusions:
Survived
461 passengers survived while 321 didnt.
Passenger Class
Majority of the passengers were in 3rd class, followed by 1st class and then 2nd class
Sex
Number of male passengers was more than female
Port of Embarkation
Majority of the passengers embarked from Southampton, followed by Cherbourg and the Queenstown
Accompanied
There were more passengers who travelled alone than with others
for i in b:
    print(df[i].describe())
    print(df[i].skew())
    sns.distplot(df[i], kde=False)
    plt.show()
count    782.000000
mean      29.700026
std       13.692729
min        0.420000
25%       22.000000
50%       29.000000
75%       36.000000
max       80.000000
Name: age, dtype: float64
0.4190340853087404

count    782.000000
mean      34.595913
std       52.176458
min        0.000000
25%        8.050000
50%       15.875000
75%       33.375000
max      512.329200
Name: fare, dtype: float64
4.583205969233933

From the above countplots, we can make the following conclusions:
Age
Youngest: 0.42 (5 months)
Average age: 29
Oldest: 80
Fare
Lowest: 0 (free)
Average: 34.5
Highest: 512
The highly positive skewed data (and the info shown above) proves that 75% of the passengers paid a fare lesser than 34.
sns.swarmplot(x="survived", y="fare", data=df)
plt.show()

From the figure, we can come to a conclusion that there were a few set of ‘elite’ passengers who paid a very high fare, and were able to survive.
df[['pclass', 'survived']].groupby(['pclass'], as_index=False).mean()
| pclass | survived | |
|---|---|---|
| 0 | 1 | 0.627358 | 
| 1 | 2 | 0.509091 | 
| 2 | 3 | 0.256790 | 
sns.countplot(x='pclass',hue="survived", data=df)
plt.show()

Survival rate was highest in 1st (upper) class and lowest in 3rd class.
Highest number of passengers who survived were from 1st class.
Highest number of passengers who DID NOT survive were from 3rd class.
df[['sex', 'survived']].groupby(['sex'], as_index=False).mean()
| sex | survived | |
|---|---|---|
| 0 | female | 0.738832 | 
| 1 | male | 0.215886 | 
sns.countplot(x='sex',hue="survived", data=df)
plt.show()

Survival rate of female passengers was more than male
age_survived = sns.FacetGrid(df, col='survived')
age_survived.map(plt.hist, 'age', bins=15)
plt.show()

Older passengers (of around 80 years old) survived.
Machine Learning Model - Logistic Regression
# Convert Boolean to Integer
df["alone"] = df["alone"].astype(int)
df.head()
| survived | pclass | sex | age | sibsp | parch | fare | who | embark_town | alone | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | man | Southampton | 0 | 
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | woman | Cherbourg | 0 | 
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | woman | Southampton | 1 | 
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | woman | Southampton | 0 | 
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | man | Southampton | 1 | 
# Encode the data (convert strings to numbers) so that the model can understand.
le_sex = LabelEncoder()
df["sex"]=le_sex.fit_transform(df["sex"])
le_who = LabelEncoder()
df["who"]=le_who.fit_transform(df["who"])
le_embark_town = LabelEncoder()
df["embark_town"]=le_embark_town.fit_transform(df["embark_town"])
print(df.shape)
df.head(10)
(782, 10)
| survived | pclass | sex | age | sibsp | parch | fare | who | embark_town | alone | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 1 | 2 | 0 | 
| 1 | 1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 2 | 0 | 0 | 
| 2 | 1 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 2 | 2 | 1 | 
| 3 | 1 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 2 | 2 | 0 | 
| 4 | 0 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 1 | 2 | 1 | 
| 5 | 0 | 3 | 1 | 29.0 | 0 | 0 | 8.4583 | 1 | 1 | 1 | 
| 6 | 0 | 1 | 1 | 54.0 | 0 | 0 | 51.8625 | 1 | 2 | 1 | 
| 7 | 0 | 3 | 1 | 2.0 | 3 | 1 | 21.0750 | 0 | 2 | 0 | 
| 8 | 1 | 3 | 0 | 27.0 | 0 | 2 | 11.1333 | 2 | 2 | 0 | 
| 9 | 1 | 2 | 0 | 14.0 | 1 | 0 | 30.0708 | 0 | 0 | 0 | 
# Split the dependent and independent values
x = df.drop("survived", axis=1)
y = df["survived"]
# pre-processing the data
x = StandardScaler().fit(x).transform(x)
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.8)
print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)
Train set: (625, 9) (625,)
Test set: (157, 9) (157,)
# Load logistic regression model from sklearn and fit the training sets
algo = LogisticRegression().fit(xtrain,ytrain)
# find out the predictions for the testing set
ypred = algo.predict(xtest)
# compare predicted values and actual values and find out accuracy
print("Mean Absolute Error: ", mean_absolute_error(ytest,ypred))
print("Accuracy: ", accuracy_score(ytest,ypred))
Mean Absolute Error:  0.17834394904458598
Accuracy:  0.821656050955414
 
  
  
