Binary classification on PIMA Indian Dataset

6 minute read

This problem is comprised of 768 observations of medical details for Pima indians patents. The records describe instantaneous measurements taken from the patient such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0).

The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

There may be some missing values with which you have to deal with.

Build a prediction Algorithm using Decision Tree.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.metrics import mean_absolute_error, accuracy_score, confusion_matrix
df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\diabetes.csv")
df.shape
(768, 9)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
df=df.drop_duplicates()
df.shape
(768, 9)
df.isnull().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
df.head().T
0 1 2 3 4
Pregnancies 6.000 1.000 8.000 1.000 0.000
Glucose 148.000 85.000 183.000 89.000 137.000
BloodPressure 72.000 66.000 64.000 66.000 40.000
SkinThickness 35.000 29.000 0.000 23.000 35.000
Insulin 0.000 0.000 0.000 94.000 168.000
BMI 33.600 26.600 23.300 28.100 43.100
DiabetesPedigreeFunction 0.627 0.351 0.672 0.167 2.288
Age 50.000 31.000 32.000 21.000 33.000
Outcome 1.000 0.000 1.000 0.000 1.000
# Print number of missing values by count
print((df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] == 0).sum())
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
dtype: int64
# Replace all 0 with NaN
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
# print the first 5 rows of data
df.head().T
0 1 2 3 4
Pregnancies 6.000 1.000 8.000 1.000 0.000
Glucose 148.000 85.000 183.000 89.000 137.000
BloodPressure 72.000 66.000 64.000 66.000 40.000
SkinThickness 35.000 29.000 NaN 23.000 35.000
Insulin NaN NaN NaN 94.000 168.000
BMI 33.600 26.600 23.300 28.100 43.100
DiabetesPedigreeFunction 0.627 0.351 0.672 0.167 2.288
Age 50.000 31.000 32.000 21.000 33.000
Outcome 1.000 0.000 1.000 0.000 1.000
# count the number of NaN (null) values in each column
print(df.isnull().sum())
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
# Replace the null values with the mean of that column
df = df.fillna(df.mean())

# Check the number of null values after replacing
print(df.isnull().sum())
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
df.head().T
0 1 2 3 4
Pregnancies 6.000000 1.000000 8.000000 1.000 0.000
Glucose 148.000000 85.000000 183.000000 89.000 137.000
BloodPressure 72.000000 66.000000 64.000000 66.000 40.000
SkinThickness 35.000000 29.000000 29.153420 23.000 35.000
Insulin 155.548223 155.548223 155.548223 94.000 168.000
BMI 33.600000 26.600000 23.300000 28.100 43.100
DiabetesPedigreeFunction 0.627000 0.351000 0.672000 0.167 2.288
Age 50.000000 31.000000 32.000000 21.000 33.000
Outcome 1.000000 0.000000 1.000000 0.000 1.000
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
a = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']
b = ['Outcome']
plt.rcParams['figure.figsize'] = (20, 10)
df.hist()
plt.show()

png

plt.rcParams['figure.figsize'] = (6, 5)
sns.countplot(x="Outcome",data=df)
plt.show()
print(df["Outcome"].value_counts())

png

0    500
1    268
Name: Outcome, dtype: int64

This shows that there are 500 negative results and 268 positive results

for i in a:
    sns.swarmplot(x="Outcome",y=i,data=df)
    plt.show()

png

png

png

png

png

png

png

png

No clear conclusions can be drawn from the above swarmplots
But these are some readings that could prove useful:

Women with more than 13 pregnancies tend to get diabetes
Women with glucose less than 75 tend to NOT get diabetes.

for i in a:
    print(df[i].describe())
    print(df[i].skew())
    sns.distplot(df[i], kde=False)
    plt.show()
count    768.000000
mean       3.845052
std        3.369578
min        0.000000
25%        1.000000
50%        3.000000
75%        6.000000
max       17.000000
Name: Pregnancies, dtype: float64
0.9016739791518588

png

count    768.000000
mean     121.686763
std       30.435949
min       44.000000
25%       99.750000
50%      117.000000
75%      140.250000
max      199.000000
Name: Glucose, dtype: float64
0.5327186599872982

png

count    768.000000
mean      72.405184
std       12.096346
min       24.000000
25%       64.000000
50%       72.202592
75%       80.000000
max      122.000000
Name: BloodPressure, dtype: float64
0.13730536744146796

png

count    768.000000
mean      29.153420
std        8.790942
min        7.000000
25%       25.000000
50%       29.153420
75%       32.000000
max       99.000000
Name: SkinThickness, dtype: float64
0.8221731383793047

png

count    768.000000
mean     155.548223
std       85.021108
min       14.000000
25%      121.500000
50%      155.548223
75%      155.548223
max      846.000000
Name: Insulin, dtype: float64
3.019083661355125

png

count    768.000000
mean      32.457464
std        6.875151
min       18.200000
25%       27.500000
50%       32.400000
75%       36.600000
max       67.100000
Name: BMI, dtype: float64
0.5982526551146302

png

count    768.000000
mean       0.471876
std        0.331329
min        0.078000
25%        0.243750
50%        0.372500
75%        0.626250
max        2.420000
Name: DiabetesPedigreeFunction, dtype: float64
1.919911066307204

png

count    768.000000
mean      33.240885
std       11.760232
min       21.000000
25%       24.000000
50%       29.000000
75%       41.000000
max       81.000000
Name: Age, dtype: float64
1.1295967011444805

png

From the above distplots, we can make the following conclusions:

Distplot 1 : Pregnancies

Least Pregnancy: 0
Highest Pregnancy: 17
Average Pregnancy: 4

Since it’s positively skewed - Majority of the women have fewer pregnancies.

Distplot 2 : Glucose

Least Glucose level: 44
Highest Glucose level: 199
Average Glucose level: 121.68

Since it’s positively skewed - Majority of the women have lower glucose level.

Distplot 3 : Blood Pressure

Least Blood Pressure: 24 mm Hg
Highest Blood Pressure: 122 mm Hg
Average Blood Pressure: 72.40 mm Hg

Since it’s positively skewed - Majority of the women have a lower Blood Pressure.

Distplot 4 : Skin Thickness

Least Skin Thickness: 7 mm
Highest Skin Thickness: 99 mm
Average Skin Thickness: 29.15 mm

Since it’s positively skewed - Majority of the women have a lower Skin Thickness.

Distplot 5 : Insulin

Least Insulin level: 14 U/ml
Highest Insulin level: 846 U/ml
Average Insulin level: 155.54 U/ml

Since it’s positively skewed - Majority of the women have a lower insulin level.

Distplot 6 : BMI

Least BMI: 18.20
Highest BMI: 67.10
Average BMI: 32.45

Since it’s positively skewed - Majority of the women have a lower BMI.

Distplot 7 : DiabetesPedigreeFunction

Least value: 0.078
Highest value: 2.420
Average value: 0.471

Since it’s positively skewed - Majority of the women have a lower DiabetesPedigreeFunction value.

Distplot 8 : Age

Least age: 21
Highest age: 33
Average age: 81

Since it’s positively skewed - Majority of the women are younger.

print(df.skew())
Pregnancies                 0.901674
Glucose                     0.532719
BloodPressure               0.137305
SkinThickness               0.822173
Insulin                     3.019084
BMI                         0.598253
DiabetesPedigreeFunction    1.919911
Age                         1.129597
Outcome                     0.635017
dtype: float64

Insulin and DiabetesPedigreeFunction are highly skewed data. So we will normalize by using the log values of the colulmns

df['Insulin'] = np.log(df['Insulin'])
print("skew: ", df['Insulin'].skew())
sns.distplot(df['Insulin'], kde=False)
plt.show()

print('-'*40)

df['DiabetesPedigreeFunction'] = np.log(df['DiabetesPedigreeFunction'])
print("skew: ", df['DiabetesPedigreeFunction'].skew())
sns.distplot(df['DiabetesPedigreeFunction'], kde=False)
plt.show()
skew:  -0.7896950416949489

png

----------------------------------------
skew:  0.11417768826564408

png

Machine Learning model - Decision Tree

# Split the dependent and independent values
x = df.drop("Outcome",axis=1)
y = df["Outcome"]
# pre-processing the data
x = StandardScaler().fit(x).transform(x)
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.7)
print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)
Train set: (537, 8) (537,)
Test set: (231, 8) (231,)
# Load decision tree classifier model from sklearn and fit the training sets
algo = DecisionTreeClassifier(criterion='entropy',max_depth=5)
algo.fit(xtrain,ytrain)
DecisionTreeClassifier(criterion='entropy', max_depth=5)
plt.figure(figsize=(20,10))
plot_tree(algo,filled=True)
plt.show()

png

ypred = algo.predict(xtest)

# compare predicted values and actual values and find out accuracy
print("Accuracy: ", accuracy_score(ytest,ypred))
Accuracy:  0.7445887445887446
print(confusion_matrix(ytest,ypred))
[[112  38]
 [ 21  60]]

Highest Accuracy: 0.7662337662337663