Binary classification on PIMA Indian Dataset

6 minute read

This problem is comprised of 768 observations of medical details for Pima indians patents. The records describe instantaneous measurements taken from the patient such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0).

The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

There may be some missing values with which you have to deal with.

Build a prediction Algorithm using Decision Tree.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.metrics import mean_absolute_error, accuracy_score, confusion_matrix

df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\diabetes.csv")

df.shape

(768, 9)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

df=df.drop_duplicates()
df.shape

(768, 9)

df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

df.head().T

	0	1	2	3	4
Pregnancies	6.000	1.000	8.000	1.000	0.000
Glucose	148.000	85.000	183.000	89.000	137.000
BloodPressure	72.000	66.000	64.000	66.000	40.000
SkinThickness	35.000	29.000	0.000	23.000	35.000
Insulin	0.000	0.000	0.000	94.000	168.000
BMI	33.600	26.600	23.300	28.100	43.100
DiabetesPedigreeFunction	0.627	0.351	0.672	0.167	2.288
Age	50.000	31.000	32.000	21.000	33.000
Outcome	1.000	0.000	1.000	0.000	1.000

# Print number of missing values by count
print((df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] == 0).sum())

Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
dtype: int64

# Replace all 0 with NaN
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)

# print the first 5 rows of data
df.head().T

	0	1	2	3	4
Pregnancies	6.000	1.000	8.000	1.000	0.000
Glucose	148.000	85.000	183.000	89.000	137.000
BloodPressure	72.000	66.000	64.000	66.000	40.000
SkinThickness	35.000	29.000	NaN	23.000	35.000
Insulin	NaN	NaN	NaN	94.000	168.000
BMI	33.600	26.600	23.300	28.100	43.100
DiabetesPedigreeFunction	0.627	0.351	0.672	0.167	2.288
Age	50.000	31.000	32.000	21.000	33.000
Outcome	1.000	0.000	1.000	0.000	1.000

# count the number of NaN (null) values in each column
print(df.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

# Replace the null values with the mean of that column
df = df.fillna(df.mean())

# Check the number of null values after replacing
print(df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

df.head().T

	0	1	2	3	4
Pregnancies	6.000000	1.000000	8.000000	1.000	0.000
Glucose	148.000000	85.000000	183.000000	89.000	137.000
BloodPressure	72.000000	66.000000	64.000000	66.000	40.000
SkinThickness	35.000000	29.000000	29.153420	23.000	35.000
Insulin	155.548223	155.548223	155.548223	94.000	168.000
BMI	33.600000	26.600000	23.300000	28.100	43.100
DiabetesPedigreeFunction	0.627000	0.351000	0.672000	0.167	2.288
Age	50.000000	31.000000	32.000000	21.000	33.000
Outcome	1.000000	0.000000	1.000000	0.000	1.000

df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

a = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']
b = ['Outcome']

plt.rcParams['figure.figsize'] = (20, 10)
df.hist()
plt.show()

png

plt.rcParams['figure.figsize'] = (6, 5)
sns.countplot(x="Outcome",data=df)
plt.show()
print(df["Outcome"].value_counts())

png

0    500
1    268
Name: Outcome, dtype: int64

This shows that there are 500 negative results and 268 positive results

for i in a:
    sns.swarmplot(x="Outcome",y=i,data=df)
    plt.show()

png

No clear conclusions can be drawn from the above swarmplots
But these are some readings that could prove useful:

Women with more than 13 pregnancies tend to get diabetes
Women with glucose less than 75 tend to NOT get diabetes.

for i in a:
    print(df[i].describe())
    print(df[i].skew())
    sns.distplot(df[i], kde=False)
    plt.show()

count    768.000000
mean       3.845052
std        3.369578
min        0.000000
25%        1.000000
50%        3.000000
75%        6.000000
max       17.000000
Name: Pregnancies, dtype: float64
0.9016739791518588

png

count    768.000000
mean     121.686763
std       30.435949
min       44.000000
25%       99.750000
50%      117.000000
75%      140.250000
max      199.000000
Name: Glucose, dtype: float64
0.5327186599872982

png

count    768.000000
mean      72.405184
std       12.096346
min       24.000000
25%       64.000000
50%       72.202592
75%       80.000000
max      122.000000
Name: BloodPressure, dtype: float64
0.13730536744146796

png

count    768.000000
mean      29.153420
std        8.790942
min        7.000000
25%       25.000000
50%       29.153420
75%       32.000000
max       99.000000
Name: SkinThickness, dtype: float64
0.8221731383793047

png

count    768.000000
mean     155.548223
std       85.021108
min       14.000000
25%      121.500000
50%      155.548223
75%      155.548223
max      846.000000
Name: Insulin, dtype: float64
3.019083661355125

png

count    768.000000
mean      32.457464
std        6.875151
min       18.200000
25%       27.500000
50%       32.400000
75%       36.600000
max       67.100000
Name: BMI, dtype: float64
0.5982526551146302

png

count    768.000000
mean       0.471876
std        0.331329
min        0.078000
25%        0.243750
50%        0.372500
75%        0.626250
max        2.420000
Name: DiabetesPedigreeFunction, dtype: float64
1.919911066307204

png

count    768.000000
mean      33.240885
std       11.760232
min       21.000000
25%       24.000000
50%       29.000000
75%       41.000000
max       81.000000
Name: Age, dtype: float64
1.1295967011444805

png

From the above distplots, we can make the following conclusions:

Distplot 1 : Pregnancies

Least Pregnancy: 0
Highest Pregnancy: 17
Average Pregnancy: 4

Since it’s positively skewed - Majority of the women have fewer pregnancies.

Distplot 2 : Glucose

Least Glucose level: 44
Highest Glucose level: 199
Average Glucose level: 121.68

Since it’s positively skewed - Majority of the women have lower glucose level.

Distplot 3 : Blood Pressure

Least Blood Pressure: 24 mm Hg
Highest Blood Pressure: 122 mm Hg
Average Blood Pressure: 72.40 mm Hg

Since it’s positively skewed - Majority of the women have a lower Blood Pressure.

Distplot 4 : Skin Thickness

Least Skin Thickness: 7 mm
Highest Skin Thickness: 99 mm
Average Skin Thickness: 29.15 mm

Since it’s positively skewed - Majority of the women have a lower Skin Thickness.

Distplot 5 : Insulin

Least Insulin level: 14 U/ml
Highest Insulin level: 846 U/ml
Average Insulin level: 155.54 U/ml

Since it’s positively skewed - Majority of the women have a lower insulin level.

Distplot 6 : BMI

Least BMI: 18.20
Highest BMI: 67.10
Average BMI: 32.45

Since it’s positively skewed - Majority of the women have a lower BMI.

Distplot 7 : DiabetesPedigreeFunction

Least value: 0.078
Highest value: 2.420
Average value: 0.471

Since it’s positively skewed - Majority of the women have a lower DiabetesPedigreeFunction value.

Distplot 8 : Age

Least age: 21
Highest age: 33
Average age: 81

Since it’s positively skewed - Majority of the women are younger.

print(df.skew())

Pregnancies                 0.901674
Glucose                     0.532719
BloodPressure               0.137305
SkinThickness               0.822173
Insulin                     3.019084
BMI                         0.598253
DiabetesPedigreeFunction    1.919911
Age                         1.129597
Outcome                     0.635017
dtype: float64

Insulin and DiabetesPedigreeFunction are highly skewed data. So we will normalize by using the log values of the colulmns

df['Insulin'] = np.log(df['Insulin'])
print("skew: ", df['Insulin'].skew())
sns.distplot(df['Insulin'], kde=False)
plt.show()

print('-'*40)

df['DiabetesPedigreeFunction'] = np.log(df['DiabetesPedigreeFunction'])
print("skew: ", df['DiabetesPedigreeFunction'].skew())
sns.distplot(df['DiabetesPedigreeFunction'], kde=False)
plt.show()

skew:  -0.7896950416949489

png

----------------------------------------
skew:  0.11417768826564408

png

Machine Learning model - Decision Tree

# Split the dependent and independent values
x = df.drop("Outcome",axis=1)
y = df["Outcome"]

# pre-processing the data
x = StandardScaler().fit(x).transform(x)

# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.7)

print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)

Train set: (537, 8) (537,)
Test set: (231, 8) (231,)

# Load decision tree classifier model from sklearn and fit the training sets
algo = DecisionTreeClassifier(criterion='entropy',max_depth=5)
algo.fit(xtrain,ytrain)

DecisionTreeClassifier(criterion='entropy', max_depth=5)

plt.figure(figsize=(20,10))
plot_tree(algo,filled=True)
plt.show()

png

ypred = algo.predict(xtest)

# compare predicted values and actual values and find out accuracy
print("Accuracy: ", accuracy_score(ytest,ypred))

Accuracy:  0.7445887445887446

print(confusion_matrix(ytest,ypred))

[[112  38]
 [ 21  60]]

Highest Accuracy: 0.7662337662337663

Share on

Twitter Facebook Google+ LinkedIn

Rakshith Poojary

Binary classification on PIMA Indian Dataset

Machine Learning model - Decision Tree

Highest Accuracy: 0.7662337662337663

Share on

You May Also Enjoy

Binary classification on Titanic Dataset

Classification on Breast Cancer Wisconsin dataset

Model to predict power output of a peaker power plant