Classification on Breast Cancer Wisconsin dataset

2 minute read

Question 1
Use the Breast Cancer Wisconsin data set from UCI machine learning Repository and build a classification model.

Here you have 16 missing values which you will have to replace with suitable value(mean/mode/median/0). you may have to drop some less important feature before processing ahead for developing a prediction model.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score
df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\breast-cancer-wisconsin.csv")
df.shape
(699, 11)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB

“Bare Nuclei” comlumn is of object Dtype. Which means it has some invalid(null) values

df=df.drop_duplicates()
df.shape
(691, 11)

There were 8 duplicate rows

df.isnull().sum()
Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64
df = df.drop("Sample code number", axis=1)
df.head().T
0 1 2 3 4
Clump Thickness 5 5 3 6 4
Uniformity of Cell Size 1 4 1 8 1
Uniformity of Cell Shape 1 4 1 8 1
Marginal Adhesion 1 5 1 1 3
Single Epithelial Cell Size 2 7 2 3 2
Bare Nuclei 1 10 2 4 1
Bland Chromatin 3 3 3 3 3
Normal Nucleoli 1 2 1 7 1
Mitoses 1 1 1 1 1
Class 2 2 2 2 2

Drop “Sample code number” column since it’s not useful in prediction

# Replace all '?' with NaN
df = df.replace({'?':np.nan})

# print the count of null values
print("Null values: ", df["Bare Nuclei"].isnull().sum() ,"\n")

# Convert object dtype to Int64 so we can perform describe() and find out the mean value
df["Bare Nuclei"] = df["Bare Nuclei"].astype(float).astype('Int64')
print(df["Bare Nuclei"].describe())

# Replace the null values with the mean value
print("\nReplacing null values with integer value of mean: ", int(df["Bare Nuclei"].mean()))
df["Bare Nuclei"] = df["Bare Nuclei"].fillna(int(df["Bare Nuclei"].mean()))
Null values:  16 

count    675.000000
mean       3.537778
std        3.637871
min        1.000000
25%        1.000000
50%        1.000000
75%        6.000000
max       10.000000
Name: Bare Nuclei, dtype: float64

Replacing null values with integer value of mean:  3

Machine Learning Model - Logistic Regression

# Split the dependent and independent values
x = df.drop("Class", axis=1)
y = df["Class"]
# pre-processing the data
x = StandardScaler().fit(x).transform(x)
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.8)
print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)
Train set: (552, 9) (552,)
Test set: (139, 9) (139,)
# Load logistic regression model from sklearn and fit the training sets
algo = LogisticRegression().fit(xtrain,ytrain)
# find out the predictions for the testing set
ypred = algo.predict(xtest)

# compare predicted values and actual values and find out accuracy

print("Mean Absolute Error: ", mean_absolute_error(ytest,ypred))
print("Accuracy: ", accuracy_score(ytest,ypred))
Mean Absolute Error:  0.05755395683453238
Accuracy:  0.9712230215827338

Highest Accuracy: 0.9784172661870504