Classification on Breast Cancer Wisconsin dataset
Question 1
Use the Breast Cancer Wisconsin data set from UCI machine learning Repository and build a classification model.
Here you have 16 missing values which you will have to replace with suitable value(mean/mode/median/0). you may have to drop some less important feature before processing ahead for developing a prediction model.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score
df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\breast-cancer-wisconsin.csv")
df.shape
(699, 11)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sample code number 699 non-null int64
1 Clump Thickness 699 non-null int64
2 Uniformity of Cell Size 699 non-null int64
3 Uniformity of Cell Shape 699 non-null int64
4 Marginal Adhesion 699 non-null int64
5 Single Epithelial Cell Size 699 non-null int64
6 Bare Nuclei 699 non-null object
7 Bland Chromatin 699 non-null int64
8 Normal Nucleoli 699 non-null int64
9 Mitoses 699 non-null int64
10 Class 699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.2+ KB
“Bare Nuclei” comlumn is of object Dtype. Which means it has some invalid(null) values
df=df.drop_duplicates()
df.shape
(691, 11)
There were 8 duplicate rows
df.isnull().sum()
Sample code number 0
Clump Thickness 0
Uniformity of Cell Size 0
Uniformity of Cell Shape 0
Marginal Adhesion 0
Single Epithelial Cell Size 0
Bare Nuclei 0
Bland Chromatin 0
Normal Nucleoli 0
Mitoses 0
Class 0
dtype: int64
df = df.drop("Sample code number", axis=1)
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Clump Thickness | 5 | 5 | 3 | 6 | 4 |
Uniformity of Cell Size | 1 | 4 | 1 | 8 | 1 |
Uniformity of Cell Shape | 1 | 4 | 1 | 8 | 1 |
Marginal Adhesion | 1 | 5 | 1 | 1 | 3 |
Single Epithelial Cell Size | 2 | 7 | 2 | 3 | 2 |
Bare Nuclei | 1 | 10 | 2 | 4 | 1 |
Bland Chromatin | 3 | 3 | 3 | 3 | 3 |
Normal Nucleoli | 1 | 2 | 1 | 7 | 1 |
Mitoses | 1 | 1 | 1 | 1 | 1 |
Class | 2 | 2 | 2 | 2 | 2 |
Drop “Sample code number” column since it’s not useful in prediction
# Replace all '?' with NaN
df = df.replace({'?':np.nan})
# print the count of null values
print("Null values: ", df["Bare Nuclei"].isnull().sum() ,"\n")
# Convert object dtype to Int64 so we can perform describe() and find out the mean value
df["Bare Nuclei"] = df["Bare Nuclei"].astype(float).astype('Int64')
print(df["Bare Nuclei"].describe())
# Replace the null values with the mean value
print("\nReplacing null values with integer value of mean: ", int(df["Bare Nuclei"].mean()))
df["Bare Nuclei"] = df["Bare Nuclei"].fillna(int(df["Bare Nuclei"].mean()))
Null values: 16
count 675.000000
mean 3.537778
std 3.637871
min 1.000000
25% 1.000000
50% 1.000000
75% 6.000000
max 10.000000
Name: Bare Nuclei, dtype: float64
Replacing null values with integer value of mean: 3
Machine Learning Model - Logistic Regression
# Split the dependent and independent values
x = df.drop("Class", axis=1)
y = df["Class"]
# pre-processing the data
x = StandardScaler().fit(x).transform(x)
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.8)
print ('Train set:', xtrain.shape, ytrain.shape)
print ('Test set:', xtest.shape, ytest.shape)
Train set: (552, 9) (552,)
Test set: (139, 9) (139,)
# Load logistic regression model from sklearn and fit the training sets
algo = LogisticRegression().fit(xtrain,ytrain)
# find out the predictions for the testing set
ypred = algo.predict(xtest)
# compare predicted values and actual values and find out accuracy
print("Mean Absolute Error: ", mean_absolute_error(ytest,ypred))
print("Accuracy: ", accuracy_score(ytest,ypred))
Mean Absolute Error: 0.05755395683453238
Accuracy: 0.9712230215827338