Model to predict power output of a peaker power plant
The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
The power output of a peaker power plant varies depending on environmental conditions, so the business problem is predicting the power output of a peaker power plant as a function of the environmental conditions – since this would enable the grid operator to make economic tradeoffs about the number of peaker plants to turn on (or whether to buy expensive power from another grid).
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\combined_cycle_power_plant.csv", sep=";")
df.shape
(9568, 5)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   temperature        9568 non-null   float64
 1   exhaust_vacuum     9568 non-null   float64
 2   ambient_pressure   9568 non-null   float64
 3   relative_humidity  9568 non-null   float64
 4   energy_output      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB
df=df.drop_duplicates()
df.shape
(9527, 5)
df.isnull().sum()
temperature          0
exhaust_vacuum       0
ambient_pressure     0
relative_humidity    0
energy_output        0
dtype: int64
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| temperature | 9.59 | 12.04 | 13.87 | 13.72 | 15.14 | 
| exhaust_vacuum | 38.56 | 42.34 | 45.08 | 54.30 | 49.64 | 
| ambient_pressure | 1017.01 | 1019.72 | 1024.42 | 1017.89 | 1023.78 | 
| relative_humidity | 60.10 | 94.67 | 81.69 | 79.08 | 75.00 | 
| energy_output | 481.30 | 465.36 | 465.48 | 467.05 | 463.58 | 
a=["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity", "energy_output"]
for i in a:
    print(df[i].describe())
    print(df[i].skew())
    sns.distplot(df[i], kde=False)
    plt.show()
count    9527.000000
mean       19.658225
std         7.444397
min         1.810000
25%        13.530000
50%        20.350000
75%        25.710000
max        37.110000
Name: temperature, dtype: float64
-0.1361069178515444

count    9527.000000
mean       54.293421
std        12.686309
min        25.360000
25%        41.740000
50%        52.080000
75%        66.510000
max        81.560000
Name: exhaust_vacuum, dtype: float64
0.1968187812768364

count    9527.000000
mean     1013.237084
std         5.940526
min       992.890000
25%      1009.085000
50%      1012.920000
75%      1017.200000
max      1033.300000
Name: ambient_pressure, dtype: float64
0.273845628693525

count    9527.000000
mean       73.334951
std        14.607513
min        25.560000
25%        63.375000
50%        75.000000
75%        84.850000
max       100.160000
Name: relative_humidity, dtype: float64
-0.43513848893895307

count    9527.00000
mean      454.33591
std        17.03908
min       420.26000
25%       439.75000
50%       451.52000
75%       468.36500
max       495.76000
Name: energy_output, dtype: float64
0.3057905126118896

From the above distplots, we can make the following conclusions:
(All the ambient variables are taken on an hourly average basis.)
Distplot 1 : Temperature
Least Temperature: 1.81°C
Highest Temperature: 37.11°C
Average Temperature: 19.65°C
Since it’s negatively skewed - Majority of the power plants have a higher temperature
Distplot 2 : Exhaust Vacuum
Least Exhaust Vacuum: 25.36 cm Hg
Highest Exhaust Vacuum: 81.56 cm Hg
Average Exhaust Vacuum: 54.29 cm Hg
Since it’s positively skewed - Majority of the power plants have a lower Exhaust Vacuum
Distplot 3 : Ambient Pressure
Least Ambient Pressure: 992.89 milibar
Highest Ambient Pressure: 1033.30 milibar
Average Ambient Pressure: 1013.23g milibar
Since it’s positively skewed - Majority of the power plants have a lower Ambient Pressure
Distplot 4 : Relative Humidity
Least Exhaust Vacuum: 25.56%
Highest Exhaust Vacuum: 100.16%
Average Exhaust Vacuum: 73.33%
Since it’s negatively skewed - Majority of the power plants have a higher Relative Humidity
Distplot 5 : Energy Output
Least Energy Output: 420.26 MW
Highest Energy Output: 495.76 MW
Average Energy Output: 454.33 MW
Since it’s negatively skewed - Majority of the power plants have a higher Relative Humidity
sns.pairplot(df)
plt.show()

As we can clearly see from the pairplot of Temperature vs Energy Output (or vice-versa) that there is a negative correlation present.
sns.relplot(x="energy_output", y="temperature", data=df)
plt.title('Energy Output vs Temperature', fontsize=20)
plt.show()

We can safely assume that:
As temperature increases, Energy Output decreases.
Machine Learning model - Linear Regression
# Split the dependent and independent values
x = df.drop("energy_output", axis=1)
y = df["energy_output"]
# pre-processing the data
x = StandardScaler().fit(x).transform(x)
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.7)
print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)
Train set: (6668, 4) (6668,)
Test set: (2859, 4) (2859,)
# Load linear regression model from sklearn and fit the training sets
algo=LinearRegression().fit(xtrain, ytrain)
# find out the predictions for the testing set
ypred = algo.predict(xtest)
# compare predicted values and actual values and find out accuracy
print("Mean Absolute Error: ", mean_absolute_error(ytest,ypred))
print("Accuracy: ", r2_score(ytest,ypred))
Mean Absolute Error:  3.602932416142143
Accuracy:  0.9306277586738139
 
  
  
