Model to predict power output of a peaker power plant

3 minute read

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

The power output of a peaker power plant varies depending on environmental conditions, so the business problem is predicting the power output of a peaker power plant as a function of the environmental conditions – since this would enable the grid operator to make economic tradeoffs about the number of peaker plants to turn on (or whether to buy expensive power from another grid).

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df=pd.read_csv(r"D:\Learning\DLithe-ML\Assignment\combined_cycle_power_plant.csv", sep=";")
df.shape
(9568, 5)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   temperature        9568 non-null   float64
 1   exhaust_vacuum     9568 non-null   float64
 2   ambient_pressure   9568 non-null   float64
 3   relative_humidity  9568 non-null   float64
 4   energy_output      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB
df=df.drop_duplicates()
df.shape
(9527, 5)
df.isnull().sum()
temperature          0
exhaust_vacuum       0
ambient_pressure     0
relative_humidity    0
energy_output        0
dtype: int64
df.head().T
0 1 2 3 4
temperature 9.59 12.04 13.87 13.72 15.14
exhaust_vacuum 38.56 42.34 45.08 54.30 49.64
ambient_pressure 1017.01 1019.72 1024.42 1017.89 1023.78
relative_humidity 60.10 94.67 81.69 79.08 75.00
energy_output 481.30 465.36 465.48 467.05 463.58
a=["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity", "energy_output"]
for i in a:
    print(df[i].describe())
    print(df[i].skew())
    sns.distplot(df[i], kde=False)
    plt.show()
count    9527.000000
mean       19.658225
std         7.444397
min         1.810000
25%        13.530000
50%        20.350000
75%        25.710000
max        37.110000
Name: temperature, dtype: float64
-0.1361069178515444

png

count    9527.000000
mean       54.293421
std        12.686309
min        25.360000
25%        41.740000
50%        52.080000
75%        66.510000
max        81.560000
Name: exhaust_vacuum, dtype: float64
0.1968187812768364

png

count    9527.000000
mean     1013.237084
std         5.940526
min       992.890000
25%      1009.085000
50%      1012.920000
75%      1017.200000
max      1033.300000
Name: ambient_pressure, dtype: float64
0.273845628693525

png

count    9527.000000
mean       73.334951
std        14.607513
min        25.560000
25%        63.375000
50%        75.000000
75%        84.850000
max       100.160000
Name: relative_humidity, dtype: float64
-0.43513848893895307

png

count    9527.00000
mean      454.33591
std        17.03908
min       420.26000
25%       439.75000
50%       451.52000
75%       468.36500
max       495.76000
Name: energy_output, dtype: float64
0.3057905126118896

png

From the above distplots, we can make the following conclusions:
(All the ambient variables are taken on an hourly average basis.)

Distplot 1 : Temperature

Least Temperature: 1.81°C
Highest Temperature: 37.11°C
Average Temperature: 19.65°C

Since it’s negatively skewed - Majority of the power plants have a higher temperature

Distplot 2 : Exhaust Vacuum

Least Exhaust Vacuum: 25.36 cm Hg
Highest Exhaust Vacuum: 81.56 cm Hg
Average Exhaust Vacuum: 54.29 cm Hg

Since it’s positively skewed - Majority of the power plants have a lower Exhaust Vacuum

Distplot 3 : Ambient Pressure

Least Ambient Pressure: 992.89 milibar
Highest Ambient Pressure: 1033.30 milibar
Average Ambient Pressure: 1013.23g milibar

Since it’s positively skewed - Majority of the power plants have a lower Ambient Pressure

Distplot 4 : Relative Humidity

Least Exhaust Vacuum: 25.56%
Highest Exhaust Vacuum: 100.16%
Average Exhaust Vacuum: 73.33%

Since it’s negatively skewed - Majority of the power plants have a higher Relative Humidity

Distplot 5 : Energy Output

Least Energy Output: 420.26 MW
Highest Energy Output: 495.76 MW
Average Energy Output: 454.33 MW

Since it’s negatively skewed - Majority of the power plants have a higher Relative Humidity

sns.pairplot(df)
plt.show()

png

As we can clearly see from the pairplot of Temperature vs Energy Output (or vice-versa) that there is a negative correlation present.

sns.relplot(x="energy_output", y="temperature", data=df)
plt.title('Energy Output vs Temperature', fontsize=20)
plt.show()

png

We can safely assume that:
As temperature increases, Energy Output decreases.

Machine Learning model - Linear Regression

# Split the dependent and independent values
x = df.drop("energy_output", axis=1)
y = df["energy_output"]
# pre-processing the data
x = StandardScaler().fit(x).transform(x)
# Split the data for training and testing
xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.7)
print ('Train set:', xtrain.shape,  ytrain.shape)
print ('Test set:', xtest.shape,  ytest.shape)
Train set: (6668, 4) (6668,)
Test set: (2859, 4) (2859,)
# Load linear regression model from sklearn and fit the training sets
algo=LinearRegression().fit(xtrain, ytrain)
# find out the predictions for the testing set
ypred = algo.predict(xtest)

# compare predicted values and actual values and find out accuracy

print("Mean Absolute Error: ", mean_absolute_error(ytest,ypred))
print("Accuracy: ", r2_score(ytest,ypred))
Mean Absolute Error:  3.602932416142143
Accuracy:  0.9306277586738139