今天我们用UCI大学公开的数据集实践一下。

问题

abstract:

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the plant was set to work with full load.

Data Set Information:

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.
For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.
We provide the data both in .ods and in .xlsx formats.

Attribute Information:

Features consist of hourly average ambient variables
- Temperature (T) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW
The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

一.准备工作

第一步搭建环境（默认大家已经大家好了）

重点：检查一下自己是否安装一下库

没有的话用pip命令直接在控制台安装即可。

好啦，我们开始进入正题：

首先由以上数据建立一个线性回归模型即：

我们需要学习的就是θ。

得到数据后可以看到里面有一个xlsx文件

我先用excel把它打开，接着另存为"csv格式"

先把要导入的库声明了：

接着可以用pandas读取数据：

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, linear_modeldata = pd.read_csv('你自己的路径')

运行结果应该如下，看到下面的数据，说明pandas读取数据成功!!

接下来我们看一下数据的维度：

print(data.shape)

结果是：(9568, 5)

说明我们有9568样本，每个样本有5列。

接着现在开始准备特征X，我们用AT， V，AP和RH这4个列作为样本特征。

接着我们准备样本输出y，我们用PE作为样本输出。

X = data[['AT', 'V', 'AP', 'RH']]
X.head()
#print(X.head())
y = data[['PE']]
y.head()
#print(y.head())

完美运行！！

二.划分训练集和测试集

我们把X和y的样本组合划分成两部分，一部分是训练集，一部分是测试集。这里就采用简单交叉验证。

tip：使用sklean.cross_validation模块中的train_test_split进行数据分割时，发现无法引用sklean.cross_validation模块：因为该模块在0.18版本中被弃用，支持所有重构的类和函数都被移动到的model_selection模块中了，当发现有问题后，要随时看看更新文档。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

2392/9568 = 0.25

完美运行！！可以看到0.25的样本作为测试集，0.75的样本作为训练集。

三.用scikit-learn拟合线性模型

这里我们采用小学二年级学的最小二乘法来实现我们的拟合。

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)print (linreg.intercept_)
print (linreg.coef_)

得出最终的模型：

4.模型评价

对于线性回归来说，用均方差（Mean Squared Error, MSE）或者均方根差(Root Mean Squared Error, RMSE)在测试集上的表现来评价模型的好坏。

(4条消息) 方差、标准差、均方差、均方根值(RMS)、均方根误差(RMSE)_少林波波的博客-CSDN博客_rms均方根值计算公式https://blog.csdn.net/zzb714121/article/details/125339827?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%9D%87%E6%96%B9%E5%B7%AE&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-1-125339827.142^v62^control_1,201^v3^control_1,213^v1^control&spm=1018.2226.3001.4187

#模型拟合测试集
y_pred = linreg.predict(X_test)
from sklearn import metrics
# 用scikit-learn计算MSE
print ("MSE:",metrics.mean_squared_error(y_test, y_pred))
# 用scikit-learn计算RMSE
print ("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

5.模型的优化

当我取出一个参数RH后再跑一遍模型

模拟拟合的没有加上RH的好，MSE和MRSE变大了。

我们可以采用交叉验证来持续优化模型。

这里我们采用10折交叉验证

X = data[['AT', 'V', 'AP', 'RH']]
y = data[['PE']]
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(linreg, X, y, cv=10)
# 用scikit-learn计算MSE
print "MSE:",metrics.mean_squared_error(y, predicted)
# 用scikit-learn计算RMSE
print "RMSE:",np.sqrt(metrics.mean_squared_error(y, predicted))

画图观察结果：

fig, ax = plt.subplots()
ax.scatter(y, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

这里画图真实值和预测值的变化关系，离中间的直线y=x直接越近的点代表预测损失越低。

Combined Cycle Power Plant Data Set（初学练手:详解）相关推荐

[Cocoa]深入浅出 Cocoa 之 Core Data（1）- 框架详解
深入浅出 Cocoa 之 Core Data(1)- 框架详解罗朝辉(http://blog.csdn.net/kesalin) CC 许可,转载请注明出处 Core data 是 Cocoa 中处 ...
python request.post 字典参数以json_Python requests.post方法中data与json参数区别详解
在通过requests.post()进行POST请求时,传入报文的参数有两个,一个是data,一个是json. data与json既可以是str类型,也可以是dict类型. 区别: 1.不管json是 ...
备份数据库的expdp语句_Oracle数据库备份恢复Data Pump Expdp/Impdp参数详解与案例介绍...
oracle数据库备份恢复Data Pump Expdp/Impdp参数详解与案例介绍目录 1 Oracle数据泵的介绍 3 2 Oracle expdp/impdp参数使用介绍 3 2.1 Or ...
python post json参数,Python requests.post方法中data与json参数区别详解
在通过requests.post()进行POST请求时,传入报文的参数有两个,一个是data,一个是json. data与json既可以是str类型,也可以是dict类型. 区别: 1.不管json是 ...
Spring Data JPA 之 @Query 语法详解及其应用
5 Spring Data JPA 之 @Query 语法详解及其应用 5.1 快速体验 @Query 的方法沿⽤我们之前的例⼦,新增⼀个 @Query 的⽅法: // 通过 query 注解根据 ...
Java初学练手小项目---基于awt库，swing库以及MySQL数据库制作简易电影管理系统(一)
前言本人是个小小白,初学Java语言,想与一众身为程序猿的各位分享一下自己的知识和想法,达到共同学习的目的,所以想通过写博客的方式分享自己的心得体会,这也是本人第一次写博客,希望能够帮助同样在学习的 ...
sas Data步数据读取流程详解
data me; put _n_= x=;/*******1******/ input x/*input这里是读入缓冲流的关键步骤变量是从缓冲流中取出数据,根据缓冲流中指针的位置来获取变量信息*/; ...
Python，Power BI,excel，商业数据分析技能详解
[文末领取免费福利] 4月17日,有消息称亚马逊将于本周宣布退出中国,具体的时间待定.此后,亚马逊在中国仅保留两项业务,一是Kindle:二是跨境贸易,主营业务电商将全部退出中国. 有网友表示了震惊, ...
Mysql数据库Data目录迁移的方法详解
第一种方法是迁移先找到迁移服务器上的Data文件,我安装的是mysql8,自己选择的路径 D:\Develop\MySQL(如果是默认安装的路径,那么就是在C:\ProgramData\MySQL文 ...

Combined Cycle Power Plant Data Set（初学练手:详解）

问题