声明

文章代码修改于kaggle博主DIPAMVASANI，本文旨在将精华的内容留住并加以分析，防止繁杂的信息扰乱该文的本质，因此删除了图形化处理模块并在核心代码处加了中文注释。

思路

检查数据

数据的检测是很重要的，之前没有好好地检查数据，导致很多有用的信息没有看到，因此要先检查数据集的数据，保证核心的信息不遗漏，例如：y值（待预测值）和x值（属性值）一定要分清。

图像化处理

这个部分代码由于最后的分析实际上不需要，因此删除。

分析

构建模型

先用训练集的均值预测一下试试，如果后面的模型比均值还烂。。。那拜拜了您内。
最后分数据集分一分，注意随机种子，分好之后用各个模型跑出误差，挑选最优，即完成。

如果感觉这部分没有看懂，没有关系，后面我们会详细讲解。

代码实现

首先读入数据并用describe函数查看大致属性

# Some basic analysis
student = pd.read_csv('./input/student-mat.csv')
print(student.head())
print('Total number of students:',len(student))
print(student['G3'].describe())

然后对整个数据集进行one-hot编码

# Encoding categorical variables
# Select only categorical variables
category_df = student.select_dtypes(include=['object']) # 挑选非数值型变量# One hot encode the variables
dummy_df = pd.get_dummies(category_df) # get_dummies是实现one-hot编码的方法# Put the grade back in the dataframe
dummy_df['G3'] = student['G3']# Find correlations with grade
dummy_df.corr()['G3'].sort_values()# Applying one hot encoding to our data and finding correlation again
# selecting the most correlated values and dropping the others
labels = student['G3'] # G3那一列的数值# drop the school and grade columns
student = student.drop(['school', 'G1', 'G2'], axis='columns') # 删除三列# One-Hot Encoding of Categorical Variables
student = pd.get_dummies(student) # 对整个数据集进行one-hot编码

随后进行相关性分析

# Find correlations with the Grade 生成与G3相关度的排序列（降序排列）
most_correlated = student.corr().abs()['G3'].sort_values(ascending=False) # student.corr()生成相关矩阵 .abs()取绝对值 ['G3']挑出G3列 sort默认升序# Maintain the top 8 most correlation features with Grade
most_correlated = most_correlated[:9] # 其实就是[0:9]有0无9 最相关的9个变量（G3是自己 除G3外8个）
print(most_correlated)student = student.loc[:, most_correlated.index] # .loc 通过label选定一组行或列 第一个是行 逗号后面是列 选定最相关的九个变量 旗下每个学生的属性值
print(student.head())

然后运用最简单的均值预测作为我们的基线方法

# splitting the data into training and testing data (75% and 25%)
# we mention the random state to achieve the same split everytime we run the code
#train_test_split(训练数据，样本结果，测试集的样本占比，random_state=None(default)每次划分不同，为整数则相同)
X_train, X_test, y_train, y_test = train_test_split(student, labels, test_size = 0.25, random_state=42)print(X_train.head())# Calculate mae and rmse
def evaluate_predictions(predictions, true):mae = np.mean(abs(predictions - true)) # 平均绝对误差(np.mean需要是数组)rmse = np.sqrt(np.mean((predictions - true) ** 2)) # 均方根误差(np.sqrt同样需要是数组)return mae, rmse# find the median
median_pred = X_train['G3'].median() # 取中位数# create a list with all values as median
median_preds = [median_pred for _ in range(len(X_test))] # 生成长度为测试集的全值为median_pred的列表# store the true G3 values for passing into the function
true = X_test['G3'] # 保存真实的结果# Display the naive baseline metrics
mb_mae, mb_rmse = evaluate_predictions(median_preds, true)
print('Median Baseline  MAE: {:.4f}'.format(mb_mae))
print('Median Baseline RMSE: {:.4f}'.format(mb_rmse))

最后跑模型，生成结果（最后的部分是唯一生成图片的部分，用于生成模型的比较，需要掌握）

results = evaluate(X_train, X_test, y_train, y_test)
print(results)plt.figure(figsize=(12, 8))# Root mean squared error
ax =  plt.subplot(1, 2, 1)
results.sort_values('mae', ascending = True).plot.bar(y = 'mae', color = 'b', ax = ax, fontsize=20)
plt.title('Model Mean Absolute Error', fontsize=20)
plt.ylabel('MAE', fontsize=20)# Median absolute percentage error
ax = plt.subplot(1, 2, 2)
results.sort_values('rmse', ascending = True).plot.bar(y = 'rmse', color = 'r', ax = ax, fontsize=20)
plt.title('Model Root Mean Squared Error', fontsize=20)
plt.ylabel('RMSE',fontsize=20)plt.show()

可运行代码

import pandas as pd
from matplotlib import pyplot as plt
import numpy as npfrom sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR# Splitting data into training/testing
from sklearn.model_selection import train_test_split# Some basic analysis
student = pd.read_csv('./input/student-mat.csv')
print(student.head())
print('Total number of students:',len(student))
print(student['G3'].describe())# Correlation
print(student.corr()['G3'].sort_values())# Encoding categorical variables
# Select only categorical variables
category_df = student.select_dtypes(include=['object']) # 挑选非数值型变量# One hot encode the variables
dummy_df = pd.get_dummies(category_df) # get_dummies是实现one-hot编码的方法# Put the grade back in the dataframe
dummy_df['G3'] = student['G3']# Find correlations with grade
dummy_df.corr()['G3'].sort_values()# Applying one hot encoding to our data and finding correlation again
# selecting the most correlated values and dropping the others
labels = student['G3'] # G3那一列的数值# drop the school and grade columns
student = student.drop(['school', 'G1', 'G2'], axis='columns') # 删除三列# One-Hot Encoding of Categorical Variables
student = pd.get_dummies(student) # 对整个数据集进行one-hot编码# Find correlations with the Grade 生成与G3相关度的排序列（降序排列）
most_correlated = student.corr().abs()['G3'].sort_values(ascending=False) # student.corr()生成相关矩阵 .abs()取绝对值 ['G3']挑出G3列 sort默认升序# Maintain the top 8 most correlation features with Grade
most_correlated = most_correlated[:9] # 其实就是[0:9]有0无9 最相关的9个变量（G3是自己 除G3外8个）
print(most_correlated)student = student.loc[:, most_correlated.index] # .loc 通过label选定一组行或列 第一个是行 逗号后面是列 选定最相关的九个变量 旗下每个学生的属性值
print(student.head())# splitting the data into training and testing data (75% and 25%)
# we mention the random state to achieve the same split everytime we run the code
#train_test_split(训练数据，样本结果，测试集的样本占比，random_state=None(default)每次划分不同，为整数则相同)
X_train, X_test, y_train, y_test = train_test_split(student, labels, test_size = 0.25, random_state=42)print(X_train.head())# Calculate mae and rmse
def evaluate_predictions(predictions, true):mae = np.mean(abs(predictions - true)) # 平均绝对误差(np.mean需要是数组)rmse = np.sqrt(np.mean((predictions - true) ** 2)) # 均方根误差(np.sqrt同样需要是数组)return mae, rmse# find the median
median_pred = X_train['G3'].median() # 取中位数# create a list with all values as median
median_preds = [median_pred for _ in range(len(X_test))] # 生成长度为测试集的全值为median_pred的列表# store the true G3 values for passing into the function
true = X_test['G3'] # 保存真实的结果# Display the naive baseline metrics
mb_mae, mb_rmse = evaluate_predictions(median_preds, true)
print('Median Baseline  MAE: {:.4f}'.format(mb_mae))
print('Median Baseline RMSE: {:.4f}'.format(mb_rmse))# Evaluate several ml models by training on training set and testing on testing set
def evaluate(X_train, X_test, y_train, y_test):# Names of modelsmodel_name_list = ['Linear Regression', 'ElasticNet Regression','Random Forest', 'Extra Trees', 'SVM','Gradient Boosted', 'Baseline']X_train = X_train.drop('G3', axis='columns')X_test = X_test.drop('G3', axis='columns')# Instantiate the modelsmodel1 = LinearRegression()model2 = ElasticNet() # 结合岭回归和Lasso回归，避免过拟合model3 = RandomForestRegressor() # 多个决策树model4 = ExtraTreesRegressor() # 类随机森林model5 = SVR() # SVM的反义词model6 = GradientBoostingRegressor(n_estimators=50) # 梯度提升回归# Dataframe for resultsresults = pd.DataFrame(columns=['mae', 'rmse'], index=model_name_list) # columns列 index行# Train and predict with each modelfor i, model in enumerate([model1, model2, model3, model4, model5, model6]):model.fit(X_train, y_train)predictions = model.predict(X_test)# Metricsmae = np.mean(abs(predictions - y_test))rmse = np.sqrt(np.mean((predictions - y_test) ** 2))# Insert results into the dataframemodel_name = model_name_list[i]results.loc[model_name, :] = [mae, rmse]# Median Value Baseline Metricsbaseline = np.median(y_train)baseline_mae = np.mean(abs(baseline - y_test))baseline_rmse = np.sqrt(np.mean((baseline - y_test) ** 2))results.loc['Baseline', :] = [baseline_mae, baseline_rmse]return resultsresults = evaluate(X_train, X_test, y_train, y_test)
print(results)plt.figure(figsize=(12, 8))# Root mean squared error
ax =  plt.subplot(1, 2, 1)
results.sort_values('mae', ascending = True).plot.bar(y = 'mae', color = 'b', ax = ax, fontsize=20)
plt.title('Model Mean Absolute Error', fontsize=20)
plt.ylabel('MAE', fontsize=20)# Median absolute percentage error
ax = plt.subplot(1, 2, 2)
results.sort_values('rmse', ascending = True).plot.bar(y = 'rmse', color = 'r', ax = ax, fontsize=20)
plt.title('Model Root Mean Squared Error', fontsize=20)
plt.ylabel('RMSE',fontsize=20)plt.show()

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）相关推荐

asp毕业设计——基于asp+access的学生成绩查询系统设计与实现（毕业论文+程序源码）——成绩查询系统
基于asp+access的学生成绩查询系统设计与实现(毕业论文+程序源码) 大家好,今天给大家介绍基于asp+access的学生成绩查询系统设计与实现,文章末尾附有本毕业设计的论文和源码下载地址哦. ...
C#毕业设计——基于C#+asp.net+sqlserver的学生成绩管理系统设计与实现（毕业论文+程序源码）——成绩管理系统
基于C#+asp.net+sqlserver的学生成绩管理系统设计与实现(毕业论文+程序源码) 大家好,今天给大家介绍基于C#+asp.net+sqlserver的学生成绩管理系统设计与实现,文章末尾 ...
利用算法轻松预测用户贷款是否违约(附 Python 源码)
大家好,最近一张"因疫情希望延缓房贷"的截图在网上流传,随即引起网友们的热议! 当借款人从贷款机构借钱而不能如期还贷款时,就可能会发生贷款违约.拖欠贷款不仅会上报征信,还可能有被起 ...
python 逻辑回归准确率是1_python数据分析（三）——逻辑回归之学生成绩预测
Python数据分析项目 --学生成绩预测一．数据源阿里云天池公开数据集:学生成绩预测数据集 https://tianchi.aliyun.com/dataset/dataDetail?dataI ...
【php毕业设计】基于php+mysql的成绩查询系统设计与实现（毕业论文+程序源码）——成绩查询系统
基于php+mysql的成绩查询系统设计与实现(毕业论文+程序源码) 大家好,今天给大家介绍基于php+mysql的成绩查询系统设计与实现,文章末尾附有本毕业设计的论文和源码下载地址哦. 需要下载开题 ...
java课程与成绩管理计算机毕业设计MyBatis+系统+LW文档+源码+调试部署
java课程与成绩管理计算机毕业设计MyBatis+系统+LW文档+源码+调试部署 java课程与成绩管理计算机毕业设计MyBatis+系统+LW文档+源码+调试部署本源码技术栈: 项目架构:B/S ...
java计算机毕业设计高校学生资助管理信息系统MyBatis+系统+LW文档+源码+调试部署
java计算机毕业设计高校学生资助管理信息系统MyBatis+系统+LW文档+源码+调试部署 java计算机毕业设计高校学生资助管理信息系统MyBatis+系统+LW文档+源码+调试部署本源码技术栈 ...
学生HTML个人网页作业作品(游戏网站全套源码)
学生个人网页制作html模板,一个简单的HTML网页,这是我大学的一个期末作业,照片代码全都有,先打开index.html.代码如果不会复制粘贴的可以私信我.学生HTML个人网页作业作品(游戏网站全套 ...
2021 年“泰迪杯”数据分析技能赛 B 题肥料登记数据分析（视频讲解+解题源码）、数据挖掘、数据分析实战
2021 年"泰迪杯"数据分析技能赛 B 题肥料登记数据分析 (视频讲解+解题源码).数据挖掘.数据分析实战前言: 整理了2021 年"泰迪杯"数据分析技能 ...

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

声明

思路

检查数据

图像化处理

分析

相关性分析

构建模型

代码实现

可运行代码

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）相关推荐

最新文章

热门文章

用机器学习进行学生成绩预测的数据分析（入门向 附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向 附可用源码）

声明

思路

检查数据

图像化处理

分析

相关性分析

构建模型

代码实现

可运行代码

用机器学习进行学生成绩预测的数据分析（入门向 附可用源码）相关推荐

最新文章

热门文章

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）相关推荐