贷款申请最大化利润-机器学习项目实战

（数据分析与机器学习第二周项目实战）

一、项目介绍

本项目以某互联网贷款网站提供的贷款人的个人信息为背景，根据历史数据，建立模型，预测新的一个人来了，是否给他贷款，以实现利润最大化。
网站地址：http://lendingclub.com
数据：链接：https://pan.baidu.com/s/1NoU9oGnRS70d8663SQPKVQ
提取码：oldg

二、数据清洗过滤无用特征

#去掉空值太多的行,dropna(thresh=n)非nan最少n个才能保留
#去掉“描述”“链接”等没用的行
#保存为新文件
import pandas as pd
loans2007=pd.read_csv(r"E:\BaiduNetdiskDownload\唐宇迪机器学习\贷款利润最大化\LoanStats3a.csv",engine="python",skiprows=1)#因文件中含有中文，所以应该指定engine为python，否则会报错 OSError: Initializing from file failed;并且药剂得添加参数skiprows=1,否则会报错 KeyError: "labels ['desc' 'url'] not contained in axis"
half=len(loans2007)/2
loans2007=loans2007.dropna(thresh=half,axis=1)
#loans2007=loans2007.drop(["desc","url"],axis=1)
loans2007= loans2007.drop(['desc', 'url'],axis=1)
loans2007.to_csv("E:\BaiduNetdiskDownload\唐宇迪机器学习\贷款利润最大化\loans2007.csv",index=False)#打印第一列数据
#打印列数
loans2007=pd.read_csv("E:\BaiduNetdiskDownload\唐宇迪机器学习\贷款利润最大化\loans2007.csv",engine="python")
print(loans2007.iloc[0])
print(loans2007.shape[1])#注意shape[m,n],m是行，n是列#特征越多越容易过拟合，因为可能某些特征价值很低，会给数据以噪声干扰
#去掉贷款后的信息，如funded_amnt/funded_amnt_inv;去掉重复的、高度相关的信息，如grade,sub_grade;去掉无法评估的信息，如emp_title
loans2007 = loans2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
loans2007 = loans2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
loans2007 = loans2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)

三、数据预处理

#建立label值
#用value_counts()统计每一列的候选属性的出现次数
#取出'loan_status'属性值为 "Fully Paid"和"Charged Off"的行
#把字符值转化诶数值形式，用pandas.replace(),{选择的列{替代的键值对}}
print(loans2007["loan_status"].value_counts())
loans2007=loans2007[(loans2007["loan_status"]=="Fully Paid" )|( loans2007["loan_status"]=="Charged Off")]
status_replace={"loan_status":{"Fully Paid":1,"Charged Off":0}
}
loans2007=loans2007.replace(status_replace)#记得赋值#去除 属性只有一个值 的列，但需要先去除nan缺失值
columns=loans2007.columns
drop_columns=[]
for col in columns:col_num=loans2007[col].dropna().unique()if len(col_num)== 1:drop_columns.append(col)
loans2007=loans2007.drop(drop_columns,axis=1)
print(drop_columns)
print(loans2007.shape)
loans2007.to_csv(r"E:\BaiduNetdiskDownload\唐宇迪机器学习\贷款利润最大化\filtered_loans2007.csv")

四、获得最大利润的条件与做法

#读入数据
#统计缺失值
import pandas as pd
loans=pd.read_csv(r"E:\BaiduNetdiskDownload\唐宇迪机器学习\贷款利润最大化\filtered_loans2007.csv",engine="python")
null_counts=loans.isnull().sum()
print(null_counts)
print(loans["emp_length"].isnull().sum())   #emp_length的值为1073，为什么#对于缺失值少的，去掉样本；对于缺失值多的，直接去掉这个属性
#去掉"pub_rec_bankruptcies"特征列
#去掉有缺失值的行
#统计各个数据类型及其数量
loans = loans.drop("pub_rec_bankruptcies", axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())#select_dtypes选择类型为object的列，并查看第一行
object_columns=loans.select_dtypes(include=["object"])
print(object_columns.iloc[0])#purpose和title的含义类似，可以只选其一
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())#对特征进行处理，可以和刚才一样，采用字典形式
#"去掉last_credit_pull_d,earliest_cr_line,addr_state,title列
#"int_tate","revol_util"的百分号去掉，之后转化为float
print(loans["emp_length"].unique())
mapping_dict={"emp_length":{'10+ years':10,'< 1 year':0,'3 years':3,'8 years' :8,'9 years' :9,'4 years' :4,'5 years':5,'1 year' :1,'6 years':6,'2 years' :2,'7 years':7}
}
loans=loans.drop(["last_credit_pull_d","earliest_cr_line","addr_state","title"],axis=1)
loans["int_rate"]=loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"]=loans["revol_util"].str.rstrip("%").astype("float")
loans=loans.replace(mapping_dict)cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])#实现one hot encode
loans = pd.concat([loans, dummy_df], axis=1)#拼接到主体数据上
loans = loans.drop(cat_columns, axis=1) #丢除操作后的原数据（not mention）
loans = loans.drop("pymnt_plan", axis=1)loans.to_csv('cleaned_loans2007.csv', index=False)

五、预测结果并解决样本不均衡问题

因为正确预测借钱给一个人收益的钱，和错误预测给一个人损失的钱，不一样，所以不能简单拿准确率作为预测指标。应该借助tpr（true positive rate）和fpr(false positive rate)，期待tpr高,而fpr低。
先使用逻辑回归LogisticRegression()

#读入数据
#info()样本个数，有无缺失，类型
import pandas as pd
loans=pd.read_csv("cleaned_loans2007.csv")
print(loans.info())#准备train和test,label准备好之后开始训练模型
cols=loans.columns
train_col=cols.drop("loan_status")
features=loans[train_col]
target=loans["loan_status"]#使用逻辑回归查看分类效果
#衡量tp,fp指标值，发现tp和fo都很高
#怀疑所有人来了都借了，打印前20条查看
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict,KFold
lr=LogisticRegression()
kf=KFold(5,shuffle=False,random_state=1)
#交叉验证
predictions=cross_val_predict(lr,features,target,cv=kf)
predictions=pd.Series(predictions)fp_filter=(predictions==1)&(loans["loan_status"]==0)
#print(fp_filter)
fp=len(predictions[fp_filter])tp_filter=(predictions==1)&(loans["loan_status"]==1)
tp=len(predictions[tp_filter])fn_filter=(predictions==0)&(loans["loan_status"]==1)
fn=len(predictions[fn_filter])tn_filter=(predictions==0)&(loans["loan_status"]==0)
tn=len(predictions[tn_filter])tpr=tp/float((tp+fn))
fpr=fp/float((fp+tn))print(tpr)
print(fpr)
print(predictions[:20])

发现效果并不好，因为分类器把所有的样本都预测为1了。这是因为正负样本的量6：1不合理，会对分类器产生误导。所以引入权重项，让正样本对最终结果影响小一些，负样本影响大一些，让样本更加均衡。
添加权重项

#添加权重项
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict,KFold
lr=LogisticRegression(class_weight="balanced")#**
kf=KFold(5,shuffle=False,random_state=1)
#交叉验证
predictions=cross_val_predict(lr,features,target,cv=kf)
predictions=pd.Series(predictions)fp_filter=(predictions==1)&(loans["loan_status"]==0)
#print(fp_filter)
fp=len(predictions[fp_filter])tp_filter=(predictions==1)&(loans["loan_status"]==1)
tp=len(predictions[tp_filter])fn_filter=(predictions==0)&(loans["loan_status"]==1)
fn=len(predictions[fn_filter])tn_filter=(predictions==0)&(loans["loan_status"]==0)
tn=len(predictions[tn_filter])tpr=tp/float((tp+fn))
fpr=fp/float((fp+tn))print(tpr)
print(fpr)

自己定义权重项

#添加权重项
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict,KFold
penalty={0:5,1:1}#**
lr=LogisticRegression(class_weight=penalty)
kf=KFold(5,shuffle=False,random_state=1)
#交叉验证
predictions=cross_val_predict(lr,features,target,cv=kf)
predictions=pd.Series(predictions)fp_filter=(predictions==1)&(loans["loan_status"]==0)
#print(fp_filter)
fp=len(predictions[fp_filter])tp_filter=(predictions==1)&(loans["loan_status"]==1)
tp=len(predictions[tp_filter])fn_filter=(predictions==0)&(loans["loan_status"]==1)
fn=len(predictions[fn_filter])tn_filter=(predictions==0)&(loans["loan_status"]==0)
tn=len(predictions[tn_filter])tpr=tp/float((tp+fn))
fpr=fp/float((fp+tn))print(tpr)
print(fpr)

采用随机森林模型。class_weight调节正负样本权重比例。也可以进行参数的改变，n_estimators=10进行10个数的鉴定，调节为100之后，发现结果改善不大，还是不是很好。

#随机森林模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predictrf=RandomForestClassifier(n_estimators=10,class_weight="balanced",random_state=1)
kf=KFold(5,shuffle=False,random_state=1)
predictions=cross_val_predict(rf,features,target,cv=kf)
predictions=pd.Series(predictions)fp_filter=(predictions==1)&(loans["loan_status"]==0)
#print(fp_filter)
fp=len(predictions[fp_filter])tp_filter=(predictions==1)&(loans["loan_status"]==1)
tp=len(predictions[tp_filter])fn_filter=(predictions==0)&(loans["loan_status"]==1)
fn=len(predictions[fn_filter])tn_filter=(predictions==0)&(loans["loan_status"]==0)
tn=len(predictions[tn_filter])tpr=tp/float((tp+fn))
fpr=fp/float((fp+tn))print(tpr)
print(fpr)

对于正负样本不均衡的情况，我们可以：

调节正负样本权重参数
可以尝试除了随机森林、逻辑回归之外的其他模型，比如adaboosting、支持向量机等
可以使用一些之前去除的特征，也可以加上计算，生成新的特征
可以把其他模型和现有模型融合，通过多个模型跑出分类结果
可以调节模型的参数，比如输的个数，adaboosting算法的其他策略，来提升效果

个人总结：

本次项目的特点是，数据量非常庞大有40000余条，数据特征列也非常庞大；数据量大，样本个数多，对我们而言是有益的，越大越有益，但是数据特征越多，越容易过拟合，特征列的选择很大程度上会决定模型的好坏。所以我们采用了很多方法对特征列进行清洗处理，比如去掉无用的、重复的、确认发放贷款后才能得到的以及无法评估的特征列，还有只有一个属性值的列。之后，数据预处理，我们发现原本的数据缺少标签，直接指明是否发放贷款，于是我们又建立了label值。

此后，我们又对数据缺失值进行了处理，对于缺失值少的，去掉样本行；对于缺失值多的，直接去掉这个属性。然后由于sklearn只能处理数值型的变量，我们又对剩余的字符型的列，进行类型转换。
最后，就是结果的预测。我们先采用的是逻辑回归，发现tpr和fpr的值都很高（我们期望的是tpr高而fpr低），于是我们怀疑，模型对所有人都选择了发放贷款。而这，又是由于训练集本身，正负样本的不均衡导致的，所以我们又引入了权重项，并且进一步自定义权重项，让样本更加均衡。再然后，我们又尝试了随机森林模型，引入权重项，并且调节参数，以得到更好的效果。

总得来说，这个项目，于我而言还是收获不小的。代码能力，模型调用的提高，还是得从项目中得到。