银行贷款预测分析（Loan Prediction）

贷款数据的预测分析，通过使用python来分析申请人哪些条件对贷款有影响，并预测哪些客户更容易获得银行贷款。

数据来源 Loan Prediction：https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

提出问题：哪些客户更容易获得银行贷款？

导入数据

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline# 导入数据
full_data = pd.read_csv('loan_train.csv')
full_data.shape

(614, 13)

数据有614行，13列。

查看前五行数据

full_data.head()

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	NaN	360.0	1.0	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N
2	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.0	360.0	1.0	Urban	Y
3	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.0	360.0	1.0	Urban	Y
4	LP001008	Male	No	0	Graduate	No	6000	0.0	141.0	360.0	1.0	Urban	Y

一、理解数据

Loan_ID 贷款人ID

Gender 性别 (Male, female)

ApplicantIncome 申请人收入

Coapplicant Income 申请收入

Credit_History 信用记录

Dependents 亲属人数

Education 教育程度

LoanAmount 贷款额度

Loan_Amount_Term 贷款时间长

Loan_Status 贷款状态（Y, N）

Married 婚姻状况（NO,Yes）

Property_Area 所在区域包括：城市地区、半城区和农村地区

Self_Employed 职业状况：自雇还是非自雇

查看描述统计数据

full_data.describe()

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	614.000000	614.000000	592.000000	600.00000	564.000000
mean	5403.459283	1621.245798	146.412162	342.00000	0.842199
std	6109.041673	2926.248369	85.587325	65.12041	0.364878
min	150.000000	0.000000	9.000000	12.00000	0.000000
25%	2877.500000	0.000000	100.000000	360.00000	1.000000
50%	3812.500000	1188.500000	128.000000	360.00000	1.000000
75%	5795.000000	2297.250000	168.000000	360.00000	1.000000
max	81000.000000	41667.000000	700.000000	480.00000	1.000000

查看数据集

full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB

看到数据有缺失值，需要后面进一步处理

二、从单变量进行分析

1. 分析目标变量Loan_Status贷款状态

#目标变量统计
full_data['Loan_Status'].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

#统计百分比
full_data['Loan_Status'].value_counts(normalize=True)

Y    0.687296
N    0.312704
Name: Loan_Status, dtype: float64

sns.countplot(x='Loan_Status', data=full_data, palette = 'Set1')

614个人中有422人（约69％）获得贷款批准

2.Gender 性别特征

full_data['Gender'].value_counts(normalize=True)

Male      0.813644
Female    0.186356
Name: Gender, dtype: float64

sns.countplot(x='Gender', data=full_data, palette = 'Set1')

数据集中80％的申请人是男性。

3.Married婚姻特征

full_data['Married'].value_counts(normalize=True).plot.bar(title= 'Married')

有65％的申请贷款的人是已经结婚。

4.Dependent亲属特征

Dependents=full_data['Dependents'].value_counts(normalize=True)
Dependents

0     0.575960
1     0.170284
2     0.168614
3+    0.085142
Name: Dependents, dtype: float64

Dependents.plot.bar(title= 'Dependents')

贷款客户主要集中在没有亲属关系中，占到57%.

5.是否自雇人士

Self_Employed=full_data['Self_Employed'].value_counts(normalize=True)
print(Self_Employed)

No     0.859107
Yes    0.140893
Name: Self_Employed, dtype: float64

Self_Employed.plot.bar(title= 'Self_Employed')

大约有13％的申请人是自雇人士。

6.Loan_Amount_Term贷款时间

full_data['Loan_Amount_Term'].value_counts().plot.bar(title= 'Loan_Amount_Term')

贷款时间主要集中在360天

7.Credit_History 信用记录变量

Credit_History=full_data['Credit_History'].value_counts(normalize=True)
print(Credit_History)

1.0    0.842199
0.0    0.157801
Name: Credit_History, dtype: float64

Credit_History.plot.bar(title= 'Credit_History')

大约85％的申请人已偿还债务了。

8.Education 教育程度

Education=full_data['Education'].value_counts(normalize=True)
print(Education)

Graduate        0.781759
Not Graduate    0.218241
Name: Education, dtype: float64

Education.plot.bar(title= 'Education')

贷款的客户中有接近80%的客户主要是受教育的毕业生

三、双变量分析各个特征与目标变量（Loan_Status）的关系

1.性别与贷款关系

Gender=pd.crosstab(full_data['Gender'],full_data['Loan_Status'])
Gender.plot(kind="bar", stacked=True, figsize=(5,5))

男性更容易申请通过贷款

2.结婚与贷款关系

Married=pd.crosstab(full_data['Married'],full_data['Loan_Status'])
Married.plot(kind="bar", stacked=True, figsize=(5,5))

已经结婚的客户申请贷款通过的最高

3.亲属人数与贷款关系

Dependents=pd.crosstab(full_data['Dependents'],full_data['Loan_Status'])
Dependents.plot(kind="bar", stacked=True, figsize=(5,5))

没有亲属关系的客户也容易获得申请通过贷款

4.教育与贷款关系

Education=pd.crosstab(full_data['Education'],full_data['Loan_Status'])
Education.plot(kind="bar", stacked=True, figsize=(5,5))

已经受教育毕业的客户获得贷款更容易

5.职业与贷款关系

Self_Employed=pd.crosstab(full_data['Self_Employed'],full_data['Loan_Status'])
Self_Employed.plot(kind="bar", stacked=True, figsize=(5,5))

不是自雇客户申请通过的最高

6.信用记录与贷款之间的关系

Credit_History=pd.crosstab(full_data['Credit_History'],full_data['Loan_Status'])
Credit_History.plot(kind="bar", stacked=True, figsize=(5,5))

信用记录为1的人更有可能获得贷款批准，说明有信用的获得贷款的机会大。

7.区域与贷款关系

Property_Area=pd.crosstab(full_data['Property_Area'],full_data['Loan_Status'])
Property_Area.plot(kind="bar", stacked=True, figsize=(5,5))

在半城市区获得批准的贷款要高于农村或城市地区

四、热图来可视化相关性

用于查看所有数值变量之间的相关性。

首先将类别特征值转为数值型，方便热图分析相关性

将dependents变量中的3+更改为3以使其成为数值变量。我们还将目标变量的类别转换为0和1，以便我们可以找到它与数值变量的相关性。

full_data['Gender'].replace(('Female','Male'),(0,1),inplace=True)
full_data['Married'].replace(('NO','Yes'),(0,1),inplace=True)
full_data['Dependents'].replace(('0', '1', '2', '3+'),(0, 1, 2, 3),inplace=True)
full_data['Education'].replace(('Not Graduate', 'Graduate'),(0, 1),inplace=True)
full_data['Self_Employed'].replace(('No','Yes'),(0,1),inplace=True)
full_data['Property_Area'].replace(('Semiurban','Urban','Rural'),(0,1,2),inplace=True)

通过着色的变化来显示数据。颜色较深的变量意味着它们的相关性更高。

matrix = full_data.corr()
f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(matrix,vmax=.8, square=True,cmap="BuPu",annot=True);

可以看到最相关的变量是（ApplicantIncome - LoanAmount）和（Credit_History - Loan_Status），这两者相关性强。

LoanAmount也与CoapplicantIncome相关。说明申请人的收入和贷款金额、信用历史记录与贷款状态有很强的关系

五、缺失值和异常值的处理

连续变量特征分析是否有异常值

申请人收入数据分析

plt.figure()
plt.subplot(121)
sns.distplot(full_data['ApplicantIncome']);
plt.subplot(122)
full_data['ApplicantIncome'].plot.box(figsize=(16,5))
plt.show()

收入分配的大部分数据主要偏在左边，没有呈现正态分布，箱线图确认存在大量异常值，收入差距较大，需要进行处理

按教育分开绘制

full_data.boxplot(column='ApplicantIncome', by = 'Education')
plt.suptitle("")

Text(0.5,0.98,'')

可以看到受教育的人，有很多的高收入，出现异常值。

贷款额度分析

plt.figure(1)
plt.subplot(121)
df=full_data.dropna()
sns.distplot(df['LoanAmount']);plt.subplot(122)
full_data['LoanAmount'].plot.box(figsize=(16,5))plt.show()

贷款额度数呈现正态分布，但是从箱线图中看到出现很多的异常值，下面需要进行处理异常值。

处理缺失值

查看有多少缺失值

full_data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

Gender，Married，Dependents，Self_Employed，LoanAmount，Loan_Amount_Term和Credit_History功能中缺少值

填充缺失的值的方法：

对于数值变量：使用均值或中位数进行插补

对于分类变量：使用常见众数进行插补，这里主要使用众数进行插补空值

full_data['Gender'].fillna(full_data['Gender'].value_counts().idxmax(), inplace=True)
full_data['Married'].fillna(full_data['Married'].value_counts().idxmax(), inplace=True)
full_data['Dependents'].fillna(full_data['Dependents'].value_counts().idxmax(), inplace=True)
full_data['Self_Employed'].fillna(full_data['Self_Employed'].value_counts().idxmax(), inplace=True)
full_data["LoanAmount"].fillna(full_data["LoanAmount"].mean(skipna=True), inplace=True)
full_data['Loan_Amount_Term'].fillna(full_data['Loan_Amount_Term'].value_counts().idxmax(), inplace=True)
full_data['Credit_History'].fillna(full_data['Credit_History'].value_counts().idxmax(), inplace=True)

查看是否存在缺失值

full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null object
Dependents           614 non-null float64
Education            614 non-null int64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     614 non-null float64
Credit_History       614 non-null float64
Property_Area        614 non-null int64
Loan_Status          614 non-null object
dtypes: float64(7), int64(3), object(3)
memory usage: 62.4+ KB

可以看到数据集中已填充所有缺失值，没有缺失值存在。

异常值处理

对于异常值需要进行处理，这里采用对数log转化处理，消除异常值的影响，让数据回归正态分布

full_data['LoanAmount_log'] = np.log(full_data['LoanAmount'])
full_data['LoanAmount_log'].hist(bins=20)

<matplotlib.axes._subplots.AxesSubplot at 0x1f506860>

full_data['ApplicantIncomeLog'] = np.log(full_data['ApplicantIncome'])
full_data['ApplicantIncomeLog'].hist(bins=20)

异常值处理完成，接下来构建模型预测准确率

六、构建模型（逻辑回归模型）

Loan_ID变量对贷款状态没有影响，需要删除更改。

 full_data=full_data.drop('Loan_ID',axis=1)

删除目标变量Loan_Status，并将它保存在另一个数据集中

X = full_data.drop('Loan_Status',1)
y = full_data.Loan_Status

X=pd.get_dummies(X)
full_data=pd.get_dummies(full_data)

导入导入train_test_split

from sklearn.model_selection import train_test_split
#建立训练集合测试集
x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size =0.3)

从sklearn导入LogisticRegression和accuracy_score并拟合逻辑回归模型

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

创建模型逻辑回归和训练模型

model = LogisticRegression()
model.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)

评估模型

pred_cv = model.predict(x_cv)
accuracy_score(y_cv,pred_cv)

0.8054054054054054

预测几乎达到80％准确，说明正确识别80％的贷款状态

总结

通过练习熟悉数据分析的基本过程，学习到缺失值填充和异常值的处理以及数据可视化知识；在构建模型中有很多模型方法不了解，后期需要继续学习python数据分析方法和模型构建等知识。
本次主要练习学习python，更多的数据分析方法需要进一步学习。