Task1 赛题理解

数据脱敏

预测指标

代码

数据读取

import pandas as pd
import numpy as nppath = './data/'Train_data = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')
print('Train data shape:', Train_data.shape)
print('TestA data shape:', Test_data.shape)
Train data shape: (150000, 31)
TestA data shape: (50000, 30)
Train_data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482

5 rows × 31 columns

分类评价指标

分类算法评价指标 https://www.cnblogs.com/guoyaohua/p/classification-metrics.html

#accuracy
from sklearn.metrics import accuracy_score
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
print(accuracy_score(y_true, y_pred))
0.75
# precision, recall, f1-score
from sklearn import metrics
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
print('precision ', metrics.precision_score(y_true, y_pred))
print('recall ', metrics.recall_score(y_true, y_pred))
print('F1-score ', metrics.f1_score(y_true, y_pred))
precision  1.0
recall  0.6666666666666666
F1-score  0.8
# AUC
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print(roc_auc_score(y_true, y_scores))
0.75

回归指标计算

def mape(y_true, y_pred):return np.mean(np.abs((y_pred-y_true)/ y_true))y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))
## R2-score
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',r2_score(y_true, y_pred))
MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.1461904761904762
R2-score: 0.9486081370449679

Task2 数据分析

EAD目标

  • EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。

  • 当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。

  • 引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。

  • 完成对于数据的探索性分析,并对于数据进行一些图表或者文字总结并打卡。

代码

1 载入数据

#加载库import warnings
warnings.filterwarnings('ignore')
#在使用Jupyternotebook进行编程时,经常出现warnings,
#但这些warnings很多时候又是可以忽略的。出现大量warnings会影响界面美观。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno #缺失值可视化处理
#载入训练集和测试集Train_data = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')
Train_data.head().append(Train_data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 10 4.0 0.0 1.0 163 15.0 ... 0.280264 0.000310 0.048441 0.071158 0.019174 1.988114 -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 11 0.0 0.0 0.0 125 10.0 ... 0.253217 0.000777 0.084079 0.099681 0.079371 1.839166 -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 6.0 ... 0.233353 0.000705 0.118872 0.100118 0.097914 2.439812 -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 15.0 ... 0.256369 0.000252 0.081479 0.083558 0.081498 2.075380 -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 28 6.0 0.0 1.0 193 12.5 ... 0.284475 0.000000 0.040072 0.062543 0.025819 1.978453 -3.179913 0.031724 -1.483350 -0.342674

10 rows × 31 columns

Train_data.shape
(150000, 31)
Test_data.head().append(Test_data.tail())
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 150000 66932 20111212 222.0 4 5.0 1.0 1.0 313 15.0 ... 0.264405 0.121800 0.070899 0.106558 0.078867 -7.050969 -0.854626 4.800151 0.620011 -3.664654
1 150001 174960 19990211 19.0 21 0.0 0.0 0.0 75 12.5 ... 0.261745 0.000000 0.096733 0.013705 0.052383 3.679418 -0.729039 -3.796107 -1.541230 -0.757055
2 150002 5356 20090304 82.0 21 0.0 0.0 0.0 109 7.0 ... 0.260216 0.112081 0.078082 0.062078 0.050540 -4.926690 1.001106 0.826562 0.138226 0.754033
3 150003 50688 20100405 0.0 0 0.0 0.0 1.0 160 7.0 ... 0.260466 0.106727 0.081146 0.075971 0.048268 -4.864637 0.505493 1.870379 0.366038 1.312775
4 150004 161428 19970703 26.0 14 2.0 0.0 0.0 75 15.0 ... 0.250999 0.000000 0.077806 0.028600 0.081709 3.616475 -0.673236 -3.197685 -0.025678 -0.101290
49995 199995 20903 19960503 4.0 4 4.0 0.0 0.0 116 15.0 ... 0.284664 0.130044 0.049833 0.028807 0.004616 -5.978511 1.303174 -1.207191 -1.981240 -0.357695
49996 199996 708 19991011 0.0 0 0.0 0.0 0.0 75 15.0 ... 0.268101 0.108095 0.066039 0.025468 0.025971 -3.913825 1.759524 -2.075658 -1.154847 0.169073
49997 199997 6693 20040412 49.0 1 0.0 1.0 1.0 224 15.0 ... 0.269432 0.105724 0.117652 0.057479 0.015669 -4.639065 0.654713 1.137756 -1.390531 0.254420
49998 199998 96900 20020008 27.0 1 0.0 0.0 1.0 334 15.0 ... 0.261152 0.000490 0.137366 0.086216 0.051383 1.833504 -2.828687 2.465630 -0.911682 -2.057353
49999 199999 193384 20041109 166.0 6 1.0 NaN 1.0 68 9.0 ... 0.228730 0.000300 0.103534 0.080625 0.124264 2.914571 -1.135270 0.547628 2.094057 -1.552150

10 rows × 30 columns

Test_data.shape
(50000, 30)

2 总览数据概况

  1. describe种有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以及最大值 看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断,比如有的时候会发现999 9999 -1 等值这些其实都是nan的另外一种表达方式,有的时候需要注意下
  2. info 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常
Train_data.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 150000.000000 150000.000000 1.500000e+05 149999.000000 150000.000000 145494.000000 141320.000000 144019.000000 150000.000000 150000.000000 ... 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000
mean 74999.500000 68349.172873 2.003417e+07 47.129021 8.052733 1.792369 0.375842 0.224943 119.316547 12.597160 ... 0.248204 0.044923 0.124692 0.058144 0.061996 -0.001000 0.009035 0.004813 0.000313 -0.000688
std 43301.414527 61103.875095 5.364988e+04 49.536040 7.864956 1.760640 0.548677 0.417546 177.168419 3.919576 ... 0.045804 0.051743 0.201410 0.029186 0.035692 3.772386 3.286071 2.517478 1.288988 1.038685
min 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.168192 -5.558207 -9.639552 -4.153899 -6.546556
25% 37499.750000 11156.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243615 0.000038 0.062474 0.035334 0.033930 -3.722303 -1.951543 -1.871846 -1.057789 -0.437034
50% 74999.500000 51638.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 0.000000 110.000000 15.000000 ... 0.257798 0.000812 0.095866 0.057014 0.058484 1.624076 -0.358053 -0.130753 -0.036245 0.141246
75% 112499.250000 118841.250000 2.007111e+07 66.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265297 0.102009 0.125243 0.079382 0.087491 2.844357 1.255022 1.776933 0.942813 0.680378
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 7.000000 6.000000 1.000000 19312.000000 15.000000 ... 0.291838 0.151420 1.404936 0.160791 0.222787 12.357011 18.819042 13.847792 11.147669 8.658418

8 rows × 30 columns

Test_data.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 50000.000000 50000.000000 5.000000e+04 50000.000000 50000.000000 48587.000000 47107.000000 48090.000000 50000.000000 50000.000000 ... 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 174999.500000 68542.223280 2.003393e+07 46.844520 8.056240 1.782185 0.373405 0.224350 119.883620 12.595580 ... 0.248669 0.045021 0.122744 0.057997 0.062000 -0.017855 -0.013742 -0.013554 -0.003147 0.001516
std 14433.901067 61052.808133 5.368870e+04 49.469548 7.819477 1.760736 0.546442 0.417158 185.097387 3.908979 ... 0.044601 0.051766 0.195972 0.029211 0.035653 3.747985 3.231258 2.515962 1.286597 1.027360
min 150000.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.160049 -5.411964 -8.916949 -4.123333 -6.112667
25% 162499.750000 11203.500000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243762 0.000044 0.062644 0.035084 0.033714 -3.700121 -1.971325 -1.876703 -1.060428 -0.437920
50% 174999.500000 52248.500000 2.003091e+07 29.000000 6.000000 1.000000 0.000000 0.000000 109.000000 15.000000 ... 0.257877 0.000815 0.095828 0.057084 0.058764 1.613212 -0.355843 -0.142779 -0.035956 0.138799
75% 187499.250000 118856.500000 2.007110e+07 65.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265328 0.102025 0.125438 0.079077 0.087489 2.832708 1.262914 1.764335 0.941469 0.681163
max 199999.000000 196805.000000 2.015121e+07 246.000000 39.000000 7.000000 6.000000 1.000000 20000.000000 15.000000 ... 0.291618 0.153265 1.358813 0.156355 0.214775 12.338872 18.856218 12.950498 5.913273 2.624622

8 rows × 29 columns

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48587 non-null float64
fuelType             47107 non-null float64
gearbox              48090 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

判断缺失数据和异常

包括两部分:nan和异常值

#查看每列的存在nan情况
Train_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   1
brand                   0
bodyType             4506
fuelType             8680
gearbox              5981
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
price                   0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
Test_data.isnull().sum()
SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             1413
fuelType             2893
gearbox              1910
power                   0
kilometer               0
notRepairedDamage       0
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
dtype: int64
#visualizationmissing = Train_data.isnull().sum()
missing = missing[missing>0]
print(missing)
model          1
bodyType    4506
fuelType    8680
gearbox     5981
dtype: int64
missing.sort_values(inplace=True) #sort and replace
print(missing)
model          1
bodyType    4506
gearbox     5981
fuelType    8680
dtype: int64
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1a22b660d0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Owuu8zJd-1585057979312)(output_27_1.png)]

通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的过多、可以考虑删掉

missingno https://blog.csdn.net/Andy_shenzl/article/details/81633356

msno.matrix(Train_data.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x1a221d8fd0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lyxnjSDE-1585057979313)(output_29_1.png)]

msno. bar(Train_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x1a213cbe10>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XQvwZf0J-1585057979313)(output_30_1.png)]

msno.matrix(Test_data.sample(250))
<matplotlib.axes._subplots.AxesSubplot at 0x1a24420b10>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gQZNCyxS-1585057979313)(output_31_1.png)]

msno.bar(Test_data.sample(1000))
<matplotlib.axes._subplots.AxesSubplot at 0x1a243f4a90>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ytNgg4tq-1585057979314)(output_32_1.png)]

#查看异常值检测
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

可以发现除了notRepairedDamage 为object类型其他都为数字 这里我们把他的几个不同的值都进行显示就知道了。因为obeject类型,故可能存在异常值

Train_data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64
#将‘-’替换为nan
Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True)
Train_data['notRepairedDamage'].value_counts()
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
Train_data.isnull().sum()
SaleID                   0
name                     0
regDate                  0
model                    1
brand                    0
bodyType              4506
fuelType              8680
gearbox               5981
power                    0
kilometer                0
notRepairedDamage    24324
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
dtype: int64
Test_data['notRepairedDamage'].value_counts()
0.0    37249
-       8031
1.0     4720
Name: notRepairedDamage, dtype: int64
Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

数据倾斜
以下两个类别特征严重倾斜,一般不会对预测有什么帮助,故这边先删掉,当然你也可以继续挖掘,但是一般意义不大
怎么找出数据集中有数据倾斜的特征https://blog.csdn.net/Pysamlam/article/details/103982408
https://cloud.tencent.com/developer/article/1584553

Train_data.iloc[:, 0:10]
SaleID name regDate model brand bodyType fuelType gearbox power kilometer
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0
... ... ... ... ... ... ... ... ... ... ...
149995 149995 163978 20000607 121.0 10 4.0 0.0 1.0 163 15.0
149996 149996 184535 20091102 116.0 11 0.0 0.0 0.0 125 10.0
149997 149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 6.0
149998 149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 15.0
149999 149999 177672 19990204 19.0 28 6.0 0.0 1.0 193 12.5

150000 rows × 10 columns

Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    125676 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Train_data.iloc[:, 0:16]
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 1046 0 0 20160404 1850
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 NaN 4366 0 0 20160309 3600
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 0.0 2806 0 0 20160402 6222
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 0.0 434 0 0 20160312 2400
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0.0 6977 0 0 20160313 5200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
149995 149995 163978 20000607 121.0 10 4.0 0.0 1.0 163 15.0 0.0 4576 0 0 20160327 5900
149996 149996 184535 20091102 116.0 11 0.0 0.0 0.0 125 10.0 0.0 2826 0 0 20160312 9500
149997 149997 147587 20101003 60.0 11 1.0 1.0 0.0 90 6.0 0.0 3302 0 0 20160328 7500
149998 149998 45907 20060312 34.0 10 3.0 1.0 0.0 156 15.0 0.0 1877 0 0 20160401 4999
149999 149999 177672 19990204 19.0 28 6.0 0.0 1.0 193 12.5 0.0 235 0 0 20160305 4700

150000 rows × 16 columns

#all_features = Train_data.iloc[:,0:16].drop(['price'], axis=1)
all_features = Train_data.drop(['price', 'SaleID'], axis = 1)
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric = []
for i in all_features:if all_features[i].dtype in numeric_dtypes:numeric.append(i)
#对所有数值型变量绘制箱体图
sns.set_style('white')
fg, ax = plt.subplots(figsize=(8,7))
ax.set_xscale("log")
ax = sns.boxplot(data=all_features[numeric], orient='h', palette='Set1')
ax.xaxis.grid(False)
ax.set(ylabel="Feature names")
ax.set(xlabel="Numeric values")
ax.set(title="Numeric Distribution of Features")
sns.despine(trim=True, left=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mu0tkVhd-1585057979315)(output_47_0.png)]

# 找出明显偏态的数值型变量
skew_features = all_features[numeric].skew().sort_values(ascending=False)
high_skew = skew_features[skew_features>0.5]
skew_index = high_skew.indexprint("本数据集中有 {} 个数值型变量的 Skew > 0.5 :".format(high_skew.shape[0]))
skewness = pd.DataFrame({'Skew' :high_skew})
skew_features.head(10)
本数据集中有 12 个数值型变量的 Skew > 0.5 :seller      387.298335
power        65.863178
v_7           5.130233
v_2           4.842556
v_11          3.029146
fuelType      1.595486
model         1.484388
gearbox       1.317514
brand         1.150760
bodyType      0.991530
dtype: float64
Train_data['seller'].value_counts()
0    149999
1         1
Name: seller, dtype: int64
Test_data['offerType'].value_counts()
0    50000
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
Train_data['power'].describe()
count    150000.000000
mean        119.316547
std         177.168419
min           0.000000
25%          75.000000
50%         110.000000
75%         150.000000
max       19312.000000
Name: power, dtype: float64

了解预测值的分布

Train_data['price']
0         1850
1         3600
2         6222
3         2400
4         5200...
149995    5900
149996    9500
149997    7500
149998    4999
149999    4700
Name: price, Length: 150000, dtype: int64
Train_data['price'].value_counts()
500      2337
1500     2158
1200     1922
1000     1850
2500     1821...
25321       1
8886        1
8801        1
37920       1
8188        1
Name: price, Length: 3763, dtype: int64
#总体分布概况
import scipy.stats as st
y = Train_data['price']
plt.figure(1)
plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
<matplotlib.axes._subplots.AxesSubplot at 0x1a30f68950>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3HkYCPhZ-1585057979315)(output_56_1.png)]

plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
<matplotlib.axes._subplots.AxesSubplot at 0x1a34a89890>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JWjSAYvW-1585057979315)(output_57_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-p1rFgPcA-1585057979316)(output_57_2.png)]

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.346487
Kurtosis: 18.995183

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K5pK8OAT-1585057979316)(output_58_1.png)]

sns.distplot(Train_data.skew(), color='blue', axlabel='Skewness')
<matplotlib.axes._subplots.AxesSubplot at 0x1a3c7461d0>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vfYUVr9u-1585057979316)(output_59_1.png)]

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
<matplotlib.axes._subplots.AxesSubplot at 0x1a47403850>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7IkiMPCr-1585057979317)(output_60_1.png)]

## 3) 查看预测值的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rKYqBszZ-1585057979317)(output_61_0.png)]

查看频数, 大于20000得值极少,其实这里也可以把这些当作特殊得值(异常值)直接用填充或者删掉,再前面进行
log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick

plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cuo9gTwf-1585057979318)(output_63_0.png)]

2.3.6 特征分为类别特征和数字特征,对类别特征查看unique分布

  • name - 汽车编码
  • regDate - 汽车注册时间
  • model - 车型编码
  • brand - 品牌
  • bodyType - 车身类型
  • fuelType - 燃油类型
  • gearbox - 变速箱
  • power - 汽车功率
  • kilometer - 汽车行驶公里
  • notRepairedDamage - 汽车有尚未修复的损坏
  • regionCode - 看车地区编码
  • seller - 销售方 【以删】
  • offerType - 报价类型 【以删】
  • creatDate - 广告发布时间
  • price - 汽车价格
  • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根据汽车的评论、标签等大量信息得到的embedding向量)【人工构造 匿名特征】
Y_train = Train_data['price']
# 这个区别方式适用于没有直接label coding的数据
# 这里不适用,需要人为根据实际含义来区分
# 数字特征
# numeric_features = Train_data.select_dtypes(include=[np.number])
# numeric_features.columns
# # 类型特征
# categorical_features = Train_data.select_dtypes(include=[np.object])
# categorical_features.columns
# 这个区别方式适用于没有直接label coding的数据
# 这里不适用,需要人为根据实际含义来区分
# 数字特征
numeric_features = Train_data.select_dtypes(include=[np.number])
numeric_features.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'regionCode', 'creatDate', 'price','v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9','v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
# # 类型特征
categorical_features = Train_data.select_dtypes(include=[np.object])
categorical_features.columns
Index(['notRepairedDamage'], dtype='object')
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
#特征unique分布
for cat_fea in categorical_features:print(cat_fea + '的特征分布如下:')print("{} has {} different values".format(cat_fea, Train_data[cat_fea].nunique()))print(Train_data[cat_fea].value_counts())
name的特征分布如下:
name has 99662 different values
708       282
387       282
55        280
1541      263
203       233...
5074        1
7123        1
11221       1
13270       1
174485      1
Name: name, Length: 99662, dtype: int64
model的特征分布如下:
model has 248 different values
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186...
245.0        2
209.0        2
240.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand的特征分布如下:
brand has 40 different values
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType has 8 different values
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType has 7 different values
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox has 2 different values
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage has 2 different values
0.0    111361
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode has 7905 different values
419     369
764     258
125     137
176     136
462     134...
6414      1
7063      1
4239      1
5931      1
7267      1
Name: regionCode, Length: 7905, dtype: int64
# 特征nunique分布
for cat_fea in categorical_features:print(cat_fea + "的特征分布如下:")print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))print(Test_data[cat_fea].value_counts())
name的特征分布如下:
name特征有个37453不同的值
55       97
708      96
387      95
1541     88
713      74..
22270     1
89855     1
42752     1
48899     1
11808     1
Name: name, Length: 37453, dtype: int64
model的特征分布如下:
model特征有个247不同的值
0.0      3896
19.0     3245
4.0      3007
1.0      1981
29.0     1742...
242.0       1
240.0       1
244.0       1
243.0       1
246.0       1
Name: model, Length: 247, dtype: int64
brand的特征分布如下:
brand特征有个40不同的值
0     10348
4      5763
14     5314
10     4766
1      4532
6      3502
9      2423
5      1569
13     1245
11      919
7       795
3       773
16      771
8       704
25      695
27      650
21      544
15      511
20      450
19      450
12      389
22      363
30      324
17      317
26      303
24      268
28      225
32      193
29      117
31      115
18      106
2       104
37       92
34       77
33       76
36       67
23       62
35       53
38       23
39        2
Name: brand, dtype: int64
bodyType的特征分布如下:
bodyType特征有个8不同的值
0.0    13985
1.0    11882
2.0     9900
3.0     4433
4.0     3303
5.0     2537
6.0     2116
7.0      431
Name: bodyType, dtype: int64
fuelType的特征分布如下:
fuelType特征有个7不同的值
0.0    30656
1.0    15544
2.0      774
3.0       72
4.0       37
6.0       14
5.0       10
Name: fuelType, dtype: int64
gearbox的特征分布如下:
gearbox特征有个2不同的值
0.0    37301
1.0    10789
Name: gearbox, dtype: int64
notRepairedDamage的特征分布如下:
notRepairedDamage特征有个2不同的值
0.0    37249
1.0     4720
Name: notRepairedDamage, dtype: int64
regionCode的特征分布如下:
regionCode特征有个6971不同的值
419     146
764      78
188      52
125      51
759      51...
7753      1
7463      1
7230      1
826       1
112       1
Name: regionCode, Length: 6971, dtype: int64

2.3.7 数字特征分析

numeric_features.append('price')
numeric_features
['power','kilometer','v_0','v_1','v_2','v_3','v_4','v_5','v_6','v_7','v_8','v_9','v_10','v_11','v_12','v_13','v_14','price']
Train_data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482

5 rows × 29 columns

#相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False), '\n')
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64
f, ax = plt.subplots(figsize=(7,7))plt.title('Correlation of Numeric Features with Price', y=1,size=16)
sns.heatmap(correlation, square=True, vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x1a460ce610>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-s8xAwoVt-1585057979318)(output_77_1.png)]

del price_numeric['price']
#查看几个特征的偏度和峰值
for col in numeric_features:print('{:15}'.format(col),'Skewness: {:05.2f}'.format(Train_data[col].skew()),' ','Kurtosis: {:06.2f}'.format(Train_data[col].kurt()))
power           Skewness: 65.86   Kurtosis: 5733.45
kilometer       Skewness: -1.53   Kurtosis: 001.14
v_0             Skewness: -1.32   Kurtosis: 003.99
v_1             Skewness: 00.36   Kurtosis: -01.75
v_2             Skewness: 04.84   Kurtosis: 023.86
v_3             Skewness: 00.11   Kurtosis: -00.42
v_4             Skewness: 00.37   Kurtosis: -00.20
v_5             Skewness: -4.74   Kurtosis: 022.93
v_6             Skewness: 00.37   Kurtosis: -01.74
v_7             Skewness: 05.13   Kurtosis: 025.85
v_8             Skewness: 00.20   Kurtosis: -00.64
v_9             Skewness: 00.42   Kurtosis: -00.32
v_10            Skewness: 00.03   Kurtosis: -00.58
v_11            Skewness: 03.03   Kurtosis: 012.57
v_12            Skewness: 00.37   Kurtosis: 000.27
v_13            Skewness: 00.27   Kurtosis: -00.44
v_14            Skewness: -1.19   Kurtosis: 002.39
price           Skewness: 03.35   Kurtosis: 019.00
## 3) 每个数字特征得分布可视化
#https://blog.csdn.net/weixin_42398658/article/details/82960379
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, 'value')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yRckqVs0-1585057979319)(output_80_0.png)]

f
variable value
0 power 60.0
1 power 0.0
2 power 163.0
3 power 193.0
4 power 68.0
... ... ...
2699995 price 5900.0
2699996 price 9500.0
2699997 price 7500.0
2699998 price 4999.0
2699999 price 4700.0

2700000 rows × 2 columns

#数字特征相互之间的关系可视化
# pairplot https://www.jianshu.com/p/6e18d21a4cad
sns.set()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7d4WRrQB-1585057979319)(output_82_0.png)]

# 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x1a45584790>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V8rgoLDX-1585057979320)(output_83_1.png)]

2.3.8 类别特征分析

#unique分布
for fea in categorical_features:print(Train_data[fea].nunique())
99662
248
40
8
7
2
2
7905
categorical_features
['name','model','brand','bodyType','fuelType','gearbox','notRepairedDamage','regionCode']
#类别特征箱形图可视化
#箱形图 https://blog.csdn.net/bi_hu_man_wu/article/details/80807287
#category https://blog.csdn.net/liuweiyuxiang/article/details/78185475
#先把出name和reginCode外不稀疏的几类画一下
categorical_features =['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']for c in categorical_features:Train_data[c] = Train_data[c].astype('category')if Train_data[c].isnull().any():  # print(Train_data[c].cat)Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])Train_data[c] = Train_data[c].fillna('MISSING')#  print(Train_data[c])def boxplot(x, y, **kargs):sns.boxplot(x=x, y=y)x = plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col='variable', col_wrap = 2, sharex=False, sharey=False)
g = g.map(boxplot, 'value', 'price')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RsgljqXx-1585057979320)(output_87_0.png)]

Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
#小提琴图可视化
#箱形图和小提琴图 https://www.cnblogs.com/zhhfan/p/11344310.html
catg_list = categorical_features
target = 'price'
for catg in catg_list:sns.violinplot(x=catg, y=target, data=Train_data)plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IqSYoK15-1585057979320)(output_89_0.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RcJaPsx2-1585057979321)(output_89_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-phBCayAN-1585057979321)(output_89_2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qZPAYWeR-1585057979321)(output_89_3.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UBwUH5nm-1585057979321)(output_89_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KIAC3tAN-1585057979322)(output_89_5.png)]

categorical_features
['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
#柱形图可视化
def bar_plot(x, y, **kargs):sns.barplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False)
g = g.map(bar_plot, 'value', 'price')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ltIFLgTW-1585057979322)(output_91_0.png)]

#类别频数可视化
#countplot barplot https://blog.csdn.net/BF02jgtRS00XKtCx/article/details/103998227
def count_plot(x, **kargs):sns.countplot(x=x)x=plt.xticks(rotation=90)f = pd.melt(Train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col='variable', col_wrap=2, sharex=False, sharey=False)
g = g.map(count_plot, 'value')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nsingjwh-1585057979322)(output_92_0.png)]

for cat in categorical_features:print(Train_data[cat].value_counts())
0.0        11762
19.0        9573
4.0         8445
1.0         6038
29.0        5186...
240.0          2
242.0          2
245.0          2
247.0          1
MISSING        1
Name: model, Length: 249, dtype: int64
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
0.0        41420
1.0        35272
2.0        30324
3.0        13491
4.0         9609
5.0         7607
6.0         6482
MISSING     4506
7.0         1289
Name: bodyType, dtype: int64
0.0        91656
1.0        46991
MISSING     8680
2.0         2212
3.0          262
4.0          118
5.0           45
6.0           36
Name: fuelType, dtype: int64
0.0        111623
1.0         32396
MISSING      5981
Name: gearbox, dtype: int64
0.0        111361
MISSING     24324
1.0         14315
Name: notRepairedDamage, dtype: int64
categorical_features
['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']

2.3.9生成数据报告

import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file('./example.html')
HBox(children=(FloatProgress(value=0.0, description='variables', max=29.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=529.0, style=ProgressStyl…

HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='missing', max=2.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='warnings', max=3.0, style=ProgressStyle(description_width…

HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…

数据探索有利于我们发现数据的一些特性,数据之间的关联性,对于后续的特征构建是很有帮助的。

  1. 对于数据的初步分析(直接查看数据,或.sum(), .mean(),.descirbe()等统计函数)可以从:样本数量,训练集数量,是否有时间特征,是否是时许问题,特征所表示的含义(非匿名特征),特征类型(字符类似,int,float,time),特征的缺失情况(注意缺失的在数据中的表现形式,有些是空的有些是”NAN”符号等),特征的均值方差情况。

  2. 分析记录某些特征值缺失占比30%以上样本的缺失处理,有助于后续的模型验证和调节,分析特征应该是填充(填充方式是什么,均值填充,0填充,众数填充等),还是舍去,还是先做样本分类用不同的特征模型去预测。

  3. 对于异常值做专门的分析,分析特征异常的label是否为异常值(或者偏离均值较远或者是特殊符号),异常值是否应该剔除,还是用正常值填充,是记录异常,还是机器本身异常等。

  4. 对于Label做专门的分析,分析标签的分布情况等。

  5. 进步分析可以通过对特征作图,特征和label联合做图(统计图,离散图),直观了解特征的分布情况,通过这一步也可以发现数据之中的一些异常值等,通过箱型图分析一些特征值的偏离情况,对于特征和特征联合作图,对于特征和label联合作图,分析其中的一些关联性。


Datawhale活动-二手车价格预测 task1task2相关推荐

  1. DataWhale活动-二手车价格预测 task3

    Task3 特征工程 特征工程目标 对于特征进行进一步分析,并对于数据进行处理 完成对于特征工程的分析,并对于数据进行一些图表或者文字总结 代码示例 1. 导入数据 import pandas as ...

  2. Datawhale 天池 二手车价格预测(4.12-4.20)

    Task1 赛题理解(4.12-4.14)打卡 一.赛题概况 赛题以预测二手车的交易价格为任务,数据集报名后可见并可下载,该数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息, ...

  3. Datawhale task4打卡——二手车价格预测

    Datawhale task4打卡--二手车价格预测 1. 线性回归模型 1.1 *特征要求(易忽略) 1.2 *处理长尾分布(易忽略) 2. 模型性能验证 2.1 目标函数 2.2 交叉验证 2.2 ...

  4. Datawhale task3打卡——二手车价格预测

    Datawhale task3打卡--二手车价格预测 1. 异常处理(*易忽略) 1.1 通过箱线图(或 3-Sigma)分析删除异常值 1.2 BOX-COX 转换(处理有偏分布) 1.3 长尾截断 ...

  5. 【组队学习】【24期】河北邀请赛(二手车价格预测)

    河北邀请赛(二手车价格预测) 开源内容: https://github.com/datawhalechina/team-learning-data-mining/tree/master/SecondH ...

  6. 数据挖掘二手车价格预测 Task05:模型融合

    模型融合是kaggle等比赛中经常使用到的一个利器,它通常可以在各种不同的机器学习任务中使结果获得提升.顾名思义,模型融合就是综合考虑不同模型的情况,并将它们的结果融合到一起.模型融合主要通过几部分来 ...

  7. 数据挖掘-二手车价格预测 Task04:建模调参

    数据挖掘-二手车价格预测 Task04:建模调参 模型调参部分 利用xgb进行五折交叉验证查看模型的参数效果 ## xgb-Model xgr = xgb.XGBRegressor(n_estimat ...

  8. 基于二手车价格预测——特征工程

    特征工程 特征工程 分析: 第一步:异常值处理 箱型图法: 第二步:特征构造 第三步:数据分桶 数据分桶详解 删除不需要的数据 特征归一化 总结--特征 1.特征构造: 2.异常类型处理 3.构造新特 ...

  9. 二手车价格预测task03:特征工程

    二手车价格预测task03:特征工程 1.学习了operator模块operator.itemgetter()函数 2.学习了箱线图 3.了解了特征工程的方法 (内容介绍) 4.敲代码学习,加注解 以 ...

  10. 二手车价格预测数据探索

    二手车价格预测数据探索 1.赛题理解 [类型]属于回归问题. [数据字段] 训练数据字段: 字段名字 含义 类型 name 汽车编码 int regDate 汽车注册时间 int model 车型编码 ...

最新文章

  1. 中国靶材行业投资价值与发展机遇研究报告2022版
  2. 说说成为顶级运营人员的一个先决条件:做事的霸气!
  3. 物料分类账的基本原理
  4. LiveVideoStackCon 2018社区编辑门票兑换启动
  5. [转]ASP.Net篇之Session与Cookie
  6. 有关Visual Studio Code的说明
  7. (转)【MySQL】sync_binlog innodb_flush_log_at_trx_commit 浅析
  8. CSS魔法堂:小结一下Box Model与Positioning Scheme
  9. Helm 3 完整教程(十四):Helm 函数讲解(8)数学计算函数
  10. origin使用指导。pdf_获取所有的最高法指导性案例并保存为pdf
  11. 方正电脑如何关闭网络启动计算机,方正电脑怎么进安全模式
  12. 恢复mysql数据--使用frm和ibd文件
  13. 京东轮播图的原生代码
  14. c语言设计通讯录设计报告,C语言通讯录课程设计报告--设计一个通讯录管理系统...
  15. 解决win 7的Aero Peek无效的方法
  16. 《成为乔布斯》读后感
  17. 微软2017年预科生计划在线编程笔试第二场-#1498 : Diligent Robots
  18. python实现输入一个正整数_Python中实现输入一个整数的案例
  19. 配置Hive在mysql上的元数据库时Underlying cause: java.lang.ClassNotFoundException : com.mysql.jdbc.Driver
  20. linux - realpath_ex

热门文章

  1. cesium-点线面
  2. mac tga转jpg
  3. 创建Spring Boot工程 流程 以及基本操作
  4. MPU9250的磁力计数据问题
  5. 教你快速上手Bootstrap框架
  6. c# 字符串 相关知识
  7. 2022届计算机视觉算法秋招面经(CV岗)——offer经
  8. C++ 合辑——等差数列
  9. PT14-48页互联网绿色年终总结汇报晋升述职数据图表PPT模板
  10. 液晶密码锁(可更改密码)