

  1. 提出问题
  2. 理解数据
  • 采集数据
  • 导入数据
  • 查看数据集信息
  1. 数据清洗
  • 数据预处理
  • 特征工程
  1. 构建模型
  2. 模型评估
  3. 方案实施
  • 提交结果到Kaggle




2.1 采集数据



import numpy as np
import pandas as pd
import matplotlib as plt
train = pd.read_csv("/Users/qxh/Desktop/titanic/train.csv")
test = pd.read_csv("/Users/qxh/Desktop/titanic/test.csv")
训练集数据大小: (891, 12)
测试集数据大小: (418, 11)
full = train.append(test, ignore_index = True)
整体数据集大小: (1309, 12)

2.3 查看数据集信息

Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
Age Fare Parch PassengerId Pclass SibSp Survived
count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 1309.000000 891.000000
mean 29.881138 33.295479 0.385027 655.000000 2.294882 0.498854 0.383838
std 14.413493 51.758668 0.865560 378.020061 0.837836 1.041658 0.486592
min 0.170000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000
25% 21.000000 7.895800 0.000000 328.000000 2.000000 0.000000 0.000000
50% 28.000000 14.454200 0.000000 655.000000 3.000000 0.000000 0.000000
75% 39.000000 31.275000 0.000000 982.000000 3.000000 1.000000 1.000000
max 80.000000 512.329200 9.000000 1309.000000 3.000000 8.000000 1.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3. 数据清洗




  • 年龄(Age)里面数据总数是1046条,缺失了263条数据,用平均值填补。
  • 船票价格(Fare)里面数据总数是1308条,缺失了1条数据,用平均值填补。
  • 登船港口(Embarked)里面数据总数是1308条,缺失了2条数据,用出现最频繁的值填补。
  • 船舱号(Cabin)里面数据总数是295条,缺失了1014条数据,缺失较多,增添新标记unknown进行填补。
count     1307
unique       3
top          S
freq       914
Name: Embarked, dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2 特征工程


  1. 数值类型:
  • 乘客编号(PassengerId)
  • 年龄(Age)
  • 船票价格(Fare)
  • 同代直系亲属人数(SibSp)
  • 不同代直系亲属人数(Parch)
  1. 时间序列:无
  2. 分类数据(直接分类)
  • 乘客性别(Sex):男性male,女性female
  • 登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
  • 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
  1. 分类数据(字符串类型):可能从这里面提取出特征来
  • 乘客姓名(Name)
  • 客舱号(Cabin)
  • 船票编号(Ticket)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2.1 分类数据(直接分类)


  • 性别(Sex)
sex_mapDict = {'male':1,'female':0}
full['Sex'] = full['Sex'].map(sex_mapDict)
  • 登陆港口(Embarked)
embarkedDf = pd.DataFrame()
embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
full = pd.concat([full,embarkedDf],axis=1)
  • 客舱等级(Pclass)
pcalssDf = pd.DataFrame()
pcalssDf = pd.get_dummies(full['Pclass'],prefix='Pclass')
Pclass_1 Pclass_2 Pclass_3
0 0 0 1
1 1 0 0
2 0 0 1
3 1 0 0
4 0 0 1
full = pd.concat([full,pcalssDf],axis=1)


  • 从名字中提取头衔
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
def get_title(name):str1 = name.split(',')[1]str2 = str1.split('.')[0]str3 = str2.strip()#strip()用于移除字符串头尾指定字符,这里是移除头尾空格return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(get_title)
the Countess
title_mapDict = {"Capt":       "Officer","Col":        "Officer","Major":      "Officer","Jonkheer":   "Royalty","Don":        "Royalty","Sir" :       "Royalty","Dr":         "Officer","Rev":        "Officer","the Countess":"Royalty","Dona":       "Royalty","Mme":        "Mrs","Mlle":       "Miss","Ms":         "Mrs","Mr" :        "Mr","Mrs" :       "Mrs","Miss" :      "Miss","Master" :    "Master","Lady" :      "Royalty"}
titleDf['Title'] = titleDf['Title'].map(title_mapDict)
titleDf = pd.get_dummies(titleDf['Title'])
Master Miss Mr Mrs Officer Royalty
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
full = pd.concat([full,titleDf],axis=1)
  • 客舱号
0       U
1     C85
2       U
3    C123
4       U
Name: Cabin, dtype: object
cabinDf = pd.DataFrame()
full['Cabin'] = full['Cabin'].map(lambda c : c[0])
0    U
1    C
2    U
3    C
4    U
Name: Cabin, dtype: object
cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )
Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1
3 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1
full = pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace= True)
Age Fare Parch PassengerId Pclass Sex SibSp Survived Ticket Embarked_C ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U
0 22.0 7.2500 0 1 3 1 1 0.0 A/5 21171 0 ... 0 0 0 0 0 0 0 0 0 1
1 38.0 71.2833 0 2 1 0 1 1.0 PC 17599 1 ... 0 0 0 1 0 0 0 0 0 0
2 26.0 7.9250 0 3 3 0 0 1.0 STON/O2. 3101282 0 ... 0 0 0 0 0 0 0 0 0 1
3 35.0 53.1000 0 4 1 0 1 1.0 113803 0 ... 0 0 0 1 0 0 0 0 0 0
4 35.0 8.0500 0 5 3 1 0 0.0 373450 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 27 columns

3.2.3 数据类型

  • 家庭人员和家庭类别
familyDf = pd.DataFrame()'''
familyDf[ 'family_size' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1
count    1309.000000
mean        1.883881
std         1.583639
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        11.000000
Name: family_size, dtype: float64
%matplotlib notebook
<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x110675400>
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''familyDf['family_single'] = familyDf['family_size'].map(lambda s : 1 if s==1 else 0)
familyDf['family_small'] = familyDf['family_size'].map(lambda s : 1 if 2<=s<=4 else 0)
familyDf['family_large'] = familyDf['family_size'].map(lambda s : 1 if s>4 else 0)
family_size family_single family_small family_large
0 2 0 1 0
1 2 0 1 0
2 1 1 0 0
3 2 0 1 0
4 1 1 0 0
full = pd.concat([full,familyDf],axis=1)
full.drop([ 'Parch','SibSp','family_size' ],axis=1, inplace=True)
Age Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S ... Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large
0 22.0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 ... 0 0 0 0 0 0 1 0 1 0
1 38.0 71.2833 2 1 0 1.0 PC 17599 1 0 0 ... 1 0 0 0 0 0 0 0 1 0
2 26.0 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 ... 0 0 0 0 0 0 1 1 0 0
3 35.0 53.1000 4 1 0 1.0 113803 0 0 1 ... 1 0 0 0 0 0 0 0 1 0
4 35.0 8.0500 5 3 1 0.0 373450 0 0 1 ... 0 0 0 0 0 0 1 1 0 0

5 rows × 28 columns

  • 年龄(Age)和船票费用(Fare)


import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.1309
age_scale_param = scaler.fit(full['Age'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.
full['Age_scaled'] = scaler.fit_transform(full['Age'].reshape(-1,1), age_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.
Age Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled
0 22.0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 ... 0 0 0 0 0 1 0 1 0 -0.611972
1 38.0 71.2833 2 1 0 1.0 PC 17599 1 0 0 ... 0 0 0 0 0 0 0 1 0 0.630431
2 26.0 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 ... 0 0 0 0 0 1 1 0 0 -0.301371
3 35.0 53.1000 4 1 0 1.0 113803 0 0 1 ... 0 0 0 0 0 0 0 1 0 0.397481
4 35.0 8.0500 5 3 1 0.0 373450 0 0 1 ... 0 0 0 0 0 1 1 0 0 0.397481

5 rows × 29 columns

full.drop([ 'Age'],axis=1, inplace=True)
Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S Master ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled
0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 0 ... 0 0 0 0 0 1 0 1 0 -0.611972
1 71.2833 2 1 0 1.0 PC 17599 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0.630431
2 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 0 ... 0 0 0 0 0 1 1 0 0 -0.301371
3 53.1000 4 1 0 1.0 113803 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0.397481
4 8.0500 5 3 1 0.0 373450 0 0 1 0 ... 0 0 0 0 0 1 1 0 0 0.397481

5 rows × 28 columns

fare_scale_param = scaler.fit(full['Fare'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.
full['Fare_scaled'] = scaler.fit_transform(full['Fare'].reshape(-1,1), fare_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.
Fare PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S Master ... Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled Fare_scaled
0 7.2500 1 3 1 0.0 A/5 21171 0 0 1 0 ... 0 0 0 0 1 0 1 0 -0.611972 -0.503595
1 71.2833 2 1 0 1.0 PC 17599 1 0 0 0 ... 0 0 0 0 0 0 1 0 0.630431 0.734503
2 7.9250 3 3 0 1.0 STON/O2. 3101282 0 0 1 0 ... 0 0 0 0 1 1 0 0 -0.301371 -0.490544
3 53.1000 4 1 0 1.0 113803 0 0 1 0 ... 0 0 0 0 0 0 1 0 0.397481 0.382925
4 8.0500 5 3 1 0.0 373450 0 0 1 0 ... 0 0 0 0 1 1 0 0 0.397481 -0.488127

5 rows × 29 columns

full.drop([ 'Fare'],axis=1, inplace=True)
PassengerId Pclass Sex Survived Ticket Embarked_C Embarked_Q Embarked_S Master Miss ... Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled Fare_scaled
0 1 3 1 0.0 A/5 21171 0 0 1 0 0 ... 0 0 0 0 1 0 1 0 -0.611972 -0.503595
1 2 1 0 1.0 PC 17599 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 0.630431 0.734503
2 3 3 0 1.0 STON/O2. 3101282 0 0 1 0 1 ... 0 0 0 0 1 1 0 0 -0.301371 -0.490544
3 4 1 0 1.0 113803 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0.397481 0.382925
4 5 3 1 0.0 373450 0 0 1 0 0 ... 0 0 0 0 1 1 0 0 0.397481 -0.488127

5 rows × 28 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
PassengerId      1309 non-null int64
Pclass           1309 non-null int64
Sex              1309 non-null int64
Survived         891 non-null float64
Ticket           1309 non-null object
Embarked_C       1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
Master           1309 non-null uint8
Miss             1309 non-null uint8
Mr               1309 non-null uint8
Mrs              1309 non-null uint8
Officer          1309 non-null uint8
Royalty          1309 non-null uint8
Cabin_A          1309 non-null uint8
Cabin_B          1309 non-null uint8
Cabin_C          1309 non-null uint8
Cabin_D          1309 non-null uint8
Cabin_E          1309 non-null uint8
Cabin_F          1309 non-null uint8
Cabin_G          1309 non-null uint8
Cabin_T          1309 non-null uint8
Cabin_U          1309 non-null uint8
family_single    1309 non-null int64
family_small     1309 non-null int64
family_large     1309 non-null int64
Age_scaled       1309 non-null float64
Fare_scaled      1309 non-null float64
dtypes: float64(3), int64(6), object(1), uint8(18)
memory usage: 125.4+ KB



corrDf = full.corr()
PassengerId Pclass Sex Survived Embarked_C Embarked_Q Embarked_S Master Miss Mr ... Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U family_single family_small family_large Age_scaled Fare_scaled
PassengerId 1.000000 -0.038354 0.013406 -0.005007 0.048101 0.011585 -0.049836 0.002254 -0.050027 0.014116 ... -0.008136 0.000306 -0.045949 -0.023049 0.000208 0.028546 0.002975 -0.063415 0.025731 0.031416
Pclass -0.038354 1.000000 0.124617 -0.338481 -0.269658 0.230491 0.091320 0.095257 0.024487 0.121492 ... -0.225649 0.013122 0.052133 -0.042750 0.713857 0.147393 -0.218303 0.127306 -0.366371 -0.558477
Sex 0.013406 0.124617 1.000000 -0.543351 -0.066564 -0.088651 0.115193 0.164375 -0.672819 0.870678 ... -0.040340 -0.006655 -0.083285 0.020558 0.137396 0.284537 -0.255196 -0.077748 0.057397 -0.185484
Survived -0.005007 -0.338481 -0.543351 1.000000 0.168240 0.003650 -0.149683 0.085221 0.332795 -0.549199 ... 0.145321 0.057935 0.016040 -0.026456 -0.316912 -0.203367 0.279855 -0.125147 -0.070323 0.257307
Embarked_C 0.048101 -0.269658 -0.066564 0.168240 1.000000 -0.164166 -0.778262 -0.014172 -0.014351 -0.065538 ... 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.107874 0.159594 -0.092825 0.076179 0.286241
Embarked_Q 0.011585 0.230491 -0.088651 0.003650 -0.164166 1.000000 -0.491656 -0.009091 0.198804 -0.080224 ... -0.042877 -0.020282 -0.019941 -0.008904 0.142369 0.127214 -0.122491 -0.018423 -0.012718 -0.130054
Embarked_S -0.049836 0.091320 0.115193 -0.149683 -0.778262 -0.491656 1.000000 0.018297 -0.113886 0.108924 ... 0.002960 0.030575 0.040560 0.018111 0.137351 0.014246 -0.062909 0.093671 -0.059153 -0.169894
Master 0.002254 0.095257 0.164375 0.085221 -0.014172 -0.009091 0.018297 1.000000 -0.110595 -0.258902 ... 0.001860 0.058311 -0.013690 -0.006113 0.041178 -0.265355 0.120166 0.301809 -0.363923 0.011596
Miss -0.050027 0.024487 -0.672819 0.332795 -0.014351 0.198804 -0.113886 -0.110595 1.000000 -0.585809 ... 0.008700 -0.003088 0.061881 -0.013832 -0.004364 -0.023890 -0.018085 0.083422 -0.254146 0.092051
Mr 0.014116 0.121492 0.870678 -0.549199 -0.065538 -0.080224 0.108924 -0.258902 -0.585809 1.000000 ... -0.032953 -0.026403 -0.072514 0.023611 0.131807 0.386262 -0.300872 -0.194207 0.165476 -0.192192
Mrs 0.033299 -0.179945 -0.571176 0.344935 0.098379 -0.100374 -0.022950 -0.093887 -0.212435 -0.497310 ... 0.045538 0.013376 0.042547 -0.011742 -0.162253 -0.354649 0.361247 0.012893 0.198091 0.139235
Officer 0.002231 -0.137341 0.087288 -0.031316 0.003678 -0.003212 -0.001202 -0.029567 -0.066899 -0.156611 ... -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 0.013303 0.003966 -0.034572 0.162818 0.028696
Royalty 0.004400 -0.104916 -0.020408 0.033391 0.077213 -0.021853 -0.054250 -0.015002 -0.033945 -0.079466 ... -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 0.008761 -0.000073 -0.017542 0.059466 0.026214
Cabin_A -0.002831 -0.202143 0.047561 0.022287 0.094914 -0.042105 -0.056984 -0.000711 -0.035697 0.015372 ... -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 0.045227 -0.029546 -0.033799 0.125177 0.020094
Cabin_B 0.015895 -0.353414 -0.094453 0.175095 0.161595 -0.073613 -0.095790 -0.017168 0.035069 -0.096776 ... -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 -0.087912 0.084268 0.013470 0.113458 0.393743
Cabin_C 0.006092 -0.430044 -0.077473 0.114652 0.158043 -0.059151 -0.101861 -0.047456 -0.013418 -0.068072 ... -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 -0.137498 0.141925 0.001362 0.167993 0.401370
Cabin_D 0.000549 -0.265341 -0.057396 0.150716 0.107782 -0.061459 -0.056023 -0.042192 -0.012516 -0.030261 ... -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.074310 0.102432 -0.049336 0.132886 0.072737
Cabin_E -0.008136 -0.225649 -0.040340 0.145321 0.027566 -0.042877 0.002960 0.001860 0.008700 -0.032953 ... 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.042535 0.068007 -0.046485 0.106600 0.073949
Cabin_F 0.000306 0.013122 -0.006655 0.057935 -0.020010 -0.020282 0.030575 0.058311 -0.003088 -0.026403 ... -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.004055 0.012756 -0.033009 -0.072644 -0.037567
Cabin_G -0.045949 0.052133 -0.083285 0.016040 -0.031566 -0.019941 0.040560 -0.013690 0.061881 -0.072514 ... -0.011135 -0.007907 1.000000 -0.001712 -0.114803 -0.076397 0.087471 -0.016008 -0.085977 -0.022857
Cabin_T -0.023049 -0.042750 0.020558 -0.026456 -0.014095 -0.008904 0.018111 -0.006113 -0.013832 0.023611 ... -0.004972 -0.003531 -0.001712 1.000000 -0.051263 0.022411 -0.019574 -0.007148 0.032461 0.001179
Cabin_U 0.000208 0.713857 0.137396 -0.316912 -0.258257 0.142369 0.137351 0.041178 -0.004364 0.131807 ... -0.333381 -0.236733 -0.114803 -0.051263 1.000000 0.175812 -0.211367 0.056438 -0.271918 -0.507197
family_single 0.028546 0.147393 0.284537 -0.203367 -0.107874 0.127214 0.014246 -0.265355 -0.023890 0.386262 ... -0.042535 0.004055 -0.076397 0.022411 0.175812 1.000000 -0.873398 -0.318944 0.116675 -0.274826
family_small 0.002975 -0.218303 -0.255196 0.279855 0.159594 -0.122491 -0.062909 0.120166 -0.018085 -0.300872 ... 0.068007 0.012756 0.087471 -0.019574 -0.211367 -0.873398 1.000000 -0.183007 -0.038189 0.197281
family_large -0.063415 0.127306 -0.077748 -0.125147 -0.092825 -0.018423 0.093671 0.301809 0.083422 -0.194207 ... -0.046485 -0.033009 -0.016008 -0.007148 0.056438 -0.318944 -0.183007 1.000000 -0.161210 0.170853
Age_scaled 0.025731 -0.366371 0.057397 -0.070323 0.076179 -0.012718 -0.059153 -0.363923 -0.254146 0.165476 ... 0.106600 -0.072644 -0.085977 0.032461 -0.271918 0.116675 -0.038189 -0.161210 1.000000 0.171521
Fare_scaled 0.031416 -0.558477 -0.185484 0.257307 0.286241 -0.130054 -0.169894 0.011596 0.092051 -0.192192 ... 0.073949 -0.037567 -0.022857 0.001179 -0.507197 -0.274826 0.197281 0.170853 0.171521 1.000000

27 rows × 27 columns

corrDf['Survived'].sort_values(ascending =False)
Survived         1.000000
Mrs              0.344935
Miss             0.332795
family_small     0.279855
Fare_scaled      0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Master           0.085221
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
Age_scaled      -0.070323
family_large    -0.125147
Embarked_S      -0.149683
family_single   -0.203367
Cabin_U         -0.316912
Pclass          -0.338481
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64
full_x = pd.concat([ titleDf,#头衔pcalssDf,#客舱等级familyDf,#家庭大小full['Fare_scaled'],#船票价格full['Age_scaled'],cabinDf,#船舱号embarkedDf,#登船港口full['Sex']#性别],axis=1)
Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 family_size ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Embarked_C Embarked_Q Embarked_S Sex
0 0 0 1 0 0 0 0 0 1 2 ... 0 0 0 0 0 1 0 0 1 1
1 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 1 0 0 0
2 0 1 0 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 0
3 0 0 0 1 0 0 1 0 0 2 ... 0 0 0 0 0 0 0 0 1 0
4 0 0 1 0 0 0 0 0 1 1 ... 0 0 0 0 0 1 0 0 1 1

5 rows × 28 columns



4.1 建立训练数据集和测试数据集

sourceRow = 891source_x = full_x.loc[0:sourceRow-1, :]
source_y = full.loc[0:sourceRow-1,'Survived']pred_x = full_x.loc[sourceRow:,:]
训练集数据大小: (891, 28)
测试集数据大小: (418, 28)
train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data
'''from sklearn.cross_validation import train_test_split#建立模型用的训练数据sour集和测试数据集
train_x, test_x, train_y, test_y = train_test_split(source_x ,source_y,train_size=.8)
print ('原始数据集特征:',source_x.shape, '训练数据集特征:',train_x.shape ,'测试数据集特征:',test_x.shape)print ('原始数据集标签:',source_y.shape, '训练数据集标签:',train_y.shape ,'测试数据集标签:',test_y.shape)
原始数据集特征: (891, 28) 训练数据集特征: (712, 28) 测试数据集特征: (179, 28)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)

4.2 选择机器学习算法

#逻辑回归from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

4.3 训练模型

model.fit(train_x, train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)

5 模型评估

model.score(test_x , test_y )


pred_y = model.predict(pred_x)'''
passenger_id = full.loc[sourceRow:,'PassengerId']
predDf = pd.DataFrame( { 'PassengerId': passenger_id ,      'Survived': pred_y } )
(418, 2)
PassengerId Survived
891 892 0
892 893 1
893 894 0
894 895 0
895 896 1
predDf.to_csv( '/Users/qxh/Desktop/titanic/titanic_pred.csv' , index = False )



