泰坦尼克号船员获救数据分析

一、概述

 本文分析了泰坦尼克号船员获救的数据集合。数据集包括船员的一些信息（年龄、船舱等级、名字等等）和 是否获救的数据

数据解释

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
游客ID 是否被救船舱等级名字性别年龄兄弟姐妹数老人孩子数票编号票价座位号哪个站登船的

注意：兄弟姐妹数、老人孩子数都是指的是在该船上的统计量

二、流程

1. 使用线性回归构建预测模型
2. 使用逻辑回归构建预测模型
3. 使用决策树构建
4. 使用随机森林
5. 使用集成算法（Ensemble learning）中的聚合多个模型（本文中使用的是随机森林和逻辑回归算法）构建的集合算法模型

三、效果

从上到下，效果呈现出上升趋势

1. 数据导入和预处理

import pandas #ipython notebook
titanic = pandas.read_csv("titanic_train.csv")
titanic.head(5)
print (titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008
std     257.353842    0.486592    0.836071   14.526497    1.102743
min       1.000000    0.000000    1.000000    0.420000    0.000000
25%     223.500000    0.000000    2.000000   20.125000    0.000000
50%     446.000000    0.000000    3.000000   28.000000    0.000000
75%     668.500000    1.000000    3.000000   38.000000    1.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000   Parch        Fare
count  891.000000  891.000000
mean     0.381594   32.204208
std      0.806057   49.693429
min      0.000000    0.000000
25%      0.000000    7.910400
50%      0.000000   14.454200
75%      0.000000   31.000000
max      6.000000  512.329200

titanic.iloc[19:30]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S
25	26	1	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	C23 C25 C27	S
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
29	30	0	3	Todoroff, Mr. Lalio	male	NaN	0	0	349216	7.8958	NaN	S

amount_Age_NaN = len(titanic.loc[titanic["Age"].isnull().values,:])
print("Age字段缺失值的个数：",amount_Age_NaN)

Age字段缺失值的个数： 177

1.1 从上表原始数据集中可以看到：对于Age列，有177个年龄值缺失（缺失显示为NaN），需要用fillna函数填充缺失值,这里使用中位数（median）填充

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
print(titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642   29.361582    0.523008
std     257.353842    0.486592    0.836071   13.019697    1.102743
min       1.000000    0.000000    1.000000    0.420000    0.000000
25%     223.500000    0.000000    2.000000   22.000000    0.000000
50%     446.000000    0.000000    3.000000   28.000000    0.000000
75%     668.500000    1.000000    3.000000   35.000000    1.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000   Parch        Fare
count  891.000000  891.000000
mean     0.381594   32.204208
std      0.806057   49.693429
min      0.000000    0.000000
25%      0.000000    7.910400
50%      0.000000   14.454200
75%      0.000000   31.000000
max      6.000000  512.329200

1.2 查看某一列属性值有多少种类，并将对应的字符串映射成数字

print(titanic["Sex"].unique())# Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

['male' 'female']

print(titanic["Embarked"].unique())
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

['S' 'C' 'Q' nan]

2. 训练模型

2.1 使用线性回归模型构建分类器

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.model_selection import KFold  # KFold已经移到了model_selection 模块
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(n_splits=3, random_state=1)predictions = []
accuracy = []
test_idices = []
# kf.split 会将数据分为n_splits份并返回训练和测试集数据对应的索引，而不是数据本身
# 注意，当Kfold的参数shuffle == false 时，生成的 test值从0开始，例如0,1,2,3.....len(数据集)
for train, test in kf.split(titanic):# The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.train_predictors = (titanic[predictors].iloc[train,:])# The target we're using to train the algorithm.train_target = titanic["Survived"].iloc[train]# Training the algorithm using the predictors and target.alg.fit(train_predictors, train_target)# We can now make predictions on the test foldtest_predictions = alg.predict(titanic[predictors].iloc[test,:])predictions.append(test_predictions)test_idices.append(test)

注意：此时predictions是一个list,里面有三个ndarray，因为交叉验证做了n_splits=3 次

下面的例子表明。Fold函数在当参数Shuffle == False时，生成的test 从 0 依次开始

test_idices

[array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296]),array([297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309,310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322,323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335,336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348,349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361,362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374,375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387,388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413,414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426,427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439,440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452,453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465,466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478,479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491,492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504,505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517,518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530,531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543,544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556,557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569,570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582,583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593]),array([594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606,607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619,620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632,633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645,646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658,659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671,672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684,685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697,698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710,711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723,724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736,737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749,750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762,763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775,776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788,789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801,802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814,815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827,828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840,841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853,854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866,867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879,880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890])]

import numpy as np# The predictions are in three separate numpy arrays.  Concatenate them into one.
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate((predictions[0],predictions[1],predictions[2]), axis=0)
test_idices = np.concatenate((test_idices[0],test_idices[1],test_idices[2]),axis=0)# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1   #代表预测
predictions[predictions <=.5] = 0# accuracy = len(predictions[predictions == titanic["Survived"].iloc[test_idices].values]) / len(predictions)
#应为predict是针对索引0,1，2.依次比较的结果，所以有等价写法
accuracy = len(predictions[predictions == titanic["Survived"].values]) / len(predictions)print(accuracy)

0.7833894500561167

2.2 使用逻辑回归模型（Logistics Regress其实是分类器模型，名字有点混淆）构建分类器


from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Initialize our algorithm
alg = LogisticRegression(random_state=1,solver='liblinear')
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)print(scores.mean())

0.7878787878787877

2.2.1 重新导入数据训练

titanic_test = pandas.read_csv("test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

from sklearn.model_selection import KFold,cross_val_score
from sklearn.ensemble import RandomForestClassifierpredictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
#kf = model_selection.KFold(titanic.shape[0], n_folds=3, random_state=1)
kf = KFold(n_splits=5,random_state=2)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.8013935095097608

alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = KFold( 3, random_state=1)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.8148148148148148

# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]# The .apply method generates a new series
# lambda arg1,arg2,.....argn:expression
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

Note: 关键字lambda表示匿名函数，lambda arg1,arg2,…argn:expression

冒号:之前的a,b,c表示它们是这个函数的参数
匿名函数不需要return来返回值，表达式本身结果就是返回值

# 正则表达式处理模块RE
import re    # A function to get the title from a name.
def get_title(name):# Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.title_search = re.search(' ([A-Za-z]+)\.', name)# If the title exists, extract and return it.if title_search:return title_search.group(1)return ""# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():titles[titles == k] = v# Verify that we converted everything.
print(pandas.value_counts(titles))# Add in the title column.
titanic["Title"] = titles

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Col           2
Major         2
Jonkheer      1
Lady          1
Sir           1
Capt          1
Don           1
Countess      1
Ms            1
Mme           1
Name: Name, dtype: int64
1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64

3.1.1 在决策树模型中选择最优的特征个数，并绘制各个特征对分类影响的重要程度

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.title("importance of features")
plt.xlabel("features")
plt.ylabel("importance")
plt.show()# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
score = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=5)
print("预测的得分值: " ,score.mean())

预测的得分值:  0.8193563556396282

3. 使用集成算法求解

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],[LogisticRegression(random_state=1,solver='liblinear'), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]# Initialize the cross validation folds
kf = KFold(n_splits=3, random_state=1)predictions = []
for train, test in kf.split(titanic):train_target = titanic["Survived"].iloc[train]full_test_predictions = []# Make predictions for each algorithm on each foldfor alg, predictors in algorithms:# Fit the algorithm on the training data.alg.fit(titanic[predictors].iloc[train,:], train_target)# Select and predict on the test fold.  # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]full_test_predictions.append(test_predictions)# Use a simple ensembling scheme -- just average the predictions to get the final classification.test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2# Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.test_predictions[test_predictions <= .5] = 0test_predictions[test_predictions > .5] = 1predictions.append(test_predictions)# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)# Compute accuracy by comparing to the training data.
accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)
print('模型精确度：',accuracy)

模型精确度： 0.8215488215488216

titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():titles[titles == k] = v
titanic_test["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(titanic_test["Title"].values))# Now, we add the family size column.
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

1     240
2      79
3      72
4      21
7       2
6       2
10      1
5       1
dtype: int64

3.2构建集成算法，对一个未知结果的数据集进行预测

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title"]algorithms = [[GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],[LogisticRegression(random_state=1,solver='liblinear'), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]full_predictions = []
for alg, predictors in algorithms:# Fit the algorithm using the full training data.alg.fit(titanic[predictors], titanic["Survived"])# Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]full_predictions.append(predictions)# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0predictions

array([0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0.,0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.,0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 0.,0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0.,0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0.,0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0.,1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1.,0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1.,0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1.,0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1.,0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0.,1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0.,0., 0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1.,0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0.,1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0.,1., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,1., 1., 1., 1., 1., 0., 1., 0., 0., 0.])

泰坦尼克号船员获救数据分析相关推荐

python mean dropna_小丸子踏入python之路：python_day05（用Pandas处理泰坦尼克船员获救数据titanic_train.csv）...
泰坦尼克船员获救数据: titanic_train.csv 用excel打开数据集.显示如下: 写在前边: 为了方便以后运用numpy和pandas的库,分别造它们的别名np和pd. import p ...
幸存与否 ——泰坦尼克号沉船事件数据分析*
幸存与否 --泰坦尼克号沉船事件数据分析铁达尼号沉船事件发生在1912年4月.铁达尼号是当时世界上最大的客运轮船,而此次航行为首次.铁达尼号从英国南安普敦出发,途经法国瑟堡-奥克特维尔以及爱尔兰昆士 ...
沉没的泰坦尼克号的幸存者数据分析
点击蓝色字关注我们! 一个努力中的公众号长的好看的人都关注了泰坦尼克号是英国白星航运公司下的一艘奥林匹克级邮轮,号称永不沉没,然而在1912年4月14日却与冰山相撞,沉没水中.2224 名船员及乘 ...
泰坦尼克号数据_数据分析-泰坦尼克号乘客生存率预测
项目背景目标预测一个乘客是否能够在泰坦尼克号事件中幸存. 概述 1912年4月15日,泰坦尼克号在首次航行期间撞上冰山后沉没,船上共有2224名人员(包括乘客和机组人员),共有1502人不幸遇难. ...
泰坦尼克号生还情况数据分析
在这里主要展示数据的预处理过程,以及对于生还情况与各特征之间的关系. 1．数据探索我们首先对样本数据集的结构,规律和质量进行了分析,从数据质量分析和数据特征分析等两个角度进行展开. 1.1 数据质量 ...
机器学习中的特征选择——决策树模型预测泰坦尼克号乘客获救实例
在机器学习和统计学中,特征选择(英语:feature selection)也被称为变量选择.属性选择或变量子集选择.它是指:为了构建模型而选择相关特征(即属性.指标)子集的过程.使用特征选择技术 ...
泰坦尼克号生存预测数据分析+挖掘建模
数据集来源:Kaggle https://www.kaggle.com/vikichocolate/titanic-machine-learning-from-disaster 数据集各字段的含义 P ...
kaggle入门赛TOP%7：泰坦尼克号（1.数据分析，特征处理）基于百度aistudio平台
目前排名1420/19000 建模预测在下一篇帖子: https://blog.csdn.net/QianLong_/article/details/105780130 # 查看当前挂载的数据集目录, ...
技术图文：C# VS. Python 读取CSV文件指南
背景 CSV 是一种以逗号进行特征分隔的文本文件类型,在数据库或电子表格中是一种非常常见的导入导出格式.本篇图文就以泰坦尼克号船员获救预测( Kaggle)中使用的数据集为例来说明 C#.Python ...

泰坦尼克号船员获救数据分析

一、概述

数据解释

注意：兄弟姐妹数、老人孩子数都是指的是在该船上的统计量

二、流程

三、效果

1. 数据导入和预处理

1.1 从上表原始数据集中可以看到：对于Age列，有177个年龄值缺失（缺失显示为NaN），需要用fillna函数填充缺失值,这里使用中位数（median）填充

1.2 查看某一列属性值有多少种类，并将对应的字符串映射成数字

2. 训练模型

2.1 使用线性回归模型构建分类器

注意：此时predictions是一个list,里面有三个ndarray，因为交叉验证做了n_splits=3 次

2.2 使用逻辑回归模型（Logistics Regress其实是分类器模型，名字有点混淆）构建分类器

2.2.1 重新导入数据训练

Note: 关键字lambda表示匿名函数，lambda arg1,arg2,…argn:expression

3.1.1 在决策树模型中选择最优的特征个数，并绘制各个特征对分类影响的重要程度

3. 使用集成算法求解

3.2构建集成算法，对一个未知结果的数据集进行预测

泰坦尼克号船员获救数据分析相关推荐

最新文章

热门文章