一、简介

1.数据

(1)训练集(train.csv)
(2)测试集(test.csv)
(3)提交文件示例(gender_submission.csv)
对于训练集,我们为每位乘客提供结果。模型将基于乘客的性别和阶级等“特征”也可以使用特征工程来创建新特征。我们要做的就是对于测试集中的每个乘客,使用训练的模型来预测他们是否在泰坦尼克号沉没中幸存下来。

2.属性说明

属性 说明
PassengerId 乘客ID
Survived 是否获救,1为是,0为否
Pclass 乘客票务舱,1表示最高级,还有2、3等级的
Name 乘客姓名
Sex 性别
Age 年龄
SibSp 堂兄弟妹个数
Parch 父母与小孩个数
Ticket 船票号
Fare 票价
Cabin 客舱号
Embarked 登船港口

3.网站指引

1.网站注册是“register”,注册的时候需要连接“特殊工具”,不然没法进行人机验证。具体解决办法可以跳转到:https://www.cnblogs.com/liuxiaomin/p/11785645.html【解决】

2.数据下载、结果提交

二、数据预处理

1.导入第三方库和数据文件

#导入第三方库
import pandas as pd
import numpy as np
#读取数据
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
submit = pd.read_csv('gender_submission.csv')

2.查看缺失值
代码:

print(train.info())

结果:

3.观察变量信息
代码:

print(train.describe())

结果:

4.总结
从上面的缺失值和变量关系可以看出,训练集的数据中"Age","Embarked"有缺失值的情况,因为891条数据里,只有"Age"的count只有714,"Embarked"的count只有。所以需要对"Age"的缺失值进行处理。

三、数据处理

1.缺失值处理
代码:

# 用中值填补缺失值
train['Age'] = train['Age'].fillna(train['Age'].median())
#缺失值
print(train.info())

结果:

2.数据转化
把数据类型是"object"转化
代码:

print(train['Sex'].unique())
# titanic.loc[0]表示第0行的样本
# titanic.loc[0, 'PassengerId']表示行为0,列为PassengerId的值
train.loc[train['Sex'] == 'male', 'Sex'] = 0
train.loc[train['Sex'] == 'female', 'Sex'] = 1

完整:

"""导入库"""
# 数据分析与整理
import pandas as pd
import numpy as np
import random as rnd
# 可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# 机器学习
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier"""获取数据"""
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]for dataset in combine:dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)# expand=False表示返回DataFrame
# 用一个更常见的名字替换许多标题,分类稀有标题
for dataset in combine:dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]for dataset in combine:dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)all_data = pd.concat([train_df, test_df], ignore_index = True)
#用随机森林对Age缺失值进行填充
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressortrain_df = all_data[all_data['Survived'].notnull()]
test_df  = all_data[all_data['Survived'].isnull()]
# 分割数据,按照 训练数据:cv数据 = 1:1的比例
train_split_1, train_split_2 = model_selection.train_test_split(train_df, test_size=0.5, random_state=0)def predict_age_use_cross_validationg(df1,df2,dfTest):age_df1 = df1[['Age', 'Pclass','Sex','Title']]age_df1 = pd.get_dummies(age_df1)age_df2 = df2[['Age', 'Pclass','Sex','Title']]age_df2 = pd.get_dummies(age_df2)known_age = age_df1[age_df1.Age.notnull()].iloc[:,:].valuesunknow_age_df1 = age_df1[age_df1.Age.isnull()].iloc[:,:].valuesunknown_age = age_df2[age_df2.Age.isnull()].iloc[:,:].valuesprint (unknown_age.shape)y = known_age[:, 0]X = known_age[:, 1:]rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)rfr.fit(X, y)predictedAges = rfr.predict(unknown_age[:, 1::])df2.loc[ (df2.Age.isnull()), 'Age' ] = predictedAges predictedAges = rfr.predict(unknow_age_df1[:,1::])df1.loc[(df1.Age.isnull()),'Age'] = predictedAgesage_Test = dfTest[['Age', 'Pclass','Sex','Title']]age_Test = pd.get_dummies(age_Test)age_Tmp = df2[['Age', 'Pclass','Sex','Title']]age_Tmp = pd.get_dummies(age_Tmp)age_Tmp = pd.concat([age_Test[age_Test.Age.notnull()],age_Tmp])known_age1 = age_Tmp.iloc[:,:].valuesunknown_age1 = age_Test[age_Test.Age.isnull()].iloc[:,:].valuesy = known_age1[:,0]x = known_age1[:,1:]rfr.fit(x, y)predictedAges = rfr.predict(unknown_age1[:, 1:])dfTest.loc[ (dfTest.Age.isnull()), 'Age' ] = predictedAges return dfTestt1 = train_split_1.copy()
t2 = train_split_2.copy()
tmp1 = test_df.copy()
t5 = predict_age_use_cross_validationg(t1,t2,tmp1)
t1 = pd.concat([t1,t2])t3 = train_split_1.copy()
t4 = train_split_2.copy()
tmp2 = test_df.copy()
t6 = predict_age_use_cross_validationg(t4,t3,tmp2)
t3 = pd.concat([t3,t4])train_df['Age'] = (t1['Age'] + t3['Age'])/2test_df['Age'] = (t5['Age'] + t6['Age']) / 2
all_data = pd.concat([train_df,test_df])
print (train_df.describe())
print (test_df.describe())guess_ages = np.zeros((2,3))
# 迭代sex(0或1)和pclass(1,2,3)来计算六个组合的年龄估计值。
for dataset in combine:for i in range(0, 2):for j in range(0, 3):guess_df = dataset[(dataset['Sex'] == i) & \(dataset['Pclass'] == j+1)]['Age'].dropna()age_guess = guess_df.median()guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5for i in range(0, 2):for j in range(0, 3):dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\'Age'] = guess_ages[i,j]dataset['Age'] = dataset['Age'].astype(int)train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)for dataset in combine:    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3dataset.loc[ dataset['Age'] > 64, 'Age']train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]for dataset in combine:dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
for dataset in combine:dataset['IsAlone'] = 0dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]for dataset in combine:dataset['Age*Class'] = dataset.Age * dataset.Pclassfreq_port = train_df.Embarked.dropna().mode()[0]# 众数
for dataset in combine:dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
for dataset in combine:dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)for dataset in combine:dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3dataset['Fare'] = dataset['Fare'].astype(int)train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]#将Title进行转化清洗
train_df.loc[train_df['Title'] == 'Mr', 'Title'] = 0
train_df.loc[train_df['Title'] == 'Miss', 'Title'] = 1
train_df.loc[train_df['Title'] == 'Mrs', 'Title'] = 2
train_df.loc[train_df['Title'] == 'Master', 'Title'] = 3
train_df.loc[train_df['Title'] == 'Dr', 'Title'] = 4
train_df.loc[train_df['Title'] == 'Rev', 'Title'] = 5
train_df.loc[train_df['Title'] == 'Major', 'Title'] = 6
train_df.loc[train_df['Title'] == 'Col', 'Title'] = 7
train_df.loc[train_df['Title'] == 'Mlle', 'Title'] = 8
train_df.loc[train_df['Title'] == 'Mme', 'Title'] = 9
train_df.loc[train_df['Title'] == 'Don', 'Title'] = 10
train_df.loc[train_df['Title'] == 'Lady', 'Title'] = 11
train_df.loc[train_df['Title'] == 'Countess', 'Title'] = 12
train_df.loc[train_df['Title'] == 'Jonkheer', 'Title'] = 13
train_df.loc[train_df['Title'] == 'Sir', 'Title'] = 14
train_df.loc[train_df['Title'] == 'Capt', 'Title'] = 15
train_df.loc[train_df['Title'] == 'Ms', 'Title'] = 16
train_df.loc[train_df['Title'] == 'Rare', 'Title'] = 17#将测试集Title进行转化清洗
test_df.loc[test_df['Title'] == 'Mr', 'Title'] = 0
test_df.loc[test_df['Title'] == 'Miss', 'Title'] = 1
test_df.loc[test_df['Title'] == 'Mrs', 'Title'] = 2
test_df.loc[test_df['Title'] == 'Master', 'Title'] = 3
test_df.loc[test_df['Title'] == 'Dr', 'Title'] = 4
test_df.loc[test_df['Title'] == 'Rev', 'Title'] = 5
test_df.loc[test_df['Title'] == 'Major', 'Title'] = 6
test_df.loc[test_df['Title'] == 'Col', 'Title'] = 7
test_df.loc[test_df['Title'] == 'Mlle', 'Title'] = 8
test_df.loc[test_df['Title'] == 'Mme', 'Title'] = 9
test_df.loc[test_df['Title'] == 'Don', 'Title'] = 10
test_df.loc[test_df['Title'] == 'Lady', 'Title'] = 11
test_df.loc[test_df['Title'] == 'Countess', 'Title'] = 12
test_df.loc[test_df['Title'] == 'Jonkheer', 'Title'] = 13
test_df.loc[test_df['Title'] == 'Sir', 'Title'] = 14
test_df.loc[test_df['Title'] == 'Capt', 'Title'] = 15
test_df.loc[test_df['Title'] == 'Ms', 'Title'] = 16
test_df.loc[test_df['Title'] == 'Rare', 'Title'] = 17#test_df = test_df.drop(['Survived'], axis=1)
train_df = train_df.drop(['PassengerId'], axis=1)X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()

四、模型

1.决策树

# Decision Treedecision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree


2.SVC

# Support Vector Machinessvc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc


3.LogR

# Logistic Regressionlogreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log


4.KNN

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn


5.Gaussian

# Gaussian Naive Bayesgaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian


6.Perceptron

# Perceptronperceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron


7.Stochastic

# Stochastic Gradient Descentsgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

Final

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBestpipe=Pipeline([('select',SelectKBest(k='all')), ('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])param_test = {'classify__n_estimators':list(range(20,50,2)), 'classify__max_depth':list(range(3,60,3))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='roc_auc', cv=10)
gsearch.fit(X_train,Y_train)
print(gsearch.best_params_, gsearch.best_score_)


———————————————————
写的太烂了,重新去打基础学习了QAQ

Kaggle——泰坦尼克号(Titanic: Machine Learning from Disaster)详细过程相关推荐

  1. 【Kaggle】Titanic - Machine Learning from Disaster(二)

    文章目录 1. 前言 2. 预备-环境配置 3. 数据集处理 3.1 读取数据集 3.2 查看pandas数据信息 3.2.1 查看总体信息 3.2.2 数据集空值统计 3.3. 相关性分析 3.3. ...

  2. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  3. Kaggle | Titanic - Machine Learning from Disaster【泰坦尼克号生存预测】 | baseline及优秀notebook总结

    文章目录 一.数据介绍 二.代码 三.代码优化方向 一.数据介绍   Titanic - Machine Learning from Disaster是主要针对机器学习初学者开展的比赛,数据格式比较简 ...

  4. 小白的机器学习之路(1)---Kaggle竞赛:泰坦尼克之灾(Titanic Machine Learning from Disaster)

    我是目录 前言 数据导入 可视化分析 Pclass Sex Age SibSp Parch Fare Cabin Embarked 特征提取 Title Family Size Companion A ...

  5. 大数据第一课(满分作业)——泰坦尼克号生存者预测(Titanic - Machine Learning from Disaster)

    大数据第一课(满分作业)--泰坦尼克号生存者预测(Titanic - Machine Learning from Disaster) 1 项目背景 1.1 The Challenge 1.2 What ...

  6. 【kaggle入门题一】Titanic: Machine Learning from Disaster

    原题: Start here if... You're new to data science and machine learning, or looking for a simple intro ...

  7. 数据分析入门项目之 :Titanic: Machine Learning from Disaster

    1.摘要: 本文详述了新手如何通过数据预览,探索式数据分析,缺失数据填补,删除关联特征以及派生新特征等数据处理方法,完成Kaggle的Titanic幸存预测要求的内容和目标. 2.背景介绍: Tita ...

  8. Kaggle比赛(一)Titanic: Machine Learning from Disaster

    泰坦尼克号幸存预测是本小白接触的第一个Kaggle入门比赛,主要参考了以下两篇教程: https://www.cnblogs.com/star-zhao/p/9801196.html https:// ...

  9. Titanic: Machine Learning from Disaster-kaggle入门赛-学习笔记

    Titanic: Machine Learning from Disaster 对实验用的数据的认识,数据中的特殊点/离群点的分析和处理,特征工程(feature engineering)很重要. 注 ...

最新文章

  1. 个人作业1——四则运算题目生成程序
  2. ​2021年人工智能的四大趋势
  3. lookupedit选中之后的事件_姚明之后,易建联之前!曾经的国产第四中锋,为何没去成NBA?...
  4. 软件架构设计_给非专业人士介绍——软件架构设计工作
  5. java:1221是一个非常特殊的数,它从左边读和从右边读是一样的,编程求所有这样的四位十进制数。
  6. php js urlencode,JavaScript版本的UrlEncode和UrlDecode函数实现
  7. att汇编教程 linux,ATT 汇编语法
  8. maven导入jar失败
  9. 微信为什么没有公众号导航
  10. UVA 10765 Doves and bombs 割点
  11. [转] oracle 数据库 SQL plus 连接方法
  12. 数模电路基础知识 —— 8. PN结与三极管的工作原理
  13. python 运动模拟_Python中的几何布朗运动模拟
  14. 假短信截图在线生成器_工资到账提醒短信原来可以这样玩
  15. ECCV 2020|3D-CVF多模态融合(LIDAR+CAMERA)
  16. redis-manger集群管理工具
  17. python在输出中间加空行_python输出空行
  18. 闩锁和锁(Latches and Locks)
  19. 2006年元宵节前喜得贵子,真高兴,呵呵 ^_^
  20. seqCNA笔记-处理来自肿瘤样本的高通量测序拷贝数数据

热门文章

  1. Composer中的Tilde和插入符号版本限制
  2. android n 动态时钟,基于JQuery的动态罗盘时钟
  3. react中px转rem(px2rem和px to rem rpx的使用)
  4. 2021四川书法高考成绩查询,2021年四川省实行书法统考,学书法将成为学生必备素质!...
  5. RocketMQ和Kafka应用场景与选型
  6. Nessus漏洞扫描教程之配置Nessus
  7. 金融情感分析--基于业绩说明会的研究
  8. Hive数据模型是什么?
  9. 前端常见算法(js)
  10. 糖尿病视网膜病变预测模型-机器学习-人工智能