刚入手data science, 想着自己玩一玩kaggle,玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来,但是具体到一些细节,以至于到能拿到的出手的成绩还是需要理论分析的。

本文旨在介绍kaggle比赛到各种原理与技巧,当然一切源自于coursera,由于课程都是英文的,且都比较好理解,这里直接使用英文

  • Reference
    How to Win a Data Science Competition: Learn from Top Kagglers

Features: numeric, categorical, ordinal, datetime, coordinate, text

Numeric features

All models are divided into tree-based model and non-tree-based model.

Scaling

For example: if we apply KNN algorithm to the instances below, as we see in the second row, we caculate the distance between the instance and the object. It is obvious that dimension of large scale dominates the distance.

Tree-based models doesn’t depend on scaling

Non-tree-based models hugely depend on scaling

How to do

sklearn:

  1. To [0,1]
    sklearn.preprocessing.MinMaxScaler
    X = ( X-X.min( ) )/( X.max()-X.min() )
  2. To mean=0, std=1
    sklearn.preprocessing.StandardScaler
    X = ( X-X.mean( ) )/X.std()

    • if you want to use KNN, we can go one step ahead and recall that the bigger feature is, the more important it will be for KNN. So, we can optimize scaling parameter to boost features which seems to be more important for us and see if this helps

Outliers

The outliers make the model diviate like the red line.

We can clip features values between teo chosen values of lower bound and upper bound

  • Rank Transformation

If we have outliers, it behaves better than scaling. It will move the outliers closer to other objects

Linear model, KNN, Neural Network will benefit from this mothod.

rank([-100, 0, 1e5]) == [0,1,2]
rank([1000,1,10]) = [2,0,1]

scipy:

scipy.stats.rankdata

  • Other method

    1. Log transform: np.log(1 + x)
    2. Raising to the power < 1: np.sqrt(x + 2/3)

Feature Generation

Depends on

a. Prior knowledge
b. Exploratory data analysis


Ordinal features

Examples:

  • Ticket class: 1,2,3
  • Driver’s license: A, B, C, D
  • Education: kindergarden, school, undergraduate, bachelor, master, doctoral

Processing

1.Label Encoding
* Alphabetical (sorted)
[S,C,Q] -> [2, 1, 3]

sklearn.preprocessing.LabelEncoder

  • Order of appearance
    [S,C,Q] -> [1, 2, 3]

Pandas.factorize

This method works fine with two ways because tree-methods can split feature, and extract most of the useful values in categories on its own. Non-tree-based-models, on the other side,usually can’t use this feature effectively.

2.Frequency Encoding
[S,C,Q] -> [0.5, 0.3, 0.2]

encoding = titanic.groupby(‘Embarked’).size()
encoding = encoding/len(titanic)
titanic[‘enc’] = titanic.Embarked.map(encoding)

from scipy.stats import rankdata

For linear model, it is also helpful.
if frequency of category is correlated with target value, linear model will utilize this dependency.

3.One-hot Encoding

pandas.get_dummies

It give all the categories of one feature a new columns and often used for non-tree-based model.
It will slow down tree-based model, so we introduce sparse matric. Most of libaraies can work with these sparse matrices directly. Namely, xgboost, lightGBM

Feature generation

Interactions of categorical features can help linear models and KNN

By concatenating string


Datetime and Coordinates

Date and time

1.Periodicity
2.Time since

a. Row-independent moment
For example: since 00:00:00 UTC, 1 January 1970;b. Row-dependent important moment
Number of days left until next holidays/ time passed after last holiday.

3.Difference betwenn dates

We can add date_diff feature which indicates number of days between these events

Coordicates

1.Interesting places from train/test data or additional data

Generate distance between the instance to a flat or an old building(Everything that is meanful)

2.Aggergates statistics

The price of surrounding building

3.Rotation

Sometime it makes the model more precisely to classify the instances.


Missing data

Hidden Nan, numeric

When drawing a histgram, we see the following picture:

It is obivous that -1 is a hidden Nan which is no meaning for this feature.

Fillna approaches

1.-999,-1,etc(outside the feature range)

It is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer.

2.mean,median

Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place.

3.Reconstruct:

  • Isnull

  • Prediction


* Replace the missing data with the mean of medain grouped by another feature.
But sometimes it can be screwed up, like:

The way to handle this is to ignore missing values while calculating means for each category.

  • Treating values which do not present in trian data

Just generate new feature indicating number of occurrence in the data(freqency)

  • Xgboost can handle Nan

4.Remove rows with missing values

This one is possible, but it can lead to loss of important samples and a quality decrease.


Text

Bag of words

Text preprocessing

1.Lowercase

2.Lemmatization and Stemming

3.Stopwords

Examples:
1.Articles(冠词) or prepositions
2.Very common words

sklearn.feature_extraction.text.CountVectorizer:
max_df

  • max_df : float in range [0.0, 1.0] or int, default=1.0
    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

CountVectorizer

The number of times a term occurs in a given document

sklearn.feature_extraction.text.CountVectorizer

TFiDF

In order to re-weight the count features into floating point values suitable for usage by a classifier

  • Term frequency
    tf = 1 / x.sum(axis=1) [:,None]
    x = x * tf

  • Inverse Document Frequency
    idf = np.log(x.shape[0] / (x > 0).sum(0))
    x = x * idf

N-gram

sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer

  • ngram_range : tuple (min_n, max_n)
    The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

Embeddings(~word2vec)

It converts each word to some vector in some sophisticated space, which usually have several hundred dimensions

a. Relatively small vectors

b. Values in vector can be interpreted only in some cases

c. The words with similar meaning often have similar
embeddings

Example:

转载于:https://www.cnblogs.com/bjwu/p/8970821.html

Feature Preprocessing on Kaggle相关推荐

  1. 机器学习中数据预处理——标准化/归一化方法(scaler)

    由于工作问题比较忙,有两周没有总结一下工作学习中遇到的问题. 这篇主要是关于机器学习中的数据预处理的scaler变化. 工作中遇到的问题是:流量预测问题,拿到的数据差距非常大,凌晨的通话流量很少几乎为 ...

  2. Kaggle系列-IEEE-CIS Fraud Detection第一名复现

    赛题背景 想象一下,站在杂货店的收银台,身后排着长队,收银员没有那么安静地宣布您的信用卡被拒绝了.在这一刻,你可能没有想到决定你命运的数据科学. 非常尴尬有木有?当然你肯定有足够的资金为50个最亲密的 ...

  3. <Squeezing Backbone Feature Distributions to the Max for Efficient Few-Shot Learning>

    在本文中,我们提出了一种新的基于传输的方法,旨在处理特征向量,使其更接近高斯分布,从而提高了精度. 对于在训练期间可以使用未标记测试样本的转导小样本学习,我们还引入了一种优化传输启发算法,以进一步提高 ...

  4. xgboost坏账预测(二分类问题)

    参考代码来自:Explore and run machine learning code with Kaggle Notebooks | Using data from Two Sigma Conne ...

  5. Auto Machine Learning 自动化机器学习笔记

    ⭐适读人群:有机器学习算法基础 1. auto-sklearn 能 auto 到什么地步? 在机器学习中的分类模型中: 常规 ML framework 如下图灰色部分:导入数据-数据清洗-特征工程-分 ...

  6. python dicom图像分割_处理医疗影像的Python利器:PyDicom

    Pydicom是一个用于处理DICOM格式文件的Python包,可以处理包括如医学图像(CT等).报告等. Pydicom支持DICOM格式的读取:可以将dicom文件读入python结构,同时支持修 ...

  7. auto-sklearn详解

    auto-sklearn: 基于sklearn/ AutoML 方向/ 免费自动机器学习服务/ GitHub开源/ 2.4k+ stars!!!⭐ 适读人群:有机器学习算法基础 学完 Machine ...

  8. 进化计算在深度学习中的应用 | 附多篇论文解读

    随着当今计算能力的大幅度提升和大数据时代的到来,深度学习在帮助挖掘海量数据中蕴含的信息和完成一些人工智能任务中,展现出了非凡的能力.然而目前深度学习领域还有许多问题亟待解决,其中算法参数和结构的优化尤 ...

  9. 泰坦尼克号沉船数据分析与可视化、数据建模与分类预测【Python | 机器学习-Sklearn】

    泰坦尼克号沉船数据之美--起于悲剧,止于浪漫 前言:泰坦尼克号,不只是卡梅隆导演的经典电影,它是一个真实存在的悲剧,也是电影的故事背景与题材.作为一个IT人,分析事实还得看数据,了解到泰坦尼克号沉船幸 ...

最新文章

  1. mysql+1.6安装,CentOS 7.0编译安装Nginx1.6.0+MySQL5.6.19+PHP5.5.14方法
  2. 【数据库】SQL查询强化篇
  3. 计算机函数填写评价,信息技术应用 用计算机画函数图象教学评价实录
  4. STM32读写DS1302,HAL库方式
  5. 几种典型信号的频谱 周期单位脉冲序列的频谱
  6. 解读2017年云计算发展趋势 — 简本
  7. VHDL 仿真出现 UUUUUUUU 红线
  8. 手机浏览器获取某东cookie
  9. android毛玻璃壁纸效果,【手机教程大赛】制作 毛玻璃效果 壁纸
  10. LDA隐狄利克雷分配
  11. 苹果CMS、海洋CMS自动定时采集-可采集任意自定义指定资源
  12. CAD快速修改角度小技巧
  13. QtMetaObjectsysmtem详解之三:QMetaObject接口实现
  14. HNOI2017滚粗记
  15. 学习shell,这一篇就够了(shell基础)
  16. grpc协议_gRPC和协议缓冲区简介
  17. mongrel cluster
  18. java office 类库_OfficeFloor
  19. 2022年4月线上终端药品增长迅猛,市场政策合规进程加快
  20. 如何使用Tera Term Language (TTL)

热门文章

  1. CSDN-Markdown基本语法
  2. kernel并发控制:自旋锁、互斥体、中断屏蔽
  3. 对传统视觉惯性的颠覆
  4. h5首页加载慢_H5网站制作注意了
  5. html引用c 变量,在jsp页面中定义全局变量,供其他页面引用
  6. 艾为数字ic面试题_每日学习:数字后端面试100问(2019全新版)
  7. python 工资管理软件_基于[Python]的员工管理系统
  8. python visa模块_已经安装了pyvisa仍然报错没有模块
  9. 四叶草引导windows和linux,Windows环境下使用Clover四叶草引导双硬盘安装OSX 10.11.5原版镜像...
  10. 李宏毅机器学习课程1~~~Introduction Regression