文章目录

  • 调参
    • num_leaves和max_depth
    • min_data_in_leaf和min_sum_hessian_in_leaf
    • monotonic constraints单调约束
    • group_column和ignore_column
    • categorical_feature
    • lambda_l1和lambda_l2
    • bagging_fraction和bagging_freq
    • 关于类别特征 Categorical Feature Support
  • lambdaRank label
  • 调参
    • 学习率和迭代次数
    • max_depth和num_leaves
    • max_bin和min_data_in_leaf
    • feature_fraction bagging_fraction bagging_freq
    • lambda_l1和lambda_l2
    • monotonic constraints
    • learning_rate
    • 特征重要性
  • 代码
  • gbm常用属性
  • predict
  • 部署
  • 模型文件

调参

LightGBM官方文档

num_leaves和max_depth

  • num_leaves这是控制树模型复杂度的主要参数. 理论上, 借鉴 depth-wise 树, 我们可以设置 num_leaves = 2^(max_depth)但是, 这种简单的转化在实际应用中表现不佳. 这是因为, 当叶子数目相同时, leaf-wise 树要比 depth-wise 树深得多, 这就有可能导致过拟合. 因此, 当我们试着调整 num_leaves 的取值时, 应该让其小于 2^(max_depth). 举个例子, 当 max_depth=6 时(这里译者认为例子中, 树的最大深度应为7), depth-wise 树可以达到较高的准确率.但是如果设置 num_leaves 为 127 时, 有可能会导致过拟合, 而将其设置为 70 或 80 时可能会得到比 depth-wise 树更高的准确率. 其实, depth 的概念在 leaf-wise 树中并没有多大作用, 因为并不存在一个从 leaves 到 depth 的合理映射.

min_data_in_leaf和min_sum_hessian_in_leaf

  • min_data_in_leaf 这是处理 leaf-wise 树的过拟合问题中一个非常重要的参数. 它的值取决于训练数据的样本个树和 num_leaves 将其设置的较大可以避免生成一个过深的树, 但有可能导致欠拟合. 实际应用中, 对于大数据集, 设置其为几百或几千就足够了. 你也可以利用 max_depth 来显式地限制树的深度.

  • min_sum_hessian_in_leafdefault = 1e-3, type = double, aliases: min_sum_hessian_per_leaf, min_sum_hessian, min_hessian, min_child_weight, constraints: min_sum_hessian_in_leaf >= 0.0

参数优化

monotonic constraints单调约束

  • It is often the case in a modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model. A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted response:

group_column和ignore_column

  • group_columndefault = “”, type = int or string, aliases: group, group_id, query_column, query, query_id used to specify the query/group id column use number for index, e.g. query=0 means column_0 is the query id add a prefix name: for column name, e.g. query=name:query_id
    Note: works only in case of loading data directly from file
    Note: data should be grouped by query_id, for more information, see Query Data
    Note: index starts from 0 and it doesn’t count the label column when passing type is int, e.g. when label is column_0 and query_id is column_1, the correct parameter is query=0

  • ignore_column default = “”, type = multi-int or string, aliases: ignore_feature, blacklist
    used to specify some ignoring columns in training
    use number for index, e.g. ignore_column=0,1,2 means column_0, column_1 and column_2 will be ignored
    add a prefix name: for column name, e.g. ignore_column=name:c1,c2,c3 means c1, c2 and c3 will be ignored
    Note: works only in case of loading data directly from file
    Note: index starts from 0 and it doesn’t count the label column when passing type is int
    Note: despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully

categorical_feature

  • categorical_feature , default = “”, type = multi-int or string, aliases: cat_feature, categorical_column, cat_column used to specify categorical features
    use number for index, e.g. categorical_feature=0,1,2 means column_0, column_1 and column_2 are categorical features add a prefix name: for column name, e.g. categorical_feature=name:c1,c2,c3 means c1, c2 and c3 are categorical features
    Note: only supports categorical with int type (not applicable for data represented as pandas DataFrame in Python-package)
    Note: index starts from 0 and it doesn’t count the label column when passing type is int
    Note: all values should be less than Int32.MaxValue (2147483647)
    Note: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
    Note: all negative values will be treated as missing values
    Note: the output cannot be monotonically constrained with respect to a categorical feature

lambda_l1和lambda_l2

xbgoost中参数正则项包含叶子数和叶子权重

正则化项L1和L2的区别

bagging_fraction和bagging_freq

  • bagging_fraction, default = 1.0, type = double, aliases: sub_row, subsample, bagging, constraints: 0.0 < bagging_fraction <= 1.0
    like feature_fraction, but this will randomly select part of data without resampling
    can be used to speed up training
    can be used to deal with over-fitting
    Note: to enable bagging, bagging_freq should be set to a non zero value as well

  • bagging_freq , default = 0, type = int, aliases: subsample_freq
    frequency for bagging
    0 means disable bagging; k means perform bagging at every k iteration. Every k-th iteration, LightGBM will randomly select bagging_fraction * 100 % of the data to use for the next k iterations
    Note: to enable bagging, bagging_fraction should be set to value smaller than 1.0 as well

  • feature_fraction , default = 1.0, type = double, aliases: sub_feature, colsample_bytree, constraints: 0.0 < feature_fraction <= 1.0
    LightGBM will randomly select a subset of features on each iteration (tree) if feature_fraction is smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree
    can be used to speed up training
    can be used to deal with over-fitting

关于类别特征 Categorical Feature Support

  • LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies Fisher (1958) to find the optimal split over categories as described here. This often performs better than one-hot encoding.
  • Use categorical_feature to specify the categorical features. Refer to the parameter categorical_feature in Parameters.
  • Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647). It is best to use a contiguous range of integers started from zero.
  • Use min_data_per_group, cat_smooth to deal with over-fitting (when #data is small or #category is large).
  • For a categorical feature with high cardinality (#category is large), it often works best to treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or by embedding the categories in a low-dimensional numeric space.

lambdaRank label

  • The label should be of type int, such that larger numbers correspond to higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
  • Use label_gain to set the gain(weight) of int label.
  • Use lambdarank_truncation_level to truncate the max DCG.

可以用label_gain设置标签的增益,增益数值的设定是一个待研究的问题

  • label_gain , default = 0,1,3,7,15,31,63,…,2^30-1, type = multi-double
  • used only in lambdarank application
  • relevant gain for labels. For example, the gain of label 2 is 3 in case of default label gains separate by ,

调参

学习率和迭代次数

先把学习率先定一个较高的值,取 learning_rate = 0.1,然后通过cross validearly_stopping_round计算最佳迭代次数
n_estimators/num_iterations/num_round/num_boost_round。我们可以先将该参数设成一个较大的数,然后在cv结果中查看最优的迭代次数,具体如代码。

通过early_stopping_round(如果一次验证数据的一个度量在最近的early_stopping_round 回合中没有提高,模型将停止训练)
可以设置cv的返回:模型或者评估结果,评估结果类似{'ndcg@10-mean':[, , , ], 'ndcg@10-stdv':[, , , ]}

max_depth和num_leaves

max_bin和min_data_in_leaf

feature_fraction bagging_fraction bagging_freq

lambda_l1和lambda_l2

monotonic constraints

learning_rate

特征重要性

代码

class GridSearch(object):def __init__(self):self.gbm = Noneself.optimal_params = {}def grid_search(self, grids, paral=False, is_print=True):"""一组参数搜索 girds:{feature_name:[]}"""res_d = {}grid_l = list(grids.items())params.update(self.optimal_params)# 特征并行,暂时用不到if paral:pass# 特征自由组合else:# 所有可能的参数组合param_l = [[x] for x in grid_l[0][1]]  # 初始化第一个参数for i in range(1, len(grid_l)):param_l = [l + [p] for l in param_l for p in grid_l[i][1]]name_l = [tup[0] for tup in grid_l]for i in range(len(param_l)):_d = dict(zip(name_l, param_l[i]))params_copy = params.copy()params_copy.update(_d)# k = 'lambda_l1: 0.001,lambda_l2: 0.1'k = ','.join([str(tup[0]) + ": " + str(tup[1]) for tup in zip(name_l, param_l[i])])print("第{}个组合:{}".format(i, k))self.gbm = lgb.train(params_copy, train_data, num_boost_round=400, valid_sets=[valid_data],early_stopping_rounds=40, verbose_eval=20)res_d[k] = (self.gbm.best_score['valid_0']['ndcg@10'], self.gbm.best_score['valid_0']['ndcg@40'])# [(k,(ndcg@10, ndcg@30))]   k = 'lambda_l1: 0.001,lambda_l2: 0.1'ndcg_10 = sorted(res_d.items(), key=lambda kv: kv[1][0], reverse=True)self.optimal_params.update(dict([x.split(': ') for x in ndcg_10[0][0].split(',')]))print(self.optimal_params)ndcg_40 = sorted(res_d.items(), key=lambda kv: kv[1][1], reverse=True)if is_print:for k, v in ndcg_10:print(k, v)print("-" * 40)for k, v in ndcg_40:print(k, v)return ndcg_10, ndcg_40def all_parameters_search(self, grids_l):"""自动化搜索所有可能的参数组"""# with open('train_record', 'w') as f:for grids in grids_l:# grids.update(self.optimal_params)  # update上一步的最优参数ndcg_10, ndcg_30 = self.grid_search(grids, paral=False, is_print=False)for k, v in ndcg_10:print('{}\t{},{}\n'.format(k, v[0], v[1]))for k, v in ndcg_30:print('{}\t{},{}\n'.format(k, v[1], v[0]))print("-" * 40 + '\n')print(self.optimal_params)def print_feature_importance(self):"""打印特征的重要度"""importances = self.gbm.feature_importance(importance_type='split')feature_names = self.gbm.feature_name()sum = 0.for value in importances:sum += valuename_impo = sorted(list(zip(feature_names, importances)), key=lambda x: x[1], reverse=True)for name, impo in name_impo:print('{} : {} : {}'.format(name, impo, impo / sum))grids_l = [# {'max_depth': list(range(3, 8, 1)), 'num_leaves': list(range(5, 100, 5))},# {'max_bin': list(range(5, 256, 10)), 'min_data_in_leaf': list(range(50, 1001, 50))},{'feature_fraction': [0.6, 0.7, 0.8, 0.9, 1.0], 'bagging_fraction': [0.6, 0.7, 0.8, 0.9, 1.0], 'bagging_freq': range(0, 81, 10)},{'lambda_l1': [1e-5, 1e-3, 1e-1, 0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0], 'lambda_l2': [1e-5, 1e-3, 1e-1, 0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]},{'min_split_gain': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]},{'learning_rate': [0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5, 0.7, 1.0]}
]if __name__ == "__main__":t = int(sys.argv[1])# t = 1# categorical_feature='name:u_fs,up_sex'会报错,因为没有输入feature namestrain_data = lgb.Dataset("data/train.csv", categorical_feature=[18, 31], free_raw_data=False)valid_data = lgb.Dataset("data/valid.csv", categorical_feature=[18, 31], free_raw_data=False)if t == 0:cv_res = lgb.cv(params, train_data, num_boost_round=1000, nfold=5, stratified=False, early_stopping_rounds=50)print("iteration num: {}".format(len(cv_res['ndcg@10-mean'])))print("ndcg@10:{} ndcg@40: {} ".format(max(cv_res['ndcg@10-mean']), max(cv_res['ndcg@40-mean'])))elif t == 1:  # grid searchgs = GridSearch()gs.grid_search({'learning_rate': [0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5, 0.7, 1.0]})gs.print_feature_importance()elif t == 2:gs = GridSearch()gs.all_parameters_search(grids_l)

启动日志,多关注一下,涉及到参数的设置
主要包含:

  • 参数的设置,通常会有一些警告
  • 数据的整体分析,rank里,类别数、query数、每个query的平均数据等等
[LightGBM] [Info] Construct bin mappers from text data time 0.33 seconds
[LightGBM] [Debug] Number of queries in train3.csv: 118403. Average number of rows per query: 33.868660.
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.769392
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.492899
[LightGBM] [Debug] init for col-wise cost 0.265492 seconds, init for row-wise cost 0.571826 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.318483 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 5951
[LightGBM] [Info] Number of data points in the train set: 4010151, number of used features: 31

gbm常用属性

gbm = lgb.train(params, train_data, num_boost_round=400, valid_sets=[valid_data])
gbm.best_score
>>> defaultdict(<class 'collections.OrderedDict'>, {'valid_0': OrderedDict([('ndcg@10', 0.49198254166476096), ('ndcg@30', 0.5681340145051615)])})

predict

预测时,是否有query列不影响最终计算结果

>>>ypred4 = gbm.predict('test4.csv',data_has_header=True)
[LightGBM] [Warning] Feature (sid) is missed in data file. If it is weight/query/group/ignore_column, you can ignore this warning.
>>>ypred4[:10]
array([ 0.54267792,  0.39272917,  0.31842769,  0.10324354, -0.05312303,
0.10855625,  0.0766676 ,  0.1336972 ,  1.57561062,  0.14458557])

LightGBM的参数详解以及如何调优
LightGBM调参笔记

部署

https://zhuanlan.zhihu.com/p/265888201

模型文件

https://zhuanlan.zhihu.com/p/269882594

lightGBM 训练rank记录相关推荐

  1. TF之TFSlim:利用经典VGG16模型(InceptionV3)在ImageNet数据集基础上训练自己的五个图像类别数据集的训练过程记录

    TF之TFSlim:利用经典VGG16模型(InceptionV3)在ImageNet数据集基础上训练自己的五个图像类别数据集的训练过程记录 目录 训练控制台显示 输出结果文件 训练控制台显示 输出结 ...

  2. CV之IC之SpatialTransformer:基于ClutteredMNIST手写数字图片数据集分别利用CNN_Init、ST_CNN算法(CNN+ST)实现多分类预测案例训练过程记录

    CV之IC之SpatialTransformer:基于ClutteredMNIST手写数字图片数据集分别利用CNN_Init.ST_CNN算法(CNN+ST)实现多分类预测案例训练过程记录 目录 基于 ...

  3. Google开源OCR项目Tesseract训练(自己训练的记录,未成功)

    图像处理开发需求.图像处理接私活挣零花钱,请加微信/QQ 2487872782 图像处理开发资料.图像处理技术交流请加QQ群,群号 271891601 本文训练Tesseract用的方法主要参考文章  ...

  4. 消除LightGBM训练过程中出现的[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

    1.总结一下Github上面有关这个问题的解释: 这意味着当前迭代中树的学习应该停止,因为不能再分裂了. "min_data_in_leaf"的设定值太大导致的,应该设一个小一点的 ...

  5. lightGBM专题2:基于pyspark在spark平台下lightgbm训练详解

    数据集 这里以数据集flight_weather.csv为例,文件下载地址:flight_weather.csv,将flight_weather.csv上传到hdfs,这里上传到目录/home/,必须 ...

  6. LightGBM 训练及调参

    LightGBM是微软开源的一个快速的,分布式的,高性能的基于决策树算法的梯度提升算法,对比于XGBoost算法来说准确率在大部分情况下相较无几,但是在收敛速度,内存占用情况下相较于XGBoost来说 ...

  7. 目标跟踪的训练流程记录

    大部门目标跟踪领域的文章的训练测试流程都是一样的.在此简要记录一下. ByteTrackor 对于消融实验,使用MOT17中每个视频的前半部分用于训练,后半部分用于验证. 训练在CrowdHuman和 ...

  8. 2019 训练比赛 记录

    仔细总结一下自己的成绩才发现自己是被锤的多惨 然后才能明白,还有多长的路要走 7.31 再次被锤爆,真的心态崩了 很多思维题不会,继续加油,加强训练吧 8.5 场输为0,令人绝望 我什么时候才能不被锤 ...

  9. Torchvision目标检测模型训练过程记录

    1.环境: pytorch==1.6.0 torchvision==0.7.0 cudatoolkit==10.2 2.场景: https://pytorch.org/tutorials/interm ...

最新文章

  1. 【Python】数据提取xpath和lxml模块(糗事百科的爬虫)
  2. 花了2周时间收集汇总的大厂面经,节后准备跳槽的看过来!
  3. Spring框架简介
  4. NMF(非负矩阵分解)的场景应用
  5. thinkPHP 空模块和空操作、前置操作和后置操作 详细介绍(十四)
  6. win7怎么看计算机Mac地址,win7如何查看mac地址?win7系统查看mac地址两种方法
  7. Fresco图片库研读分析
  8. 5800交点正反算坐标(可计算不对称缓和曲线)
  9. win10 远程桌面卡顿_win10系统使用远程桌面卡顿的设置教程
  10. Java SE 八大基本类型
  11. 算法分析与设计 作业1
  12. 简单是王道《九、讲故事》
  13. 手把手制作前端图标字体
  14. FTPC 在制品跟踪(WIP Tracking)对象
  15. 尼尔机器人技能快捷键_尼尔机械纪元出招表及招式使用技巧 尼尔机械纪元技能怎么用...
  16. 百度地图-新手入门教程
  17. 怎么把cad做的图分享给别人_怎么将CAD图转换
  18. cufflinks suit
  19. 全球人口突破80亿!免费分享全球人口分布数据
  20. 2022年义乌医院三基考试临床检查多选专项模拟题及答案

热门文章

  1. html 实现点击按钮播放音乐
  2. Netflix测试全新“超级会员”服务 订阅费用更贵
  3. 电商实训二:电子支付实训
  4. linux readdir
  5. 智慧灯杆案例:发力新基建,智慧灯杆杭州模式引期待,来看看都有哪些试点项目?
  6. 有限增量公式、泰勒公式、泰勒级数、傅里叶级数的关系
  7. ChatGPT技术与市场动态
  8. 查询用户uid和gid_在AIX上标准化用户UID和GID号
  9. Linux 初学者:如何在 Ubuntu 中重启网络
  10. Oracle导入dmp文件乱码解决案例