文章目录

  • 引言
  • 一、数据竞赛介绍
  • 二、赛题介绍
  • 三、赛题思路
  • 四、baseline代码解析
    • 1.自编码器模型
    • 2.全连接网络(MLP)模型
    • 3.完整代码

引言

简街市场预测

一、数据竞赛介绍

二、赛题介绍


三、赛题思路

四、baseline代码解析

链接—提取码:1234

1.自编码器模型

def create_autoencoder(input_dim,output_dim,noise=0.05):i = Input(input_dim)# 自编码部分# 自编码器— x = decoder(encoder(x)) => 130 -> 64 -> 64 -> 130# 编码器—对数据进行降维encoded = BatchNormalization()(i)encoded = GaussianNoise(noise)(encoded)encoded = Dense(64,activation='relu')(encoded)# 解码器# 对数据进行升维decoded = Dropout(0.2)(encoded)decoded = Dense(input_dim,name='decoded')(decoded)# 将解码后的数据在训练一个分类模型x = Dense(32,activation='relu')(decoded)x = BatchNormalization()(x)x = Dropout(0.2)(x)x = Dense(32,activation='relu')(x)x = BatchNormalization()(x)x = Dropout(0.2)(x)    x = Dense(output_dim,activation='sigmoid',name='label_output')(x)encoder = Model(inputs=i,outputs=encoded)autoencoder = Model(inputs=i,outputs=[decoded,x])# 损失函数由二部分构成。损失:均方误差。分类:交叉熵损失autoencoder.compile(optimizer=Adam(0.005),loss={'decoded':'mse','label_output':'binary_crossentropy'})return autoencoder, encoder

2.全连接网络(MLP)模型

def create_model(input_dim,output_dim,encoder):inputs = Input(input_dim)# encoder进行降维,可以学习到数据集更有效的表征方法x = encoder(inputs)x = Concatenate()([x,inputs]) #use both raw and encoded featuresx = BatchNormalization()(x)x = Dropout(0.13)(x)# 多个隐藏层hidden_units = [384, 896, 896, 394]for idx, hidden_unit in enumerate(hidden_units):x = Dense(hidden_unit)(x)x = BatchNormalization()(x)x = Lambda(tf.keras.activations.relu)(x)x = Dropout(0.25)(x)# 输出x = Dense(output_dim,activation='sigmoid')(x)model = Model(inputs=inputs,outputs=x)# label_smoothing标签平滑操作model.compile(optimizer=Adam(0.0005),loss=BinaryCrossentropy(label_smoothing=0.05),metrics=[tf.keras.metrics.AUC(name = 'auc')])return model

3.完整代码

tf.keras.layers.BatchNormalization
tf.keras.layers.Lambda
tf.keras.layers.GaussianNoise
tf.keras.layers.Activation
tf.keras.losses.BinaryCrossentropy

from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Concatenate, Lambda, GaussianNoise, Activation
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFoldfrom tqdm import tqdm
from random import choices# PurgedGroupTimeSeriesSplit——根据时序划分数据集
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args# modified code for group gaps; source
# https://github.com/getgaurav2/scikit-learn/blob/d4a3af5cc9da3a76f0266932644b884c99724c57/sklearn/model_selection/_split.py#L2243
class PurgedGroupTimeSeriesSplit(_BaseKFold):"""Time Series cross-validator variant with non-overlapping groups.Allows for a gap in groups to avoid potentially leaking info fromtrain into test if the model has windowed or lag features.Provides train/test indices to split time series data samplesthat are observed at fixed time intervals according to athird-party provided group.In each split, test indices must be higher than before, and thus shufflingin cross validator is inappropriate.This cross-validation object is a variation of :class:`KFold`.In the kth split, it returns first k folds as train set and the(k+1)th fold as test set.The same group will not appear in two different folds (the number ofdistinct groups has to be at least equal to the number of folds).Note that unlike standard cross-validation methods, successivetraining sets are supersets of those that come before them.Read more in the :ref:`User Guide <cross_validation>`.Parameters----------n_splits : int, default=5Number of splits. Must be at least 2.max_train_group_size : int, default=InfMaximum group size for a single training set.group_gap : int, default=NoneGap between train and testmax_test_group_size : int, default=InfWe discard this number of groups from the end of each train split"""@_deprecate_positional_argsdef __init__(self,n_splits=5,*,max_train_group_size=np.inf,max_test_group_size=np.inf,group_gap=None,verbose=False):super().__init__(n_splits, shuffle=False, random_state=None)self.max_train_group_size = max_train_group_sizeself.group_gap = group_gapself.max_test_group_size = max_test_group_sizeself.verbose = verbosedef split(self, X, y=None, groups=None):"""Generate indices to split data into training and test set.Parameters----------X : array-like of shape (n_samples, n_features)Training data, where n_samples is the number of samplesand n_features is the number of features.y : array-like of shape (n_samples,)Always ignored, exists for compatibility.groups : array-like of shape (n_samples,)Group labels for the samples used while splitting the dataset intotrain/test set.Yields------train : ndarrayThe training set indices for that split.test : ndarrayThe testing set indices for that split."""if groups is None:raise ValueError("The 'groups' parameter should not be None")X, y, groups = indexable(X, y, groups)n_samples = _num_samples(X)n_splits = self.n_splitsgroup_gap = self.group_gapmax_test_group_size = self.max_test_group_sizemax_train_group_size = self.max_train_group_sizen_folds = n_splits + 1group_dict = {}u, ind = np.unique(groups, return_index=True)unique_groups = u[np.argsort(ind)]n_samples = _num_samples(X)n_groups = _num_samples(unique_groups)for idx in np.arange(n_samples):if (groups[idx] in group_dict):group_dict[groups[idx]].append(idx)else:group_dict[groups[idx]] = [idx]if n_folds > n_groups:raise ValueError(("Cannot have number of folds={0} greater than"" the number of groups={1}").format(n_folds,n_groups))group_test_size = min(n_groups // n_folds, max_test_group_size)group_test_starts = range(n_groups - n_splits * group_test_size,n_groups, group_test_size)for group_test_start in group_test_starts:train_array = []test_array = []group_st = max(0, group_test_start - group_gap - max_train_group_size)for train_group_idx in unique_groups[group_st:(group_test_start - group_gap)]:train_array_tmp = group_dict[train_group_idx]train_array = np.sort(np.unique(np.concatenate((train_array,train_array_tmp)),axis=None), axis=None)train_end = train_array.sizefor test_group_idx in unique_groups[group_test_start:group_test_start +group_test_size]:test_array_tmp = group_dict[test_group_idx]test_array = np.sort(np.unique(np.concatenate((test_array,test_array_tmp)),axis=None), axis=None)test_array  = test_array[group_gap:]if self.verbose > 0:passyield [int(i) for i in train_array], [int(i) for i in test_array]# 加载训练数据
# 定义TRAINING来控制到底是训练还是提交,训练时TRAINING = True,预测时TRAINING = False
TRAINING = True
USE_FINETUNE = False
FOLDS = 4 # 4折
SEED = 42# 读取数据,并用一部分数据集当我们的训练集
train = pd.read_csv('train.csv',nrows = None)
# 先用查询表达式'date > 85'进行查询,再重置为整数索引
train = train.query('date > 85').reset_index(drop = True)
# 将float64 => float32,缩小内存
train = train.astype({c: np.float32 for c in train.select_dtypes(include='float64').columns}) #limit memory use
# 缺失值的填充,用均值进行填充
train.fillna(train.mean(),inplace=True)
# 选取满足条件的训练集
train = train.query('weight > 0').reset_index(drop = True)
# 构建action列
#train['action'] = (train['resp'] > 0).astype('int')
train['action'] =  (  (train['resp_1'] > 0 ) & (train['resp_2'] > 0 ) & (train['resp_3'] > 0 ) & (train['resp_4'] > 0 ) &  (train['resp'] > 0  )   ).astype('int')
# 130个feature
features = [c for c in train.columns if 'feature' in c]resp_cols = ['resp_1', 'resp_2', 'resp_3', 'resp', 'resp_4']
# X,y
X = train[features].values
y = np.stack([(train[c] > 0).astype('int') for c in resp_cols]).T #Multitarget
# 每列的均值
f_mean = np.mean(train[features[1:]].values,axis=0)# 自编码器
def create_autoencoder(input_dim,output_dim,noise=0.05):i = Input(shape=(input_dim,))# 自编码部分# 自编码器— x = decoder(encoder(x)) => 130 -> 64 -> 64 -> 130# 编码器—对数据进行降维encoded = BatchNormalization()(i)encoded = GaussianNoise(noise)(encoded)encoded = Dense(64,activation='relu')(encoded)# 解码器# 对数据进行升维decoded = Dropout(0.2)(encoded)decoded = Dense(input_dim,name='decoded')(decoded)# 将解码后的数据在训练一个分类模型x = Dense(32,activation='relu')(decoded)x = BatchNormalization()(x)x = Dropout(0.2)(x)x = Dense(32,activation='relu')(x)x = BatchNormalization()(x)x = Dropout(0.2)(x)    x = Dense(output_dim,activation='sigmoid',name='label_output')(x)encoder = Model(inputs=i,outputs=encoded)autoencoder = Model(inputs=i,outputs=[decoded,x])# 损失函数由二部分构成。损失:均方误差。分类:交叉熵损失autoencoder.compile(optimizer=Adam(0.005),loss={'decoded':'mse','label_output':'binary_crossentropy'})return autoencoder, encoder# 全连接网络(MLP)
def create_model(input_dim,output_dim,encoder):inputs = Input(input_dim)# encoder进行降维,可以学习到数据集更有效的表征方法x = encoder(inputs)# 将经过encoder降维后的数据与原始数据进行拼接,原有信息与降维后的信息都存在,由神经网络决定使用什么数据# 这样做导致向量过长,导致模型参数增多,优化变难x = Concatenate()([x,inputs]) #use both raw and encoded featuresx = BatchNormalization()(x)x = Dropout(0.13)(x)# 多个隐藏层hidden_units = [384, 896, 896, 394]for idx, hidden_unit in enumerate(hidden_units):x = Dense(hidden_unit)(x)x = BatchNormalization()(x)x = Lambda(tf.keras.activations.relu)(x)x = Dropout(0.25)(x)# 输出x = Dense(output_dim,activation='sigmoid')(x)model = Model(inputs=inputs,outputs=x)# label_smoothing标签平滑操作model.compile(optimizer=Adam(0.0005),loss=BinaryCrossentropy(label_smoothing=0.05),metrics=[tf.keras.metrics.AUC(name = 'auc')])return model# 定义与训练自编码器,我们对训练数据的均值与方差添加了高斯噪声;训练结束后,我们锁定编码器中的层,避免进一步训练
autoencoder, encoder = create_autoencoder(X.shape[-1],y.shape[-1],noise=0.1)
if TRAINING:autoencoder.fit(X,(X,y),epochs=1000,batch_size=4096, validation_split=0.1,callbacks=[EarlyStopping('val_loss',patience=10,restore_best_weights=True)])encoder.save_weights('./encoder.hdf5')
else:encoder.load_weights('./encoder.hdf5')
encoder.trainable = False# 训练与预测
FOLDS = 5
SEED = 42oof = np.zeros((X.shape[0],5))if TRAINING:gkf = PurgedGroupTimeSeriesSplit(n_splits = FOLDS, group_gap=20)splits = list(gkf.split(y, groups=train['date'].values))for fold, (train_indices, test_indices) in enumerate(splits):model = create_model(130, 5, encoder)X_train, X_test = X[train_indices], X[test_indices]y_train, y_test = y[train_indices], y[test_indices]# 现在训练集上做训练,然后在测试集上做微调model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=100,batch_size=4096,callbacks=[EarlyStopping('val_auc',mode='max',patience=10,restore_best_weights=True)])model.save_weights(f'./model_{SEED}_{fold}.hdf5')model.compile(Adam(0.00001),loss='binary_crossentropy')model.fit(X_test,y_test,epochs=3,batch_size=4096)model.save_weights(f'./model_{SEED}_{fold}_finetune.hdf5')oof[test_indices] = model.predict(X_test)
else:models = []for f in range(FOLDS):model = create_model(130, 5, encoder)if USE_FINETUNE:model.load_weights(f'./model_{SEED}_{f}_finetune.hdf5')else:model.load_weights(f'./model_{SEED}_{f}.hdf5')models.append(model)# 评分
from sklearn.metrics import roc_auc_score,roc_curvescore_oof = roc_auc_score(train['action'].values,np.median(np.where(oof[:,:] >= 0.5,1,0).astype(int),1))
print(score_oof)
# auc取值与线上得分一致# 提交
if not TRAINING:f = np.median   # 中位数models = models[-2:]import janestreetenv = janestreet.make_env()th = 0.503 # 阈值# 数据集从简街中遍历得到测试样本for (test_df, pred_df) in tqdm(env.iter_test()):if test_df['weight'].item() > 0:x_tt = test_df.loc[:, features].valuesif np.isnan(x_tt[:, 1:].sum()):# 缺失值的填充x_tt[:, 1:] = np.nan_to_num(x_tt[:, 1:]) + np.isnan(x_tt[:, 1:]) * f_mean# 5个模型每个resp_的均值   pred = np.mean([model(x_tt, training = False).numpy() for model in models],axis=0)# pred的中位数pred = f(pred)# 通过一个阈值将pred => actionpred_df.action = np.where(pred >= th, 1, 0).astype(int)else:pred_df.action = 0env.predict(pred_df)

一、1.kaggel简街市场预测—baseline代码解析相关推荐

  1. [2021-CVPR] Jigsaw Clustering for Unsupervised Visual Representation Learning 论文简析及关键代码简析

    [2021-CVPR] Jigsaw Clustering for Unsupervised Visual Representation Learning 论文简析及关键代码简析 论文:https:/ ...

  2. 百度图神经网络——论文节点比赛baseline代码注解

    文章目录 一.项目介绍 二.BaseLine内容注解 1.Config部分注解 2. 数据读取与处理部分 2.1 边数据的加载与处理 2.2 数据的完整加载与处理 2.3 数据读取与分割 3. 模型加 ...

  3. 【简书如何插入代码框】

    [简书如何插入代码框] 步骤一:打开设置 步骤二:选择Markdown模式 步骤三:新建文章,在代码前面加上```即可(注:```为键盘上Esc键的下面英文符号) 步骤四:在```后面插入swift, ...

  4. KDD Cup风力发电赛题详解-附baseline代码【时间序列相关赛题方案】

    shiji按序列 赛题名称 Baidu KDD CUP 2022 赛题链接 https://aistudio.baidu.com/aistudio/competition/detail/152/0/i ...

  5. boot spring 简拼_超牛逼的下拉字典框架,支持拼音简拼搜索,代码项、代码值搜索,无限层级级联,多选、过滤自定义数据表等等...

    Dic扩展模块 Dic数据字典模块是v-ci的核心之一,为解决大数据量下拉字典效率及操作问题,特推出通用的数据字典模块,所有字典数据可在数据库维护,在大数据量时采用分页下拉的展示方式,同时支持拼音简拼 ...

  6. vul.php,phpvulhunter 精短简小的PHP代码安全审计,可 SQL注入漏洞、跨站脚本等 Exploit 弱点检测 267万源代码下载- www.pudn.com...

    文件名称: phpvulhunter下载  收藏√  [ 5  4  3  2  1 ] 开发工具: PHP 文件大小: 555 KB 上传时间: 2016-11-10 下载次数: 0 提 供 者: ...

  7. 条件随机场(CRF)极简原理与超详细代码解析

    条件随机场(CRF)极简原理与超详细代码解析 1. 原理 1.1 从名称说起 1.2 优化的目标 1.3 如何计算 2. 代码 2.1 基本结构 2.2 模型初始化 2.3 BERT提取的特征如何输入 ...

  8. ​万字长文详解文本抽取:从算法理论到实践(附“达观杯”官方baseline实现解析及答疑)...

    [ 导读 ]"达观杯"文本智能信息抽取挑战赛已吸引来自中.美.英.法.德等26个国家和地区的2400余名选手参赛,目前仍在火热进行中(点击"阅读原文"进入比赛页 ...

  9. Python3,一行代码解析地址信息,原来物流单的地址是这样拆分。

    1行代码解析地址信息 1.引言 2.代码示例 2.1 简介 2.2 安装 2.3 实战 2.3.1 提取省市区信息 2.3.2 提取街镇乡.村或居委会信息 2.3.3 自动补全省市信息 3.总结 1. ...

  10. 通过调试微信小程序示例代码解析flex布局参数功能(一)

    通过调试微信小程序示例代码解析flex布局参数功能 官方示例小程序源代码下载地址:https://github.com/wechat-miniprogram/miniprogram-demo 通过调试 ...

最新文章

  1. CNN模型 INT8 量化实现方式(一)
  2. 张亚勤:PC之外的争夺战
  3. c语言程序设计慕课版答案第6章,C语言程序设计答案黄保和编第6章函数.doc
  4. Linux守护进程编程编写,linux守护进程编程实例
  5. hive 将null值替换为0_【Hive】数据倾斜
  6. python 字典性质描述_卧槽!Python还有这些特性(2):奇怪的字典
  7. 从20 年程序员老兵做到上市公司合伙人,怎么少踩坑?
  8. RouterOS 5.22固定公网IP共享上网设置
  9. Nhibernate与Castle windsor (个人学习笔记1)
  10. [转]Angular: Hide Navbar Menu from Login page
  11. 自旋锁spin_lock
  12. matlab按图像边缘抠图_Ps最全十大抠图方法都在这,最后一种万能「值得收藏」...
  13. 《自己动手写操作系统》之 10分钟完成最小的操作系统
  14. Linux的DNS深度学习(DNS服务器搭建)
  15. 电脑卡怎么办?4招帮你解决电脑卡顿的烦恼!
  16. 从200K/s到2M/s,只差这篇文章——使用ProxyeeDown加速百度云盘下载速度
  17. 魂斗罗经典12个版本
  18. 解决VirtualBox虚拟电脑控制台严重错误
  19. prometheus+alertmanager 企业微信告警
  20. Navicat导入sql文件报错

热门文章

  1. 强联通块tarjan算法
  2. C#.NET Split 的几种使用方法
  3. 2. 通用基础技术框架搭建
  4. 排序趟[置顶] Java和C实现的冒泡排序(基本思想)
  5. MOSFET(一):基础
  6. 如何自动生成和安装requirements.txt依赖
  7. Software--Spring Boot--Contact 项目初期
  8. myeclipse注册机,自己生成注册码
  9. 如何用MyEclipse在Resin中调试Web应用程序
  10. 一个java内存泄漏的排查案例