文章目录

  • 往期文章链接目录
  • Before we start
  • Jigsaw Multilingual Toxic Comment Classification
    • Overview of the competition
    • Data
  • 1st place solution
    • Ensembling to mitigate Transformer training variability
    • Pseudo-Labeling
    • Multilingual XLM-Roberta models
    • Train foreign language monolingual Transformer
    • Finetuning pre-trained foreign language monolingual FastText models
    • Post-processing
      • Pseudo-code
  • 4th place solution special
    • Speeding up training
    • Training strategy
    • Post-processing
  • Commons in top solutions
    • Two stage training
  • A checklist of contemporary NLP classification techniques
  • 往期文章链接目录

往期文章链接目录

Before we start

Two of my previous post might be helpful in getting a general understanding of the top solutions of this competition. Please feel free to check them out.

  • Knowledge Distillation clearly explained
  • Common Multilingual Language Modeling methods (M-Bert, LASER, MultiFiT, XLM)

Jigsaw Multilingual Toxic Comment Classification

Use TPUs to identify toxicity comments across multiple languages.

Overview of the competition

Jigsaw Multilingual Toxic Comment Classification is the 3rd annual competition organized by the Jigsaw team. It follows Toxic Comment Classification Challenge, the original 2018 competition, and Jigsaw Unintended Bias in Toxicity Classification, which required the competitors to consider biased ML predictions in their new models. This year, the goal is to use English only training data to run toxicity predictions on foreign languages (tr, ru, it, fr, pt, es).

Kagglers are predicting the probability that a comment is toxic. A toxic comment would receive a 1.0. A benign, non-toxic comment would receive a 0.0. In the test set, all comments are classified as either a 1.0 or a 0.0. The whole test set was visible in this competition.

Data

  • jigsaw-toxic-comment-train.csv: from Jigsaw Toxic Comment Classification Challenge (2018).
0.0 1.0 total
count 202165 21384 223549
  • jigsaw-unintended-bias-train.csv: from Jigsaw Unintended Bias in Toxicity Classification (2019)
0.0 1.0 total
count 1789968 112226 1902194
  • validation.csv: comments from Wikipedia talk pages in different non-English languages
  • test.csv: comments from Wikipedia talk pages in different non-English languages

Here is the the value counts in valid data:

0.0 1.0 total
es 2078 422 2500
it 2012 488 2500
tr 2680 320 3000

Here is the the value counts in test data:

As you can see, the test set comments contains 6 non-English languages (tr, ru, it, fr, pt, es) and the validation set contains only three non-English comments (es, it, tr).

1st place solution

Ensembling to mitigate Transformer training variability

Since the performance of Transformer models is impacted heavily by initialization and data order, they went with an iterative blending approach, refining the test set predictions across submissions with a weighted average of the previous best submission and the current model’s predictions. They began with a simple average, and gradually increased the weight of the previous best submission.

Note: The predictions are an exponential moving average of all past model predictions and the current model’s prediction.

Pseudo-Labeling

They observed a performance improvement when they used test-set predictions as training data - the intuition being that it helps models learn the test set distribution. Using all test-set predictions as soft-labels worked better than any other version of pseudo-labelling (e.g., hard labels, confidence thresholded PLs, etc). Towards the end of the competition, we discovered a minor but material boost in LB when we upsampled the PLs.

Multilingual XLM-Roberta models

As with most teams, they began with a vanilla XLM-Roberta model, incorporating translations of the 2018 dataset in the 6 test-set languages as training data. They used a vanilla classification head on the [CLS] token of the last layer with the Adam optimizer and binary cross entropy loss function, and finetuned the entire model with a low learning rate. Given Transformer models having several hundred million trainable weights put to the relatively simple task of making a binary prediction, they didn’t spend too much time on hyper-parameter optimization, architectural tweaks, or pre-processing.

Train foreign language monolingual Transformer

Inspired by the MultiFiT paper, they observed a dramatic performance boost when they used pretrained foreign language monolingual Transformer models from HuggingFace for the test-set languages(e.g., Camembert for french samples, Rubert for russian, BerTurk for turkish, BETO for spanish, etc).

They finetuned models for each of the 6 languages:

  • Combining translations of the 2018 Toxic Comments together with pseudo-labels for samples from the test set and hard labels for samples from the val set in that specific language. These pseudo-labels initially come from the XLM-R multilingual models, then they are continuously refined by these monolingual transformer models.

Note: A training run would for example have 10K test set samples + 70K subsampled 2018 translations samples + 2.5K val samples. These models are quite good at few-shot learning so <100K is sufficient to learn.

  • Training the corresponding monolingual model, predicting the same samples then blending it back with the “main branch” of all predictions. It was synergistic in that training cross-lingual model -> monolingual models -> cross-lingual model, etc. lead to continual performance improvements.

For each model run, they would reload weight initalizations from the pretrained models to prevent overfitting. In other words, the continuing improvements of the predictions were being driven by refinements in the pseudo-labels we were providing to the models as training data.

For a given monolingual model, predicting only test-set samples in that language worked best. Translating test-set samples in other languages to the model’s language and predicting them worsened performance.

Finetuning pre-trained foreign language monolingual FastText models

After they exhausted the HuggingFace monolingual model library, they trained monolingual FastText Bidirectional GRU models on 2018 Toxic Comments, using pretrained embeddings for the test-set languages, to continue refining the test set predictions (albeit with a lower weight when combined with the main branch of predictions) and saw a small but meaningful performance boost.

Post-processing

Intuition:

They consider the trend of subsequent submissions of a specific language (e.g. Russian) for each example in the test dataset. If the trend of that example is positive, they nudge the example further in the positive direction and vice versa.

They measure the trend by taking the differences of all subsequent submissions for the specific language and averaging those differences. The nudge that they give to the new submission is based on a predefined weight, typically a weight of 1 or 1.5.

Pseudo-code

Given the following:

weight = predefined weight (typically 1 or 1.5)
pred_best = current best predictions on LB
diff_avg = average of differences of consecutive subs (trend)

Then for each example in test of the specific language (e.g. Turkish):

if diff_avg < 0: # negative trendpred_new = (1+weight*diff_avg)*pred_best  # nudge downwards
else: # positive trendpred_new = (1-weight*diff_avg)*pred_best + diff_avg # nudge upwards

4th place solution special

Speeding up training

  • One way is to downsample negative samples to get a more balanced dataset and a smaller dataset at the same time. If you only use as many negative as positive then dataset is reduced 5x roughly, same for time to run one epoch.

  • Another speedup comes from padding by batch. The main idea is to limit the amount of padding to what is necessary for the current batch. Inputs are padded to the length of the longest input in the batch rather than a fixed length. It has been used in previous competitions and it accelerates training significantly. They refined the idea by sorting the samples by their length so that the samples in a given batch have similar length. This reduces even further the need for padding. If all inputs in the batch have the same length then there is no padding at all. Given samples are sorted, they cannot be shuffled in the training mode. They rather shuffled batches. This yields an extra 2x speedup compared to the original padding by batch.

To learn how to implement padding by batch, read my previous summary on Tweeter Sentiment Extraction section sequence bucketing (dynamic padding).

Training strategy

One main ingredient for their good result is a stepwise finetuning towards single language models for tr, it and es. Below is the illustration of their model architecture:

As input data, they used the english 2018 Toxic Comments and its six translations available as public datasets which combined are roughly 1.4M comments. Training a model directly on the combined dataset is not optimal, as each comment appears 7x in each epoch (although in different languages) and the model overfits on that. So they divided the combined dataset into 7 stratified parts, where each part contains a comment only once. For each fold they then fine-tuned a transformer in a 3 step manner:

  • Step 1: finetune to all 7 languages for 2 epochs
  • Step 2: finetune only to the full valid.csv which only has tr, it and es
  • Step 3: 3x finetune to each language in valid resulting in 3 models

We then use the step1 model for predicting ru, the step2 model for predicting pt and fr and the respective step3 models for tr, it and es. Using the step2 model for pt and fr gave a significant boost compared to using step1 model for those due. Most likely due to the language similarity between it, es, fr and pt.

Post-processing

The competition metric is sensitive to the global rank of all predictions and not language specific. So they took care of the languages having a correct relation to each other, by shifting the predictions of each language individually.

test_df.loc[test_df["lang"] == "es", "toxic"] *= 1.06
test_df.loc[test_df["lang"] == "fr", "toxic"] *= 1.04
test_df.loc[test_df["lang"] == "it", "toxic"] *= 0.97
test_df.loc[test_df["lang"] == "pt", "toxic"] *= 0.96
test_df.loc[test_df["lang"] == "tr", "toxic"] *= 0.98

I tried out this post-processing and it boosted my rank from 64th to top 30.

Commons in top solutions

I read through top 10 solutions and indeed they are roughtly the same. Unlike in Tweet Sentiment Extraction where every top solution has its own novel and fancy model structures, top solutions in this competition mostly share the following ideas:

  • Pseudo-Labeling (using soft labels).

  • Use monolingual Transformer models.

  • Two stage training.

  • Post-processing (each team have has different pp strategy).

Two stage training

This is the last common ideas that I haven’t explained. The main idea is the following.

  • One stage training means that we always train the same data in training. This policy always get the lower scores.

  • Two stage training means that we will have two different data for training. e.g., firstly train on Toxic Comment Classification dataset (+ Unintended Bias in Toxicity dataset). After that train on 8k validation data. This policy always get the higher scores.

A checklist of contemporary NLP classification techniques

1st place author shared a checklist of contemporary NLP classification techniques. The followings are a selection of what didn’t work for them in this competition, but I still believe these ideas are worth trying in any competition or project.

  • Document-level embedders (e.g., LASER)
  • Further MLM pretraining of Transformer models using task data
  • Alternative ensembling mechanisms (rank-averaging, stochastic weight averaging)
  • Alternative loss functions (e.g., focal loss, histogram loss)
  • Alternative pooling mechanisms for the classification head (e.g., max-pool CNN across tokens, using multiple hidden layers etc.)
  • Non-FastText pretrained embeddings (e.g., Flair, glove, bpem)
  • Freeze-finetuning for the Transformer models
  • Regularization (e.g., multisample dropout, input mixup, manifold mixup, sentencepiece-dropout)
  • Backtranslation as data augmentation
  • English translations as train/test-time augmentation
  • Self-distilling to relabel the 2018 data
  • Adversarial training by perturbing the embeddings layer using FGM
  • Multi-task learning
  • Temperature scaling on pseudo-labels
  • Semi-supervised learning using the test data
  • Composing two models into an encoder-decoder model
  • Use of translation focused pretrained models (e.g., mBart)

Reference:

  • 1st place solution: https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160862, https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160986
  • 4th place solution: https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160980
  • 6th place solution: https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/161095
  • 11st place solution: https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/160961
  • Jigsaw Multilingual: Quick EDA & TPU Modeling: https://www.kaggle.com/ipythonx/jigsaw-multilingual-quick-eda-tpu-modeling
  • Jigsaw Train Multilingual Coments (Google API) For multilingual models training: https://www.kaggle.com/miklgr500/jigsaw-train-multilingual-coments-google-api

The followings are monolingual language models in huggingface communities mentioned by the 6th place solution:

  • es:

    • https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased#
    • https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased#
    • https://github.com/dccuchile/beto
  • fr:
    • https://huggingface.co/camembert/camembert-large
    • https://huggingface.co/camembert-base
    • https://huggingface.co/flaubert/flaubert-large-cased
    • https://huggingface.co/flaubert/flaubert-base-uncased
    • https://huggingface.co/flaubert/flaubert-base-cased
  • tr:
    • (128k vocabulary) https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased
    • (128k vocabulary) https://huggingface.co/dbmdz/bert-base-turkish-128k-cased#
    • (32k vocabulary) https://huggingface.co/dbmdz/bert-base-turkish-uncased#
    • (32k vocabulary) https://huggingface.co/dbmdz/bert-base-turkish-cased#
  • it
    • https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased#
    • https://huggingface.co/dbmdz/bert-base-italian-xxl-cased#
    • https://huggingface.co/dbmdz/bert-base-italian-uncased#
    • https://huggingface.co/dbmdz/bert-base-italian-cased#
  • pt
    • https://huggingface.co/neuralmind/bert-large-portuguese-cased#
    • https://huggingface.co/neuralmind/bert-base-portuguese-cased#
  • ru
    • https://huggingface.co/DeepPavlov/bert-base-bg-cs-pl-ru-cased
    • https://huggingface.co/DeepPavlov/rubert-base-cased#

往期文章链接目录

Kaggle: Jigsaw Multilingual Toxic Comment Classification Top Solutions 金牌思路总结相关推荐

  1. toxic comment classification 数据集

    NLP情感分类任务,下载 label 描述 toxic 恶意 sevveretoxic 穷凶极恶 obscene 猥琐 threat 恐吓 insult 侮辱 identityhate 种族歧视 其中 ...

  2. kaggle Talking Data 广告欺诈检测竞赛 top 1%方案分享

    kaggle Talking Data 广告欺诈检测竞赛 top 1%方案分享 今天kaggle Talking Data结束了,开榜前踩着金牌最后一名提心吊胆,最后虽然选的结果是第7好的,但还是前进 ...

  3. Kaggle---Toxic Comment Classification Challenge

    1.Introduction: 在互联网上匿名有时会让人们说出他们在现实生活中通常不会说的脏话.让我们从我们的平台上过滤掉仇恨,一个一个的评价.最近看了下kaggle上大佬的做法,整理了一下代码,并加 ...

  4. Kaggle图像识别竞赛 Plant Seedlings Classification(植物幼苗分类)具体实现

    目录 0. 前言 1. 总体设计 2. import部分 3. 具体实现步骤 一.数据预处理 (一)均衡化 (二)提取图片中叶子(绿色)的部分 二.提取特征 (一)SIFT提取关键点 (二)BOW(B ...

  5. kaggle Jigsaw Rate有毒评论打分比赛内容总结

    游戏本ubuntu21.04装机配置pytorch环境大全 maxlen调参 不同的模型以及不同的损失函数 数据集的选取 不同模型对应的数据清洗问题 预训练 线上线下分数不一致问题 加权过拟合问题 使 ...

  6. Kaggle: Tweet Sentiment Extraction 方法总结 Part 2/2: 金牌思路总结

    文章目录 往期文章目录链接 Note Noteworthy ideas in 1st place solution Idea Second level models Architectures Arc ...

  7. 近期机器学习竞赛汇总~总奖池超三百万人民币!

    准备战斗吧! 本文汇总了近期正在进行的机器学习类竞赛,包含CV.NLP.语音.数据.生物信息等众多方向,既包括国内竞赛平台的比赛,也包括Kaggle等知名或不知名国外竞赛平台的比赛,总奖池超三百万人民 ...

  8. Kaggle: Tweet Sentiment Extraction 方法总结 Part 1/2: 常用方法总结

    文章目录 往期文章目录链接 Note Before we start Tweet Sentiment Extraction What is the MAGIC? Common Methods Labe ...

  9. kaggle r语言_kaggle第三名解决方案拼图多语言毒性评论分类

    kaggle r语言 I had recently participated in the Jigsaw Multilingual Toxic Comment Classification chall ...

最新文章

  1. SSIS--FOR循环容器
  2. innodb一页为什么要存储两行记录_InnoDB中的页合并与分裂
  3. python单词意思-Python这个单词是什么含义??????????????
  4. 大型网站演化发展历程
  5. 【BZOJ】P2144 跳跳棋
  6. submit按钮html,html的两种提交按钮submit和button
  7. php中访问控制_一个实例:基于RBAC理论的访问控制实践
  8. 机器学习物语(2):大数定理军团
  9. Java进阶之对象克隆(复制)
  10. linux 打包排除多个目录,tar打包整个目录(可排除子目录)几种方法
  11. JDBC连接(MySql)数据库步骤,以及查询、插入、删除、更新等十一个处理数据库信息的功能。...
  12. 实战!阿里神器 Seata 实现 TCC模式 解决分布式事务,真香!
  13. 《JavaScript核心技术》
  14. Map的keyset() 方法和entrySet() 方法
  15. STM32F4之SDIO接口
  16. 歌声合成系列教程《1》
  17. python数学公式编辑工具_GNU TeXmacs
  18. CodeBlocks调试教程
  19. 信息系统安全概述(课程笔记)
  20. 分类模型-评估指标(2):ROC曲线、 AUC值(ROC曲线下的面积)【只能用于二分类模型的评价】【不受类别数量不平衡的影响;不受阈值取值的影响】【AUC的计算方式:统计所有正负样本对中的正序对】

热门文章

  1. 汉语编程vs汉语写一元方程
  2. msn名字大集合(转)
  3. ROS实践3:publisher模板教程
  4. Rocket.chat 安装
  5. Rust - Rocket框架 -基础入门
  6. 1527编码协议_一种用射频收发器芯片发射1527编码的方法与流程
  7. 点、圆、圆柱类的设计
  8. 科技新品 | 索尼首款专业无人机;arpara 5K超清VR头显及一体机;七彩虹多款显卡新品...
  9. 贵州高二会考计算机考试题,贵州省普通高中会考数学试题.docx
  10. C语言源文件名为什么无效,scanf函数,想说输入不容易!----小话c语言(3)