模型介绍

NLP预训练模型随着近几年的发展，参数量越来越大，受限于算力，在实际落地上线带来了困难，针对最近最为流行的BERT预训练模型，提出了DistilBERT，在保留97%的性能的前提下，模型大小下降40%，inference运算速度快了60%。

为了利用大模型在预训练过程中学习到的归纳偏差，引入了结合语言建模、蒸馏和余弦距离损失的三重损失。

模型改进

Knowledge Distilling(知识蒸馏)

一种压缩模型的技术。用一个小模型（Student）去学习大模型（Teacher）的输出。

在监督学习中，一个分类模型通常是最大化正确标签的概率来进行训练的。因此，一个标准的训练目标包括最小化模型的预测分布和训练标签的one-hot经验分布之间的交叉熵。一个模型如果表现的好，那么在正确的类别上就会有很高的概率，在其他的类别上就会有近似于0的概率。但是其中一些近似于0的类别要比其他的大，而且也会反映出模型的概括能力和在测试集上的表现。

在训练学生（Student）来模拟老师（Teacher）的输出分布时，使用的是soft target（hard label指one-hot 编码这种输出，每个输出只属于一个类；soft label输出是每个类的概率）的交叉熵（cross entropy）损失函数 $L_{ce}=\sum_it_i*log(s_i)$ ， $t_i$ 为Teacher的logits， $s_i$ 为Student的logits。并且softmax函数被替换成softmax-temperature函数 $p_i=\frac {exp (z_i / T)} {\sum{_j exp (z_j/T)}}$ 。T控制着输出分布的平滑度，当T变大，类别之间的差距变小；当T变小，类别直接的差距变大。 $z_i$ 为模型在类别i的score。

在训练过程中Student和Teacher使用相同的T（T>1），在推断的时候，设置T为1，恢复标准的softmax。

在分类任务上使用了损失函数 $L_{ce}$ 和监督训练损失函数（supervised trainning loss）的线性组合。本文中的监督学习任务是masked language modeling loss $L_{mlm}$ 。然后在这个基础上加入了cosine embedding loss $L_{cos}$ ，也就是student和teacher隐藏状态向量的cos计算。

参数初始化

在进行模型参数初始化的时候，为了利用Student和Teacher的共同维度，Student模型使用Teacher模型的参数进行初始化。

蒸馏（distillation）

DistilBERT利用梯度累计在一个非常大的批次（一个batch达到4000个example）上进行蒸馏（distil），使用动态编码，以及去掉了NSP（next sentence prediction）任务。

数据和计算功率（compute power)

和BERT使用了相同的语料，English Wikipedia and Toronto Book Corpus。DistilBERT在8块16GB V100的GPU上训练了大约90个小时。

模型结构

DistilBERT Student和BERT有相同的结构。但是token-type embeddings层和pooler层被移除并且层的数量被减半。

模型参考

论文地址：https://arxiv.org/abs/1910.01108

代码地址：https://github.com/huggingface/transformers

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter（2019-10-2）相关推荐

论文笔记--DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
论文笔记--DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 1. 文章简介 2. 文章概括 ...
《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》（NeurIPS-2019）论文阅读
前言论文地址:https://arxiv.org/abs/1910.01108 代码地址:https://github.com/huggingface/transformers Abstract 就 ...
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter学习
1. 总结论文地址论文写得很简单,但是引用量好高啊
《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》论文笔记
论文来源:NIPS-2019(hugging face发布) 论文链接:https://arxiv.org/abs/1910.01108 ⭐背景介绍: 近年来NLP领域,在大型预训练模型上进行迁移学 ...
DistilBERT, a distilled version of BERT
1 简介本文根据2020年<DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter>翻译 ...
Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT 翻译
paper: https://arxiv.org/pdf/1910.01108v2.pdf code: https://github.com/huggingface/transformers Time ...
Faster RCNN好文（转）
经过R-CNN和Fast RCNN的积淀,Ross B. Girshick在2016年提出了新的Faster RCNN,在结构上,Faster RCNN已经将特征抽取(feature extracti ...
系统学习NLP（二十六）--BERT详解
转自:https://zhuanlan.zhihu.com/p/48612853 前言 BERT(Bidirectional Encoder Representations from Transfor ...
Pytorch——BERT 预训练模型及文本分类（情感分类）
BERT 预训练模型及文本分类介绍如果你关注自然语言处理技术的发展,那你一定听说过 BERT,它的诞生对自然语言处理领域具有着里程碑式的意义.本次试验将介绍 BERT 的模型结构,以及将其应用于文 ...

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter（2019-10-2）