文章目录

  • 往期文章链接目录
  • Before we start
  • Surface-Level Patterns in Attention
  • Probing Individual Attention Heads
  • Probing Attention Head Combinations
  • Clustering Attention Heads
  • 往期文章链接目录

往期文章链接目录

Before we start

In this post, I mainly focus on the conclusions the authors reach in the paper, and I think these conclusions are worth sharing.

In this paper, the authors study the attention maps of a pre-trained BERT model. Their analysis focuses on the 144 attention heads in BERT.

Surface-Level Patterns in Attention

  1. There are heads that specialize to attending heavily on the next or previous token, especially in earlier layers of the network.

  1. A substantial amount of BERT’s attention focuses on a few tokens. For example, over half of BERT’s attention in layers 6-10 focuses on [SEP]. One possible explanation is that [SEP] is used to aggregate segment-level information which can then be read by other heads.

    However, if this explanation were true, they would expect attention heads processing [SEP] to attend broadly over the whole segment to build up these representations. However, they instead almost entirely (more than 90%) attend to themselves and the other [SEP] token.

    They speculate that attention over these special tokens might be used as a sort of “no-op” when the attention head’s function is not applicable.

  1. Some attention heads, especially in lower layers, have very broad attention. The output of these heads is roughly a bag-of-vectors representation of the sentence.

  2. They also measured entropies for all attention heads from only the [CLS] token. The last layer has a high entropy from [CLS], indicating very broad attention. This finding makes sense given that the representation for the [CLS] token is used as input for the “next sen- tence prediction” task during pre-training, so it attends broadly to aggregate a representation for the whole input in the last layer.

Probing Individual Attention Heads

  1. There is no single attention head that does well at syntax “overall”.

  2. They do find that certain attention heads specialize to specific dependency relations, sometimes achieving high accuracy.

  1. Despite not being explicitly trained on these tasks, BERT’s attention heads perform remarkably well, illustrating how syntax-sensitive behavior can emerge from self-supervised training alone.

  2. While the similarity between machine-learned attention weights and human-defined syntactic relations are striking, they note these are relations for which attention heads do particularly well on. They would not say individual attention heads capture dependency structure as a whole.

Probing Attention Head Combinations

The probing classifiers are basically graph-based dependency parsers. Given an input word, the classifier produces a probability distribution over other words in the sentence indicating how likely each other word is to be the syntactic head of the current one.

  1. Their results from probing both individual and combinations of attention heads suggest that BERT learns some aspects syntax purely as a by-product of self-supervised training.

  2. A growing body of work indicating that indirect supervision from rich pre-training tasks like language modeling can also produce models sensitive to language’s hierarchical structure.

Clustering Attention Heads

  1. Heads within the same layer are often fairly close to each other, meaning that heads within the layer have similar attention distributions. This finding is a bit surprising given that Tu et al. (2018) show that encouraging attention heads to have different behaviors can improve Transformer performance at machine translation.

Computing the distances between all pairs of attention heads. Formally, they measure the distance between two heads H i \mathrm{H}_{i} Hi and H j \mathrm{H}_{j} Hj as:

∑ token  ∈ data  J S ( H i ( token  ) , H j ( token  ) ) \sum_{\text {token } \in \text { data }} J S\left(\mathrm{H}_{i}(\text { token }), \mathrm{H}_{j}(\text { token })\right) tokendataJS(Hi(token),Hj(token))

Where J S J S JS is the Jensen-Shannon Divergence between attention distributions. Using these distances, they visualize the attention heads by applying multidimensional scaling to embed each head in two dimensions such that the Euclidean distance between embeddings reflects the Jensen-Shannon distance between the corresponding heads as closely as possible.

  1. Heads within the same layer are often fairly close to each other, meaning that heads within the layer have similar attention distributions. This finding is a bit surprising given that Tu et al. (2018) show that encouraging attention heads to have different behaviors can improve Transformer performance at machine translation.

  2. Many attention heads can be pruned away without substantially hurting model performance. Interestingly, the important attention heads that remain after pruning tend to be ones with identified behaviors.


Reference:

  • What Does BERT Look At? An Analysis of BERT’s Attention: https://arxiv.org/pdf/1906.04341.pdf

往期文章链接目录

What Does BERT Look At? An Analysis of BERT’s Attention 论文总结相关推荐

  1. Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence 论文总结

    Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence <通过构建辅助句利用 ...

  2. [Python人工智能] 三十四.Bert模型 (3)keras-bert库构建Bert模型实现微博情感分析

    从本专栏开始,作者正式研究Python深度学习.神经网络及人工智能相关知识.前一篇文章开启了新的内容--Bert,首先介绍Keras-bert库安装及基础用法及文本分类工作.这篇文章将通过keras- ...

  3. 我的BERT!改改字典,让BERT安全提速不掉分(已开源)

    文 | 苏剑林 编 | 小轶 背景 当前,大部分中文预训练模型都是以字为基本单位的,也就是说中文语句会被拆分为一个个字.中文也有一些多粒度的语言模型,比如创新工场的ZEN和字节跳动的AMBERT,但这 ...

  4. bert模型简介、transformers中bert模型源码阅读、分类任务实战和难点总结

    bert模型简介.transformers中bert模型源码阅读.分类任务实战和难点总结:https://blog.csdn.net/HUSTHY/article/details/105882989 ...

  5. bert 多义词_自然语言处理:Bert及其他

    以下内容主要参考了文末列出的参考文献,在此表示感谢! 2018年被认为是NLP技术的new era的开始.在这一年,提出了多种有创新性的技术,而且最后的集大成者Bert在NLP的多项任务中屠榜,造成的 ...

  6. 2020 CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis

    abstract 多模态情感分析是一个新兴的研究领域,旨在使机器能够识别.解释和表达情感.通过跨模态互动,我们可以得到说话者更全面的情绪特征.来自Transformers(BERT)的双向Encode ...

  7. 论文笔记:Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

    核心任务 通过构造辅助句,将ABSA任务转换为"句子对"分类任务(如QA和NLI) 对BERT的预训练模型进行了微调,并在SentiHood和SemEval2014 Task 4数 ...

  8. Dissecting BERT Part 1: The Encoder 解析BERT解码器(transformer)

    原文:https://medium.com/dissecting-bert/dissecting-bert-part-1-d3c3d495cdb3 A meaningful representatio ...

  9. pytorch bert文本分类_一起读Bert文本分类代码 (pytorch篇 四)

    Bert是去年google发布的新模型,打破了11项纪录,关于模型基础部分就不在这篇文章里多说了.这次想和大家一起读的是huggingface的pytorch-pretrained-BERT代码exa ...

最新文章

  1. python3.6安装包报错_win10安装python3.6的常见问题
  2. Cocos2D-x(3)——动作类备忘
  3. 大工奥鹏计算机在线作业,大工20春《计算机网络技术》在线作业1题目【标准答案】...
  4. mysql al32utf8_Oracle 11g更改字符集AL32UTF8为ZHS16GBK
  5. win7下部署docker教程(三步搞定)
  6. pyqt5转pyqt6需要注意的事项
  7. SWFUpload学习记录
  8. 如何看待360与腾讯之争
  9. XMPP协议的工作原理
  10. FAGL_FCV 外币评估 原因代码替代
  11. html 小猪佩奇代码,HTML5之canvas画小猪佩奇~
  12. Android M Android6.0 权限管理 EasyPermission Demo
  13. 企业基因决定企业命运
  14. jenkins Error performing command: git ls-remote -h 解决办法
  15. usb接口多少钱_新款本田CRV正式上市,落地价多少钱?
  16. 个性化茅台之中国酒韵·十大花鸟
  17. Navigation的简单使用
  18. 10种常用的网络营销方法
  19. 7-3 二叉树路径和II
  20. 太极链——六大技术特点

热门文章

  1. 开关电源基础01:电源变换器基础(3)
  2. 【读MFiX源代码】MFiX中四种传热方式全面详解(对流、导热、辐射、反应热)并且输出以供后处理(2020-12-15更新)
  3. 10-性能测试之JMeter运行方式
  4. 无线鼠标、键盘换上新电池不能用的解决办法
  5. Elementary OS luma 手动安装 jdk1.8.0.5
  6. 考研英语 长难句训练day59
  7. 程序员永远不要犯的错误
  8. 参加云时代质量的力量论坛与 16 届软件展览会的一些收获
  9. 专心写博客21天,我拿到了博客专家
  10. 去掉input框的点击出现的蓝色边框