本文主要介绍了 为什么基于bert产出的句向量,在语义相似相关的任务上表现较差的原因及相关解释(各向异性,表示退化,锥形空间),另外介绍了simcse 中 论述的 对比学习 与 各向异性 的联系。

主要是涉及的相关论文和主要论点,留存用。

目录

问题引入:

相关论文解释:

1. REPRESENTATION DEGENERATION PROBLEM IN TRAINING NATURAL LANGUAGE GENERATION MODELS

2. bert-flow : chap2 : Understanding the Sentence Embedding Space of BERT

2.1 The Connection between Semantic Similarity and BERT Pre-training :

2.2 Anisotropic Embedding Space Induces Poor Semantic Similarity:

3. simcse : chap5 : Connection to Anisotropy

4. Alignment and Uniformity

相关论文:



问题引入:

why do the BERT-induced sentence embeddings perform poorly to retrieve semantically similar sentences?

即,为什么基于bert,来产出句向量,在语义相似相关的任务上表现极差?

Reimers and Gurevych (2019) demonstrate that such BERT sentence embeddings lag behind the state-of-the-art sentence embeddings in terms of semantic similarity. On the STS-B dataset, BERT sentence embeddings are even less competitive to averaged GloVe (Pennington et al., 2014) embed- dings, which is a simple and non-contextualized baseline proposed several years ago.

相关论文解释:

1. REPRESENTATION DEGENERATION PROBLEM IN TRAINING NATURAL LANGUAGE GENERATION MODELS

主要引进了表示退化问题(各向异性)

We observe that when training a model for natural language genera- tion tasks through likelihood maximization with the weight tying trick, especially with big training datasets, most of the learnt word embeddings tend to degenerate and be distributed into a narrow cone, which largely limits the representation power of word embeddings.

......

2. bert-flow : chap2 : Understanding the Sentence Embedding Space of BERT

主要介绍了bert类预训练任务和语义相似的联系,以及对语义相似表现较差的分析

2.1 The Connection between Semantic Similarity and BERT Pre-training :

  • The similarity between BERT sentence embed- dings can be reduced to the similarity betweenT2BERT context embeddings hc hc′ . However, as shown in Equation 1, the pretraining of BERT does not explicitly involve the computation of hTc hc′ . Therefore, we can hardly derive a mathematical formulation of what h⊤c hc′ exactly represents.
  • Co-Occurrence Statistics as the Proxy for Semantic Similarity: roughly speaking, it is semantically meaningful to compute the dot product be- tween a context embedding and a word embedding
  • Higher-Order Co-Occurrence Statistics as Context-Context Semantic Similarity: During pretraining, the semantic relationship between two contexts c and c′ could be inferred and reinforced with their connections to words.

2.2 Anisotropic Embedding Space Induces Poor Semantic Similarity:

  • To investigate the underlying problem of the fail- ure, we use word embeddings as a surrogate be- cause words and contexts share the same embed- ding space. If the word embeddings exhibits some misleading properties, the context embeddings will also be problematic, and vice versa.
  • Gao et al. (2019) and Wang et al. (2020) have pointed out that, for language modeling, the max- imum likelihood training with Equation 1 usually produces an anisotropic word embedding space. “Anisotropic” means word embeddings occupy a narrow cone in the vector space.
  • Observation 1: Word Frequency Biases the Embedding Space
  • Observation 2: Low-Frequency Words Dis- perse Sparsely We observe that, in the learned anisotropic embedding space, high-frequency words concentrates densely and low-frequency words disperse sparsely.
  • Due to the sparsity, many “holes” could be formed around the low-frequency word embed- dings in the embedding space, where the semantic meaning can be poorly defined. Note that BERT sentence embeddings are produced by averaging the context embeddings, which is a convexity- preserving operation. However, the holes violate the convexity of the embedding space

3. simcse : chap5 : Connection to Anisotropy

主要介绍了simcse 与各向异性的联系,及为什么simcse会有效

we take a singular spectrum perspective—which is a common practice in analyzing word embeddings (Mu and Viswanath, 2018; Gao et al., 2019; Wang et al., 2020), and show that the contrastive objective can “flatten” the singular value distribution of sentence embeddings and make the representations more isotropic.

......

4. Alignment and Uniformity

主要引进了Alignment and Uniformity 来分析和评估(训练)句向量

......

相关论文:

  • Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tieyan Liu. 2019. Representation degenera- tion problem in training natural language generation models. In International Conference on Learning Representations (ICLR).
  • https://openreview.net/pdf?id=ByxY8CNtvr : IMPROVING NEURAL LANGUAGE GENERATION WITH SPECTRUM CONTROL
  • bert-flow: On the Sentence Embeddings from Pre-trained Language Models
  • SimCSE: Simple Contrastive Learning of Sentence Embeddings
  • http://proceedings.mlr.press/v119/wang20k/wang20k.pdf Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

bert 句向量 的 各向异性问题 及与 对比学习 的联系相关推荐

  1. 进击!BERT句向量表征

    文章目录 前言 Sentence-Bert(EMNLP 2019) 核心思路 BERT-flow(EMNLP 2020) 核心思路 BERT-whitening 核心思路 ConSERT(ACL 20 ...

  2. Bert模型学习之句向量的简单应用

    Bert模型学习之预训练模型的简单应用 上文讲到,让自己的机器可以同时安装两个版本的python3.x,下面我们就可以正式运行Bert模型. 一.框架安装 1.首先切换到版本3.6的文件夹下 2.安装 ...

  3. 闲聊机器人实例四:python实现小姜机器人(检索式chatbot_sentence_vec_by_bert_bert句向量)

    bert构建生成句向量,再计算相似度,匹配问答库中的标准问题.为什么使用bert倒数第二层输出构建句向量. 小姜机器人.python.tensorflow.chatbot.dialog.bert中文短 ...

  4. 闲聊机器人实例三:python实现小姜机器人(检索式chatbot_sentence_vec_by_word_词向量句向量)

    word2vec词向量构建生成句向量,再计算相似度,匹配问答库中的标准问题. 小姜机器人.python.tensorflow.chatbot.dialog.fuzzywuzzy.检索式.生成式.聊天. ...

  5. 两行代码玩转 Google BERT 句向量词向量

    关于作者:肖涵博士,bert-as-service 作者.现为腾讯 AI Lab 高级科学家.德中人工智能协会主席.肖涵的 Fashion-MNIST 数据集已成为机器学习基准集,在 Github 上 ...

  6. 使用Bert得到句向量简单总结

    Bert 中文模型下载 https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip Bert 安 ...

  7. bert 生成文本句向量

    之前生成文本句向量的方法是:训练词向量模型w2v,将句子中各词的向量进行平均,现在想尝试一下用bert模型生成句向量. 1. bert模型结构 all_encoder_layers: 经过transf ...

  8. CoSENT:比Sentence-BERT更有效的句向量方案

    ©PaperWeekly 原创 · 作者 | 苏剑林 单位 | 追一科技 研究方向 | NLP.神经网络 学习句向量的方案大致上可以分为无监督和有监督两大类,其中有监督句向量比较主流的方案是 Face ...

  9. 【论文阅读-句向量】Whitening Sentence Representations for Better Semantics and Faster Retrieval

    这是苏神的论文,从BERT-flow到BERT-whitening,越来越接近文本的本质,处理方法也越来越简单了.其实昨天已经看完这个论文了,但是在看苏神的博客时发现这篇论文竟然还有一点小插曲:一篇使 ...

最新文章

  1. Linux的shell变量
  2. Mysql —— linux下使用c语言访问mySql数据库
  3. 03-编写dao实现类方式
  4. java date显示格式_Java如何显示不同格式的日期?
  5. 今天才知道css hack是什么
  6. Java 性能优化实战记录(2)---句柄泄漏和监控
  7. C语言挂载文件夹,使用autofs 按需挂载共享目录
  8. 部署Django工程
  9. SharePoint 2013版本功能对比介绍
  10. java web接收tcp_Java多线程实现TCP网络Socket编程(C/S通信)
  11. AngularJs学习的前景及优势
  12. SQL最全基础教程(有本事别看啊!)
  13. 项目,项目集与项目组合的关系
  14. bios显存改8g rx_玩屏蔽?爆4GB显存版RX480可刷成8GB版
  15. 计算机网络原理 实验3《IP数据包捕获及数据分析》
  16. 版主评选资料 - dongshan8
  17. 计算机原理 裸机运行,裸机恢复功能的工作原理 | Microsoft Docs
  18. 科研难做,何不使用Nvivo?
  19. 单片机AC220V过零检测电路仿真及改进仿真
  20. Python处理Excel求取某列固定间隔数的平均值

热门文章

  1. 线程池核心线程数最大线程数的意义
  2. 跟曹操学做事,向孔子学做人!
  3. Excel VBA 设置等待时间
  4. 如何用计算机算多个乘法相加,掌握这8个乘法口算法,答题速度快过计算器!|收藏给孩子...
  5. SQL注释怎么写以及SQL分类
  6. C++ decltype的使用
  7. Linux查看文件大小的几种方法(超全)
  8. linux下iostat命令详解
  9. 数据库原理-关系数据库理论(数据依赖)
  10. golang grpc