BLEU SCORE

BLEU Score 最早在2002年的ACL会议的一篇论文中被提出。

论文链接
BLEU: a Method for Automatic Evaluation of Machine Translation

1. 什么是 BLEU

BLEU (BiLingual Evaluation Understudy) 中文意思是“双语评估替补”。它是一种衡量机器翻译得到的文本 (candidate) 与参考译文 (reference) 之间相似程度的指标。

“双语”意思是是在两个文本间进行评估，“评估”意思是衡量两个文本间的相似程度，“替补”意思是BLEU是一种替代人工评价的指标。

在英文中 understudy 是替身演员的意思，原文中作者指出人力评估是高昂的、不可复用的，因此提出一种metric，来作为有经验的人类裁判的一种“替身”。

We present this method as an automated understudy to skilled human judges

评估的核心思想是接近专业人工翻译的机器翻译是更好的翻译。

The closer a machine translation is to a professional human translation, the better it is.

2. BLEU 原理

首先看如下例子，有两个机器翻译文本，记为Candidate1、2，和三个人类给出的参考翻译文本，记为Reference1、2、3。

Example1

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party.

直观上我们会认为 Candidate1 是更好的机器翻译，这是因为 Candidate1 中包含的 It is a guide to action、which、ensures that the military 等若干子序列都出现在了 References 中，而 Candidate2 却没有很多匹配上的子序列。

因此，BLEU 最主要的任务就是找到 Candidate 与 Reference 共享的子序列（也就是n-gram），通过量化共享的 n-gram，给出Candidate的分数。

2.1 Modified n-gram precision

上面的例子告诉我们Candidate与Reference之间共享的n-gram越多，那么Candidate的质量越高。但是这一特别直观的想法是存在缺陷的。

让我们看如下例子：
Example2

Candidate:       the the the the the the the
Reference1:     the cat is on the mat
Reference2:     there is a cat on the mat

Candidate由 7 个 the 构成，这显然是一句非常烂的翻译。但如果只根据 Example1 的想法，也即根据共享的 n-gram 数量衡量Candidate的质量，那么在1-gram (unigram) 的情况下，Candidate就是一个perfect的翻译，得分为 7 / 7 = 1.0分。这是因为Candidate中有 7 个单词，每一个单词都是 the，都出现在 Reference 中。

因此作者提出 modified unigram precision 与 modified n-gram precision，这里把unigram看作是n-gram在 n = 1 的情况下的特例，直接介绍 modified n-gram precision。

首先我们提取出Candidate与Reference中所有的 n-gram，也就是长度为n的子序列。然后用如下公式计算 n-gram 的分数。
Countclip(x)=min(Count(x),max(Countref(x)))Count_{clip}(x) = min(Count(x), max(Count_{ref} (x)))Countclip(x)=min(Count(x),max(Countref(x)))

Count(x)Count(x)Count(x) 代表 n-gram 序列 x 在 Candidate 中出现的次数， Coundref(x)Cound_{ref}(x)Coundref(x)代表 x 在各个 Reference 中出现的次数，可以看作是一个数组。max(Countref(x))max(Count_{ref} (x))max(Countref(x)) 就是对这个数组求最大值。 min(Count(x),max(Countref(x)))min(Count(x), max(Count_{ref}(x)))min(Count(x),max(Countref(x))) 代表着取二者中较小的。

最终 Countclip(x)Count_{clip}(x)Countclip(x) 就是 precision 的分子的一部分，从而计算modified n-gram precision的方式如下：
modified_precision(n)=∑x∈n−gramCountclip(x)∑x∈n−gramCount(x)modified\_precision (n) = \frac{\sum_{x \in n-gram} Count_{clip} (x)}{\sum_{x\in n-gram} Count(x)} modified_precision(n)=∑x∈n−gramCount(x)∑x∈n−gramCountclip(x)

通过这样的定义， Example2 中的的 modified_precision(1) = 1 / 7 而不是 1.0。

再通过一个例子理解一下。

Example3

Candidate:  the cat the cat the cat on the mat
Reference1: the cat the cat on the mat
Reference2: the cat is on the mat

这里的Reference也不是特别好的翻译，只是拿来作为例子计算modified 2-gram precion分数

2-gram           the cat 3次 cat the 2次 cat on 1次 on the 1次 the mat 1次
得分          the cat 2分 cat the 1分 cat on 1分 on the 1分 the mat 1分
最后分数为 ( 2 + 1 + 1 + 1 ) / (3 + 2 + 1 + 1 + 1) = 5 / 8

the cat 得分为 2 的原因见如上Count clip的计算。

2.2 Sentence length

说完 modified n-gram precision，我们就解决了翻译的 precision 问题，但是整个问题还没有解决，让我们看下一个例子。

Example4

Candidate: of the
Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party

Candidate只有 of the 两个单词组成。但是 modified_n_gram_precision(1) 与 modified_n_gram_precision(2) 的得分都是1.0。按照目前的标准，它是一个满分翻译，但是实际上它非常烂。

这告诉我们，如果机器偷懒只是吐出很短的句子，那么它也很可能骗取特别高的分数，因此我们有必要解决 recall 问题，让机器生成尽可能多的 Reference 中出现的词，得到与 Reference 差不多长度的文本。

但是如果只是让包含更多 n-gram 的 Candidate 文本得分更高的话，就会导致如下问题：

Example5

Candidate1:  I always invariably perpetually do.
Candidate2:  I always do.
Reference 1: I always do.
Reference 2: I invariably do.
Reference 3: I perpetually do.

Example5中，按照目前的标准，Candidate1 比 Candidate2 都是1.0分的翻译。并且 1 比 2 召回更多的单词，但是显然 1 不如 2 翻译得好。

为了解决 Example4 翻译过短与 Example5 翻译过长问题，作者提出了 Sentence brevity penalty。

Sentence brevity penalty

Sentence brevity penalty的思想是这样的，如果一个 Candidate 比 Reference 长的话，我们不需要额外构造启发式的惩罚值，因为它本身就已经受到 precision 这一项的惩罚了（也就是说，越长的句子越可能产生不在 Reference 中的 n-gram，从而得到不高的分数）。但如果 Candidate 比 Reference 更短的话，机器就有可能偷懒，只生成 Reference 中的一小个片段，从而得到较高的分数。因此我们需要去额外加一个 penalty 惩罚值，迫使机器生成更长的 Candidate。

BP 的定义如下，如果 Candidate 长度比 Reference 要长，那么 BP = 1，否则 BP 等于一个(0, 1) 之间的数。
BP={1ifc>re1−r/cifc≤rBP=\left\{ \begin{aligned} 1 \quad if \quad c > r\\ e^{1-r/c} \quad if \quad c \le r\\ \end{aligned} \right . BP={1ifc>re1−r/cifc≤r

最终的 BLEU 计算方式如下，其中n通常取 4，也就是只计算1-gram 到 4-gram 的 precision。 wnw_nwn 代表权重，最简单的取法是全部取 1 / n。
BLEU=BP⋅e∑nwnpnBLEU = BP \cdot e^{\sum_{n} w_n p_n} BLEU=BP⋅e∑nwnpn

[NLP] BLEU Score原理详解相关推荐

【NLP】Seq2Seq原理详解
一.Seq2Seq简介 seq2seq 是一个Encoder–Decoder 结构的网络,它的输入是一个序列,输出也是一个序列.Encoder 中将一个可变长度的信号序列变为固定长度的向量表达,Dec ...
Attention原理详解
Attention原理详解 Attention模型对齐模型介绍 Attention整体流程 Step1 计算Encoder的隐藏状态和Decoder的隐藏状态 Step2 获取每个编码器隐藏状态对 ...
CRF(条件随机场)与Viterbi(维特比)算法原理详解
摘自:https://mp.weixin.qq.com/s/GXbFxlExDtjtQe-OPwfokA https://www.cnblogs.com/zhibei/p/9391014.html C ...
【NLP】Google BERT详解
版权声明:博文千万条,版权第一条.转载不规范,博主两行泪 https://blog.csdn.net/qq_39521554/article/details/83062188 </div> ...
循环神经网络RNN、LSTM、GRU原理详解
一.写在前面这部分内容应该算是近几年发展中最基础的部分了,但是发现自己忘得差不多了,很多细节记得不是很清楚了,故写这篇博客,也希望能够用更简单清晰的思路来把这部分内容说清楚,以此能够帮助更多的朋友, ...
基于MobileNet的人脸表情识别系统（MATLAB GUI版+原理详解）
摘要:本篇博客介绍了基于MobileNet的人脸表情识别系统,支持图片识别.视频识别.摄像头识别等多种形式,通过GUI界面实现表情识别可视化展示.首先介绍了表情识别任务的背景与意义,总结近年来利用深度 ...
NLP中BERT模型详解
标题NLP中BERT模型详解谷歌发表的论文为: Attention Is ALL You Need 论文地址:[添加链接描述](https://arxiv.org/pdf/1706.03762.pd ...
Transformer 初识：模型结构+attention原理详解
Transformer 初识:模型结构+原理详解参考资源前言 1.整体结构 1.1 输入: 1.2 Encoder 和 Decoder的结构 1.3 Layer normalization Bat ...
LVS原理详解（3种工作方式8种调度算法）--老男孩
一.LVS原理详解(4种工作方式8种调度算法) 集群简介集群就是一组独立的计算机,协同工作,对外提供服务.对客户端来说像是一台服务器提供服务. LVS在企业架构中的位置: 以上的架构只是众多企业里面 ...

[NLP] BLEU Score原理详解