论文笔记-Bleu 一种机器翻译的评价指标

首发于彼得攀的小站
在文本生成中，如何评价生成文本质量是一个很重要的问题。2002年Kishore Papineni et al.提出一个重要评价指标Bleu，该论文引用近万（9000+），是NLP领域的必读文章之一。Bleu最初应用于机器翻译

论文地址：https://www.aclweb.org/anthology/P02-1040

基本思想

Motivation:
如果机器翻译的译文和人类翻译的结果接近，那么认为机器翻译的结果是好的
实现的基本思想：
将机器翻译产生的候选译文与人翻译的多个参考译文相比较，越接近，候选译文的正确率越高。
实现的方法：
统计同时出现在系统译文和参考译文中的n-gram的个数，最后把匹配到的n-gram的数目除以系统译文的n-gram数目，得到评测结果

n-gram precision

n-gram将一个句子中连续n个元素作为一个整体，长度为18的句子有18个1-gram，每个单词都是一个1-gram，有17个2-gram。令：

candidate代表机器翻译的译文
reference translation代表参考译文

那么n-gram precision代表candidate中的n-gram在所有的reference translation中出现的概率

Candidate 1：It is a guide to action which ensures that the military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the directions of the party .

在上述例子中，Candidate 1的1-gram precision是17/18, 2-gram precision是10/17，即用candidate和reference dictionary中共同出现的n-gram个数除以candidate中总的n-gram个数

modified n-gram recision

上述算法中存在问题：同一个n-gram在不同的reference中重复出现，计算时次数累加，这样的算法显然是不合理的，如下例

Candidate: the the the the the the the.Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.

candidate的1-gram precision是1，显然是不合理的。所以有了做了修正，其modified-gram precision为2/7，其中2=min(7,2)。即candidate和reference dictionary中共同出现的n-gram个数 C o u n t c l i p = m i n ( c o u n t , M a x _ R e f _ C o u n t ) Count_{clip}=min(count, Max\_Ref\_Count) Countclip=min(count,Max_Ref_Count)：

count是n-gram在candidate中的出现次数，如上例中，the出现了7次
ref_count代表当前n-gram在reference translation中出现次数，Max_Ref_Count为出现次数的最大值，上例中the在reference 1中出现2次，在reference 2中出现1次，取最大值为2

所以上述评价翻译质量过程变成了找n-gram match的问题，而n-gram实质上有以下性质：

1-grams match可以体现adequacy
longer n-grams match可以体现fluency

而adequacy和fluency都是人类来评价翻译质量的指标。

有了对单个句子的评价指标，将单个指标合起来就可以得到对于文档的评价指标：(其中n代表n-gram) p n = ∑ C ∈ { C a n d i d a t e } ∑ n - g r a m ∈ C C o u n t c l i p ( n - g r a m ) ∑ C ’ ∈ { C a n d i d a t e } ∑ n - g r a m ’ ∈ C ’ C o u n t ( n - g r a m ) p_n = \frac {\sum_{C \in \left\{Candidate \right\}} \sum_{n\text{-}gram \in C}Count_{clip}(n\text{-}gram)}{\sum_{C^{\text{'}} \in \left\{Candidate \right\}} \sum_{n\text{-}gram\text{'} \in C\text{'}}Count(n\text{-}gram)} pn=∑C’∈{Candidate}∑n-gram’∈C’Count(n-gram)∑C∈{Candidate}∑n-gram∈CCountclip(n-gram)

对于n-gram中n的取值，论文中做了实验，得到了以下结果：

所以在修正的n元语法精度计算中，随着n值的增大，精度值几乎成指数级下降，因此，BLEU方法中采用了修正的n元语法精度的对数加权平均值，相当于对修正的精度值进行几何平均，n值最大为4。另外，考虑到句子的长度对上述BLEU评分也有一定的影响，显然短的疑问更有可能得到高的评分，过长的译文更有可能得到低的评分。因此，需要进一步考虑译文的句子长度对计算评分的影响。

本身Modified n-grams precision对于过长的translation candidate就有惩罚：越长，match的个数显然会减小，特别是超过reference translation的情况
再引入了一个惩罚因子，对过短的译文进行惩罚

最终Bleu的计算方法如下：

Evaluation

论文中还论证了以下信息

实验表明，几何平均比算术平均更接近人类的判断，一般n-gram不会超过4
对于Bleu而言，并非是好的翻译的score就接近1，一个人类翻译质量很高的译文并不一定就很接近1，而是越接近reference translation的翻译，score越高->所以说reference translation 的数量很重要，经实验表明，对于一个好的翻译，原文对应的reference translation数目越多，Belu score越高。论文中对reference translation的数目同样做了评估，只要语料库中reference translation的风格是多样化的（即不是来自同一译者），reference translation数目只有1个，Belu score的质量也是很高的
论文中对Belu 和人类评价做了比较，在相同的语料库，对Belu和人类评价的结果做了线性回归，发现相关度接近于1（0.96和0.99），表明Belu作为评价指标质量很高
Bleu评价的基本单元是句子