What Will it Take to Fix Benchmarking in NLU? [Bowman 2021]

Teaser
[Church & Hestness] revisit the arguments that motivated the NLP community’s shift toward quantitative benchmarking in the early 1990s and warn that the overwhelming success of this shift has indirectly laid the groundwork for the widespread use of poor-quality benchmakrs. 你骂,你再骂!(我应该就是从这里开始延伸出去看完了上个世纪a stream of papers about eval on language)

tl;dr
We lay out 4 criteria that we argue NLU benchmarks should meet. Most current benchmarks fail at these critieria, and adversarial data collection does not meaningfully address the causes of these failures.
Call for effort in restoring a health evaluation ecosystem.
Such ecosystem might involve frequent formative evaluations on a conventional non-adversarial benchmark in conjunction with periodic organized evaluations in an adversarial setting.

[Church and Hestness A survey of 25 years of evaluation]: Progress suffers in the absence of a trustworthy metric for benchmark-driven work. Newcomers and non-specialists are discouraged from trying to contribute, and specialists are given significant freedom to cherry-pick ad-hoc evaluation settings that mask a lack of progrss.

High-level Keypoints
  1. Scientific Goal

Building machines that can demonstrate a comprehensive and reliable understanding of everyday natural langauge text in the context of some specific well-posed task, langauge variety, and topic domain.

  1. Distinguish between a task and a benchmark:

tl;dr: task is abstract, benchmark is concrete

A task is a language-related skill or competency we want a model to demonstrate in the context of a specific input-output format. A benchmark attempts to eval performance on a task by grounding it to a text domain and instantiating it with a concrete dataset and eval metric. There is no general way to prove that a concrete benchmark fathfully measures performance on a abstract task. Nevertheless, since we can only eval models on concrete benchmarks, we have no choice but to strengthen the correspondence between the two as the best we can.

Questioning the role of adversarial data collection
  • Ood test sets encure that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure.
  • Collecting examples on which current models fail is neither necessary nor sufficient to create a useful benchmark. This approach can create a counterproductive incentive for researchers to develop models that are different without being better, since a model can top the leaderboard either by producing fewer errors than the adversary or by simply producing different errors.
  • One could attempt to avoid using some data to minimize the degree to which the idiosyncratic mistakes of the new model line up with those of the old one.
  • The ill-posed incentives can slow progress and contribute to spurious claims of discovery.
4 Criteria
  1. Validity

Test the full set of language phenomena/skills/variation

Should be sufficiently free of annotation artifacts so that a system cannot reach near-human performance by any means other than demonstrating the required langauge-related behaviors. 这里说得模棱两可因为behavior仍旧是表面现象,the underlying path it takes from input to output does not necessarily have to be human-like. 如果这里真的是只behavior on the surface, 那现在事实上已经实现了,现在大家还在说bias的核心在于当前的datasets不能够cover full set of language phenomena, yet everyone hastens to draw conclusions that xxx demonstrate xxx skill even if the current benchmarks only test a small subset of xxx skill. 所以不是说没有进展,而是people don’t faithfully report what they did and always overshoot when describing what their machines can do. 另一方面,如果说这里指的是underlying path也必须跟人脑里面发生的事情一样的话,这目标就很离谱,得先等生物那边的兄弟努努力,而且还会有不少质疑因为machines are not here to replace human, only to assist human,至少在商界的应用市场是这样,因此工程方面在意的就只要behavor类似。 If a benchmark fully meets the challenge, we should expect any clear improvements on the benchmake translate to similar performance on any other valid and reasonable eval data for the same task and language domain. Sadly, right now there is still a huuuuuuge difference between solving the dataset and acquiring the skill.

  • Problem with naturally-occurring examples: They minimize the effort in creating benchmarks and minimizes the risk that the benchmark is somehow skewed in a way that omits important phenomena, but they do nothing to separate skills of interest from factual world knowledge, and can be overwhelmingly dominated by the latter.
  • Expert authored examples: focus on specific types of model failure and it’s counterproductive when our goal is to build a broad-coverage benchmark to set priorities and guide progress toward the solution of some task.
  • Corwdsourcing often focus heavily on repetitive, easy cases and often fail to isolate key behaviors.
  • Adversarial Filtering can make a dataset artifically harder but does not ensure validity, cuz it can systematically elimiate coverage of linguistic phenomena or skills that are necessary for the task but already well-solved by the adversary model. This mode-seeking (as opposed to mass-covering) characteristic of AF, if left unchecked, tends to reduce dataset diversity and thus make validity harder to achieve.
  1. Reliable Annotation

tl;dr: Data is consistently-labeled

Clearly there is ample room for improvement but we have no guatantee that our benchmarks will detect those improvements when they are made. 人间真实: Most were collected by crowdsourcing with relatively limited quality control, such that we have no reason to expect that perfect performance on their metrics is achievable or that the benchmark will meaningfully distinguish between systems with superhuman metric performance.

Comparison between machine performance and human inter-annotator agreements may not be totally fair, since machines learn to predict the behavior of a collection of human models while human performance reflects the behavior of humans who are reporting their own judgments, rather than attempting to predict the most frequently assigned label.

  • Avoid carelessly mislabeled examples
  • Avoid unclear labels due to underspecified guidelines
  • Avoid ambiguous examples under the relevant metric due to legitimate disagreements in interpretation among annotators. Such disagreements might stem from dialectal variants or different interpretations of time-sensitive states of the world.
  1. Offer adequate statistical power

tl;dr: discriminative anough to detect any qualitatively relevant performance difference. 注意这是对benchmark的要求,而这篇之前定义了benchmark包括concrete dataset + metrics

When it comes to discrminative power, we should expect to be spending more time in the long tail of our data difficulty distributions.

  1. Disincentivize the use of harmful biases

This is challenging because of deep issues with the precise specification of what constitutes harmful bias. There is no precise enumeration of social biases that will be broadly satisfactory across applications and cultural contexts. Also admittedly, building such a list of attributes is deeply political.

Sketching a Solution
  1. Improving Validity
  • Crowdsourcing + Experts: starting from relatively high-quality crowdsourced datasets, then use expert effort to augment them in ways that mitigate annotation artifacts.

  • It also possible to make small interventions during the crowdsourcing process – like offering additional bonus payments for examples that avoid overused words and constructions.

  1. Handling Annotation Errors and Disagreements
  • Multiple annotations for the same example can largely resolve the issue of mistaken annotations.
  • Careful planning and pilot work can largely resolve the issue of ambiguous guidelines.
  • Handling legitimate disagreements can take two approaches: a). treat ambiguous examples in the same way as mislabeled ones and systematically identifies and discards them during a validation phase; b). one can decline to assign single, discrete labels to ambiguous examples. This can involve asking models to predict the empirical distribution of labels that trustworthy annotators assign.
  1. Improving Statistical Power

你骂,你再骂?Ultimately, the community needs to compare the cost of making serious investments in better benchmarks to the cost of wasting researcher time and computational resources due to our inability to measure progress.

  1. Disincentives for Biased Models

A viable approach could involve the expanded use of auxiliary metrics: benchmark creators can introduce a family of additional expert-constructed test datasets and metrics that each isolate and measure a specific type of bias. A model would be evluated in parallel on these additional bias test sets. In addition, the fact that these metrics would target specific types of biases would make it easier for benchmark maintainers to adapt as changing norms or changing downstream applications demand coverage of additional potential harms.

The difficulty is in developing community infrastructure to encourage the widespread reporting of metrics, plausibly involving peer review norms, explicit publication venue policies, and introduction of bias-oriented metrics to public leaderboards

What this paper doesn’t contain
  • We set aside computational efficiency and data efficiency.
  • We set aside few-shot learning. While few-shot learning represents a potentially impactful direction for engineering research, artifical constraints on the use of training data do not fit the broad goals laid out above and do not fit many applied settings.

[Paper Summary] What Will it Take to Fix Benchmarking in NLU? [Bowman 2021]相关推荐

  1. [Paper Summary] Evaluating repres. by the complexity of learning low-loss predictors [Whitney 2020]

    Evaluating representations by the complexity of learning low-loss predictors [Whitney 2020] tl;dr Pr ...

  2. paper 68 :MATLAB中取整函数(fix, floor, ceil, round)的使用

    MATLAB取整函数 1)fix(x) : 截尾取整. >> fix( [3.12 -3.12]) ans =      3    -3 (2)floor(x):不超过x 的最大整数.(高 ...

  3. [Paper Summary] Frustratingly Short Attn Spans in Neural LM [Daniluk 2017]

    Frustratingly Short Attn Spans in Neural LM [Daniluk 2017] 这是2017年的ICLR很老的文而且主要的贡献是用了不同的representati ...

  4. OSDI 2014 paper reading

    OSDI 2014 Accepted Paper Summary 1. Arrakis: The operating system is the control plane 最近的硬件促进网络服务器操 ...

  5. 独家 | 如何利用大规模无监督数据建立高水平特征?

    作者:Jae Duk Seo 翻译:Nicola 校对:丁楠雅 本文约3000字,建议阅读9分钟. 本文带你一窥Twitter整个产品链的构成,了解数据科学是怎样在各类型公司中发挥作用的. GIF来自 ...

  6. 【渐进】浅尝DDD,对试卷建模

    领域模型是OO分析中最重要的和经典的模型.领域驱动设计(DDD)则是有效的软件复杂性的应对之道. 领域模型其实是一种语言,领域专家与分析人员.开发人员之间交流的通用语言. 一开始,分析人员与领域专家需 ...

  7. Ext4文件系统修复

    Ext4文件系统修复 目录 一. super block........................................................................ ...

  8. 【今日CV 视觉论文速览】 19 Nov 2018

    今日CS.CV计算机视觉论文速览 Mon, 19 Nov 2018 Totally 21 papers Daily Computer Vision Papers [1] Title: GPipe: E ...

  9. git团队如何提交_如何使您的提交消息很棒并保持团队快乐

    git团队如何提交 by Bruno 由Bruno 如何使您的提交消息很棒并保持团队快乐 (How to make your commit messages awesome and keep your ...

  10. 计算机视觉论文-2021-06-30

    本专栏是计算机视觉方向论文收集积累,时间:2021年6月30日,来源:paper digest 欢迎关注原创公众号 [计算机视觉联盟],回复 [西瓜书手推笔记] 可获取我的机器学习纯手推笔记! 直达笔 ...

最新文章

  1. 彩云国物语片头曲_はじまりの風
  2. javascript取随机数_Js怎么产生随机数?
  3. android自定义控件是一个 内部类 如何在xml中引用,android 自定义view属性
  4. OneNote使用说明
  5. 判读一个对象不为空_“人不为己,天诛地灭”的真实含义
  6. mysql中的lgwr_MySQL Replication和Oracle logical standby的原理对比
  7. 跟风写博---也谈值类型和引用类型
  8. np.array 的shape (2,)与(2,1)的分别是什么意思
  9. List集合取交集、并集、差集
  10. leetcode[206]翻转链表/reverse linked list 链表经典面试题目
  11. nook3软件_在Nook上阅读适用于PC和便携式设备的所有电子书
  12. 52个外文文献论文网站,写论文的你必囤!
  13. CVI通过ODBC连接数据库的方法
  14. QEMU-KVM设备热插拔机制简析
  15. 工控自动化通讯测试工具总结之--串口
  16. HDU - Shaolin(STL)
  17. 电脑和ubuntu开发板用网线连接的方法
  18. 冗余系统服务器切换的因素,揭秘 | 自动化冗余系统
  19. 索骥馆-DIY硬盘之《新手3周学通硬盘芯片级维修》扫描版[PDF]
  20. 无线信道参数估计算法

热门文章

  1. 安防产业布局跨境电商 有哪些方法?
  2. 微信小程序云开发——实现 线上注册,登录的逻辑,并保存账号到云数据库( 五 ):云函数调用
  3. Indiegogo和Kickstarter上受中国用户支持的10个众筹项目
  4. menu.add方法的四个参数都是什么?,都是什么意思?
  5. C# 关于事件实现观察者模式-父子窗体类间传值
  6. mysqlmac怎么下载
  7. 蓝牙协议分析(8)_BLE安全机制之白名单
  8. 从戴尔SC系列的这些细节,我仿佛看到了存储的未来
  9. Springboot系列-整合Mybatis
  10. 小赵带你读论文系列13-Airbnb Embedding