[Paper Summary] What Will it Take to Fix Benchmarking in NLU? [Bowman 2021]

What Will it Take to Fix Benchmarking in NLU? [Bowman 2021]

Teaser
[Church & Hestness] revisit the arguments that motivated the NLP community’s shift toward quantitative benchmarking in the early 1990s and warn that the overwhelming success of this shift has indirectly laid the groundwork for the widespread use of poor-quality benchmakrs. 你骂，你再骂！（我应该就是从这里开始延伸出去看完了上个世纪a stream of papers about eval on language）

tl;dr
We lay out 4 criteria that we argue NLU benchmarks should meet. Most current benchmarks fail at these critieria, and adversarial data collection does not meaningfully address the causes of these failures.
Call for effort in restoring a health evaluation ecosystem.
Such ecosystem might involve frequent formative evaluations on a conventional non-adversarial benchmark in conjunction with periodic organized evaluations in an adversarial setting.

[Church and Hestness A survey of 25 years of evaluation]: Progress suffers in the absence of a trustworthy metric for benchmark-driven work. Newcomers and non-specialists are discouraged from trying to contribute, and specialists are given significant freedom to cherry-pick ad-hoc evaluation settings that mask a lack of progrss.

High-level Keypoints

Scientific Goal

Building machines that can demonstrate a comprehensive and reliable understanding of everyday natural langauge text in the context of some specific well-posed task, langauge variety, and topic domain.

Distinguish between a task and a benchmark:

tl;dr: task is abstract, benchmark is concrete

A task is a language-related skill or competency we want a model to demonstrate in the context of a specific input-output format. A benchmark attempts to eval performance on a task by grounding it to a text domain and instantiating it with a concrete dataset and eval metric. There is no general way to prove that a concrete benchmark fathfully measures performance on a abstract task. Nevertheless, since we can only eval models on concrete benchmarks, we have no choice but to strengthen the correspondence between the two as the best we can.

Questioning the role of adversarial data collection

Ood test sets encure that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure.
Collecting examples on which current models fail is neither necessary nor sufficient to create a useful benchmark. This approach can create a counterproductive incentive for researchers to develop models that are different without being better, since a model can top the leaderboard either by producing fewer errors than the adversary or by simply producing different errors.
One could attempt to avoid using some data to minimize the degree to which the idiosyncratic mistakes of the new model line up with those of the old one.
The ill-posed incentives can slow progress and contribute to spurious claims of discovery.

4 Criteria

Validity

Test the full set of language phenomena/skills/variation

Should be sufficiently free of annotation artifacts so that a system cannot reach near-human performance by any means other than demonstrating the required langauge-related behaviors. 这里说得模棱两可因为behavior仍旧是表面现象，the underlying path it takes from input to output does not necessarily have to be human-like. 如果这里真的是只behavior on the surface, 那现在事实上已经实现了，现在大家还在说bias的核心在于当前的datasets不能够cover full set of language phenomena, yet everyone hastens to draw conclusions that xxx demonstrate xxx skill even if the current benchmarks only test a small subset of xxx skill. 所以不是说没有进展，而是people don’t faithfully report what they did and always overshoot when describing what their machines can do. 另一方面，如果说这里指的是underlying path也必须跟人脑里面发生的事情一样的话，这目标就很离谱，得先等生物那边的兄弟努努力，而且还会有不少质疑因为machines are not here to replace human, only to assist human，至少在商界的应用市场是这样，因此工程方面在意的就只要behavor类似。 If a benchmark fully meets the challenge, we should expect any clear improvements on the benchmake translate to similar performance on any other valid and reasonable eval data for the same task and language domain. Sadly, right now there is still a huuuuuuge difference between solving the dataset and acquiring the skill.

Problem with naturally-occurring examples: They minimize the effort in creating benchmarks and minimizes the risk that the benchmark is somehow skewed in a way that omits important phenomena, but they do nothing to separate skills of interest from factual world knowledge, and can be overwhelmingly dominated by the latter.
Expert authored examples: focus on specific types of model failure and it’s counterproductive when our goal is to build a broad-coverage benchmark to set priorities and guide progress toward the solution of some task.
Corwdsourcing often focus heavily on repetitive, easy cases and often fail to isolate key behaviors.
Adversarial Filtering can make a dataset artifically harder but does not ensure validity, cuz it can systematically elimiate coverage of linguistic phenomena or skills that are necessary for the task but already well-solved by the adversary model. This mode-seeking (as opposed to mass-covering) characteristic of AF, if left unchecked, tends to reduce dataset diversity and thus make validity harder to achieve.

Reliable Annotation

tl;dr: Data is consistently-labeled

Clearly there is ample room for improvement but we have no guatantee that our benchmarks will detect those improvements when they are made. 人间真实: Most were collected by crowdsourcing with relatively limited quality control, such that we have no reason to expect that perfect performance on their metrics is achievable or that the benchmark will meaningfully distinguish between systems with superhuman metric performance.

Comparison between machine performance and human inter-annotator agreements may not be totally fair, since machines learn to predict the behavior of a collection of human models while human performance reflects the behavior of humans who are reporting their own judgments, rather than attempting to predict the most frequently assigned label.

Avoid carelessly mislabeled examples
Avoid unclear labels due to underspecified guidelines
Avoid ambiguous examples under the relevant metric due to legitimate disagreements in interpretation among annotators. Such disagreements might stem from dialectal variants or different interpretations of time-sensitive states of the world.

Offer adequate statistical power

tl;dr: discriminative anough to detect any qualitatively relevant performance difference. 注意这是对benchmark的要求，而这篇之前定义了benchmark包括concrete dataset + metrics

When it comes to discrminative power, we should expect to be spending more time in the long tail of our data difficulty distributions.

Disincentivize the use of harmful biases

This is challenging because of deep issues with the precise specification of what constitutes harmful bias. There is no precise enumeration of social biases that will be broadly satisfactory across applications and cultural contexts. Also admittedly, building such a list of attributes is deeply political.

Sketching a Solution

Improving Validity

Crowdsourcing + Experts: starting from relatively high-quality crowdsourced datasets, then use expert effort to augment them in ways that mitigate annotation artifacts.
It also possible to make small interventions during the crowdsourcing process – like offering additional bonus payments for examples that avoid overused words and constructions.

Handling Annotation Errors and Disagreements

Multiple annotations for the same example can largely resolve the issue of mistaken annotations.
Careful planning and pilot work can largely resolve the issue of ambiguous guidelines.
Handling legitimate disagreements can take two approaches: a). treat ambiguous examples in the same way as mislabeled ones and systematically identifies and discards them during a validation phase; b). one can decline to assign single, discrete labels to ambiguous examples. This can involve asking models to predict the empirical distribution of labels that trustworthy annotators assign.

Improving Statistical Power

你骂，你再骂？Ultimately, the community needs to compare the cost of making serious investments in better benchmarks to the cost of wasting researcher time and computational resources due to our inability to measure progress.

Disincentives for Biased Models

A viable approach could involve the expanded use of auxiliary metrics: benchmark creators can introduce a family of additional expert-constructed test datasets and metrics that each isolate and measure a specific type of bias. A model would be evluated in parallel on these additional bias test sets. In addition, the fact that these metrics would target specific types of biases would make it easier for benchmark maintainers to adapt as changing norms or changing downstream applications demand coverage of additional potential harms.

The difficulty is in developing community infrastructure to encourage the widespread reporting of metrics, plausibly involving peer review norms, explicit publication venue policies, and introduction of bias-oriented metrics to public leaderboards

What this paper doesn’t contain

We set aside computational efficiency and data efficiency.
We set aside few-shot learning. While few-shot learning represents a potentially impactful direction for engineering research, artifical constraints on the use of training data do not fit the broad goals laid out above and do not fit many applied settings.