大型文本语料库上的 预训练语言模型提升下游NLP任务表现,学习语言知识,也可能 存储了训练数据之间的 关系知识,可能能够回答“填空”语句的查询。
与结构化知识库对比,语言模型: 不需要模式工程;允许从业者查询一个 开放的关系类, 扩展到更多的数据,并且 不需要人工监督来进行培训。
对先进的预训练语言模型中的 关系知识进行分析的发现:
1 Introduction
关系知识已经存在于预先训练的现成语言模型中,如ELMo和BERT。他们存储了多少 关系知识?对于不同类型的知识,如关于实体的 事实常识和一般的问题回答,这有什么不同呢?
与自动从文本中提取的符号知识库 相比,在没有进行微调的情况下,它们的 性能如何呢?


LAMA: 由一组知识源组成,每个知识源都由一组事实组成。

我们定义,一个预训练的语言模型知道一个事实(主语、关系、宾语),如(但丁出生在佛罗伦萨),如果它能成功预测MASK的对象,如 "但丁出生在 "这样的句子来表达这一事实。我们测试了各种类型的知识:存储在Wikidata中的实体之间的关系、常识性的概念网中的概念之间的关系,以及回答自然语言问题所需的知识SQuAD中的问题。在后一种情况下,我们手动将SQuAD问题的一个子集映射到cloze句子。






2 Background
2.1 Unidirectional Language Models
2.2 Bidirectional “Language Models” 2
3 Related Work
4 The LAMA Probe
We introduce the LAMA (LAnguage Model Analysis) probe to test the factual and commonsense
knowledge in language models:
It provides a setof knowledge sources which are composed of a corpus of facts. Facts are either subject-relationobject triples or question-answer pairs.
We evaluate each model based on how highly it ranks the ground truth token against every other word in a fifixed candidate vocabulary.
assumption: models which rank ground truth tokens high for these cloze state ments have more factual knowledge.
4.1 Knowledge Sources
we cover a variety of sources of factual and commonsense knowledge. For each source, we describe the origin of fact triples (or question answer pairs), how we transform them into cloze
templates, and to what extent aligned texts exist in Wikipedia that are known to express a partic ular fact. We use the latter information in super vised baselines that extract knowledge representa tions directly from the aligned text.
4.2 Models
4.3 Baselines
freq :For a subject and relation pair......
re:For the relation-based knowledge source......
drqa:for open-domain question answering......
4.4 Metrics
We consider rank-based metrics and compute results per relation along with mean values across all relations. To account for multiple valid objects for a subject-relation pair ( i.e. , for N-Mrelations), we follow Bordes et al. ( 2013 ) and remove from the candidates when ranking at test time all other valid objects in the training data other than the one we test. We use the mean precision at k ( P@k ). For a given fact, this value is 1 if the object is ranked among the top k results, and 0 otherwise.

4.5 Considerations
Manually Defifined Templates
Single Token
Object Slots
Intersection of Vocabularies
5 Results                 
6 Discussion and Conclusion

