方面级paper8Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis(2019ACL)

Paper link: https://arxiv.org/pdf/1906.01213v1.pdf

Code link:

Source:2019 ACL

Author:Jasminexjf

Time:2019-06-25

ACL会议是自然语言处理领域NLP的顶会，覆盖了语言分析，信息抽取，信息检索，自动问答，情感分析和观点挖掘，文摘和文本生成，文本分类和挖掘，面向Web2.0的自然语言处理，机器翻译，口语处理等众多研究方向。ACL被中国计算机学会推荐国际学术会议列表认定为A类会议.

该文章的信息：title：Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis（作者：Jialong Tang, Ziyao Lu, jinsong su, Yubin Ge, Linfeng Song, Le Sun and Jiebo Luo）。论文共同第一作者是中国科学院软件研究所2018级博士研究生唐家龙和厦门大学软件学院2018级硕士研究生陆紫耀，通讯作者是苏劲松副教授。

概述1:

本文针对神经网络在学习过程中存在的强模式过学习和弱模式欠学习的问题，提出了渐进自监督注意力机制算法，有效缓解了上述问题。本文主要基于擦除的思想，使得模型能够渐进的挖掘文本中需要关注的信息，并平衡强模式和弱模式的学习程度。在基于方面层次的情感分析三个公开数据集和两个经典的基础模型上测试表明，所提出的方法取得了不错的性能表现。

概述2：

在方面层次的情感分类任务中，使用注意力机制来捕获上下文文本中与给定方面最为相关的信息是近年来研究者们的普遍做法。然而，注意力机制容易过多的关注数据中少部分有强烈情感极性的高频词汇，而忽略那些频率较低的词。

本文提出了一种渐进的自监督注意力的学习算法，能够自动的、渐进的挖掘文本中重要的监督信息，从而在模型训练过程中约束注意力机制的学习。该团队迭代的在训练实例上擦除对情感极性“积极”/“消极”的词汇。这些词在下一轮学习过程中将会被一个特殊标记替代，并记录下来。最终，团队针对不同情况，设计出不同的监督信号，在最终模型训练目标函数中作为正则化项约束注意力机制的学习。

在SemEval 14 REST，LAPTOP以及口语化数据集TWITTER上的实验结果表明，团队提出的渐进注意力机制能在多个前沿模型的基础之上取得显著性提升。

基础架构如下：

1. Abstract:

In aspect-level sentiment classification (ASC), it is prevalent to equip dominant neural models with attention mechanisms, for the sake of acquiring the importance of each context word on the given aspect. However, such a mechanism tends to excessively focus on a few frequent words with sentiment polarities, while ignoring infrequent ones. In this paper, we propose a progressive self-supervised attention learning approach for neural ASC models, which automatically mines useful attention supervision information from a training corpus to refine attention mechanisms. Specifically, we iteratively conduct sentiment predictions on all training instances. Particularly, at each iteration, the context word with the maximum attention weight is extracted as the one with active/misleading influence on the correct/incorrect prediction of every instance, and then the word itself is masked for subsequent iterations. Finally, we augment the conventional training objective with a regularization term, which enables ASC models to continue equally focusing on the extracted active context words while decreasing weights of those misleading ones.

2. Introduction

Aspect-level sentiment classification (ASC), as an indispensable task in sentiment analysis, aims at inferring the sentiment polarity of an input sentence in a certain aspect.
However, the existing attention mechanism in ASC suffers from a major drawback. Specifically, it is prone to overly focus on a few frequent words with sentiment polarities and little attention is laid upon low-frequency ones. As a result, the performance of attentional neural ASC models is still far from satisfaction. We speculate that this is because there exist widely “apparent patterns” and “inapparent patterns”. Here, “apparent patterns” are interpreted as high-frequency words with strong sentiment polarities and “inapparent patterns” are referred to as low-frequency ones in training data. As above-mentioned , NNs are easily affected by these two modes: “apparent patterns” tend to be overly learned while “inapparent patterns” often can not be fully learned.

方面级情感分类(ASC)是情态分析中不可缺少的一项工作，其目的是对输入句在某一方面的情态极性进行推理。

然而，ASC中现有的注意机制存在着很大的缺陷。具体来说，它很容易过度关注少数带有情感极性的高频词，而很少关注低频词。因此，注意神经ASC模型的性能还远远不能令人满意。我们推测这是因为存在广泛的“明显模式”和“不明显模式”。这里，“显性模式”被解释为情绪极性较强的高频词，“隐性模式”在训练数据中被称为低频模式。如上所述，NNs很容易受到这两种模式的影响:“显性模式”容易被过度学习，而“隐性模式”往往不能被完全学习。

例子：

In the ﬁrst three training sentences given the fact that the context word “small” occurs frequently with negative sentiment, the attention mechanism pays more attention to it and directly relates the sentences containing it with negative sentiment. This inevitably causes an other informative context word “crowded” to be partially neglected in spite of it als opossesses negative sentiment. Consequently, a neural ASC model incorrectly predicts the sentiment of the last two test sentences: in the ﬁrst test sentence, the neural ASC model fails to capture the negative sentiment implicated by”crowded”;while,in the second test sentence, the attention mechanism directly focuses on “small” though it is not related to the given aspect..

在前三个训练句中，由于语境词“小”经常与消极情绪一起出现，注意机制对其给予了更多的关注，并将包含小情绪的句子与消极情绪直接联系起来。这就不可避免地导致了另一个信息上下文单词“crowded”被部分忽略，而被排除在斜体和否定性词汇之外。因此,情绪的神经ASC模型错误地预测最后两个测试句子:在第一个测试中句子,神经ASC模型未能捕获的负面情绪与“拥挤”;同时,在第二个测试句子,注意机制直接关注“小”虽然没有相关方面.

so propose a novel progressive self-supervised attention learning approach for neural ASC models。

contributions are three-fold:

(1) Through in-depth analysis, we point out the existing drawback of the attention mechanism for ASC.

(2) We propose a novel incremental approach to automatically extract attention supervision information for neural ASC models. To the best of our knowledge, our work is the ﬁrst attempt to explore automatic attention supervision information mining for ASC.

(3)We apply our approachto two dominant neural ASC models: Memory Network(MN)(Tangetal.,2016b;Wangetal.,2018) and Transformation Network (TNet) (Li et al., 2018). Experimental results on several benchmark datasets demonstrate the effectiveness of our approach.

(1)通过深入分析，指出了目前一般的注意力机制存在的不足。

(2)提出了一种新的神经ASC模型注意监控信息自动提取的增量方法。

(3)我们将我们的方法应用于两个主要的神经ASC模型:

记忆网络(MN)(Tangetal.，2016b;Wangetal.，2018) 和

转换网络(TNet) (Li etal.，2018)。几个基准数据集的实验结果证明了该方法的有效性。

3.Background

3.1 Memory Networks(MN):

then define the final vector representation v(t) of t as the averaged aspect embedding of its words;

and $o = \sum_i Softmax(v_{t}^TMm_i)h_i$

3.2 Tramework Network(TNet/TNet-ATT)

(1) The bottom layer is a Bi-LSTM that transforms the input x into the contextualized word representations $h^{(0)}(x) = (h_1^{(0)},h_2^{(0)},\cdots,h_N^{(0)})$ (i.e. hidden states of Bi-LSTM).

(2) The middle part, as the core of the whole model, contains L layers of Context-Preserving Transformation (CPT), where word representations are updated as $h^{(l+1)}(x) = CPT(h^{(l)}(x))$ . The key operation of CPT layers is Target-Speciﬁc Transformation. It contains another Bi-LSTM for generating v(t) via an attention mechanism, and then incorporates v(t) into the word representations. Besides, CPT layers are also equipped with a Context-Preserving Mechanism (CPM) to preserve the context information and learn more abstract word-level features. In the end, we obtain the word-level semantic representations $h(x) = (h_1,h_2,\cdots,h_N),with h_i =h_i^{(L)}$

(3) The topmost part is a CNN layer used to produce the aspect-related sentence representation o for the sentiment classiﬁcation.

(1)底层是Bi-LSTM，它将输入x转换为上下文化的单词表示形式 $h^{(0)}(x) = (h_1^{(0)},h_2^{(0)},\cdots,h_N^{(0)})$ (即Bi-LSTM的隐藏状态)。

(2)中间部分作为整个模型的核心，包含L层上下文保持转换(Context-Preserving Transformation:CPT)，其中单词表示形式更新为 $h^{(l+1)}(x) = CPT(h^{(l)}(x))$ 。CPT层的关键操作是特定于目标的转换。它包含另一个Bi-LSTM，用于通过注意机制生成v(t)，然后将v(t)合并到单词表示中。此外，CPT层还配备了上下文保存机制(Context-Preserving Mechanism: CPM)来保存上下文信息和学习更抽象的单词级特性。最后，我们得到了单词级语义表示 $h(x) = (h_1,h_2,\cdots,h_N),with h_i =h_i^{(L)}$

(3)最上层是CNN层，用于生成与方面相关的句子表示o进行情感分类。

3.3 training objective(NLL)

4. model

we ﬁrst use the initial training corpus D to conduct model training, and then obtain the initial model parameters θ(0) (Line 1). Then, we continue training the model for K iterations, where inﬂuential context words of all trainingin stances can be iteratively extracted (Lines 6-25). During this process, for each training instance (x,t,y), we introduce two word sets initialized as ∅ (Lines 2-5) to record its extracted context words: (1) $s_a(x)$ consists of context words with active effects on the sentiment prediction of x. Each word of $s_a(x)$ will be encouraged to remain considered in the reﬁned model training,and (2) $s_m(x)$ contains context words with misleading effects, whose attention weights are expected to be decreased. Speciﬁcally, at the k-th training iteration, we adopt the following steps to deal with (x,t,y):

我们第一次使用初始训练语料库维进行模型训练,然后获得初始模型参数θ(0)(第1行)。然后,我们继续训练模型K迭代,影响力的上下文的所有训练立场可以反复提取(6-25行)。在这个过程中,对于每一个训练实例(t x, y),我们介绍两个词集初始化为∅(2 - 5行)来记录其提取上下文的话:(1) $s_a(x)$ 是由上下文词汇与积极影响x的情绪预测。每个单词的 $s_a(x)$ 将被鼓励仍然认为改进模型中的训练,和(2) $s_m(x)$ 包含上下文与误导的影响,关注权重的预计将下降。具体来说，在第k次训练迭代时，我们采用以下步骤来处理(x,t,y):

step1:line 9 to line 11

step2:line12

step3; line 13 to line 20

step4: line21 to line 24 (detail please see the paper)

where $E(\alpha (x')) = - \sum _{i=1}^{N}\alpha (x_{i}^{'})log\alpha (x_{i}^{'})$

Through K iterations of the above steps, we manage to extract inﬂuential context words of all training instances. Table 2 illustrates the context word mining process of the ﬁrst sentence shown in Table 1. In this example, we iteratively extract three context words in turn: “small”, “crowded” and “quick”. The former two words are included in $s_a(x)$ , while the last one is contained in $s_m(x)$ . Finally, the extracted context words of each training instance will be included into D, forming a ﬁnal training corpus Ds with attention supervision information (Lines 26-29), which will be used to carry out the last model training (Line 30).

5.Expweiments

5.3 Parameter-Setting

the dimnesion of Glove is 300; "OOV": U[-0.25,0.25]; the initialization of other paramters:U[-0.01,0.01]; dropout; Adam; learning rate is 0.001.

iterations k =5; regulariztion coefficients $\gamma$ :

laptop	0.1
resta	0.5
twitter	0.1

accuracy and macro-F1

dataset split: 80% for training and 20% for testing.

5.4 Results

explore the chage of $\varepsilon_a$

case study:

6.conclusion and future work