BERT Word Embeddings Tutorial

本文译自 BERT Word Emebddings Tutorial，我将其中部分内容进行了精简。转载请注明出处

1. Loading Pre-Trained BERT

通过 Hugging Face 安装 BERT 的 PyTorch 接口，该库还包含其它预训练语言模型的接口，如 OpenAI 的 GPT 和 GPT-2

如果您在 Google Colab 上运行此代码，每次重新连接时都必须安装此库

!pip install transformers

BERT 是由 Google 发布的预训练模型，该模型使用 Wikipedia 和 Book Corpus 数据进行训练（Book Corpus 是一个包含不同类型的 10000 + 本书的数据集）。Google 发布了一系列 BERT 的变体，但我们在这里使用的是两种可用尺寸（"base" 和 "large"）中较小的一种，并且我们设置忽略单词大小写

transformers 提供了许多应用于不同任务的 BERT 模型。在这里，我们使用最基本的 BertModel，这个接口的输出不针对任何特定任务，因此用它提取 embeddings 是个不错的选择

现在让我们导入 PyTorch，预训练 BERT 模型以及 BERT tokenizer

import torch
from transformers import BertTokenizer, BertModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
# logging.basicConfig(level=logging.INFO)
import matplotlib.pyplot as plt
%matplotlib inline
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2. Input Formatting

由于 BERT 是一个预训练模型，需要输入特定格式的数据，因此我们需要：

A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
Tokens that conform with the fixed vocabulary used in BERT
The Token IDs for the tokens, from BERT’s tokenizer
Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
Segment IDs used to distinguish different sentences
Positional Embeddings used to show token position within the sequence

幸运的是，使用 tokenizer.encode_plus 这个函数可以帮我们处理好一切。但是，由于这只是使用 BERT 的介绍，因此我们将主要以手动方式执行这些步骤

有关 tokenizer.encode_plus 这个函数的使用示例，可以这篇文章

2.1 Special Tokens

BERT 可以将一个或两个句子作为输入。如果是两个句子，则使用 [SEP] 将它们分隔，并且 [CLS] 标记总是出现在文本的开头；如果是一个句子，也始终需要两个标记，此时 [SEP] 表示句子的结束。举个例子

2 个句子的输入：

[CLS] The man went to the store. [SEP] He bought a gallon of milk.

1 个句子的输入：

[CLS] The man went to the store. [SEP]

2.2 Tokenization

BERT 提供了 tokenize 方法，下面我们看看它是如何处理句子的

text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
# Print out the tokens.
print (tokenized_text)

# ['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']

注意 "embeddings" 这个词是如何表示的：['em', '##bed', '##ding', '##s']

原始单词已被拆分为较小的子词和字符。这些子词中前面两个##哈希符号表示该子词或字符是较大字的一部分。因此，例如 '##bed' 和 'bed' 这两个 token 不相同；第一个用于子词 "bed" 出现在较大词中时，第二个是独立的 token

为什么会这样？因为 BERT 的 tokenizer 是使用 WordPiece 模型创建的。这个模型贪婪地创建了一个固定大小的词汇表，其中包含了最适合我们语言的固定数量的字符、子词和单词。由于我们 BERT 模型的 tokenizer 限制词汇量为 30000，因此 WordPiece 模型生成的词汇表包含所有英文字符以及该模型所训练英语预料库中找到的约 30000 个最常见的单词和子词。该词汇表包含四类东西：

整个词
出现在单词开头或单独出现的子词（"embddings" 中的 "em" 与 "go get em" 中的 "em" 向量相同）
不在单词开头的子词，前面会添加上 "##"
单个字符

具体来说，tokenzier 首先检查整个单词是否在词汇表中，如果不在，它会尝试将单词分解为词汇表中最大可能的子词，如果子词也没有，它就会将整个单词分解为单个字符。所以我们至少可以将一个单词分解为单子字符的集合。基于此，不在词汇表中的单词不会分配给 "UNK" 这种万能的标记，而是分解为子词和字符标记

因此，即使 "embeddings" 这个词不在词汇表中，我们也不会将这个词视为未知词汇，而是将其分为子词 tokens ['em', '##bed', '##ding', '##s']，这将保留单词的一些上下文含义。我们甚至可以平均这些子词的嵌入向量以生成原始单词的近似向量。有关 WordPeice 的更多信息，请参考原论文

下面是我们词汇表中的一些示例

list(tokenizer.vocab.keys())[5000:5020]

['knight',
'lap',
'survey',
'ma',
'##ow',
'noise',
'billy',
'##ium',
'shooting',
'guide',
'bedroom',
'priest',
'resistance',
'motor',
'homes',
'sounded',
'giant',
'##mer',
'150',
'scenes']

将文本分解为标记后，我们必须将句子转换为词汇索引列表。从这开始，我们将使用下面的例句，其中两个句子都包含 "bank" 这个词，且它们的含义不同

# Define a new example sentence with multiple meanings of the word "bank"
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Display the words with their indeces.
for tup in zip(tokenized_text, indexed_tokens):
print('{:<12} {:>6,}'.format(tup[0], tup[1]))

[CLS] 101
after 2,044
stealing 11,065
money 2,769
from 2,013
the 1,996
bank 2,924
vault 11,632
, 1,010
the 1,996
bank 2,924
robber 27,307
was 2,001
seen 2,464
fishing 5,645
on 2,006
the 1,996
mississippi 5,900
river 2,314
bank 2,924
. 1,012
[SEP] 102

2.3 Segment ID

BERT 希望用 0 和 1 区分两个句子。也就是说，对于 tokenized_text 中的每个 token，我们必须指明它属于哪个句子。如果是单句，只需要输入一系列 1；如果是两个句子，请将第一个句子中的每个单词（包括 [SEP]）指定为 0，第二个句子指定为 1

# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)

3. Extracting Embeddings

3.1 Running BERT on our text

接下来，我们需要将数据转换为 PyTorch tensor 类型

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

调用 from_pretrained 函数将从互联网上获取模型。当我们加载 bert-base-uncased 时，我们会在 logging 记录中看到模型的定义。该模型是一个具有 12 层的深度神经网络，解释每层的功能不在本文的范围内，您可以查看我博客之前的内容来学习相关信息

model.eval() 会使得我们的模型处于测试模式，而不是训练模式。在测试模式下，模型将会关闭 dropout regularization

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

接下来，让我们把示例文本传入模型，并获取网络的隐藏状态

torch.no_grad() 告诉 PyTorch 在前向传播的过程中不构造计算图（因为我们不会在这里反向传播），这有助于减少内存消耗并加快运行速度

# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]

3.2 Understanding the Output

hidden_states 包含的信息有点复杂，该变量有四个维度，分别是：

The Layer number（13 layers）
The batch number（1 sentence）
The word / token number（22 tokens in our sentence）
The hidden unit / feature number（768 features）

ちょっと待って，13 层？前面不是说 BERT 只有 12 层吗？因为最前面的一层是 Word Embedding 层，剩下的是 12 个 Encoder Layer

第二个维度（batch size）是一次向模型提交多个句子时使用的；不过，在这里我们只有一个句子

print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

Number of layers: 13 (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768

通过快速浏览指定 token 和网络层的数值范围，您会发现其中大部分值介于 [-2, 2]，少数在 - 12 附近

# For the 5th token in our sentence, select its feature values from layer 5.
token_i = 5
layer_i = 5
vec = hidden_states[layer_i][batch_i][token_i]
# Plot the values as a histogram to show their distribution.
plt.figure(figsize=(10,10))
plt.hist(vec, bins=200)
plt.show()

按层对值进行分组是有意义的，但是为了使用，我们希望它按 token 进行分组

当前的维度：[layers, batchs, tokens, features]

期望的维度：[tokens, layers, features]

幸运的是，PyTorch 的 permute 函数可以轻松的重新排列维度。但是目前 hidden_states 第一个维度是 list，所以我们要先结合各层，使其成为一个 tensor

# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)
token_embeddings.size()
# torch.Size([13, 1, 22, 768])

接着我们消掉 "batch" 维度，因为我们不需要它

# Remove dimension 1, the "batches".
token_embeddings = token_embeddings.squeeze(dim=1)
token_embeddings.size()
# torch.Size([13, 22, 768])

最后，我们使用 permute 函数来交换维度

# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_embeddings.size()
# torch.Size([22, 13, 768])

3.3 Creating word and sentence vectors from hidden states

我们希望为每个词获取单独的向量，或者为整个句子获取单独的向量。但是对于输入的每个词，我们有 13 个向量，每个向量的长度为 768。为了获得单个向量，我们需要将一些层的向量组合起来。但是，哪个层或组合哪些层比较好？

Word Vectors

我们用两种方式创建词向量。第一种方式是拼接最后四层，则每个单词的向量长度为 4*768=3072

# Stores the token vectors, with shape [22 x 3,072]
token_vecs_cat = []
# `token_embeddings` is a [22 x 12 x 768] tensor.
# For each token in the sentence...
for token in token_embeddings:
# `token` is a [12 x 768] tensor
# Concatenate the vectors (that is, append them together) from
# the last four layers.
# Each layer vector is 768 values, so `cat_vec` is length 3072.
cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
# Use `cat_vec` to represent `token`.
token_vecs_cat.append(cat_vec)
print ('Shape is: %d x %d' % (len(token_vecs_cat), len(token_vecs_cat[0])))
# Shape is: 22 x 3072

第二种方式是将最后四层相加

# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []
# `token_embeddings` is a [22 x 12 x 768] tensor.
# For each token in the sentence...
for token in token_embeddings:
# `token` is a [12 x 768] tensor
# Sum the vectors from the last four layers.
sum_vec = torch.sum(token[-4:], dim=0)
# Use `sum_vec` to represent `token`.
token_vecs_sum.append(sum_vec)
print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))
# Shape is: 22 x 768

Sentence Vectors

有很多种策略可以获得一个句子的单个向量表示，其中一种简单的方法是将倒数第 2 层所有 token 的向量求平均

# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
print("Our final sentence embedding vector of shape:", sentence_embedding.size())
# Our final sentence embedding vector of shape: torch.Size([768])

3.4 Confirming contextually dependent vectors

为了确认这些向量的值是上下文相关的，我们可以检查一下例句中 "bank" 这个词的向量

“After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.”

for i, token_str in enumerate(tokenized_text):
print(i, token_str)

0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]

在这个例子中，我们通过累加最后四层的单词向量，然后打印出来进行比较

print('First 5 vector values for each instance of "bank".')
print('')
print("bank vault ", str(token_vecs_sum[6][:5]))
print("bank robber ", str(token_vecs_sum[10][:5]))
print("river bank ", str(token_vecs_sum[19][:5]))

First 5 vector values for each instance of "bank".
bank vault tensor([ 3.3596, -2.9805, -1.5421, 0.7065, ...])
bank robber tensor([ 2.7359, -2.5577, -1.3094, 0.6797, ...])
river bank tensor([ 1.5266, -0.8895, -0.5152, -0.9298, ...])

很明显值不同，但是通过计算向量之间的余弦相似度可以更精确的进行比较

from scipy.spatial.distance import cosine
# Calculate the cosine similarity between the word bank
# in "bank robber" vs "bank vault" (same meaning).
same_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[6])
# Calculate the cosine similarity between the word bank
# in "bank robber" vs "river bank" (different meanings).
diff_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[19])
print('Vector similarity for *similar* meanings: %.2f' % same_bank) # 0.94
print('Vector similarity for *different* meanings: %.2f' % diff_bank) # 0.69

3.5 Pooling Strategy & Layer Choice

BERT Authors

BERT 作者通过将不同的向量组合作为输入特征提供给 NER 任务，并观察所得的 F1 分数

虽然最后四层拼接在此特定任务上产生了最佳结果，但许多其他方法效果也不差，通常建议针对特定应用测试不同版本，结果可能会有所不同

Han Xiao's BERT-as-service

肖涵在 Github 上创建了一个名为 bert-as-service 的开源项目，该项目旨在使用 BERT 为您的文本创建单词嵌入。他尝试了各种方法来组合这些嵌入，并在项目的 FAQ 页面上分享了一些结论和基本原理

肖涵的观点认为：

第一层是嵌入层，由于它没有上下文信息，因此同一个词在不同语境下的向量是相同的
随着进入网络的更深层次，单词嵌入从每一层中获得了越来越多的上下文信息
但是，当您接近最后一层时，词嵌入将开始获取 BERT 特定预训练任务的信息（MLM 和 NSP）
倒数第二层的词嵌入比较合理