最简单的方式获取ELMo得到的词向量

Introduction

本文的目的就是用最简单的方式获取 elmo 得到的word representation，看了一些其他人的介绍，其实最后对我有用的就这么多了，我只想要他生成的词向量。
简单介绍一下 elmo：Allen NLP在NAACL2018上的Best paper - Deep contextualized word representations，使用elmo让原有的模型在NLI等Task上效果提升。
那好，直接说怎么得到这个elmo。现在有tf，pytorch，keras各种版本。本文使用的官方给出的elmo片段方式，不用加在模型当中，直接获得词向量的Tensor，因为我只想用他的词向量，训练他的模型又耗时有耗机器。

Environment

首先在conda中新建环境：

conda create -n allennlp python=3.6

接着安装allennlp[保证你电脑里gcc是OK的，编译时需要C++的环境]

pip install allennlp

别断网就OK了，东西有点多，pytorch啥的全套。
然后，下载allennlp给出的训练好的参数和模型
网址：

https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json
https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5

这样方便你重复使用。

Method

下面就是用这两个文件怎么得到词向量了：

from allennlp.commands.elmo import ElmoEmbedderoptions_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "/files/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"elmo = ElmoEmbedder(options_file, weight_file)# use batch_to_ids to convert sentences to character ids
context_tokens = [['I', 'love', 'you', '.'], ['Sorry', ',', 'I', 'don', "'t", 'love', 'you', '.']] #references
elmo_embedding, elmo_mask = elmo.batch_to_embeddings(context_tokens)print(elmo_embedding)
print(elmo_mask)

Result

Embedding:
tensor([[[[ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],[-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],[ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],...,[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],[[-1.1051, -0.4092, -0.4365,  ..., -0.6326,  0.4735, -0.2577],[ 0.0899, -0.4828, -0.5596,  ...,  0.4372,  0.3840, -0.7343],[-0.5538, -0.1473, -0.2441,  ...,  0.2551,  0.0873,  0.2774],...,[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],[[-3.2634, -0.9448, -0.3199,  ..., -1.2070,  0.6930, -0.2016],[-0.3688, -0.7632, -0.0715,  ...,  0.6294,  1.6869, -0.6655],[-1.0870, -1.4243, -0.2445,  ...,  0.0825,  0.5020,  0.2765],...,[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]],[[[ 0.5042, -0.6629, -0.0231,  ..., -0.3084, -0.9741, -0.7230],[ 0.1131,  0.1575,  0.1414,  ...,  0.3718, -0.1432, -0.0248],[ 0.6923, -0.3261,  0.2283,  ...,  0.1757,  0.2660, -0.1013],...,[-0.7348, -0.0965, -0.1411,  ..., -0.3411,  0.3681,  0.5445],[ 0.3645, -0.1415, -0.0662,  ...,  0.1163,  0.1783, -0.7290],[-0.8872, -0.2004, -1.0601,  ..., -0.2655,  0.2115,  0.1977]],[[ 0.1221, -0.7032,  0.0169,  ..., -0.3249, -0.4935, -0.4965],[ 0.3399, -0.4682,  0.1888,  ..., -0.0565,  0.1001, -0.0416],[-0.8135, -0.8491, -0.3264,  ..., -0.5674,  0.2638,  0.2006],...,[ 0.4460, -0.4475, -0.1583,  ...,  0.4372,  0.3840, -0.7343],[-0.1287,  0.0161,  0.0315,  ...,  0.2551,  0.0873,  0.2774],[-1.2373, -0.3373,  0.1098,  ..., -0.0276, -0.0181,  0.0602]],[[-0.0830, -1.5891, -0.2576,  ..., -1.2944,  0.1082,  0.6745],[-0.0724, -0.7200,  0.1463,  ...,  0.6919,  0.9144, -0.1260],[-2.3460, -1.1714, -0.7065,  ..., -1.2885,  0.4679,  0.3800],...,[ 0.1246, -0.6929,  0.6330,  ...,  0.6294,  1.6869, -0.6655],[-0.5757, -1.0845,  0.5794,  ...,  0.0825,  0.5020,  0.2765],[-1.2392, -0.6155, -0.9032,  ...,  0.0524, -0.0852,  0.0805]]]])
Mask:  tensor([[1, 1, 1, 1, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1]])

Tips

实验的输出结果是 2 * 3 * 8 * 1024 的word embedding，都是2、3、8超参数。
2是batch_size, 3是两层biLM的输出加一层CNN对character编码的输出, 8是最长list的长度(对齐), 1024是每层输出的维度。
mask的输出2是batch_size, 8实在最长list的长度, 第一个list有4个tokens,第二个list有8个tokens, 所以对应位置输出1。

References

https://cstsunfu.github.io/2018/06/ELMo/
https://blog.csdn.net/sinat_26917383/article/details/81913790