Word Embedding 和Skip-Gram模型的实践

什么是Word Embedding?

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

word Embedding其实就是一个对词语进行向量化的高级方法。该方法是一种映射，它把一个词语映射到
$R^{n}$ n维空间上。该方法对词语进行向量化后的结果倾向于同类词语之间向量的距离会更小。例如在一堆预料中，I like apple和I like watermelon 经常出现，而apple和china不会经常一块出现。那么 apple 和watemelon向量化后两个向量的距离应该比apple和China两个向量化后之间距离小。

常见的词语向量化模型有one-hot模型、词袋模型，但是这些都不具有上面说的这个特性。

另外如果使用词袋模型，词语向量化后的结果一般是一个维数特别大的向量，而且这个向量中0元素特别尤其是one-hot模型。然而通过Word embedding 方法向量化的结果一般某个词只需要用150-300维的实向量表示就可以了。

而Skip-Gram就是实现这种词向量化的一种方法，该方法在对词语进行one-hot向量化的基础上再使用一个简单的分类神经网络来学习各个词的词向量。而里面的Skip-Gram 还考虑到了词的上下文语义，通过词在语料库里面的相邻词语来构成训练样本。skip-Gram中的Skip就是指示了上下文的边界。比如说：I live in china and I like Chinese food. 如果以China为中心词，规定skip=3的词都是china的上下文词，这就相当于在说我们认为

< I, c h i n a >, < l i v e, c h i n a >, < i n, c h i n a >, < c h i n a, a n d >, < c h i n a, I >, < c h i n a, l i k e >

接下来，我们就来看看如何提高Skip-Gram 来对词语进行向量化吧。

首先我们定义一个比较简单语料库.

　Pumas are large, cat-like animals which are found in America. When reports came into London Zoo that a wild puma had been spotted forty-five miles south of London, they were not taken seriously. However, as the evidence began to accumulate, experts from the Zoo felt obliged to investigate, for the descriptions given by people who claimed to have seen the puma were extraordinarily similar.The hunt for the puma began in a small village where a woman picking blackberries saw ‘a large cat’ only five yards away from her. It immediately ran away when she saw it, and experts confirmed that a puma will not attack a human being unless it is cornered. The search proved difficult, for the puma was often observed at one place in the morning and at another place twenty miles away in the evening. Wherever it went, it left behind it a trail of dead deer and small animals like rabbits. Paw prints were seen in a number of places and puma fur was found clinging to bushes. Several people complained of “cat-like noises’ at night and a businessman on a fishing trip saw the puma up a tree. The experts were now fully convinced that the animal was a puma, but where had it come from? As no pumas had been reported missing from any zoo in the country, this one must have been in the possession of a private collector and somehow managed to escape. The hunt went on for several weeks, but the puma was not caught. It is disturbing to think that a dangerous wild animal is still at large in the quiet countryside.

来自新概念第三册的一篇文章。
1. 清洗数据，把标点符号去除，提取其中的单词

__author__ = 'jmh081701'
import re
def getWords(data):rule=r"([A-Za-z-]+)"pattern =re.compile(rule)words=pattern.findall(data)return words

测试一下，输出前5个单词：

>>words = getWords(data)
>>print (words[0:5]
['Pumas', 'are', 'large', 'cat-like', 'animals']

找出一个有多少个不同单词，并统计他们的频数

def enumWords(words):rst={}for each in words:if not each in rst:rst.setdefault(each,1)else:rst[each]+=1return rst

enumWords函数统计刚刚的一共有多少个不同的词，返回一个dict,其中key是单词，value是其频数

>>words = getWords(data)
>>words = enumWords(words)
>>print(len(words))
154

共有154个单词

3.初步的数据清理,形成词汇表
依照需要我们可以把那些出现频数特别少的词语去掉，让词汇表不是特别大。
但是，在我们这个例子中，一共才154个单词，倒也没有什么必要把词频少的去掉。

>>vocaburary=list(words)
>>print(vocaburary[0:5])
['accumulate', 'which', 'reported', 'spotted', 'think']

4.使用one -hot 对每个词进行初步向量化
在我们这个语料库，词汇表就154个单词，所以每个单词one-hot的结果是一个154维的向量，其中只有一个分量为1，其它分量为0。

[0, 0, 0, \dots, 1, \dots, 0, 0, 0]

$[0,0,0,\dots,1,\dots,0,0,0]$
第i个分量为1，表示这个词在词汇表中的第i个位置。
例如，我们的的词汇表前5个词为：

['accumulate', 'which', 'reported', 'spotted', 'think']

那么 ‘accumulate’ 进行one-hot后的结果就是：
$[1,0,0,\dots,0，0,0,0]$
同理‘which’one-hot后的结果就是
$[0,1,0,\dots,0，0,0,0]$
我们写一个函数来对某个具体的词进行one-hot:

def onehot(word,vocaburary):l =len(vocaburary)vec=numpy.zeros(shape=[1,l],dtype=numpy.float32)index=vocaburary.index(word)vec[0][index]=1.0return  vec

函数返回一个numpy向量，是一个1xn的列向量。
我们测试一下：

>>print(onehot(vocaburary[0],vocaburary))
[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
···
···0.]]

5.定义上下文的边界skip_sizes
上下文的边界就是规定了一个词在文章中的前skip_sizes和后skip_sizes个词都与这个词在含义上高度相关。一般skip_sizes不会取的很大，取2-5就可以。另外这个skip_size是可以随机的。

def getContext(words):rst=[]for index,word in enumerate(words):skip_size=random.randint(1,5)for i in range(max(0,index-skip_size),index):rst.append([word,words[i]])for i in  range(index+1,min(len(words),index+skip_size)):rst.append([word,words[i]])return  rst

注意这个的words是getWords（）的返回值，传入的是原文。
测试：

>>Context=getContext(getWords(data))
>>print(Context[0:5])
[['Pumas', 'are'], ['are', 'Pumas'], ['are', 'large'], ['are', 'cat-like'], ['large', 'are']]

6.设计训练神经网络结构

首先搞清楚样本的数据格式：

以第5步得到的词对为训练样本。第5的返回值中，每一个都是一个词对，我们把词对的第一个元素当做X,第二元素当做Y，并对他们做one-hot形成训练样本。
例如，对于上面的Context的第一个元素：[‘Pumas’,’are’]
我们把’Pumas’ one hot 后的结果作为X,把‘are’进行one-hot的结果作为Y.

>>Context=getContext(getWords(data))
>>X=onehot(Context[0][0],vocaburary)
>>Y=onehot(Context[0][1],vocaburary)
>>print(X)
[[ 0.
···  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  ···
]]
>>print(Y)
[[ 0. ....  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.   0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

ok,我们可以知道输入和输出其实都是one-hot后的两个154维的列向量。
接下来我们设计一个简单的多层感知机，我们希望用一个30维的向量去表示个各个词。

输入层：1 x 154
隐含层：权值矩阵：W shape：154x30
偏置：b shape:1x30
激活函数 :无
输出层：权值矩阵：W’ shape:30x154
偏置：b’ shape:1x154
激活函数：softmax

然后通过计算，隐含层的W就是我们的要的。其中矩阵的第i行的行向量，即是在词汇表中第i个词的向量表达。

Word Embedding 和Skip-Gram模型的实践相关推荐

NLP-词向量(Word Embedding)-2013：Word2vec模型（CBOW、Skip-Gram）【对NNLM的简化】【层次Softmax、负采样、重采样】【静态表示；无法解决一词多义】
一.文本的表示方法 (Representation) 文本是一种非结构化的数据信息,是不可以直接被计算的.因为文本不能够直接被模型计算,所以需要将其转化为向量. 文本表示的作用就是将这些非结构化的信息 ...
RNN模型与NLP应用笔记(2)：文本处理与词嵌入详解及完整代码实现（Word Embedding）
一.写在前面紧接着上一节,现在来讲文本处理的常见方式. 本文大部分内容参考了王树森老师的视频内容,再次感谢王树森老师和李沐老师的讲解视频. 目录一.写在前面二.引入三.文本处理基本步骤详解四 ...
词嵌入（word embedding）（pytorch）
文章目录词嵌入代码 Skip-Gram 模型如何取词建模模型细节隐层,我们需要的结果输出 N Gram模型代码词嵌入参考:<深度学习入门之Pytorch> 词嵌入到底是 ...
《自然语言处理学习之路》02 词向量模型Word2Vec，CBOW，Skip Gram
本文主要是学习参考莫烦老师的教学,对老师课程的学习,记忆笔记. 原文链接文章目录书山有路勤为径,学海无涯苦作舟. 零.吃水不忘挖井人一.计算机如何实现对于词语的理解 1.1 万物数字化 1.2 ...
从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史
本文可以任意转载,转载时请标明作者和出处. 张俊林 2018-11-11 (如果图片浏览有问题可以转至:知乎版本) Bert最近很火,应该是最近最火爆的AI进展,网上的评价很高,那么Bert值得这么高 ...
[DeeplearningAI笔记]序列模型2.1-2.2词嵌入word embedding
5.2自然语言处理觉得有用的话,欢迎一起讨论相互学习~ 吴恩达老师课程原地址 2.1词汇表征 Word representation 原先都是使用词汇表来表示词汇,并且使用1-hot编码的方式来表示 ...
【深度学习】从Word Embedding到Bert模型
Bert最近很火,应该是最近最火爆的AI进展,网上的评价很高,那么Bert值得这么高的评价吗?我个人判断是值得.那为什么会有这么高的评价呢?是因为它有重大的理论或者模型创新吗?其实并没有,从模型创新角 ...
从Word Embedding到Bert模型：自然语言处理中的预训练技术发展史
转:https://zhuanlan.zhihu.com/p/49271699 作者:张俊林专栏:深度学习前沿笔记目录: 1.图像领域的预训练 2.Word Embedding考古史 3.从Wor ...
【发展史】自然语言处理中的预训练技术发展史—Word Embedding到Bert模型
目录自然语言处理中的预训练技术发展史-Word Embedding到Bert模型 1 图像领域的预训练 2 Word Embedding考古史 3 从Word Embedding到ELMO 4 从W ...

Word Embedding 和Skip-Gram模型的实践

Word Embedding 和Skip-Gram模型的实践相关推荐

最新文章

热门文章

Word Embedding 和Skip-Gram模型 的实践

Word Embedding 和Skip-Gram模型 的实践相关推荐

最新文章

热门文章

Word Embedding 和Skip-Gram模型的实践

Word Embedding 和Skip-Gram模型的实践相关推荐