TensorFlow实现Word2Vec

上一章我们讲了Word2Vec的原理推导,详细介绍了Word2Vec的来龙去脉。这一章会讲解使用Skip-Gram模型,用TensorFlow实现Word2Vec的代码。本章着重讲定义训练模型的可新代码,对于数据的读取不详细讲解。

导入包

import collections
import math
import os
import random
import zipfile
import numpy as np
import urllib
import tensorflow as tf

定义下载文本数据的函数并读取文件

url = 'http://mattmahoney.net/dc/'def maybe_download(filename, expected_bytes):"""Download a file if not present, and make sure it's the right size."""if not os.path.exists(filename):filename, _ = urllib.request.urlretrieve(url + filename, filename)statinfo = os.stat(filename)if statinfo.st_size == expected_bytes:print('Found and verified', filename)else:print(statinfo.st_size)raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')return filenamefilename = maybe_download('text8.zip', 31344016)
def read_data(filename):"""Extract the first file enclosed in a zip file as a list of words"""with zipfile.ZipFile(filename) as f:data = tf.compat.as_str(f.read(f.namelist()[0])).split()return datawords = read_data(filename)
print('Data size', len(words))

创建vocabulary词汇表

将top50000的词汇放入词典中,词典之外的词定义为Unkown,编号为0.

vocabulary_size = 50000def build_dataset(words):count = [['UNK', -1]]count.extend(collections.Counter(words).most_common(vocabulary_size - 1))dictionary = dict()for word, _ in count:dictionary[word] = len(dictionary)data = list()unk_count = 0for word in words:if word in dictionary:index = dictionary[word]else:index = 0  # dictionary['UNK']unk_count += 1data.append(index)count[0][1] = unk_countreverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionarydata, count, dictionary, reverse_dictionary = build_dataset(words)
del words  # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])data_index = 0

为Skip-Gram模型生成批训练样本

def generate_batch(batch_size, num_skips, skip_window):global data_indexassert batch_size % num_skips == 0assert num_skips <= 2 * skip_windowbatch = np.ndarray(shape=(batch_size), dtype=np.int32)labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)span = 2 * skip_window + 1 # [ skip_window target skip_window ]buffer = collections.deque(maxlen=span)for _ in range(span):buffer.append(data[data_index])data_index = (data_index + 1) % len(data)for i in range(batch_size // num_skips):target = skip_window  # target label at the center of the buffertargets_to_avoid = [ skip_window ]for j in range(num_skips):while target in targets_to_avoid:target = random.randint(0, span - 1)targets_to_avoid.append(target)batch[i * num_skips + j] = buffer[skip_window]labels[i * num_skips + j, 0] = buffer[target]buffer.append(data[data_index])data_index = (data_index + 1) % len(data)return batch, labelsbatch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):print(batch[i], reverse_dictionary[batch[i]],'->', labels[i, 0], reverse_dictionary[labels[i, 0]])

建立和训练模型(重点来了!!)

首先创建一个Graph并设置为默认的graph。定义训练数据的占位符。然后使用tf.random_uniform随机生成所以单词的词向量embeddings 。再使用tf.nn.embedding_lookup查找输入train_inputs对应的向量。下面使用Nec loss作为训练的优化目标。这个nec loss就是使用了负采样的方法来减少计算量的。并使用reduce_mean进行汇总。

batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.graph = tf.Graph()
with graph.as_default():# Input data.train_inputs = tf.placeholder(tf.int32, shape=[batch_size])train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])valid_dataset = tf.constant(valid_examples, dtype=tf.int32)# Ops and variables pinned to the CPU because of missing GPU implementationwith tf.device('/cpu:0'):# Look up embeddings for inputs.embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))embed = tf.nn.embedding_lookup(embeddings, train_inputs)#这里写代码片Construct the variables for the NCE lossnce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))nce_biases = tf.Variable(tf.zeros([vocabulary_size]))# Compute the average NCE loss for the batch.# tf.nce_loss automatically draws a new sample of the negative labels each# time we evaluate the loss.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_biases,labels=train_labels,inputs=embed,num_sampled=num_sampled,num_classes=vocabulary_size))# Construct the SGD optimizer using a learning rate of 1.0.optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)# Compute the cosine similarity between minibatch examples and all embeddings.norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))normalized_embeddings = embeddings / normvalid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)# Add variable initializer.init = tf.global_variables_initializer()

开始训练

num_steps = 100001with tf.Session(graph=graph) as session:# We must initialize all variables before we use them.init.run()print("Initialized")average_loss = 0for step in range(num_steps):batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}# We perform one update step by evaluating the optimizer op (including it# in the list of returned values for session.run()_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)average_loss += loss_valif step % 2000 == 0:if step > 0:average_loss /= 2000# The average loss is an estimate of the loss over the last 2000 batches.print("Average loss at step ", step, ": ", average_loss)average_loss = 0# Note that this is expensive (~20% slowdown if computed every 500 steps)if step % 10000 == 0:sim = similarity.eval()for i in range(valid_size):valid_word = reverse_dictionary[valid_examples[i]]top_k = 8 # number of nearest neighborsnearest = (-sim[i, :]).argsort()[1:top_k+1]log_str = "Nearest to %s:" % valid_wordfor k in range(top_k):close_word = reverse_dictionary[nearest[k]]log_str = "%s %s," % (log_str, close_word)print(log_str)final_embeddings = normalized_embeddings.eval()

可视化词向量

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"plt.figure(figsize=(18, 18))  #in inchesfor i, label in enumerate(labels):x, y = low_dim_embs[i,:]plt.scatter(x, y)plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',ha='right',va='bottom')plt.savefig(filename)

最后训练出的结果把他压缩在一个二维平面上,可以观察出词与词之间的关系。

TensorFlow实现Word2Vec相关推荐

  1. Tensorflow (5) Word2Vec

    图解Word2vec - 知乎 ( 原文:The Illustrated Word2vec – Jay Alammar – Visualizing machine learning one conce ...

  2. TensorFlow实现word2vec 详细代码解释

    参考1:http://blog.csdn.net/mylove0414/article/details/69789203 参考2:<TensorFlow实战> 参考3:http://www ...

  3. 理解 TensorFlow 之 word2vec

    自然语言处理(英语:Natural Language Processing,简称NLP)是人工智能和语言学领域的分支学科.自然语言生成系统把计算机数据转化为自然语言.自然语言理解系统把自然语言转化为计 ...

  4. Word2vec模型原理与keras、tensorflow实现word2vec

    目录 一.Word2vec模型介绍与举例 1.1 Skip-Gram详解 1.2 词向量的优势

  5. tensorflow实现Word2Vec——生成词向量以及降维可视化

    以下代码来源与<Tensorflow实战>,来自Github上的tensorflow开源实现,代码非常简洁,可读性高,对于研究NLP.tensorflow.python编程等有很大帮助. ...

  6. word2vec python实现_教程 | 在Python和TensorFlow上构建Word2Vec词嵌入模型

    原标题:教程 | 在Python和TensorFlow上构建Word2Vec词嵌入模型 选自adventuresinmachinelearning 参与:李诗萌.刘晓坤 本文详细介绍了 word2ve ...

  7. Tensorflow和Gensim里word2vec训练

    Tensorflow里word2vec训练 # -*- coding:utf-8 -*- import tensorflow as tf import numpy as np import math ...

  8. 在Python和TensorFlow上构建Word2Vec词嵌入模型

    本文详细介绍了 word2vector 模型的模型架构,以及 TensorFlow 的实现过程,包括数据准备.建立模型.构建验证集,并给出了运行结果示例. GitHub 链接:https://gith ...

  9. 理解word2vec的训练过程

    from:http://blog.csdn.net/dn_mug/article/details/69852740 生成词向量是自然语言处理中的基本过程,此前对此只知道使用但是一直不知道其原理. 最近 ...

最新文章

  1. php语+言教程,写给thinkphp开发者的laravel系列教程(九)打印数据-Fun言
  2. android系统的手机目录
  3. IIS 之 失败请求跟踪规则
  4. 【算法知识】详解希尔排序算法
  5. 数组转换成json key-value形式
  6. 【Uva 12093】Protecting Zonk
  7. 渐进式web应用程序_渐进式Web应用程序与加速的移动页面:有什么区别,哪种最适合您?
  8. 趣谈程序员真香定律:源码即设计
  9. Java多线程11:ReentrantLock的使用和Condition
  10. 微信模板消息47001错误
  11. C# 根据年月日计算星期几
  12. Android技术栈总结
  13. BERT出来后难道我们无路可走了吗?错!这些新想法你需要了解!
  14. 服务器搭建bbr加速
  15. 两台局域网电脑共享文件及传输文件集锦
  16. MapReduce经典案例实战
  17. ipv4 pxe 联想start_PC开机出现Start pxe over ipv4解决办法 PC重启后显示start pxe over IPv4...
  18. leetcode【中等】781、森林中的兔子
  19. Foxmail.exe -损坏的映像 错误修复
  20. 蓝牙开发那些事(9)——结合代码看a2dp协议

热门文章

  1. 我的oracle备份经历
  2. 5G驱动数字经济崛起,博睿数据入选电信管理论坛(TM Forum)催化剂项目团队
  3. 高校计算机课程期末考试试卷分析,《计算机应用基础》期末试卷分析.doc
  4. 优化自己的电脑 系统瘦身8大方法
  5. Windows Mobile远程控制软件Pocket Controller Pro简介
  6. 显示器3d测试软件,3D画面究竟如何开启_AOC D2357PH_液晶显示器评测-中关村在线
  7. web前端设计与开发(一)学习前的了解
  8. c# 未能加载文件或程序集mysql.data,SQLite的C#,.NET应用自适应32位/64位系统(未能加载文件或程序集“System.Data.SQLite.dll)...
  9. GPU虚拟化现状及新技术方案XPU
  10. 利用Python爬虫,对自己的博客进行数据分析