by Kavita Ganesan

通过Kavita Ganesan

如何开始使用Word2Vec-然后使其工作 (How to get started with Word2Vec — and then how to make it work)

The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are.

Word2Vec背后的想法非常简单。 我们假设一个单词的含义可以由其保留的公司来推断。 这类似于“ 给我看你的朋友,我告诉你你是谁 ”的说法

If you have two words that have very similar neighbors (meaning: the context in which its used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled, and astonished are usually used in a similar context.

如果您有两个单词具有非常相似的邻居(意思是:使用该单词的上下文大致相同),那么这些单词的含义可能非常相似或至少相关。 例如, 震惊震惊惊讶的词通常用于类似的上下文中。

Using this underlying assumption, you can use Word2Vec to surface similar concepts, find unrelated concepts, compute similarity between two words, and more!

使用此基本假设,您可以使用Word2Vec来展示相似的概念,查找不相关的概念,计算两个单词之间的相似度等等!

正事 (Down to business)

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work. I’ve long heard complaints about poor performance in general, but it really is a combination of two things: (1) your input data and (2) your parameter settings.

在本教程中,您将学习如何使用Word2Vec的Gensim实现并使其真正起作用。 长期以来,我一直抱怨抱怨性能不佳,但这实际上是两件事的结合: (1)输入数据(2)参数设置

Note that the training algorithms in the Gensim package were actually ported from the original Word2Vec implementation by Google and extended with additional functionality.

请注意,Gensim软件包中的训练算法实际上是从Google最初的Word2Vec实现中移植过来的,并扩展了其他功能。

导入和记录 (Imports and logging)

First, we start with our imports and get logging established:

首先,我们从导入开始,并建立日志记录:

# imports needed and loggingimport gzipimport gensim import logging
logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)

数据集 (Dataset)

Our next task is finding a really good dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data in the relevant domain. For example, if your goal is to build a sentiment lexicon, then using a dataset from the medical domain or even Wikipedia may not be effective. So, choose your dataset wisely.

我们的下一个任务是找到一个非常好的数据集。 使Word2Vec真正为您工作的秘诀是在相关域中拥有大量文本数据。 例如,如果您的目标是构建情感词典,那么使用医学领域甚至维基百科的数据集可能并不有效。 因此,明智地选择数据集。

For this tutorial, I am going to use data from the OpinRank dataset from some of my Ph.D work. This dataset has full user reviews of cars and hotels. I have specifically gathered all of the hotel reviews into one big file which is about 97 MB compressed and 229 MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review.

在本教程中,我将使用一些博士论文的OpinRank数据集中的数据。 该数据集具有汽车和酒店的完整用户评论。 我专门将所有酒店评论收集到一个大文件中,该文件压缩后约为97 MB ,未压缩时约为229 MB 。 在本教程中,我们将使用压缩文件。 该文件中的每一行代表一个酒店评论。

Now, let’s take a closer look at this data below by printing the first line.

现在,让我们通过打印第一行来仔细查看下面的数据。

You should see the following:

您应该看到以下内容:

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Beijing, then you will be ok.I chose to have some breakfast in the hotel, which was really tasty and there was a good selection of dishes. There are a couple of computers to use in the communal area, as well as a pool table. There is also a small swimming pool and a gym area.I would definitely stay in this hotel again, but only if I did not plan to travel to central Beijing, as it can take a long time. The location is ok if you plan to do a lot of shopping, as there is a big shopping centre just few minutes away from the hotel and there are plenty of eating options around, including restaurants that serve a dog meat!\t\r\n"

You can see that this is a pretty good, full review with lots of words and that’s what we want. We have approximately 255,000 such reviews in this dataset.

您可以看到这是一个非常不错的全面评论,其中包含很多单词,这就是我们想要的。 在此数据集中,我们大约有255,000个此类评论。

To avoid confusion, the Gensim’s Word2Vec tutorial says that you need to pass a sequence of sentences as the input to Word2Vec. However, you can actually pass in a whole review as a sentence (that is, a much larger size of text) if you have a lot of data and it should not make much of a difference. In the end, all we are using the dataset for is to get all neighboring words for a given target word.

为避免混淆,Gensim的Word2Vec教程说您需要传递一系列句子作为Word2Vec的输入。 但是,如果您有大量数据,实际上可以将整个审阅作为一个句子(即更大的文本)来传递,而这不会有太大的不同。 最后,我们使用数据集的目的是获取给定目标词的所有相邻词。

将文件读入列表 (Read files into a list)

Now that we’ve had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below that I am directly reading the compressed file. I’m also doing a mild pre-processing of the reviews using gensim.utils.simple_preprocess (line). This does some basic pre-processing such as tokenization, lowercasing, and so on and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official Gensim documentation site.

现在,我们已经有了一个数据集的隐秘峰值,我们可以将其读取到列表中,以便将其传递给Word2Vec模型。 请注意,在下面的代码中,我正在直接读取压缩文件。 我还使用gensim.utils.simple_preprocess (line)对评论进行了温和的预处理。 这将执行一些基本的预处理,例如标记化,小写等,并返回标记(单词)列表。 该预处理方法的文档可以在Gensim官方文档站点上找到 。

训练Word2Vec模型 (Training the Word2Vec model)

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists, where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

训练模型非常简单。 您只需实例化Word2Vec并通过我们在上一步中阅读的评论。 因此,我们实质上是传递一个列表列表,其中主列表中的每个列表都包含一组来自用户评论的令牌。 Word2Vec使用所有这些标记在内部创建词汇表。 所谓词汇,是指一组独特的单词。

After building the vocabulary, we just need to call train(...) to start training the Word2Vec model. Behind the scenes we are actually training a simple neural network with a single hidden layer. But we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

建立词汇表后,我们只需要调用train(...)可以开始训练Word2Vec模型。 在幕后,我们实际上是在训练一个带有单个隐藏层的简单神经网络。 但是实际上,我们在训练后不会使用神经网络。 相反,目标是学习隐藏层的权重。 这些权重本质上是我们试图学习的单词向量。

Training on the Word2Vec OpinRank dataset takes about 10–15 minutes. so please be patient while running your code on this dataset

对Word2Vec OpinRank数据集的培训大约需要10-15分钟。 因此在此数据集上运行代码时请耐心等待

有趣的部分-一些结果! (The fun part — some results!)

Let’s get to the fun stuff already! Since we trained on user reviews, it would be nice to see similarity on some adjectives. This first example shows a simple look up of words similar to the word ‘dirty’. All we need to do here is to call the most_similar function and provide the word ‘dirty’ as the positive example. This returns the top 10 similar words.

让我们已经获得有趣的东西了! 由于我们训练了用户评论,因此很高兴看到某些形容词的相似性。 第一个示例显示了一个简单的单词查询,类似于单词“ dirty”。 我们需要做的就是调用most_similar函数,并提供单词'dirty'作为肯定的例子。 这将返回前10个相似的单词。

Ooh, that looks pretty good. Let’s look at more.

哦,看起来不错。 让我们看看更多。

Similar to polite:

类似于礼貌:

Similar to france:

法国类似

Similar to shocked:

类似于震惊:

Overall, the results actually make sense. All of the related words tend to be used in the same context for the given query word.

总体而言,结果实际上是有意义的。 对于给定的查询词,所有相关词都倾向于在相同的上下文中使用。

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words.

现在,您甚至可以使用Word2Vec来调用词汇表的likeness similarity(...)函数并传入相关单词,从而计算词汇表中两个单词之间的similarity(...)

Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors of each. From the scores above, it makes sense that dirty is highly similar to smelly but dirty is dissimilar to clean. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring here.

在引擎盖下,以上三个片段使用每个单词的向量来计算两个指定单词之间的余弦相似度。 从上面的分数中可以看出, dirtysmelly非常相似,但是dirtyclean却不一样。 如果您在两个相同单词之间进行相似度计算,则分数将为1.0,因为余弦相似度分数的范围将始终在[0.0-1.0]之间。 您可以在此处关于余弦相似度评分的信息 。

You will find more examples of how you could use Word2Vec in my Jupyter Notebook.

您将在Jupyter Notebook中找到有关如何使用Word2Vec的更多示例。

仔细查看参数设置 (A closer look at the parameter settings)

To train the model earlier, we had to set some parameters. Now, let’s try to understand what some of them mean. For reference, this is the command that we used to train the model.

为了更早地训练模型,我们必须设置一些参数。 现在,让我们尝试了解其中的一些含义。 作为参考,这是我们用来训练模型的命令。

model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)

size (size)

The size of the dense vector that is to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100–150 has worked well for me for similarity lookups.

表示每个标记或单词的密集向量的大小。 如果数据非常有限,则大小应该小得多。 如果您有大量数据,则可以尝试各种大小的数据。 100-150的值对我进行相似性查找效果很好。

window (window)

The maximum distance between the target word and its neighboring word. If your neighbor’s position is greater than the maximum window width to the left or the right, then some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its not overly narrow or overly broad. If you are not too sure about this, just use the default value.

目标单词与其相邻单词之间的最大距离。 如果您邻居的位置大于左侧或右侧的最大窗口宽度,则某些邻居不会被视为与目标单词相关。 从理论上讲,较小的窗口应为您提供相关性更高的术语。 如果您有大量数据,则窗口大小应该没有太大关系,只要它不会过窄或过宽即可。 如果您对此不太确定,请使用默认值。

min_count (min_count)

Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

单词的最小频率计数。 该模型将忽略不min_count 。 极少使用的单词通常并不重要,因此最好摆脱这些单词。 除非您的数据集很小,否则这不会真正影响模型。

workers (workers)

How many threads to use behind the scenes?

在后台使用多少线程?

什么时候应该使用Word2Vec? (When should you use Word2Vec?)

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.

Word2Vec有许多应用方案。 想象一下,如果您需要构建情感词典。 在大量的用户评论上训练Word2Vec模型可以帮助您实现这一目标。 您不仅有情感词典,而且还有词汇表中的大多数单词。

Beyond raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million StackOverflow questions and answers, you could find related tags and recommend those for exploration. You can do this by treating each set of co-occuring tags as a “sentence” and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.

除了原始的非结构化文本数据,您还可以使用Word2Vec来获取更多的结构化数据。 例如,如果您具有用于一百万个StackOverflow问题和答案的标签,则可以找到相关标签并将其推荐给探索。 您可以通过将每个同时出现的标记集视为一个“句子”并在此数据上训练Word2Vec模型来实现。 当然,您仍然需要大量示例来使其工作。

源代码 (Source code)

To use this tutorial’s Jupyter Notebook, you can go to my GitHub repo and follow the instructions on how to get the notebook running locally. I plan to upload the pre-trained vectors which could be used for your own work.

要使用本教程的Jupyter Notebook,您可以转到我的GitHub存储库,并按照有关如何使笔记本在本地运行的说明进行操作。 我计划上传可以用于您自己的工作的预训练向量。

To follow Kavita’s article via email, please subscribe to her blog.This article was originally published at kavita-ganesan.com

要通过电子邮件关注Kavita的文章,请订阅其博客 。 本文最初发表在kavita-ganesan.com

翻译自: https://www.freecodecamp.org/news/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3/

如何开始使用Word2Vec-然后使其工作相关推荐

  1. 使计算机工作必不可缺的软件,探讨测绘工程中计算机制图的运用问题(原稿)

    1.样,对图纸审核工作的开展十分不利.利用CAD软件可以使图形标准工作更加方便地开展,另外还可以在非常多的地共享时也会由于这部分问题导致出现问题,进而增加工作量.探讨测绘工程中计算机制图的运用问题(原 ...

  2. 在家远程办公的工作招聘_使在家工作成功:远程团队的资源

    在家远程办公的工作招聘 If you're a designer or developer, chances are you've made the shift to working from hom ...

  3. 结对编程_结对编程:使其工作的好处,技巧和建议

    结对编程 Pair Programming - a pair that's greater than the sum of its parts. You may have heard about pa ...

  4. go如何使web工作

    web工作方式的几个概念 以下均是服务器端的几个概念 Request:用户请求的信息,用来解析用户的请求信息,包括post.get.cookie.url等信息 Response:服务器需要反馈给客户端 ...

  5. gensimAPI学习——word2vec

    models.word2vec – Word2vec embeddings 0介绍 该模块使用高度优化的C例程.数据流和Python接口实现word2vec算法系列. word2vec算法包括skip ...

  6. linux空白屏幕,如何在Linux中的Logitech R400上使空白屏幕键正常工作?

    更新,问题解决了 我使用Logitech R400遥控器来控制我在Linux上的演示.我的所有演示文稿都是PDF格式,使用Acroread显示(它提供最佳图形).前进/后退工作开箱即用,但有时我想空白 ...

  7. 第八课.TPAMI2021年多篇GNN相关工作

    目录 Topology-Aware Graph Pooling Networks Graph Neural Networks with Convolutional ARMA Filters Learn ...

  8. [论文阅读] (24) 向量表征:从Word2vec和Doc2vec到Deepwalk和Graph2vec,再到Asm2vec和Log2vec(一)

    <娜璋带你读论文>系列主要是督促自己阅读优秀论文及听取学术讲座,并分享给大家,希望您喜欢.由于作者的英文水平和学术能力不高,需要不断提升,所以还请大家批评指正,非常欢迎大家给我留言评论,学 ...

  9. 利用机器学习探索食物配方 通过Word2Vec模型进行菜谱分析

    介绍 食物是我们生活中不可分割的一部分.据观察,当一个人选择吃东西时,通常会考虑食材和食谱.受食材和烹饪风格的影响,一道菜可能有数百或数千种不同的菜谱.网站上的菜谱展示了做一道菜所需要的食材和烹饪过程 ...

最新文章

  1. “听音辨脸”的超能力,你想拥有吗?
  2. 项目pom.xml第一行报错解决方案
  3. “是福不是祸,是祸躲不过”这句话对吗?
  4. 3.17-3.18 HDFS2.x中高级特性讲解
  5. vue组件间的5种传值方式
  6. vue 按钮多次点击重复提交数据
  7. Blazor 组件库开发指南
  8. 95-190-730-源码-WindowFunction-窗口操作符侧的窗口函数(内部函数)
  9. 分配菜品类别: 展开 收起_运营技巧:让产品数据决定菜品的去与留
  10. 面试技巧(一)〔参加笔试、面试的技巧〕
  11. [置顶]       ibatis框架----控制台输出SQL语句
  12. 云计算openstack核心组件——nova计算服务(7)
  13. extremecomponents 配置
  14. JAVAFX输入法的实现
  15. word总页数不包含封面_Word技巧:除去封面后,总页码减1是如何设置的?
  16. 学之思开源考试系统 - 数据库设计文档
  17. 会议oa之排座和送审
  18. python正则表达式是什么意思_理解python正则表达式
  19. 鱼眼图像校正(球面等距投影模型)
  20. 利用PPT的平滑变换功能以及Onekey插件做变形金刚变身的过程

热门文章

  1. 单片机炫彩灯实训报告_单片机跑马灯(流水灯)控制实验报告
  2. Tofino可编程交换机实现网络聚合的快速使用
  3. poj 2288 Islands and Bridges (状压dp+Tsp问题)
  4. DevExpress打印
  5. SpringCloudAlibaba学习-从Nacos开始
  6. SAS常用日期和时间函数简介
  7. live555 问题汇总
  8. maftools :MAF文件可视化工具
  9. 语义分割CVPR2020-Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision
  10. 练习MySQL,sql练习(mysql版)