词嵌入 网络嵌入

深度学习 , 自然语言处理 (Deep Learning, Natural Language Processing)

Word embedding is a method to capture the “meaning” of the word via low dimension vector and it can be used in a variety of tasks in Natural Language Processing (NLP).

单词嵌入是一种通过低维向量捕获单词“含义”的方法,可用于自然语言处理(NLP)的各种任务中。

Before beginning word embedding tutorial we should have an understanding of vector space and similarity matrix.

在开始单词嵌入教程之前,我们应该了解向量空间相似度矩阵

向量空间 (Vector Space)

A sequence of numbers that is used to identify a point in space is called vector and if we have a whole bunch of vectors that all belong to the same dataset it will be called a vector space.

用于识别空间中一个点的数字序列称为向量 ,如果我们有一堆全部属于同一数据集的向量 ,则称为向量空间

Words in the text can also be represented in the higher dimension in vector space where words having the same meaning will have similar representations. For example,

文本中的单词也可以在向量空间中以较高的维度表示,其中具有相同含义的单词将具有相似的表示形式。 例如,

photo by Allision Parrish from Github
来自Github的Allision Parrish摄

The above image shows a vector representation of words on the scale of cuteness and size of animals. we can see that there is a semantic relationship between words on bases of similar properties. It is difficult to represent the higher dimensional relationship between words but the maths behind is the same so it works similarly in a higher dimension also.

上图显示了可爱程度和动物大小的单词矢量表示。 我们可以看到,基于相似属性的单词之间存在语义关系。 很难表示单词之间的高维关系,但是后面的数学是相同的,因此它在高维上也类似地工作。

相似度矩阵 (Similarity matrix)

It is used to calculate the distance between vectors in the vector space. it measures similarity or distance between two data points in vector space. This allows us to capture words that are used in similar ways to result in having similar representation naturally capturing their meaning. there is a lot of similarity matrix available but we will discuss Euclidean distance and Cosine similarity.

它用于计算向量空间中向量之间的距离。 它测量向量空间中两个数据点之间的相似度或距离。 这使我们能够捕获以相似方式使用的单词,从而导致具有相似的表示形式自然地捕获其含义。 有很多可用的相似度矩阵,但我们将讨论欧几里得距离余弦相似度。

欧氏距离 (Euclidean distance)

One way to calulate how far two data points are in vector space is to calculate Euclidean distance.

计算向量空间中两个数据点的距离的一种方法是计算欧几里得距离。

import mathdef distance2d(x1, y1, x2, y2):    return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

So, the distance between “capybara” (70, 30) and “panda” (74, 40) from the above image example:

因此,根据上图示例,“水豚”(70、30)和“熊猫”(74、40)之间的距离:

… is less than the distance between “tarantula” and “elephant” from the above image example:

…小于上图示例中的“狼蛛”和“大象”之间的距离:

This shows that “pandas” and “capybara” are more similar as compared to “tarantula” and “elephant”.

这表明“熊猫”和“水豚”比“狼蛛”和“大象”更相似。

余弦相似度 (Cosine similarity)

It is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

它是内积空间的两个非零向量之间相似性的量度,用于测量它们之间角度的余弦。

from numpy import dotfrom numpy.linalg import normcos_sim = dot(a, b)/(norm(a)*norm(b))

现在的问题是什么是词嵌入,为什么我们要使用它们? (Now Question is what is word embedding and why do we use them?)

In simple words, they are a vector representation of words in sentences, documents, etc.,

简单来说,它们是句子,文档等中单词的向量表示,

Word embedding is a learning representation of words in the form of numeric vectors. It learns a densely distributed representation for a predefined fixed-sized vocabulary from a corpus of text. The word embedding representation is capable to reveal many hidden relationships between words. For example, vector(“king”) — vector(“lords”) is similar to vector(“queen”) — vector(“princess”)

单词嵌入是数字向量形式的单词的学习表示。 它从文本语料库中学习预定义的固定大小词汇的密集分布表示形式。 单词嵌入表示法能够揭示单词之间的许多隐藏关系。 例如,vector(“ king”)— vector(“ lords”)类似于vector(“ queen”)— vector(“ princess”)

It is an improvement over the traditional methods to represent word such as bag-of-word model which produces large sparse vectors which are computationally impractical to represent an entire vocabulary. These representations were sparse due to its vast vocabularies and a given word or document would be represented by a large vector comprised mostly of zero values a sparse representation.

它是对表示单词的传统方法(例如单词模型)的一种改进,它产生了较大的稀疏矢量,这些矢量在计算上不切实际,无法代表整个词汇。 这些表述由于其词汇量庞大而稀疏,给定的单词或文档将由一个大型矢量表示,该矢量主要由零值表示。

Two popular methods of learning word embeddings from the text include:

从文本中学习单词嵌入的两种流行方法包括:

1. Word2Vec.

1. Word2Vec

2. GloVe.

2. GloVe

There are pre-trained models that were trained over a large corpus of text. We can use them for our use case.

有一些经过训练的模型,这些模型经过大量文本训练。 我们可以将它们用于我们的用例。

In addition to these methods, a word embedding can be learned using deep learning model. This can be a slower approach but we can design it for our own use case the model will be trained on a specific training dataset as per our own requirement. Keras provides a very easy and flexible Embedding layer that can be used for neural networks on text data.

除了这些方法,还可以使用深度学习模型来学习单词嵌入。 这可能是一种较慢的方法,但是我们可以针对自己的用例进行设计,然后根据我们自己的要求在特定的训练数据集上对模型进行训练。 Keras提供了一个非常简单和灵活的嵌入层,可用于文本数据的神经网络。

In

导入模块 (Importing Module)

Let’s get started with importing our dataset, module, and checking its head. I took a dataset from Kaggle IMBD Movie Review-NLP.

让我们开始导入数据集,模块并检查其头部。 我从Kaggle IMBD电影评论-NLP中获取了一个数据集。

import pandas as pdimport numpy as npfrom numpy import arrayfrom keras.preprocessing.text import one_hot, Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.layers import Flattenfrom keras.layers.embeddings import Embedding

We’ll use Scikit-learn to divide our dataset into a training set and test set. We’ll train the word embedding on 70% of the data and test it on 30%.

我们将使用Scikit-learn将数据集分为训练集和测试集。 我们将在70%的数据上训练嵌入单词,并在30%的数据上对其进行测试。

完整的编码所有文件 (INTEGER ENCODING ALL THE DOCUMENTS)

After this, all the unique words will be represented by an integer. For this, we are using one_hot function available in the Keras. Note that the vocab_size is specified as a total number of unique words so as to ensure unique integer encoding for each and every word.

此后,所有唯一词将由一个整数表示。 为此,我们使用Keras中可用的one_hot函数。 请注意, vocab_size被指定为唯一单词的总数,以确保每个单词的唯一整数编码

Note one important thing that the integer encoding for the word remains the same in different text. eg ‘year’ is denoted by 23518 in each and every document.

注意一件事,单词的整数编码在不同的文本中保持相同。 例如,在每个文档中,“年份”都用23518表示。

Let’s now have a look at one of the reviews. We’ll compare this sentence with its transformation as we move in the next steps.

现在让我们看看其中一项评论。 在下一步中,我们将比较此句子及其转换。

I really didn't like this movie because it didn't really bring across the messages and ideas L'Engle brought out in her novel. We had read the novel in our English class and i absolutely loved it, i'm afraid i can't say the same for the film. There were some serious differences between the novel and the adapted version and it just didn't do any credit to the imaginative genius that is Madeleine L'Engle! This is the reason i gave it such a poor rating. Don't see this movie if you are a big fan of L'Engle's texts because you will be sorely disappointed. However, if you are watching the movie for entertainment purposes (or educational as was my case) then it is an alright movie!

This review will be converted into integer representation where each number represents a unique word.

该评论将转换为整数表示,其中每个数字代表一个唯一的单词。

[24608, 32542, 30289, 58025, 50966, 19624, 43296, 35850, 30289, 32542, 31519, 11569, 30465, 7968, 12928, 34105, 8750, 49668, 38039, 40264, 3503, 45016, 63074, 41404, 53275, 30465, 45016, 40264, 28666, 47101, 44909, 12928, 24608, 62202, 46727, 35850, 24425, 5515, 24608, 25601, 35725, 30465, 10577, 55918, 30465, 13875, 62286, 22967, 5067, 9001, 33291, 1247, 30465, 45016, 12928, 30465, 23555, 44142, 12928, 35850, 41976, 30289, 20229, 15687, 7845, 50705, 30465, 58301, 14031, 11556, 1495, 26143, 8750, 50966, 1495, 30465, 63056, 24608, 39847, 35850, 30936, 54227, 33469, 55622, 8193, 3111, 50966, 19624, 9403, 51670, 40033, 54227, 42254, 52367, 44935, 63226, 17625, 43296, 51670, 65642, 30053, 42863, 34757, 32894, 9403, 51670, 40033, 1112, 30465, 19624, 55918, 55169, 57666, 10193, 50176, 59413, 10480, 63135, 56156, 64520, 35850, 1495, 49938, 59074, 19624]

填充文本(使相同长度的文本) (Padding theText (to make the very text of the same length))

The Keras Embedding layer requires all individual documents to be of the same length. Hence we will pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer, the ‘input_length’ will be equal to the length (ie no of words) of the document with maximum length or a maximum number of words.

Keras嵌入层要求所有单个文档的长度都相同。 因此,我们现在将较短的文档填充0。 因此,现在在Keras嵌入层中, “ input_length”将等于具有最大长度或最大单词数的文档的长度(即单词数)。

To pad the shorter documents I am using pad_sequences function from the Keras library.

为了填充较短的文档,我使用Keras库中的pad_sequences函数。

The maximum number of words in any document is :  1719

Here, we found that the maximum words that a sentence hold is 1719. so we will be padding according to it. In padding, we will be adding zeros(0) in a shorter sentence than max_length. In shorter length sentences “0 ” will be added at the beginning of the sentence.

在这里,我们发现一个句子容纳的最大单词数为1719。因此我们将根据它进行填充。 在填充中,我们将在比max_length更短的句子中添加zeros(0)。 在较短的句子中,将在句子的开头添加“ 0”。

For example:

例如:

array([    0,     0,     0, ..., 32875, 18129, 60728])

我们将使用KERAS嵌入层创建嵌入 (WE WILL BE CREATING THE EMBEDDINGS using KERAS EMBEDDING LAYER)

Now all the text are of the same length (after padding). And so now we are ready to create and use the embedding layer.

现在,所有文本的长度相同(填充后)。 因此,现在我们可以创建和使用嵌入层了。

PARAMETERS OF THE EMBEDDING LAYER — -

嵌入层的参数--

‘Input_dim’ = the vocab size that we will choose. It is the number of unique words in the vocabulary.

'Input_dim'=我们将选择的唱头大小 。 它是词汇表中唯一词的数量。

‘Output_dim’ = the number of dimensions we wish to embed into. Each word can be represented by a vector of the same dimensions.

'Output_dim'=我们希望嵌入的尺寸数 。 每个单词可以用相同维数的向量表示。

‘Input_length’ = length of the maximum text. which is stored in the maxlen variable in the example.

'Input_length'=最大文本的长度 。 在示例中存储在maxlen变量中。

Model: "sequential_1"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding_1 (Embedding)      (None, 1719, 8)           527680    _________________________________________________________________flatten_1 (Flatten)          (None, 13752)             0         _________________________________________________________________dense_1 (Dense)              (None, 1)                 13753     =================================================================Total params: 541,433Trainable params: 541,433Non-trainable params: 0_________________________________________________________________None

Let’s now check the model accuracy on our training set.

现在,让我们在训练集中检查模型的准确性。

6000/6000 [==============================] - 1s 170us/stepTraining Accuracy is 100.0

The next step we can do is check its accuracy on the test set.

下一步,我们可以在测试集上检查其准确性。

4000/4000 [==============================] - 1s 179us/stepTesting Accuracy is 86.57500147819519

We are getting train accuracy as 100% because on that data we train embedding but for test data, there are some words used which are unseen so we are getting a bit less accuracy.

我们得到的训练精度为100%,因为在该数据上我们进行了嵌入训练,但对于测试数据,由于使用了一些看不见的单词,因此准确性有所降低。

In practice, I would recommend performing a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding. That will surely improve performance on test data.

实际上,我建议使用固定的预训练嵌入来执行单词嵌入,并尝试在预训练嵌入的基础上进行学习。 这肯定会提高测试数据的性能。

下一步是什么 (What’s Next)

Now we have learned how to represent words in the form of continuous numbers. As compared to other forms of text representation such as bag-of-words or TF-IDF(term frequency-inverse document frequency), etc., Word embedding gives much better semantic relationships between words. It can significantly improve the performance of natural language processing(NLP) tasks.

现在我们已经学习了如何以连续数字的形式表示单词。 与其他形式的文本表示形式(例如词袋或TF-IDF(术语频率-反向文档频率)等)相比,词嵌入可提供更好的词间语义关系。 它可以显着提高自然语言处理(NLP)任务的性能。

Now, I would suggest you try yourself word embedding on your own NLP task and you will find significant improvement in the performance. you can also experiment with implementing word embeddings on the same dataset by using pre-trained word embeddings such as Word2Vec as fixed and on top of it, you can perform learning.

现在,我建议您尝试将单词嵌入到您自己的NLP任务中,您会发现性能有了显着提高。 您还可以尝试使用固定训练的单词嵌入(例如Word2Vec)在同一数据集上实现单词嵌入,并在此之上进行学习。

Most often, you will notice that the pre-trained models will have a higher accuracy on the testing set the reason for that is it already had trained on a large variety of NLP datasets. But if you have enough data and want to perform a specific task than it will be a better choice to train your own word embedding.

大多数情况下,您会注意到预训练的模型在测试集上的准确性更高,原因是它已经在各种NLP数据集上进行了训练。 但是,如果您有足够的数据并且想要执行特定任务,那么训练您自己的单词嵌入将是一个更好的选择。

Code for Word Embedding is Available on GitHub.

GitHub上提供了Word嵌入代码

Thanks for the read. I hope this helps you understanding Word Embedding and its importance in natural language processing (NLP).

感谢您的阅读。 我希望这可以帮助您理解单词嵌入及其在自然语言处理(NLP)中的重要性。

Follow me up at Medium. As always, I welcome feedback and constructive criticism and can be reached on Linkedin.

Medium跟我来。 与往常一样,我欢迎您提供反馈和建设性的批评,可以通过Linkedin与我们联系

翻译自: https://medium.com/towards-artificial-intelligence/introduction-to-word-embedding-5ba5cf97d296

词嵌入 网络嵌入

http://www.taodudu.cc/news/show-863519.html

相关文章:

  • 如何成为数据科学家_成为数据科学家的5大理由
  • 大脑比机器智能_机器大脑的第一步
  • 嵌入式和非嵌入式_我如何向非技术同事解释词嵌入
  • ai与虚拟现实_将AI推向现实世界
  • bert 无标记文本 调优_使用BERT准确标记主观问答内容
  • 机器学习线性回归学习心得_机器学习中的线性回归
  • 安全警报 该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全
  • 现代分层、聚集聚类算法_分层聚类:聚集性和分裂性-解释
  • 特斯拉自动驾驶使用的技术_使用自回归预测特斯拉股价
  • 熊猫分发_实用熊猫指南
  • 救命代码_救命! 如何选择功能?
  • 回归模型评估_评估回归模型的方法
  • gan学到的是什么_GAN推动生物学研究
  • 揭秘机器学习
  • 投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影!
  • 机器学习中的随机过程_机器学习过程
  • ci/cd heroku_在Heroku上部署Dash或Flask Web应用程序。 简易CI / CD。
  • 图像纹理合成_EnhanceNet:通过自动纹理合成实现单图像超分辨率
  • 变压器耦合和电容耦合_超越变压器和抱抱面的分类
  • 梯度下降法_梯度下降
  • 学习机器学习的项目_辅助项目在机器学习中的重要性
  • 计算机视觉知识基础_我见你:计算机视觉基础知识
  • 配对交易方法_COVID下的自适应配对交易,一种强化学习方法
  • 设计数据密集型应用程序_设计数据密集型应用程序书评
  • pca 主成分分析_超越普通PCA:非线性主成分分析
  • 全局变量和局部变量命名规则_变量范围和LEGB规则
  • dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
  • 计算机视觉课_计算机视觉教程—第4课
  • 用camelot读取表格_如何使用Camelot从PDF提取表格
  • c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择

词嵌入 网络嵌入_词嵌入简介相关推荐

  1. 词嵌入 网络嵌入_深入研究词嵌入以进行情感分析

    词嵌入 网络嵌入 When applying one-hot encoding to words, we end up with sparse (containing many zeros) vect ...

  2. 词嵌入 网络嵌入_词嵌入深入实践

    词嵌入 网络嵌入 介绍 (Introduction) I'm sure most of you would stumble sooner or later on the term "Word ...

  3. 词向量表示:word2vec与词嵌入

    在NLP任务中,训练数据一般是一句话(中文或英文),输入序列数据的每一步是一个字母.我们需要对数据进行的预处理是:先对这些字母使用独热编码再把它输入到RNN中,如字母a表示为(1, 0, 0, 0, ...

  4. 论文分享|【词向量专题】中文词嵌入最新进展

    分布式的词嵌入(word embedding)将一个词表征成一个连续空间中的向量,并且有效地挖掘了词的语义和句法上的信息,从而被作为输入特征广泛得应用于下游的NLP任务(比如:命名实体识别,文本分类, ...

  5. 02_词向量与有趣的词嵌入

    文章目录 文本向量化 OneHot编码 文本向量化 预留问题 有趣的词向量 word2vec案例分析 n-Gram 实现词向量 keras的Embedding实现 博文配套视频课程:自然语言处理与知识 ...

  6. 图嵌入前篇之词嵌入模型 Wrod2Vec

    词向量模型 Word2Vec Skip-gram 模型是图嵌入模型 Random Walk 中要使用到的模型,因此先学习 Word2Vec 神经网络语言模型 NNLM 目标:根据给定的词序列,预测下一 ...

  7. 单词嵌入_单词嵌入与单词袋:推荐系统的奇怪案例

    单词嵌入 词嵌入始终是最佳选择吗? (Are word embeddings always the best choice?) If you can challenge a well-accepted ...

  8. 随机邻域嵌入_图嵌入(Graph embedding)综述

    最近在学习Embedding相关的知识的时候看到了一篇关于图嵌入的综述,觉得写的不错便把文章中的一部分翻译了出来.因自身水平有限,文中难免存在一些纰漏,欢迎发现的知友在评论区中指正. 目录 一.图嵌入 ...

  9. vb把窗体嵌入桌面底层_桌面透明便签插件便签软件

    电脑上有哪些可以透明显示的桌面便签小插件吗?Windows电脑上有可以透明显示的便签小插件吗?这篇文章,小编就为大家解答下这个问题. Windows电脑上需要透明的便签小插件,可以打开软件管家搜索安装 ...

最新文章

  1. iOS 实现多个可变 cell 复杂界面的制作
  2. EditText和TextView出现中文、英文等string串的排版问题
  3. Linux下快速分区格式化大于2T大容量存储
  4. 如何学会读论文?送你滑铁卢大学S. Keshav的三轮阅读法
  5. 遗传算法入门到掌握(一)
  6. 怎么预约鸿蒙系统,华为鸿蒙2.0系统-鸿蒙2.0系统预约-艾艾软件园
  7. php 屏蔽mysql错误提示_PHP.ini中配置屏蔽错误信息显示和保存错误日志
  8. 帧率ffmepg 摄像头_【WIN电竞】CSGO解除锁帧方法介绍
  9. boost::test模块装饰器数据测试用例测试
  10. python-进程、线程
  11. maven到底是个啥玩意~
  12. MVC中业务层是否应该有个基类?它有什么作用?
  13. arcgis批量处理nc文件_气象数据处理——nc文件
  14. 华为鸿蒙会议安排,2020华为HDC日程确定,鸿蒙、HMS以及EMUI 11成最关注点
  15. 新浪微博API应用程序接口_什么是API? 应用程序编程接口说明
  16. bootstrap中如何使div上下、垂直居中
  17. 【数据结构】从零实现顺序表+链表相关操作
  18. android apr分析,APR分析-设计篇
  19. 系统无忧 Ghost XP SP3 快速装机版V2011.07
  20. 京东秒杀系统是世界上最牛批的,不接受反驳!

热门文章

  1. Android中最详细的焦点问题,从概念出发带你一点点分享(1)
  2. firefox-Developer开发者站点——关于Object.create()新方法的介绍
  3. Windows环境下Android Studio系列5—日志调试
  4. linux下怎么查看ssh的用户登录日志
  5. 《软件设计精要与模式》推荐序三
  6. android覆盖扩散动画,[Android]多层波纹扩散动画——自定义View绘制
  7. wordpress4.9服务器迁移
  8. 10个优秀的 Web UI 库/框架
  9. 用计算机画关于科技的画,用计算机鉴识画作
  10. 交流电的有效值rms值_交流电路中的电源