词共现矩阵:
通过统计一个事先指定大小(window_size)的窗口内的word共现次数,以word周边的共现词的次数做为当前word的vector。

SVD(奇异值分解)
基于共现矩阵得到的离散词向量存在着高维和稀疏性的问题,可对原始词向量进行降维,从而得到一个稠密的连续词向量

参考链接:
https://blog.csdn.net/m0_37565948/article/details/84989565
https://blog.csdn.net/m0_37565948/article/details/84990043

sanity check 的代码就不贴了。

# All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCASTART_TOKEN = '<START>'
END_TOKEN = '<END>'np.random.seed(0)
random.seed(0)
# ----------------
def read_corpus(category="crude"):""" Read files from the specified Reuter's category.Params:category (string): category nameReturn:list of lists, with words from each of the processed files"""files = reuters.fileids(category)return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)

数据例子:

单词去重:

def distinct_words(corpus):""" Determine a list of distinct words for the corpus.Params:corpus (list of list of strings): corpus of documentsReturn:corpus_words (list of strings): sorted list of distinct words across the corpusnum_corpus_words (integer): number of distinct words across the corpus"""corpus_words = []num_corpus_words = -1# ------------------# Write your implementation here.corpus_words = sorted(list(set(word for doc in corpus for word in doc)))#另一种写法:# corpus_words = {word for doc in corpus for word in doc}# corpus_words = sorted(list(corpus_words))num_corpus_words = len(corpus_words)# ------------------return corpus_words, num_corpus_words

计算词共现矩阵,默认窗口大小是4。

def compute_co_occurrence_matrix(corpus, window_size=4):""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).Note: Each word in a document should be at the center of a window. Words near edges will have a smallernumber of co-occurring words.For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,"All" will co-occur with "<START>", "that", "glitters", "is", and "not".Params:corpus (list of list of strings): corpus of documentswindow_size (int): size of context windowReturn:M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): Co-occurence matrix of word counts. The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M."""words, num_words = distinct_words(corpus)M = Noneword2ind = {}# ------------------# Write your implementation here.M = np.zeros([num_words,num_words])for i, word in enumerate(words):word2ind[word] = ifor doc in corpus:for cur_idx, word in enumerate(doc):for window_idx in range(-window_size, window_size + 1):neighbor_idx = cur_idx + window_idxif neighbor_idx < 0 or neighbor_idx >= len(doc) or neighbor_idx == cur_idx:continueco_occur_word = doc[neighbor_idx](word_idx, co_occur_idx) = (word2ind[word], word2ind[co_occur_word])M[word_idx][co_occur_idx] += 1# ------------------return M, word2ind

降维:
构造一个对矩阵进行降维的方法来产生k维嵌入

def reduce_to_k_dim(M, k=2):""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.htmlParams:M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word countsk (int): embedding size of each word after dimension reductionReturn:M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.In terms of the SVD from math class, this actually returns U * S"""    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`M_reduced = Noneprint("Running Truncated SVD over %i words..." % (M.shape[0]))# ------------------# Write your implementation here.svd = TruncatedSVD(n_components = k, n_iter = n_iters)M_reduced = svd.fit_transform(M)# ------------------print("Done.")return M_reduced

绘制二维空间中的一组二维向量:

def plot_embeddings(M_reduced, word2ind, words):""" Plot in a scatterplot the embeddings of the words specified in the list "words".NOTE: do not plot all the words listed in M_reduced / word2ind.Include a label next to each point.Params:M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddingsword2ind (dict): dictionary that maps word to indices for matrix Mwords (list of strings): words whose embeddings we want to visualize"""# ------------------# Write your implementation here.for i in words:coordinate = M_reduced[word2ind[i]]x, y = coordinate[0],coordinate[1]plt.scatter(x,y)plt.annotate(i,(x,y))# ------------------

main函数:

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcastingwords = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']plot_embeddings(M_normalized, word2ind_co_occurrence, words)

Outputed Plot:

斯坦福cs224n-2021 assignment1-探索词向量—词共现矩阵—SVD(奇异值分解)相关推荐

  1. 词向量发展史-共现矩阵-SVD-NNLM-Word2Vec-Glove-ELMo

    话不多说,直接上干货. 首先介绍相关概念: 词嵌入:把词映射为实数域上向量的技术也叫词嵌入(word embedding). 词向量的分类表示: 一.共现矩阵 通过统计一个事先指定大小的窗口内的wor ...

  2. 共词网络(共现网络)学习

    1.基于共词网络的专家专场挖掘 刘萍 传统的专家专长挖掘是在词频分析基础上进行的,这种基于词频分析挖掘专家专长的方法没有考虑到关键词之间的关联,使得处于相对低频关键词表达的主题不能被挖掘出来且很多高频 ...

  3. 斯坦福CS224n NLP课程【十五】——共指解析 指代消解

    Coreference Resolution 指代消解是什么? 找出文本中名词短语所指代的真实世界中的事物.比如: 不只是代词能够指代其他事物,所有格和其他名词性短语也可以.甚至还存在大量嵌套的指代: ...

  4. 详解GloVe词向量模型

      词向量的表示可以分成两个大类1:基于统计方法例如共现矩阵.奇异值分解SVD:2:基于语言模型例如神经网络语言模型(NNLM).word2vector(CBOW.skip-gram).GloVe.E ...

  5. 深度学习与自然语言处理教程(1) - 词向量、SVD分解与Word2Vec(NLP通关指南·完结)

    作者:韩信子@ShowMeAI 教程地址:https://www.showmeai.tech/tutorials/36 本文地址:https://www.showmeai.tech/article-d ...

  6. 第一篇: 词向量之Word2vector原理浅析

    第一篇: 词向量之Word2vector原理浅析 作者 Aroundtheworld 2016.11.05 18:50 字数 1353 阅读 5361评论 1喜欢 9 一.概述 本文主要是从deep ...

  7. 词向量之Word2vector原理浅析

    原文地址:https://www.jianshu.com/p/b2da4d94a122 一.概述 本文主要是从deep learning for nlp课程的讲义中学习.总结google word2v ...

  8. 静态词向量预训练模型

    1.神经网络语言模型 从语言模型的角度来看,N 元语言模型存在明显的缺点. 首先,模型容易受到数据稀疏的影响,一般需要对模型进行平滑处理:其次,无法对长度超过 N 的上下文依赖关系进行建模. 神经网络 ...

  9. 词向量 文本相似度计算

     一.abstract 为把人们所理解的自然语言让计算机也能够认识并且操作,需要将人的语言(即文字)转换成计算机的语言(即数字) 二 .词的两种表示方法 1.1离散表示(one-hot represe ...

最新文章

  1. 控制文字长度,多出的文字用省略号代替
  2. 一些基本算法的递归实现
  3. 【编译原理】构建一个简单的解释器(Let’s Build A Simple Interpreter. Part 8.)(笔记)一元运算符正负(+,-)
  4. 超详细的8psk调制解调通信系统讲解与仿真
  5. esxi.主机配置上联端口_为什么现代的电脑机箱仍然具有USB 2.0端口?
  6. 中科大 计算机网络11 应用层原理
  7. java中JOptionPane类_Java学习之JOptionPane类
  8. 谁在使用Docker?
  9. 【机器学习】【计算机视觉】非常全面的图像数据集《Actions》
  10. OSI 七层网络协议的定义与理解
  11. 计算机王码简历,王码五笔字型发明人王永民回首汉字输入这30年
  12. BZOJ1059 [ZJOI2007]矩阵游戏
  13. C# 切割超级大图(.bmp)[1G以上超大图片分块加载代码]
  14. 计算机大学老师简介,南开大学计算机学院导师教师师资介绍简介-李敏
  15. java利用poi导出excel功能-附带图片导出
  16. excel超链接应用:快速生成目录的几个方法-下
  17. MATLAB画立体包络图
  18. .Net Core MVC引入static静态变量到.cshtml页面
  19. absolute和fixed的区别
  20. import theano时,系统报错无法定位程序输入点__gxx_personality_sj0 于动态链接库libstdc+±6.dll。

热门文章

  1. IntelliJ IDEA的常用设置和快捷键
  2. Java---Map详解
  3. jquery 后台数据到前台展示
  4. 合成资产赛道风云突变,Linear Finance有望成为最具潜力的黑马
  5. java中集合的分类以及集合的选择
  6. C语言字符串 string强转为int
  7. OTP语音芯片ic的工作原理,以及目前的现状和技术发展路线是什么?flash型
  8. cesium采用primitive方式加载geojson数据
  9. Linux 命令行浏览器
  10. 数通运营商方向常见面试问题(第五部分)