获取文本语料库

古腾堡语料库

方法一（麻烦）

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt',
'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
55
>>> len(emma)
192427

方法二：

>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')

raw() 函数给我们没有进行过任何语言学处理的文件的内容。因此，例如：len(gutenberg.raw(‘blake-poems.txt’)告诉我们文本中出现的词汇个数，包括词之间的空格。sents() 函数把文本划分成句子，其中每一个句子是一个词链表。

网络和聊天文本

不正式语言

>>> from nltk.corpus import webtext

布朗语料库

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——
很方便的资源

>>> from nltk.corpus import brown

路透社语料库

这些文档分成 90 个主题，按照
“训练”和“测试”分为两组。因此，fileid 为“test/14826”的文档属于测试组

>>> from nltk.corpus import reuters

语料库方法既接受单个的 fileid 也接受 fileids 列表作为参数

>>> reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']>>> reuters.categories(['training/9865', 'training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']>>> reuters.fileids('barley')
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]>>> reuters.fileids(['barley', 'corn'])
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106',
'test/15287', 'test/15341', 'test/15618', 'test/15618', 'test/15648', ...]

就职演说语料库

每个文本都是一个总统的演说

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]
>>> [fileid[:4] for fileid in inaugural.fileids()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...] # 每个文本的年代都出现在它的文件名中。要从文件名中获得年代，我们使用 fileid[:4]提取前四个字符

NLTK 中定义的基本语料库函数

（使用 help(nltk.corpus.reader)可以找到更多的文档，
也可以阅读 http://www.nltk.org/howto 上的在线语料库的 HOWTO。）

示例	描述
fileids()	语料库中的文件
fileids([categories])	这些分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	这些文件对应的语料库中的分类
raw()	语料库的原始内容
raw(fileids=[f1,f2,f3])	指定文件的原始内容
raw(categories=[c1,c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件的编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径

条件频率分布

按文体计数词汇

FreqDist()以一个简单的链表作为输入，ConditionalFreqDist()以一个配对链表作为输入。

>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))

只看两个文体：新闻和言情。对于每个文体②，我们遍历文体中的每个词③以产生文体与词的配对①

>>> genre_word = [(genre, word) # ①
... for genre in ['news', 'romance'] # ②
... for word in brown.words(categories=genre)] # ③
>>> len(genre_word)
170576

使用此配对链表创建一个 ConditionalFreqDist，并将它保存在一个变量 cfd 中。像往常一样，我们可以输入变量的名称来检查它①，并确认它有两个条件②

>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd # ①
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance'] # ②

访问这两个条件，它们每一个都只是一个频率分布

>>> cfd['news']
<FreqDist with 100554 outcomes>
>>> cfd['romance']
<FreqDist with 70022 outcomes>
>>> list(cfd['romance'])
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',
'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him',
'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]
>>> cfd['romance']['could']
193

使用双连词生成随机文本

bigrams()函数接受一个词汇链表，并建立一个连续的词对链表

>>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
... 'and', 'the', 'earth', '.']
69
>>> nltk.bigrams(sent)
[('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('Go d', 'created'),
('created', 'the'), ('the', 'heaven'), ('heaven', 'and'), ('and', 'the'),
('the', 'earth'), ('earth', '.')]

NLTK 中的条件频率分布：定义、访问和可视化一个计数的条件频率分布的常用方法和习惯用法

示例	描述
cfdist= ConditionalFreqDist(pairs)	从配对链表中创建条件频率分布
cfdist.conditions()	将条件按字母排序
cfdist[condition]	此条件下的频率分布
cfdist[condition][sample]	此条件下给定样本的频率
cfdist.tabulate()	为条件频率分布制表
cfdist.tabulate(samples, conditions)	指定样本和条件限制下制表
cfdist.plot()	为条件频率分布绘图
cfdist.plot(samples, conditions)	指定样本和条件限制下绘图
cfdist1 < cfdist2	测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数

词典资源

词汇列表语料库

过滤文本：此程序计算文本的词汇表，然后删除所有在现有的词汇列表中出现的元素，只留下罕见或拼写错误的词

def unusual_words(text):
text_vocab = set(w.lower() for w in text if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab.difference(english_vocab)
return sorted(unusual)
>>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
['abbeyland', 'abhorrence', 'abominably', 'abridgement', 'accordant', 'accustomary',
'adieus', 'affability', 'affectedly', 'aggrandizement', 'alighted', 'allenham',
'amiably', 'annamaria', 'annuities', 'apologising', 'arbour', 'archness', ...]
>>> unusual_words(nltk.corpus.nps_chat.words())
['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abou', 'abourted', 'abs', 'ack', 'acros',
'actualy', 'adduser', 'addy', 'adoted', 'adreniline', 'ae', 'afe', 'affari', 'afk',
'agaibn', 'agurlwithbigguns', 'ahah', 'ahahah', 'ahahh', 'ahahha', 'ahem', 'ahh', ...]

停用词语料库：高频词汇，如：the，to

>>> from nltk.corpusimport stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]

更多关于 Python：代码重用

函数

关键字 def 加函数名以及所有输入参数来定义一个函数，接下来是函数的主体。

一个 Python 函数：这个函数试图生成任何英语名词的复数形式

def plural(word):
if word.endswith('y'):
return word[:-1] + 'ies'
elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
return word + 'es'
elif word.endswith('an'):
return word[:-2] + 'en'
else:
return word + 's'
>>> plural('fairy')
'fairies'
>>> plural('woman')
'women'

模块

导入别的模块的方法

from module_name import method_name

WordNet

意义和同义词

>>> from nltk.corpusimport wordnet as wn
80
>>> wn.synsets('motorcar')
[Synset('car.n.01')]

因此，motorcar 只有一个可能的含义，它被定义为 car.n.01，car 的第一个名词意义。

car.n.01 被称为 synset 或“同义词集”，意义相同的词（或“词条”）的集合：

>>> wn.synset('car.n.01').lemma_names
['car', 'auto', 'automobile', 'machine', 'motorcar']

同义词集也有一些一般的定义和例句：

>>> wn.synset('car.n.01').definition
'a motor vehicle with four wheels; usually propelled by an internal combustion engine'
>>> wn.synset('car.n.01').examples
['he needs a car to get to work']

词条：种同义词集和词的配对（例如：car.n.01.automobile，car.n.01.motorcar 等）

>>> wn.synset('car.n.01').lemmas # ①得到指定同义词集的所有词条
[Lemma('car.n.01.car'),Lemma('car.n.01.auto'),Lemma('car.n.01.automobile'),
Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]
>>> wn.lemma('car.n.01.automobile') # ②查找特定的词条
Lemma('car.n.01.automobile')
>>> wn.lemma('car.n.01.automobile').synset # ③得到一个词条对应的同义词集
Synset('car.n.01')
>>> wn.lemma('car.n.01.automobile').name # ④得到一个词条的“名字”
'automobile'

注：假如提示bound method，可能是版本问题，在方法后面加上括号即可。

WordNet 的层次结构

下位词：看到更直接、更具体的描述

>>> motorcar = wn.synset('car.n.01')
>>> types_of_motorcar = motorcar.hyponyms()
>>> types_of_motorcar[26]
Synset('ambulance.n.01')
>>> sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])
['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon',
'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible',
'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car',
'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap',
'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover',
'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car',
'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer',
'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan',
'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car',
'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car',
'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon',
'wagon']

也可以通过访问上位词来浏览层次结构

>>> motorcar.hypernyms()
[Synset('motor_vehicle.n.01')]
>>> paths = motorcar.hypernym_paths()
>>> len(paths)
2
>>> [synset.name for synset in paths[0]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',
'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01',
'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']
>>> [synset.name() for synset in paths[1]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',
'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01',
'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

可以用如下方式得到一个最一般的上位（或根上位）同义词集

>>> motorcar.root_hypernyms()
[Synset('entity.n.01')]

可以使用 dir()查看词汇关系和同义词集上定义的其它方法。例如：尝试 dir(wn.synset(‘harmony.n.02’))

小结

文本语料库是一个大型结构化文本的集合。NLTK 包含了许多语料库，如：布朗语料库nltk.corpus.brown。
有些文本语料库是分类的，例如通过文体或者主题分类；有时候语料库的分类会相互重叠。
条件频率分布是一个频率分布的集合，每个分布都有一个不同的条件。它们可以用于通过给定内容或者文体对词的频率计数。
行数较多的 Python 程序应该使用文本编辑器来输入，保存为.py 后缀的文件，并使用 import 语句来访问。
Python 函数允许你将一段特定的代码块与一个名字联系起来，然后重用这些代码想用多少次就用多少次。
一些被称为“方法”的函数与一个对象联系在起来，我们使用对象名称跟一个点然后跟方法名称来调用它，就像：x.funct(y)或者 word.isalpha()。
要想找到一些关于变量 v 的信息，可以在 Python 交互式解释器中输入 help(v)来阅读这一类对象的帮助条目。
WordNet 是一个面向语义的英语词典，由同义词的集合—或称为同义词集（synsets）—组成，并且组织成一个网络。
默认情况下有些函数是不能使用的，必须使用 Python 的 import 语句来访问。

获得文本语料和词汇资源相关推荐

《用Python进行自然语言处理》第2章获得文本语料和词汇资源
1. 什么是有用的文本语料和词汇资源,我们如何使用 Python 获取它们? 2. 哪些 Python 结构最适合这项工作? 3. 编写 Python 代码时我们如何避免重复的工作? 2.1 获取文本 ...
《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第02章获得文本语料和词汇资源
第02章获得文本语料和词汇资源 2.1 获取文本语料库古腾堡语料库网络和聊天文本布朗语料库路透社语料库就职演说语料库标注文本语料库在其他语言的语料库文本语料库的结构载入你自己的语料 ...
Python自然语言处理 | 获得文本语料与词汇资源
本章解决问题- 什么是有用的文本语料和词汇资源,我们如何使用Python获取它们? 哪些Python结构最适合这项工作? 编写Python代码时我们如何避免重复的工作? 这里写目录标题 1获取文本语料 ...
python nlp_【NLP】Python NLTK获取文本语料和词汇资源
作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开数据集.模型上提供了全面.易用的接口, ...
【Python 自然语言处理第二版】读书笔记2:获得文本语料和词汇资源
文章目录一.获取文本语料库 1.古腾堡语料库 (1)输出语料库中的文件标识符 (2)词的统计与索引 (3)文本统计 2.网络和聊天文本 3.布朗语料库 (1)初识 (2)比较不同文体中的情态动词的用 ...
第2章获得文本语料和词汇资源
时间所限,仅对自己用到的习题做了整理解答,如果想知道其他题目的答案,请留言,我会不定期查看博客的.^_^.希望大家多多与我交流意见,我会继续努力写的. 1. 创建一个变量phrase包含一个词的链表. ...
Python自然语言处理-学习笔记(2)——获得文本语料和词汇资源
语料库基本语法载入自己的语料库 PlaintextCorpusReadera 从文件系统载入 BracketParseCorpusReader 从本地硬盘载入写一段简短的程序,通过遍历前面所列出的 ...
【NLP】Python NLTK获取文本语料和词汇资源
向AI转型的程序员都关注了这个号
【ChatBot开发笔记】聊天机器人准备工作——初识NLTK库、语料与词汇资源
目录简述一.NLTK 1.安装 2.搜索 3.词统计二.语料与词汇资源 1.举例 2.类似的语料库还有: 3.语料库的通用接口: 4.其他词典资源: 5.加载自己的语料库结语简述 2021. ...

获得文本语料和词汇资源