自然语言处理NLP | NTLK入门及英文语料库处理

NLTK(Natural Language Toolkit) ，是一个自然语言处理工具包，可以方便的完成包括分词、词性标注、命名实体识别及句法分析在内的多种任务。

安装

$ pip install nltk
$ python
>>> import nltk
>>> nltk.download()

测试是否安装成功：

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

NLTK常见操作（英文语料）

文本切分成语句

import nltk
text="Don't hesitate to ask questions.Be positive."
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))Out: ["Don't hesitate to ask questions.", 'Be positive.']

文本切分成语句（大批量句子切分、特定语言句子切分）

tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') print(tokenizer.tokenize(text))Out: ["Don't hesitate to ask questions.", 'Be positive.']

分词方法 1：TreebankWordTokenizer 依据 PennTreebank 语料库的约定，通过分离缩略词来实现切分

words=nltk.word_tokenize(text)
print(words)Out: ['Do', "n't", 'hesitate', 'to', 'ask', 'questions', '.', 'Be', 'positive', '.']

分词方法 2：PunktWordTokenizer 通过分离标点来实现切分的，每一个单词都会被保留

from nltk.tokenize import WordPunctTokenizer
tokenizer=WordPunctTokenizer()
words = tokenizer.tokenize(text)
print(words)Out: ['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions', '.', 'Be', 'positive', '.']

其他分词方法 3：RegexpTokenizer、WhitespaceTokenizer、BlanklineTokenizer 等
频率分布 nltk.probability.FreqDist

fdist = FreqDist(samples)   # 创建包含给定样本的频率分布，参数为词的列表
fdist.inc(sample)           #增加样本
fdist['monstrous']            #计数给定样本出现的次数
fdist.freq('monstrous')   #给定样本的频率
fdist.N()                   #样本总数
fdist.keys()                #以频率递减顺序排序的样本链表
for sample in fdist:        #以频率递减的顺序遍历样本
fdist.max()                 #数值最大的样本
fdist.tabulate()            #绘制频率分布表
fdist.plot()                #绘制频率分布图
fdist.plot(cumulative=True) #绘制累积频率分布图
fdist1 < fdist2          #测试样本在 fdist1 中出现的频率是否小于 fdist2

条件频率分布 nltk.probability.ConditionalFreqDist

cfdist= ConditionalFreqDist(pairs)  #从配对链表中创建条件频率分布
cfdist.conditions()                 #将条件按字母排序
cfdist[condition]                   #此条件下的频率分布
cfdist[condition][sample]           #此条件下给定样本的频率
cfdist.tabulate()                   #为条件频率分布制表
cfdist.tabulate(samples, conditions)#指定样本和条件限制下制表
cfdist.plot()                       #为条件频率分布绘图
cfdist.plot(samples, conditions)    #指定样本和条件限制下绘图
cfdist1 < cfdist2                    #测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数

nltk.text.Text()类用于对文本进行初级的统计与分析

Text(words)                  #对象构造,参数为词的列表
concordance(word, width, lines) #显示 word 出现的上下文
common_contexts(words)          #显示 words 出现的相同模式
similar(word)                   #显示 word 的相似词
collocations(num, window_size)  #显示最常见的二词搭配
count(word)                     #word 出现的词数
dispersion_plot(words)          #绘制 words 中文档中出现的位置图
vocab()                         #返回文章去重的词典

nltk.corpus 自带语料库

gutenberg    #大约有 36000 本免费电子图书，多是古典作品
webtext     #网络小说、论坛、网络广告等内容
nps_chat    #有上万条聊天消息语料库，即时聊天消息为主
brown       #一个百万词级别的英语电子语料库，这个语料库包含 500 个不同来源的文本，按 文体分类有新闻、社论等 reuters 路透社语料库，上万篇新闻方档，约有 1 百万字，分 90 个主题，并分为训练集和 测试集两组
inaugural   #演讲语料库，几十个文本，都是总统演说

语料库操作

fileids()                    #返回语料库中文件名列表
fileids[categories]         #返回指定类别的文件名列表
raw(fid=[c1,c2])           #返回指定文件名的文本字符串
raw(catergories=[])        #返回指定分类的原始文本
sents(fid=[c1,c2])             #返回指定文件名的语句列表
sents(catergories=[c1,c2])     #按分类返回语句列表
words(filename)             #返回指定文件名的单词列表
words(catogories=[])       #返回指定分类的单词列表

提取词干：词干提取可以被定义为一个通过去除单词中的词缀以获取词干的过程。以单词 raining 为例，词干提取器通过从 raining 中去除词缀来返回其词根或词干 rain。为了提高信息检索的准确性，搜索引擎大多会使用词干提取来获取词干并将其存储为索引词。
- 方法 1：在 NLTK 中使用 PorterStemmer 类进行词干
```
import nltk
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
stemmerporter.stem('happiness')Out: 'happi'
```
- 方法 2：LancasterStemmer 类在 NLTK 中用于实现 Lancaster 词干提取算法
```
import nltk
from nltk.stem import LancasterStemmer
stemmerlan=LancasterStemmer()
stemmerlan.stem('happiness')Out: 'happy'
```
- 方法 3：在 NLTK 中，我们通过使用 RegexpStemmer 类也可以构建属于我们自己的词干提取器。它的工作原理是通过接收一个字符串，并在找到其匹配的单词时删除该单词的前缀或后缀
词性标注：词性标注是一个对句中的每个标识符分配词类（例如名词、动词、形容词等）标记的过程。在 NLTK 中，词性标注器存在于 nltk.tag 包中并被 TaggerIbase 类所继承

import nltk
text1=nltk.word_tokenize("It is a pleasant day today")
nltk.pos_tag(text1)Out: [('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('pleasant', 'JJ'), ('day', 'NN'), ('today', 'NN')]

一些文本处理操作

消除标点符号（中英文）

def filter_punctuation(words): new_words = []; illegal_char = string.punctuation + '【·！…（）—：“”？《》、；】'         pattern=re.compile('[%s]' % re.escape(illegal_char)) for word in words: new_word = pattern.sub(u'', word) if not new_word == u'': new_words.append(new_word) return new_words
words_no_punc = filter_punctuation(words)
print(words_no_punc)Out: ['Don', 't', 'hesitate', 'to', 'ask', 'questions', 'Be', 'positive']

文本的大小写转换

print(text.lower())
print(text.upper()) Out:
don't hesitate to ask questions. be positive.
DON'T HESITATE TO ASK QUESTIONS. BE POSITIVE.

处理停止词（英文）

from nltk.corpus import stopwords
stops=set(stopwords.words('english'))
words = [word for word in words if word.lower() not in stops]
print(words)

NLTK实际使用

下面使用NLTK处理文章text_en。

# 首先读取文件，并获取单词
def readWords():data = ""with open('data/text_en.txt', 'r', encoding='utf-8-sig') as f:data = f.read()words = nltk.word_tokenize(data)return words

（1）分词、提取词干

# 分词操作读取文件时已经完成，提取词干采用LancasterStemmer方法
def splitWordAndLancaster(words):stemmerlan = LancasterStemmer()wordsStem = [stemmerlan.stem(word) for word in words]return wordsStem
# 以下是输出结果的前100项
# ['the', 'project', 'gutenberg', 'ebook', 'of', 'prid', 'and', 'prejud', ',', 'by', 'jan', 'aust', 'chapt', '1', 'it', 'is', 'a', 'tru', 'univers', 'acknowledg', ',', 'that', 'a', 'singl', 'man', 'in', 'possess', 'of', 'a', 'good', 'fortun', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wif', '.', 'howev', 'littl', 'known', 'the', 'feel', 'or', 'view', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'ent', 'a', 'neighbo', ',', 'thi', 'tru', 'is', 'so', 'wel', 'fix', 'in', 'the', 'mind', 'of', 'the', 'surround', 'famy', ',', 'that', 'he', 'is', 'consid', 'the', 'right', 'property', 'of', 'som', 'on', 'or', 'oth', 'of', 'their', 'daught', '.', '``', 'my', 'dear', 'mr.', 'bennet', ',', "''", 'said', 'his', 'lady']

（2）去停用词

def handleStopWords(words):stops = set(stopwords.words('english'))words = [word for word in words if word.lower() not in stops]return words# 以下是输出结果的前100项
# ['Project', 'Gutenberg', 'EBook', 'Pride', 'Prejudice', ',', 'Jane', 'Austen', 'Chapter', '1', 'truth', 'universally', 'acknowledged', ',', 'single', 'man', 'possession', 'good', 'fortune', ',', 'must', 'want', 'wife', '.', 'However', 'little', 'known', 'feelings', 'views', 'man', 'may', 'first', 'entering', 'neighbourhood', ',', 'truth', 'well', 'fixed', 'minds', 'surrounding', 'families', ',', 'considered', 'rightful', 'property', 'one', 'daughters', '.', '``', 'dear', 'Mr.', 'Bennet', ',', "''", 'said', 'lady', 'one', 'day', ',', '``', 'heard', 'Netherfield', 'Park', 'let', 'last', '?', "''", 'Mr.', 'Bennet', 'replied', '.', '``', ',', "''", 'returned', ';', '``', 'Mrs.', 'Long', ',', 'told', '.', "''", 'Mr.', 'Bennet', 'made', 'answer', '.', '``', 'want', 'know', 'taken', '?', "''", 'cried', 'wife', 'impatiently', '.', '``', 'want']

（3）标点符号过滤

# 中英文标点符号
def filterPunctuation(words):new_words=[]illegal_char = string.punctuation + u'.,;《》？！“”‘’@#￥%…&×（）——+【】{};；●，。&～、|\s:：'pattern = re.compile('[%s]'%re.escape(illegal_char))for word in words:new_word = pattern.sub(u'',word)if not new_word == u'':new_words.append(new_word)return new_words
# 以下是输出结果的前100项
# ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Pride', 'and', 'Prejudice', 'by', 'Jane', 'Auten', 'Chapter', '1', 'It', 'i', 'a', 'truth', 'univerally', 'acknowledged', 'that', 'a', 'ingle', 'man', 'in', 'poeion', 'of', 'a', 'good', 'fortune', 'mut', 'be', 'in', 'want', 'of', 'a', 'wife', 'However', 'little', 'known', 'the', 'feeling', 'or', 'view', 'of', 'uch', 'a', 'man', 'may', 'be', 'on', 'hi', 'firt', 'entering', 'a', 'neighbourhood', 'thi', 'truth', 'i', 'o', 'well', 'fixed', 'in', 'the', 'mind', 'of', 'the', 'urrounding', 'familie', 'that', 'he', 'i', 'conidered', 'the', 'rightful', 'property', 'of', 'ome', 'one', 'or', 'other', 'of', 'their', 'daughter', 'My', 'dear', 'Mr', 'Bennet', 'aid', 'hi', 'lady', 'to', 'him', 'one', 'day', 'have', 'you', 'heard', 'that', 'Netherfield', 'Park']

（4）低频词过滤（n <= threshold）

def filterLowFrequency(words):threshold = 20new_words = []fdist = FreqDist(words)for word in fdist:if fdist[word] > threshold:new_words.append(word)return new_words # 以下是输出结果的前100项
# ['The', 'of', 'and', ',', 'by', 'Jane', 'Chapter', 'It', 'is', 'a', 'truth', 'that', 'man', 'in', 'good', 'fortune', 'must', 'be', 'want', 'wife', '.', 'little', 'known', 'the', 'feelings', 'or', 'such', 'may', 'on', 'his', 'first', 'neighbourhood', 'this', 'so', 'well', 'fixed', 'he', 'considered', 'some', 'one', 'other', 'their', 'daughters', '``', 'My', 'dear', 'Mr.', 'Bennet', "''", 'said', 'lady', 'to', 'him', 'day', 'have', 'you', 'heard', 'Netherfield', 'let', 'at', 'last', '?', 'replied', 'had', 'not', 'But', 'it', 'returned', 'she', ';', 'for', 'Mrs.', 'has', 'just', 'been', 'here', 'told', 'me', 'all', 'about', 'made', 'no', 'answer', 'Do', 'know', 'who', 'taken', 'cried', 'You', 'tell', 'I', 'hearing', 'This', 'was', 'invitation', 'enough', 'Why', 'my', 'young', 'large']

（5）绘制离散图，查看指定单词（Elizabeth, Darcy,Wickham, Bingley, Jane）在文中的分布位置

# 使用nltk.text库中的Text模块
def drawPlacement(words):text = Text(words)text.dispersion_plot(["Elizabeth", "Darcy", "Wickham", "Bingley", "Jane"])

（6）对前20个有意义的高频词，绘制频率分布图

def drawFreqMap(words):fdist = FreqDist(words)fdist.plot(20)

自然语言处理NLP | NTLK入门及英文语料库处理相关推荐

自然语言处理NLP快速入门
自然语言处理NLP快速入门 https://www.cnblogs.com/DicksonJYL/p/9809760.html [导读]自然语言处理已经成为人工智能领域一个重要的分支,它研究能实现人与 ...
2022nlp视频教程大全 NLP自然语言处理教程自然语言处理NLP从入门到项目实战
获取更多NLP实战资料系统性地学NLP本来就既不可能也没必要,这么大个领域,而且一直在飞速发展,等你学完了黄花菜都凉了. NLP的方法可以分成基于规则的方法和基于统计的方法.由于自然语言具备歧义性. ...
自然语言处理(NLP)之word2vec的实现(PTB语料库)＜找语义相近的词＞
在2013年Google开源了一款用于词向量计算的工具:word2vec,它本身不是一种深度学习之类的模型,是一种用于计算词嵌入的体系结构.实际上大家平时说的这个指代的就是前面介绍过的跳字(元)模型与 ...
【组队学习】【29期】9. 基于transformers的自然语言处理(NLP)入门
9. 基于transformers的自然语言处理(NLP)入门航路开辟者:多多.erenup.张帆.张贤.李泺秋.蔡杰.hlzhang 领航员:张红旭.袁一涵航海士:多多.张红旭.袁一涵.童鸣基 ...
【组队学习】【28期】基于transformers的自然语言处理(NLP)入门
基于transformers的自然语言处理(NLP)入门论坛版块: http://datawhale.club/c/team-learning/39-category/39 开源内容: https: ...
人工智能自然语言处理NLP入门教程
导读:自然语言处理(NLP)是计算机科学,人工智能,语言学关注计算机和人类(自然)语言之间的相互作用的领域. 语言是人类区别其他动物的本质特性.在所有生物中,只有人类才具有语言能力.人类的多种智能都与 ...
译文▍用Python做NLP：自然语言处理-介绍、入门与应用
作者|SHIVAM BANSAL 译者|ZacksTang 编辑|布袋熊自然语言处理NLP 系列连载 -1 根据工业界的估计,仅仅只有21%的数据是以结构化的形式展现的.数据由说话,发微博,发消 ...
自然语言处理(NLP)入门
本文简要介绍Python自然语言处理(NLP),使用Python的NLTK库.NLTK是Python的自然语言处理工具包,在NLP领域中,最常使用的一个Python库. 什么是NLP? 简单来说,自然 ...
通俗讲：自然语言处理（NLP）入门之N-gram语言模型。（朴素贝叶斯分类器的推导）
喜欢的话请关注我们的微信公众号~<你好世界炼丹师>. 公众号主要讲统计学,数据科学,机器学习,深度学习,以及一些参加Kaggle竞赛的经验. 公众号内容建议作为课后的一些相关知识的补充,饭 ...

自然语言处理NLP | NTLK入门及英文语料库处理

安装

NLTK常见操作（英文语料）

一些文本处理操作

NLTK实际使用

自然语言处理NLP | NTLK入门及英文语料库处理相关推荐

最新文章

热门文章