keras自然语言处理（五）

2.5使用NLTK进行标记和处理

Natural Language Toolkit，简称NLTK，是一个为处理和建模文本而编写的Python库，它提供了加载和处理文本的工具，我们可以使用这些工具来为我们的机器学习准备所需要的数据

2.5.1 安装NLTK

你可以使用自己喜欢的包管理工具来下载NLTK,例如：pip

pip install nltk

安装NLTK后，你需要安装NLTK数据，包括一系列大量的文本集合，可以在以后用它来测试NLTK中的其他工具。首先下载文本数据

import nltk
nltk.download()

2.5.2 分句

一个很好的思路就是，第一步将文本拆分为句子，一些模型更喜欢以段落或句子的形式输入数据，如word2vec，可以先将文本分解成句子，在将每个句子分解成单词。NLTK提供了sent_tokenize()函数将文本拆分成句子。

from nltk import sent_tokenize
with open('Metamorphosis.txt','r')as f:text = f.read()
sentences = sent_tokenize(text)
print(sentences[0])

其结果是：

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

2.5.3 分词

NLTK提供了一个word_tokenize（）函数，用于将字符串拆分为单词。它是根据空格和标点符号来进行分割的。例如，逗号和句号被视为单独的标记。缩略词被分开（Whats被分为What和s’’），引号被保留.

from nltk import word_tokenize
with open('Metamorphosis.txt','r')as f:text = f.read()
words = word_tokenize(text)
print(words[:100])

其结果为：

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', 'What', "'s", 'happened', 'to', 'me']

我们看到标点符号被分成了词标记，下面我们可以来过滤标点符号。

2.5.4 过滤标点

我们可以过滤点所有不想要的标点符号，例如所有独立的标点符号。这可以通过迭代所有标记并仅保留那些全部为字母的标记来完成。Python具有可以使用的函数isalpha（）。

from nltk import word_tokenize
with open('Metamorphosis.txt','r')as f:text = f.read()
words = word_tokenize(text)
words = [word for word in words if word.isalpha()]
print(words[:100])

其结果为：

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

从其结果看：分词例子中除了标点没有之外，还有连词“‘armour-like’”也删除了。

2.5.5 过滤掉停用词（和管道）

停用词是那些对词语的深层含义没有贡献的词语，他们是最常用的词语。例如：a和is对于某些问题，删除停用词是有意义的。NLTK提供了各种语言的停用词列表。变现方式如下：

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

其结构如下

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

可以看出他们都是小写。可以使用tokens与停用词进行比较并过滤掉他们。我们还是用Metamorphosis.txt来做演示，其步骤如下：

加载文本
分成tokens
转化为小写
删除标记中的符号
从剩下的tokens中过滤掉非字母
从剩下的tokens中过滤掉停用词

import string,re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
with open('Metamorphosis.txt','r')as f:text = f.read()tokens = word_tokenize(text)
tokens = [token.lower() for token in tokens]
re_punc = re.compile('[%s]'%re.escape(string.punctuation))
stripped = [re_punc.sub(' ',w)for w in tokens]
words = [word for word in stripped if  word.isalpha()]
stop_words = set(stopwords.words('english'))
words = [word for word in words if not word in stop_words]
print(words[:100])

结果：

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer', 'gregor', 'turned']

可以看到除了其他变换之外，像a和 to之类的都被清除。

2.5.6 词干

词干是指将每个单词缩减的过程，例如shing，shed，sher都减少到sh，一些文本分类中，可以从词干中受益，以便既减少词汇又专注于文档的情感而不是更深层的含义。尽管有很多词干提取算法，但是目前行业流行的方法还是Porter Stemming算法，可以通过NLTK类库中的porterstemmer类使用这个算法。例如：

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
with open('Metamorphosis.txt','r')as f:text = f.read()tokens = word_tokenize(text)
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

其结果为：

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', 'what', "'s", 'happen', 'to', 'me']

运行这个例子，单词已经减少到他们的词干，比如trouble已经成为troubl，还可以看到词干实现还将token全部变成小写。
在NLTK中有一套很好的词干和词形还原算法可供选择，如果将词语缩减到他们的根目录就是你的项目需要的东西。

2.6 其他文字处理注意事项

上面的教程中所使用的文本非常干净，所有会跳过了很多需要自己预处理的问题。下面是在处理其他文本过程中需要注意的简短列表：

处理的问题不适合含有大量文档和大型文档；
从标记中提取文本，如HTML，PDF或者其他结构化文档格式；
从其他语言到英文的音译；
将Unicode字符解码为规范格式，例如utf-8
处理特定领域的单词，短语和首字母缩略图
处理或者删除数字，例如金额或日期
找到并纠正拼写错误
以及其他

这份列表还可以给出更多，希望你能将其运用并获得干净的文本。