第六章为电影评论的情感分析做准备

每个问题的文本数据都不同，准备工作从简单的步骤开始，例如加载数据，但是随着任务的进行，数据清理工作会变得越来越困难。下面我们来逐步了解如何为电影评论的情绪分析准备文本数据：

加载文本数据，并作清洗
开发词汇表，定制词汇表并保存到文件中
如何使用清洁和预定义的词汇表准备电影评论，并将其保存到准备建模的文件中。

6.1 概述

我们将了解一下几个部分：

电影评论数据集
加载文本数据集
清洗文本数据
开发词汇
保存准备好的数据

6.2 电影评论数据集

数据集下载：
http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
数据集介绍：
数据的特点：
- 仅仅是英文评论
- 都是小写
- 标点符号周围有空格
- 每一行是一个句子

6.3 加载文本数据

本节中我们将加载单个文本文件。

import os
file_path = r"F:\5-model data\aclImdb\aclImdb\train\neg"
file_data = os.path.join(file_path,'7_3.txt')
with open(file_data,'r')as f:text = f.read()
print(text)

打开指定目录下的7_3.txt文件

This is really a new low in entertainment. Even though there are a lot worse movies out.<br /><br />In the Gangster / Drug scene genre it is hard to have a convincing storyline (this movies does not, i mean Sebastians motives for example couldn't be more far fetched and worn out clich茅.) Then you would also need a setting of character relationships that is believable (this movie does not.) <br /><br />Sure Tristan is drawn away from his family but why was that again? what's the deal with his father again that he has to ask permission to go out at his age? interesting picture though to ask about the lack and need of rebellious behavior of kids in upper class family. But this movie does not go in this direction. Even though there would be the potential judging by the random Backflashes. Wasn't he already down and out, why does he do it again? <br /><br />So there are some interesting questions brought up here for a solid socially critic drama (but then again, this movie is just not, because of focusing on "cool" production techniques and special effects an not giving the characters a moment to reflect and most of all forcing the story along the path where they want it to be and not paying attention to let the story breath and naturally evolve.) <br /><br />It wants to be a drama to not glorify abuse of substances and violence (would be political incorrect these days, wouldn't it?) but on the other hand it is nothing more then a cheap action movie (like there are so so many out there) with an average set of actors and a Vinnie Jones who is managing to not totally ruin what's left of his reputation by doing what he always does.<br /><br />So all in all i .. just ... can't recommend it.<br /><br />1 for Vinnie and 2 for the editing.

我们还可以把它定义为一个函数：

import os
file_path = r"F:\5-model data\aclImdb\aclImdb\train\neg"
file_data = os.path.join(file_path,'7_3.txt')
# with open(file_data,'r')as f:
#     text = f.read()
# print(text)
def load_doc(file_data):with open(file_data,'r')as f:text = f.read()return text
text_data = load_doc(file_data)
print(text_data)

其结果：

This is really a new low in entertainment. Even though there are a lot worse movies out.<br /><br />In the Gangster / Drug scene genre it is hard to have a convincing storyline (this movies does not, i mean Sebastians motives for example couldn't be more far fetched and worn out clich茅.) Then you would also need a setting of character relationships that is believable (this movie does not.) <br /><br />Sure Tristan is drawn away from his family but why was that again? what's the deal with his father again that he has to ask permission to go out at his age? interesting picture though to ask about the lack and need of rebellious behavior of kids in upper class family. But this movie does not go in this direction. Even though there would be the potential judging by the random Backflashes. Wasn't he already down and out, why does he do it again? <br /><br />So there are some interesting questions brought up here for a solid socially critic drama (but then again, this movie is just not, because of focusing on "cool" production techniques and special effects an not giving the characters a moment to reflect and most of all forcing the story along the path where they want it to be and not paying attention to let the story breath and naturally evolve.) <br /><br />It wants to be a drama to not glorify abuse of substances and violence (would be political incorrect these days, wouldn't it?) but on the other hand it is nothing more then a cheap action movie (like there are so so many out there) with an average set of actors and a Vinnie Jones who is managing to not totally ruin what's left of his reputation by doing what he always does.<br /><br />So all in all i .. just ... can't recommend it.<br /><br />1 for Vinnie and 2 for the editing.

这里的数据文件夹aclImdb\train\下有两个文件夹，可以用listdir（）函数获取文件夹下目录列表，然后依次加载每个文件，这里我们定义两个函数path_to_file_data函数将目录下所有文本数据读取出来。

def load_doc(file_data):with open(file_data,'r')as f:text = f.read()return text
def path_to_file_data(target_path):for file_name in os.listdir(target_path):if file_name[-4:] == '.txt':text_file_data = load_doc(os.path.join(file_path,file_name))return text_file_data

我们知道了如何加载数据，下面让我们来看看如何清洗数据。

6.4 清洗文本数据

在本节中，我们将了解可能需要对电影评论数据进行哪些方面的清理，假设使用词袋模型或不需要太多准备的单词嵌入模型。

6.4.1 分词

首先我们来加载一个文档，然后用空格分割标记。这里使用上一节中的load_doc函数来加载文件，用split函数将文档分割标记：

def load_doc(file_data):with open(file_data,'r')as f:text = f.read()return text
text = load_doc(file_data)
tokens = text.split()
print(tokens)

其结果显示一个很长的列表：

['This', 'is', 'really', 'a', 'new', 'low', 'in', 'entertainment.', 'Even', 'though', 'there', 'are', 'a', 'lot', 'worse', 'movies', 'out.<br', '/><br', '/>In', 'the', 'Gangster', '/', 'Drug', 'scene', 'genre', 'it', 'is', 'hard', 'to', 'have', 'a', 'convincing', 'storyline', '(this', 'movies', 'does', 'not,', 'i', 'mean', 'Sebastians', 'motives', 'for', 'example', "couldn't", 'be', 'more', 'far', 'fetched', 'and', 'worn', 'out', 'clich茅.)', 'Then', 'you', 'would', 'also', 'need', 'a', 'setting', 'of', 'character', 'relationships', 'that', 'is', 'believable', '(this', 'movie', 'does', 'not.)', '<br', '/><br', '/>Sure', 'Tristan', 'is', 'drawn', 'away', 'from', 'his', 'family', 'but', 'why', 'was', 'that', 'again?', "what's", 'the', 'deal', 'with', 'his', 'father', 'again', 'that', 'he', 'has', 'to', 'ask', 'permission', 'to', 'go', 'out', 'at', 'his', 'age?', 'interesting', 'picture', 'though', 'to', 'ask', 'about', 'the', 'lack', 'and', 'need', 'of', 'rebellious', 'behavior', 'of', 'kids', 'in', 'upper', 'class', 'family.', 'But', 'this', 'movie', 'does', 'not', 'go', 'in', 'this', 'direction.', 'Even', 'though', 'there', 'would', 'be', 'the', 'potential', 'judging', 'by', 'the', 'random', 'Backflashes.', "Wasn't", 'he', 'already', 'down', 'and', 'out,', 'why', 'does', 'he', 'do', 'it', 'again?', '<br', '/><br', '/>So', 'there', 'are', 'some', 'interesting', 'questions', 'brought', 'up', 'here', 'for', 'a', 'solid', 'socially', 'critic', 'drama', '(but', 'then', 'again,', 'this', 'movie', 'is', 'just', 'not,', 'because', 'of', 'focusing', 'on', '"cool"', 'production', 'techniques', 'and', 'special', 'effects', 'an', 'not', 'giving', 'the', 'characters', 'a', 'moment', 'to', 'reflect', 'and', 'most', 'of', 'all', 'forcing', 'the', 'story', 'along', 'the', 'path', 'where', 'they', 'want', 'it', 'to', 'be', 'and', 'not', 'paying', 'attention', 'to', 'let', 'the', 'story', 'breath', 'and', 'naturally', 'evolve.)', '<br', '/><br', '/>It', 'wants', 'to', 'be', 'a', 'drama', 'to', 'not', 'glorify', 'abuse', 'of', 'substances', 'and', 'violence', '(would', 'be', 'political', 'incorrect', 'these', 'days,', "wouldn't", 'it?)', 'but', 'on', 'the', 'other', 'hand', 'it', 'is', 'nothing', 'more', 'then', 'a', 'cheap', 'action', 'movie', '(like', 'there', 'are', 'so', 'so', 'many', 'out', 'there)', 'with', 'an', 'average', 'set', 'of', 'actors', 'and', 'a', 'Vinnie', 'Jones', 'who', 'is', 'managing', 'to', 'not', 'totally', 'ruin', "what's", 'left', 'of', 'his', 'reputation', 'by', 'doing', 'what', 'he', 'always', 'does.<br', '/><br', '/>So', 'all', 'in', 'all', 'i', '..', 'just', '...', "can't", 'recommend', 'it.<br', '/><br', '/>1', 'for', 'Vinnie', 'and', '2', 'for', 'the', 'editing.']

只要查看原始分词结果就让我们有很多想法，例如：
-从单词中删除标点符号，例如"Wasn’t"
- 将单词全部转换为小写，例如：‘This’
- 删除具有一个单词的标记，例如：‘i’
- 删除没有多大意义的分词（token）例如：‘and’
- 删除字符中的无意义符号，例如：‘it.<br’

这些想法可以用正则表达式从标记中过滤出标点符号，可以使用isalpa（）删除只是标点符号或包含数字的token，使用NLTK删除英文停用词，我们可以通过限制单词长度来过滤短分词：

import os
file_path = r"F:\5-model data\aclImdb\aclImdb\train\neg"
file_data = os.path.join(file_path,'7_3.txt')
def load_doc(file_data):with open(file_data,'r')as f:text = f.read()return text
text = load_doc(file_data)
tokens = text.split()
print(tokens)
from nltk.corpus import stopwords
import string,re
re_punc = re.compile('[%s]'%re.escape(string.punctuation)
)
tokens = [re_punc.sub('',w)for w in tokens]
tokens = [word for word in tokens if word.isalpha()]
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

两个的对比结果：

['This', 'is', 'really', 'a', 'new', 'low', 'in', 'entertainment.', 'Even', 'though', 'there', 'are', 'a', 'lot', 'worse', 'movies', 'out.<br', '/><br', '/>In', 'the', 'Gangster', '/', 'Drug', 'scene', 'genre', 'it', 'is', 'hard', 'to', 'have', 'a', 'convincing', 'storyline', '(this', 'movies', 'does', 'not,', 'i', 'mean', 'Sebastians', 'motives', 'for', 'example', "couldn't", 'be', 'more', 'far', 'fetched', 'and', 'worn', 'out', 'clich茅.)', 'Then', 'you', 'would', 'also', 'need', 'a', 'setting', 'of', 'character', 'relationships', 'that', 'is', 'believable', '(this', 'movie', 'does', 'not.)', '<br', '/><br', '/>Sure', 'Tristan', 'is', 'drawn', 'away', 'from', 'his', 'family', 'but', 'why', 'was', 'that', 'again?', "what's", 'the', 'deal', 'with', 'his', 'father', 'again', 'that', 'he', 'has', 'to', 'ask', 'permission', 'to', 'go', 'out', 'at', 'his', 'age?', 'interesting', 'picture', 'though', 'to', 'ask', 'about', 'the', 'lack', 'and', 'need', 'of', 'rebellious', 'behavior', 'of', 'kids', 'in', 'upper', 'class', 'family.', 'But', 'this', 'movie', 'does', 'not', 'go', 'in', 'this', 'direction.', 'Even', 'though', 'there', 'would', 'be', 'the', 'potential', 'judging', 'by', 'the', 'random', 'Backflashes.', "Wasn't", 'he', 'already', 'down', 'and', 'out,', 'why', 'does', 'he', 'do', 'it', 'again?', '<br', '/><br', '/>So', 'there', 'are', 'some', 'interesting', 'questions', 'brought', 'up', 'here', 'for', 'a', 'solid', 'socially', 'critic', 'drama', '(but', 'then', 'again,', 'this', 'movie', 'is', 'just', 'not,', 'because', 'of', 'focusing', 'on', '"cool"', 'production', 'techniques', 'and', 'special', 'effects', 'an', 'not', 'giving', 'the', 'characters', 'a', 'moment', 'to', 'reflect', 'and', 'most', 'of', 'all', 'forcing', 'the', 'story', 'along', 'the', 'path', 'where', 'they', 'want', 'it', 'to', 'be', 'and', 'not', 'paying', 'attention', 'to', 'let', 'the', 'story', 'breath', 'and', 'naturally', 'evolve.)', '<br', '/><br', '/>It', 'wants', 'to', 'be', 'a', 'drama', 'to', 'not', 'glorify', 'abuse', 'of', 'substances', 'and', 'violence', '(would', 'be', 'political', 'incorrect', 'these', 'days,', "wouldn't", 'it?)', 'but', 'on', 'the', 'other', 'hand', 'it', 'is', 'nothing', 'more', 'then', 'a', 'cheap', 'action', 'movie', '(like', 'there', 'are', 'so', 'so', 'many', 'out', 'there)', 'with', 'an', 'average', 'set', 'of', 'actors', 'and', 'a', 'Vinnie', 'Jones', 'who', 'is', 'managing', 'to', 'not', 'totally', 'ruin', "what's", 'left', 'of', 'his', 'reputation', 'by', 'doing', 'what', 'he', 'always', 'does.<br', '/><br', '/>So', 'all', 'in', 'all', 'i', '..', 'just', '...', "can't", 'recommend', 'it.<br', '/><br', '/>1', 'for', 'Vinnie', 'and', '2', 'for', 'the', 'editing.']
['This', 'really', 'new', 'low', 'entertainment', 'Even', 'though', 'lot', 'worse', 'movies', 'outbr', 'br', 'In', 'Gangster', 'Drug', 'scene', 'genre', 'hard', 'convincing', 'storyline', 'movies', 'mean', 'Sebastians', 'motives', 'example', 'couldnt', 'far', 'fetched', 'worn', 'clich茅', 'Then', 'would', 'also', 'need', 'setting', 'character', 'relationships', 'believable', 'movie', 'br', 'br', 'Sure', 'Tristan', 'drawn', 'away', 'family', 'whats', 'deal', 'father', 'ask', 'permission', 'go', 'age', 'interesting', 'picture', 'though', 'ask', 'lack', 'need', 'rebellious', 'behavior', 'kids', 'upper', 'class', 'family', 'But', 'movie', 'go', 'direction', 'Even', 'though', 'would', 'potential', 'judging', 'random', 'Backflashes', 'Wasnt', 'already', 'br', 'br', 'So', 'interesting', 'questions', 'brought', 'solid', 'socially', 'critic', 'drama', 'movie', 'focusing', 'cool', 'production', 'techniques', 'special', 'effects', 'giving', 'characters', 'moment', 'reflect', 'forcing', 'story', 'along', 'path', 'want', 'paying', 'attention', 'let', 'story', 'breath', 'naturally', 'evolve', 'br', 'br', 'It', 'wants', 'drama', 'glorify', 'abuse', 'substances', 'violence', 'would', 'political', 'incorrect', 'days', 'wouldnt', 'hand', 'nothing', 'cheap', 'action', 'movie', 'like', 'many', 'average', 'set', 'actors', 'Vinnie', 'Jones', 'managing', 'totally', 'ruin', 'whats', 'left', 'reputation', 'always', 'doesbr', 'br', 'So', 'cant', 'recommend', 'itbr', 'br', 'Vinnie', 'editing']

这里我们可以看到下面的结果已经好了很多，但是还有点小问题，例如：'clich茅’里面含有中文字符，这里需要进一步处理，这里这个问题就留个读者自己去解决。
这我们将上面的处理过程打包成一个clear_doc的函数：

from nltk.corpus import stopwords
import string,re
def clear_doc(doc):#用空格来分词tokens = doc.split()#准备正则字符过滤re_punc = re.compile('[%s]' % re.escape(string.punctuation))#删除标点tokens = [re_punc.sub('', w) for w in tokens]#删除剩余的非字母标记tokens = [word for word in tokens if word.isalpha()]#过滤掉停用词（英文的）stop_words = set(stopwords.words('english'))tokens = [w for w in tokens if not w in stop_words]#过滤掉短词组tokens = [word for word in tokens if len(word) > 1]#字母小写tokens = [w.lower() for w in tokens]return  tokensdef load_doc(file_data):with open(file_data,'r')as f:text = f.read()return textimport os
file_path = r"F:\5-model data\aclImdb\aclImdb\train\neg"
file_data = os.path.join(file_path,'7_3.txt')
text = load_doc(file_data)
tokens = clear_doc(text)
print(tokens)

结果：

['this', 'really', 'new', 'low', 'entertainment', 'even', 'though', 'lot', 'worse', 'movies', 'outbr', 'br', 'in', 'gangster', 'drug', 'scene', 'genre', 'hard', 'convincing', 'storyline', 'movies', 'mean', 'sebastians', 'motives', 'example', 'couldnt', 'far', 'fetched', 'worn', 'clich茅', 'then', 'would', 'also', 'need', 'setting', 'character', 'relationships', 'believable', 'movie', 'br', 'br', 'sure', 'tristan', 'drawn', 'away', 'family', 'whats', 'deal', 'father', 'ask', 'permission', 'go', 'age', 'interesting', 'picture', 'though', 'ask', 'lack', 'need', 'rebellious', 'behavior', 'kids', 'upper', 'class', 'family', 'but', 'movie', 'go', 'direction', 'even', 'though', 'would', 'potential', 'judging', 'random', 'backflashes', 'wasnt', 'already', 'br', 'br', 'so', 'interesting', 'questions', 'brought', 'solid', 'socially', 'critic', 'drama', 'movie', 'focusing', 'cool', 'production', 'techniques', 'special', 'effects', 'giving', 'characters', 'moment', 'reflect', 'forcing', 'story', 'along', 'path', 'want', 'paying', 'attention', 'let', 'story', 'breath', 'naturally', 'evolve', 'br', 'br', 'it', 'wants', 'drama', 'glorify', 'abuse', 'substances', 'violence', 'would', 'political', 'incorrect', 'days', 'wouldnt', 'hand', 'nothing', 'cheap', 'action', 'movie', 'like', 'many', 'average', 'set', 'actors', 'vinnie', 'jones', 'managing', 'totally', 'ruin', 'whats', 'left', 'reputation', 'always', 'doesbr', 'br', 'so', 'cant', 'recommend', 'itbr', 'br', 'vinnie', 'editing']

6.5 开发词汇表

在开发文本预测模型时，例如词袋模型，存在减小词汇量的压力，因为词汇量越大，每个单词或文档的表示越稀疏。为情感分析准备文本的一部分工作是定义和定制文本预测模型支持的单词的词汇表，。我们可以通过加载数据集中的所有文档并构建一组单词来完成此操作，，我们可以使用所有的单词，也可以只取其中一部分，然后将最终选择的词汇保存在文件中，供以后使用。
我们使用Counter类来管理和处理词汇表，Counter类是一个词典，词典的内容样式是单词：单词的计数，接着我们定义一些额外的函数以便更好的处理词汇表：定义一个新函数来处理文档并将其添加到词汇表中，该函数通过调用load_doc函数加载文档，使用之前定义的clear_doc函数来清洗加载的文档，然后将清洗好的分词添加到Counter，并更新计数。我们可以通过counter对象上的update函数来完成最后一步，下面定义一个名为add_doc_to_vocab的函数，它将文档文件名和计数器词汇表作为参数。

def add_doc_to_vocab(file_name,vocab):doc = load_doc(file_name)tokens = clear_doc(doc)vocab.update(tokens)

最后我们将上面处理目录中的文件模板过程定义为process_docs函数，并在这个函数中调用add_doc_to_vocab函数更新词汇表

def process_docs(directory,vocab):for file_name in os.listdir(directory):if not file_name.endswith('.txt'):next()path = os.path.join(directory,file_name)add_doc_to_vocab(path,vocab)

现在将所有的代码结合起来：

from nltk.corpus import stopwords
import string,re
def clear_doc(doc):#用空格来分词tokens = doc.split()#准备正则字符过滤re_punc = re.compile('[%s]' % re.escape(string.punctuation))#删除标点tokens = [re_punc.sub('', w) for w in tokens]# 删除中文reg = re.compile('[\u4e00-\u9fa5]')tokens = [reg.sub('', w) for w in tokens]#删除剩余的非字母标记tokens = [word for word in tokens if word.isalpha()]#过滤掉停用词（英文的）stop_words = set(stopwords.words('english'))tokens = [w for w in tokens if not w in stop_words]#过滤掉短词组tokens = [word for word in tokens if len(word) > 1]#字母小写tokens = [w.lower() for w in tokens]return  tokens
def load_doc(file_data):with open(file_data,'r',encoding= 'utf-8')as f:text = f.read()return text
import os
def add_doc_to_vocab(file_name,vocab):doc = load_doc(file_name)tokens = clear_doc(doc)vocab.update(tokens)
def process_docs(directory,vocab):for file_name in os.listdir(directory):if not file_name.endswith('.txt'):next()path = os.path.join(directory,file_name)add_doc_to_vocab(path,vocab)# if file_name.endswith('.txt'):#     path = os.path.join(directory,file_name)#     add_doc_to_vocab(path,vocab)
from collections import Counter
vocab = Counter()
file_path_neg = r"F:\5-model data\aclImdb\aclImdb\train\neg"
process_docs(file_path_neg,vocab)
file_path_pos = r"F:\5-model data\aclImdb\aclImdb\train\pos"
process_docs(file_path_pos,vocab)
print(len(vocab))
print(vocab.most_common(50))

结果：

117361
[('br', 57143), ('the', 45829), ('movie', 41807), ('film', 37455), ('one', 25508), ('like', 19641), ('this', 14984), ('good', 14555), ('even', 12503), ('it', 12265), ('would', 12135), ('time', 11779), ('really', 11663), ('story', 11454), ('see', 11223), ('much', 9584), ('well', 9372), ('get', 9212), ('also', 9073), ('people', 8951), ('bad', 8912), ('great', 8894), ('first', 8857), ('dont', 8473), ('made', 7990), ('movies', 7788), ('make', 7729), ('films', 7727), ('could', 7713), ('way', 7685), ('but', 7323), ('characters', 7290), ('think', 7229), ('and', 7045), ('watch', 6777), ('its', 6773), ('two', 6643), ('many', 6640), ('seen', 6529), ('character', 6514), ('never', 6425), ('little', 6387), ('acting', 6291), ('plot', 6275), ('best', 6263), ('love', 6214), ('in', 6044), ('know', 6038), ('life', 5988), ('show', 5967)]

该示例是将创建包含数据集中所有文档的词汇表，包括正面和负面评论。我们可以看到所有的评论中有117361个词，超过十万个，前五个最常见的单词为(‘br’, 57143), (‘the’, 45829), (‘movie’, 41807), (‘film’, 37455), (‘one’, 25508)。
最不常见的单词，也就是那些在所有评论中只出现过一次的单词，对我们的预测几乎不起作用，还有一些最常见的词在我们的情感分析中也可能没用，，使用特定的预测模型进行分析时，我们对词汇表要做适当的取舍。一般来说，在2000评论中只出现过一次或几次的单词可能对我们的数据分析贡献会微乎其微，可以从词汇表中删除，这样就大大减少我们需要建模的分词。我们可以通过单词和他们的计数来完成这个操作过程。保存的单词计数高于计数阀值。我们这里选5为计数阀值。

min_occurrence = 5
tokens = [k for k,c in vocab.items() if c>=min_occurrence]
print(len(tokens))

这样一来结果就是：

从11万降到了3万。可能5这个阀值过大，你可以尝试一下其他的阀值。接下来我们就将选择的单词保存到文件中，下面就定义一个save_list函数，用于保存词汇列表

def save_list(lines,filename):data = '\n'.join(lines)with open(filename,'w',encoding='utf-8')as f:f.writelines(data)

完整的代码：

from nltk.corpus import stopwords
import string,re
def clear_doc(doc):#用空格来分词tokens = doc.split()#准备正则字符过滤re_punc = re.compile('[%s]' % re.escape(string.punctuation))#删除标点tokens = [re_punc.sub('', w) for w in tokens]# 删除中文reg = re.compile('[\u4e00-\u9fa5]')tokens = [reg.sub('', w) for w in tokens]#删除剩余的非字母标记tokens = [word for word in tokens if word.isalpha()]#过滤掉停用词（英文的）stop_words = set(stopwords.words('english'))tokens = [w for w in tokens if not w in stop_words]#过滤掉短词组tokens = [word for word in tokens if len(word) > 1]#字母小写tokens = [w.lower() for w in tokens]return  tokens
def load_doc(file_data):with open(file_data,'r',encoding= 'utf-8')as f:text = f.read()return text
import os
def add_doc_to_vocab(file_name,vocab):doc = load_doc(file_name)tokens = clear_doc(doc)vocab.update(tokens)
def process_docs(directory,vocab):for file_name in os.listdir(directory):if not file_name.endswith('.txt'):next()path = os.path.join(directory,file_name)add_doc_to_vocab(path,vocab)# if file_name.endswith('.txt'):#     path = os.path.join(directory,file_name)#     add_doc_to_vocab(path,vocab)
from collections import Counter
vocab = Counter()
file_path_neg = r"F:\5-model data\aclImdb\aclImdb\train\neg"
process_docs(file_path_neg,vocab)
file_path_pos = r"F:\5-model data\aclImdb\aclImdb\train\pos"
process_docs(file_path_pos,vocab)
print(len(vocab))
# with open('vocab.json','w',encoding='utf-8')as f:
#     f.write(vocab)
#最常见的50个词
# print(vocab.most_common(50))
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c>=min_occurrence]
print(len(tokens))
print(tokens[:20])
def save_list(lines,filename):data = '\n'.join(lines)with open(filename,'w',encoding='utf-8')as f:f.writelines(data)save_list(tokens,'vocab.txt')

运行代码后会在本地文件夹中创建一个vocab.txt的文件，打开文件内部如下图所示：

6.6 保存已经准备好的数据

我们可以使用数据清理和选择词汇表来处理每条电影评论，并将其保存下来，接着进行后续模型的建立，这个过程是将数据准备和建立模型分离，如果你对数据的准备和模型的建立有什么新的想法，则可以深入研究数据或者是建模。现在我们就从加载vocab.txt文件开始。

import os
file_path = r"E:\1- data"
data_file = os.path.join(file_path,'vocab.txt')
def load_doc(file_data):with open(file_data,'r',encoding= 'utf-8')as f:text = f.read()return text
vocab = load_doc(data_file)
print(vocab)
vocab = vocab.split()
print(vocab)
vocab =set(vocab)
print(vocab)

这里结果比较大，不太适合在全部表示，下面只摘取一部分：

necroborgs
genma
cheang
catiii
yokai
tadashis
elkaim
rideau
presque
lifshitz
cédric
paulies
agi
santamarina
peaches
yaara
letty
jarada
thornway
rawhide
erendira
['story', 'man', 'unnatural', 'feelings', 'pig', 'starts', 'opening', 'scene', 'terrific', 'example', 'absurd', 'comedy', 'formal', 'orchestra', 'audience', 'turned', 'insane', 'violent', 'mob', 'crazy', 'singers', 'unfortunately', 'stays', 'whole', 'time', 'general', 'narrative', 'eventually', 'making', 'putting', 'even', 'era', 'the', 'cryptic', 'dialogue', 'would', 'make', 'shakespeare', 'seem', 'easy', 'third', 'grader', 'on', 'technical', 'level', 'better', 'might', 'think', 'good', 'cinematography', 'future', 'great', 'stars', 'sally', 'kirkland', 'frederic', 'forrest', 'seen', 'briefly', 'airport', 'brand', 'new', 'luxury', 'plane', 'loaded', 'valuable', 'paintings', 'belonging', 'rich', 'businessman', 'philip', 'stevens', 'james', 'stewart', 'flying', 'bunch', 'estate', 'preparation', 'opened', 'public', 'museum', 'also', 'board', 'daughter', 'julie', 'kathleen', 'son', 'takes', 'planned', 'midair', 'hijacked', 'chambers', 'robert', 'foxworth', 'two', 'accomplices', 'banker', 'monte', 'markham', 'wilson', 'michael', 'pataki', 'knock', 'passengers', 'crew', 'sleeping', 'gas', 'plan', 'steal', 'cargo', 'land', 'disused', 'strip', 'isolated', 'island', 'descent', 'almost', 'hits', 'oil', 'rig', 'ocean', 'loses', 'control', 'sending', 'crashing', 'sea', 'sinks', 'bottom', 'right', 'bang', 'middle', 'triangle', 'with', 'air', 'short', 'supply', 'water', 'flown', 'miles', 'course', 'problems', 'mount', 'survivors', 'await', 'help', 'fast', 'running', 'outbr', 'br', 'known', 'slightly', 'different', 'second', 'sequel', 'disaster', 'thriller', 'directed', 'jerry', 'jameson', 'like', 'predecessors', 'cant', 'say', 'sort', 'forgotten', 'classic', 'entertaining', 'although', 'necessarily', 'reasons', 'out', 'three', 'films', 'far', 'actually', 'liked', 'one', 'best', 'it', 'favourite', 'plot', 'nice', 'hijacking', 'didnt', 'see', 'sinking', 'maybe', 'makers', 'trying', 'cross', 'original', 'another', 'popular', 'flick', 'period', 'adventure', 'submerged', 'end', 'stark', 'dilemma', 'facing', 'trapped', 'inside', 'either', 'runs', 'drown', 'floods', 'doors', 'decent', 'idea', 'could', 'made', 'little', 'bad', 'unsympathetic', 'characters', 'dull', 'lethargic', 'setpieces', 'real', 'lack', 'danger', 'suspense', 'tension', 'means', 'missed', 'opportunity', 'while']
{'dat', 'capacity', 'wwe', 'transfixed', 'whatever', 'cantonese', 'capote', 'cultclassic', 'shotsbr', 'dictators', 'brunt', 'successor', 'harbors', 'cannavale', 'posturing', 'surviving', 'raped', 'steeles', 'fallout', 'knee', 'macready', 'fawcett', 'group', 'ovation', 'likes', 'charter', 'remained', 'sneers', 'hottest', 'extremebr', 'sexiest', 'murdock', 'rewatching', 'clichéridden', 'himselfbr', 'mortal', 'sufficient', 'fronts', 'celebrated', 'mantra', 'snipes', 'months', 'peacock', 'dh', 'doorbr', 'dolphins', 'rosy', 'debatable', 'babe', 'capitalists', 'missiles', 'firm', 'andys', 'dosent', 'enterprise', 'meanspirited', 'digested', 'wasted', 'creepiness', 'rowdy', 'graduate', 'corrected', 'uncalled', 'merchant', 'fox', 'misfit', 'helms', 'ankush', 'repertoire', 'laid', 'titanics', 'joins', 'onboard', 'em', 'hows', 'mcenroe', 'egyptologist', 'institute', 'doers', 'announce', 'lumps', 'violation', 'lifethe', 'largely', 'carrell', 'emotions', 'adrianne', 'notbr', 'roth', 'sith', 'sickeningly', 'bags', 'burial', 'joycelyn', 'geico', 'copycat', 'donning', 'woodard', 'viewable', 'peaks', 'adopted', 'banking', 'duncan', 'charo', 'manchu', 'rational', 'palpable', 'coup', 'joints', 'cling', 'explicitly', 'girly', 'hopes', 'ditches', 'plot', 'mic', 'mixedup', 'cinemagic', 'ransom', 'holbrook', 'attributes', 'remarkable', 'formulas', 'impostor', 'suspense', 'morrison', 'blondes', 'propels', 'joker', 'judson', 'vierde', 'fittingly', 'happinessbr', 'glamour', 'thirtysomething'}

接下来，我们可以清洗每条电影评论，使用加载的词汇来过滤掉不需要的词汇，并将干净的评论保存在新文件中。一种方法可以是将所有正面评论保存在一个文件中，将所有负面评论保存在一个问价中，，将每条过滤后的评论自成一行，词汇用空格分隔。首先，我们定义一个函数来处理评论文档，清洗，过滤，然后将评论重新组成一行，并保存到文件中。下面我们来定义doc_to_line函数来完成这一过程，将文件和词汇作为参数。它调用先前定义的load_doc函数来加载文档，并用clean_doc函数来清洗，

def doc_to_line(filename,vocab):doc = load_doc(filename)tokens = clear_doc(doc)tokens = [w for w in tokens if w in vocab]return  ' '.join(tokens)

下面我们定义一个新版的process_docs来逐步处理文件夹中所有的评论，并通过doc_to_line函数将每个文档转换为行，然后返回列表

import os
def process_docs(directory,vocab):lines = list()for filename in os.listdir(directory):if filename[-4:] == '.txt':files = os.path.join(directory,filename)line = doc_to_line(files,vocab)lines.append(line)return lines

然后我们调用process_docs函数处理正面和负面评论的目录，然后从上一节调用save_list将每个已处理的评论保存到文件中。
下面我们列出完整的过程：

import os, string, re
from nltk.corpus import stopwords
def load_doc(file_data):with open(file_data, 'r', encoding='utf-8')as f:text = f.read()return text
def clear_doc(doc):# 用空格来分词tokens = doc.split()# 准备正则字符过滤re_punc = re.compile('[%s]' % re.escape(string.punctuation))# 删除标点tokens = [re_punc.sub('', w) for w in tokens]# 删除中文reg = re.compile('[\u4e00-\u9fa5]')tokens = [reg.sub('', w) for w in tokens]# 删除剩余的非字母标记tokens = [word for word in tokens if word.isalpha()]# 过滤掉停用词（英文的）stop_words = set(stopwords.words('english'))tokens = [w for w in tokens if not w in stop_words]# 过滤掉短词组tokens = [word for word in tokens if len(word) > 1]# 字母小写tokens = [w.lower() for w in tokens]return tokens
def save_list(lines, filename):data = '\n'.join(lines)with open(filename, 'w', encoding='utf-8')as f:f.write(data)
def doc_to_line(filename, vocab):doc = load_doc(filename)tokens = clear_doc(doc)tokens = [w for w in tokens if w in vocab]return ' '.join(tokens)
def process_docs(directory, vocab):lines = list()for filename in os.listdir(directory):if filename[-4:] == '.txt':files = os.path.join(directory, filename)line = doc_to_line(files, vocab)lines.append(line)return lines
vocab_path = r"E:\1- data"
vocab_name = "vocab.txt"
vocab_data = os.path.join(vocab_path, vocab_name)
vocab = load_doc(vocab_data)
vocab = vocab.split()
vocab = set(vocab)
# 负面评论
neg_path = r"F:\5-model and data\aclImdb\aclImdb\train\neg"
neg_lines = process_docs(neg_path,vocab)
save_list(neg_lines,'neg.txt')
# 正面评论
pos_path = r"F:\5-model and data\aclImdb\aclImdb\train\pos"
pos_lines = process_docs(pos_path,vocab)
save_list(pos_lines,'pos.txt')

该示例将保存两个文件neg.txt和pos.txt，分别包含了正面和负面评论。数据已经准备好用于单词包甚至单词嵌入模型。

6.7 其他相关

Keras自然语言处理（九）相关推荐

【小白学习keras教程】九、keras 使用GPU和Callbacks模型保存
@Author:Runsen GPU 在gpu上训练使训练神经网络比在cpu上运行快得多 Keras支持使用Tensorflow和Theano后端对gpu进行培训文档: https://keras. ...
keras自然语言处理（四）
第二章如何使用NLTK手动清理文本你不能直接从原始文本训练机器学习或深度学习模型,必须将文本转化成能够输入到模型的张量(tensor),这个过程意味着对原始文本进行处理,例如单词,标点和大小写等. ...
keras自然语言处理（五）
2.5使用NLTK进行标记和处理 Natural Language Toolkit,简称NLTK,是一个为处理和建模文本而编写的Python库,它提供了加载和处理文本的工具,我们可以使用这些工具来为我 ...
ApacheCN 大数据译文集 20211206 更新
PySpark 大数据分析实用指南零.前言一.安装 Pyspark 并设置您的开发环境二.使用 RDD 将您的大数据带入 Spark 环境三.Spark 笔记本的大数据清理和整理四.将数据汇 ...
google python代码规范_Python代码这样写才规范优雅! (二）
前文传送门:Python代码这样写才规范优雅! (一)参考:https://www.python.org/dev/peps/pep-0008/Python PEP8编码规范的内容如下: 1. Intr ...
【深度学习】深度强化学习初学者指南
一.说明 GAN(Generative Adversarial Networks)是一种深度学习模型,它由两个神经网络组成:一个生成网络和一个判别网络.生成网络学习如何生成类似于给定数据集的新数据,而 ...
使用Keras进行深度学习：（三）使用text-CNN处理自然语言（上）
欢迎大家关注我们的网站和系列教程:http://www.tensorflownews.com/,学习更多的机器学习.深度学习的知识! 上一篇文章中一直围绕着CNN处理图像数据进行讲解,而CNN除了处理 ...
自然语言处理总复习（九）—— 机器翻译
自然语言处理总复习(九)-- 机器翻译一.概述 1. 定义 2. 分类 3. 发展历史二.机器翻译的技术路线实用化的策略 (一)基于规则 1. 直接式翻译 2. 转换式翻译 (1)通过句法分析得 ...
Keras版Sequence2Sequence对对联实战——自然语言处理技术
今天我们来做NLP(自然语言处理)中Sequence2Sequence的任务.其中Sequence2Sequence任务在生活中最常见的应用场景就是机器翻译.除了机器翻译之外,现在很流行的对话机器人任 ...

Keras自然语言处理（九）

第六章为电影评论的情感分析做准备

6.1 概述

6.2 电影评论数据集

6.3 加载文本数据

6.4 清洗文本数据

6.4.1 分词

6.5 开发词汇表

6.6 保存已经准备好的数据

6.7 其他相关

Keras自然语言处理（九）相关推荐

最新文章

热门文章

Keras自然语言处理（九）

第六章 为电影评论的情感分析做准备

6.1 概述

6.2 电影评论数据集

6.3 加载文本数据

6.4 清洗文本数据

6.4.1 分词

6.5 开发词汇表

6.6 保存已经准备好的数据

6.7 其他相关

Keras自然语言处理（九）相关推荐

最新文章

热门文章

第六章为电影评论的情感分析做准备