预处理过程中需要把英文缩写进行替换,如it’s和it is是等价的,won’t和will not也是等价的,等等。

text = "The story loses its bite in a last-minute happy ending that's even less plausible than the rest of the picture ."text.replace("that's", "that is")

‘The story loses its bite in a last-minute happy ending that is even less plausible than the rest of the picture .’

text = "This is a film well worth seeing , talking and singing heads and all ."text.lower()

‘this is a film well worth seeing , talking and singing heads and all .’

import retext = "disney has always been hit-or-miss when bringing beloved kids' books to the screen . . . tuck everlasting is a little of both ."
text = re.sub("[^a-zA-Z]", " ", text)# 删除多余的空格
' '.join(text.split())

‘disney has always been hit or miss when bringing beloved kids books to the screen tuck everlasting is a little of both’
英文文本的分词和中文文本的分词方法不同,英文文本分词方法可以根据所提供的文本进行选择,如果文本中单词和标点符号或者其它字符是以空格隔开的,例如"a little of both .",那么可以直接使用split()方法;如果文本中单词和标点符号没有用空格隔开,例如"a little of both.",可以使用nltk库中的word_tokenize()方法。nltk库安装也比较简单,在windows下,用pip install nltk进行安装即可。

# 单词和标点符号用空格隔开
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness ."

[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]

# 单词和标点符号没有用空格隔开
from nltk.tokenize import word_tokenizetext = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness."

[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]

from enchant.checker import SpellChecker
chkr = SpellChecker("en_US")
chkr.set_text("Many peope likee to watch in the Name of People.")
for err in chkr:print("ERROR:", err.word)

ERROR: peope
ERROR: likee
词干提取(stemming)和词型还原(lemmatization)是英文文本预处理的特色。两者其实有共同点,即都是要找到词的原始形式。只不过词干提取(stemming)会更加激进一点,它在寻找词干的时候可以会得到不是词的词干。比如"imaging"的词干可能得到的是"imag", 并不是一个词。而词形还原则保守一些,它一般只对能够还原成一个正确的词的词进行处理。在nltk中,做词干提取的方法有PorterStemmer,LancasterStemmer和SnowballStemmer。推荐使用SnowballStemmer。这个类可以处理很多种语言,当然,除了中文。

from nltk.stem.porter import PorterStemmer
stem_porter = PorterStemmer()
stem_porter.stem('countries')  # 输出countri
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("countries")  # 输出countri
from nltk.stem.wordnet import WordNetLemmatizer
stem_wordnet = WordNetLemmatizer()
stem_wordnet.lemmatize('countries')  # 输出country


from nltk.corpus import stopwordsstop_words = stopwords.words("english")text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness"words = [w for w in text.split() if w not in stop_words]
' '.join(words)

‘part charm satin rouge avoids obvious humour lightness’

import numpy as np
import os
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
import time
import warningswarnings.filterwarnings('ignore')path_neg = './polarity_data/rt-polarity.neg'
path_pos = './polarity_data/rt-polarity.pos'# 文本预处理
def text_preprocessing(path):# 读取数据text = []with open(path, 'r', encoding='utf-8', errors='ignore') as f:for line in f.readlines():text.append(line.strip())# 英文缩写替换text_abbreviation = []for item in text:item = item.lower().replace("it's", "it is").replace("i'm", "i am").replace("he's", "he is").replace("she's", "she is")\.replace("we're", "we are").replace("they're", "they are").replace("you're", "you are").replace("that's", "that is")\.replace("this's", "this is").replace("can't", "can not").replace("don't", "do not").replace("doesn't", "does not")\.replace("we've", "we have").replace("i've", " i have").replace("isn't", "is not").replace("won't", "will not")\.replace("hasn't", "has not").replace("wasn't", "was not").replace("weren't", "were not").replace("let's", "let us")text_abbreviation.append(item)# 删除标点符号、数字等其他字符text_clear_str = []for item in text_abbreviation:item = re.sub("[^a-zA-Z]", " ", item)text_clear_str.append(' '.join(item.split()))text_clear_str_stem_del_stopwords = []stem_porter = PorterStemmer()  # 词形归一化stop_words = stopwords.words("english")  # 停用词# 分词、词形归一化、删除停用词for item in text_clear_str:words_token = word_tokenize(item)  # 分词words = [stem_porter.stem(w) for w in words_token if w not in stop_words]text_clear_str_stem_del_stopwords.append(' '.join(words))return text_clear_str_stem_del_stopwordsstart_time1 = time.clock()text_neg = text_preprocessing(path_neg)
text_pos = text_preprocessing(path_pos)end_time1 = time.clock()print("the time of text preprocessing is %.2f s" % (end_time1 - start_time1))text = text_neg + text_pos# 特征提取
def features_extraction(text):vectors = TfidfVectorizer()features = vectors.fit_transform(text).todense()return featuresstart_time2 = time.clock()
features = features_extraction(text)
end_time2 = time.clock()print("the time of features extracting is %.2f s" % (end_time2 - start_time2))
print('.'*40)m = len(text_neg)
n = len(text_pos)# 标签
labels = np.vstack((np.zeros((m, 1)), np.ones((n, 1))))
# 数据集
data = np.hstack((features, labels))
# 数据集随机打乱
np.random.shuffle(data)# 样本特征
features = data[:, :-1]
# 样本标签
labels = data[:, -1]print(features.shape)
print(labels.shape)# 训练集、测试集划分
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)# 逻辑回归
start_time3 = time.clock()
lr = LogisticRegression().fit(x_train, y_train)
end_time3 = time.clock()# 训练时间
print("the time of LogisicRegression training is %.2f s" % (end_time3 - start_time3))y_pred = lr.predict(x_test)print("the f1_score of LogisicRegression is %.4f" % f1_score(y_test, y_pred))


import retext = '最后是12月12日《篮球先锋报》的新闻报道“湖”涂开营。到底是保罗还是霍华德,\
于是,湖人放弃追逐保罗,又把奥多姆送去小牛,目的就是抢魔兽。这一次,他们能如愿吗?标签:$LOTOzf$'text = re.sub("[0-9《》“”‘’。、,?!——$¥#@%……&*^()a-zA-Z<>;:/]", "", text)


可使用 jieba.cut 和 jieba.cut_for_search 方法进行分词,两者所返回的结构都是一个可迭代的 generator,可使用 for 循环来获得分词后得到的每一个词语(unicode),或者直接使用 jieba.lcut 以及 jieba.lcut_for_search 直接返回 list。其中:jieba.cut 和 jieba.lcut 接受 3 个参数:
sentence:需要分词的字符串(unicode 或 UTF-8 字符串、GBK 字符串)
cut_all 参数:是否使用全模式,默认值为 False
HMM 参数:用来控制是否使用 HMM 模型,默认值为 True

jieba.cut_for_search 和 jieba.lcut_for_search 接受 2 个参数:
sentence:需要分词的字符串(unicode 或 UTF-8 字符串、GBK 字符串)
HMM 参数:用来控制是否使用 HMM 模型,默认值为 True
尽量不要使用 GBK 字符串,可能无法预料地错误解码成 UTF-8

jieba 是目前最好的 Python 中文分词组件,它有以下三种分词模式:

import jiebasentence= '他来到北京大学参加暑期夏令营'# 精准模式
print(list(jieba.cut(sentence, cut_all=False)))

[‘他’, ‘来到’, ‘北京大学’, ‘参加’, ‘暑期’, ‘夏令营’]


sentence= '他来到北京大学参加暑期夏令营'# 全模式
print(list(jieba.cut(sentence, cut_all=True)))

[‘他’, ‘来到’, ‘北京’, ‘北京大学’, ‘大学’, ‘参加’, ‘暑期’, ‘夏令’, ‘夏令营’]


sentence = '他毕业于北京大学机电系,后来在一机部上海电器科学研究所工作'# 搜索引擎模式

[‘他’, ‘毕业’, ‘于’, ‘北京’, ‘大学’, ‘北京大学’, ‘机电’, ‘系’, ‘,’, ‘后来’, ‘在’, ‘一机部’, ‘上海’, ‘电器’, ‘科学’, ‘研究’, ‘研究所’, ‘工作’]


import jieba
import re# 停用词
stop_words = ['于', '后来', '在']sentence = '他毕业于北京大学机电系后来在一机部上海电器科学研究所工作'# 分词
sent_cut = list(jieba.cut(sentence))# 删除停用词
text = [w for w in sent_cut if w not in stop_words]print('分词:', sent_cut)
print('删除停用词:', text)

分词: [‘他’, ‘毕业’, ‘于’, ‘北京大学’, ‘机电’, ‘系’, ‘后来’, ‘在’, ‘一机部’, ‘上海’, ‘电器’, ‘科学’, ‘研究所’, ‘工作’]
删除停用词: [‘毕业’, ‘北京大学’, ‘机电’, ‘系’, ‘一机部’, ‘上海’, ‘电器’, ‘科学’, ‘研究所’, ‘工作’]


