训练WIKI中文模型

1. 安装依赖包

numpy：用来计算多维数组的包，基本操作可看：https://blog.csdn.net/cxmscb/article/details/54583415

scipy：用于数据统计，有多种常用的数据统计函数，也包括连续和离散两种随机变量，这个包要在安装了numpy之后才能安装

gensim：gensim是一个python的自然语言处理库，能够将文档根据TF-IDF, LDA, LSI 等模型转化成向量模式，这个包要在安装了scipy之后才能安装

下载中文词集

下载地址：中文词集数据的下载地址

处理中文词集（xml）转化为txt

from gensim.corpora import WikiCorpus# 将训练集转化（xml）为txt
# 参数：wiki训练集存放的路径，txt存放的路径
def translateTheText(xml_path,txt_path):path_to_wiki_dump = xml_pathwiki_corpus = WikiCorpus(path_to_wiki_dump, dictionary={})num = 0with open(txt_path, 'w', encoding='utf-8') as output:for text in wiki_corpus.get_texts():  # get_texts() 将 wiki的一篇文章转为textd的一行output.write(' '.join(text) + '\n')num += 1if num % 10000 == 0 and num != 0:print('已处理 %d 篇文章'%(num))print('wiki词集转化完毕')

如果有UserWarning：detected Windows; aliasing chunkize to chunkize_serial
warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”) 警告
可以再导入gensim包之前写入下面的代码：

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

这个时间有点长，可以耐心等一下，看个电视局啥的。。。。。。。

转化完成之后可以看到有一个文件：

将txt文档中的繁体转化为简体

下载opencc工具：opencc下载地址

解压调用：进入到刚才wiki_text.txt的文件目录下：
进入cmd调用命令：

(opencc的路径)\opencc.exe -i (txt文件的路径)\wiki_text.txt -o (txt文件的路径)\wiki_text2.txt -c (opencc的路径)\t2s.json
一小会就好了

将txt文档分词

import jieba# 将txt文本中的句子分词
# 参数：txt路径,分词之后的存放文本路径
def getCutWords(txt_path, seg_txt):stopword_set = set()output = open(seg_txt, 'w', encoding='utf-8')with open(txt_path, 'r', encoding='utf-8') as content:for texts_num, line in enumerate(content):  # enumerate 给 line前加序号line = line.strip('\n')words = jieba.cut(line, cut_all=False)for word in words:if word not in stopword_set:output.write(word + ' ')output.write('\n')if (texts_num + 1) % 10000 == 0:print("已完成 %d 行的分词"%(texts_num + 1))print('文本分词完毕')output.close()

这个时间也有点长，耐心等一下，可以看看直播。。。。。。。
弄好了之后就会有一个wiki_seg.txt文档了

训练模型

from gensim.models import word2vecdef getWordsNumber(seg_path, model_path):sentences = word2vec.LineSentence(seg_path)model = word2vec.Word2Vec(sentences, size=250, min_count=5)  # size 用来设置神经网络的层数model.save(model_path)

这个时间也有点久，稍稍等待
完成后，就会出现三个文件

到此模型训练完毕