NLP基础之拼写纠错代码实现

# 第一步：构建词库 vocab网上搜，自己爬都行
vocab = set([line.rstrip() for line in open('./vocab.txt')])
vocab

输出：
{ ‘widths’, ‘truer’, …}

# 第二步： 生成编辑距离为1的有效单词
# 定义函数生成所有编辑距离为1的候选单词
def generate_candidates(word):"""word: 给定的输入（错误的输入） 返回所有(valid)候选集合"""letter = 'abcdefghijklmnopqrstuvwxyz'splite = [(word[:i], word[i:]) for i in range(len(word)+1)]# delete操作delete = [(l+r[1:]) for l,r in splite]# insert 操作insert = [(l+c+r) for l,r in splite for c in letter]# replace操作replace = [(l+c+r[1:]) for l, r in splite for c in letter]words = set(replace + insert + delete)condidates = [word for word in words if word in vocab ]return condidatesgenerate_candidates('apple')

['apples', 'apply', 'apple', 'ample']

# 第三步：通过语料库构建LM模型（bigram）
from nltk.corpus import reuters
categories =reuters.categories()
corpus = reuters.sents(categories = categories)


term_count = {}
bigram_count ={}
for doc in corpus:doc = ['<s>'] + docfor i in range(0,len(doc)-1):term = doc[i]bigram = doc[i:i+2]if term in term_count:term_count[term] +=1else:term_count[term] = 1bigram = ''.join(bigram)if bigram in bigram_count:bigram_count[bigram] +=1else:bigram_count[bigram] = 1print(bigram)

6mln

# 第四步： 用户打错的概率统计
#（这里实际是通过用户日志统计每个正确单词对应的错误单词的种类
# 及次数来生成错误单词的概率P(mistake1|correct)）,P(mistake2|correct)）...
# {'raining': {'rainning': 0.5, 'raning': 0.5}, ...}
# 本项目是假设各错误情况出现的概率相等
channel_prob = {}
for line in open('spell-errors.txt'):item = line.split(':')correct = item[0].strip()mistake = [misword.strip() for misword in item[1].strip().split(',')]channel_prob[correct] = {}for mis in mistake:channel_prob[correct][mis]=1/len(mistake)

{'raining': {'rainning': 0.5, 'raning': 0.5}, 'writings': {'writtings': 1.0}, 'disparagingly': {'disparingly': 1.0}, 'yellow': {'yello': 1.0}, 'four': {'forer': 0.2, 'fours': 0.2, 'fuore': 0.2, 'fore*5': 0.2, 'for*4': 0.2}, 'woods': {'woodes': 1.0}, 'hanging': {'haing': 1.0}, 'aggression': {'agression': 1.0}, 'looking': {'loking': 0.1, 'begining': 0.1, 'luing': 0.1, 'look*2': 0.1, 'locking': 0.1, 'lucking': 0.1, 'louk': 0.1, 'looing': 0.1, 'lookin': 0.1, 'liking': 0.1},  'misdemeanors': {'misdameanors': 0.5, 'misdemenors': 0.5}

# 第五步： 测试数据错误单词纠错。
import numpy as np
V = len(term_count.keys()) # 语料库里的单词种类个数file = open("testdata.txt", 'r')
for line in file:items = line.rstrip().split('\t')line = items[2].rstrip('.').split() # line = ["I", "like", "playing"]for word in line:# 找出没在词库里的单词，（即认为是拼写错误的）if word not in vocab:# Step1: 生成所有的(valid)候选集合candidates = generate_candidates(word)if len(candidates) < 1:continue  #(最好是再生成编辑距离为2的候选词进行比较#，这里候选词没有的话直接跳过)probs = []# 对于每一个candidate, 计算它的score# score = p(correct)*p(mistake|correct)#       = log p(correct) + log p(mistake|correct)# 返回score最大的candidatefor candi in candidates:prob = 0# a. 计算channel probabilityif candi in channel_prob and word in channel_prob[candi]:prob += np.log(channel_prob[candi][word])else:prob += np.log(0.0001)# b. 计算语言模型的概率  idx = line.index(word)  # 错误单词的位置索引bigram = [line[idx - 1],candi]bigram = ''.join(bigram)if bigram in bigram_count and candi in term_count:# 计算当前word 与pre_word 条件概率 # log(P（word\pre_word)) = P(pre_word,word)/P(word)prob += np.log((bigram_count[bigram]+1)/(term_count[candi]+V))else:prob += np.log(1.0 / V)  # 计算 [word, post_word] 条件概率#   prob += np.log(bigram概率)     if idx + 1 < len(line): #（最后一个单词出错的话直接跳过该步骤）bigram = [candi,line[idx + 1]]bigram = ''.join(bigram)if bigram in bigram_count and candi in term_count:prob += np.log((bigram_count[bigram]+1)/(term_count[candi]+V))else:prob += np.log(1.0 / V)# prob ：分数 将计算的分数放入列表#如： candidates: ['apples', 'apply', 'apple', 'ample']# probs:    [0.8, 0.2, 0.1, 0.1] # 分数与候选词相互对应选最大分数那个probs.append(prob)max_idx = probs.index(max(probs))print (word, candidates[max_idx])

NLP基础之拼写纠错代码实现相关推荐

NLP预处理阶段----拼写纠错实战
拼写纠错流程仅限拼写上的纠错,暂不去进行语法层次上的纠错. I like play football. 本文暂不纠错. 数据集: spell-errors.txt 正确:错误1,错误2- 其他错误给 ...
NLP项目（二）——拼写纠错
目录前言一.数据集介绍 1-1.spell-errors.txt 1-2.vocab.txt 1-3.testdata.txt 二.拼写纠错代码 Part0:构建词库 Part1:生成所有的候选集 ...
NLP基础：编辑距离+拼写纠错实战
NLP基础:编辑距离+拼写纠错实战 1. 编辑距离相关 1.1 编辑距离的计算 1.2 运行结果 1.3 生成特定编辑距离的字符串 1.3.1 生成与目标字符编辑距离为1的字符 1.3.2 运行结果 ...
中文拼写纠错_[NLP]中文拼写检纠错
一.基于统计语言模型的中文拼写纠错 1.流程图 2.实验结果局部方法的实验结果: 全局方法的实验结果: 3.初步结论缺点: a.SLM对训练语料的规模和质量敏感. b.错词检测策略灵活,变化较多. ...
NLP实战（三）实现拼写纠错
Part 3: 实现拼写纠错此项目需要的数据: vocab.txt: 这是一个词典文件,作为判断单词是否拼错的依据,任何未出现在词典中的词都认为拼写错误. spell-errors.txt: 该文件 ...
基于 BK 树的中文拼写纠错候选召回
最近在研究中文拼写纠错,在查阅资料的时候看到了这篇文章<从编辑距离.BK树到文本纠错 - JadePeng - 博客园>,觉得 BK 树挺有意思的,决定深入研究一下,并在其基础上重新整理一 ...
NLP-文本处理：拼写纠错【非词（编辑距离）、真词（编辑距离...）候选词 -＞ “噪音通道模型”计算候选词错拼成待纠错词的似然概率 -＞ N-gram模型评估候选词组成的语句合理性】
一.贝叶斯公式 1.单事件 P(Ax∣B)P(A_x|B)P(Ax∣B)=P(AxB)P(B)=P(B∣Ax)×P(Ax)P(B)=P(B∣Ax)×P(Ax)∑i=0n[P(B∣Ai)∗P(Ai)] ...
多模型结合的等长拼写纠错 | 全国中文纠错大赛冠军方案
每天给你送来NLP技术干货! 来自:达观数据点击这里进群->加入NLP交流群参与任务 DATAGRAND 中文拼写检查任务是中文自然语言处理中非常具有代表性和挑战性的任务,其本质是找出文本 ...
elasticSearch学习笔记04-同义词，停用词，拼音，高亮，拼写纠错
由于elasticSearch版本更新频繁,此笔记适用ES版本为 7.10.2 此笔记摘录自<Elasticsearch搜索引擎构建入门与实战>第一版文中涉及代码适用于kibana开发工 ...

NLP基础之拼写纠错代码实现

NLP基础之拼写纠错代码实现相关推荐

最新文章

热门文章