英文文本关键词抽取——使用NLTK进行关键词抽取

记录一下代码：

"""
__author__:shuangrui Guo
__description__:
"""
import sys
import nltk
import json
from tqdm import tqdm
#多进程的包
import multiprocessing
import argparse
import os
import re
SUFFIX_NLTK = '__nltk.json'
#清洗文本
def clean_text(text):text = re.sub(r'[^\x00-\x7F]+',' ',text)text = re.sub(r"([.,!:?()])",r" \1 ",text)text = re.sub(r"\s{2,}"," ",text)text = text.replace("-"," ")return text
#获取文件行数的函数
def get_line_count(inFile):lines = 0with open(inFile,'r') as f:while f.readline():lines+=1return lines
#跳过所有的单个词，默认是True
def get_nps_from_tree(tree, words_original, attachNP=False, skip_single_word=True):nps = []st = 0for subtree in tree:if isinstance(subtree, nltk.tree.Tree):if subtree.label() == 'NP':np = subtree.leaves()ed = st + len(np)if not skip_single_word or len(np) > 1:nps.append({'st': st,'ed': ed,'text': ' '.join(words_original[st:ed])})if attachNP:nps[-1]['np'] = npst += len(subtree.leaves())else:st += 1return nps
def validate_nps(nps, words_original):validated_nps = []for np in sorted(nps, key=lambda x:x['st']):st = np['st']ed = np['ed']token_span = words_original[st:ed]# 'A polynomial time algorithm for the Lambek calculus with brackets of  bounded order'if ' '.join(token_span).strip() != np['text'].strip():print(' '.join(token_span))print(np)return validated_npsvalidated_nps.append(np)return nps
def get_nps_nltk_raw(doc):# 预先定义的分块语法，具体含义不清楚GRAMMAR = r"""NBAR:{<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns（名次和形容词，并且以名词结尾）NP:{<NBAR>}{<NBAR><IN><NBAR>}  # Above, connected with in/of/etc..."""# 定义语法解析器_PARSER = nltk.RegexpParser(GRAMMAR)doc = clean_text(doc)#对文档使用空格切分words_original = nltk.word_tokenize(doc)#words_original = doc.split(' ')try:parse_tree = _PARSER.parse(nltk.pos_tag(words_original))passexcept Exception as e:import ipdb; ipdb.set_trace()passnps = get_nps_from_tree(parse_tree, words_original)return nps
#读入与写出文件
def writeToJson(inFile, outFile):#分别读入文件，与写出文件with open(inFile, 'r') as fin, open(outFile, 'w') as fout:total = get_line_count(inFile)for line in tqdm(fin, total=total):doc = line.strip('\r\n')#对每一行进行处理if doc:nps = get_nps_nltk_raw(doc)else:nps = []fout.write(json.dumps(nps))fout.write('\n')if __name__ == '__main__':inFile = "./patent_abstract.txt"outFile = inFile + SUFFIX_NLTKwriteToJson(inFile, outFile)

英文文本关键词抽取——使用NLTK进行关键词抽取相关推荐

英文文本分词之工具NLTK
英文文本分词之工具NLTK 安装NLTK 停用词和标点符号包放置验证安装NLTK pip install nltk 分词需要用到两个包:stopwords和punkt,需要下载: import n ...
英文文本分词处理（NLTK）
文章目录 1.NLTK的安装 2.NLTK分词和分句 3.NLTK分词后去除标点符号 4.NLTK分词后去除停用词 5.NLTK分词后进行词性标注 6.NLTK分词后进行词干提取 7.NLTK分词后进 ...
python 英语分词_基于Python NLTK库进行英文文本预处理
文本预处理是要文本处理成计算机能识别的格式,是文本分类.文本可视化.文本分析等研究的重要步骤.具体流程包括文本分词.去除停用词.词干抽取(词形还原).文本向量表征.特征选择等步骤,以消除脏数据对挖掘分 ...
英文文本关系抽取（fine-tune Huggingface XLNet）
本文主要是基于英文文本关系抽取比赛,讲解如何fine-tune Huggingface的预训练模型,同时可以看作是关系抽取的一个简单案例数据预览训练数据包含两列.第一列是文本,其中<e1&g ...
英文文本分类——电影评论情感判别
目录 1.导入所需的库 2.用Pandas读入训练数据 3.构建停用词列表数据 4.对数据做预处理 5.将清洗的数据添加到DataFrame里 6.计算训练集中每条评论数据的向量 7.构建随机森林分类 ...
使用apriori对英文文本进行频繁项挖掘
使用apriori对英文文本进行频繁项挖掘知识储备频繁项集,关联性分析 apriori算法运行环境 python3.6+ 数据及运行数据预处理 apriori算法进行关联性分析结果及分析 ...
Python编程实例03——对英文文本进行分词
系列目录上一篇:Python编程实例02--实现斐波那契数列文章目录系列目录前言一.编程要点 1.split()函数 a.单个分隔符分割 b.多个分割符分割 2.sorted()函数与sor ...
readability: 英文文本数据可读性库
readability文本可读性的公式最初都是为英语开发而来,所以目前仅支持英文文本数据. 文档 https://pypi.org/project/readability/ 安装 pip instal ...
利用文本相似度进行英文文本分类（C++实现）
利用文本相似度进行英文文本分类(C++实现).仅用于应付课程小作业. 代码在链接:利用文本相似度进行英文文本分类(C++实现)-C++文档类资源-CSDN下载文本分类是自然语言处理中比较常见且重要的 ...

英文文本关键词抽取——使用NLTK进行关键词抽取

英文文本关键词抽取——使用NLTK进行关键词抽取相关推荐

最新文章

热门文章