命名实体识别学习记录(spaCy/OpenNLP..)

  • spaCy
    • 环境
    • 功能实现
  • NLTK
    • 环境
    • 功能实现
  • Stanford NLP
    • 环境
    • 功能实现
  • NER works
    • Spacy
      • Install
      • Run
      • Results
      • Entity types
    • NLTK
      • Install
      • Run
      • Results
      • Entity types
    • [Stanford NLP](https://nlp.stanford.edu/software/CRF-NER.shtml)
      • Install
      • Run
      • Results
      • Entity types
    • BERT-NER
      • Install
      • Run
      • Results
      • Entity types

spaCy

API文档

环境

只列举不是一查就能查到的命令:

  1. 下载en_core_web_sm:本人唯一成功的方法是本地下载,然后pip install + 本地路径。(conda显示安装好了但不行)
  2. 下载textacy:python -m pip install textacy
    但有 verb_phrases = textacy.extract.matches(doc, patterns=patterns) TypeError: ‘module’ object is not callable的报错,说明库找不到
    发现是新版的函数库有区别的原因,通过查看库的源代码,将上句改成下句即成功。
    旧版:verb_phrases = textacy.extract.matches(doc, patterns=patterns)
    新版:verb_phrases = textacy.corpus.extract.matches.token_matches(doclike=doc, patterns=patterns)

功能实现

参考博客 2.4-2.8跑通 含名词与动词识别

NLTK

环境

  1. 报错NLTK:Resource punkt not found. Please use the NLTK Downloader to obtain the resource
    解决:在gitee下载packages 记得把zip解压成dir

功能实现

NLTK+Stanford NLP的代码

Stanford NLP

环境

按照文章里下载并改成本地路径即可

功能实现

NLTK+Stanford NLP的代码
发现生成NER速度很慢,改进方法:sn.tag_sents() 参考这篇

NER works

For most of the modules, just use pip install + xxx to download.

Spacy

Install

  • spacy

  • pandas

  • en_core_web_sm: both pip and conda don’t work. Download newest package here and run pip install + local path

Run

python spacy_NER.py

Results

After step Run, you’ll get spacy_NER_result.csv as NER results.

Entity types

There are 18 types in spacy but I only use 11 of them since they’re more related to our program.

type_list = ['EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'NORP', 'ORG', 'PERSON', 'PRODUCT', 'WORK_OF_ART']
ENT_TYPE_(18 in total) DESCRIPTION
CARDINAL Numerals that do not fall under another type.
DATE Absolute or relative dates or periods.
EVENT Named hurricanes, battles, wars, sports events, etc.
FAC Buildings, airports, highways, bridges, etc.
GPE Geopolitical entity, i.e. countries, cities, states.
LANGUAGE Any named language.
LAW Named documents made into laws.
LOC Non-GPE locations, mountain ranges, bodies of water.
MONEY Monetary values, including unit.
NORP Nationalities or religious or political groups.
ORDINAL “first”, “second”, etc.
ORG Companies, agencies, institutions.
PERCENT Percentage, including “%”.
PERSON People, including fictional.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
QUANTITY Measurements, as of weight or distance.
TIME Times smaller than a day.
WORK_OF_ART Titles of books, songs, etc.

NLTK

Install

  • pandas
  • re
  • nltk: for error Resource punkt not found. Please use the NLTK Downloader to obtain the resource, you can follow this
    • download folderpackages from github or gitee, and rename it to nltk_data
    • the terminal will output several searched paths, you can just choose one and unzip nltk_data folder, like ‘D:\nltk_data’

Run

python NLTK_NER.py

Results

After step Run, you’ll get NLTK_NER_result.csv as NER results.

Entity types

There are 9 types in NLTK but I only use 5 of them since they’re more related to our program.

type_list = ['ORGANIZATION', 'PERSON', 'LOCATION', 'FACILITY', 'GPE']
ENT_TYPE_(9 in total) DESCRIPTION
ORGANIZATION Georgia-Pacific Corp., WHO
PERSON Eddy Bonte, President Obama
LOCATION Murray River, Mount Everest
DATE June, 2008-06-29
TIME two fifty am, 1:30 p.m
MONEY 175 million Canadian Dollars, GBP 10.40
PERCENT twenty pct, 18.75 %
FACILITY Washington Monument, Stonehenge
GPE geopolitical entity:South East Asia, Midlothian)

Stanford NLP

Install

  • re

  • nltk.tag

  • os

  • pandas

  • nltk: download from Download index, and unzip it to a local path, like ‘D://stanford-ner-2020-11-17’

  • java: use your local java path

please make sure your local paths are correct because they engage in loading the NER model

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_261\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('D://stanford-ner-2020-11-17/classifiers/english.muc.7class.distsim.crf.ser.gz', path_to_jar='D://stanford-ner-2020-11-17/stanford-ner.jar')

Run

python Stanford_NLP_NER.py

Results

After step Run, you’ll get Stanford_NLP_NER_result.csv as NER results.

Entity types

There are 7 types in Stanford NLP but I only use 3 of them since they’re more related to our program.

type_list = ['LOCATION', 'PERSON', 'ORGANIZATION']
ENT_TYPE_(7 in total, except facility & GPE) DESCRIPTION
Location Murray River, Mount Everest
Person Eddy Bonte, President Obama
Organization Georgia-Pacific Corp., WHO
Money 175 million Canadian Dollars, GBP 10.40
Percent twenty pct, 18.75 %
Date June, 2008-06-29
Time two fifty am, 1:30 p.m)

BERT-NER

Install

# Kaggle
!git clone -b dev https://github.com/kamalkraj/BERT-NER.git
!pip3 install -r /kaggle/working/BERT-NER/requirements.txt
# Local
git clone -b dev https://github.com/kamalkraj/BERT-NER.git
pip3 install -r /kaggle/working/BERT-NER/requirements.txt

Run

# Kaggle
!python /kaggle/working/BERT-NER/run_ner.py --data_dir=/kaggle/working/BERT-NER/data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1
# Local
python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out_base --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1

Since I don’t have GPU, I run it on Kaggle and get output_base successfully.

在这里插入图片描述

If you use default parameters, you can just download pretrained model BERT_BASE and BERT_LARGE.

Then define a model and get NER outputs.

# BERT_NER.py
model_large = Ner("D:/pythonProject/BERT-NER-dev/out_large/") # local path

python BERT_NER.py

Results

After step Run, you’ll get BERT_NER_result.csv as NER results.

Entity types

There are 11 types in BERT-NER.

["O", "B-MISC", "I-MISC",  "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "[CLS]", "[SEP]"]
  1. PER: person
  2. LOC: location
  3. ORG: organization
  4. MISC: miscellaneous (consisting of diverse things or members)

BIO lables:

  • O
  • B-X:X phrase’s beginning
  • I-X:X phrase’s middle

B-PER:“a person name begins here”

I-PER tag:“a person name continues”

O tag: “no name here”

  • NP: Noun Phrase

命名实体识别学习记录(spaCy/OpenNLP..)相关推荐

  1. 命名实体识别学习-用lstm+crf处理conll03数据集

    title: 命名实体识别学习-用lstm+crf处理conll03数据集 date: 2020-07-18 16:32:31 tags: 命名实体识别学习-用lstm+crf处理conll03数据集 ...

  2. 命名实体识别学习笔记——使用Ltp

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/xuewenstudy/article/ ...

  3. 命名实体识别学习笔记

    1 命名实体识别概述 1.1 定义 命名实体识别(Name Entity Recognition,NER),也称作"专名识别",是指识别文本中具有特定意义的实体,包括人名.地名.机 ...

  4. 基于spaCy的领域命名实体识别

    基于spaCy的命名实体识别 ----以"大屠杀"领域命名实体识别研究为例 作者: Dr. W.J.B. Mattingly Postdoctoral Fellow at the ...

  5. 使用Spacy实现命名实体识别

    使用Spacy实现命名实体识别 本次实验的目的是完成文本数据的词性标注和识别文本中的命名实体 一.数据来源 数据是2022年2月4日的新闻 二.数据预处理 使用jieba对文本进行分词和去停用词,使用 ...

  6. spacy spaCy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等

    spaCy主要功能包括分词.词性标注.词干化.命名实体识别.名词短语提取等等https://zhuanlan.zhihu.com/p/51425975

  7. (转)OpenNLP进行中文命名实体识别(下:载入模型识别实体)

    上一节介绍了使用OpenNLP训练命名实体识别模型的方法,并将模型写到磁盘上形成二进制bin文件,这一节就是将模型从磁盘上载入,然后进行命名实体识别.依然是先上代码: [java] view plai ...

  8. [NLP]OpenNLP命名实体识别(NameFinder)的使用

    目录 Name Finder 模型训练 命名识别 Name Finder 命名查找器可以检测文本中的命名实体和数字.为了能够检测到实体,命名查找器需要一个模型.模型依赖于它被训练的语言和实体类型.Op ...

  9. 对命名实体识别进行基准测试:StanfordNLP,IBM,spaCy,Dialogflow和TextSpace

    作者|Felix Laumann 编译|VK 来源|Towards Data Science NER是信息提取的一个子任务,它试图定位并将非结构化文本中提到的指定实体划分为预定义的类别,如人名.组织. ...

最新文章

  1. 如何改进yolov3_揭秘YOLOv3鲜为人知的关键细节
  2. python鸡兔同笼编程运行结果_Python少儿编程:鸡兔同笼
  3. 百度搜索关键词纠错机制研究
  4. python程序中结束while循环的两种方法是_Python中while循环
  5. 算法:间隔重排序链表Reorder List
  6. CentOS7入门:使用Vi文本编辑器
  7. 使用element中el-tab如何改变文字样式等
  8. 【程序员如何买股票 二】 A股证券账户开户
  9. GroovyHelp
  10. iOS 升级HTTPS配置ATS-----(1)------
  11. SAP ERP FI(Financial Accounting)财务会计--BW方向--初级--1
  12. python灰色预测_python实现灰色预测模型(GM11)——以预测股票收盘价为例
  13. 冒泡排序和快速排序的效率比较
  14. 一起聊聊什么是P问题、NP问题、NPC问题
  15. Chrome 去掉“该网站的安全证书不受信任!”的提示
  16. 福州“小年”年味浓 祭灶已成传统节俗
  17. PlayStation Now比您想象的要好
  18. 润乾报表Api导出word只读
  19. 模型评估与改进(三)// 评估指标
  20. 主流自媒体推广平台有哪些 如何用自媒体引流

热门文章

  1. 一名非典型二流学生的自述 | 我是如何从菜鸟进化到辣鸡的
  2. iOS | 模拟器调试Web控制台空白问题及解决
  3. 零距离接触阿里云时序时空数据库TSDB
  4. 【璟丰机电】美国派克Parker产品在FPD中的行业应用案例盘点
  5. sunxi:[0]全志SoC启动过程
  6. Linux创建系统管理员用户
  7. mount:special device does not exist (a path prefix is not a directory)
  8. 刚接触js不久,自己写的banner幻灯片效果。
  9. 锚定品质,金科走出清晰的“产品主义”路径
  10. 【“计算机科学与技术”专业小白成长系列】Linux Shell 编程 极简教程