The Porter Stemming Algorithm

Porter Stemming Algorithm

The Porter Stemming Algorithm

This is the ‘official’ home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

The algorithm was originally described in Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137. It has since been reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4.

The Algorithm has been widely used, quoted, and adapted over the past 20 years. Unfortunately variants of it abound which claim to be true implementations, and this can cause confusion. This page contains a demonstration of the stemmer, and downloadable versions of it in ANSI C, Java, Perl and other languages.

The original stemmer was coded up in BCPL, a language no longer in vogue. In its final surviving form, this BCPL version has three minor points of difference from the published algorithm, and these are clearly marked in the downloadable ANSI C version. They are discussed further below.

The ANSI C, Java and Perl versions are exactly equivalent to the original BCPL version, having been tested on a large corpus of English text. The original paper was, I hope, unambiguous, despite a couple of irritating typos, but even so the ANSI C version nowadays acts as a better definition of the algorithm than the original published paper.

posted on 2013-01-24 17:33 lexus 阅读(...) 评论(...) 编辑 收藏

转载于:https://www.cnblogs.com/lexus/archive/2013/01/24/2875389.html

The Porter Stemming Algorithm相关推荐

  1. Porter Stemming Algorithm

    所谓Stemming,可以称为词根化,这里有个overview.在英语这样的拉丁语系里面,单词有多种变形.比如加上-ed.-ing.-ly等等.在分词的时候,如果能够把这些变形单词的词根找出了,对搜索 ...

  2. python 英语分词_英文分词算法(Porter stemmer)

    python金融风控评分卡模型和数据分析微专业课(博主亲自录制视频):http://dwz.date/b9vv 最近需要对英文进行分词处理,希望能够实现还原英文单词原型,比如 boys 变为 boy ...

  3. 有关Lucene的问题(2):stemming和lemmatization

    问题: 我试验了一下文章中提到的 stemming 和 lemmatization 将单词缩减为词根形式,如"cars"到"car"等.这种操作称为:stemm ...

  4. 搜索引擎之---Apache solr的实现

    Solr 是一种可供企业使用的.基于 Lucene 的搜索服务器,它支持层面搜索.命中醒目显示和多种输出格式.在这篇分两部分的文章中,Lucene Java™ 的提交人 Grant Ingersoll ...

  5. Lucene in action 笔记 analysis篇

    Analysis, in Lucene, is the process of converting field text into its most fundamental indexed repre ...

  6. Go语言(golang)开源项目大全

    http://www.open-open.com/lib/view/open1396063913278.html#Compression 内容目录 Astronomy 构建工具 缓存 云计算 命令行选 ...

  7. 081020_文本分类(Text Classification)

    About Feature Generator 关于特征生成 1.  Change all the letters to lowercase, with a stemmer manipulation, ...

  8. 【NLP】Words Normalization+PorterStemmer源码解析

    Words Normalization 目录 Words Normalization Stemming(词干提取) Lemmatisation(词形还原) PorterStemmer源码解析 1.de ...

  9. 垃圾邮件过滤优化方法

    垃圾邮件过滤优化方法 通过honeypot project 搜集大量垃圾邮件数据 通过解析邮件header 获取垃圾邮件发送路径和服务器相关信息 对编写错误的单词的修正 比如:w4tch 对相同含义的 ...

  10. golang 开源项目全集

    一直更新中,地址:https://github.com/golang/go/wiki/Projects#zeromq Indexes and search engines These sites pr ...

最新文章

  1. 从小白到精通python要多久-超适合小白的python新手教程
  2. Visual Studio无法调试
  3. 2018-2019-2 20175224 实验五《网络编程与安全》实验报告
  4. cocos2d 走动椭圆
  5. 0网卡开启_中标麒麟Linux v7系统下设置双网卡bond或team绑定详细过程
  6. 多进程IterableDataset流式读取数据的坑:每个进程会读取一遍完整数据
  7. 盒马员工因工资单意外被同事看到,遭强制开除;微博被传大面积裁员、员工被要求主动离职,官方否认;豆瓣在截图中添加盲水印|雷峰早报...
  8. 计算机二级省份,【计算机二级】这些省份发布报名时间!调整前的最后一次考试!...
  9. 女生学的计算机专业有前途吗,计算机专业好不好 女生学计算机有前途吗
  10. iOS开发NSDate、NSString、时间戳之间的转化
  11. 字符编码(一):序言
  12. 一二线城市知名 IT 互联网公司名单,程序员选择多了
  13. Python math.comb() 方法
  14. 利用栈实现中缀表达式转后缀表达式
  15. 点击切换图标(收藏和取消收藏)
  16. Bug heroes虫虫英雄······超详细翻译+基本攻略
  17. vue知识图谱--不问API,只问为什么
  18. SSH整合开发实例:用户管理系统
  19. 巧用小词在GRE写作中拿高分
  20. 好评如潮!《典籍里的中国》为什么火?

热门文章

  1. robotframework 内置库DateTime,Dialogs,
  2. 电路原理 的 一些基础知识
  3. 485通讯的校验和_MCGS 与 FX3U PLC 之间的无线通讯实例
  4. 鹏业安装算量软件8.0.0.76升级内容
  5. php文字如何排版,文字排版,二十个文字排版技巧教程
  6. “小步快跑、快速迭代” 可用于工作的好方法
  7. 龙渊服务器信息丢失,多多自走棋为什么停止运营 游戏数据转移腾讯服务器
  8. 计算机病毒的防治方法不包括,计算机病毒的防治方法
  9. 通过python理解相速度和群速度
  10. 关于传奇客户端及补丁文件相关合集