The English (Porter2) stemming algorithm

波特词干算法 - 残阳似血的博客

波特词干

The English (Porter2) stemming algorithm

The English (Porter2) stemming algorithm

Links to resources

Snowball main page
The stemmer in Snowball
The ANSI C stemmer
— and its header
Sample English vocabulary
Its stemmed equivalent
Vocabulary + stemmed equivalent
Tar-gzipped file of all of the above

A stop word list

Here is a sample of vocabulary, with the stemmed forms that will be generated with the algorithm.

word     stem              word     stem
consign
consigned
consigning
consignment
consist
consisted
consistency
consistent
consistently
consisting
consists
consolation
consolations
consolatory
console
consoled
consoles
consolidate
consolidated
consolidating
consoling
consolingly
consols
consonant
consort
consorted
consorting
conspicuous
conspicuously
conspiracy
conspirator
conspirators
conspire
conspired
conspiring
constable
constables
constance
constancy
constant
    =>     consign
consign
consign
consign
consist
consist
consist
consist
consist
consist
consist
consol
consol
consolatori
consol
consol
consol
consolid
consolid
consolid
consol
consol
consol
conson
consort
consort
consort
conspicu
conspicu
conspiraci
conspir
conspir
conspir
conspir
conspir
constabl
constabl
constanc
constanc
constant
    knack
knackeries
knacks
knag
knave
knaves
knavish
kneaded
kneading
knee
kneel
kneeled
kneeling
kneels
knees
knell
knelt
knew
knick
knif
knife
knight
knightly
knights
knit
knits
knitted
knitting
knives
knob
knobs
knock
knocked
knocker
knockers
knocking
knocks
knopp
knot
knots
    =>     knack
knackeri
knack
knag
knave
knave
knavish
knead
knead
knee
kneel
kneel
kneel
kneel
knee
knell
knelt
knew
knick
knif
knife
knight
knight
knight
knit
knit
knit
knit
knive
knob
knob
knock
knock
knocker
knocker
knock
knock
knopp
knot
knot

Developing the English stemmer

(Revised slightly, December 2001)
(Further revised, September 2002)

I have made more than one attempt to improve the structure of the Porter algorithm by making it follow the pattern of ending removal of the Romance language stemmers. It is not hard to see why one should want to do this: step 1b of the Porter stemmer removes ed and ing, which are i-suffixes (*) attached to verbs. If these suffixes are removed, there should be no need to remove d-suffixes which are not verbal, although it will try to do so. This seems to be a deficiency in the Porter stemmer, not shared by the Romance stemmers. Again, the divisions between steps 2, 3 and 4 seem rather arbitrary, and are not found in the Romance stemmers.

Nevertheless, these attempts at improvement have been abandoned. They seem to lead to a more complicated algorithm with no very obvious improvements. A reason for not taking note of the outcome of step 1b may be that English endings do not determine word categories quite as strongly as endings in the Romance languages. For example, condition and position in French have to be nouns, but in English they can be verbs as well as nouns,

We are all conditioned by advertising
They are positioning themselves differently today

A possible reason for having separate steps 2, 3 and 4 is that d-suffix combinations in English are quite complex, a point which has been made elsewhere.

But it is hardly surprising that after twenty years of use of the Porter stemmer, certain improvements did suggest themselves, and a new algorithm for English is therefore offered here. (It could be called the ‘Porter2’ stemmer to distinguish it from the Porter stemmer, from which it derives.) The changes are not so very extensive: (1) terminating y is changed to i rather less often, (2) suffix us does not lose its s, (3) a few additional suffixes are included for removal, including (4) suffix ly. In addition, a small list of exceptional forms is included. In December 2001 there were two further adjustments: (5) Steps 5a and 5b of the old Porter stemmer were combined into a single step. This means that undoubling final ll is not done with removal of final e. (6) In Step 3 ative is removed only when in region R2. (7) In July 2005 a small adjustment was made (including a new step 0) to handle apostrophe.

To begin with, here is the basic algorithm without reference to the exceptional forms. An exact comparison with the Porter algorithm needs to be done quite carefully if done at all. Here we indicate by * points of departure, and by + additional features. In the sample vocabulary, Porter and Porter2 stem slightly under 5% of words to different forms.

Definition of the English stemmer

Define a vowel as one of

a   e   i   o   u   y

Define a double as one of

bb   dd   ff   gg   mm   nn   pp   rr   tt

Define a valid li-ending as one of

c   d   e   g   h   k   m   n   r   t

R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. (This definition may be modified for certain exceptional words — see below.)

R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)

Define a short syllable in a word as either (a) a vowel followed by a non-vowel other than w, x or Y and preceded by a non-vowel, or * (b) a vowel at the beginning of the word followed by a non-vowel.

So rap, trap, entrap end with a short syllable, and ow, on, at are classed as short syllables. But uproot, bestow, disturb do not end with a short syllable.

A word is called short if it ends in a short syllable, and if R1 is null.

So bed, shed and shred are short words, bead, embed, beds are not short words.

An apostrophe (') may be regarded as a letter. (See note on apostrophes in English.)

If the word has two letters or less, leave it as it is.

Otherwise, do each of the following operations,

Remove initial ', if present. + Then,

Set initial y, or y after a vowel, to Y, and then establish the regions R1 and R2. (See note on vowel marking.)

Step 0: +

Search for the longest among the suffixes,

' 's 's'
and remove if found.

Step 1a:

Search for the longest among the following suffixes, and perform the action indicated.

sses
replace by ss

ied+   ies*
replace by i if preceded by more than one letter, otherwise by ie (so ties -> tie, cries -> cri)

s
delete if the preceding word part contains a vowel not immediately before the s (so gas and this retain the s, gaps and kiwis lose it)

us+   ss
do nothing

Step 1b:

Search for the longest among the following suffixes, and perform the action indicated.

eed   eedly+
replace by ee if in R1

ed   edly+   ing   ingly+
delete if the preceding word part contains a vowel, and after the deletion:
if the word ends at, bl or iz add e (so luxuriat -> luxuriate), or
if the word ends with a double remove the last letter (so hopp -> hop), or
if the word is short, add e (so hop -> hope)

Step 1c: *

replace suffix y or Y by i if preceded by a non-vowel which is not the first letter of the word (so cry -> cri, by -> by, say -> say)

Step 2:

Search for the longest among the following suffixes, and, if found and in R1, perform the action indicated.

tional:   replace by tion enci:   replace by ence anci:   replace by ance abli:   replace by able entli:   replace by ent izer   ization:   replace by ize ational   ation   ator:   replace by ate alism   aliti   alli:   replace by al fulness:   replace by ful ousli   ousness:   replace by ous iveness   iviti:   replace by ive biliti   bli+:   replace by ble ogi+:   replace by og if preceded by l fulli+:   replace by ful lessli+:   replace by less li+:   delete if preceded by a valid li-ending

Step 3:

Search for the longest among the following suffixes, and, if found and in R1, perform the action indicated.

tional+:   replace by tion ational+:   replace by ate alize:   replace by al icate   iciti   ical:   replace by ic ful   ness:   delete ative*:   delete if in R2

Step 4:

Search for the longest among the following suffixes, and, if found and in R2, perform the action indicated.

al   ance   ence   er   ic   able   ible   ant   ement   ment   ent   ism   ate   iti   ous   ive   ize
delete
ion
delete if preceded by s or t

Step 5: *

Search for the the following suffixes, and, if found, perform the action indicated.

e
delete if in R2, or in R1 and not preceded by a short syllable
l
delete if in R2 and preceded by l

Finally, turn any remaining Y letters in the word back into lower case.

Exceptional forms in general

It is quite easy to expand a Snowball script so that certain exceptional word forms get special treatment. The standard case is that certain words W1,  W2  ..., instead of passing through the stemming process, are mapped to the forms  X1,  X2  ... respectively. If the script does the stemming by means of the call

    define stem as C

where  C  is a command, the exceptional cases can be dealt with by extending this to

    define stem as ( exception or C )

and putting in a routine  exception:

    define exception as ([substring] atlimit among('W1'  ( <- 'X1' )'W2'  ( <- 'X2' )...))

atlimit  causes the whole string to be tested for equality with one of the  Wi, and if a match is found, the string is replaced with Xi.

More precisely we might have a group of words  W11,  W12  ... that need to be mapped to  X1, another group  W21,  W22 ... that need to be mapped to  X2, and so on, and a list of words V1,  V2  ...  Vk  that are to remain invariant. The exception  routine may then be written as follows:

    among( 'W11' 'W12' ... (<- 'X1')'W21' 'W22' ... (<- 'X2')...'Wn1' 'Wn2' ... (<- 'Xn')'V1' 'V2' ... 'Vk')

And indeed the  exception1  routine for the English stemmer has just that shape:

    define exception1 as ([substring] atlimit among(/* special changes: */'skis'      (<-'ski')'skies'     (<-'sky')'dying'     (<-'die')'lying'     (<-'lie')'tying'     (<-'tie')/* special -LY cases */'idly'      (<-'idl')'gently'    (<-'gentl')'ugly'      (<-'ugli')'early'     (<-'earli')'only'      (<-'onli')'singly'    (<-'singl')// ... extensions possible here .../* invariant forms: */'sky''news''howe''atlas' 'cosmos' 'bias' 'andes' // not plural forms// ... extensions possible here ...))

(More will be said about the words that appear here shortly.)

Here we see words being treated exceptionally before stemming is done, but equally we could treat stems exceptionally after stemming is done, and so, if we wish, map absorpt to absorb, reduct to reduc etc., as in the Lovins stemmer. But more generally, throughout the algorithm, each significant step may have recognised exceptions, and a suitably placed  among  will take care of them. For example, a point made at least twice in the literature is that words beginning gener are overstemmed by the Porter stemmer:

generate
generates
generated
generating
general
generally
generic
generically
generous
generously
    ->     gener

To fix this over-stemming, we make an exception to the usual setting of p1, the left point of R1, and therefore replace

    gopast v  gopast non-v  setmark p1

with

    among ('gener'// ... and other stems may be included here ...) or (gopast v  gopast non-v)setmark p1

after which the words beginning gener stem as follows:

generate
generates
generated
generating
    ->     generat
general
generally
    ->     general
generic
generically
    ->     generic
generous
generously
    ->     generous

Another example is given by the  exception2  routine, which is similar to  exception1, but placed after the call of  Step_1a, which may have removed terminal s,

    define exception2 as ([substring] atlimit among('inning' 'outing' 'canning' 'herring''proceed' 'exceed' 'succeed'// ... extensions possible here ...))

Snowball makes it easy therefore to add in lists of exceptions. But deciding what the lists of exceptions should be is far from easy. Essentially there are two lines of attack, the systematic and the piecemeal. One might systematically treat as exceptions the stem changes of irregular verbs, for example. The piecemeal approach is to add in exceptions as people notice them — like gener above. The problem with the systematic approach is that it should be done by investigating the entire language vocabulary, and that is more than most people are prepared to do. The problem with the piecemeal approach is that it is arbitrary, and usually yields little.

The exception lists in the English stemmer are meant to be illustrative (‘this is how it is done if you want to do it’), and were derived piecemeal.

a) The new stemmer improves on the Porter stemmer in handling short words ending e and y. There is however a mishandling of the four forms sky, skies, ski, skis, which is easily corrected by treating three of these words as special cases.

b) Similarly there is a problem with the ing form of three letter verbs ending ie. There are only three such verbs: die, lie and tie, so a special case is made for dying, lying and tying.

c) One has to be a little careful of certain ing forms. inning, outing, canning, which one does not wish to be stemmed to in, out, can.

d) The removal of suffix ly, which is not in the Porter stemmer, has a number of exceptions. Certain short-word exceptions are idly, gently, ugly, early, only, singly. Rarer words (bristly, burly, curly, surly ...) are not included.

e) The remaining words were included following complaints from users of the Porter algorithm. news is not the plural of new (noticed when IR systems were being set up for Reuters). Howe is a surname, and needs to be separated from how (noticed when doing a search for ‘Sir Geoffrey Howe’ in a demonstration at the House of Commons). succeed etc are not past participles, so the ed should not be removed (pointed out to me in an email from India). herring should not stem to her (another email from Russia).

f) Finally, a few non-plural words ending s have been added.

Incidentally, this illustrates how much feedback to expect from the real users of a stemming algorithm: seven or eight words in twenty years!

The definition of the English stemmer above is therefore supplemented by the following:

Exceptional forms in the English stemmer

If the words begins gener, commun or arsen, set R1 to be the remainder of the word.

Stem certain special words as follows,

skis     ->     ski
skies     ->     sky
dying
lying
tying
  ->   die
lie
tie
idly
gently
ugly
early
only
singly
  ->   idl
gentl
ugli
earli
onli
singl

If one of the following is found, leave it invariant,

sky
news
howe
atlas       cosmos       bias       andes

Following step 1a, leave the following invariant,

inning       outing       canning       herring       earring
proceed       exceed       succeed

The full algorithm in Snowball

integers ( p1 p2 ) booleans ( Y_found ) routines ( prelude postlude mark_regions shortv R1 R2 Step_1a Step_1b Step_1c Step_2 Step_3 Step_4 Step_5 exception1 exception2 ) externals ( stem ) groupings ( v v_WXY valid_LI ) stringescapes {} define v 'aeiouy' define v_WXY v + 'wxY' define valid_LI 'cdeghkmnrt' define prelude as ( unset Y_found do ( ['{'}'] delete) do ( ['y'] <-'Y' set Y_found) do repeat(goto (v ['y']) <-'Y' set Y_found) ) define mark_regions as ( $p1 = limit $p2 = limit do( among ( 'gener' 'commun' // added May 2005 'arsen' // added Nov 2006 (arsenic/arsenal) // ... extensions possible here ... ) or (gopast v gopast non-v) setmark p1 gopast v gopast non-v setmark p2 ) ) backwardmode ( define shortv as ( ( non-v_WXY v non-v ) or ( non-v v atlimit ) ) define R1 as $p1 <= cursor define R2 as $p2 <= cursor define Step_1a as ( try ( [substring] among ( '{'}' '{'}s' '{'}s{'}' (delete) ) ) [substring] among ( 'sses' (<-'ss') 'ied' 'ies' ((hop 2 <-'i') or <-'ie') 's' (next gopast v delete) 'us' 'ss' ) ) define Step_1b as ( [substring] among ( 'eed' 'eedly' (R1 <-'ee') 'ed' 'edly' 'ing' 'ingly' ( test gopast v delete test substring among( 'at' 'bl' 'iz' (<+ 'e') 'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt' // ignoring double c, h, j, k, q, v, w, and x ([next] delete) '' (atmark p1 test shortv <+ 'e') ) ) ) ) define Step_1c as ( ['y' or 'Y'] non-v not atlimit <-'i' ) define Step_2 as ( [substring] R1 among ( 'tional' (<-'tion') 'enci' (<-'ence') 'anci' (<-'ance') 'abli' (<-'able') 'entli' (<-'ent') 'izer' 'ization' (<-'ize') 'ational' 'ation' 'ator' (<-'ate') 'alism' 'aliti' 'alli' (<-'al') 'fulness' (<-'ful') 'ousli' 'ousness' (<-'ous') 'iveness' 'iviti' (<-'ive') 'biliti' 'bli' (<-'ble') 'ogi' ('l' <-'og') 'fulli' (<-'ful') 'lessli' (<-'less') 'li' (valid_LI delete) ) ) define Step_3 as ( [substring] R1 among ( 'tional' (<- 'tion') 'ational' (<- 'ate') 'alize' (<-'al') 'icate' 'iciti' 'ical' (<-'ic') 'ful' 'ness' (delete) 'ative' (R2 delete) // 'R2' added Dec 2001 ) ) define Step_4 as ( [substring] R2 among ( 'al' 'ance' 'ence' 'er' 'ic' 'able' 'ible' 'ant' 'ement' 'ment' 'ent' 'ism' 'ate' 'iti' 'ous' 'ive' 'ize' (delete) 'ion' ('s' or 't' delete) ) ) define Step_5 as ( [substring] among ( 'e' (R2 or (R1 not shortv) delete) 'l' (R2 'l' delete) ) ) define exception2 as ( [substring] atlimit among( 'inning' 'outing' 'canning' 'herring' 'earring' 'proceed' 'exceed' 'succeed' // ... extensions possible here ... ) ) ) define exception1 as ( [substring] atlimit among( /* special changes: */ 'skis' (<-'ski') 'skies' (<-'sky') 'dying' (<-'die') 'lying' (<-'lie') 'tying' (<-'tie') /* special -LY cases */ 'idly' (<-'idl') 'gently' (<-'gentl') 'ugly' (<-'ugli') 'early' (<-'earli') 'only' (<-'onli') 'singly' (<-'singl') // ... extensions possible here ... /* invariant forms: */ 'sky' 'news' 'howe' 'atlas' 'cosmos' 'bias' 'andes' // not plural forms // ... extensions possible here ... ) ) define postlude as (Y_found repeat(goto (['Y']) <-'y')) define stem as ( exception1 or not hop 3 or ( do prelude do mark_regions backwards ( do Step_1a exception2 or ( do Step_1b do Step_1c do Step_2 do Step_3 do Step_4 do Step_5 ) ) do postlude ) )

posted on 2013-01-24 15:16 lexus 阅读(...) 评论(...) 编辑 收藏

转载于:https://www.cnblogs.com/lexus/archive/2013/01/24/2875055.html

The English (Porter2) stemming algorithm相关推荐

  1. Porter Stemming Algorithm

    所谓Stemming,可以称为词根化,这里有个overview.在英语这样的拉丁语系里面,单词有多种变形.比如加上-ed.-ing.-ly等等.在分词的时候,如果能够把这些变形单词的词根找出了,对搜索 ...

  2. Porter Algorithm ---------词干提取算法

    在英语中,一个单词常常是另一个单词的"变种",如:happy=>happiness,这里happy叫做happiness的词干(stem).在信息检索系统中,我们常常做的一件 ...

  3. nlp对语料进行分类_如何使用nlp对推文进行分类

    nlp对语料进行分类 Over the years, we have seen Twitter evolve from just a social media to also a business a ...

  4. 波特词干算法 - 残阳似血的博客

    波特词干算法 - 残阳似血的博客 波特词干算法 - 残阳似血的博客 波特词干算法 位于分类 自然语言处理 在英语中,一个单词常常是另一个单词的"变种",如:happy=>ha ...

  5. 有关Lucene的问题(2):stemming和lemmatization

    问题: 我试验了一下文章中提到的 stemming 和 lemmatization 将单词缩减为词根形式,如"cars"到"car"等.这种操作称为:stemm ...

  6. python 英语分词_英文分词算法(Porter stemmer)

    python金融风控评分卡模型和数据分析微专业课(博主亲自录制视频):http://dwz.date/b9vv 最近需要对英文进行分词处理,希望能够实现还原英文单词原型,比如 boys 变为 boy ...

  7. R语言 文本挖掘 tm包 使用

    为什么80%的码农都做不了架构师?>>>    #清除内存空间 rm(list=ls())  #导入tm包 library(tm) library(SnowballC) #查看tm包 ...

  8. R语言︱文本挖掘套餐包之——XML+SnowballC+tm包

    每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- R语言︱文本挖掘套餐包之--XML+tm+Sn ...

  9. Code Project精彩系列(转)

    Applications Crafting a C# forms Editor From scratch http://www.codeproject.com/csharp/SharpFormEdit ...

  10. codeproject资源集合贴

    Applications Crafting a C# forms Editor From scratch http://www.codeproject.com/csharp/SharpFormEdit ...

最新文章

  1. 当有键盘时如何在开始编辑时使UITextField向上移动?
  2. Windows安装配置tidevice
  3. http接口测试工具——RESTClient
  4. 这几天又看了Gosu,发现也是蛮有意思
  5. oracle java javapath_系统找不到C:\ProgramData\Oracle\Java\javapath\java.exe问题及解决方案...
  6. 改进的有效边表算法_优硕微展 | 张和慧:基于邻域保持嵌入算法的间歇过程故障检测研究...
  7. 清除SQL被注入script恶意病毒代码的语句
  8. 文件格式和扩展名不匹配。文件可能已损坏或不安全。除非您信任其来源,否则请勿打开。是否仍要打开它?
  9. Servlet→DWR实现JAVA服务器端向客户端推送消息
  10. 一文讲透项目管理的价值和意义到底是什么?
  11. css模板 bulma_用Bulma在6分钟内学习CSS框架
  12. sp导出贴图到maya
  13. moss下载_无法为增值税MOSS混乱提供“简单的技术解决方案”
  14. 微信小程序小说搭建流程
  15. 常见的USB VID
  16. python查找联系人_python ---简易联系人
  17. 微信iBeaconID-微信官方iBeacon蓝牙基站UUID编码
  18. Firefox for Mac(火狐浏览器 mac)一款速度快到飞起的浏览器
  19. 7月23日09点,上海,PMCamp的产品经理大会
  20. C语言实现可写入文件的账号密码登录系统,密码输入时掩盖,登录界面菜单选择,更改密码系统,课设必备。

热门文章

  1. 如何用计算机串口烧录芯片,如何使用串口烧写xmc1301芯片.pdf
  2. Redis:字符串MSET、MSETNX、MGET命令介绍
  3. 2022年网络安全行业发展趋势
  4. java抽象类和普通类_抽象类和普通类的区别是什么?java类和抽象类的区别
  5. mysql建立全文索引
  6. 手撕Boost!Boost公式推导及实验验证
  7. POI库读取xlsx和xls格式excel以及解决安卓上的适配
  8. android 音乐扬声器,android安插耳机状态使用扬声器外放音乐
  9. DXP2004 warning / error/注意事项
  10. 在html中打字如何变大,如何把字体放大 如何更改桌面与网页字体大小-电脑教程...