Python 英文分词,词倒排索引

【一.一般多次查询】

'''
Created on 2015-11-18
'''
#encoding=utf-8# List Of English Stop Words
# http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
_WORD_MIN_LENGTH = 3
_STOP_WORDS = frozenset([
'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again',
'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although',
'always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another',
'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',
'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been',
'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides',
'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can',
'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe',
'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight',
'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even',
'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few',
'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former',
'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here',
'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him',
'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc',
'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last',
'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me',
'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly',
'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never',
'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same',
'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she',
'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere',
'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their',
'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third',
'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus',
'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two',
'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter',
'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which',
'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will',
'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
'yourselves', 'the'])def word_split_out(text):word_list = []wcurrent = []for i, c in enumerate(text):if c.isalnum():wcurrent.append(c)elif wcurrent:word = u''.join(wcurrent)word_list.append(word)wcurrent = []if wcurrent:word = u''.join(wcurrent)word_list.append(word)return word_listdef word_split(text):"""Split a text in words. Returns a list of tuple that contains(word, location) location is the starting byte position of the word."""word_list = []wcurrent = []windex = 0for i, c in enumerate(text):if c.isalnum():wcurrent.append(c)elif wcurrent:word = u''.join(wcurrent)word_list.append((windex, word))windex += 1wcurrent = []if wcurrent:word = u''.join(wcurrent)word_list.append((windex, word))windex += 1return word_listdef words_cleanup(words):"""Remove words with length less then a minimum and stopwords."""cleaned_words = []for index, word in words:if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:continuecleaned_words.append((index, word))return cleaned_wordsdef words_normalize(words):"""Do a normalization precess on words. In this case is just a tolower(),but you can add accents stripping, convert to singular and so on..."""normalized_words = []for index, word in words:wnormalized = word.lower()normalized_words.append((index, wnormalized))return normalized_wordsdef word_index(text):"""Just a helper method to process a text.It calls word split, normalize and cleanup."""words = word_split(text)words = words_normalize(words)words = words_cleanup(words)return wordsdef inverted_index(text):"""Create an Inverted-Index of the specified text document.{word:[locations]}"""inverted = {}for index, word in word_index(text):locations = inverted.setdefault(word, [])locations.append(index)return inverteddef inverted_index_add(inverted, doc_id, doc_index):"""Add Invertd-Index doc_index of the document doc_id to the Multi-Document Inverted-Index (inverted), using doc_id as document identifier.{word:{doc_id:[locations]}}"""for word, locations in doc_index.iteritems():indices = inverted.setdefault(word, {})indices[doc_id] = locationsreturn inverteddef search(inverted, query):"""Returns a set of documents id that contains all the words in your query."""words = [word for _, word in word_index(query) if word in inverted]results = [set(inverted[word].keys()) for word in words]return reduce(lambda x, y: x & y, results) if results else []if __name__ == '__main__':doc1 = """
Niners head coach Mike Singletary will let Alex Smith remain his starting
quarterback, but his vote of confidence is anything but a long-term mandate.
Smith now will work on a week-to-week basis, because Singletary has voided
his year-long lease on the job.
"I think from this point on, you have to do what's best for the football team,"
Singletary said Monday, one day after threatening to bench Smith during a
27-24 loss to the visiting Eagles.
"""doc2 = """
The fifth edition of West Coast Green, a conference focusing on "green" home
innovations and products, rolled into San Francisco's Fort Mason last week
intent, per usual, on making our living spaces more environmentally friendly
- one used-tire house at a time.
To that end, there were presentations on topics such as water efficiency and
the burgeoning future of Net Zero-rated buildings that consume no energy and
produce no carbon emissions.
"""# Build Inverted-Index for documentsinverted = {}documents = {'doc1':doc1, 'doc2':doc2}for doc_id, text in documents.iteritems():doc_index = inverted_index(text)inverted_index_add(inverted, doc_id, doc_index)# Print Inverted-Indexfor word, doc_locations in inverted.iteritems():print word, doc_locations# Search something and print resultsqueries = ['Week', 'Niners week', 'West-coast Week']for query in queries:result_docs = search(inverted, query)print "Search for '%s': %r" % (query, result_docs)for _, word in word_index(query):def extract_text(doc, index): word_list = word_split_out(documents[doc])word_string = ""for i in range(index, index +4):word_string += word_list[i] + " "word_string = word_string.replace("\n", "")return word_stringfor doc in result_docs:for index in inverted[word][doc]:print '   - %s...' % extract_text(doc, index)print

【二. 短语查询】

'''
Created on 2015-11-18
'''
#encoding=utf-8# List Of English Stop Words
# http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
_WORD_MIN_LENGTH = 3
_STOP_WORDS = frozenset([
'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again',
'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although',
'always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another',
'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',
'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been',
'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides',
'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can',
'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe',
'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight',
'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even',
'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few',
'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former',
'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here',
'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him',
'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc',
'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last',
'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me',
'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly',
'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never',
'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same',
'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she',
'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere',
'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their',
'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third',
'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus',
'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two',
'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter',
'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which',
'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will',
'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
'yourselves', 'the'])def word_split_out(text):word_list = []wcurrent = []for i, c in enumerate(text):if c.isalnum():wcurrent.append(c)elif wcurrent:word = u''.join(wcurrent)word_list.append(word)wcurrent = []if wcurrent:word = u''.join(wcurrent)word_list.append(word)return word_listdef word_split(text):"""Split a text in words. Returns a list of tuple that contains(word, location) location is the starting byte position of the word."""word_list = []wcurrent = []windex = 0for i, c in enumerate(text):if c.isalnum():wcurrent.append(c)elif wcurrent:word = u''.join(wcurrent)word_list.append((windex, word))windex += 1wcurrent = []if wcurrent:word = u''.join(wcurrent)word_list.append((windex, word))windex += 1return word_listdef words_cleanup(words):"""Remove words with length less then a minimum and stopwords."""cleaned_words = []for index, word in words:if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:continuecleaned_words.append((index, word))return cleaned_wordsdef words_normalize(words):"""Do a normalization precess on words. In this case is just a tolower(),but you can add accents stripping, convert to singular and so on..."""normalized_words = []for index, word in words:wnormalized = word.lower()normalized_words.append((index, wnormalized))return normalized_wordsdef word_index(text):"""Just a helper method to process a text.It calls word split, normalize and cleanup."""words = word_split(text)words = words_normalize(words)words = words_cleanup(words)return wordsdef inverted_index(text):"""Create an Inverted-Index of the specified text document.{word:[locations]}"""inverted = {}for index, word in word_index(text):locations = inverted.setdefault(word, [])locations.append(index)return inverteddef inverted_index_add(inverted, doc_id, doc_index):"""Add Invertd-Index doc_index of the document doc_id to the Multi-Document Inverted-Index (inverted), using doc_id as document identifier.{word:{doc_id:[locations]}}"""for word, locations in doc_index.iteritems():indices = inverted.setdefault(word, {})indices[doc_id] = locationsreturn inverteddef search(inverted, query):"""Returns a set of documents id that contains all the words in your query."""words = [word for _, word in word_index(query) if word in inverted]results = [set(inverted[word].keys()) for word in words]return reduce(lambda x, y: x & y, results) if results else []def distance_between_word(word_index_1, word_index_2, distance):"""To judge whether the distance between the two words is equal distance""" distance_list = []for index_1 in word_index_1:for index_2 in word_index_2:if (index_1 < index_2):if(index_2 - index_1 == distance):distance_list.append(index_1)else:continue        return distance_listdef extract_text(doc, index): """Output search results"""word_list = word_split_out(documents[doc])word_string = ""for i in range(index, index +4):word_string += word_list[i] + " "word_string = word_string.replace("\n", "")return word_stringif __name__ == '__main__':doc1 = """
Niners head coach Mike Singletary will let Alex Smith remain his starting
quarterback, but his vote of confidence is anything but a long-term mandate.
Smith now will work on a week-to-week basis, because Singletary has voided
his year-long lease on the job.
"I think from this point on, you have to do what's best for the football team,"
Singletary said Monday, one day after threatening to bench Smith during a
27-24 loss to the visiting Eagles.
"""doc2 = """
The fifth edition of West Coast Green, a conference focusing on "green" home
innovations and products, rolled into San Francisco's Fort Mason last week
intent, per usual, on making our living spaces more environmentally friendly
- one used-tire house at a time.
To that end, there were presentations on topics such as water efficiency and
the burgeoning future of Net Zero-rated buildings that consume no energy and
produce no carbon emissions.
"""# Build Inverted-Index for documentsinverted = {}documents = {'doc1':doc1, 'doc2':doc2}for doc_id, text in documents.iteritems():doc_index = inverted_index(text)inverted_index_add(inverted, doc_id, doc_index)# Print Inverted-Indexfor word, doc_locations in inverted.iteritems():print word, doc_locations# Search something and print resultsqueries = ['Week', 'water efficiency', 'Singletary said Monday']for query in queries:result_docs = search(inverted, query)print "Search for '%s': %r" % (query, result_docs)query_word_list = word_index(query)for doc in result_docs:index_first = []distance = 1for _, word in query_word_list:index_second = inverted[word][doc]index_new = []if(index_first != []):index_first = distance_between_word(index_first, index_second, distance)distance += 1else:index_first = index_secondfor index in index_first:print '   - %s...' % extract_text(doc, index)print

【三. 临近词查询】

'''
Created on 2015-11-18
'''
#encoding=utf-8# List Of English Stop Words
# http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
_WORD_MIN_LENGTH = 3
_STOP_WORDS = frozenset([
'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again',
'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although',
'always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another',
'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',
'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been',
'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides',
'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can',
'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe',
'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight',
'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even',
'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few',
'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former',
'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here',
'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him',
'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc',
'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last',
'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me',
'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly',
'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never',
'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same',
'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she',
'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere',
'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their',
'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third',
'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus',
'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two',
'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter',
'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which',
'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will',
'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
'yourselves', 'the'])def word_split_out(text):word_list = []wcurrent = []for i, c in enumerate(text):if c.isalnum():wcurrent.append(c)elif wcurrent:word = u''.join(wcurrent)word_list.append(word)wcurrent = []if wcurrent:word = u''.join(wcurrent)word_list.append(word)return word_listdef word_split(text):"""Split a text in words. Returns a list of tuple that contains(word, location) location is the starting byte position of the word."""word_list = []wcurrent = []windex = 0for i, c in enumerate(text):if c.isalnum():wcurrent.append(c)elif wcurrent:word = u''.join(wcurrent)word_list.append((windex, word))windex += 1wcurrent = []if wcurrent:word = u''.join(wcurrent)word_list.append((windex, word))windex += 1return word_listdef words_cleanup(words):"""Remove words with length less then a minimum and stopwords."""cleaned_words = []for index, word in words:if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:continuecleaned_words.append((index, word))return cleaned_wordsdef words_normalize(words):"""Do a normalization precess on words. In this case is just a tolower(),but you can add accents stripping, convert to singular and so on..."""normalized_words = []for index, word in words:wnormalized = word.lower()normalized_words.append((index, wnormalized))return normalized_wordsdef word_index(text):"""Just a helper method to process a text.It calls word split, normalize and cleanup."""words = word_split(text)words = words_normalize(words)words = words_cleanup(words)return wordsdef inverted_index(text):"""Create an Inverted-Index of the specified text document.{word:[locations]}"""inverted = {}for index, word in word_index(text):locations = inverted.setdefault(word, [])locations.append(index)return inverteddef inverted_index_add(inverted, doc_id, doc_index):"""Add Invertd-Index doc_index of the document doc_id to the Multi-Document Inverted-Index (inverted), using doc_id as document identifier.{word:{doc_id:[locations]}}"""for word, locations in doc_index.iteritems():indices = inverted.setdefault(word, {})indices[doc_id] = locationsreturn inverteddef search(inverted, query):"""Returns a set of documents id that contains all the words in your query."""words = [word for _, word in word_index(query) if word in inverted]results = [set(inverted[word].keys()) for word in words]return reduce(lambda x, y: x & y, results) if results else []def distance_between_word(word_index_1, word_index_2, distance):"""To judge whether the distance between the two words is smaller than distance""" distance_list = []for index_1 in word_index_1:for index_2 in word_index_2:if (index_1 < index_2):if(index_2 - index_1 <= distance):distance_list.append(index_1)else:continue       return distance_listdef extract_text(doc, index): """Output search results"""word_list = word_split_out(documents[doc])word_string = ""for i in range(index, index + 7):word_string += word_list[i] + " "word_string = word_string.replace("\n", "")return word_stringif __name__ == '__main__':doc1 = """
Niners head coach Mike Singletary will let Alex Smith remain his starting
quarterback, but his vote of confidence is anything but a long-term mandate.
Smith now will work on a week-to-week basis, because Singletary has voided
his year-long lease on the job.
"I think from this point on, you have to do what's best for the football team,"
Singletary said Monday, one day after threatening to bench Smith during a
27-24 loss to the visiting Eagles.
"""doc2 = """
The fifth edition of West Coast Green, a conference focusing on "green" home
innovations and products, rolled into San Francisco's Fort Mason last week
intent, per usual, on making our living spaces more environmentally friendly
- one used-tire house at a time.
To that end, there were presentations on topics such as water efficiency and
the burgeoning future of Net Zero-rated buildings that consume no energy and
produce no carbon emissions.
"""# Build Inverted-Index for documentsinverted = {}documents = {'doc1':doc1, 'doc2':doc2}for doc_id, text in documents.iteritems():doc_index = inverted_index(text)inverted_index_add(inverted, doc_id, doc_index)# Print Inverted-Indexfor word, doc_locations in inverted.iteritems():print word, doc_locations# Search something and print resultsqueries = ['Week', 'buildings consume', 'Alex remain quarterback']for query in queries:result_docs = search(inverted, query)print "Search for '%s': %r" % (query, result_docs)query_word_list = word_index(query)for doc in result_docs:index_first = []step = 3distance = 3for _, word in query_word_list:index_second = inverted[word][doc]index_new = []if(index_first != []):index_first = distance_between_word(index_first, index_second, distance)distance += step else:index_first = index_secondfor index in index_first:print '   - %s...' % extract_text(doc, index)print

Python 英文分词相关推荐

  1. python 英语分词_英文分词算法(Porter stemmer)

    python金融风控评分卡模型和数据分析微专业课(博主亲自录制视频):http://dwz.date/b9vv 最近需要对英文进行分词处理,希望能够实现还原英文单词原型,比如 boys 变为 boy ...

  2. 【基于python版本的连续英文分词实现java版本的英文分词器】

    连续英文分词器java版本 定义词典, 构建词典 切词实现 在搜索领域,用户的输入是千奇百怪的,有时候用户输入的是连续的英文,如果不能有效的进行切分,那么搜索召回的效果可能会比较差,所以我们需要针对连 ...

  3. 【NLP】为什么中文分词比英文分词更难?有哪些常用算法?(附代码)

    导读:人类文明的重要标志之一是语言文字的诞生.数千年来,几乎人类所有知识的传播都是以语言和文字作为媒介. 自然语言处理是使用计算机科学与人工智能技术分析和理解人类语言的一门学科.在人工智能的诸多范畴中 ...

  4. 为什么中文分词比英文分词更难?有哪些常用算法?(附代码)

    导读:人类文明的重要标志之一是语言文字的诞生.数千年来,几乎人类所有知识的传播都是以语言和文字作为媒介. 自然语言处理是使用计算机科学与人工智能技术分析和理解人类语言的一门学科.在人工智能的诸多范畴中 ...

  5. python jieba分词_从零开始学自然语言处理(八)—— jieba 黑科技

    小编喜欢用 jieba 分词,是因为它操作简单,速度快,而且可以添加自定义词,从而让 jieba 分出你想要分出的词,特别适用于特定场景的中文分词任务. 然鹅,万事都有两面性,jieba 分词这么好用 ...

  6. python英文文本分析和提取_英文文本挖掘预处理流程总结

    在中文文本挖掘预处理流程总结中,我们总结了中文文本挖掘的预处理流程,这里我们再对英文文本挖掘的预处理流程做一个总结. 1. 英文文本挖掘预处理特点 英文文本的预处理方法和中文的有部分区别.首先,英文文 ...

  7. Python中文分词及词频统计

    Python中文分词及词频统计 中文分词 中文分词(Chinese Word Segmentation),将中文语句切割成单独的词组.英文使用空格来分开每个单词的,而中文单独一个汉字跟词有时候完全不是 ...

  8. 4.2 英文分词及词性标注

    转载自: https://datartisan.gitbooks.io/begining-text-mining-with-python/content/%E7%AC%AC4%E7%AB%A0%20% ...

  9. Python英文词频统计(哈姆雷特)程序示例

    今天继续给大家介绍Python相关知识,本文主要内容是Python英文词频统计程序示例,主要是对英文文本--<哈姆雷特>进行分词. 一.英文文本词频统计思路 想要对<哈姆雷特> ...

最新文章

  1. 技术宅硬核跨年,DIY墨水屏日历:自动刷新位置、天气,随机播放2000多条「毒鸡汤」| 开源...
  2. 10.图的深度优先遍历序列是否唯一?为什么?
  3. Mato的文件管理 (莫队)题解
  4. 傅里叶变换是用来做什么的,具体举例一下应用?
  5. java的idea的使用_java学习-IDEA相关使用
  6. linux bluez语音传输,Linux BlueZ PCM 音频播放器
  7. Docker安装elasticsearch 7.7.0
  8. 《剑指Offer》 最小的K个数
  9. 美检方或起诉刘强东;百度对呛微信规则;澳洲调查 Facebook | 极客头条
  10. 编程序常用英语单词是什么
  11. 60、剑指offer--把二叉树打印成多行
  12. Hololens2的调试与安装
  13. 2013CSDN全国高校巡讲之四川托普信息技术职业学院
  14. 快速解决杰奇网站模块出现This function is not valid!的问题
  15. 【2021-11】4 个 Anaconda 国内开源镜像站
  16. 微信小程序获取openid和用户信息
  17. 古代诗词哲理名句赏析
  18. 数据标准化方法z-score讲解(matlab)
  19. A Survey of Shape Feature Extraction Techniques中文翻译
  20. 图片怎么在线转换成PDF格式

热门文章

  1. 杨辉三角(C语言实现)
  2. 【IT实施培训那些事儿】
  3. IE8中文件下载不兼容问题
  4. 成都的IT研发产业和芯片产业等情况:2006年初的数据。(转)
  5. 【闲聊杂谈】Redis中的CAP理论
  6. Gitlab CI/CD教程以及实战docker自动部署前端项目(全)
  7. 有哪些工具或者软件堪称神器?
  8. ubuntu11.10 华为无线上网卡e303s
  9. 阿里云服务器好吗?阿里云服务器ECS有什么优势
  10. python3注释_python3的注释