python的研究方法有哪些_python有哪些提取文本摘要的库

展开全部

1.2113google goose

>>> from goose import Goose

>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'

>>> g = Goose()

>>> article = g.extract(url=url)

>>> article.title

u'Occupy London loses eviction fight'

>>> article.meta_description

"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."

>>> article.cleaned_text[:150]

(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi

>>> article.top_image.src

http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

2. python SnowNLPfrom snownlp import SnowNLP

s = SnowNLP(u'这个东西真心5261很赞')

s.words # [u'这个', u'东西', u'真心',

# u'很', u'赞']

s.tags # [(u'这个', u'r'), (u'东西', u'n'),

# (u'真心', u'd'), (u'很', u'd'),

# (u'赞', u'Vg')]

s.sentiments # 0.9769663402895832 positive的概率

s.pinyin # [u'zhe', u'ge', u'dong', u'xi',

# u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP(u'「繁4102体字」「繁体中文」的叫1653法在台湾亦很常见。')

s.han # u'「繁体字」「繁体中文」的叫法

# 在台湾亦很常见。'

text = u'''

自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。

它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。

自然语言处理是一门融语言学、计算机科学、数学于一体的科学。

因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，

所以它与语言学的研究有着密切的联系，但又有重要的区别。

自然语言处理并不是一般地研究自然语言，

而在于研制能有效地实现自然语言通信的计算机系统，

特别是其中的软件系统。因而它是计算机科学的一部分。

'''

s = SnowNLP(text)

s.keywords(3) # [u'语言', u'自然', u'计算机']

s.summary(3) # [u'因而它是计算机科学的一部分',

# u'自然语言处理是一门融语言学、计算机科学、

# 数学于一体的科学',

# u'自然语言处理是计算机科学领域与人工智能

# 领域中的一个重要方向']

s.sentences

s = SnowNLP([[u'这篇', u'文章'],

[u'那篇', u'论文'],

[u'这个']])

s.tf

s.idf

s.sim([u'文章'])# [0.3756070762985226, 0, 0]

3. python TextTeaser#!/usr/bin/python

# -*- coding: utf-8 -*-

from textteaser import TextTeaser

# article source: https://blogs.dropbox.com/developers/2015/03/limitations-of-the-get-method-in-http/

title = "Limitations of the GET method in HTTP"

text = "We spend a lot of time thinking about web API design, and we learn a lot from other APIs and discussion with their authors. In the hopes that it helps others, we want to share some thoughts of our own. In this post, we’ll discuss the limitations of the HTTP GET method and what we decided to do about it in our own API. As a rule, HTTP GET requests should not modify server state. This rule is useful because it lets intermediaries infer something about the request just by looking at the HTTP method. For example, a browser doesn’t know exactly what a particular HTML form does, but if the form is submitted via HTTP GET, the browser knows it’s safe to automatically retry the submission if there’s a network error. For forms that use HTTP POST, it may not be safe to retry so the browser asks the user for confirmation first. HTTP-based APIs take advantage of this by using GET for API calls that don’t modify server state. So if an app makes an API call using GET and the network request fails, the app’s HTTP client library might decide to retry the request. The library doesn’t need to understand the specifics of the API call. The Dropbox API tries to use GET for calls that don’t modify server state, but unfortunately this isn’t always possible. GET requests don’t have a request body, so all parameters must appear in the URL or in a header. While the HTTP standard doesn’t define a limit for how long URLs or headers can be, most HTTP clients and servers have a practical limit somewhere between 2 kB and 8 kB. This is rarely a problem, but we ran up against this constraint when creating the /delta API call. Though it doesn’t modify server state, its parameters are sometimes too long to fit in the URL or an HTTP header. The problem is that, in HTTP, the property of modifying server state is coupled with the property of having a request body. We could have somehow contorted /delta to mesh better with the HTTP worldview, but there are other things to consider when designing an API, like performance, simplicity, and developer ergonomics. In the end, we decided the benefits of making /delta more HTTP-like weren’t worth the costs and just switched it to HTTP POST. HTTP was developed for a specific hierarchical document storage and retrieval use case, so it’s no surprise that it doesn’t fit every API perfectly. Maybe we shouldn’t let HTTP’s restrictions influence our API design too much. For example, independent of HTTP, we can have each API function define whether it modifies server state. Then, our server can accept GET requests for API functions that don’t modify server state and don’t have large parameters, but still accept POST requests to handle the general case. This way, we’re opportunistically taking advantage of HTTP without tying ourselves to it."

tt = TextTeaser()

sentences = tt.summarize(title, text)

for sentence in sentences:

print sentence

4. python sumy# -*- coding: utf8 -*-

from __future__ import absolute_import

from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser

from sumy.parsers.plaintext import PlaintextParser

from sumy.nlp.tokenizers import Tokenizer

from sumy.summarizers.lsa import LsaSummarizer as Summarizer

from sumy.nlp.stemmers import Stemmer

from sumy.utils import get_stop_words

LANGUAGE = "czech"

SENTENCES_COUNT = 10

if __name__ == "__main__":

url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"

parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))

# or for plain text files

# parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))

stemmer = Stemmer(LANGUAGE)

summarizer = Summarizer(stemmer)

summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, SENTENCES_COUNT):

print(sentence)

python的研究方法有哪些_python有哪些提取文本摘要的库相关推荐

Python使用get_text()方法从大段html中提取文本
<textarea rows="" cols="" name="id"><DIV style="TEXT-IND ...
python中文文本分词_SnowNLP：?中文分词?词性标准?提取文本摘要,?提取文本关键词,?转换成拼音?繁体转简体的处理中文文本的Python3 类库...
SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和Te ...
NLP：基于nltk和jieba库对文本实现提取文本摘要(两种方法实现：top_n_summary和mean_scored_summary)
NLP:基于nltk和jieba库对文本实现提取文本摘要(两种方法实现:top_n_summary和mean_scored_summary) 目录输出结果设计思路核心代码输出结果 1.测试文本 ...
python中的方法需要定义_Python中规范定义命名空间的一些建议
API的设计是一个艺术活.往往需要其简单.易懂.整洁.不累赘. 很多时候,我们在底层封装一个方法给高层用,而其它的方法只是为了辅助这个方法的. 也就是说我们只需要暴露这个方法就行,不用关心这个方法是怎 ...
python 字符串find方法怎么用_Python字符串find()方法
Python字符串find()方法确定字符串str是出现在字符串中,还是在字符串指定范围的子串中,子字符串是由给给定起始索引beg和结束索引end切片得出. 语法以下是find()方法的语法 - s ...
python中的方法是什么_Python方法
1. 对象 = 属性 + 方法对象是类的实例.换句话说,类主要定义对象的结构,然后我们以类为模板创建对象.类不但包含方法定义,而且还包含所有实例共享的数据. 封装:信息隐蔽技术我们可以使用关键字 ...
python中main方法的用法_Python中的main方法
估计很多人跟我一样初学python看代码的时候先找一下main()方法,从main往下看.但事实上python中是没有你理解中的"main()"方法的.言归正传 if name = ...
python论文摘要_Python实践：提取文章摘要
一.概述在博客系统的文章列表中,为了更有效地呈现文章内容,从而让读者更有针对性地选择阅读,通常会同时提供文章的标题和摘要. 一篇文章的内容可以是纯文本格式的,但在网络盛行的当今,更多是HTML格式的 ...
python使用如下方法规范化数组_python归一化多维数组的方法
本篇文章给大家分享的内容是python归一化多维数组的方法 ,具有一定的参考价值,有需要的朋友参考一下今天遇到需要归一化多维数组的问题,但是在网上查阅了很多资料都是归一化数组的一行或者一列,对于怎么 ...

python的研究方法有哪些_python有哪些提取文本摘要的库

python的研究方法有哪些_python有哪些提取文本摘要的库相关推荐

最新文章

热门文章