

自动文本摘要早在20世纪50年代就引起了人们的注意。汉斯•彼得•鲁恩(Hans Peter Luhn)在20世纪50年代末发表了一篇研究论文,题为《文学文摘的自动创作》(the automatic creation of literature abstracts)。该论文利用词频和短语频等特征,从文本中提取重要句子进行总结。
另一项重要的研究是Harold P Edmundson在20世纪60年代末所做的,该研究利用线索词的出现、出现在文章标题中的词以及句子的位置等方法,提取出有意义的句子进行文本总结。从那时起,许多重要和令人兴奋的研究已经发表,以解决自动文本摘要的挑战。









import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re


df = pd.read_csv("tennis_articles_v4.csv")


from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:sentences.append(sent_tokenize(s))sentences = [y for x in sentences for y in x] # flatten list


GloVe词嵌入是词的向量表示。这些词的嵌入将被用来为我们的句子创建向量。我们也可以使用单词包或TF-IDF方法为句子创建特征,但是这些方法忽略了单词的顺序(特征的数量通常相当大)。我们将使用预先培训的维基百科2014 + Gigaword5 GloVe矢量,这些单词嵌入的大小是822 MB。

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip


# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:], dtype='float32')word_embeddings[word] = coefs



# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# function to remove stopwords
def remove_stopwords(sen):sen_new = " ".join([i for i in sen if i not in stop_words])return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]


# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:values = line.split()word = values[0]coefs = np.asarray(values[1:], dtype='float32')word_embeddings[word] = coefs
sentence_vectors = []
for i in clean_sentences:if len(i) != 0:v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)else:v = np.zeros((100,))sentence_vectors.append(v)


# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):for j in range(len(sentences)):if i != j:sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]



import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)#Summary Extraction
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Extract top 10 sentences as the summary
for i in range(10):print(ranked_sentences[i][1])


When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person
whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the
weather and know that in the next few minutes I have to go and try to win a tennis match.Major players feel that a big event in late November combined with one in January before the Australian Open will
mean too much tennis and too little rest.Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius
Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of
any commitment."I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the
Olympic weeks, not necessarily during the tournaments.Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event
in London next month.He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the
win on his first match point.
The Spaniard broke Anderson twice in the second but didn't get another chance on the South African's serve in the
final set."We also had the impression that at this stage it might be better to play matches than to train.The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will replace
the classic home-and-away ties played four times per year for decades.Federer said earlier this month in Shanghai in that his chances of playing the Davis Cup were all but non-existent.





