实验要求
- 实验内容
- 实验目的
- 实验环境
基本知识及实验原理详解
- 实验原理分析：
- - 1.对tf-idf的详细理解
  - 2.排序检索模型
  - 3.相关示例
实验结果图
数据集处理
代码重要部分解释
- 相似度得分计算部分
- 归一化处理
Code
实验数据

实验要求

实验内容

– 在Experiment1的基础上实现最基本的Ranked retrieval model

• Input：a query (like Ron Weasley birthday)

• Output: Return the top K (e.g., K = 100) relevant tweets.

• Use SMART notation: lnc.ltn

• Document: logarithmic tf (l as first character), no idf and cosine normalization

• Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization

• 改进Inverted index

• 在Dictionary中存储每个term的DF

• 在posting list中存储term在每个doc中的TF with pairs (docID, tf) • 选做

• 支持所有的SMART Notations

实验目的

实现基本的排名检索模型

实验环境

Windows10、Spyder(Anaconda3)

基本知识及实验原理详解

实验原理分析：

1.对tf-idf的详细理解

TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。TF是词频(Term Frequency)，IDF是逆文本频率指数(Inverse Document Frequency)。TF-IDF的主要思想是：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

n i,j 表示的是第i个词在文档j中出现的总次数,分母表示在文档j中其它的所有的词数。
使用上面的公式我们就可以计算出词频。

使用如上公式可以计算出逆文档频率idf，所谓tf-idf=tf*idf。对tf和idf有了一定的理解后，开始进行实验，选择python语言。需要注意的是罕见词项比常见词所蕴含的信息更多，因此它的权重会更大。

2.排序检索模型

现在回到题目实现排序检索中去，为什么要使用排序检索呢。因为在排序检索模型中，

•系统根据文档与query的相关性排序返回文档集合中的文档，而不是简单地返回所有满足query描述的文档集合

• 自由文本查询：用户query是自然语言的一个或多个词语而不是由查询语言构造的表达式

• 总体上，排序检索模型中有布尔查询和自由文本查询两种方式，但是实际中排序检索模型总是与自由文本查询联系在一起，反之亦然

排序检索的基本——评分

• 我们希望根据文档对查询者的有用性大小顺序将文档返回给查询者

• 怎样根据一个query对文档进行排序？

• 给每个“查询-文档对”进行评分，在[0,1]之间

• 这个评分值衡量文档与query的匹配程度

Jaccard 系数:一种常用的衡量两个集合重叠度的方法,Jaccard(A,B) = |A ∩ B| / |A ∪ B|,Jaccard(A,A) = 1, Jaccard(A,B) = 0 if A ∩ B = 0, 集合A和B不需要具有同样的规模, Jaccard(A,B)的取值在[0,1]。在这里用它来计算terms之间的相似度得分。但是jaccard有一定的缺陷。

最后我们要对最终得分做归一化处理。

3.相关示例

引用课堂里的PPT页面，首先根据cos方法计算最后得分的时候，我们可以看到相关的情况如下图

其中一个例子tf-idf example: lnc.ltc

实验结果图

数据集处理

• 在Dictionary中存储每个term的DF

• 在posting list中存储term在每个doc中的TF with pairs (docID, tf)

代码重要部分解释

相似度得分计算部分

def similarity_NLP(query):global score_tidt_num={}for te in query:if te in t_num:t_num[te]+=1else:t_num[te]=1for te in t_num.keys():if te in postings:d_fre=len(postings[te])else:d_fre=doc_numst_num[te]=(math.log(t_num[te])+1)*math.log(doc_nums/d_fre)for te in query:if te in postings:for tid in postings[te]:if tid in score_tid.keys():score_tid[tid]+=postings[te][tid]*t_num[te]else:score_tid[tid]=postings[te][tid]*t_num[te]similarity=sorted(score_tid.items(),key=lambda x:x[1],reverse=True)  #降序排序，文档匹配度高的优先return similarity

归一化处理

就是指消除某些特征造成的过大的影响，核心思想就是
n o r = 1 ∑ d n u m [ t e r m s ] nor=\frac{1}{ \sqrt{∑dnum[terms]}} nor=∑dnum[terms] 1

 #归一化#normalized() nor=0for te in d_num.keys():nor=nor+d_num[te]nor=1.0/math.sqrt(nor)for te in d_num.keys():d_num[te]=d_num[te]*nor

Code

# -*- coding: utf-8 -*-
"""
Created on Fri Oct 16 08:02:19 2020@author: bad-kids
"""
import sys
import math
from functools import reduce
from textblob import TextBlob
from textblob import Word
from collections import defaultdictuselessTerm = ["username", "text", "tweetid"]
postings = defaultdict(dict)     #记录df
doc_nums = 0
d_num=defaultdict(int)#文档数
t_num=defaultdict(int)#词数
score_tid=defaultdict(dict)def tokenize_tweet(document):    #文件的属性document=document.lower()   #将所有大写字母返回小写字母并返回字符串a=document.index("username")     #返回指定的索引名称b=document.index("clusterno")c=document.rindex("tweetid") - 1d=document.rindex("errorcode")e=document.index("text")f=document.index("timestr") - 3 #获取时间戳#提取twwetid,username,tweet内容三部分主要信息document=document[c:d] + document[a:b] + document[e:f]   terms=TextBlob(document).words.singularize()  #词干提取，单词名词变单数，含特殊处理result=[]for word in terms:expected_str=Word(word)expected_str=expected_str.lemmatize("v")     #lemmatize() 方法  对单词进行词形还原，名词找单数，动词找原型if expected_str not in uselessTerm:result.append(expected_str)return resultdef token(doc):doc = doc.lower()terms = TextBlob(doc).words.singularize()result = []for word in terms:expected_str = Word(word)expected_str = expected_str.lemmatize("v")result.append(expected_str)return resultdef get_postings():global postings,doc_numsf = open(r"E:\myClass2\data-mining\expriment\lab1\tweets.txt")lines = f.readlines()  # 读取全部内容for line in lines:doc_nums+=1line = tokenize_tweet(line)tweetid = line[0]  # 这里记录一下tweetid，就弹出line.pop(0)d_num={}for te in line:if te in d_num.keys():d_num[te]+=1else:d_num[te]=1for te in d_num.keys():d_num[te]=math.log(d_num[te])+1#归一化#normalized() nor=0for te in d_num.keys():nor=nor+d_num[te]nor=1.0/math.sqrt(nor)for te in d_num.keys():d_num[te]=d_num[te]*nor # 按字典序对postings进行升序排序,返回的是列表unique_terms=set(line)for te in unique_terms:if te in postings.keys():postings[te][tweetid]=d_num[te]else:postings[te][tweetid]=d_num[te]def do_search():query = token(input("please input search query >> "))result = []  # 返回对于query的所有tweetid排序后的列表if query == []:sys.exit()unique_query = set(query)relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query])print ("score(NLP): Tweeetid")print("NLP一共有"+str(len(relevant_tweetids))+"条相关tweet！")if not relevant_tweetids:print("No tweets matched any query terms for")print(query)else:# NLP文本分类print("the top 50 tweets are:")#调用sorted函数对score进行排序scores=similarity_NLP(query)i = 1for (id, score) in scores:if i<=50:#返回前n条查询到的信息result.append(id)print(str(score) + ": " + id)i = i + 1else:breakprint("finished")
#计算相似度得分
def similarity_NLP(query):global score_tidt_num={}for te in query:if te in t_num:t_num[te]+=1else:t_num[te]=1for te in t_num.keys():if te in postings:d_fre=len(postings[te])else:d_fre=doc_numst_num[te]=(math.log(t_num[te])+1)*math.log(doc_nums/d_fre)for te in query:if te in postings:for tid in postings[te]:if tid in score_tid.keys():score_tid[tid]+=postings[te][tid]*t_num[te]else:score_tid[tid]=postings[te][tid]*t_num[te]similarity=sorted(score_tid.items(),key=lambda x:x[1],reverse=True)return similaritydef Union(sets):return reduce(set.union, [s for s in sets])def main():get_postings()while True:do_search()if __name__ == "__main__":main()

实验数据

实验数据tweet

信息检索实验2- Ranked retrieval model相关推荐

Generalizing A Person Retrieval Model Hetero- and Homogeneously阅读总结
Generalizing A Person Retrieval Model Hetero- and Homogeneously Zhun Zhong, Liang Zheng, Shaozi Li,Y ...
论文阅读Generalizing A Person Retrieval Model Hetero-and Homogeneously
论文阅读Generalizing A Person Retrieval Model Hetero-and Homogeneously 论文:Generalizing A Person Retrieva ...
Generalizing A Person Retrieval Model Hetero- and Homogeneously
1.论文:Generalizing A Person Retrieval Model Hetero- and Homogeneously 2.代码:https://github.com/zhunzho ...
Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index
Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index 嵌入索引能够实现快速近似近邻 ...
ECCV2018_Generalizing A Person Retrieval Model Hetero- and Homogeneously
基本思路:利用source domain和target domain进行混合训练,以domain adaption a) Target domain利用camStyle进行各个摄像头的数据增强 ...
信息检索 Information Retrieval
信息检索主要是查找与用户查询相关的文档. 给定:大型静态文档集合和信息需求(基于关键字的查询) 任务:查找所有且仅与查询相关的文档典型的 IR 系统: • 搜索一组摘要 • 搜索报纸文章 • 图书 ...
《introduction to information retrieval》信息检索学习笔记1 布尔检索
第1章布尔检索信息检索的定义:信息检索(IR)是从大型集合(通常存储在计算机上)中寻找满足信息需求的非结构化性质(通常是文本)的材料(通常是文档). 1.1一个信息检索的例子问题描述:确定莎士比 ...
Information Retrieval（信息检索）笔记01：Boolean Retrieval（布尔检索）
Information Retrieval(信息检索)笔记01:Boolean Retrieval(布尔检索) 什么是信息检索(Information Retrieval) 布尔检索(Boolean ...
Introduce to Inforamtion Retrieval读书笔记(1)
很好的一本书,介绍的非常全面,看了很久了,还没有看完,刚看完前十章,发现前面看的都忘的差不多了,还是回来记一下吧. Boolean Retrieval 一.information retrieval定 ...

信息检索实验2- Ranked retrieval model

目录