Hadoop 倒排索引

　　倒排索引是文档检索系统中最常用的数据结构，被广泛地应用于全文搜索引擎。它主要是用来存储某个单词（或词组）在一个文档或一组文档中存储位置的映射，即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容，而是进行相反的操作，因而称为倒排索引（Inverted Index）。

一、实例描述

　　倒排索引简单地就是，根据单词，返回它在哪个文件中出现过，而且频率是多少的结果。这就像百度里的搜索，你输入一个关键字，那么百度引擎就迅速的在它的服务器里找到有该关键字的文件，并根据频率和其他的一些策略（如页面点击投票率）等来给你返回结果。这个过程中，倒排索引就起到很关键的作用。

　　样例输入：

　　样例输出：

二、设计思路

　　倒排索引涉及几个过程：Map过程，Combine过程，Reduce过程。

　　Map过程：

　　当你把需要处理的文档上传到hdfs时，首先默认的TextInputFormat类对输入的文件进行处理，得到文件中每一行的偏移量和这一行内容的键值对<偏移量，内容>做为map的输入。在改写map函数的时候，我们就需要考虑，怎么设计key和value的值来适合MapReduce框架，从而得到正确的结果。由于我们要得到单词,所属的文档URL,词频，而<key,value>只有两个值，那么就必须得合并其中得两个信息了。这里我们设计key=单词＋URL，value=词频。即map得输出为<单词＋URL，词频>，之所以将单词＋URL做为key，时利用MapReduce框架自带得Map端进行排序。

　　Combine过程：

　　Combine过程将key值相同得value值累加，得到一个单词在文档上得词频。但是为了把相同得key交给同一个reduce处理，我们需要设计为key=单词，value＝URL+词频。

　　Reduce过程：

　　Reduce过程其实就是一个合并的过程了，只需将相同的key值的value值合并成倒排索引需要的格式即可。

三、程序代码

　　程序代码如下：

 1 import java.io.IOException;
 2 import java.util.StringTokenizer;
 3
 4 import org.apache.hadoop.conf.Configuration;
 5 import org.apache.hadoop.fs.Path;
 6 import org.apache.hadoop.io.LongWritable;
 7 import org.apache.hadoop.io.Text;
 8 import org.apache.hadoop.mapreduce.Job;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14 import org.apache.hadoop.util.GenericOptionsParser;
15
16
17 public class InvertedIndex {
18
19     public static class Map extends Mapper<LongWritable, Text, Text, Text>{
20         private static Text word = new Text();
21         private static Text one = new Text();
22
23         @Override
24         protected void map(LongWritable key, Text value,Mapper<LongWritable, Text, Text, Text>.Context context)
25                 throws IOException, InterruptedException {
26             //  super.map(key, value, context);
27             String fileName = ((FileSplit)context.getInputSplit()).getPath().getName();
28             StringTokenizer st = new StringTokenizer(value.toString());
29             while (st.hasMoreTokens()) {
30                 word.set(st.nextToken()+"\t"+fileName);
31                 context.write(word, one);
32             }
33         }
34     }
35
36     public static class Combine extends Reducer<Text, Text, Text, Text>{
37         private static Text word = new Text();
38         private static Text index = new Text();
39
40         @Override
41         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
42                 throws IOException, InterruptedException {
43             //  super.reduce(arg0, arg1, arg2);
44             String[] splits = key.toString().split("\t");
45             if (splits.length != 2) {
46                 return ;
47             }
48             long count = 0;
49             for(Text v:values){
50                 count++;
51             }
52             word.set(splits[0]);
53             index.set(splits[1]+":"+count);
54             context.write(word, index);
55         }
56     }
57
58     public static class Reduce extends Reducer<Text, Text, Text, Text>{
59         private static StringBuilder sub = new StringBuilder(256);
60         private static Text index = new Text();
61
62         @Override
63         protected void reduce(Text word, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
64                 throws IOException, InterruptedException {
65             // super.reduce(arg0, arg1, arg2);
66             for(Text v:values){
67                 sub.append(v.toString()).append(";");
68             }
69             index.set(sub.toString());
70             context.write(word, index);
71             sub.delete(0, sub.length());
72         }
73     }
74
75     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
76         Configuration conf = new Configuration();
77         String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
78         if(otherArgs.length!=2){
79             System.out.println("Usage:wordcount <in> <out>");
80             System.exit(2);
81         }
82         Job job = new Job(conf,"Invert Index ");
83         job.setJarByClass(InvertedIndex.class);
84
85         job.setMapperClass(Map.class);
86         job.setCombinerClass(Combine.class);
87         job.setReducerClass(Reduce.class);
88
89         job.setMapOutputKeyClass(Text.class);
90         job.setMapOutputValueClass(Text.class);
91         job.setOutputKeyClass(Text.class);
92         job.setOutputValueClass(Text.class);
93
94         FileInputFormat.addInputPath(job,new Path(args[0]));
95         FileOutputFormat.setOutputPath(job, new Path(args[1]));
96         System.exit(job.waitForCompletion(true)?0:1);
97     }
98
99 }

转载于:https://www.cnblogs.com/xiaoyh/p/9361356.html

Hadoop 倒排索引相关推荐

倒排索引 inverted index
2019独角兽企业重金招聘Python工程师标准>>> 1.什么是倒排索引. e>>>(⊙o⊙)- 这是我见过最垃圾的翻译了,完全让人误解他的意思. 这个名称很容易 ...
Hadoop学习笔记(8) ——实战做个倒排索引
Hadoop学习笔记(8) --实战做个倒排索引倒排索引是文档检索系统中最常用数据结构.根据单词反过来查在文档中出现的频率,而不是根据文档来,所以称倒排索引(Inverted Index).结构如 ...
Hadoop入门（十八）Mapreduce的倒排索引程序
一.简介 "倒排索引"是文档检索系统中最常用的数据结构,被广泛地应用于全文搜索引擎.它主要是用来存储某个单词(或词组)在一个文档或一组文档中的存储位置的映射,即提供了一种根据内容来 ...
Hadoop 之 MapReduce 的工作原理及其倒排索引的建立
一.Hadoop 简介下面先从一张图理解MapReduce得整个工作原理下面对上面出现的一些名词进行介绍 ResourceManager:是YARN资源控制框架的中心模块,负责集群中所有的资源的统 ...
Hadoop实战-MR倒排索引（三）
场景描述通过切入具体示例代码,解决问题,从而积累 Hadoop 实战经验. 倒排索引,源于实际应用中需要根据属性的值来查找记录,通过倒排索引,可以根据单词快速获取包含这个单词的文档列表. 实验数据 ...
hadoop学习-倒排索引
倒排索引是文档搜索系统中常用的数据结构.它主要用来存储某个词组在一个或多个文档中的位置映射.通常情况下,倒排索引由词组以及相关的文档列表组成.如下表所示. 表1: 单词文档列表单词1 文 ...
倒排索引原理_拜托，面试请不要再问我分布式搜索引擎的架构原理！
欢迎关注头条号:石杉的架构笔记周一至周五早八点半!精品技术文章准时送上!!! 精品学习资料获取通道,参见文末目录 (1)倒排索引到底是啥? (2)什么叫分布式搜索引擎? (3)ElasticSea ...
Greenplum Hadoop分布式平台大数据解决方案实战教程
基于Greenplum Hadoop分布式平台的大数据解决方案及商业应用案例剖析 [上集]百度网盘下载:链接:http://pan.baidu.com/s/1eQJFXZ0 密码:kdx9 [下集]百 ...
MapReduce实现倒排索引（类似协同过滤）
一.问题背景倒排索引其实就是出现次数越多,那么权重越大,不过我国有凤巢....zf为啥不管,总局回应推广是不是广告有争议... eclipse里ctrl+t找接口或者抽象类的实现类,看看都有啥方法, ...

Hadoop 倒排索引

Hadoop 倒排索引相关推荐

最新文章

热门文章