MapReduce编程之单词去重

在MR编程中，最典型的业务就是求sum，max，min，avg，distinct， group by 还有 join 等操作的实现了。事实上，无论是那种业务。 MapReduce的编程框架已经决定了要把mapper阶段计算出来的key-value会按照key做组划分。所以reduceTask当中的reduce方法，其实接收到的参数就是key相同的一组key-value，然后根据业务逻辑做规约。比如distinct操作。如果需要按照某个字段值进行去重，那么只需要把该要进行去重的字段做key就OK，然后在reducer阶段，再在每一组中输出一个key-value值即可。

下面以一个简单的单词去重作为例子：

直接上源码，部分解释在源码中，请细看：

package com.ghgj.mazh.mapreduce.distinct;import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 作者： 马中华：http://blog.csdn.net/zhongqi2513* 日期： 2017年10月25日下午12:34:25* * 描述：单词去重* */
public class DistinctWordMR {public static void main(String[] args) throws Exception {// 指定hdfs相关的参数Configuration conf = new Configuration();conf.set("fs.defaultFS", "hdfs://hadoop06:9000");System.setProperty("HADOOP_USER_NAME", "hadoop");Job job = Job.getInstance(conf);// 设置jar包所在路径job.setJarByClass(DistinctWordMR.class);// 指定mapper类和reducer类job.setMapperClass(DistinctWordMRMapper.class);job.setReducerClass(DistinctWordMRReducer.class);// 指定maptask的输出类型job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(NullWritable.class);// 指定reducetask的输出类型job.setOutputKeyClass(Text.class);job.setOutputValueClass(NullWritable.class);// 指定该mapreduce程序数据的输入和输出路径
//      Path inputPath = new Path("d:/wordcount/input");
//      Path outputPath = new Path("d:/wordcount/output");Path inputPath = new Path("/wc/input");Path outputPath = new Path("/wc/output");FileSystem fs = FileSystem.get(conf);if (fs.exists(outputPath)) {fs.delete(outputPath, true);}FileInputFormat.setInputPaths(job, inputPath);FileOutputFormat.setOutputPath(job, outputPath);// 最后提交任务boolean waitForCompletion = job.waitForCompletion(true);System.exit(waitForCompletion ? 0 : 1);}/*** 作者： 马中华：http://blog.csdn.net/zhongqi2513* 日期： 2017年10月25日下午12:39:34* * 描述：单词去重MR中的mapper组件。 读取文件然后切分出单词*/private static class DistinctWordMRMapper extends Mapper<LongWritable, Text, Text, NullWritable> {private Text outkey = new Text();/*** 在单词计数的场景中。 把单词作为key输出即可， 不用输出value*/@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String[] split = value.toString().split(" ");for (String word : split) {outkey.set(word);context.write(outkey, NullWritable.get());}}}/*** 作者： 马中华：http://blog.csdn.net/zhongqi2513* 日期： 2017年10月25日下午12:39:20* * 描述：单词去重的MR程序的reducer组件*/private static class DistinctWordMRReducer extends Reducer<Text, NullWritable, Text, NullWritable> {@Overrideprotected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {/*** reduce方法没调用一次，就接收到一组相同的单词。所以，在此因为是去重的业务，所以直接输出一次key即可。就表示这一组单词就取一个。就相当于实现去重的业务*/context.write(key, NullWritable.get());}}
}

下面是程序接收到的数据：

hello huangbo
hello xuzheng
hello wangbaoqiang
one two three four five
one two three four
one two three
one two
hello hi

下面是程序输出的结果数据：

five
four
hello
hi
huangbo
one
three
two
wangbaoqiang
xuzheng

从以上的输出结果可以得出一个结论：

1、MapReduce编程框架中，一定会对mapper阶段输出的key-value排序，会按照key-value中的key排序，默认按照自然顺序排序。而且只会按照key进行排序

2、如果一个MapReduce程序，没有reducer阶段，那么mapper和reducer中间的shuffle过程就没有，所以这种情况，是不会排序的，也就是说，只要一个MR程序有reducer阶段，那么该程序一定会对key进行排序。

问题：如果想要进行排序的字段在value中呢，由于MR编程模型只会对key进行排序，所以要怎么实现呢。?

MapReduce--5--单词去重WordDistinctMR相关推荐

MapReduce的数据去重功能
实验材料及说明现有某电商网站用户对商品的收藏数据,记录了用户收藏的商品id以及收藏日期,文件名为buyer_favorite.buyer_favorite包含:买家id,商品id,收藏日期这三个字段 ...
shell脚本单词去重多个文件
shell脚本单词去重多个文件例如要求如下: 有多个txt文件,每个文件内有多行单词中间为英文的",",或者中文的","逗号作为分隔符. world,世界 ...
使用Eclipse开发工具运行MapReduce统计单词出现次数
使用Eclipse开发工具运行MapRuce统计单词出现次数 1.##我在这里将原先准备好的10000个单词的英语文章通过U盘移动到了Ubuntu系统中,也可以通过其它的方式,比如FTP传输或者将Ub ...
【MapReduce】数据去重、多表查询、倒排索引、单元测试等案例编程
数据去重.多表查询.倒排索引.单元测试等案例编程 1 数据去重 2 多表查询 2.1 笛卡尔积 2.2 等值连接 2.3 自连接 3 倒排索引 4 单元测试手动反爬虫,禁止转载: 原博地址 http ...
详解MapReduce实现数据去重与倒排索引应用场景案例
Hadoop笔试题: 找出不同人的共同好友(要考虑数据去重) 例子: 张三:李四,王五,赵六李四:张三,田七,王五实际工作中,数据去重用的还是挺多的,包括空值的过滤等等,本文就数据去重与倒排 ...
大数据【四】MapReduce（单词计数；二次排序；计数器；join；分布式缓存）
前言: 根据前面的几篇博客学习,现在可以进行MapReduce学习了.本篇博客首先阐述了MapReduce的概念及使用原理,其次直接从五个实验中实践学习(单词计数,二次排序,计数器,join,分 ...
MapReduce之单词计数
最近在看google那篇经典的MapReduce论文,中文版可以参考孟岩推荐的 mapreduce 中文版中文翻译论文中提到,MapReduce的编程模型就是: 计算利用一个输入key/value ...
MapReduce实例(数据去重)
数据去重: 原理(理解):Mapreduce程序首先应该确认<k3,v3>,根据<k3,v3>确定<k2,v2>,原始数据中出现次数超过一次的数据在输出文件中只出现 ...
Hadoop入门（十四）Mapreduce的数据去重程序
1 实例描述对数据文件中的数据进行去重.数据文件中的每行都是一个数据样例输入如下所示: 1)file1 2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 20 ...

MapReduce--5--单词去重WordDistinctMR

MapReduce编程之单词去重

MapReduce--5--单词去重WordDistinctMR相关推荐

最新文章

热门文章