MapReduce:给出children-parents（孩子——父母）表，要求输出grandchild-grandparent（孙子—

hadoop中使用MapReduce单表关联案例：

MapReduce:给出children-parents（孩子——父母）表，要求输出grandchild-grandparent（孙子——爷奶）表。

给出表：

Tom Lucy
Tom Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Ben
Jack Alice
Jack Jesse
Terry Alice
Terry Jesse
Philip Terry
Philip Alma
Mark Terry
Mark Alma

要求实现如下效果：

设计思路：将这张单表分成两张表如下：

将左表的parents列和右表的child列进行连接，连接结果中除去连接的两列就是所需要的结果："grandchild--grandparents"表。

因为MapReduce的shuffle过程会将相同的key会连接在一起，所以在map阶段将读入数据分割成children和parents之后，

左表：将parents设置成key，children设置成value进行输出

右表：将children设置成key，parents设置成value进行输出

为了区分输出中的左右表，需要在输出的value中再加上左右表的信息，比如在value的最开始处加上字符1表示左表，加上字符2表示右表。

这样在map的结果中就形成了左表和右表，然后在shuffle过程中完成连接。

reduce接收到连接的结果,遍历values集合，得到每个value的值，将左表中的children放入children数组，右表中的parents放入parents数组，然后对两个数组求笛卡尔积就能得到最后结果。

代码如下（由于水平有限，不保证完全正确，如果发现错误欢迎指正）：

package com;import java.io.IOException;
import java.util.ArrayList;
import java.util.List;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class TestParents {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration config = new  Configuration();config.set("fs.defaultFS", "hdfs://192.168.0.100:9000");config.set("yarn.resourcemanager.hostname", "192.168.0.100");FileSystem fs = FileSystem.get(config);Job job = Job.getInstance(config);job.setJarByClass(TestParents.class);//设置所用到的map类job.setMapperClass(myMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);//设置用到的reducer类job.setReducerClass(myReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);//设置输入输出地址FileInputFormat.addInputPath(job, new Path("/input/parent.txt"));Path path = new Path("/output3/");if(fs.exists(path)){fs.delete(path, true);}//指定文件的输出地址
        FileOutputFormat.setOutputPath(job, path);//启动处理任务jobboolean completion = job.waitForCompletion(true);if(completion){System.out.println("Job Success!");}}public static class myMapper extends Mapper<Object, Text, Text, Text> {// 实现map函数public void map(Object key, Text value, Context context) throws IOException, InterruptedException {String temp=new String();// 左右表标识
                String values=value.toString();String words[]=values.split(" ");//Tom    Lucy// 输出左表temp = "1";context.write(new Text(words[1]), new Text(temp +"+"+ words[0] + "+" + words[1]));//(Lucy,1+Tom+Lucy)// 输出右表temp = "2";context.write(new Text(words[0]), new Text(temp +"+"+ words[0] + "+" + words[1]));//(Tom,2+Tom+Lucy)
            }}public static class myReducer extends Reducer<Text, Text, Text, Text> {// 实现reducer函数public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {List<String> grandchild = new ArrayList<String>();List<String> grandparent = new ArrayList<String>();for (Text value : values) {char temp=(char) value.charAt(0);String words[]=value.toString().split("[+]");  //1,Tom+Lucy// +、*、|、/等符号在正则表达示中有相应的不同意义,一般来讲只需要加[]、或是\\即可if(temp == '1'){grandchild.add(words[1]);}if(temp == '2'){grandparent.add(words[2]);}}//求笛卡尔儿积for (String gc : grandchild) {for (String gp : grandparent) {context.write(new Text(gc),new Text(gp));}}}}
}

运行详解：

（1）Map处理如下所示：

    Tom        Lucy                    map输出：               <Lucy，1+Tom+Lucy><Tom，2+Tom+Lucy >Tom        Jack                    map输出：                <Jack，1+Tom+Jack><Tom，2+Tom+Jack>Jone        Lucy                   map输出：                <Lucy，1+Jone+Lucy><Jone，2+Jone+Lucy>Jone        Jack                   map输出：                <Jack，1+Jone+Jack><Jone，2+Jone+Jack>Lucy        Mary                   map输出：                <Mary，1+Lucy+Mary><Lucy，2+Lucy+Mary>Lucy        Ben                    map输出：                <Ben，1+Lucy+Ben><Lucy，2+Lucy+Ben>Jack        Alice                  map输出：                <Alice，1+Jack+Alice><Jack，2+Jack+Alice>Jack        Jesse                  map输出：                <Jesse，1+Jack+Jesse><Jack，2+Jack+Jesse>Terry        Alice                 map输出：               <Alice，1+Terry+Alice><Terry，2+Terry+Alice>Terry        Jesse                 map输出：                <Jesse，1+Terry+Jesse><Terry，2+Terry+Jesse>Philip        Terry                map输出：               <Terry，1+Philip+Terry><Philip，2+Philip+Terry>Philip        Alma                 map输出：                <Alma，1+Philip+Alma><Philip，2+Philip+Alma>Mark        Terry                  map输出：                <Terry，1+Mark+Terry><Mark，2+Mark+Terry>Mark        Alma                   map输出：                <Alma，1+Mark+Alma><Mark，2+Mark+Alma>

（2）Shuffle处理如下：

map函数输出

排序结果

shuffle连接

<Lucy，1+Tom+Lucy>

<Tom，2+Tom+Lucy>

<Jack，1+Tom+Jack>

<Tom，2+Tom+Jack>

<Lucy，1+Jone+Lucy>

<Jone，2+Jone+Lucy>

<Jack，1+Jone+Jack>

<Jone，2+Jone+Jack>

<Mary，1+Lucy+Mary>

<Lucy，2+Lucy+Mary>

<Ben，1+Lucy+Ben>

<Lucy，2+Lucy+Ben>

<Alice，1+Jack+Alice>

<Jack，2+Jack+Alice>

<Jesse，1+Jack+Jesse>

<Jack，2+Jack+Jesse>

<Alice，1+Terry+Alice>

<Terry，2+Terry+Alice>

<Jesse，1+Terry+Jesse>

<Terry，2+Terry+Jesse>

<Terry，1+Philip+Terry>

<Philip，2+Philip+Terry>

<Alma，1+Philip+Alma>

<Philip，2+Philip+Alma>

<Terry，1+Mark+Terry>

<Mark，2+Mark+Terry>

<Alma，1+Mark+Alma>

<Mark，2+Mark+Alma>

<Alice，1+Jack+Alice>

<Alice，1+Terry+Alice>

<Alma，1+Philip+Alma>

<Alma，1+Mark+Alma>

<Ben，1+Lucy+Ben>

<Jack，1+Tom+Jack>

<Jack，1+Jone+Jack>

<Jack，2+Jack+Alice>

<Jack，2+Jack+Jesse>

<Jesse，1+Jack+Jesse>

<Jesse，1+Terry+Jesse>

<Jone，2+Jone+Lucy>

<Jone，2+Jone+Jack>

<Lucy，1+Tom+Lucy>

<Lucy，1+Jone+Lucy>

<Lucy，2+Lucy+Mary>

<Lucy，2+Lucy+Ben>

<Mary，1+Lucy+Mary>

<Mark，2+Mark+Terry>

<Mark，2+Mark+Alma>

<Philip，2+Philip+Terry>

<Philip，2+Philip+Alma>

<Terry，2+Terry+Alice>

<Terry，2+Terry+Jesse>

<Terry，1+Philip+Terry>

<Terry，1+Mark+Terry>

<Tom，2+Tom+Lucy>

<Tom，2+Tom+Jack>

<Alice，1+Jack+Alice，

1+Terry+Alice ，

1+Philip+Alma，

1+Mark+Alma >

<Ben，1+Lucy+Ben>

<Jack，1+Tom+Jack，

1+Jone+Jack，

2+Jack+Alice，

2+Jack+Jesse >

<Jesse，1+Jack+Jesse，

1+Terry+Jesse >

<Jone，2+Jone+Lucy，

2+Jone+Jack>

<Lucy，1+Tom+Lucy，

1+Jone+Lucy，

2+Lucy+Mary，

2+Lucy+Ben>

<Mary，1+Lucy+Mary，

2+Mark+Terry，

2+Mark+Alma>

<Philip，2+Philip+Terry，

2+Philip+Alma>

<Terry，2+Terry+Alice，

2+Terry+Jesse，

1+Philip+Terry，

1+Mark+Terry>

<Tom，2+Tom+Lucy，

2+Tom+Jack>

（3）Reduce处理：

取出values（Jack , {1+Tom+Jack},{1+Jone+Jack},{1+Jone+Jack},{2+Jack+Jesse }）遍历出来的一条value的值：1+Tom+Jack

根据1或者2，把值给grandchild数组和grandparent数组。

最后由语句： for (String gc : grandchild) {

                for (String gp : grandparent) {context.write(new Text(gc),new Text(gp)); } }

得知：只要数组grandchild中没有值或者数组grandparent没有值，则不会做处理，根据这条规则去除无效的shuffle连接，就能得出最后的结果。

如果您认为这篇文章还不错或者有所收获，您可以通过右边的“打赏”功能打赏我一杯咖啡【物质支持】，也可以点击下方的【好文要顶】按钮【精神支持】，因为这两种支持都是使我继续写作、分享的最大动力！

转载于:https://www.cnblogs.com/supiaopiao/p/7244007.html

MapReduce:给出children-parents（孩子——父母）表，要求输出grandchild-grandparent（孙子——爷奶）表...相关推荐

实例中给出child-parent（孩子——父母）表，要求输出grandchild-grandparent（孙子——爷奶）表
一·需求描述: 要求从给出的数据中寻找所关心的数据,它是对原始数据所包含信息的挖掘.下面进入这个实例. 实例中给出child-parent(孩子--父母)表,要求输出grandchild-gr ...
Hadoop案例之单表关联输出祖孙关系
1.案例描述实例中给出child-parent(孩子--父母)表,要求输出grandchild-grandparent(孙子--爷奶)表. 样例输入如下所示. file: child p ...
顺序表的插入与删除java_C++实现顺序表的常用操作(插入删出查找输出)
实现顺序表的插入,删除,查找,输出操作在C语言中经常用到.下面小编给大家整理实现代码,一起看下吧代码如下所示: #include using namespace std; #define MAXSI ...
2020身高体重标准表儿童_2020年儿童标准身高体重表发布，对比看看，你家孩子达标了吗？...
下载好向圈APP可以快速联系圈友您需要登录才可以下载或查看,没有帐号?立即注册 x 说起孩子的身高,恐怕没有父母不关注.尤其到了春天,长高的"黄金季"来了,都恨不得孩子在这个 ...
尝试设计出“网易新闻模块”（或者其他感兴趣项目）的数据库表结构
一.尝试设计出"网易新闻模块"(或者其他感兴趣项目)的数据库表结构完成时间:2020.11.19 项目环境:MySQL.Navicat.Win10: 思路 1.首先将网页简单分为 ...
面试官问单表数据量大一定要分库分表吗？我们用六个字和十张图回答
1 文章概述在业务发展初期单表完全可以满足业务需求,在阿里巴巴开发手册也建议:单表行数超过500万行或者单表容量超过2GB才推荐进行分库分表,如果预计三年后数据量根本达不到这个级别,请不要在创建表时 ...
Hive 的概念、应用场景、安装部署及使用、数据存储、table(内部表)和external table(外部表)、partition(分区表)和bucket(分桶表)
目录 1.Hive 的概念 2.Hive 的特点 3.Hive 和 RDBMS(关系型数据库) 的对比 4.Hive 和 HBase 的差别 5.Hive 架构 6.Hive安装与使用方法介绍 7.H ...
怎么查找表_MySQL索引是怎么支撑千万级表的快速查找？
前言在 MySQL 官方提到,改善操作性能的最佳方法 SELECT在查询中测试的一个或多个列上创建索引.索引条目的作用类似于指向表行的指针,从而使查询可以快速确定哪些行与WHERE子句中的条件匹配, ...
postgres sql 多表联合查询_从零学会SQL-多表查询
之前学习的内容几乎针对单个表进行简单操作,实际工作中可没有这么简单,复杂的表结构和多表数据关联进行分析,这时候需要我们掌握多表查询方法,一如既往,我们先看一下这篇的主要内容: 表的加法表的联结联结 ...

MapReduce:给出children-parents（孩子——父母）表，要求输出grandchild-grandparent（孙子——爷奶）表...

MapReduce:给出children-parents（孩子——父母）表，要求输出grandchild-grandparent（孙子——爷奶）表...相关推荐

最新文章

热门文章