数据集下载地址：ratings and trust

作业要求

ratings_data：用户为某个事物评分
各字段意思：第1个字段：用户id
第2个字段：事物id
第3个字段：用户给事物进行打分（关注度）
1. 统计每个事物关注的用户
2. 统计每个事物的分数
trust_data：用户为另外一个用户打分
第1个字段：原用户id
第2个字段：目标用户id
第3个字段：第一个用户给第二个用户打分（信任度）
1. 统计每个用户的分数
2. 按分数高低进行排序，找出分数最高的用户

第一题

第一问

主体思想依旧是MapReduce的第一个例子wordcount，改一下mapper和reduer就好

job


import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class RTJob {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job jsip = Job.getInstance(conf);jsip.setJarByClass(RTJob.class); /*job class*/jsip.setMapperClass(RTMapper.class); /*mapper class*/jsip.setReducerClass(RTReducer.class); /*reducer class*//*map out*/jsip.setMapOutputKeyClass(Text.class); /*Key*/jsip.setMapOutputValueClass(IntWritable.class); /*Value*//*out*/jsip.setOutputKeyClass(Text.class);jsip.setOutputValueClass(IntWritable.class);/*data file path*/FileInputFormat.setInputPaths(jsip, args[0]); /*input data path*/FileOutputFormat.setOutputPath(jsip, new Path(args[1])); /*output data path*/jsip.waitForCompletion(true);}}

mapper

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;public class RTMapper extends Mapper<LongWritable, Text, Text, IntWritable> {@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{String line = value.toString();Pattern p = Pattern.compile("([\\d.]+) (\\d+) (\\d+)");List<String> lines = new ArrayList<String>();lines.add(line);int name = 0;for (String word : lines) {Matcher matcher = p.matcher(word);if (matcher.find()){word = matcher.group(2);name = Integer.parseInt(matcher.group(1));context.write(new Text(word + ":"), new IntWritable(name));}           }}
}

reducer


import java.io.IOException;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;/* * KEYIN：对应mapper阶段输出的key类型 * VALUEIN：对应mapper阶段输出的value类型 * KEYOUT：reduce处理完之后输出的结果kv对中key的类型 * VALUEOUT：reduce处理完之后输出的结果kv对中value的类型 */
public class RTReducer  extends Reducer<Text, IntWritable, Text, IntWritable>{@Override  /* * reduce方法提供给reduce task进程来调用 *  * reduce task会将shuffle阶段分发过来的大量kv数据对进行聚合，聚合的机制是相同key的kv对聚合为一组 * 然后reduce task对每一组聚合kv调用一次我们自定义的reduce方法 * 比如：<hello,1><hello,1><hello,1><tom,1><tom,1><tom,1> *  hello组会调用一次reduce方法进行处理，tom组也会调用一次reduce方法进行处理 *  调用时传递的参数： *          key：一组kv中的key *          values：一组kv中所有value的迭代器 */  protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {  //通过value这个迭代器，遍历这一组kv中所有的value  String ratings = "[";int count=0;for(IntWritable value:values){  count++;//context.write(new Text(","), new IntWritable(value.get()));if(count==1){ratings = ratings + value.get();}else {ratings = ratings + "," + value.get();}} ratings = key + ratings + "]";//输出key:[value1, value2, ... ,value] count context.write(new Text(ratings), new IntWritable(count));}
}

这是部分输出结果

100036:[4295] 1
100037:[4295] 1
100038:[4295] 1
100039:[4295] 1
10003:[115] 1
100040:[4295] 1
100041:[4295] 1
100042:[4295] 1
100043:[4296,6596] 2
100044:[4297] 1
100045:[7585,7362,12497,4297] 4
100046:[4299] 1
100047:[6192,15536,4300,12207,4686,10058] 6
100048:[4300] 1
100049:[13452,42119,24010,34686,24249,25568,24504,16909,12497,10262,4300] 11
10004:[6760,115,23529,776,1795] 5
100050:[4301] 1
100051:[4301] 1
100052:[4586,30962,15266,4301,23094] 5
100053:[31816,14562,28782,8806,21679,26823,10205,44326,4301] 9

第二问

第二问就是一个简单的统计单词一样的。
job和reducer可以直接再次用。这里就不贴源码了，代码可以去我之前的博客了解一下：MapReduce编程实例——WordCount
mapper

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;public class RTMapper extends Mapper<LongWritable, Text, Text, IntWritable> {@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{String line = value.toString();Pattern p = Pattern.compile("([\\d.]+) (\\d+) (\\d+)");List<String> lines = new ArrayList<String>();lines.add(line);int name = 0;for (String word : lines) {Matcher matcher = p.matcher(word);if (matcher.find()){word = matcher.group(2);name = Integer.parseInt(matcher.group(3));context.write(new Text(word), new IntWritable(name));}           }}
}

结果：

1 52
10 3
100 33
1000 15
10000 3
100000 15
100001 4
100002 3
100003 4
100004 4
100005 4
100006 4
100007 4
100008 5
100009 5
10001 7
100010 3
100011 5
100012 4
100013 4
100014 9
100015 4
100016 5
100017 10

第二题

第一问

job和reducer还是一样的用Wordcount的，也是只是需要修改mapper，我的mapper的切分是用正则表达来切分的。
mapper

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;public class trustmapper extends Mapper<LongWritable, Text, Text, IntWritable> {/*** @author mshing* @param args* @throws InterruptedException * @throws IOException * @time */@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{String line = value.toString();Pattern p = Pattern.compile("([\\d.]+) (\\d+) (\\d+)");List<String> lines = new ArrayList<String>();lines.add(line);for (String word : lines) {Matcher matcher = p.matcher(word);if (matcher.find()){word = matcher.group(2);context.write(new Text(word), new IntWritable(1));}           }}
}

输出结果：

1 402
10 69
100 4
1000 66
10000 2
10001 1
10002 1
10003 6
10004 5
10005 1
10006 1
10007 2
10008 1
10009 3
1001 14

第二问

第二问，是要用到二次排序的，但是我当时没时间弄了就直接用excel排序了，结果是

495 2589

这个第二问是数据不多用excel，所以平时不建议用excel。

对ratings_data和trust_data进行分析相关推荐

电影推荐系统 python简书_分析9000部电影|一个简单的电影推荐系统
不知道大家平时喜不喜欢看电影来消遣时光,我是比较喜欢看电影的.对我而言,当我看完一部电影,觉得很好看的时候,我就会寻找类似这部电影的其他电影.刚好有这么一个数据集,包含了很多部的电影,于是打算对其进行 ...
【Golang源码分析】Go Web常用程序包gorilla/mux的使用与源码简析
目录[阅读时间:约10分钟] 一.概述二.对比: gorilla/mux与net/http DefaultServeMux 三.简单使用四.源码简析 1.NewRouter函数 2.HandleF ...
2022-2028年中国自动驾驶系统行业现状调研分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了中国自动驾驶系统行业市场行业相关概述.中国自 ...
2022-2028年中国阻尼涂料市场研究及前瞻分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了中国阻尼涂料行业市场行业相关概述.中国阻尼涂 ...
2021-2028年中国阻燃装饰行业市场需求与投资规划分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了中国阻燃装饰行业市场行业相关概述.中国阻燃装 ...
2022-2028年全球与中国漂白吸水棉市场研究及前瞻分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了全球与中国漂白吸水棉行业市场行业相关概述.全 ...
2022-2028年全球与中国青苔清洗剂市场研究及前瞻分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了全球与中国青苔清洗剂行业市场行业相关概述.全 ...
2022-2028年全球与中国氢碘化物市场智研瞻分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了全球与中国氢碘化物行业市场行业相关概述.全球 ...
2022-2028年全球与中国人字拖市场研究及前瞻分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了全球与中国人字拖行业市场行业相关概述.全球与 ...
2022-2028年全球与中国乳胶丝市场研究及前瞻分析报告
[报告类型]产业研究 [报告价格]4500起 [出版时间]即时更新(交付时间约3个工作日) [发布机构]智研瞻产业研究院 [报告格式]PDF版本报告介绍了全球与中国乳胶丝行业市场行业相关概述.全球与 ...

对ratings_data和trust_data进行分析

作业要求

第一题

第一问

第二问

第二题

第一问

第二问

对ratings_data和trust_data进行分析相关推荐

最新文章

热门文章