题目要求

有如此三份数据：
1、users.dat 数据格式为： 2::M::56::16::70072
对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释：用户id，性别，年龄，职业，邮政编码

2、movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为：MovieID BigInt, Title String, Genres String
对应字段中文解释：电影ID，电影名字，电影类型

3、ratings.dat 数据格式为： 1::1193::5::978300760
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释：用户ID，电影ID，评分，评分时间戳

题目要求：

数据要求：
（1）写shell脚本清洗数据。（hive不支持解析多字节的分隔符，也就是说hive只能解析':', 不支持解析'::'，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗）
（2）使用Hive能解析的方式进行

Hive要求：
1、正确建表，导入数据（三张表，三份数据），并验证是否正确

2、求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

3、分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

4、求movieid = 2116这部电影各年龄段（因为年龄就只有7个，
就按这个7个分就好了）的平均影评（年龄段，影评分）

5、求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的
平均影评分（观影者，电影名，影评分）

6、求好片（评分>=4.0）最多的那个年份的最好看的10部电影

7、求1997年上映的电影中，评分最高的10部Comedy类电影

8、该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）

9、各年评分最高的电影类型（年份，类型，影评分）

10、每个地区（邮政编码）最高评分的电影名，把结果存入HDFS（地区，电影名，影评分）

一、数据清洗、建表

因为元数据的字段之间用：：分割，所以我们使用shell进行一下清洗,将：：都转换成逗号

vi change1.sh#!/bin/bash
sed "s/::/,/g" /zgm/movies.dat>/zgm/movies2.dat
sed "s/::/,/g" /zgm/ratings.dat>/zgm/ratings2.dat
sed "s/::/,/g" /zgm/users.dat>/zgm/users2.datchmod +x change1.sh/bin/sh change1.sh

或者在建表时：

 create external table if not exists ratings(userid bigint,movieid bigint,rating double,timestamped string)partitioned by (dt string,city string)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'with serdeProperties ('input.regex'='([0-9]*)::([0-9]*)::([0-9]*)::([0-9]*)')location '/zgm/ratings';

建表、load数据的完整shell

#! /bin/bash
yesterday=$(date -d 'yesterday' +%Y%m%d)
logdir=/Log/movieLog/$yesterdayif(!-d "$logdir");then
mkdir "$logdir"
fihql1="use myhive;drop table users;create external table if not exists users(userid bigint,gender string,age int,occupation string,zipcode string)partitioned by (dt string,city string)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'with serdeProperties ('input.regex'='([0-9]*)::([A-Z]*)::([0-9]*)::([0-9]*)::([0-9]*)')location '/zgm/users';load data local inpath '/zgm/users.dat' overwrite into table users partition(dt='${yesterday}',city='shandong');"$HIVE_HOME/bin/hive --database myhive -e "${hql1}"hql2="drop table movies;create external table if not exists movies(movieid bigint,title string,genres string)partitioned by (dt string,city string)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'with serdeProperties ('input.regex'='([0-9]*)::([^@]*)::([^ ]*)')location '/zgm/movies';load data local inpath '/zgm/movies.dat' overwrite into table movies partition(dt='${yesterday}',city='shandong');"$HIVE_HOME/bin/hive --database myhive -e "${hql2}"
hql3="drop table ratings;create external table if not exists ratings(userid bigint,movieid bigint,rating double,timestamped string)partitioned by (dt string,city string)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'with serdeProperties ('input.regex'='([0-9]*)::([0-9]*)::([0-9]*)::([0-9]*)')location '/zgm/ratings';load data local inpath '/zgm/ratings.dat' overwrite into table ratings partition(dt='${yesterday}',city='shandong');"$HIVE_HOME/bin/hive --database myhive -e "${hql3}"

二、实现分析hql