Hive 和 Hbase 整合

1.整合配置

需要 jar 包（hive 自带） hive-hbase-handler-x.y.z.jar
在 hive 的服务端 node2 和 node3 上修改 hive-site.xml 文件：

<property> <name>hbase.zookeeper.quorum</name> <value>node2,node3,node4</value>
</property>

然后 node2 和 node3 启动 hive：hiveserver2 启动客户端 CLI

[root@node4 ~]# beeline
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://node2,node3,node4/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2_zk root 123

2.创建表

要在 hive 中操作 hbase 的表，需要对列进行映射。

0: jdbc:hive2://node2,node3,node4/> create table hbasetbl(key int,value string)
0: jdbc:hive2://node2,node3,node4/> stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
0: jdbc:hive2://node2,node3,node4/> with serdeproperties ("hbase.columns.mapping" = ":key,cf1:val")
0: jdbc:hive2://node2,node3,node4/> tblproperties ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");
No rows affected (19.602 seconds)

必须指定 hbase.columns.mapping 属性。 hbase.table.name 属性可选，用于指定 hbase 中对应的表名，允许在 hive 表中使用不同的表名。上例中，hive 中表名为 hbase_table_1，hbase 中表名为 xyz。如果不指定，hive 中的表名与 hbase 中的表名一致。 hbase.mapred.output.outputtable 属性可选，向表中插入数据的时候是必须的。该属性的值传递给了 hbase.mapreduce.TableOutputFormat 使用

在 hive 表定义中的映射 hbase.columns.mapping 中的 cf1:val 在创建完表之后， hbase 中只显示 cf1，并不显示 val，因为 val 是行级别的，cf1 才是 hbase 中表级别的元数据。

3.内部表创建具体操作

Node4 的 hive 客户端上创建一个表：

0: jdbc:hive2://node2,node3,node4/> create table hbasetbl(key int,value string)
0: jdbc:hive2://node2,node3,node4/> stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
0: jdbc:hive2://node2,node3,node4/> with serdeproperties ("hbase.columns.mapping" = ":key,cf1:val")
0: jdbc:hive2://node2,node3,node4/> tblproperties ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");
No rows affected (19.602 seconds)

node1 上连接 hbase shell:

hbase(main):001:0> list
TABLE
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hbase-0.98/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2021-11-24 17:53:59,363 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
eventlog
neusoft:phone_log
phone
sentence
tb_region1
test
test1
testtable
wordcount
xyz

hbase 操作：

hbase(main):007:0> put 'xyz','1111','cf1:name','zhangsan'
0 row(s) in 0.1240 seconds
hbase(main):009:0> scan 'xyz'
ROW COLUMN+CELL 1111 column=cf1:name, timestamp=1591109665882, value=zhangsan
1 row(s) in 0.0290 seconds

hive 操作：

0: jdbc:hive2://node2,node3,node4/> select *from hbasetbl;
+---------------+-----------------+--+
| hbasetbl.key  | hbasetbl.value  |
+---------------+-----------------+--+
+---------------+-----------------+--+

hbase 操作：

hbase(main):010:0> put 'xyz','1111','cf1:val','JAVA'
0 row(s) in 0.0140 seconds

hbase(main):008:0> scan "xyz"
ROW                                         COLUMN+CELL                                                                                                                    1111                                       column=cf1:, timestamp=1637747993440, value=zhangsahan                                                                         1111                                       column=cf1:name, timestamp=1637748161532, value=yqq
1 row(s) in 0.0790 seconds

hive 操作：

0: jdbc:hive2://node2,node3,node4/> select *from hbasetbl;
+---------------+-----------------+--+
| hbasetbl.key  | hbasetbl.value  |
+---------------+-----------------+--+
| 1111          | JAVA            |
+---------------+-----------------+--+

4.外部表创建

Hive 建立外部表要求 hbase 中必须有表对应,否则抛错
HBase操作

hbase(main):013:0> create "t_order","order"
0 row(s) in 0.7590 seconds=> Hbase::Table - t_order

Hive操作

0: jdbc:hive2://node2,node3,node4/> create external table tmp_order(key string,id string,user_id string)
0: jdbc:hive2://node2,node3,node4/> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
0: jdbc:hive2://node2,node3,node4/> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,order:order_id,order:user_id")
0: jdbc:hive2://node2,node3,node4/> TBLPROPERTIES ("hbase.table.name" = "t_order");
No rows affected (2.968 seconds)

hbase 操作：

hbase(main):003:0> put 't_order','1','order:order_id','1'
0 row(s) in 0.1480 seconds
hbase(main):004:0> put 't_order','1','order:user_id','101'
0 row(s) in

Hive 操作：

0: jdbc:hive2://node2,node3,node4/> select * from tmp_order;
+----------------+---------------+--------------------+--+
| tmp_order.key  | tmp_order.id  | tmp_order.user_id  |
+----------------+---------------+--------------------+--+
| 1              | 1             | 101                |
+----------------+---------------+--------------------+--+

Hive 操作：

0: jdbc:hive2://node2,node3,node4/> insert into tmp_order values(2,2,102)
0: jdbc:hive2://node2,node3,node4/> ;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1637722681714_0002
INFO  : The url to track the job: http://node4:8088/proxy/application_1637722681714_0002/
INFO  : Starting Job = job_1637722681714_0002, Tracking URL = http://node4:8088/proxy/application_1637722681714_0002/
INFO  : Kill Command = /opt/hadoop-2.6.5/bin/hadoop job  -kill job_1637722681714_0002
INFO  : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO  : 2021-11-24 20:30:40,694 Stage-0 map = 0%,  reduce = 0%
INFO  : 2021-11-24 20:31:40,255 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 4.48 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 480 msec
INFO  : Ended Job = job_1637722681714_0002
No rows affected (167.502 seconds)
0: jdbc:hive2://node2,node3,node4/> select * from tmp_order;
+----------------+---------------+--------------------+--+
| tmp_order.key  | tmp_order.id  | tmp_order.user_id  |
+----------------+---------------+--------------------+--+
| 1              | 1             | 101                |
| 2              | 2             | 102                |
+----------------+---------------+--------------------+--+
2 rows selected (5 seconds)

hbase 操作：


hbase(main):017:0> scan "t_order"
ROW                                         COLUMN+CELL                                                                                                                    1                                          column=order:order_id, timestamp=1637756806411, value=1                                                                        1                                          column=order:user_id, timestamp=1637756850480, value=101                                                                       2                                          column=order:order_id, timestamp=1637757099448, value=2                                                                        2                                          column=order:user_id, timestamp=1637757099448, value=102
2 row(s) in 0.0830 seconds

总结：
1、创建 hive 的内部表，要求 hbase 中不能有对应的表
2、创建 hive 的外部表，要求 hbase 中一定要有对应的表
3、映射关系通过 WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf:id,cf:username,cf:age”) with serdeproperties
4、stored by 指定 hive 中存储数据的时候，由该类来处理，该类会将数据放到 hbase 的存储中，同时在 hive 读取数据的时候，由该类负责处理 hbase 的数据和 hive 的对应关系 STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
5、指定 hive 表和 hbase 中的哪张表对应，outputtable 负责当 hive insert 数据的时候将数据写到 hbase 的哪张表。 tblproperties (“hbase.table.name” = “my_table”, “hbase.mapred.output.outputtable” = “my_table”);
6、如果 hbase 中的表名和 hive 中的表名一致，则可以不指定 tblproperties。