spark-shell整合

安装hive

https://blog.csdn.net/qq_43659234/article/details/111632216

配置信息

把配置好的hive-site.xml文件copy到spark的安装目录的conf下

[root@spark01 conf]# cp hive-site.xml /usr/spark/spark-3.0.0-bin-hadoop3.2/conf/

在spark的conf下的spark-env.sh中添加hive配置

将mysql的驱动包，加入到spark的jars下

启动spark

[root@spark01 jars]# start-master.sh
[root@spark01 jars]# start-slaves.sh

测试

idea中spark整合

windows下搭建hadoop

下载Hadoop的winutils

在github上一搜就有
!!!注意版本
下载后将其hadoop.dll拷贝至C:\Windows\System32下

配置环境变量

HADOOP_HOME指定为刚刚下载Hadoop的winutils

在PATH中加上

到此windows下的hadoop就算配置好了

添加文件

log4j.properties文件
添加后只输出warning以上的信息

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
## Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

idea连接虚拟机

连接文件

需要什么文件直接拖拉过就行

连接虚拟机

将虚拟机中的hive-site.xml,hdfs-site.xml(可有可无),core-site.xml文件加入到resource文件夹下

最主要的就是hive-site.xml文件
hive-site.xml文件内容如下

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration><!-- jdbc 连接的 URL --><property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://spark01:3306/hive</value></property><!-- jdbc 连接的 Driver--><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.cj.jdbc.Driver</value></property><!-- jdbc 连接的 username--><property><name>javax.jdo.option.ConnectionUserName</name><value>root</value></property><!-- jdbc 连接的 password --><property><name>javax.jdo.option.ConnectionPassword</name><value>123456789</value></property><!-- Hive 元数据存储版本的验证 --><property><name>hive.metastore.schema.verification</name><value>false</value></property><!--元数据存储授权--><property><name>hive.metastore.event.db.notification.api.auth</name><value>false</value></property><!-- Hive 默认在 HDFS 的工作目录 --><property><name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value></property><!-- 指定存储元数据要连接的地址 --><property><name>hive.metastore.uris</name><value>thrift://spark01:9083</value></property><!-- 指定 hiveserver2 连接的 host --><property><name>hive.server2.thrift.bind.host</name><value>spark01</value></property><!-- 指定 hiveserver2 连接的端口号 --><property><name>hive.server2.thrift.port</name><value>10000</value></property>
</configuration>

再将mysql-connector-java-5.1.23-bin.jar和spark hive相关包加入mevan中

!!!启动hive服务

必须启动

注意: 启动后窗口不能再操作，需打开一个新的shell窗口做别的操作
启动metastore

[root@spark01 bin]# hive --service metastore

启动 hiveserver2

[root@spark01 hive]# hive --service hiveserver2

hive的启动脚本

#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
if [ ! -d $HIVE_LOG_DIR ]
thenmkdir -p $HIVE_LOG_DIR
fi
#检查进程是否运行正常，参数 1 为进程名，参数 2 为进程端口
function check_process()
{pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)echo $pid[[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1
}function hive_start()
{metapid=$(check_process HiveMetastore 9083)cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"[ -z "$metapid" ] && eval $cmd || echo "Metastroe 服务已启动" server2pid=$(check_process HiveServer2 10000)cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"[ -z "$server2pid" ] && eval $cmd || echo "HiveServer2 服务已启动"
}function hive_stop()
{metapid=$(check_process HiveMetastore 9083)[ "$metapid" ] && kill $metapid || echo "Metastore 服务未启动" server2pid=$(check_process HiveServer2 10000)[ "$server2pid" ] && kill $server2pid || echo "HiveServer2 服务未启动"
}case $1 in
"start")hive_start;;
"stop")hive_stop;;
"restart")hive_stopsleep 2hive_start;;
"status")check_process HiveMetastore 9083 >/dev/null && echo "Metastore 服务运行正常" || echo "Metastore 服务运行异常"check_process HiveServer2 10000 >/dev/null && echo "HiveServer2 服务运行正常" || echo "HiveServer2 服务运行异常";;
*)echo Invalid Args!echo 'Usage: '$(basename $0)' start|stop|restart|status';;
esac

将其放入hive的bin目录下并给其权限chmod -R 777 脚本的名字就可以启动
启动的命令脚本名字 start/stop/status

测试

import org.apache.spark.sql.SparkSession
/*** @author 公羽* @time : 2021/6/2 15:52* @File : spark_hive.java*/
object spark_hive {def main(args: Array[String]): Unit = {val spark = new SparkSession.Builder().enableHiveSupport().appName("spark_hive").master("local[5]").config("spark.driver.host","localhost").getOrCreate()import spark.implicits._spark.sql("show databases").show()}
}

！！！！

windows下的hosts文件得添加ip

要是报启动连接错误先看看集群有没有启动,hive有没有启动

spark整合hive相关推荐

Spark SQL实战(08)-整合Hive
1 整合原理及使用 Apache Spark 是一个快速.可扩展的分布式计算引擎,而 Hive 则是一个数据仓库工具,它提供了数据存储和查询功能.在 Spark 中使用 Hive 可以提高数据处理和查 ...
Spark SQL整合Hive
Spark SQL官方释义 Spark SQL is Apache Spark's module for working with structured data. 一.使用Spark SQL访问Hi ...
Spark对接Hive：整合Hive操作及函数
1.拷贝hive-site.xml文件到spark的conf目录下 2.[hadoop@hadoop002 bin]$ ./spark-shell --master local[2] --jars ~ ...
phoenix+hbase+Spark整合，Spark处理数据操作phoenix入hbase，Spring Cloud整合phoenix
1 版本要求 Spark版本:spark-2.3.0-bin-hadoop2.7 Phoenix版本:apache-phoenix-4.14.1-HBase-1.4-bin HBASE版本:hbase ...
sparksql整合hive
sparksql整合hive 步骤 1.需要把hive安装目录下的配置文件hive-site.xml拷贝到每一个spark安装目录下对应的conf文件夹中 2.需要一个连接mysql驱动的jar包拷贝 ...
Spark 整合ElasticSearch
Spark 整合ElasticSearch 因为做资料搜索用到了ElasticSearch,最近又了解一下 Spark ML,先来演示一个Spark 读取/写入 ElasticSearch 简单示例. ...
漫谈大数据 - Spark on Hive Hive on Spark
目录 Spark on hive 与 Hive on Spark 的区别 Hive查询流程及原理 Hive将SQL转成MapReduce执行速度慢 Hive On Spark优化 Hive元数据库的功 ...
Apache+Hudi入门指南: Spark+Hudi+Hive+Presto
一.整合 hive集成hudi方法:将hudi jar复制到hive lib下 cp ./packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-b ...
sparksql hive mysql_SparkSql 整合 Hive
SparkSql整合Hive 需要Hive的元数据,hive的元数据存储在Mysql里,sparkSql替换了yarn,不需要启动yarn,需要启动hdfs 首先你得有hive,然后你得有spark, ...

spark整合hive

目录