此前,我已经搭建了 hive on spark, 不 准确说 是 spark on hive, 我可以在spark 中愉快得玩耍 hive,这也符合我当时得需求:
然而,通过hive客户端连接,hive 使用spark 引擎时,却报了 我无法解决得错误:
所以,只得参考官方网站方式来从新搭建:hive on spark:Hive on Spark: Getting Started
官方说要自己编译一个不包含hive的,而官方下载的spark一般都是包含hive的。所以自己动手编译spark
Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile.
环境准备:环境变量 host hostname jdk 免密登陆 关闭防火墙 参考之前得帖子。
1.hadoop环境搭建
core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/opt/hadoop/data/hadoop/tmp</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <!-- 垃圾回收 --> <property> <name>fs.trash.interval</name> <value>10080</value> </property> <property> <name>fs.trash.checkpoint.interval</name> <value>60</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> </configuration>
hadoop-env.sh
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop/data/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop/data/hadoop/datanode</value> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> <!-- 保留磁盘空间20g--> <property> <name>dfs.datanode.du.reserved</name> <value>21474836480</value> </property> </configuration>
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> </configuration>
slaves
localhost
yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> <property> <description>Whether to enable log aggregation</description> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://master:19888/jobhistory/job</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>86400</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration>
启动之前执行 初始化namenode 否则容易报错:
hadoop namenode -format
启动hadoop
./sbin/start-all.sh
访问50070和8088端口 查看是否启动成功
master:50070
master:8088
2.下载编译spark
Hive Version | Spark Version |
---|---|
master | 2.3.0 |
3.0.x | 2.3.0 |
2.3.x | 2.0.0 |
2.1根据官方推荐我先选择hive3.0.0 spark2.3.0
wget http://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0.tgz
2.2解压spark根据pom 中的maven版本下载maven
wget http://archive.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz tar xzvf apache-maven-3.3.9-bin.tar.gz
2.3添加maven 环境变量
export PATH=/opt/apache-maven-3.3.9/bin:${PATH} source /etc/profile
2.4编译spark
cd /opt/hadoop/spark/spark-2.3.0 ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"
2.5等了40分钟编译完成
生成了一个spark-2.3.0-bin-hadoop2-without-hive.tgz
解压到/opt/hadoop/spark-2.3.0-bin-hadoop2
2.6下载scala
根据spark源码pom 查看需要scala 2.11.8
wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
解压到/opt/hadoop/scala-2.11.8
2.7添加scala和spark环境变量
#Java export JAVA_HOME=/opt/hadoop/jdk1.8.0_77 #hadoop export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib" export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2 export SCALA_HOME=/opt/hadoop/scala-2.11.8 export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin export PATH=$PATH:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin:${HIVE_HOME}/bin export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar #Maven export PATH=/opt/apache-maven-3.3.9/bin:${PATH}
3.配置spark
slaves
cd /opt/hadoop/spark-2.3.0-bin-hadoop2/conf cp slaves.template slaves
localhost
spark-defaults.conf
spark.master yarn #spark.submit.deployMode cluster spark.executor.cores 5 spark.num.executors 5 spark.eventLog.enabled true spark.eventLog.compress true spark.eventLog.dir hdfs://master:9000/tmp/logs/root/logs spark.history.fs.logDirectory hdfs://master:9000/tmp/logs/root/logs spark.yarn.historyServer.address http://master:18080 spark.sql.parquet.writeLegacyFormat true
spark-env.sh
export SCALA_HOME=/opt/hadoop/scala-2.11.8 export JAVA_HOME=/opt/hadoop/jdk1.8.0_77 export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2 export SPARK_MASTER_IP=master #export SPARK_LOCAL_IP=master export SPARK_MASTER_HOST=master export SPARK_MASTER_PORT=7077 export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_YARN_USER_ENV=$HADOOP_HOME/etc/hadoop export SPARK_EXECUTOR_MEMORY=4G export SPARK_WORKER_DIR=/opt/hadoop/data/spark/work/
4.下载配置hive
4.1安装mysql
新建hive元数据库"hive"
4.2下载hive 解压到 /opt/hadoop/apache-hive-3.0.0-bin
wget http://archive.apache.org/dist/hive/hive-3.0.0/apache-hive-3.0.0-bin.tar.gz
4.3配置hive
hive-site.xml
找到所有这种引用变量路径 ${system:java.io.tmpdir}/${system:user.name}
全部替换为/opt/hadoop/data/hive/iotmp
追加内容
<property> <name>hive.server2.thrift.bind.host</name> <value>10.10.22.133</value> <description>Bind host on which to run the HiveServer2 Thrift service.</description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://master:9083</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://master:3306/hive</value> <description/> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description/> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description/> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description/> </property> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <property> <name>spark.home</name> <value>/opt/hadoop/spark-2.3.0-bin-hadoop2</value> </property> <property> <name>spark.serializer</name> <value>org.apache.spark.serializer.KryoSerializer</value> </property> <property> <name>spark.master</name> <value>yarn</value> </property> <property> <name>spark.sql.parquet.writeLegacyFormat</name> <value>true</value> </property> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <property> <name>hive.server2.active.passive.ha.enable</name> <value>true</value> </property> <property> <name>spark.sql.parquet.writeLegacyFormat</name> <value>true</value> </property>
注意把hive-site.xml复制到spark/conf
hive-env.sh
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2 export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin export HIVE_CONF_DIR=/opt/hadoop/apache-hive-3.0.0-bin/conf
4.4初始化hive元数据库
./schematool -dbType mysql -initSchema
5.启动测试
5.1启动spark
/opt/hadoop/spark-2.3.0-bin-hadoop2.7/sbin/start-all.sh
但是报错:
starting org.apache.spark.deploy.master.Master, logging to /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out failed to launch: nice -n 0 /opt/hadoop/spark-2.3.0-bin-hadoop2.7/bin/spark-class org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080 Spark Command: /opt/hadoop/jdk1.8.0_77/bin/java -cp /opt/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/opt/hadoop/spark-2.3.0-bin-hadoop2.7/jars/*:/opt/hadoop/hadoop-2.7.7/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080 ======================================== full log in /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
我在官网下载了一个同版本带hadoop的spark,并把所有jars复制到我编译好的这个jars,问题就解决了。但同时也复制hive的jar。后面出问题再解决,这里我想到 自己编译和官网下载的是不是就是jars的区别,那我下载好了 删hive的jars就可以了,干嘛还自己编译,我打算等会再试一下。
5.2启动hive
nohup hive --service metastore &
nohup hive --service hiveserver2 &
5.3 hive测试
hive-site.xml中直接配置了spark engine,所以直接测试
./hive hive>create table test(ts BIGINT,line STRING); hive>select count(*) from test;
问题来了,报错:
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf)' FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf
去spark/jars中删除所有带hive的jar,这些都是我后来复制进来的5.1步。
删除后再试
hive>create table test(ts BIGINT,line STRING); hive>select count(*) from test; Query ID = root_20190118163315_8a679820-288e-46f7-b464-f8b7fceb6abd Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Running with YARN Application = application_1547172099098_0075 Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0075 Hive on Spark Session Web UI URL: http://slave2:49196 Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING -------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED -------------------------------------------------------------------------------------- Stage-0 ........ 0 FINISHED 2 2 0 0 0 Stage-1 ........ 0 FINISHED 1 1 0 0 0 -------------------------------------------------------------------------------------- STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.20 s -------------------------------------------------------------------------------------- Spark job[0] finished successfully in 10.20 second(s) OK 591285 Time taken: 39.154 seconds, Fetched: 1 row(s)
到这里我明白了,编译出来的spark少了很多包,把官方编译好的spark jar复制过来又多了hive,总之spark中不能包含hive。
至此:hive on spark 搭建完成。
=======================================================================
但是我是一个搞事情的人,我要试一下,直接下载官方打包好的spark(一般包含hive),然后删掉里面的hive是不是就可以了。
使用官方下载好的spark
[root@master jars]# rm -rf spark-hive* [root@master jars]# rm -rf hive-* ./sbin/start-all.sh
hive> select count(1) from subject_total_score; Query ID = root_20190118164946_88709ec2-a5e1-4099-88eb-f98d24de6e88 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Running with YARN Application = application_1547172099098_0076 Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0076 Hive on Spark Session Web UI URL: http://master:40695 Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING -------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED -------------------------------------------------------------------------------------- Stage-0 ........ 0 FINISHED 2 2 0 0 0 Stage-1 ........ 0 FINISHED 1 1 0 0 0 -------------------------------------------------------------------------------------- STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.16 s -------------------------------------------------------------------------------------- Spark job[0] finished successfully in 9.16 second(s) OK 591285 Time taken: 39.291 seconds, Fetched: 1 row(s)
果然:前面做的都是无用功,hive on spark 只需要删掉所有包含hive的 jar就能马上实现,而不需要自己编译。
然而 :
spark-shell --master yarn
scala>spark.sql("show tables").show
已经查询不到hive表。我之前的spark程序都是基于这个hive库,但是实现hive on spark就不能实现spark on hive。那么就需要两套spark。
你要么hive 驱动 spark,实现hive on spark,要么反着来,spark直接操作hive,实现spark on hive,而不能双向来搞。
所以思路应该是 不要用hivethriftserver,而要用sparkthriftserver来提供hive的外部服务。