hive on spark环境搭建(官方源码编译方式)

2020-03-06 16:25:47 | 编辑

此前,我已经搭建了 hive on spark, 不 准确说 是 spark on hive, 我可以在spark 中愉快得玩耍 hive,这也符合我当时得需求:hive on spark集群环境搭建

然而,通过hive客户端连接,hive 使用spark 引擎时,却报了 我无法解决得错误:hive on spark异常Failed to create Spark client for Spark session解决过程

所以,只得参考官方网站方式来从新搭建:hive on spark:Hive on Spark: Getting Started

官方说要自己编译一个不包含hive的,而官方下载的spark一般都是包含hive的。所以自己动手编译spark

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. 


环境准备:环境变量 host hostname jdk 免密登陆 关闭防火墙 参考之前得帖子。


1.hadoop环境搭建


core-site.xml

<configuration>
    <property>
       <name>hadoop.tmp.dir</name>
       <value>file:/opt/hadoop/data/hadoop/tmp</value>
    </property>
    <property>
       <name>io.file.buffer.size</name>
       <value>131072</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <!-- 垃圾回收 -->
    <property>
        <name>fs.trash.interval</name>
        <value>10080</value>
    </property>
        <property>
        <name>fs.trash.checkpoint.interval</name>
        <value>60</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <property>
       <name>hadoop.proxyuser.root.groups</name>
       <value>*</value>
    </property>
</configuration>

hadoop-env.sh

export JAVA_HOME=/opt/hadoop/jdk1.8.0_77

hdfs-site.xml

<configuration>
   <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>master:9001</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/data/hadoop/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/data/hadoop/datanode</value>
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
    <!-- 保留磁盘空间20g-->
    <property>
        <name>dfs.datanode.du.reserved</name>
        <value>21474836480</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
    </property>
    <property>
         <name>mapreduce.jobhistory.address</name>
         <value>master:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
    </property>
</configuration>

slaves

localhost

yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
    </property>
    <property>
       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.scheduler.class</name>
       <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    </property>
    <property>
       <description>Whether to enable log aggregation</description>
       <name>yarn.log-aggregation-enable</name>
       <value>true</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>master</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>master:8033</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://master:19888/jobhistory/job</value>
    </property>
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>86400</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>


启动之前执行 初始化namenode 否则容易报错:

hadoop namenode -format


启动hadoop

./sbin/start-all.sh


访问50070和8088端口 查看是否启动成功

master:50070

1.png


master:8088

2.png


2.下载编译spark

Hive Version

Spark Version

master2.3.0
3.0.x2.3.0
2.3.x2.0.0

2.1根据官方推荐我先选择hive3.0.0 spark2.3.0

wget http://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0.tgz

2.2解压spark根据pom 中的maven版本下载maven

wget http://archive.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xzvf apache-maven-3.3.9-bin.tar.gz

2.3添加maven 环境变量

export PATH=/opt/apache-maven-3.3.9/bin:${PATH}
source /etc/profile

2.4编译spark

cd /opt/hadoop/spark/spark-2.3.0
./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"

2.5等了40分钟编译完成

生成了一个spark-2.3.0-bin-hadoop2-without-hive.tgz

解压到/opt/hadoop/spark-2.3.0-bin-hadoop2


2.6下载scala

根据spark源码pom 查看需要scala 2.11.8

wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz

解压到/opt/hadoop/scala-2.11.8


2.7添加scala和spark环境变量

#Java
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
#hadoop
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2
export SCALA_HOME=/opt/hadoop/scala-2.11.8
export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin
export PATH=$PATH:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin:${HIVE_HOME}/bin
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
#Maven
export PATH=/opt/apache-maven-3.3.9/bin:${PATH}


3.配置spark

slaves

cd /opt/hadoop/spark-2.3.0-bin-hadoop2/conf
cp slaves.template slaves
localhost

spark-defaults.conf

spark.master                      yarn
#spark.submit.deployMode                cluster
spark.executor.cores              5
spark.num.executors               5
spark.eventLog.enabled                     true
spark.eventLog.compress                    true
spark.eventLog.dir                         hdfs://master:9000/tmp/logs/root/logs
spark.history.fs.logDirectory              hdfs://master:9000/tmp/logs/root/logs
spark.yarn.historyServer.address           http://master:18080
spark.sql.parquet.writeLegacyFormat        true


spark-env.sh

export SCALA_HOME=/opt/hadoop/scala-2.11.8
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2
export SPARK_MASTER_IP=master
#export SPARK_LOCAL_IP=master
export SPARK_MASTER_HOST=master
export SPARK_MASTER_PORT=7077
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_YARN_USER_ENV=$HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_MEMORY=4G
export SPARK_WORKER_DIR=/opt/hadoop/data/spark/work/


4.下载配置hive

4.1安装mysql

linux centos yum安装mysql新建hive元数据库"hive"

4.2下载hive 解压到 /opt/hadoop/apache-hive-3.0.0-bin

wget http://archive.apache.org/dist/hive/hive-3.0.0/apache-hive-3.0.0-bin.tar.gz

4.3配置hive

hive-site.xml

找到所有这种引用变量路径 ${system:java.io.tmpdir}/${system:user.name}

全部替换为/opt/hadoop/data/hive/iotmp

追加内容 

<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>10.10.22.133</value>
    <description>Bind host on which to run the HiveServer2 Thrift service.</description>
  </property>
  <property>
       <name>hive.metastore.uris</name>
       <value>thrift://master:9083</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://master:3306/hive</value>
    <description/>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description/>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description/>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description/>
  </property>
  <property>
    <name>hive.execution.engine</name>
    <value>spark</value>
  </property>
 <property>
    <name>spark.home</name>
    <value>/opt/hadoop/spark-2.3.0-bin-hadoop2</value>
  </property>
  <property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
  </property>
  <property>
    <name>spark.master</name>
    <value>yarn</value>
  </property>
  <property>
    <name>spark.sql.parquet.writeLegacyFormat</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.server2.active.passive.ha.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>spark.sql.parquet.writeLegacyFormat</name>
    <value>true</value>
  </property>

注意把hive-site.xml复制到spark/conf


hive-env.sh

export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2
export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin
export HIVE_CONF_DIR=/opt/hadoop/apache-hive-3.0.0-bin/conf


4.4初始化hive元数据库

./schematool -dbType mysql -initSchema


5.启动测试

5.1启动spark

/opt/hadoop/spark-2.3.0-bin-hadoop2.7/sbin/start-all.sh

但是报错:

starting org.apache.spark.deploy.master.Master, logging to /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out
failed to launch: nice -n 0 /opt/hadoop/spark-2.3.0-bin-hadoop2.7/bin/spark-class org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
  Spark Command: /opt/hadoop/jdk1.8.0_77/bin/java -cp /opt/hadoop/spark-2.3.0-bin-hadoop2.7/conf/:/opt/hadoop/spark-2.3.0-bin-hadoop2.7/jars/*:/opt/hadoop/hadoop-2.7.7/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
  ========================================
full log in /opt/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out

我在官网下载了一个同版本带hadoop的spark,并把所有jars复制到我编译好的这个jars,问题就解决了。但同时也复制hive的jar。后面出问题再解决,这里我想到 自己编译和官网下载的是不是就是jars的区别,那我下载好了 删hive的jars就可以了,干嘛还自己编译,我打算等会再试一下。

5.2启动hive

nohup hive --service metastore &

nohup  hive --service hiveserver2 &


5.3 hive测试

hive-site.xml中直接配置了spark engine,所以直接测试

./hive
hive>create table test(ts BIGINT,line STRING); 
hive>select count(*) from test;

问题来了,报错:

Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf)'
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session e4aae433-e79b-48c2-8edf-9d04796da7cf

去spark/jars中删除所有带hive的jar,这些都是我后来复制进来的5.1步。

删除后再试

hive>create table test(ts BIGINT,line STRING);
hive>select count(*) from test;
Query ID = root_20190118163315_8a679820-288e-46f7-b464-f8b7fceb6abd
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Running with YARN Application = application_1547172099098_0075
Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0075
Hive on Spark Session Web UI URL: http://slave2:49196
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      2          2        0        0       0  
Stage-1 ........         0      FINISHED      1          1        0        0       0  
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 10.20 s    
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 10.20 second(s)
OK
591285
Time taken: 39.154 seconds, Fetched: 1 row(s)

到这里我明白了,编译出来的spark少了很多包,把官方编译好的spark jar复制过来又多了hive,总之spark中不能包含hive。


至此:hive on spark 搭建完成。


=======================================================================

但是我是一个搞事情的人,我要试一下,直接下载官方打包好的spark(一般包含hive),然后删掉里面的hive是不是就可以了。

使用官方下载好的spark

[root@master jars]# rm -rf spark-hive*
[root@master jars]# rm -rf hive-*
./sbin/start-all.sh
hive> select count(1) from subject_total_score;
Query ID = root_20190118164946_88709ec2-a5e1-4099-88eb-f98d24de6e88
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Running with YARN Application = application_1547172099098_0076
Kill Command = /opt/hadoop/hadoop-2.7.7/bin/yarn application -kill application_1547172099098_0076
Hive on Spark Session Web UI URL: http://master:40695
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      2          2        0        0       0  
Stage-1 ........         0      FINISHED      1          1        0        0       0  
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 9.16 s     
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 9.16 second(s)
OK
591285
Time taken: 39.291 seconds, Fetched: 1 row(s)


果然:前面做的都是无用功,hive on spark 只需要删掉所有包含hive的 jar就能马上实现,而不需要自己编译。


然而 :

spark-shell --master yarn

scala>spark.sql("show tables").show

已经查询不到hive表。我之前的spark程序都是基于这个hive库,但是实现hive on spark就不能实现spark on hive。那么就需要两套spark。

你要么hive 驱动 spark,实现hive on spark,要么反着来,spark直接操作hive,实现spark on hive,而不能双向来搞。


所以思路应该是 不要用hivethriftserver,而要用sparkthriftserver来提供hive的外部服务。



登录后即可回复 登录 | 注册
    
关注编程学问公众号