hive on spark集群环境搭建

2020-03-06 16:25:47 | 编辑

spark on hive集群环境搭建过程很简单,不要搞pre-built版,又是编译又是依赖又是打包,搞一堆,就下载正常稳定的版本就好,简单就好。

环境就只linux centos7开始搭建,程序全部放再/opt/hadoop中,先安装一个节点,然后一个一个动态增加节点


【目录】

1.下载程序包

2.环境配置

3.安装配置hadoop

4.安装配置spark

5 spark历史日志配置

6 hive安装配置

7 测试hive on spark



1.下载程序包

1.1下载之前你得知道需要什么版本

hive on spark官方文档:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

spark和hive版本匹配

Hive Version

Spark Version

1.1.x1.2.0
1.2.x1.3.1
2.0.x1.5.0
2.1.x1.6.0
2.2.x1.6.0
2.3.x2.0.0
3.0.x2.3.0
master2.3.0


1.2通过这篇文章推荐来选择spark和hive的版本再清华镜像下载:

hadoop: wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz

spark: wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz

hive:wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz

scala:wget https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.tgz

jdk8:https://www.oracle.com/technetwork/cn/java/javase/downloads/jdk8-downloads-2133151-zhs.html


全部解压到/opt/hadoop/


2.环境配置

2.1所有机器hosts:

vi /etc/hosts

11.101.22.133 master

11.101.22.134 slave1


2.2所有机器hostname:

按照hosts修改每个机器的hostname linux修改hostname永久生效

这个一定要修改,要不然有问题


2.3所有机器环境变量:

vi /etc/profile

export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2.7
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
export SCALA_HOME=/opt/hadoop/scala-2.12.2
export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin
export PATH=$PATH:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin:${HIVE_HOME}/bin
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar

2.4集群机器配置ssh免密

master本机,master和其他机器之间都要做ssh免密

linux服务器之间ssh免密码访问配置


3.安装配置hadoop

环境配置上面已经配置好了,hadoop的配置文件都在/opt/hadoop/hadoop-2.7.7/etc/hadoop

直接把yarn配置好,以便后面基于yarn资源调度

再次声明这里是先添加单节点,后面再一个一个动态添加扩展节点

31.配置文件

core-site.xml

<configuration>
    <property>
       <name>hadoop.tmp.dir</name>
       <value>file:/opt/hadoop/data/hadoop/tmp</value>
    </property>
    <property>
       <name>io.file.buffer.size</name>
       <value>131072</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <!-- 垃圾回收 -->
    <property>
        <name>fs.trash.interval</name>
        <value>10080</value>
    </property>
        <property>
        <name>fs.trash.checkpoint.interval</name>
        <value>60</value>
    </property>
</configuration>

hadoop-env.sh

export JAVA_HOME=/opt/hadoop/jdk1.8.0_77

hdfs-site.xml

<configuration>
   <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>master:9001</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/data/hadoop/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/data/hadoop/datanode</value>
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
    <!-- 保留磁盘空间20g-->
    <property>
        <name>dfs.datanode.du.reserved</name>
        <value>21474836480</value>
    </property>
</configuration>

mapred-site.xml 

cp mapred-site.xml.template mapred-site.xml
<configuration>
    <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
    </property>
</configuration>

slaves

localhost

可以根据hosts配置其他节点或ip,一行一个

yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
    </property>
    <property>
       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.resourcemanager.scheduler.class</name>
       <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    </property>
    <property>
       <description>Whether to enable log aggregation</description>
       <name>yarn.log-aggregation-enable</name>
       <value>true</value>
    </property>
    <property>
       <name>yarn.resourcemanager.hostname</name>
       <value>master</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>master:8033</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://master:19888/jobhistory/logs</value>
    </property>
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

3.2启动hadoop

如果启动有问题:hadoop namenode -format

./sbin/start-all.sh

3.3测试访问hadoop

3.3.1访问hadoop端口:master:50070/

1.png


3.3.2master运行hadoop命令

hadoop fs -ls /
18/12/23 11:05:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwx-wx-wx   - root supergroup          0 2018-12-21 16:36 /tmp

不报错就行,上面那个警告不要紧,后面会一直跟着你,先不着急解决

ok hadoop完成


3.3.3 查看yarn界面

2.png


4.安装配置spark

4.1配置文件

配置文件在/opt/hadoop/spark-2.3.0-bin-hadoop2.7/conf

spark也一样,先添加单节点,后面再扩展节点

4.1.1 slaves

cp slaves.template slaves
vi slaves
localhost

有其他节点时,添加其他节点,一行一个

4.1.2 spark-env.sh

cp spark-env.sh.template spark-env.sh
vi spark-env.sh
export SCALA_HOME=/opt/hadoop/scala-2.12.2
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2.7
export SPARK_MASTER_IP=master
export SPARK_LOCAL_IP=master
export SPARK_MASTER_HOST=master
export SPARK_MASTER_PORT=7077
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_YARN_USER_ENV=$HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_MEMORY=4G

4.1.3 spark-defaults.conf

spark-defaults.conf是spark-submit客户端的参数配置,客户端支持命令行参数,也支持spark-defaults.conf,集群中这个配置没上面用,但我要统一配置好,不管哪里都从这里复制着过去。

cp spark-defaults.conf.template  spark-defaults.conf
spark.master                     spark://master:7077
spark.submit.deployMode                cluster
#历史日志配置 第5节有详细介绍
spark.eventLog.enabled                     true
spark.eventLog.compress                    true
spark.eventLog.dir                         hdfs://master:9000/tmp/logs/root/logs
spark.history.fs.logDirectory              hdfs://master:9000/tmp/logs/root/logs
spark.yarn.historyServer.address           master:18080


4.2 启动spark(yarn 模式下,spark 可以不启动,由yarn调度)

./sbin/start-all.sh


4.3 测试park

查看界面http://master:8080/

3.png

运行测试程序

cd /opt/hadoop/spark-2.3.0-bin-hadoop2.7/bin/
./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 --deploy-mode cluster ../examples/jars/spark-examples_2.11-2.3.0.jar 1000

查看界面http://master:8080/

5.png

运行成功


5.spark历史日志配置

要查看spark的历史运行记录,就得配置历史日志,启动历史日志服务,这里直接看另外一篇帖子

spark on yarn日志配置及webUI界面日志查看


6.hive安装配置

hive只需要安装在master,需要把hive-site.xml复制到所有节点spark的conf中

6.1 安装mysql

有mysql就不用安装了,没有参考下列之一:


linux centos 安装 mysql 并设置开机启动


linux centos yum安装mysql


6.2 复制mysql-connector-java-5.1.25.jar到hive lib 和spark jars中

下载mysql驱动包http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.25/mysql-connector-java-5.1.25.jar

记住spark和hive都放进去,具体什么版本根据你的mysql来吧


6.3 hive配置

一定要把hive-site.xml复制到所有节点的spark conf中

hive-site.xml

cp hive-default.xml.template hive-site.xml

在后面追加内容

<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
  </property>
  <property>
       <name>hive.metastore.uris</name>
       <value>thrift://master:9083</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://master:3306/hivemetastore</value>
    <description/>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description/>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description/>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description/>
  </property>
  <property>
    <name>hive.enable.spark.execution.engine</name>
    <value>true</value>
  </property>
  <property>
    <name>spark.home</name>
    <value>/opt/hadoop/spark-2.3.0-bin-hadoop2.7</value>
  </property>
  <property>
    <name>spark.master</name>
    <value>yarn-cluster</value>
  </property>

并找到所有这种引用变量路径 ${system:java.io.tmpdir}/${system:user.name}

全部替换为/opt/hadoop/data/hive/iotmp

要不然会报错找不到路径


hive-env.sh

cp hive-env.sh.template hive-env.sh

export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin
export HIVE_CONF_DIR=/opt/hadoop/apache-hive-3.0.0-bin/conf


6.4 初始化原数据库

先根据hive-site.xml的配置创建好数据库

cd /opt/hadoop/apache-hive-3.0.0-bin/bin
./schematool -dbType mysql -initSchema

Initialization script completed

schemaTool completed


6.5 启动hive元数据服务

客户端通过metastore来连接hive

cd /opt/hadoop/apache-hive-3.0.0-bin/bin
nohup hive --service metastore -p 9083 &

启动hiveserver2 如果需要使用三方客户端来连接hive就像连接mysql一样就启动

nohup  hive --service hiveserver2 &

hiveserver2启动之后 可以通过自带的客户端连接测试

./beeline -u jdbc:hive2://master:10000


7 测试hive on spark

启动spark-shell客户端

[root@master bin]# spark-shell --master spark://master:7077
2018-12-23 13:10:18 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://master:4040
Spark context available as 'sc' (master = spark://master:7077, app id = app-20181223131026-0003).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

测试创建hive表

scala> spark.sql("create table test(id int)").show
++
||
++
++
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default|     test|      false|
+--------+---------+-----------+
scala>


到此hive on spark集群环境搭建成功!


另外:

spark扩展新节点就是在slaves中配置新的节点后重启

而hadoop添加新节点参考:hadoop动态添加扩容新节点





登录后即可回复 登录 | 注册
    
关注编程学问公众号