spark on hive集群环境搭建过程很简单,不要搞pre-built版,又是编译又是依赖又是打包,搞一堆,就下载正常稳定的版本就好,简单就好。
环境就只linux centos7开始搭建,程序全部放再/opt/hadoop中,先安装一个节点,然后一个一个动态增加节点
【目录】
1.1下载之前你得知道需要什么版本
hive on spark官方文档:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
spark和hive版本匹配
Hive Version | Spark Version |
---|---|
1.1.x | 1.2.0 |
1.2.x | 1.3.1 |
2.0.x | 1.5.0 |
2.1.x | 1.6.0 |
2.2.x | 1.6.0 |
2.3.x | 2.0.0 |
3.0.x | 2.3.0 |
master | 2.3.0 |
1.2通过这篇文章推荐来选择spark和hive的版本再清华镜像下载:
hadoop: wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz
spark: wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
hive:wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz
scala:wget https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.tgz
jdk8:https://www.oracle.com/technetwork/cn/java/javase/downloads/jdk8-downloads-2133151-zhs.html
全部解压到/opt/hadoop/
2.1所有机器hosts:
vi /etc/hosts
11.101.22.133 master
11.101.22.134 slave1
2.2所有机器hostname:
按照hosts修改每个机器的hostname
这个一定要修改,要不然有问题
2.3所有机器环境变量:
vi /etc/profile
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib" export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2.7 export JAVA_HOME=/opt/hadoop/jdk1.8.0_77 export SCALA_HOME=/opt/hadoop/scala-2.12.2 export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin export PATH=$PATH:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin:${HIVE_HOME}/bin export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
2.4集群机器配置ssh免密
master本机,master和其他机器之间都要做ssh免密
环境配置上面已经配置好了,hadoop的配置文件都在/opt/hadoop/hadoop-2.7.7/etc/hadoop
直接把yarn配置好,以便后面基于yarn资源调度
再次声明这里是先添加单节点,后面再一个一个动态添加扩展节点
31.配置文件
core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/opt/hadoop/data/hadoop/tmp</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <!-- 垃圾回收 --> <property> <name>fs.trash.interval</name> <value>10080</value> </property> <property> <name>fs.trash.checkpoint.interval</name> <value>60</value> </property> </configuration>
hadoop-env.sh
export JAVA_HOME=/opt/hadoop/jdk1.8.0_77
hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop/data/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop/data/hadoop/datanode</value> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> <!-- 保留磁盘空间20g--> <property> <name>dfs.datanode.du.reserved</name> <value>21474836480</value> </property> </configuration>
mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
slaves
localhost
可以根据hosts配置其他节点或ip,一行一个
yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> <property> <description>Whether to enable log aggregation</description> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://master:19888/jobhistory/logs</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration>
3.2启动hadoop
如果启动有问题:hadoop namenode -format
./sbin/start-all.sh
3.3测试访问hadoop
3.3.1访问hadoop端口:master:50070/
3.3.2master运行hadoop命令
hadoop fs -ls /
18/12/23 11:05:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items drwx-wx-wx - root supergroup 0 2018-12-21 16:36 /tmp
不报错就行,上面那个警告不要紧,后面会一直跟着你,先不着急解决
ok hadoop完成
3.3.3 查看yarn界面
4.1配置文件
配置文件在/opt/hadoop/spark-2.3.0-bin-hadoop2.7/conf
spark也一样,先添加单节点,后面再扩展节点
4.1.1 slaves
cp slaves.template slaves vi slaves
localhost
有其他节点时,添加其他节点,一行一个
4.1.2 spark-env.sh
cp spark-env.sh.template spark-env.sh vi spark-env.sh
export SCALA_HOME=/opt/hadoop/scala-2.12.2 export JAVA_HOME=/opt/hadoop/jdk1.8.0_77 export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2.7 export SPARK_MASTER_IP=master export SPARK_LOCAL_IP=master export SPARK_MASTER_HOST=master export SPARK_MASTER_PORT=7077 export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_YARN_USER_ENV=$HADOOP_HOME/etc/hadoop export SPARK_EXECUTOR_MEMORY=4G
4.1.3 spark-defaults.conf
spark-defaults.conf是spark-submit客户端的参数配置,客户端支持命令行参数,也支持spark-defaults.conf,集群中这个配置没上面用,但我要统一配置好,不管哪里都从这里复制着过去。
cp spark-defaults.conf.template spark-defaults.conf
spark.master spark://master:7077 spark.submit.deployMode cluster #历史日志配置 第5节有详细介绍 spark.eventLog.enabled true spark.eventLog.compress true spark.eventLog.dir hdfs://master:9000/tmp/logs/root/logs spark.history.fs.logDirectory hdfs://master:9000/tmp/logs/root/logs spark.yarn.historyServer.address master:18080
4.2 启动spark(yarn 模式下,spark 可以不启动,由yarn调度)
./sbin/start-all.sh
4.3 测试park
查看界面http://master:8080/
运行测试程序
cd /opt/hadoop/spark-2.3.0-bin-hadoop2.7/bin/ ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 --deploy-mode cluster ../examples/jars/spark-examples_2.11-2.3.0.jar 1000
查看界面http://master:8080/
运行成功
要查看spark的历史运行记录,就得配置历史日志,启动历史日志服务,这里直接看另外一篇帖子
hive只需要安装在master,需要把hive-site.xml复制到所有节点spark的conf中
6.1 安装mysql
有mysql就不用安装了,没有参考下列之一:
6.2 复制mysql-connector-java-5.1.25.jar到hive lib 和spark jars中
下载mysql驱动包http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.25/mysql-connector-java-5.1.25.jar
记住spark和hive都放进去,具体什么版本根据你的mysql来吧
6.3 hive配置
一定要把hive-site.xml复制到所有节点的spark conf中
hive-site.xml
cp hive-default.xml.template hive-site.xml
在后面追加内容
<property> <name>hive.execution.engine</name> <value>spark</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://master:9083</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://master:3306/hivemetastore</value> <description/> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description/> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description/> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description/> </property> <property> <name>hive.enable.spark.execution.engine</name> <value>true</value> </property> <property> <name>spark.home</name> <value>/opt/hadoop/spark-2.3.0-bin-hadoop2.7</value> </property> <property> <name>spark.master</name> <value>yarn-cluster</value> </property>
并找到所有这种引用变量路径 ${system:java.io.tmpdir}/${system:user.name}
全部替换为/opt/hadoop/data/hive/iotmp
要不然会报错找不到路径
hive-env.sh
cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export SPARK_HOME=/opt/hadoop/spark-2.3.0-bin-hadoop2.7 export HIVE_HOME=/opt/hadoop/apache-hive-3.0.0-bin export HIVE_CONF_DIR=/opt/hadoop/apache-hive-3.0.0-bin/conf
6.4 初始化原数据库
先根据hive-site.xml的配置创建好数据库
cd /opt/hadoop/apache-hive-3.0.0-bin/bin ./schematool -dbType mysql -initSchema
Initialization script completed
schemaTool completed
6.5 启动hive元数据服务
客户端通过metastore来连接hive
cd /opt/hadoop/apache-hive-3.0.0-bin/bin nohup hive --service metastore -p 9083 &
启动hiveserver2 如果需要使用三方客户端来连接hive就像连接mysql一样就启动
nohup hive --service hiveserver2 &
hiveserver2启动之后 可以通过自带的客户端连接测试
./beeline -u jdbc:hive2://master:10000
启动spark-shell客户端
[root@master bin]# spark-shell --master spark://master:7077 2018-12-23 13:10:18 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://master:4040 Spark context available as 'sc' (master = spark://master:7077, app id = app-20181223131026-0003). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77) Type in expressions to have them evaluated. Type :help for more information. scala>
测试创建hive表
scala> spark.sql("create table test(id int)").show ++ || ++ ++ scala> spark.sql("show tables").show +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ | default| test| false| +--------+---------+-----------+ scala>
到此hive on spark集群环境搭建成功!
另外:
spark扩展新节点就是在slaves中配置新的节点后重启
而hadoop添加新节点参考: