linux搭建spark需要 java,scala,haddop,spark
1.linux安装java并配置环境变量
linux centos 安装java jdk及环境变量配置
2.下载hadoop,scala,spark
hadoop下载地址:
https://archive.apache.org/dist/hadoop/common/
我下在的hadoop-2.7.7.tar.gz
scala下载地址:
https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.tgz
spark2.1.0下载地址:
https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
下载后全部解压到/opt 目录
3.配置环境变量
所有服务器都需要一样配置
export HADOOP_HOME=/opt/hadoop-2.7.7 export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib" export SPARK_HOME=/opt/spark-2.1.0-bin-hadoop2.7 export JAVA_HOME=/usr/java/jdk1.8.0_77 export SCALA_HOME=/opt/scala-2.12.2 export PATH=$PATH:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin
4.配置hosts
/etc/hosts增加 以下两行,如有多台服务器增加多个slave
10.10.22.122 master
10.10.22.123 slave1
………………… slave2
5.配置hadoop
hadoop配置文件在/opt/hadoop-2.7.7/etc/hadoop
core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/opt/hadoop-2.7.7/hdfs/tmp</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <!-- 垃圾回收 --> <property> <name>fs.trash.interval</name> <value>10080</value> </property> <property> <name>fs.trash.checkpoint.interval</name> <value>60</value> </property> </configuration>
hadoop-env.sh
export JAVA_HOME=/opt/jdk1.8.0_181
slaves-对应 /etc/hosts里面配置的slave1
slave1
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop/hadoop-2.7.7/data/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop/hadoop-2.7.7/data/datanode</value> </property> <!-- 保留磁盘空间20g --> <property> <name>dfs.datanode.du.reserved</name> <value>21474836480</value> </property> </configuration>
6.配置sprk
spark-defaults.conf
#driver线程数 spark.driver.cores 2 #driver内存 spark.driver.memory 2g #executors 数量 spark.num.executors 4 #executors 线程数 spark.executor.cores 2 #executors 使用内存 spark.executor.memory 4g #任务数num-executors * executor-cores的2~3倍较为合适 spark.default.parallelism 64
slaves
这里的slave1对应 /etc/hosts里面配置的slave1
slave1
spark-env.sh
export SCALA_HOME=/opt/scala/scala-2.12.2 export JAVA_HOME=/usr/java/jdk1.8.0_77 export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_HOME=/opt/spark/spark-2.1.0-bin-hadoop2.7 export SPARK_MASTER_IP=master export SPARK_LOCAL_IP=master export SPARK_MASTER_HOST=master export SPARK_MASTER_PORT=7077 export SPARK_WORKER_DIR=/data/spark/work #worker数量,相当于在本机起几个slave,和集群增加节点逻辑效果一样的 export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=4 export SPARK_WORKER_MEMORY=4G #是否开启自动清理# 清理周期,每隔多长时间清理一次,单位秒# 保留最近多长时间的数据 export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=86400"
其中在slave中export SPARK_LOCAL_IP=slave1,根据hosts来配置
7.服务之间ssh免密和防火墙
ssh免密配置:/article/79.html
关闭防火墙:/article/28.html
8.启动hadoop和spark
分别在hadoop和spark的bin目录执行 start-all.sh。主从服务器就都会起来。
查看hadoop控制台:http://masterIP:50070
查看sprk控制台:http://masterIP:8080
可以点子节点进到各个子节点的控制台
9.测试spark并查看spark控制台
在spark bin目录执行:
./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 --executor-memory 1024M --total-executor-cores 5 ../examples/jars/spark-examples_2.11-2.1.0.jar 1000
可以通过http://masterIP:8080查看执行状态,也可以在执行中在http://masterIP:4040查看执行情况。