linux hadoop spark环境搭建

2020-03-06 16:25:47 | 编辑

linux搭建spark需要 java,scala,haddop,spark


1.linux安装java并配置环境变量

linux centos 安装java jdk及环境变量配置


2.下载hadoop,scala,spark

hadoop下载地址:

https://archive.apache.org/dist/hadoop/common/

我下在的hadoop-2.7.7.tar.gz


scala下载地址:

https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.tgz


spark2.1.0下载地址:

https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz


下载后全部解压到/opt 目录


3.配置环境变量

所有服务器都需要一样配置

export HADOOP_HOME=/opt/hadoop-2.7.7
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export SPARK_HOME=/opt/spark-2.1.0-bin-hadoop2.7
export JAVA_HOME=/usr/java/jdk1.8.0_77
export SCALA_HOME=/opt/scala-2.12.2
export PATH=$PATH:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin


4.配置hosts

/etc/hosts增加 以下两行,如有多台服务器增加多个slave

10.10.22.122 master

10.10.22.123 slave1

…………………  slave2


5.配置hadoop

hadoop配置文件在/opt/hadoop-2.7.7/etc/hadoop


core-site.xml

<configuration>
    <property>
       <name>hadoop.tmp.dir</name>
       <value>file:/opt/hadoop-2.7.7/hdfs/tmp</value>
    </property>
    <property>
       <name>io.file.buffer.size</name>
       <value>131072</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
<!-- 垃圾回收 -->
    <property>
        <name>fs.trash.interval</name>
        <value>10080</value>
    </property>
        <property>
        <name>fs.trash.checkpoint.interval</name>
        <value>60</value>
    </property>
</configuration>


hadoop-env.sh

export JAVA_HOME=/opt/jdk1.8.0_181


slaves-对应 /etc/hosts里面配置的slave1

slave1


mapred-site.xml

<configuration>
    <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
    </property>
</configuration>


yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
    </property>
    <property>
       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>


hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/hadoop-2.7.7/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/hadoop-2.7.7/data/datanode</value>
    </property>
<!-- 保留磁盘空间20g -->
<property>
        <name>dfs.datanode.du.reserved</name>
        <value>21474836480</value>
    </property>
</configuration>


6.配置sprk

spark-defaults.conf

#driver线程数
spark.driver.cores 2
#driver内存
spark.driver.memory 2g
#executors 数量
spark.num.executors 4
#executors 线程数
spark.executor.cores 2
#executors 使用内存
spark.executor.memory 4g
#任务数num-executors * executor-cores的2~3倍较为合适
spark.default.parallelism 64


slaves

这里的slave1对应 /etc/hosts里面配置的slave1

slave1


spark-env.sh

export SCALA_HOME=/opt/scala/scala-2.12.2
export JAVA_HOME=/usr/java/jdk1.8.0_77
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/spark/spark-2.1.0-bin-hadoop2.7
export SPARK_MASTER_IP=master
export SPARK_LOCAL_IP=master
export SPARK_MASTER_HOST=master
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/data/spark/work
#worker数量,相当于在本机起几个slave,和集群增加节点逻辑效果一样的
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=4G
#是否开启自动清理# 清理周期,每隔多长时间清理一次,单位秒# 保留最近多长时间的数据
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=86400"

其中在slave中export SPARK_LOCAL_IP=slave1,根据hosts来配置


7.服务之间ssh免密和防火墙

ssh免密配置:/article/79.html

关闭防火墙:/article/28.html


8.启动hadoop和spark

分别在hadoop和spark的bin目录执行 start-all.sh。主从服务器就都会起来。

查看hadoop控制台:http://masterIP:50070

查看sprk控制台:http://masterIP:8080

可以点子节点进到各个子节点的控制台


9.测试spark并查看spark控制台

在spark bin目录执行:

./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 --executor-memory 1024M --total-executor-cores 5 ../examples/jars/spark-examples_2.11-2.1.0.jar 1000

可以通过http://masterIP:8080查看执行状态,也可以在执行中在http://masterIP:4040查看执行情况。








登录后即可回复 登录 | 注册
    
关注编程学问公众号