1.异常信息
之前提交spark任务都很正常,但是最近老是执行spark任务失败:BindException: Address already in use
spark ui 显示 异常信息
HTTP ERROR 500
Problem accessing /proxy/application_1588486936385_2884/. Reason:
Address already in use
Caused by:
java.net.BindException: Address already in use
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
at java.net.Socket.bind(Socket.java:644)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:120)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:200)
at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:387)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
yarn application 查看的异常信息
20/12/21 12:55:18 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on a random free port. You may check whether configuring an appropriate binding address.
20/12/21 12:55:18 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
java.net.BindException: Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
End of LogType:stderr
2.异常分析
重试 这么多次都找不到随机端口,说明端口都被占用了
查看端口总数ss -s
Total: 105626 (kernel 109563)
TCP: 105277 (estab 196, closed 79, orphaned 0, synrecv 0, timewait 77/0), ports 0
Transport Total IP IPv6
* 109563 - -
RAW 1 0 1
UDP 15 8 7
TCP 105198 104809 389
INET 105214 104817 397
FRAG 0 0 0
可以看到一万多个 基本 被占用完了
使用ss命令 大量的 CLOSE-WAIT 端口
STAB 0 0 [::ffff:192.168.827]:44693 [::ffff:192.168.860]:37528
CLOSE-WAIT 1 0 [::ffff:192.168.827]:58473 [::ffff:192.168.827]:50010
CLOSE-WAIT 1 0 [::ffff:192.168.827]:55800 [::ffff:192.168.827]:50010
CLOSE-WAIT 1 0 [::ffff:192.168.827]:37749 [::ffff:192.168.860]:50010
CLOSE-WAIT 1 0 [::ffff:192.168.827]:54642 [::ffff:192.168.827]:50010
CLOSE-WAIT 1 0 [::ffff:192.168.827]:39578 [::ffff:192.168.827]:50010
CLOSE-WAIT 1 0
。。。。
随便找个端口分析一下
查看端口状态
[root@master ~]# netstat -anp | grep 54889
tcp 1 0 192.168.1.827:54889 192.168.1.803:50010 CLOSE_WAIT 19870/java
tcp6 1 0 192.168.1.827:54889 192.168.1.827:50010 CLOSE_WAIT 44212/java
查看进程
[root@master ~]# ps -ef|grep 19870
root 17678 45042 0 16:34 pts/0 00:00:00 grep --color=auto 19870
root 19870 1 0 May04 ? 1-10:31:14 /opt/hadoop/jdk1.8.0_77/bin/java -Xmx16384m -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dproc_hiveserver2 -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/hadoop/apache-hive-3.0.0-bin/conf/parquet-logging.properties -Djline.terminal=jline.UnsupportedTerminal -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/hadoop/apache-hive-3.0.0-bin/lib/hive-service-3.0.0.jar org.apache.hive.service.server.HiveServer2
查看 java 进程
[root@master ~]# jps
28966 HQuorumPeer
11113 SecondaryNameNode
28457 HRegionServer
10858 DataNode
15722 NodeManager
11403 ResourceManager
10707 NameNode
44212 ApplicationMaster
2839 Jps
19672 RunJar
28217 HMaster
19870 RunJar
42814 RunJar
这个 RunJar应该就是HiveServer2
先查看对向端口
[root@slave3 ~]# netstat -anp|grep 50010
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 11430/java
tcp 0 0 192.168.1.859:50010 192.168.1.860:34366 ESTABLISHED 11430/java
tcp 0 0 192.168.1.859:38024 192.168.1.803:50010 ESTABLISHED 11430/java
tcp 0 0 192.168.1.859:50010 192.168.1.859:47796 ESTABLISHED 11430/java
tcp 0 0 192.168.1.859:38022 192.168.1.803:50010 ESTABLISHED 11430/java
tcp6 0 0 192.168.1.859:47796 192.168.1.859:50010 ESTABLISHED 29418/java
[root@slave3 ~]# ps -ef|grep 11430
root 11430 1 0 May03 ? 2-03:40:55 /opt/hadoop/jdk1.8.0_77/bin/java -Dproc_datanode -Xmx16384m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop-root-datanode-slave3.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
root 11571 1179 0 16:46 pts/0 00:00:00 grep --color=auto 11430
[root@slave3 ~]#
[root@slave3 ~]#
[root@slave3 ~]# jps
15488 HQuorumPeer
11665 Jps
5749 NodeManager
11430 DataNode
29418 HRegionServer
[root@slave3 ~]#
嗯 这是个 dataNode
结论:HiveServer2和dataNode有大量连接没有关闭
3.解决异常
先kill 掉 HiveServer2这个进程,发现ss -s命令下端口占用减少,netstat命令 都快了 好多。(端口信息过多netstat就会卡,ss命令不会卡)
[root@master ~]# ss -s
Total: 983 (kernel 14276)
TCP: 640 (estab 199, closed 84, orphaned 0, synrecv 0, timewait 82/0), ports 0
Transport Total IP IPv6
* 14276 - -
RAW 1 0 1
UDP 15 8 7
TCP 556 155 401
INET 572 163 409
FRAG 0 0 0
假设1:我通过客户端连接过HiveServer2,但查询慢的时候我就直接关闭客户端,但是HiveServer2和datanode就没有关闭连接,但我可能一年就查几次,不会导致这么多连接没关闭吧,这个假设可能性太小
假设2:application 程序 有隐形的代码自动连接HiveServer2没有关闭连接,这个可能性也不大,我关闭HiveServer2后我的应用还是照常跑。
后面准备 不启动 HiveServer2的情况下,继续观察端口的情况。