spark集群大量端口占用-BindException: Address already in use

spark | 2020-12-22 09:06:51

1.异常信息

之前提交spark任务都很正常,但是最近老是执行spark任务失败:BindException: Address already in use

spark ui 显示 异常信息

HTTP ERROR 500
Problem accessing /proxy/application_1588486936385_2884/. Reason:

    Address already in use
Caused by:
java.net.BindException: Address already in use
	at java.net.PlainSocketImpl.socketBind(Native Method)
	at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
	at java.net.Socket.bind(Socket.java:644)
	at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:120)
	at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
	at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
	at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
	at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
	at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:200)
	at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:387)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)

yarn application 查看的异常信息

20/12/21 12:55:18 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on a random free port. You may check whether configuring an appropriate binding address.
20/12/21 12:55:18 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
java.net.BindException: Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:433)
	at sun.nio.ch.Net.bind(Net.java:425)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	at java.lang.Thread.run(Thread.java:745)
End of LogType:stderr

 

2.异常分析

重试 这么多次都找不到随机端口,说明端口都被占用了

查看端口总数ss -s

Total: 105626 (kernel 109563)
TCP:   105277 (estab 196, closed 79, orphaned 0, synrecv 0, timewait 77/0), ports 0

Transport Total     IP        IPv6
*	  109563    -         -        
RAW	  1         0         1        
UDP	  15        8         7        
TCP	  105198    104809    389      
INET	  105214    104817    397      
FRAG	  0         0         0   

可以看到一万多个 基本 被占用完了

使用ss命令 大量的 CLOSE-WAIT 端口

STAB      0      0                                                                    [::ffff:192.168.827]:44693                                                                              [::ffff:192.168.860]:37528                
CLOSE-WAIT 1      0                                                                    [::ffff:192.168.827]:58473                                                                              [::ffff:192.168.827]:50010                
CLOSE-WAIT 1      0                                                                    [::ffff:192.168.827]:55800                                                                              [::ffff:192.168.827]:50010                
CLOSE-WAIT 1      0                                                                    [::ffff:192.168.827]:37749                                                                              [::ffff:192.168.860]:50010                
CLOSE-WAIT 1      0                                                                    [::ffff:192.168.827]:54642                                                                              [::ffff:192.168.827]:50010                
CLOSE-WAIT 1      0                                                                    [::ffff:192.168.827]:39578                                                                              [::ffff:192.168.827]:50010                
CLOSE-WAIT 1      0     
。。。。                                                                     

随便找个端口分析一下

查看端口状态

[root@master ~]# netstat -anp | grep 54889
tcp        1      0 192.168.1.827:54889      192.168.1.803:50010      CLOSE_WAIT  19870/java          
tcp6       1      0 192.168.1.827:54889      192.168.1.827:50010      CLOSE_WAIT  44212/java 

查看进程

[root@master ~]# ps -ef|grep 19870
root     17678 45042  0 16:34 pts/0    00:00:00 grep --color=auto 19870
root     19870     1  0 May04 ?        1-10:31:14 /opt/hadoop/jdk1.8.0_77/bin/java -Xmx16384m -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dproc_hiveserver2 -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/hadoop/apache-hive-3.0.0-bin/conf/parquet-logging.properties -Djline.terminal=jline.UnsupportedTerminal -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/hadoop/apache-hive-3.0.0-bin/lib/hive-service-3.0.0.jar org.apache.hive.service.server.HiveServer2

查看 java 进程

[root@master ~]# jps
28966 HQuorumPeer
11113 SecondaryNameNode
28457 HRegionServer
10858 DataNode
15722 NodeManager
11403 ResourceManager
10707 NameNode
44212 ApplicationMaster
2839 Jps
19672 RunJar
28217 HMaster
19870 RunJar
42814 RunJar

这个 RunJar应该就是HiveServer2

 

先查看对向端口

[root@slave3 ~]# netstat -anp|grep 50010
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      11430/java          
tcp        0      0 192.168.1.859:50010      192.168.1.860:34366      ESTABLISHED 11430/java          
tcp        0      0 192.168.1.859:38024      192.168.1.803:50010      ESTABLISHED 11430/java          
tcp        0      0 192.168.1.859:50010      192.168.1.859:47796      ESTABLISHED 11430/java          
tcp        0      0 192.168.1.859:38022      192.168.1.803:50010      ESTABLISHED 11430/java          
tcp6       0      0 192.168.1.859:47796      192.168.1.859:50010      ESTABLISHED 29418/java          
[root@slave3 ~]# ps -ef|grep 11430
root     11430     1  0 May03 ?        2-03:40:55 /opt/hadoop/jdk1.8.0_77/bin/java -Dproc_datanode -Xmx16384m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop/hadoop-2.7.7/logs -Dhadoop.log.file=hadoop-root-datanode-slave3.log -Dhadoop.home.dir=/opt/hadoop/hadoop-2.7.7 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop/hadoop-2.7.7/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
root     11571  1179  0 16:46 pts/0    00:00:00 grep --color=auto 11430
[root@slave3 ~]# 
[root@slave3 ~]# 
[root@slave3 ~]# jps
15488 HQuorumPeer
11665 Jps
5749 NodeManager
11430 DataNode
29418 HRegionServer
[root@slave3 ~]# 

嗯 这是个 dataNode

 

结论:HiveServer2和dataNode有大量连接没有关闭

 

3.解决异常

先kill 掉 HiveServer2这个进程,发现ss -s命令下端口占用减少,netstat命令 都快了 好多。(端口信息过多netstat就会卡,ss命令不会卡)

[root@master ~]# ss -s
Total: 983 (kernel 14276)
TCP:   640 (estab 199, closed 84, orphaned 0, synrecv 0, timewait 82/0), ports 0

Transport Total     IP        IPv6
*	  14276     -         -        
RAW	  1         0         1        
UDP	  15        8         7        
TCP	  556       155       401      
INET	  572       163       409      
FRAG	  0         0         0 

假设1:我通过客户端连接过HiveServer2,但查询慢的时候我就直接关闭客户端,但是HiveServer2和datanode就没有关闭连接,但我可能一年就查几次,不会导致这么多连接没关闭吧,这个假设可能性太小

假设2:application 程序 有隐形的代码自动连接HiveServer2没有关闭连接,这个可能性也不大,我关闭HiveServer2后我的应用还是照常跑。

 

后面准备 不启动 HiveServer2的情况下,继续观察端口的情况。

 

 

 

 

登录后即可回复 登录 | 注册
    
关注编程学问公众号