1.异常信息
最近Hbase regionServer节点经常挂掉几个,查了一下挂掉节点hbase-root-regionserver-slave3.log日志,出现如下错误信息
2019-07-11 02:58:34,320 WARN [regionserver/slave3/192.168.202.19:16020-longCompactions-1562643385477-SendThread(slave3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 88158ms for sessionid 0x369e6926a042130 2019-07-11 02:58:34,320 WARN [main-SendThread(slave4:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 91643ms for sessionid 0x16bc2ec8e9e0177 2019-07-11 02:58:34,320 INFO [regionserver/slave3/192.168.22.159:16020-shortCompactions-1562643620766-SendThread(slave2:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 84959ms for sessionid 0x16bc2ec8e9e017b, closing socket connection and attempting reconnect 2019-07-11 02:58:34,320 INFO [main-SendThread(slave4:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 91643ms for sessionid 0x16bc2ec8e9e0177, closing socket connection and attempting reconnect 2019-07-11 02:58:34,321 WARN [regionserver/slave3/192.168.22.159:16020-SendThread(slave3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 101972ms for sessionid 0x16bc2ec8e9e0178 2019-07-11 02:58:34,321 INFO [regionserver/slave3/192.168.22.159:16020-SendThread(slave3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 101972ms for sessionid 0x16bc2ec8e9e0178, closing socket connection and attempting reconnect 2019-07-11 02:58:34,320 INFO [regionserver/slave3/192.168.202.19:16020-longCompactions-1562643385477-SendThread(slave3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 88158ms for sessionid 0x369e6926a042130, closing socket connection and attempting reconnect 2019-07-11 02:58:34,321 WARN [RS_OPEN_REGION-slave3:16020-0-SendThread(slave3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 88092ms for sessionid 0x369e6926a04212d 2019-07-11 02:58:34,321 INFO [RS_OPEN_REGION-slave3:16020-0-SendThread(slave3:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 88092ms for sessionid 0x369e6926a04212d, closing socket connection and attempting reconnect 2019-07-11 02:58:34,322 INFO [slave3,16020,1562643381590_ChoreService_1] hbase.ScheduledChore: Chore: slave3,16020,1562643381590-MemstoreFlusherChore missed its start time 2019-07-11 02:58:34,323 INFO [slave3,16020,1562643381590_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time 2019-07-11 02:58:34,324 WARN [ResponseProcessor for block BP-1486468690-127.0.0.1-1550458385182:blk_1076360281_2619665] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1486468690-127.0.0.1-1550458385182:blk_1076360281_2619665 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2294) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:847) 2019-07-11 02:58:34,423 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 78980ms GC pool 'ParNew' had collection(s): count=1 time=3603ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=75581ms 2019-07-11 02:58:34,337 WARN [regionserver/slave3/192.168.202.19:16020] util.Sleeper: We slept 81967ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2019-07-11 02:58:34,329 WARN [DataStreamer for file /hbase-1.4.9-data/data/default/DATACENTER.QUEST_SCORE_MODEL/ee69c3e031c6acc061e50f231f828fdb/.tmp/757f314d0b02456081bf101595e81213 block BP-1486468690-127.0.0.1-1550458385182:blk_1076360262_2619646] hdfs.DFSClient: DataStreamer Exception java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hdfs.DFSPacket.writeTo(DFSPacket.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:613) 2019-07-11 02:58:34,329 WARN [ResponseProcessor for block BP-1486468690-127.0.0.1-1550458385182:blk_1076359688_2619061] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1486468690-127.0.0.1-1550458385182:blk_1076359688_2619061 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2294) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:847) 2019-07-11 02:58:34,328 INFO [slave3,16020,1562643381590_ChoreService_3] hbase.ScheduledChore: Chore: slave3,16020,1562643381590-HeapMemoryTunerChore missed its start time 2019-07-11 02:58:34,423 WARN [ResponseProcessor for block BP-1486468690-127.0.0.1-1550458385182:blk_1076360276_2619660] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1486468690-127.0.0.1-1550458385182:blk_1076360276_2619660 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2294) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:847) 2019-07-11 02:58:34,431 WARN [sync.4] hdfs.DFSClient: Error while syncing java.io.IOException: All datanodes DatanodeInfoWithStorage[192.168.202.19:50010,DS-796e0cf8-31c6-4fb3-8017-70fcf2f09135,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1224) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:990) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:507) 。。。 2019-07-11 02:58:34,440 INFO [sync.0] wal.FSHLog: Slow sync cost: 105 ms, current pipeline: [DatanodeInfoWithStorage[192.168.202.19:50010,DS-796e0cf8-31c6-4fb3-8017-70fcf2f09135,DISK]] 2019-07-11 02:58:34,440 INFO [sync.2] wal.FSHLog: Slow sync cost: 103 ms, current pipeline: [DatanodeInfoWithStorage[192.168.202.19:50010,DS-796e0cf8-31c6-4fb3-8017-70fcf2f09135,DISK]] 2019-07-11 02:58:34,598 INFO [main-SendThread(slave2:2181)] zookeeper.ClientCnxn: Opening socket connection to server slave2/192.168.202.19.155:2181. Will not attempt to authenticate using SASL (unknown error) 2019-07-11 02:58:34,599 INFO [main-SendThread(slave2:2181)] zookeeper.ClientCnxn: Socket connection established to slave2/192.168.202.19.155:2181, initiating session 2019-07-11 02:58:34,610 WARN [main-SendThread(slave2:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x16bc2ec8e9e0177 has expired 2019-07-11 02:58:34,611 INFO [main-SendThread(slave2:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x16bc2ec8e9e0177 has expired, closing socket connection 2019-07-11 02:58:34,611 FATAL [main-EventThread] regionserver.HRegionServer: ABORTING region server slave3,16020,1562643381590: regionserver:16020-0x16bc2ec8e9e0177, quorum=master:2181,slave1:2181,slave2:2181,slave3:2181,slave4:2181, baseZNode=/hbase regionserver:16020-0x16bc2ec8e9e0177 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:692) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:624) at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505) 2019-07-11 02:58:34,611 FATAL [main-EventThread] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl, org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] 2019-07-11 02:58:34,705 INFO [regionserver/slave3/192.168.202.19:16020-longCompactions-1562643385477-SendThread(master:2181)] zookeeper.ClientCnxn: Opening socket connection to server master/10.10.22.127:2181. Will not attempt to authenticate using SASL (unknown error) 2019-07-11 02:58:34,706 INFO [regionserver/slave3/192.168.202.19:16020-longCompactions-1562643385477-SendThread(master:2181)] zookeeper.ClientCnxn: Socket connection established to master/10.10.22.127:2181, initiating session 2019-07-11 02:58:34,780 INFO [regionserver/slave3/192.168.202.19:16020-longCompactions-1562643385477-SendThread(master:2181)] zookeeper.ClientCnxn: Session establishment complete on server master/10.10.22.127:2181, sessionid = 0x369e6926a042130, negotiated timeout = 90000 2019-07-11 02:58:34,847 WARN [regionserver/slave3/192.168.202.19:16020-shortCompactions-1562643620766] compactions.DefaultCompactor: Failed to close the writer after an unfinished compaction. java.io.IOException: All datanodes DatanodeInfoWithStorage[192.168.202.19:50010,DS-796e0cf8-31c6-4fb3-8017-70fcf2f09135,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1224) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:990) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:507) 2019-07-11 02:58:34,985 FATAL [regionserver/slave3/192.168.202.19:16020.logRoller] regionserver.LogRoller: Aborting java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvider.java:395) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog.java:742) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:707) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:173) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase-1.4.9-data/WALs/slave3,16020,1562643381590 at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1776) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2547) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2479) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2363) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:624) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:398) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1839) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1658) at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:493) at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:489) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:504) at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1115) at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1091) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.init(ProtobufLogWriter.java:90) at org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvider.java:384) ... 4 more 。。。
2.异常分析
从异常信息来看,先是有大量Compactions日志信息,然后hdfs.DFSClient警告
紧接着抛出
java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2294) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:847)
可见是由于hdfs异常抛出来的
所以我去hadoop查看了datanode的日志,搜索了一下ERROR信息
2019-07-11 02:51:18,969 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: The downstream error might be due to congestion in upstream including this node. Propagating the error: 2019-07-11 02:51:18,989 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1:42028 dst: /192.16.7.1.155:50010 2019-07-11 02:51:19,538 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.103:51408 dst: /192.16.7.1.155:50010 2019-07-11 02:54:39,243 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.159:60682 dst: /192.16.7.1.155:50010 2019-07-11 02:54:45,280 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.159:60788 dst: /192.16.7.1.155:50010 2019-07-11 02:54:45,289 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1:42302 dst: /192.16.7.1.155:50010 2019-07-11 02:54:45,295 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: The downstream error might be due to congestion in upstream including this node. Propagating the error: 2019-07-11 02:54:45,297 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1:42320 dst: /192.16.7.1.155:50010 2019-07-11 03:06:59,338 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.127:52564 dst: /192.16.7.1.155:50010 2019-07-11 03:06:59,390 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.159:33306 dst: /192.16.7.1.155:50010 2019-07-11 03:06:59,392 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.103:53360 dst: /192.16.7.1.155:50010 2019-07-11 04:39:09,382 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.155:44490 dst: /192.16.7.1.155:50010 2019-07-11 04:39:22,059 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: slave2:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.16.7.1.155:46914 dst: /192.16.7.1.155:50010
字面意思就是dataXceiver在写的时候操作异常
大量hbase compaction操作的时候就导致了hdfs 的data 写错误,也许是hdfs压力过大
3.解决方法
修改数据传输线程个数和文件个数
网上很多帖子说修改dfs.datanode.max.xcievers,这是老版本的参数,目前使用的版本都是dfs.datanode.max.transfer.threads
hdfs-site.xml
<property> <name>dfs.datanode.max.transfer.threads</name> <value>8192</value> <description> 默认值4091 Specifies the maximum number of threads to use for transferring data in and out of the DN. </description> </property> <property> <name>dfs.namenode.handler.count</name> <value>20</value> <description> 默认值10 The number of server threads for the namenode. </description> </property> <property> <name>dfs.datanode.handler.count</name> <value>20</value> <description> 默认值10 The number of server threads for the datanode. </description> </property>
同时我也调大了用于处理RPC调用的线程数量。
每个节点修改后 刷新配置
hdfs dfsadmin -refreshSuperUserGroupsConfiguration