运行的spark程序偶尔出错
异常如下
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *(4) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#1465L]) ... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) ... at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(subject_score_model_SUBJECT_NAMEsort#1118 ASC NULLS FIRST, 200) ... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) .. at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 40 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:280) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:103) at org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:208) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:179) ... at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 62 more
异常分析
后面可以看到
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
等异常。
解决方法
1.修改broadcastTimeout超时时间
spark.sql.broadcastTimeout 1000 spark.network.timeout 10000000
这是根据异常能马上想到的方案,但肯定不是好方案,设多久为好,到底什么导致的时间会变长?
所以就有了下面的解决方法
2.多分配cpu和内存
导致任何计算时间长都是因为数量大或者算法复杂,数据量大,不可避免,算法复杂可优化也有限,那么就只能加大计算资源
所以我就同时加大了内存和cpu
spark.executor.cores 10 spark.num.executors 10 spark.executor.instances 10 spark.driver.memory 8g spark.executor.memory 8g