spark 异常 Missing an output location for shuffle

spark | 2019-09-13 10:02:39

执行spark出现了MetadataFetchFailedException，这个spark任务有分组排序

1.异常信息：

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:867)
at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:863)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
...

2.解决方案：

我的解决方法是增加executor内存，我把spark.executor.memory由8G调到16G就不出现这个异常了

3.原因分析和解决思路：

3.1原因分析

shuffle分为shuffle write和shuffle read两部分。

shuffle write的分区数由上一阶段的RDD分区数控制，shuffle read的分区数则是由Spark提供的一些参数控制。

shuffle write可以简单理解为类似于saveAsLocalDiskFile的操作，将计算的中间结果按某种规则临时放到各个executor所在的本地磁盘上。

shuffle read的时候数据的分区数则是由spark提供的一些参数控制。可以想到的是，如果这个参数值设置的很小，同时shuffle read的量很大，那么将会导致一个task需要处理的数据非常大。结果导致JVM crash，从而导致取shuffle数据失败，同时executor也丢失了，看到Failed to connect to host的错误，也就是executor lost的意思。有时候即使不会导致JVM crash也会造成长时间的gc。

3.2解决思路

知道原因后问题就好解决了，主要从shuffle的数据量和处理shuffle数据的分区数两个角度入手。

3.2.1减少shuffle数据

思考是否可以使用map side join或是broadcast join来规避shuffle的产生。

将不必要的数据在shuffle前进行过滤，比如原始数据有20个字段，只要选取需要的字段进行处理即可，将会减少一定的shuffle数据。

3.2.2SparkSQL和DataFrame的join,group by等操作

通过spark.sql.shuffle.partitions控制分区数，默认为200，根据shuffle的量以及计算的复杂度提高这个值。

3.2.3 Rdd的join,groupBy,reduceByKey等操作

通过spark.default.parallelism控制shuffle read与reduce处理的分区数，默认为运行任务的core的总数（mesos细粒度模式为8个，local模式为本地的core总数），官方建议为设置成运行任务的core的2-3倍。

3.2.4提高executor的内存

通过spark.executor.memory适当提高executor的memory值。

3.2.5是否存在数据倾斜的问题

空值是否已经过滤？异常数据（某个key数据特别大）是否可以单独处理？考虑改变数据分区规则

登录后即可回复登录 | 注册

spark on hive 异常 `hivefileformat` doesn t match `parquetfileformat`spark操作hive orc transactional事务表异常解决spark hive插入数据异常spark currently does not populate bucketed output jdbc连接hive spark thriftserver异常unable to move source java jdbc通过spark连接hive 异常required field client protocol is unset spark hive 异常version information not found in metastore hive on spark异常failed to create spark client for spark session解决过程 spark异常 could not locate executable null bin winutils.exe in the hadoop binaries spark hive 元数据异常 filenotfoundexception spark 异常 missing an output location for shuffle spark on yarn 异常 spark shuffle does not exist spark 开发常见异常处理解决spark shell 写入hbase 异常job in state define instead of running spark rdd写入数据到hbase nullpointerexception异常 spark 异常 timeoutexception futures timed out after 1000 seconds spark 程序执行慢卡住之shuffle优化 spark shuffle原理及参数调优 spark 数据倾斜分析及 shuffle性能优化方案 spark AnalysisException: Resolved attribute(s) field missing from field spark hive Can not create the managed table('`xxx`'). The associated location('xxx') already exists