问题描述:
spark程序数据量不大,可是计算很多,计算完的新字段都join到原来的dataset中,又反复依靠新的结果计算,反复join到原dataset,这就导致了又大量的join,rdd依赖变得异常复杂而且依赖也很长
spark程序计算代码如下
df=df.withColumn("IS_PASS_SUBJECT",when($"SUBJECT_SCORE"/$"FULL_SCORE" >=0.6,1).otherwise(0))
....大量的添加新列(上百个)
df=df.join(
df.where("SCHOOL_RANK <= 10").groupBy("SCHOOL_ID","SUBJECT_ID").agg(round(avg("SUBJECT_SCORE"),2).as("SCHOOL_TOP10_SUBJECT_SCORE_AVG")),
Seq("SCHOOL_ID","SUBJECT_ID"),"left"
)
....大量的计算新字段join到原来df
df=df.withColumn("SINGLE_SCORE_PROJECT_RANK", rank().over(Window.partitionBy($"PROJECT_ID",$"SUBJECT_ID").orderBy($"SUBJECT_SCORE".desc)))
...大量的根据前面计算的新列来计算新列
df=df.join(
df.where("CLASS_RANK <= 10").groupBy("CLASS_ID","SUBJECT_ID").agg(round(avg("SUBJECT_SCORE"),2).as("CLASS_TOP10_SUBJECT_SCORE_AVG")),
Seq("CLASS_ID","SUBJECT_ID"),"left"
)
....大量的根据前面计算的结果来计算新的结果然后join到原df
这样的话,spark程序就会在中间卡住,什么都不运行,也不报任何错误,也不重新划分job和stage,什么都卡住不动
我也知道如果没有action算子,spark会根据计算链条中的算子来划分stage,我的这个计算链条太长,我也不知道如何来解决卡死的问题?