spark确实是一个很优秀的框架,但从完全不会来慢慢摸索,就是一个趴坑的过程,整理一下经常遇到的问题,防止再次入坑
1.异常:Exception in thread “main” org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT OUTER join between logical plans LocalLimit 21
出现错误,这时,需要添加
spark.conf.set(“spark.sql.crossJoin.enabled”, “true”) (这个需要试一下,网上找的解决办法)
2.异常:Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting‘spark.debug.maxToStringFields’ in SparkEnv.conf.
出现错误,这时,需要添加,
config(“spark.debug.maxToStringFields”, 1000).
3.异常:Exception in thread "main"org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
解决方案:
主要原因在于下面的textFile读取文件时未读取到数据,导致上面的错误。因此需要看文件路径或者hdfs是否启动。
4.错误:ERROR: org.apache.spark.sql.AnalysisException: resolved attribute(s)id#426
bf_dec_sharp = bf_dec_sharp.toDF(bf_dec_sharp.columns: _*)
5.解决A master URL must be set in your configuration错误
在运行Spark的程序时,出现了如下错误:
从提示中可以看出找不到程序运行的master,此时需要配置环境变量。
传递给spark的master url可以有如下几种:
local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
点击RUN => edit configuration,在左侧点击该项目。在右侧VM options中输入“-Dspark.master=local”,指示本程序本地单线程运行,再次运行即可。配置如图:
6.异常:org.apache.spark.sql.AnalysisException: Table or view not found: gpsdb.t_ana_ds_dq_conf; line 1 pos 14; 'Project [*] ± 'UnresolvedRelation gpsdb.t_ana_ds_dq_conf
解决的办法:
看看自己的项目中是否配置hive-site.xml(重点,我自己就是这个错误)
那么去哪找呢?
去集群服务器上:find -name hive-site.xml
找到之后拷贝到项目的资源文件下面就可以了,打包的时候在项目的根目录下,会自动加载jar根目录下的hive-site.xml
为什么要添加:spark要查找hive中的数据,需要这个配置文件,里面是hive的一些信息。
(这个解决方法有待验证)
问题已解决:少了hive-site.xml的配置,具体看下图:
(注意:开始加了配置文件还是不可以,后来让同事帮忙看了下,原来是后面还缺少HiveSupport())
7.错误: Warning: Local jar /var/lib/hadoop-hdfs/2018-06-01 does not exist, skipping.
Warning: Local jar /var/lib/hadoop-hdfs/2018-06-01 does not exist, skipping. java.lang.ClassNotFoundException: DATA_QUALITY.DQ_MAIN/home/DS_DQ_ANA.jar at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.util.Utils$.classForName(Utils.scala:229) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
运行语句
spark2-submit --master yarn --deploy-mode client --num-executors 20 --executor-memory 6G --executor-cores 3 --driver-memory 2G --conf spark.default.parallelism=500 --class DATA_QUALITY.DQ_MAIN/home/DS_DQ_ANA.jar "2018-03-01" "2018-03-31"
忘记加一个空格(无语…)
spark2-submit --master yarn --deploy-mode client --num-executors 20 --executor-memory 6G --executor-cores 3 --driver-memory 2G --conf spark.default.parallelism=500 --class DATA_QUALITY.DQ_MAIN /home/DS_DQ_ANA.jar "2018-06-01" "2018-06-30"
8.错误:ERROR scheduler.TaskSetManager: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
ERROR scheduler.TaskSetManager: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) 18/08/30 11:21:35 WARN scheduler.TaskSetManager: Lost task 11870.0 in stage 152.0 (TID 24031, localhost, executor driver): TaskKilled (killed intentionally) org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:210) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:310) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at cal_corp_monthly_report(<console>:70) ... 52 elided
WARN scheduler.TaskSetManager: Lost task 11870.0 in stage 228.0 (TID 36119, localhost, executor driver): TaskKilled (killed intentionally) org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
解决办法:spark.driver.maxResultSize默认大小为1G 每个Spark action(如collect)所有分区的序列化结果的总大小限制,简而言之就是executor给driver返回的结果过大,报这个错说明需要提高这个值或者避免使用类似的方法,比如countByValue,countByKey等。
将值调大即可
spark.driver.maxResultSize 2g
9.问题: ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(13,WrappedArray())
这个错误是在打印日志时遇到未知的变量出现的。注意检查代码中变量的引用及类型。
10.出现 cannot resolve symbol XXX
问题解决:
File–>Invalidated cache
reimport maven,重新把maven import到Idea中。