spark hive 元数据异常 FileNotFoundException

hive | 2019-09-13 10:02:39

1.异常重现

经常重建hive表之后，重新读取，就会报错FileNotFoundException，具体如下

Job aborted due to stage failure: Task 32 in stage 66540.0 failed 4 times, most recent failure: Lost task 32.3 in stage 66540.0 (TID 125087, slave2, executor 1): java.io.FileNotFoundException: File does not exist: hdfs://master:9000/user/hive/warehouse/datacenter.db/sample_table_subject_score_model/part-00002-38db9158-c61c-4389-8875-dde31a997265-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:

2.解决方法

根据官网介绍：

Metadata Refreshing

Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

spark.catalog.refreshTable("my_table")

意思就是为了性能，缓存了hive元数据信息，修改表后请调用刷新代码spark.catalog.refreshTable("my_table")

登录后即可回复登录 | 注册

spark on hive 异常 `hivefileformat` doesn t match `parquetfileformat`spark操作hive orc transactional事务表异常解决spark hive插入数据异常spark currently does not populate bucketed output jdbc连接hive spark thriftserver异常unable to move source java jdbc通过spark连接hive 异常required field client protocol is unset spark hive 异常version information not found in metastore hive on spark异常failed to create spark client for spark session解决过程 hive on spark parquetdecodingexception 异常解决 spark操作mongodb异常 cannot cast bsonvalue spark异常 could not locate executable null bin winutils.exe in the hadoop binaries hive on spark 匹配版本和官方文档 spark hive 元数据异常 filenotfoundexception spark rdd转dataframe构造字段信息元数据 linux hadoop、hbase、hive、spark大数据分布式集群环境搭建 hive异常relative path in absolute uri ${system java.io.tmpdir}解决spark shell 写入hbase 异常job in state define instead of running spark rdd写入数据到hbase nullpointerexception异常 spark hive 异常 could not connect to meta store using any of the uris provided spark hive Can not create the managed table('`xxx`'). The associated location('xxx') already exists Hive on Spark性能调优