spark 读取Phoenix hbase table表到 RDD, DataFrame有通用spark读取mysql的方式,也有Phoenix自有的方式。
首先maven添加依赖
<dependency> <groupId>org.apache.phoenix</groupId> <artifactId>phoenix-spark</artifactId> <version>${phoenix.version}</version> <scope>provided</scope> </dependency>
方式一:spark read读取各数据库的通用方式
spark.read.format("org.apache.phoenix.spark").option("table","subject_score").option("zkUrl","master,slave1,slave2,slave3,slave4").load
方式二:spark.load (此方法已标记删除)
val df = sqlContext.load( "org.apache.phoenix.spark", Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181") )
方式三:phoenixTableAsDataFrame(需要指定列名,留空就可以不指定列名)
val configuration = new Configuration() // Can set Phoenix-specific settings, requires 'hbase.zookeeper.quorum' val sc = new SparkContext("local", "phoenix-test") val sqlContext = new SQLContext(sc) // Load the columns 'ID' and 'COL1' from TABLE1 as a DataFrame val df = sqlContext.phoenixTableAsDataFrame( "TABLE1", Array("ID", "COL1"), conf = configuration )
方式四:phoenixTableAsRDD (需要指定列名,留空就可以不指定列名)
val sc = new SparkContext("local", "phoenix-test") // Load the columns 'ID' and 'COL1' from TABLE1 as an RDD val rdd: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD( "TABLE1", Seq("ID", "COL1"), zkUrl = Some("phoenix-server:2181") )