spark sql join表链接 left leftjoin leftouter leftsemi leftanti区别

spark | 2020-12-23 09:57:08

为了测试先初始化student和score这两个dataframe

var score = Seq(
  ("student1", "001", "quest1",10),
  ("student1", "001", "quest2",20),
  ("student1", "002", "quest2",20),
  ("student2", "001", "quest1",70),
  ("student2", "001", "quest2",60),
  ("student2", "002", "quest2",50)
).toDF("student_id", "subject_id", "quest_id","quest_score")	

var student=Seq(
  ("student1","张三"),
  ("student1","张三"),
  ("student2","李四")
).toDF("student_id", "student_name")	

 

1.spark sql支持join的类型

val supported = Seq(
        "inner",
        "outer", "full", "fullouter", "full_outer",
        "leftouter", "left", "left_outer",
        "rightouter", "right", "right_outer",
        "leftsemi", "left_semi",
        "leftanti", "left_anti",
        "cross")

 

2.left=leftouter 

scala> score.join(student,Seq("student_id"),"left").show
+----------+----------+--------+-----------+------------+                       
|student_id|subject_id|quest_id|quest_score|student_name|
+----------+----------+--------+-----------+------------+
|  student1|       001|  quest1|         10|        张三|
|  student1|       001|  quest1|         10|        张三|
|  student1|       001|  quest2|         20|        张三|
|  student1|       001|  quest2|         20|        张三|
|  student1|       002|  quest2|         20|        张三|
|  student1|       002|  quest2|         20|        张三|
|  student2|       001|  quest1|         70|        李四|
|  student2|       001|  quest2|         60|        李四|
|  student2|       002|  quest2|         50|        李四|
+----------+----------+--------+-----------+------------+

有些意外

left outer join是以左表为准,在右表中查找匹配的记录,如果查找失败,则返回一个所有字段都为null的记录。

但是右表有重复记录,会返回重复的数据,这也是包括mysql之内所有sql共同的特点,所以 一般我们会把右表distinct

 

3.leftsemi 

scala> score.join(student,Seq("student_id"),"leftsemi").show
+----------+----------+--------+-----------+                                    
|student_id|subject_id|quest_id|quest_score|
+----------+----------+--------+-----------+
|  student1|       001|  quest1|         10|
|  student1|       001|  quest2|         20|
|  student1|       002|  quest2|         20|
|  student2|       001|  quest1|         70|
|  student2|       001|  quest2|         60|
|  student2|       002|  quest2|         50|
+----------+----------+--------+-----------+

left semi join是以左表为准,在右表中查找匹配的记录,如果查找成功,则仅返回左边的记录,否则返回null

4.leftanti

scala> score.join(student,Seq("student_id"),"leftanti").show
+----------+----------+--------+-----------+                                    
|student_id|subject_id|quest_id|quest_score|
+----------+----------+--------+-----------+
+----------+----------+--------+-----------+

left anti join与left semi join相反,是以左表为准,在右表中查找匹配的记录,如果查找成功,则返回null,否则仅返回左边的记录

 

其他right join以此类推

 

 

登录后即可回复 登录 | 注册
    
关注编程学问公众号