为了测试先初始化student和score这两个dataframe
var score = Seq(
("student1", "001", "quest1",10),
("student1", "001", "quest2",20),
("student1", "002", "quest2",20),
("student2", "001", "quest1",70),
("student2", "001", "quest2",60),
("student2", "002", "quest2",50)
).toDF("student_id", "subject_id", "quest_id","quest_score")
var student=Seq(
("student1","张三"),
("student1","张三"),
("student2","李四")
).toDF("student_id", "student_name")
1.spark sql支持join的类型
val supported = Seq(
"inner",
"outer", "full", "fullouter", "full_outer",
"leftouter", "left", "left_outer",
"rightouter", "right", "right_outer",
"leftsemi", "left_semi",
"leftanti", "left_anti",
"cross")
2.left=leftouter
scala> score.join(student,Seq("student_id"),"left").show
+----------+----------+--------+-----------+------------+
|student_id|subject_id|quest_id|quest_score|student_name|
+----------+----------+--------+-----------+------------+
| student1| 001| quest1| 10| 张三|
| student1| 001| quest1| 10| 张三|
| student1| 001| quest2| 20| 张三|
| student1| 001| quest2| 20| 张三|
| student1| 002| quest2| 20| 张三|
| student1| 002| quest2| 20| 张三|
| student2| 001| quest1| 70| 李四|
| student2| 001| quest2| 60| 李四|
| student2| 002| quest2| 50| 李四|
+----------+----------+--------+-----------+------------+
有些意外
left outer join是以左表为准,在右表中查找匹配的记录,如果查找失败,则返回一个所有字段都为null的记录。
但是右表有重复记录,会返回重复的数据,这也是包括mysql之内所有sql共同的特点,所以 一般我们会把右表distinct
3.leftsemi
scala> score.join(student,Seq("student_id"),"leftsemi").show
+----------+----------+--------+-----------+
|student_id|subject_id|quest_id|quest_score|
+----------+----------+--------+-----------+
| student1| 001| quest1| 10|
| student1| 001| quest2| 20|
| student1| 002| quest2| 20|
| student2| 001| quest1| 70|
| student2| 001| quest2| 60|
| student2| 002| quest2| 50|
+----------+----------+--------+-----------+
left semi join是以左表为准,在右表中查找匹配的记录,如果查找成功,则仅返回左边的记录,否则返回null
4.leftanti
scala> score.join(student,Seq("student_id"),"leftanti").show
+----------+----------+--------+-----------+
|student_id|subject_id|quest_id|quest_score|
+----------+----------+--------+-----------+
+----------+----------+--------+-----------+
left anti join与left semi join相反,是以左表为准,在右表中查找匹配的记录,如果查找成功,则返回null,否则仅返回左边的记录
其他right join以此类推