spark内置了很多方便的计算函数,如果api没有的就需要自己定义udf,但内置的就已经很丰富了有avg,sum,count这些sql中就有的,这里介绍更加厉害的row_number,rank,dense_rank,percent_rank排序排名函数。使用方法和sql中使用count是一样的,当然也可以用spark api编码的方式来实现计算。
1.函数解释
案例:
select gender, age, row_number() over(partition by gender order by age) as rowNumber, rank() over(partition by gender order by age) as ranks, dense_rank() over(partition by gender order by age) as denseRank, percent_rank() over(partition by gender order by age) as percentRank from Affairs
rank() over(partition by gender order by age) as ranks ——》按gender分组在各自的组里分别按age进行排名
partition by用于给结果集分组,如果没有指定那么它把整个结果集作为一个分组。
order by 根据哪个字段来排名
函数名 | 解析 | 规则 | 描述 |
row_number | 序号 | 1,2,3,4 | 根据字段排序,序号依次增长,无论是否有相同名次 |
rank | 标准排名 | 1,2,2,4 | 遇到相同排名,名次跳跃 |
dense_rank | 密集排名 | 1,2,2,3 | 遇到相同排名,后面名次不跳跃 |
percent_rank | 排名百分比 | 0.01,0.02,0.04 | 返回当前排名除以总行数的百分比 |
2.使用方法
2.1 sql方式
select gender, age, row_number() over(partition by gender order by age) as rowNumber, rank() over(partition by gender order by age) as ranks, dense_rank() over(partition by gender order by age) as denseRank, percent_rank() over(partition by gender order by age) as percentRank from Affairs
简单不解释
2.2编码方式
//不指定分组
df.withColumn("rank", rank().over(Window.partitionBy().orderBy($"age".desc)))
//指定分组
df.withColumn("rank", rank().over(Window.partitionBy($"gender").orderBy($"age".desc)))