Hive高级函数实战

Hive是一个主要做统计的工具，内置了大量函数以支持各种统计需求。通过show functions;可以查看Hive中的内置函数；查看指定函数的描述信息我们可以使用desc function functionName;

# 分组排序取TopN

一个典型的应用场景是分组排序取TopN操作，主要使用row_number()和over()函数。

row_number()会对数据进行编号，编号从1开始；
over()可以理解为将数据划分到一个窗口内，可以指定按字段对数据进行分组，并对每个分组内的数据按照某个字段进行排序。
需求是根据学生的考试分数信息（语文、数学、英语），计算班级中每门科目排名前三名学生的姓名。

[root@bigdata04 hivedata]# more student_score.data 
1	zs1	chinese	80
2	zs1	math	90
3	zs1	english	89
4	zs2	chinese	60
5	zs2	math	75
6	zs2	english	80
7	zs3	chinese	79
8	zs3	math	83
9	zs3	english	72
10	zs4	chinese	90
11	zs4	math	76
12	zs4	english	80
13	zs5	chinese	98
14	zs5	math	80
15	zs5	english	70

建表语句：

create external table student_score (
    id int,
    name string,
    sub string,
    score int
) row format delimited
fields terminated by '\t'
location '/data/student_score';

加载数据：

hdfs dfs -put /data/soft/hivedata/student_score /data/student_score

创建查询SQL语句：

select *, row_number() over(partition by sub order by score desc) from student_score;
# 执行结果如下，先按照sub字段分组，再根据score降序
13      zs5     chinese 98      1
10      zs4     chinese 90      2
1       zs1     chinese 80      3
7       zs3     chinese 79      4
4       zs2     chinese 60      5
3       zs1     english 89      1
6       zs2     english 80      2
12      zs4     english 80      3
9       zs3     english 72      4
15      zs5     english 70      5
2       zs1     math    90      1
8       zs3     math    83      2
14      zs5     math    80      3
11      zs4     math    76      4
5       zs2     math    75      5

接下来找出前3的数据

select * from (select *, row_number() over(partition by sub order by score desc) as num from student_score) AS s where s.num<=3;
# 执行结果如下
s.id    s.name  s.sub   s.score s.num
13      zs5     chinese	98      1
10      zs4     chinese 90      2
1       zs1     chinese 80      3
3       zs1     english 89      1
6       zs2     english 80      2
12      zs4     english 80      3
2       zs1     math    90      1
8       zs3     math    83      2
14      zs5     math    80      3

SQL中的row_number()可以替换为rank()或者dense_rank()。

rank()：上下两个score相同，记录的行号是一样的
- 3 zs1 english 89 1 12 zs4 english 80 2 6 zs2 english 80 2
dense_rank()：上下两个score相同，记录的行号是一样的，但是会区前几
- 3 zs1 english 89 1 12 zs4 english 80 2 6 zs2 english 80 2 9 zs3 english 72 3

# 行专列

多行数据转成一列数据，用到的函数：concat_ws()、collect_list()、collect_set()

concat_ws()：将多个字符串连接为一个字符串，并使用指定的分隔符进行分隔。
collect_list()：将指定列的值收集到一个数组中。
collect_set()：将指定列的值收集到一个集合中，去除重复值。

[root@flume hivedata]# cat /data/soft/hivedata/student_favors.data 
zs	swing
zs	footbal
zs	sing
zs	codeing
zs	swing
# 希望的效果：zs	swing,footbal,sing,codeing

# 建表及准备操作

create external table student_favors (
    name string,
    favor string
) row format delimited
fields terminated by '\t'
location '/data/student_favors';

上传数据：

hdfs dfs -put /data/soft/hivedata/student_favors.data /data/student_favors

# 编写查询SQL

# 1.将favor转为数组
select name, collect_list(favor) from student_favors group by name;
# 2.将数组转为字符串，使用逗号分割
select name, concat_ws(',', collect_list(favor)) from student_favors group by name;
# 3.如果需要去重，使用collect_set
select name, concat_ws(',', collect_set(favor)) from student_favors group by name;

# 列转行

把一列数据转成多行，主要使用到split()、explode()和lateral view

split()：将字符串按指定的分隔符拆分为数组。
explode()：将数组或Map类型的列拆分成多行，每行包含一个元素。
lateral view()：将一个表的列展开成多行，用于处理复杂的数据结构，如数组或Map。

[root@flume hivedata]# cat student_favors_2.data 
zs	swing,footbal,sing
ls	codeing,swing

希望的结果是这样的：

zs swing
zs footbal
zs sing
ls codeing
ls swing

# 建表及准备操作

create external table student_favors_2 (
    name string, 
    favorlist string
) row format delimited
fields terminated by '\t'
location '/data/student_favors_2';

上传数据：

hdfs dfs -put /data/soft/hivedata/student_favors_2.data /data/student_favors_2

# 编写查询SQL

# 先使用split将字符串转为数组
select name, split(favorlist,',') from student_favors_2;
# 使用explode拆分成多行
select explode(split(favorlist,',')) from student_favors_2;
# 注意，直接使用name会报错
select name, favor_new from student_favors_2 lateral view explode(split(favorlist, ',')) table1 as favor_new;

# Hive排序相关函数

# order by

Hive中的order by跟传统的sql语言中的order by作用是一样的，会对查询的结果做一次全局排序，使用这个语句的时候生成的reduce任务只有一个。

# sort by

Hive中指定了sort by，如果有多个reduce，那么在每个reducer端都会做排序，也就是说保证了局部有序（每个reducer出来的数据是有序的，但是不能保证所有的数据是全局有序的，除非只有一个reducer）。

# distribute by

只会根据指定的key对数据进行分区，但是不会排序。一般情况下可以和sort by结合使用，先对数据分区，再进行排序两者结合使用的时候distribute by必须要写在sort by。

 select id from t2_bak distribute by id sort by id; # 先根据id分区，再根据id排序

# cluster by

cluster by的功能就是distribute by和sort by的简写形式：cluster by id 等于 distribute by id sort by id

注意被cluster by指定的列只能是升序，不能指定asc和desc

# Hive 的分组和去重函数

group by：对数据按照指定字段进行分组
distinct：对数据中指定字段的重复值进行去重

需求：统计order表中name去重后的数据量

第一种：select count(distinct name) from order
第二种：select count(tmp.name) from (select name from order group by name) tmp

第一种：使用distinct会将所有的name都shuffle到一个reducer里面，性能较低
第二种：先对name分组，因为分组的同时其实就是去重，此时是可以并行计算的，然后再计算count

Flume+Hive综合案例 Scala下载及安装

评论区

第三方登录

暂无评论~~

星空小屋

Choose mode