精华内容
下载资源
问答
  • hive窗口函数

    2019-01-09 08:33:38
    hive窗口系列函数
  • Hive窗口函数

    万次阅读 多人点赞 2017-02-08 13:34:15
    Hive窗口函数

    简介

    本文主要介绍hive中的窗口函数.hive中的窗口函数和sql中的窗口函数相类似,都是用来做一些数据分析类的工作,一般用于olap分析

    概念

    我们都知道在sql中有一类函数叫做聚合函数,例如sum()、avg()、max()等等,这类函数可以将多行数据按照规则聚集为一行,一般来讲聚集后的行数是要少于聚集前的行数的.但是有时我们想要既显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数.

    在深入研究Over字句之前,一定要注意:在SQL处理中,窗口函数都是最后一步执行,而且仅位于Order by字句之前.

    数据准备

    我们准备一张order表,字段分别为name,orderdate,cost.数据内容如下:

    jack,2015-01-01,10
    tony,2015-01-02,15
    jack,2015-02-03,23
    tony,2015-01-04,29
    jack,2015-01-05,46
    jack,2015-04-06,42
    tony,2015-01-07,50
    jack,2015-01-08,55
    mart,2015-04-08,62
    mart,2015-04-09,68
    neil,2015-05-10,12
    mart,2015-04-11,75
    neil,2015-06-12,80
    mart,2015-04-13,94

    在hive中建立一张表t_window,将数据插入进去.

    实例

    聚合函数+over

    假如说我们想要查询在2015年4月份购买过的顾客及总人数,我们便可以使用窗口函数去去实现

    select name,count(*) over ()
    from t_window
    where substring(orderdate,1,7) = '2015-04'

    得到的结果如下:

    name    count_window_0
    mart    5
    mart    5
    mart    5
    mart    5
    jack    5

    可见其实在2015年4月一共有5次购买记录,mart购买了4次,jack购买了1次.事实上,大多数情况下,我们是只看去重后的结果的.针对于这种情况,我们有两种实现方式

    第一种:distinct

    select distinct name,count(*) over ()
    from t_window
    where substring(orderdate,1,7) = '2015-04'

    第二种:group by

    select name,count(*) over ()
    from t_window
    where substring(orderdate,1,7) = '2015-04'
    group by name

    执行后的结果如下:
    name count_window_0
    mart 2
    jack 2

    partition by子句

    Over子句之后第一个提到的就是Partition By.Partition By子句也可以称为查询分区子句,非常类似于Group By,都是将数据按照边界值分组,而Over之前的函数在每一个分组之内进行,如果超出了分组,则函数会重新计算.

    实例

    我们想要去看顾客的购买明细及月购买总额,可以执行如下的sql

    select name,orderdate,cost,sum(cost) over(partition by month(orderdate))
    from t_window

    执行结果如下:

    name    orderdate   cost    sum_window_0
    jack    2015-01-01  10  205
    jack    2015-01-08  55  205
    tony    2015-01-07  50  205
    jack    2015-01-05  46  205
    tony    2015-01-04  29  205
    tony    2015-01-02  15  205
    jack    2015-02-03  23  23
    mart    2015-04-13  94  341
    jack    2015-04-06  42  341
    mart    2015-04-11  75  341
    mart    2015-04-09  68  341
    mart    2015-04-08  62  341
    neil    2015-05-10  12  12
    neil    2015-06-12  80  80

    可以看出数据已经按照月进行汇总了.

    order by子句

    上述的场景,假如我们想要将cost按照月进行累加.这时我们引入order by子句.

    order by子句会让输入的数据强制排序(文章前面提到过,窗口函数是SQL语句最后执行的函数,因此可以把SQL结果集想象成输入数据)。Order By子句对于诸如Row_Number(),Lead(),LAG()等函数是必须的,因为如果数据无序,这些函数的结果就没有任何意义。因此如果有了Order By子句,则Count(),Min()等计算出来的结果就没有任何意义。

    我们在上面的代码中加入order by

    select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate )
    from t_window

    得到的结果如下:(order by默认情况下聚合从起始行当当前行的数据)

    name    orderdate   cost    sum_window_0
    jack    2015-01-01  10  10
    tony    2015-01-02  15  25
    tony    2015-01-04  29  54
    jack    2015-01-05  46  100
    tony    2015-01-07  50  150
    jack    2015-01-08  55  205
    jack    2015-02-03  23  23
    jack    2015-04-06  42  42
    mart    2015-04-08  62  104
    mart    2015-04-09  68  172
    mart    2015-04-11  75  247
    mart    2015-04-13  94  341
    neil    2015-05-10  12  12
    neil    2015-06-12  80  80

    window子句

    我们在上面已经通过使用partition by子句将数据进行了分组的处理.如果我们想要更细粒度的划分,我们就要引入window子句了.

    我们首先要理解两个概念:
    - 如果只使用partition by子句,未指定order by的话,我们的聚合是分组内的聚合.
    - 使用了order by子句,未使用window子句的情况下,默认从起点到当前行.

    当同一个select查询中存在多个窗口函数时,他们相互之间是没有影响的.每个窗口函数应用自己的规则.

    window子句:
    - PRECEDING:往前
    - FOLLOWING:往后
    - CURRENT ROW:当前行
    - UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

    我们按照name进行分区,按照购物时间进行排序,做cost的累加.
    如下我们结合使用window子句进行查询

    select name,orderdate,cost,
    sum(cost) over() as sample1,--所有行相加
    sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
    sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
    sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4 ,--和sample3一样,由起点到当前行的聚合
    sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5, --当前行和前面一行做聚合
    sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   AND 1 FOLLOWING  ) as sample6,--当前行和前边一行及后面一行
    sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
    from t_window;

    得到查询结果如下:

    name    orderdate   cost    sample1 sample2 sample3 sample4 sample5 sample6 sample7
    jack    2015-01-01  10  661 176 10  10  10  56  176
    jack    2015-01-05  46  661 176 56  56  56  111 166
    jack    2015-01-08  55  661 176 111 111 101 124 120
    jack    2015-02-03  23  661 176 134 134 78  120 65
    jack    2015-04-06  42  661 176 176 176 65  65  42
    mart    2015-04-08  62  661 299 62  62  62  130 299
    mart    2015-04-09  68  661 299 130 130 130 205 237
    mart    2015-04-11  75  661 299 205 205 143 237 169
    mart    2015-04-13  94  661 299 299 299 169 169 94
    neil    2015-05-10  12  661 92  12  12  12  92  92
    neil    2015-06-12  80  661 92  92  92  92  92  80
    tony    2015-01-02  15  661 94  15  15  15  44  94
    tony    2015-01-04  29  661 94  44  44  44  94  79
    tony    2015-01-07  50  661 94  94  94  79  79  50

    窗口函数中的序列函数

    主要序列函数是不支持window子句的.

    hive中常用的序列函数有下面几个:

    NTILE

    • NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值

    • NTILE不支持ROWS BETWEEN,
      比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)

    • 如果切片不均匀,默认增加第一个切片的分布

    这个函数用什么应用场景呢?假如我们想要每位顾客购买金额前1/3的交易记录,我们便可以使用这个函数.

    select name,orderdate,cost,
           ntile(3) over() as sample1 , --全局数据切片
           ntile(3) over(partition by name), -- 按照name进行分组,在分组内将数据切成3份
           ntile(3) over(order by cost),--全局按照cost升序排列,数据切成3份
           ntile(3) over(partition by name order by cost ) --按照name分组,在分组内按照cost升序排列,数据切成3份
    from t_window

    得到的数据如下:

    name    orderdate   cost    sample1 sample2 sample3 sample4
    jack    2015-01-01  10  3   1   1   1
    jack    2015-02-03  23  3   1   1   1
    jack    2015-04-06  42  2   2   2   2
    jack    2015-01-05  46  2   2   2   2
    jack    2015-01-08  55  2   3   2   3
    mart    2015-04-08  62  2   1   2   1
    mart    2015-04-09  68  1   2   3   1
    mart    2015-04-11  75  1   3   3   2
    mart    2015-04-13  94  1   1   3   3
    neil    2015-05-10  12  1   2   1   1
    neil    2015-06-12  80  1   1   3   2
    tony    2015-01-02  15  3   2   1   1
    tony    2015-01-04  29  3   3   1   2
    tony    2015-01-07  50  2   1   2   3

    如上述数据,我们去sample4 = 1的那部分数据就是我们要的结果

    row_number、rank、dense_rank

    这三个窗口函数的使用场景非常多
    - row_number()从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列
    - RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
    - DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

    **注意:
    rank和dense_rank的区别在于排名相等时会不会留下空位.**

    举例如下:

    SELECT 
    cookieid,
    createtime,
    pv,
    RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
    DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
    FROM lxw1234 
    WHERE cookieid = 'cookie1';
    
    cookieid day           pv       rn1     rn2     rn3 
    
    cookie1 2015-04-12      7       1       1       1
    cookie1 2015-04-11      5       2       2       2
    cookie1 2015-04-15      4       3       3       3
    cookie1 2015-04-16      4       3       3       4
    cookie1 2015-04-13      3       5       4       5
    cookie1 2015-04-14      2       6       5       6
    cookie1 2015-04-10      1       7       6       7
    rn1: 15号和16号并列第3, 13号排第5
    rn2: 15号和16号并列第3, 13号排第4
    rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。

    LAG和LEAD函数

    这两个函数为常用的窗口函数,可以返回上下数据行的数据.
    以我们的订单表为例,假如我们想要查看顾客上次的购买时间可以这样去查询

    select name,orderdate,cost,
    lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
    lag(orderdate,2) over (partition by name order by orderdate) as time2
    from t_window;

    查询后的数据为:

    name    orderdate   cost    time1   time2
    jack    2015-01-01  10  1900-01-01  NULL
    jack    2015-01-05  46  2015-01-01  NULL
    jack    2015-01-08  55  2015-01-05  2015-01-01
    jack    2015-02-03  23  2015-01-08  2015-01-05
    jack    2015-04-06  42  2015-02-03  2015-01-08
    mart    2015-04-08  62  1900-01-01  NULL
    mart    2015-04-09  68  2015-04-08  NULL
    mart    2015-04-11  75  2015-04-09  2015-04-08
    mart    2015-04-13  94  2015-04-11  2015-04-09
    neil    2015-05-10  12  1900-01-01  NULL
    neil    2015-06-12  80  2015-05-10  NULL
    tony    2015-01-02  15  1900-01-01  NULL
    tony    2015-01-04  29  2015-01-02  NULL
    tony    2015-01-07  50  2015-01-04  2015-01-02

    time1取的为按照name进行分组,分组内升序排列,取上一行数据的值.

    time2取的为按照name进行分组,分组内升序排列,取上面2行的数据的值,注意当lag函数为设置行数值时,默认为1行.未设定取不到时的默认值时,取null值.

    lead函数与lag函数方向相反,取向下的数据.

    first_value和last_value

    first_value取分组内排序后,截止到当前行,第一个值
    last_value取分组内排序后,截止到当前行,最后一个值

    select name,orderdate,cost,
    first_value(orderdate) over(partition by name order by orderdate) as time1,
    last_value(orderdate) over(partition by name order by orderdate) as time2
    from t_window

    查询结果如下:

    name    orderdate   cost    time1   time2
    jack    2015-01-01  10  2015-01-01  2015-01-01
    jack    2015-01-05  46  2015-01-01  2015-01-05
    jack    2015-01-08  55  2015-01-01  2015-01-08
    jack    2015-02-03  23  2015-01-01  2015-02-03
    jack    2015-04-06  42  2015-01-01  2015-04-06
    mart    2015-04-08  62  2015-04-08  2015-04-08
    mart    2015-04-09  68  2015-04-08  2015-04-09
    mart    2015-04-11  75  2015-04-08  2015-04-11
    mart    2015-04-13  94  2015-04-08  2015-04-13
    neil    2015-05-10  12  2015-05-10  2015-05-10
    neil    2015-06-12  80  2015-05-10  2015-06-12
    tony    2015-01-02  15  2015-01-02  2015-01-02
    tony    2015-01-04  29  2015-01-02  2015-01-04
    tony    2015-01-07  50  2015-01-02  2015-01-07
    

    参考内容:
    - SQL Server中的窗口函数
    - 分析函数——排序排列(rank、dense_rank、row_number)
    - SQL中的窗口函数 OVER窗口函数

    展开全文
  • hive窗口函数总结

    千次阅读 2020-12-19 17:49:55
    hive窗口函数细则

    Hive preceding and following理解

    在了解hive开窗函数前我们来看看Hive窗口函数preceding and following是怎么回事呢.
    Hive窗口函数中,有一个功能是统计当前行之前或之后指定行作为一个聚合,关键字是 preceding 和 following,举例说明其使用方法.
    常规的窗口函数比较简单,这里介绍一下分组的,重点是分组排序之后的rows between用法。
    关键是理解rows between中关键字含义:

    关键字 含义
    preceding 往前
    following 往后
    current row 当前行
    unbounded 开始行
    unbounded preceding 表示从前面的起点
    unbounded following 表示到后面的终点

    案例

    select country,time,charge,
    max(charge) over (partition by country order by time) as normal,
    max(charge) over (partition by country order by time rows between unbounded preceding and current row) as unb_pre_cur,
    max(charge) over (partition by country order by time rows between 2 preceding and 1 following) as pre2_fol1,
    max(charge) over (partition by country order by time rows between current row and unbounded following) as cur_unb_fol 
    from temp
    

    注意:默认是在分组类的当前行之前的行中计算。
    rows between unbounded preceding and current row和默认的一样
    rows between 2 preceding and 1 following表示在当前行的前2行和后1行中计算
    rows between current row and unbounded following表示在当前行和到最后行中计算
    rows between对于avg、min、max、sum这几个窗口函数的含义基本是一致的,注意查看当前结果
    注意查看分组后窗口函数统计结果
    在 hive 环境中创建临时表

    create table tmp_student
    (
       name           string,
       class          tinyint,
       cooperator_name   string,
       score          tinyint
    )row format delimited fields terminated by '|';
    

    加载测试数据
    load data local inpath ‘text.txt’ into table tmp_student;
    其中text.txt中内容为:

    adf|3|测试公司1|45
    xx|3|测试公司2|55
    cfe|2|测试公司2|74
    3dd|3|测试公司5|n
    fda|1|测试公司7|80
    gds|2|测试公司9|92
    ffd|1|测试公司10|95
    dss|1|测试公司4|95
    ddd|3|测试公司3|99
    gf|3|测试公司9|99
    

    查看是否加载成功

    hive> select * from tmp_student;
    adf	3	测试公司1	45
    xx 3	测试公司2	55
    cfe	2	测试公司2	74
    3dd	3	测试公司5	NULL
    fda	1	测试公司7	80
    gds	2	测试公司9	92
    ffd	1	测试公司10	95
    dss	1	测试公司4	95
    ddd	3	测试公司3	99
    gf	3	测试公司9	99
    Time taken: 1.314 seconds, Fetched: 10 row(s)
    

    下面来了解preceding and following函数用法,执行下面sql

    select
        name,
        score,
        sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 当前行的score值加减2的范围内的所有行
        sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 当前行+前后2行,一共5行
        sum(score) over(order by score range between unbounded preceding and unbounded following) s3, -- 全部行,不做限制
        sum(score) over(order by score rows between unbounded preceding and unbounded following) s4, -- 全部行,不做限制
        sum(score) over(order by score) s5, -- 第一行到当前行(和当前行相同score值的所有行都会包含进去)
        sum(score) over(order by score rows between unbounded preceding and current row) s6, -- 第一行到当前行(和当前行相同score值的其他行不会包含进去,这是和上面的区别)
        sum(score) over(order by score rows between 3 preceding and current row) s7, -- 当前行+往前3行
        sum(score) over(order by score rows between 3 preceding and 1 following) s8, --当前行+往前3行+往后1行
        sum(score) over(order by score rows between current row and unbounded following) s9 --当前行+往后所有行
    from
        tmp.tmp_student
    order by 
        score;
    

    注意:
    当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
    当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
    rows是物理窗口,是哪一行就是哪一行,与当前行的值(order by key的key的值)无关,只与排序后的行号相关,就是我们常规理解的那样。
    range是逻辑窗口,与当前行的值有关(order by key的key的值),在key上操作range范围。

    得到相关结果如下
    注意查看窗口函数统计结果通过上面的练习我们主要是对preceding and following有了一个比较全面的理解,所谓开窗函数其实就相当于flink中的滚动窗口,统计分析都是基于这个滚动窗口内完成的,所有的聚合计算统计都需要先根据range或者row确定窗口内的数据,然后就很容易得到正确的计算结果,在确定行数的过程中需要根据range和row确定是逻辑范围还是物理范围,最终都可以看作是第N行到第M行内数据的聚合统计.
    Flink窗口说明

    窗口函Windowing functions

    • FIRST_VALUE(col, bool DEFAULT)

      返回分组窗口内第一行col的值,DEFAULT默认为false,如果指定为true,则跳过NULL后再取值,对于FIRST_VALUE每个分组第一行数据的FIRST_VALUE(col, bool DEFAULT) 就等于col,接下来几行数据会参考第一行数据是否为NULL根据True/False进行取舍.

    WITH tmp AS (
    		SELECT 1 AS group_id, 'a' AS col
    		UNION ALL
    		SELECT 1 AS group_id, 'b' AS col
    		UNION ALL
    		SELECT 1 AS group_id, 'c' AS col
    		UNION ALL
    		SELECT 2 AS group_id, NULL AS col
    		UNION ALL
    		SELECT 2 AS group_id, 'e' AS col
    	)
    SELECT group_id, col, FIRST_VALUE(col) OVER (PARTITION BY group_id ORDER BY col) AS col_new
    FROM tmp;
    返回结果为:
    group_id col col_new  
    1 a a 
    1 b a 
    1 c a 
    2 NULL NULL  
    2 e NULL
    如果是True
    WITH tmp AS (
    		SELECT 1 AS group_id, NULL AS col
    		UNION ALL
    		SELECT 1 AS group_id, 'b' AS col
    		UNION ALL
    		SELECT 1 AS group_id, 'c' AS col
    		UNION ALL
    		SELECT 2 AS group_id, NULL AS col
    		UNION ALL
    		SELECT 2 AS group_id, 'e' AS col
    	)
    SELECT group_id, col, FIRST_VALUE(col, true) OVER (PARTITION BY group_id ORDER BY col) AS col_new
    FROM tmp;
    返回结果为:  
    group_id col col_new  
    1 NULL NULL  
    1 b b  
    1 c b  
    2 NULL NULL  
    2 e e
    
    • LAST_VALUE(col, bool DEFAULT)
      返回分组窗口内第后一行col的值,DEFAULT默认为false,如果指定为true,则跳过NULL后再取值.
    WITH tmp AS (
    		SELECT 1 AS group_id, 'a' AS col
    		UNION ALL
    		SELECT 1 AS group_id, NULL AS col
    		UNION ALL
    		SELECT 1 AS group_id, 'c' AS col
    		UNION ALL
    		SELECT 2 AS group_id, 'd' AS col
    		UNION ALL
    		SELECT 2 AS group_id, 'e' AS col
    	)
    SELECT group_id, col, LAST_VALUE(col) OVER (PARTITION BY group_id ORDER BY col DESC) AS col_new FROM tmp; 
    返回结果为: 
    group_id col col_new  
    1 c c  
    1 a a  
    1 NULL NULL  
    2 e e  
    2 d d
    如果是True
    WITH tmp AS (
    		SELECT 1 AS group_id, 'a' AS col
    		UNION ALL
    		SELECT 1 AS group_id, NULL AS col
    		UNION ALL
    		SELECT 1 AS group_id, 'c' AS col
    		UNION ALL
    		SELECT 2 AS group_id, 'd' AS col
    		UNION ALL
    		SELECT 2 AS group_id, 'e' AS col
    	)
    SELECT group_id, col, LAST_VALUE(col, true) OVER (ORDER BY group_id,col DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS col_new FROM tmp;
    返回结果为:  
    group_id col col_new  
    1 c a  
    1 a a  
    1 NULL e  
    2 e d  
    2 d d
    
    开窗函数不同于group by函数,开窗函数能够把所有的记录都显示出来,一般select所选择的列也都与over里面的分组和排序字段相同,这样才能比较清楚地看到当前记录在聚合函数中的区别和贡献,上面两个窗口函数我们针对最后一个案例进行说明下.
    使用了开窗函数首先要确定窗口的大小,根据上面的PRECEDING和FOLLOWING讲解我们可以知道在分析时候窗口大小为[前一行,当前行,后一行],那么对于第一行1 c取出last_value就是从[空值,c,a]取出集合中最后一个就是a,同理对于第二行1 a取出last_value就是从[c,a,Null]中取出最后一个Null跳过再取得到a,对于2 e从集合[Null,e,d]last_value=d*
    
    • LEAD(col, n, DEFAULT)

    返回分组窗口内往下第n行col的值,n默认为1,往下第n没有时返回DEFAULT(DEFAULT默认为NULL)使用分组后那么分组之间就不交叉计算.

    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'd' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          LEAD(col) over(partition by group_id order by col) as col_new
    FROM tmp;
    

    返回结果

    group_id col col_new
    1 a b
    1 b c
    1 c NULL
    2 d e
    2 e NULL
    

    等同于

    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'd' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          LAST_VALUE(col) over(partition by group_id order by col rows between 1 FOLLOWING and 1 FOLLOWING) as col_new
    FROM tmp;
    

    其中rows between 1 FOLLOWING and 1 FOLLOWING为从往后一行开始到往后一行结束=往后一行
    返回结果

    group_id col col_new
    1 a b
    1 b c
    1 c NULL
    2 d e
    2 e NULL
    

    使用LEAD默认值

    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'd' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          LEAD(col, 2, 'z') over(partition by group_id order by col) as col_new
    FROM tmp;
    

    返回结果

    group_id col col_new
    1 a c
    1 b z
    1 c z
    2 d z
    2 e z
    
    • LAG(col, n, DEFAULT)
      返回分组窗口内往上第n行col的值,n默认为1,往上第n没有时返回DEFAULT(DEFAULT默认为NULL)
    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'd' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          LAG(col) over(partition by group_id order by col) as col_new
    FROM tmp;
    

    等同于

    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'd' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          FIRST_VALUE(col) over(partition by group_id order by col rows BETWEEN 1 PRECEDING and 1 PRECEDING) as col_new
    FROM tmp;
    

    返回结果都是

    group_id col col_new
    1 a NULL
    1 b a
    1 c b
    2 d NULL
    2 e d
    

    使用默认值

    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'd' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          LAG(col, 2, 'zz') over(partition by group_id order by col) as col_new
    FROM tmp;
    

    返回结果

    group_id col col_new
    1 a zz
    1 b zz
    1 c a
    2 d zz
    2 e zz
    

    OVER详解 The OVER clause

    ** FUNCTION(expr) OVER([PARTITION BY statement] [ORDER BY statement] [window clause]) **
    中括号为可选参数
    FUNCTION:包括标准聚合函数(COUNT/SUM/MIN/MAX/AVG)和一些分析函数(RANK/ROW_NUMBER/DENSE_RANK等)
    PARTITION BY:可以由一个或者多个列组成
    ORDER BY:可以由一个或者多个列组成
    window clause:(ROWS | RANGE) BETWEEN (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW) AND (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW)
    当window clause 未指定时默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,即分组内第一行至当前行作为窗口
    当 window clause和ORDER BY都未指定时,默认为ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    **即分组内第一行至最后一行作为窗口.**
    

    标准聚合函数

    COUNT(expr) OVER()
    返回窗口内行数
    WITH tmp AS
    (
     SELECT 1 AS group_id, 'a' AS col 
     UNION ALL SELECT 1 AS group_id,  'b' AS col 
     UNION ALL SELECT 1 AS group_id,  'c' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col 
     UNION ALL SELECT 2 AS group_id,  'e' AS col
    )
    SELECT group_id,
          col,
          count(col) over(partition by group_id) as cnt1,
          count(col) over(partition by group_id order by col) as cnt2,
          count(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt3,
          count(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt4
    FROM tmp;
    返回结果为
    group_id col cnt1 cnt2 cnt3 cnt4
    1 a 3 1 3 3
    1 b 3 2 2 2
    1 c 3 3 1 1
    2 e 2 2 2 1
    2 e 2 2 1 1
    
    SUM(expr) OVER()
    返回窗口内求和值
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  2 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col
    )
    SELECT group_id,
          col,
          SUM(col) over(partition by group_id) as sum1,
          SUM(col) over(partition by group_id order by col) as sum2,
          SUM(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum3,
          SUM(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum4
    FROM tmp;
    返回结果为
    group_id col sum1 sum2 sum3 sum4
    1 1 6 1 6 6
    1 2 6 3 5 5
    1 3 6 6 3 3
    2 4 8 8 8 4
    2 4 8 8 4 4
    
    MIN(expr) OVER()
    返回窗口内最小值
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  2 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          MIN(col) over(partition by group_id) as min1,
          MIN(col) over(partition by group_id order by col) as min2,
          MIN(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as min3
    FROM tmp;
    group_id col min1 min2 min3
    1 1 1 1 1
    1 2 1 1 2
    1 3 1 1 3
    2 4 4 4 4
    2 5 4 4 5
    
    MAX(expr) OVER()
    返回窗口内最大值
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  2 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          MAX(col) over(partition by group_id) as max1,
          MAX(col) over(partition by group_id order by col) as max2,
          MAX(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as max3
    FROM tmp;
    返回结果为
    group_id col max1 max2 max3
    1 1 3 1 3
    1 2 3 2 3
    1 3 3 3 3
    2 4 5 4 5
    2 5 5 5 5
    
    AVG(expr) OVER()
    返回窗口内平均值
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  2 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col
    )
    SELECT group_id,
          col,
          AVG(col) over(partition by group_id) as avg1,
          AVG(col) over(partition by group_id order by col) as avg2,
          AVG(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg3,
          AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg4
    FROM tmp;
    返回结果为
    |group_id|col|avg1|avg2|avg3|avg4|
    |1|1|2.0|1.0|2.0|2.0|
    |1|2|2.0|1.5|2.5|2.5|
    |1|3|2.0|2.0|3.0|3.0|
    |2|4|4.0|4.0|4.0|4.0|
    |2|4|4.0|4.0|4.0|4.0|
    
    分析函数 Analytics functions
    RANK() OVER()
    返回分组内排名(不支持自定义窗口)
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          RANK() over(partition by group_id order by col desc) as r
    FROM tmp;
    返回结果为
    |group_id|col|r|
    |1|3|1|
    |1|3|1|
    |1|1|3|
    |2|5|1|
    |2|4|2|
    
    ROW_NUMBER() OVER()
    返回分组内行号(不支持自定义窗口)
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          ROW_NUMBER() over(partition by group_id order by col desc) as r
    FROM tmp;
    返回结果为
    |group_id|col|r|
    |1|3|1|
    |1|3|2|
    |1|1|3|
    |2|5|1|
    |2|4|2|
    
    DENSE_RANK() OVER()
    返回分组内排名(排名相等不会留下空位,不支持自定义窗口)
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          DENSE_RANK() over(partition by group_id order by col desc) as r
    FROM tmp;
    返回结果为
    |group_id|col|r|
    |1|3|1|
    |1|3|1|
    |1|1|2|
    |2|5|1|
    |2|4|2|
    
    CUME_DIST() OVER()
    返回分组内累计分布值,即分组内小于(或者大于)等于当前值行数/分组内总行数
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          CUME_DIST() over(partition by group_id order by col asc) as d1,
          CUME_DIST() over(partition by group_id order by col desc) as d2
    FROM tmp;
    
    返回结果为
    |group_id|col|d1|d2|
    |1|3|1.0|0.6666666666666666|
    |1|3|1.0|0.6666666666666666|
    |1|1|0.3333333333333333|1.0|
    |2|5|1.0|0.5|
    |2|4|0.5|1.0|
    
    PERCENT_RANK() OVER()
    返回百分比排序值,即分组内当前行的RANK值-1/分组内总行数-1
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          RANK() over(partition by group_id order by col asc) as r1,
          PERCENT_RANK() over(partition by group_id order by col asc) as p1,
          RANK() over(partition by group_id order by col desc) as r2,
          PERCENT_RANK() over(partition by group_id order by col desc) as p2
    FROM tmp;
    
    返回结果为
    |group_id|col|r1|p1|r2|p2|
    |1|3|2|0.5|1|0.0|
    |1|3|2|0.5|1|0.0|
    |1|1|1|0.0|3|1.0|
    |2|5|2|1.0|1|0.0|
    |2|4|1|0.0|2|1.0|
    
    NTILE(INTEGER x) OVER()
    返回分区编号(将有序分区划分为x个组,称为bucket,并为分区中的每一行分配一个bucket编号)
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          col,
          NTILE(2) over(partition by group_id order by col asc) as bucket_id
    FROM tmp;
    
    返回结果为
    |group_id|col|bucket_id|
    |1|1|1|
    |1|3|1|
    |1|3|2|
    |1|3|2|
    |2|4|1|
    |2|5|2|
    OVER子句也支持聚合函数
    Hive 2.1.0及之后版本,OVER子句也支持聚合函数,如:
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  5 AS col
    )
    SELECT group_id,
          RANK() over(order by sum(col) desc) as r
    FROM tmp
    group by group_id;
    结果为
    |group_id|r|
    |2|1|
    |1|2|
    
    window clause 的另一种写法
    将window子句写在from后面,over后使用别名进行引用,如下:
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  2 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col
    )
    SELECT group_id,
          col,
          AVG(col) over w1 as avg1,
          AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg2
    FROM tmp
    WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);
    
    结果为
    |group_id|col|avg1|avg2|
    |1|1|2.0|2.0|
    |1|2|2.5|2.5|
    |1|3|3.0|3.0|
    |2|4|4.0|4.0|
    |2|4|4.0|4.0|
    
    WITH tmp AS
    (
     SELECT 1 AS group_id, 1 AS col 
     UNION ALL SELECT 1 AS group_id,  2 AS col 
     UNION ALL SELECT 1 AS group_id,  3 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col 
     UNION ALL SELECT 2 AS group_id,  4 AS col
    )
    SELECT group_id,
          col,
          AVG(col) over w1 as avg1,
          AVG(distinct col) over w2 as avg2
    FROM tmp
    WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following),
    w2 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);
    
    结果为
    |group_id|col|avg1|avg2|
    |1|1|2.0|2.0|
    |1|2|2.5|2.5|
    |1|3|3.0|3.0|
    |2|4|4.0|4.0|
    |2|4|4.0|4.0|
    

    本文完.
    Any suggestions and criticisms will be sincerely welcomed.
    资料

    https://blog.csdn.net/happyrocking/article/details/105369558
    https://docs.aws.amazon.com/redshift/latest/dg/redshift
    https://www.jianshu.com/p/3f3cf58472ca
    https://www.cnblogs.com/hyunbar/p/13524855.html
    https://blog.csdn.net/weixin_42307036/article/details/112381387

    展开全文
  • Hive 窗口函数

    2020-06-01 19:45:25
    Hive 窗口函数简介概念数据准备实例聚合函数+overpartition by子句order by子句window子句窗口函数中的序列函数NTILELAG和LEAD函数first_value和last_value扩展:总结: 简介 本文主要介绍hive中的窗口函数.hive中的...

    简介

    本文主要介绍hive中的窗口函数.hive中的窗口函数和sql中的窗口函数相类似,都是用来做一些数据分析类的工作,一般用于olap分析(在线分析处理)。

    概念

    我们都知道在sql中有一类函数叫做聚合函数,例如sum()、avg()、max()等等,这类函数可以将多行数据按照规则聚集为一行,一般来讲聚集后的行数是要少于聚集前的行数的.但是有时我们想要既显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数.

    在深入研究Over字句之前,一定要注意:在SQL处理中,窗口函数都是最后一步执行,而且仅位于Order by字句之前。

    数据准备

    我们准备一张order表,字段分别为name,orderdate,cost.数据内容如下:

    jack,2015-01-01,10
    tony,2015-01-02,15
    jack,2015-02-03,23
    tony,2015-01-04,29
    jack,2015-01-05,46
    jack,2015-04-06,42
    tony,2015-01-07,50
    jack,2015-01-08,55
    mart,2015-04-08,62
    mart,2015-04-09,68
    neil,2015-05-10,12
    mart,2015-04-11,75
    neil,2015-06-12,80
    mart,2015-04-13,94
    

    实例

    聚合函数+over

    假如说我们想要查询在2015年4月份购买过的顾客及总人数,我们便可以使用窗口函数去去实现

    select name,count(*) over ()
    from t_window
    where substring(orderdate,1,7) = '2015-04'
    

    得到的结果如下:

    name    count_window_0
    mart    5
    mart    5
    mart    5
    mart    5
    jack    5
    

    可见其实在2015年4月一共有5次购买记录,mart购买了4次,jack购买了1次.事实上,大多数情况下,我们是只看去重后的结果的.针对于这种情况,我们有两种实现方式
    第一种:distinct

    select distinct name,count(*) over ()
    from t_window
    where substring(orderdate,1,7) = '2015-04'
    

    第二种:group by

    select name,count(*) over ()
    from t_window
    where substring(orderdate,1,7) = '2015-04'
    group by name
    

    执行后的结果如下:

    name count_window_0 
    mart 2 
    jack 2
    

    partition by子句

    Over子句之后第一个提到的就是Partition By.Partition By子句也可以称为查询分区子句,非常类似于Group By,都是将数据按照边界值分组,而Over之前的函数在每一个分组之内进行,如果超出了分组,则函数会重新计算.

    实例

    我们想要去看顾客的购买明细及月购买总额,可以执行如下的sql

    select name,orderdate,cost,sum(cost) over(partition by month(orderdate))
    from t_window
    

    执行结果如下:

    name    orderdate   cost    sum_window_0
    jack    2015-01-01  10  205
    jack    2015-01-08  55  205
    tony    2015-01-07  50  205
    jack    2015-01-05  46  205
    tony    2015-01-04  29  205
    tony    2015-01-02  15  205
    jack    2015-02-03  23  23
    mart    2015-04-13  94  341
    jack    2015-04-06  42  341
    mart    2015-04-11  75  341
    mart    2015-04-09  68  341
    mart    2015-04-08  62  341
    neil    2015-05-10  12  12
    neil    2015-06-12  80  80
    

    可以看出数据已经按照月进行汇总了.

    order by子句

    上述的场景,假如我们想要将cost按照月进行累加.这时我们引入order by子句.

    order by子句会让输入的数据强制排序(文章前面提到过,窗口函数是SQL语句最后执行的函数,因此可以把SQL结果集想象成输入数据)。Order By子句对于诸如Row_Number(),Lead(),LAG()等函数是必须的,因为如果数据无序,这些函数的结果就没有任何意义。因此如果有了Order By子句,则Count(),Min()等计算出来的结果就没有任何意义。

    我们在上面的代码中加入order by

    select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by orderdate )
    from t_window
    

    得到的结果如下:(order by默认情况下聚合从起始行到当前行的数据)

    name    orderdate   cost    sum_window_0
    jack    2015-01-01  10  10
    tony    2015-01-02  15  25 //10+15
    tony    2015-01-04  29  54 //10+15+29
    jack    2015-01-05  46  100 //10+15+29+46
    tony    2015-01-07  50  150
    jack    2015-01-08  55  205
    jack    2015-02-03  23  23
    jack    2015-04-06  42  42
    mart    2015-04-08  62  104
    mart    2015-04-09  68  172
    mart    2015-04-11  75  247
    mart    2015-04-13  94  341
    neil    2015-05-10  12  12
    neil    2015-06-12  80  80
    

    window子句

    我们在上面已经通过使用partition by子句将数据进行了分组的处理.如果我们想要更细粒度的划分,我们就要引入window子句了.

    我们首先要理解两个概念:

    • 如果只使用partition by子句,未指定order by的话,我们的聚合是分组内的聚合. 使用了order
    • by子句,未使用window子句的情况下,默认从起点到当前行.

    当同一个select查询中存在多个窗口函数时,他们相互之间是没有影响的.每个窗口函数应用自己的规则.

    • window子句:
    • PRECEDING:往前
    • FOLLOWING:往后
    • CURRENT ROW:当前行
    • UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED
      FOLLOWING:表示到后面的终点

    我们按照name进行分区,按照购物时间进行排序,做cost的累加.
    如下我们结合使用window子句进行查询

    select name,orderdate,cost,
    sum(cost) over() as sample1,--所有行相加
    sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
    sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
    sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4 ,--和sample3一样,由起点到当前行的聚合
    sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5, --当前行和前面一行做聚合
    sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   AND 1 FOLLOWING  ) as sample6,--当前行和前边一行及后面一行
    sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
    from t_window;
    

    得到查询结果如下:

    name    orderdate   cost    sample1 sample2 sample3 sample4 sample5 sample6 sample7
    jack    2015-01-01  10  661 176 10  10  10  56  176
    jack    2015-01-05  46  661 176 56  56  56  111 166
    jack    2015-01-08  55  661 176 111 111 101 124 120
    jack    2015-02-03  23  661 176 134 134 78  120 65
    jack    2015-04-06  42  661 176 176 176 65  65  42
    mart    2015-04-08  62  661 299 62  62  62  130 299
    mart    2015-04-09  68  661 299 130 130 130 205 237
    mart    2015-04-11  75  661 299 205 205 143 237 169
    mart    2015-04-13  94  661 299 299 299 169 169 94
    neil    2015-05-10  12  661 92  12  12  12  92  92
    neil    2015-06-12  80  661 92  92  92  92  92  80
    tony    2015-01-02  15  661 94  15  15  15  44  94
    tony    2015-01-04  29  661 94  44  44  44  94  79
    tony    2015-01-07  50  661 94  94  94  79  79  50
    

    窗口函数中的序列函数

    主要序列函数是不支持window子句的.

    hive中常用的序列函数有下面几个:

    NTILE

    • NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
    • NTILE不支持ROWS BETWEEN, 比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY
      createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
    • 如果切片不均匀,默认增加第一个切片的分布

    这个函数用什么应用场景呢?假如我们想要每位顾客购买金额前1/3的交易记录,我们便可以使用这个函数.

    select name,orderdate,cost,
           ntile(3) over() as sample1 , --全局数据切片
           ntile(3) over(partition by name), -- 按照name进行分组,在分组内将数据切成3份
           ntile(3) over(order by cost),--全局按照cost升序排列,数据切成3份
           ntile(3) over(partition by name order by cost ) --按照name分组,在分组内按照cost升序排列,数据切成3份
    from t_window
    

    得到的数据如下:

    name    orderdate   cost    sample1 sample2 sample3 sample4
    jack    2015-01-01  10  3   1   1   1
    jack    2015-02-03  23  3   1   1   1
    jack    2015-04-06  42  2   2   2   2
    jack    2015-01-05  46  2   2   2   2
    jack    2015-01-08  55  2   3   2   3
    mart    2015-04-08  62  2   1   2   1
    mart    2015-04-09  68  1   2   3   1
    mart    2015-04-11  75  1   3   3   2
    mart    2015-04-13  94  1   1   3   3
    neil    2015-05-10  12  1   2   1   1
    neil    2015-06-12  80  1   1   3   2
    tony    2015-01-02  15  3   2   1   1
    tony    2015-01-04  29  3   3   1   2
    tony    2015-01-07  50  2   1   2   3
    

    如上述数据,我们去sample4 = 1的那部分数据就是我们要的结果

    row_number
    rank
    dense_rank

    • 这三个窗口函数的使用场景非常多
    • row_number()从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列
    • RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位 DENSE_RANK()
      生成数据项在分组中的排名,排名相等会在名次中不会留下空位
      注意:
      rank和dense_rank的区别在于排名相等时会不会留下空位.

      举例如下:
    SELECT 
    cookieid,
    createtime,
    pv,
    RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
    DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
    FROM lxw1234 
    WHERE cookieid = 'cookie1';
     
    cookieid day           pv       rn1     rn2     rn3 
     
    cookie1 2015-04-12      7       1       1       1
    cookie1 2015-04-11      5       2       2       2
    cookie1 2015-04-15      4       3       3       3
    cookie1 2015-04-16      4       3       3       4
    cookie1 2015-04-13      3       5       4       5
    cookie1 2015-04-14      2       6       5       6
    cookie1 2015-04-10      1       7       6       7
    rn1: 15号和16号并列第3, 13号排第5
    rn2: 15号和16号并列第3, 13号排第4
    rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。
    

    LAG和LEAD函数

    这两个函数为常用的窗口函数,可以返回上下数据行的数据.
    以我们的订单表为例,假如我们想要查看顾客上次的购买时间可以这样去查询

    select name,orderdate,cost,
    lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
    lag(orderdate,2) over (partition by name order by orderdate) as time2
    from t_window;
    

    查询后的数据为:

    name    orderdate   cost    time1   time2
    jack    2015-01-01  10  1900-01-01  NULL
    jack    2015-01-05  46  2015-01-01  NULL
    jack    2015-01-08  55  2015-01-05  2015-01-01
    jack    2015-02-03  23  2015-01-08  2015-01-05
    jack    2015-04-06  42  2015-02-03  2015-01-08
    mart    2015-04-08  62  1900-01-01  NULL
    mart    2015-04-09  68  2015-04-08  NULL
    mart    2015-04-11  75  2015-04-09  2015-04-08
    mart    2015-04-13  94  2015-04-11  2015-04-09
    neil    2015-05-10  12  1900-01-01  NULL
    neil    2015-06-12  80  2015-05-10  NULL
    tony    2015-01-02  15  1900-01-01  NULL
    tony    2015-01-04  29  2015-01-02  NULL
    tony    2015-01-07  50  2015-01-04  2015-01-02
    

    time1取的为按照name进行分组,分组内升序排列,取上一行数据的值,见下图。

    time2取的为按照name进行分组,分组内升序排列,取上面2行的数据的值,注意当lag函数未设置行数值时,默认为1行.设定取不到时的默认值时,取null值.

    lead函数与lag函数方向相反,取向下的数据.
    在这里插入图片描述

    first_value和last_value

    first_value取分组内排序后,截止到当前行,第一个值
    last_value取分组内排序后,截止到当前行,最后一个值

    select name,orderdate,cost,
    first_value(orderdate) over(partition by name order by orderdate) as time1,
    last_value(orderdate) over(partition by name order by orderdate) as time2
    from t_window
    

    查询结果如下:

    name    orderdate   cost    time1   time2
    jack    2015-01-01  10  2015-01-01  2015-01-01
    jack    2015-01-05  46  2015-01-01  2015-01-05
    jack    2015-01-08  55  2015-01-01  2015-01-08
    jack    2015-02-03  23  2015-01-01  2015-02-03
    jack    2015-04-06  42  2015-01-01  2015-04-06
    mart    2015-04-08  62  2015-04-08  2015-04-08
    mart    2015-04-09  68  2015-04-08  2015-04-09
    mart    2015-04-11  75  2015-04-08  2015-04-11
    mart    2015-04-13  94  2015-04-08  2015-04-13
    neil    2015-05-10  12  2015-05-10  2015-05-10
    neil    2015-06-12  80  2015-05-10  2015-06-12
    tony    2015-01-02  15  2015-01-02  2015-01-02
    tony    2015-01-04  29  2015-01-02  2015-01-04
    tony    2015-01-07  50  2015-01-02  2015-01-07
    

    原文参考:https://blog.csdn.net/qq_26937525/article/details/54925827

    扩展:

    • row_number的用途非常广泛,排序最好用它,它会为查询出来的每一行记录生成一个序号,依次排序且不会重复,注意使用row_number函数时必须要用over子句选择对某一列进行排序才能生成序号。
    • rank函数用于返回结果集的分区内每行的排名,行的排名是相关行之前的排名数加一。简单来说rank函数就是对查询出来的记录进行排名,与row_number函数不同的是,rank函数考虑到了over子句中排序字段值相同的情况,如果使用rank函数来生成序号,over子句中排序字段值相同的序号是一样的,后面字段值不相同的序号将跳过相同的排名号排下一个,也就是相关行之前的排名数加一,可以理解为根据当前的记录数生成序号,后面的记录依此类推。
    • dense_rank函数的功能与rank函数类似,dense_rank函数在生成序号时是连续的,而rank函数生成的序号有可能不连续。dense_rank函数出现相同排名时,将不跳过相同排名号,rank值紧接上一次的rank值。在各个分组内,rank()是跳跃排序,有两个第一名时接下来就是第四名,dense_rank()是连续排序,有两个第一名时仍然跟着第二名。

    借助实例能更直观地理解:

    假设现在有一张学生表student,学生表中有姓名、分数、课程编号。

    select * from student;
    

    在这里插入图片描述
    现在需要按照课程对学生的成绩进行排序:

    --row_number() 顺序排序
    select name,course,row_number() over(partition by course order by score desc) rank from student;
    
    --rank() 跳跃排序,如果有两个第一级别时,接下来是第三级别
    select name,course,rank() over(partition by course order by score desc) rank from student;
    
    --dense_rank() 连续排序,如果有两个第一级别时,接下来是第二级别 
    select name,course,dense_rank() over(partition by course order by score desc) rank from student;
    

    取得每门课程的第一名:

    --每门课程第一名只取一个: 
    select * from (select name,course,row_number() over(partition by course order by score desc) rank from student) where rank=1;
    --每门课程第一名取所有: 
    select * from (select name,course,dense_rank() over(partition by course order by score desc) rank from student) where rank=1;
    --每门课程第一名取所有:
    select * from (select name,course,rank() over(partition by course order by score desc) rank from student) where rank=1;
    

    附:每门课程第一名取所有的其他方法(使用group by 而不是partition by):

    select s.* from student s
      inner join(select course,max(score) as score from student group by course) c
      on s.course=c.course and s.score=c.score; 
    --或者使用using关键字简化连接
    select * from student s
      inner join(select course,max(score) as score from student group by course) c
      using(course,score);
    

    关于Parttion by:

    Parttion by关键字是Oracle中分析性函数的一部分,用于给结果集进行分区。它和聚合函数Group by不同的地方在于它只是将原始数据进行名次排列,能够返回一个分组中的多条记录(记录数不变),而Group by是对原始数据进行聚合统计,一般只有一条反映统计值的结果(每组返回一条)。

    TIPS:

    使用rank over()的时候,空值是最大的,如果排序字段为null, 可能造成null字段排在最前面,影响排序结果。

    可以这样: rank over(partition by course order by score desc nulls last)

    总结:

    在使用排名函数的时候需要注意以下三点:

    1、排名函数必须有 OVER 子句。

    2、排名函数必须有包含 ORDER BY 的 OVER 子句。

    3、分组内从1开始排序。

    参考:https://www.cnblogs.com/qiuting/p/7880500.html

    展开全文
  • hive窗口函数使用

    千次阅读 2020-09-24 11:14:24
    hive窗口函数的使用前言一、hive窗口函数语法1、over()窗口函数的语法结构1.1、over()函数中的三个函数讲解2、常与over()一起使用的分析函数2.1、聚合类2.2、排名类2.3、其他类3、窗口函数总结 前言 我们在学习hive...

    前言

    我们在学习hive窗口函数的时候,一定要先了解窗口函数的结构。而不是直接百度sum() over()、row_number() over()、或者count() over()的用法,如果这样做,永远也掌握不到窗口函数的核心。
    看了很多文章后才知道over()才是窗口函数,而sum、row_number、count只是与over()搭配的分析函数,当然除了这三个函数还有其他的函数。

    一、hive窗口函数语法

    在前言中我们已经说了avg()、sum()、max()、min()是分析函数,而over()才是窗口函数,下面我们来看看over()窗口函数的语法结构、及常与over()一起使用的分析函数。

    1. over()窗口函数的语法结构
    2. 常与over()一起使用的分析函数
    3. 窗口函数总结

    1、over()窗口函数的语法结构

    分析函数over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)

    over()函数中包括三个函数:包括分区partition by 列名、排序order by 列名、指定窗口范围rows between 开始位置 and 结束位置。我们在使用over()窗口函数时,over()函数中的这三个函数可组合使用也可以不使用。
    over()函数中如果不使用这三个函数,窗口大小是针对查询产生的所有数据,如果指定了分区,窗口大小是针对每个分区的数据。
    

    1.1、over()函数中的三个函数讲解

    order by
    order by是排序的意思,是该窗口中的

    A、partition by
    partition by可理解为group by 分组。over(partition by 列名)搭配分析函数时,分析函数按照每一组每一组的数据进行计算的。
    B、rows between 开始位置 and 结束位置
    是指定窗口范围,比如第一行到当前行。而这个范围是随着数据变化的。over(rows between 开始位置 and 结束位置)搭配分析函数时,分析函数按照这个范围进行计算的。
    窗口范围说明
    我们常使用的窗口范围是ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行),常用该窗口来计算累加。

    • PRECEDING:往前
    • FOLLOWING:往后
    • CURRENT ROW:当前行
    • UNBOUNDED:起点(一般结合PRECEDING,FOLLOWING使用)
    • UNBOUNDED PRECEDING 表示该窗口最前面的行(起点)
    • UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)
      比如说
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
    ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
    ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到当前行)
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点)
    

    2、常与over()一起使用的分析函数

    2.1、聚合类

    avg()、sum()、max()、min()

    2.2、排名类

    row_number()按照值排序时产生一个自增编号,不会重复(如:1、2、3、4、5、6)
    rank() 按照值排序时产生一个自增编号,值相等时会重复,会产生空位(如:1、2、3、3、3、6)
    dense_rank() 按照值排序时产生一个自增编号,值相等时会重复,不会产生空位(如:1、2、3、3、3、4)
    

    2.3、其他类

    lag(列名,往前的行数,[行数为null时的默认值,不指定为null]),可以计算用户上次购买时间,或者用户下次购买时间。
    lead(列名,往后的行数,[行数为null时的默认值,不指定为null])
    ntile(n) 把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,ntile返回此行所属的组的编号
    

    3、窗口函数总结

    其实窗口函数逻辑比较绕,我们可以把窗口理解为对表中的数据进行分组,排序等计算。要真正的理解HIVE窗口函数,还是要结合练习题才行。

    展开全文
  • HIVE窗口函数

    2018-10-09 09:33:21
    HIVE窗口函数 1. Hive窗口函数 SUM,AVG,MIN,MAX 数据格式 zhm,2018-04-10,12 zhm,2018-04-11,51 zhm,2018-04-12,72 zhm,2018-04-13,31 zhm,2018-04-14,21 zhm,2018-04-15,14 zhm,2018-04-16,44 创建表 create ...
  • Hive窗口函数使用

    2020-09-19 16:06:42
    Hive窗口函数:https://zhuanlan.zhihu.com/p/113245904
  • Hive 窗口函数详解

    2020-09-04 14:42:47
    目录1、hive窗口函数2、在日常的开发中常用那些类型函数?3、了解哪些窗口函数,都是什么意思?窗口函数功能over语法1、物理窗口(真实往上下移动多少行rows between):2、 逻辑窗口(满足条件上下多少行):(金融行业...
  • hive 窗口函数 实战

    2021-04-27 14:35:42
    hive 窗口函数 实战一、数据准备1. 创建本地business.txt,导入数据 name,orderdate,cost2.创建hive表并导入数据二、需求1.查询在2017年4月份购买过的顾客及总人数2. 查询顾客的购买明细及月购买总额3.查询顾客的...
  • hive 窗口函数练习

    2021-05-18 21:46:15
    hive 窗口函数练习 第一套练习 1、使用 over() 函数进行数据统计, 统计每个用户及表中数据的总数 2、求用户明细并统计每天的用户总数 3、计算从第一天到现在的所有 score 大于80分的用户总数 4、计算每个用户到当前...
  • Hive窗口函数理解

    2020-09-01 22:08:40
    一直对Hive窗口函数半知半解,最近一直在学习相关知识,记录一下自己的理解。        首先是窗口函数出现的概念,指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着...
  • hive窗口函数记录

    2021-03-26 15:17:46
    hive窗口函数记录 一 、 排序 row_number, rank,dense_rank row_number(): 分组内,从1到n连续 rank(): 分组内,排序如下:1,2,2,3 dense_rank():分组内,排序如下:1,2,2,4 准备数据: create table ...
  • Hive窗口函数进阶指南

    2021-02-16 08:00:00
    作为一名数据小哥,在写SQL的漫漫路上,窗口函数犹如一把披荆斩棘的利剑,帮助作者解决了很多繁琐复杂的需求,在此对窗口函数表示感谢。本文在介绍了窗口函数的同时,着重介绍Hive窗口函数的使用...
  • Hive窗口函数保姆级教程

    千次阅读 2021-06-15 17:37:08
    在SQL中有一类函数叫做聚合函数,...第一部分是Hive窗口函数详解,剖析各种窗口函数(几乎涵盖Hive所有的窗口函数); 第二部分是窗口函数实际应用,这部分总共有五个例子,都是工作常用、面试必问的非常经典的例子。

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 3,186
精华内容 1,274
关键字:

hive窗口