精华内容
下载资源
问答
  • hive中的开窗函数

    千次阅读 2019-01-23 10:54:39
    目录count开窗函数sum开窗函数min开窗函数max开窗函数avg开窗函数first_value开窗函数last_value开窗函数lag开窗函数、lead开窗函数cume_dist开窗函数排序开窗函数rank开窗函数dense_rank开窗函数ntile开窗函数row_...


    最近遇到了hive开窗函数的工作,并且面试时很有可能会问到hive的开窗函数,特整理加深印象,并方便以后复习

    普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。
    开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

    表的名称:
    rt_data -> 存放了当天每半小时的店铺销售数据
    表的字段:
    shop_id, stat_date, ordamt
    字段stat_date格式yyMMddHHmm

    表的数据:
    10026 201901230030 5170
    10026 201901230100 5669
    10026 201901230130 2396
    10026 201901230200 1498
    10026 201901230230 1997
    10026 201901230300 1188
    10026 201901230330 598
    10026 201901230400 479
    10026 201901230430 1587
    10026 201901230530 799
    10027 201901230030 2170
    10027 201901230100 1623
    10027 201901230130 3397
    10027 201901230200 1434
    10027 201901230230 1001
    10028 201901230300 1687
    10028 201901230330 1298
    10028 201901230400 149
    10029 201901230430 2587
    10029 201901230530 589

    设置本地模式

    set hive.exec.mode.local.auto=true;
    

    创建数据库

    CREATE DATABASE IF NOT EXISTS db_window;
    

    创建表

    CREATE TABLE IF NOT EXISTS db_window.rt_data(
    shop_id INT, 
    stat_date STRING, 
    ordamt DOUBLE
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
    STORED AS TEXTFILE;
    

    加载数据到表的指定位置

    hdfs dfs -put rt_data.txt /user/hive/warehouse-3.1.1/db_window.db/rt_data/
    

    开窗函数中指定行及值的语法

    (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
    

    count开窗函数

    主要场景算截止到某位置的数目

    select shop_id,stat_date,ordamt,
    -- 以符合条件的所有行作为窗口
    count(shop_id) over() as count1,
     -- 以按shop_id分组的所有行作为窗口
    count(shop_id) over(partition by shop_id) as count2,
     -- 以按shop_id分组、按stat_date排序的所有行作为窗口
    count(shop_id) over(partition by shop_id order by stat_date) as count3,
     -- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口
    count(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as count4,
     -- 以按shop_id分组、按stat_date排序、按从起点到末尾,默认从起点到末尾和count2结果相同
    count(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as count5,
     -- 以按shop_id分组、按stat_date排序、按从起点到当前行的前一行
    count(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as count6,
     -- 以按shop_id分组、按stat_date排序、按从起点到当前行
    count(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as count7,
     -- 以按shop_id分组、按stat_date排序、按从当前行到末尾
    count(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as count8,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到末尾
    count(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as count9,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行
    count(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as count10
    from rt_data;
    
    OK
    shop_id	stat_date	ordamt	count1	count2	count3	count4	count5	count6	count7	count8	count9	count10
    10026	201901230030	5170.0	20	10	1	3	10	0	1	10	9	2
    10026	201901230100	5669.0	20	10	2	4	10	1	2	9	8	2
    10026	201901230130	2396.0	20	10	3	4	10	2	3	8	7	2
    10026	201901230200	1498.0	20	10	4	4	10	3	4	7	6	2
    10026	201901230230	1997.0	20	10	5	4	10	4	5	6	5	2
    10026	201901230300	1188.0	20	10	6	4	10	5	6	5	4	2
    10026	201901230330	598.0	20	10	7	4	10	6	7	4	3	2
    10026	201901230400	479.0	20	10	8	4	10	7	8	3	2	2
    10026	201901230430	1587.0	20	10	9	3	10	8	9	2	1	1
    10026	201901230530	799.0	20	10	10	2	10	9	10	1	0	0
    10027	201901230030	2170.0	20	5	1	3	5	0	1	5	4	2
    10027	201901230100	1623.0	20	5	2	4	5	1	2	4	3	2
    10027	201901230130	3397.0	20	5	3	4	5	2	3	3	2	2
    10027	201901230200	1434.0	20	5	4	3	5	3	4	2	1	1
    10027	201901230230	1001.0	20	5	5	2	5	4	5	1	0	0
    10028	201901230300	1687.0	20	3	1	3	3	0	1	3	2	2
    10028	201901230330	1298.0	20	3	2	3	3	1	2	2	1	1
    10028	201901230400	149.0	20	3	3	2	3	2	3	1	0	0
    10029	201901230430	2587.0	20	2	1	2	2	0	1	2	1	1
    10029	201901230530	589.0	20	2	2	2	2	1	2	1	0	0
    

    sum开窗函数

    主要场景统计截止到某个位置的值

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分组、按stat_date排序、统计每个商品截止到当前时间的销售总额,默认从起点到当前行
    sum(ordamt) over(partition by shop_id order by stat_date) as sum_amt1,
      -- 以按shop_id分组、按stat_date排序、统计每个商品前半小时到后一小时的销售额(按当前行+往前1行+往后2行的行作为窗口)
    sum(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as sum_amt2,
     -- 以按shop_id分组、按stat_date排序、统计每个商品的销售总额(从起点到末尾)
    sum(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as sum_amt3,
     -- 以按shop_id分组、按stat_date排序、统计截止到前半小时的销售总额(从起点到当前行的前一行)
    sum(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as sum_amt4,
     -- 以按shop_id分组、按stat_date排序、统计每个商品截止到当前时间的销售总额,默认从起点到当前行(从起点到当前行)
    sum(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as sum_amt5,
     -- 以按shop_id分组、按stat_date排序、统计当前时间及之后的销售总额(从当前行的末尾)
    sum(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as sum_amt6,
     -- 以按shop_id分组、按stat_date排序、统计当前时间的后半小时及之后的销售额(当前行后一行到末尾)
    sum(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as sum_amt7,
     -- 以按shop_id分组、按stat_date排序、统计当前时间后半小时到后一小时之间的销售额(按从当前行往后一行到当前行往后2行)
    sum(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as sum_amt8
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	sum_amt1	sum_amt2	sum_amt3	sum_amt4	sum_amt5	sum_amt6	sum_amt7	sum_amt8
    10026	201901230030	5170.0	5170.0	13235.0	21381.0	NULL	5170.0	21381.0	16211.0	8065.0
    10026	201901230100	5669.0	10839.0	14733.0	21381.0	5170.0	10839.0	16211.0	10542.0	3894.0
    10026	201901230130	2396.0	13235.0	11560.0	21381.0	10839.0	13235.0	10542.0	8146.0	3495.0
    10026	201901230200	1498.0	14733.0	7079.0	21381.0	13235.0	14733.0	8146.0	6648.0	3185.0
    10026	201901230230	1997.0	16730.0	5281.0	21381.0	14733.0	16730.0	6648.0	4651.0	1786.0
    10026	201901230300	1188.0	17918.0	4262.0	21381.0	16730.0	17918.0	4651.0	3463.0	1077.0
    10026	201901230330	598.0	18516.0	3852.0	21381.0	17918.0	18516.0	3463.0	2865.0	2066.0
    10026	201901230400	479.0	18995.0	3463.0	21381.0	18516.0	18995.0	2865.0	2386.0	2386.0
    10026	201901230430	1587.0	20582.0	2865.0	21381.0	18995.0	20582.0	2386.0	799.0	799.0
    10026	201901230530	799.0	21381.0	2386.0	21381.0	20582.0	21381.0	799.0	NULL	NULL
    10027	201901230030	2170.0	2170.0	7190.0	9625.0	NULL	2170.0	9625.0	7455.0	5020.0
    10027	201901230100	1623.0	3793.0	8624.0	9625.0	2170.0	3793.0	7455.0	5832.0	4831.0
    10027	201901230130	3397.0	7190.0	7455.0	9625.0	3793.0	7190.0	5832.0	2435.0	2435.0
    10027	201901230200	1434.0	8624.0	5832.0	9625.0	7190.0	8624.0	2435.0	1001.0	1001.0
    10027	201901230230	1001.0	9625.0	2435.0	9625.0	8624.0	9625.0	1001.0	NULL	NULL
    10028	201901230300	1687.0	1687.0	3134.0	3134.0	NULL	1687.0	3134.0	1447.0	1447.0
    10028	201901230330	1298.0	2985.0	3134.0	3134.0	1687.0	2985.0	1447.0	149.0	149.0
    10028	201901230400	149.0	3134.0	1447.0	3134.0	2985.0	3134.0	149.0	NULL	NULL
    10029	201901230430	2587.0	2587.0	3176.0	3176.0	NULL	2587.0	3176.0	589.0	589.0
    10029	201901230530	589.0	3176.0	3176.0	3176.0	2587.0	3176.0	589.0	NULL	NULL
    

    min开窗函数

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分组、按stat_date排序、
    min(ordamt) over(partition by shop_id order by stat_date) as min_amt1,
      -- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as min_amt2,
     -- 以按shop_id分组、按stat_date排序、从起点到末尾的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as min_amt3,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as min_amt4,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as min_amt5,
     -- 以按shop_id分组、按stat_date排序、从当前行的末尾的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as min_amt6,
     -- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as min_amt7,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的最小数
    min(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as min_amt8
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	min_amt1	min_amt2	min_amt3	min_amt4	min_amt5	min_amt6	min_amt7	min_amt8
    10026	201901230030	5170.0	5170.0	2396.0	479.0	NULL	5170.0	479.0	479.0	2396.0
    10026	201901230100	5669.0	5170.0	1498.0	479.0	5170.0	5170.0	479.0	479.0	1498.0
    10026	201901230130	2396.0	2396.0	1498.0	479.0	5170.0	2396.0	479.0	479.0	1498.0
    10026	201901230200	1498.0	1498.0	1188.0	479.0	2396.0	1498.0	479.0	479.0	1188.0
    10026	201901230230	1997.0	1498.0	598.0	479.0	1498.0	1498.0	479.0	479.0	598.0
    10026	201901230300	1188.0	1188.0	479.0	479.0	1498.0	1188.0	479.0	479.0	479.0
    10026	201901230330	598.0	598.0	479.0	479.0	1188.0	598.0	479.0	479.0	479.0
    10026	201901230400	479.0	479.0	479.0	479.0	598.0	479.0	479.0	799.0	799.0
    10026	201901230430	1587.0	479.0	479.0	479.0	479.0	479.0	799.0	799.0	799.0
    10026	201901230530	799.0	479.0	799.0	479.0	479.0	479.0	799.0	NULL	NULL
    10027	201901230030	2170.0	2170.0	1623.0	1001.0	NULL	2170.0	1001.0	1001.0	1623.0
    10027	201901230100	1623.0	1623.0	1434.0	1001.0	2170.0	1623.0	1001.0	1001.0	1434.0
    10027	201901230130	3397.0	1623.0	1001.0	1001.0	1623.0	1623.0	1001.0	1001.0	1001.0
    10027	201901230200	1434.0	1434.0	1001.0	1001.0	1623.0	1434.0	1001.0	1001.0	1001.0
    10027	201901230230	1001.0	1001.0	1001.0	1001.0	1434.0	1001.0	1001.0	NULL	NULL
    10028	201901230300	1687.0	1687.0	149.0	149.0	NULL	1687.0	149.0	149.0	149.0
    10028	201901230330	1298.0	1298.0	149.0	149.0	1687.0	1298.0	149.0	149.0	149.0
    10028	201901230400	149.0	149.0	149.0	149.0	1298.0	149.0	149.0	NULL	NULL
    10029	201901230430	2587.0	2587.0	589.0	589.0	NULL	2587.0	589.0	589.0	589.0
    10029	201901230530	589.0	589.0	589.0	589.0	2587.0	589.0	589.0	NULL	NULL
    

    max开窗函数

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分组、按stat_date排序、
    max(ordamt) over(partition by shop_id order by stat_date) as max_amt1,
      -- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as max_amt2,
     -- 以按shop_id分组、按stat_date排序、从起点到末尾的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as max_amt3,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as max_amt4,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as max_amt5,
     -- 以按shop_id分组、按stat_date排序、从当前行的末尾的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as max_amt6,
     -- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as max_amt7,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的最大数
    max(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as max_amt8
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	max_amt1	max_amt2	max_amt3	max_amt4	max_amt5	max_amt6	max_amt7	max_amt8
    10026	201901230030	5170.0	5170.0	5669.0	5669.0	NULL	5170.0	5669.0	5669.0	5669.0
    10026	201901230100	5669.0	5669.0	5669.0	5669.0	5170.0	5669.0	5669.0	2396.0	2396.0
    10026	201901230130	2396.0	5669.0	5669.0	5669.0	5669.0	5669.0	2396.0	1997.0	1997.0
    10026	201901230200	1498.0	5669.0	2396.0	5669.0	5669.0	5669.0	1997.0	1997.0	1997.0
    10026	201901230230	1997.0	5669.0	1997.0	5669.0	5669.0	5669.0	1997.0	1587.0	1188.0
    10026	201901230300	1188.0	5669.0	1997.0	5669.0	5669.0	5669.0	1587.0	1587.0	598.0
    10026	201901230330	598.0	5669.0	1587.0	5669.0	5669.0	5669.0	1587.0	1587.0	1587.0
    10026	201901230400	479.0	5669.0	1587.0	5669.0	5669.0	5669.0	1587.0	1587.0	1587.0
    10026	201901230430	1587.0	5669.0	1587.0	5669.0	5669.0	5669.0	1587.0	799.0	799.0
    10026	201901230530	799.0	5669.0	1587.0	5669.0	5669.0	5669.0	799.0	NULL	NULL
    10027	201901230030	2170.0	2170.0	3397.0	3397.0	NULL	2170.0	3397.0	3397.0	3397.0
    10027	201901230100	1623.0	2170.0	3397.0	3397.0	2170.0	2170.0	3397.0	3397.0	3397.0
    10027	201901230130	3397.0	3397.0	3397.0	3397.0	2170.0	3397.0	3397.0	1434.0	1434.0
    10027	201901230200	1434.0	3397.0	3397.0	3397.0	3397.0	3397.0	1434.0	1001.0	1001.0
    10027	201901230230	1001.0	3397.0	1434.0	3397.0	3397.0	3397.0	1001.0	NULL	NULL
    10028	201901230300	1687.0	1687.0	1687.0	1687.0	NULL	1687.0	1687.0	1298.0	1298.0
    10028	201901230330	1298.0	1687.0	1687.0	1687.0	1687.0	1687.0	1298.0	149.0	149.0
    10028	201901230400	149.0	1687.0	1298.0	1687.0	1687.0	1687.0	149.0	NULL	NULL
    10029	201901230430	2587.0	2587.0	2587.0	2587.0	NULL	2587.0	2587.0	589.0	589.0
    10029	201901230530	589.0	2587.0	2587.0	2587.0	2587.0	2587.0	589.0	NULL	NULL
    

    avg开窗函数

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分组、按stat_date排序、
    round(avg(ordamt) over(partition by shop_id order by stat_date),2) as avg_amt1,
      -- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following), 2) as avg_amt2,
     -- 以按shop_id分组、按stat_date排序、从起点到末尾的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following), 2) as avg_amt3,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding), 2) as avg_amt4,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row), 2) as avg_amt5,
     -- 以按shop_id分组、按stat_date排序、从当前行的末尾的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following), 2) as avg_amt6,
     -- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following), 2) as avg_amt7,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的平均值
    round(avg(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following), 2) as avg_amt8
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	avg_amt1	avg_amt2	avg_amt3	avg_amt4	avg_amt5	avg_amt6	avg_amt7	avg_amt8
    10026	201901230030	5170.0	5170.0	4411.67	2138.1	NULL	5170.0	2138.1	1801.22	4032.5
    10026	201901230100	5669.0	5419.5	3683.25	2138.1	5170.0	5419.5	1801.22	1317.75	1947.0
    10026	201901230130	2396.0	4411.67	2890.0	2138.1	5419.5	4411.67	1317.75	1163.71	1747.5
    10026	201901230200	1498.0	3683.25	1769.75	2138.1	4411.67	3683.25	1163.71	1108.0	1592.5
    10026	201901230230	1997.0	3346.0	1320.25	2138.1	3683.25	3346.0	1108.0	930.2	893.0
    10026	201901230300	1188.0	2986.33	1065.5	2138.1	3346.0	2986.33	930.2	865.75	538.5
    10026	201901230330	598.0	2645.14	963.0	2138.1	2986.33	2645.14	865.75	955.0	1033.0
    10026	201901230400	479.0	2374.38	865.75	2138.1	2645.14	2374.38	955.0	1193.0	1193.0
    10026	201901230430	1587.0	2286.89	955.0	2138.1	2374.38	2286.89	1193.0	799.0	799.0
    10026	201901230530	799.0	2138.1	1193.0	2138.1	2286.89	2138.1	799.0	NULL	NULL
    10027	201901230030	2170.0	2170.0	2396.67	1925.0	NULL	2170.0	1925.0	1863.75	2510.0
    10027	201901230100	1623.0	1896.5	2156.0	1925.0	2170.0	1896.5	1863.75	1944.0	2415.5
    10027	201901230130	3397.0	2396.67	1863.75	1925.0	1896.5	2396.67	1944.0	1217.5	1217.5
    10027	201901230200	1434.0	2156.0	1944.0	1925.0	2396.67	2156.0	1217.5	1001.0	1001.0
    10027	201901230230	1001.0	1925.0	1217.5	1925.0	2156.0	1925.0	1001.0	NULL	NULL
    10028	201901230300	1687.0	1687.0	1044.67	1044.67	NULL	1687.0	1044.67	723.5	723.5
    10028	201901230330	1298.0	1492.5	1044.67	1044.67	1687.0	1492.5	723.5	149.0	149.0
    10028	201901230400	149.0	1044.67	723.5	1044.67	1492.5	1044.67	149.0	NULL	NULL
    10029	201901230430	2587.0	2587.0	1588.0	1588.0	NULL	2587.0	1588.0	589.0	589.0
    10029	201901230530	589.0	1588.0	1588.0	1588.0	2587.0	1588.0	589.0	NULL	NULL
    

    first_value开窗函数

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分组、按stat_date排序、
    first_value(ordamt) over(partition by shop_id order by stat_date) as first_amt1,
      -- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as first_amt2,
     -- 以按shop_id分组、按stat_date排序、从起点到末尾的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as first_amt3,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as first_amt4,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as first_amt5,
     -- 以按shop_id分组、按stat_date排序、从当前行的末尾的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as first_amt6,
     -- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as first_amt7,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的第一个值
    first_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as first_amt8
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	first_amt1	first_amt2	first_amt3	first_amt4	first_amt5	first_amt6	first_amt7	first_amt8
    10026	201901230030	5170.0	5170.0	5170.0	5170.0	NULL	5170.0	5170.0	5669.0	5669.0
    10026	201901230100	5669.0	5170.0	5170.0	5170.0	5170.0	5170.0	5669.0	2396.0	2396.0
    10026	201901230130	2396.0	5170.0	5669.0	5170.0	5170.0	5170.0	2396.0	1498.0	1498.0
    10026	201901230200	1498.0	5170.0	2396.0	5170.0	5170.0	5170.0	1498.0	1997.0	1997.0
    10026	201901230230	1997.0	5170.0	1498.0	5170.0	5170.0	5170.0	1997.0	1188.0	1188.0
    10026	201901230300	1188.0	5170.0	1997.0	5170.0	5170.0	5170.0	1188.0	598.0	598.0
    10026	201901230330	598.0	5170.0	1188.0	5170.0	5170.0	5170.0	598.0	479.0	479.0
    10026	201901230400	479.0	5170.0	598.0	5170.0	5170.0	5170.0	479.0	1587.0	1587.0
    10026	201901230430	1587.0	5170.0	479.0	5170.0	5170.0	5170.0	1587.0	799.0	799.0
    10026	201901230530	799.0	5170.0	1587.0	5170.0	5170.0	5170.0	799.0	NULL	NULL
    10027	201901230030	2170.0	2170.0	2170.0	2170.0	NULL	2170.0	2170.0	1623.0	1623.0
    10027	201901230100	1623.0	2170.0	2170.0	2170.0	2170.0	2170.0	1623.0	3397.0	3397.0
    10027	201901230130	3397.0	2170.0	1623.0	2170.0	2170.0	2170.0	3397.0	1434.0	1434.0
    10027	201901230200	1434.0	2170.0	3397.0	2170.0	2170.0	2170.0	1434.0	1001.0	1001.0
    10027	201901230230	1001.0	2170.0	1434.0	2170.0	2170.0	2170.0	1001.0	NULL	NULL
    10028	201901230300	1687.0	1687.0	1687.0	1687.0	NULL	1687.0	1687.0	1298.0	1298.0
    10028	201901230330	1298.0	1687.0	1687.0	1687.0	1687.0	1687.0	1298.0	149.0	149.0
    10028	201901230400	149.0	1687.0	1298.0	1687.0	1687.0	1687.0	149.0	NULL	NULL
    10029	201901230430	2587.0	2587.0	2587.0	2587.0	NULL	2587.0	2587.0	589.0	589.0
    10029	201901230530	589.0	2587.0	2587.0	2587.0	2587.0	2587.0	589.0	NULL	NULL
    

    last_value开窗函数

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分组、按stat_date排序、
    last_value(ordamt) over(partition by shop_id order by stat_date) as last_amt1,
      -- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as last_amt2,
     -- 以按shop_id分组、按stat_date排序、从起点到末尾的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as last_amt3,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as last_amt4,
     -- 以按shop_id分组、按stat_date排序、从起点到当前行的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as last_amt5,
     -- 以按shop_id分组、按stat_date排序、从当前行的末尾的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as last_amt6,
     -- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as last_amt7,
     -- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的最后一个值
    last_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as last_amt8
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	last_amt1	last_amt2	last_amt3	last_amt4	last_amt5	last_amt6	last_amt7	last_amt8
    10026	201901230030	5170.0	5170.0	2396.0	799.0	NULL	5170.0	799.0	799.0	2396.0
    10026	201901230100	5669.0	5669.0	1498.0	799.0	5170.0	5669.0	799.0	799.0	1498.0
    10026	201901230130	2396.0	2396.0	1997.0	799.0	5669.0	2396.0	799.0	799.0	1997.0
    10026	201901230200	1498.0	1498.0	1188.0	799.0	2396.0	1498.0	799.0	799.0	1188.0
    10026	201901230230	1997.0	1997.0	598.0	799.0	1498.0	1997.0	799.0	799.0	598.0
    10026	201901230300	1188.0	1188.0	479.0	799.0	1997.0	1188.0	799.0	799.0	479.0
    10026	201901230330	598.0	598.0	1587.0	799.0	1188.0	598.0	799.0	799.0	1587.0
    10026	201901230400	479.0	479.0	799.0	799.0	598.0	479.0	799.0	799.0	799.0
    10026	201901230430	1587.0	1587.0	799.0	799.0	479.0	1587.0	799.0	799.0	799.0
    10026	201901230530	799.0	799.0	799.0	799.0	1587.0	799.0	799.0	NULL	NULL
    10027	201901230030	2170.0	2170.0	3397.0	1001.0	NULL	2170.0	1001.0	1001.0	3397.0
    10027	201901230100	1623.0	1623.0	1434.0	1001.0	2170.0	1623.0	1001.0	1001.0	1434.0
    10027	201901230130	3397.0	3397.0	1001.0	1001.0	1623.0	3397.0	1001.0	1001.0	1001.0
    10027	201901230200	1434.0	1434.0	1001.0	1001.0	3397.0	1434.0	1001.0	1001.0	1001.0
    10027	201901230230	1001.0	1001.0	1001.0	1001.0	1434.0	1001.0	1001.0	NULL	NULL
    10028	201901230300	1687.0	1687.0	149.0	149.0	NULL	1687.0	149.0	149.0	149.0
    10028	201901230330	1298.0	1298.0	149.0	149.0	1687.0	1298.0	149.0	149.0	149.0
    10028	201901230400	149.0	149.0	149.0	149.0	1298.0	149.0	149.0	NULL	NULL
    10029	201901230430	2587.0	2587.0	589.0	589.0	NULL	2587.0	589.0	589.0	589.0
    10029	201901230530	589.0	589.0	589.0	589.0	2587.0	589.0	589.0	NULL	NULL
    

    lag开窗函数、lead开窗函数

    这两个函数的应用场景的语法使用在我的第一篇博客上已经有了,就不重复写了,这两个函数可以求连续登录的用户数

    https://blog.csdn.net/lz6363/article/details/82470278

    cume_dist开窗函数

    计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:
    小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。

    select 
    shop_id, stat_date, ordamt,
    -- 统计小于等于当前销售额的订单占总订单的比例
    cume_dist() over(order by ordamt) as cume_dist1,
    -- 统计大于等于当前销售额的订单占总订单的比例
    cume_dist() over(order by ordamt desc) as cume_dist2,
    -- 统计分区内小于等于当前销售额的订单占总订单的比例
    round(cume_dist() over(partition by shop_id order by ordamt), 2) as cume_dist3
    from db_window.rt_data;
    
    shop_id	stat_date	ordamt	cume_dist1	cume_dist2	cume_dist3
    10026	201901230400	479.0	0.1	0.95	0.1
    10026	201901230330	598.0	0.2	0.85	0.2
    10026	201901230530	799.0	0.25	0.8	0.3
    10026	201901230300	1188.0	0.35	0.7	0.4
    10026	201901230200	1498.0	0.5	0.55	0.5
    10026	201901230430	1587.0	0.55	0.5	0.6
    10026	201901230230	1997.0	0.7	0.35	0.7
    10026	201901230130	2396.0	0.8	0.25	0.8
    10026	201901230030	5170.0	0.95	0.1	0.9
    10026	201901230100	5669.0	1.0	0.05	1.0
    10027	201901230230	1001.0	0.3	0.75	0.2
    10027	201901230200	1434.0	0.45	0.6	0.4
    10027	201901230100	1623.0	0.6	0.45	0.6
    10027	201901230030	2170.0	0.75	0.3	0.8
    10027	201901230130	3397.0	0.9	0.15	1.0
    10028	201901230400	149.0	0.05	1.0	0.33
    10028	201901230330	1298.0	0.4	0.65	0.67
    10028	201901230300	1687.0	0.65	0.4	1.0
    10029	201901230530	589.0	0.15	0.9	0.5
    10029	201901230430	2587.0	0.85	0.2	1.0
    

    排序开窗函数

    rank开窗函数

    rank 开窗函数基于 over 子句中的 order by 确定一组值中一个值的排名。
    如果存在partition by ,则为每个分区组中的每个值排名。排名可能不是连续的。例如,如果两个行的排名为 1,则下一个排名为 3。

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id排序
    rank() over(order by shop_id) as rank_amt1,
      -- 以按shop_id分区、按stat_date排序
    rank() over(partition by shop_id order by stat_date) as rank_amt2
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	rank_amt1	rank_amt2
    10026	201901230030	5170.0	1	1
    10026	201901230100	5669.0	1	2
    10026	201901230130	2396.0	1	3
    10026	201901230200	1498.0	1	4
    10026	201901230230	1997.0	1	5
    10026	201901230300	1188.0	1	6
    10026	201901230330	598.0	1	7
    10026	201901230400	479.0	1	8
    10026	201901230430	1587.0	1	9
    10026	201901230530	799.0	1	10
    10027	201901230030	2170.0	11	1
    10027	201901230100	1623.0	11	2
    10027	201901230130	3397.0	11	3
    10027	201901230200	1434.0	11	4
    10027	201901230230	1001.0	11	5
    10028	201901230300	1687.0	16	1
    10028	201901230330	1298.0	16	2
    10028	201901230400	149.0	16	3
    10029	201901230430	2587.0	19	1
    10029	201901230530	589.0	19	2
    

    dense_rank开窗函数

    dense_rank与rank有一点不同,当排名一样的时候,接下来的行是连续的。如两个行的排名为 1,则下一个排名为 2。

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id排序
    dense_rank() over(order by shop_id) as dense_amt1,
      -- 以按shop_id分区、按stat_date排序
    dense_rank() over(partition by shop_id order by stat_date) as dense_amt2
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	dense_amt1	dense_amt2
    10026	201901230030	5170.0	1	1
    10026	201901230100	5669.0	1	2
    10026	201901230130	2396.0	1	3
    10026	201901230200	1498.0	1	4
    10026	201901230230	1997.0	1	5
    10026	201901230300	1188.0	1	6
    10026	201901230330	598.0	1	7
    10026	201901230400	479.0	1	8
    10026	201901230430	1587.0	1	9
    10026	201901230530	799.0	1	10
    10027	201901230030	2170.0	2	1
    10027	201901230100	1623.0	2	2
    10027	201901230130	3397.0	2	3
    10027	201901230200	1434.0	2	4
    10027	201901230230	1001.0	2	5
    10028	201901230300	1687.0	3	1
    10028	201901230330	1298.0	3	2
    10028	201901230400	149.0	3	3
    10029	201901230430	2587.0	4	1
    10029	201901230530	589.0	4	2
    

    ntile开窗函数

    将分区中已排序的行划分为大小尽可能相等的指定数量的排名的组,并返回给定行所在的组的排名。

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分成两组、按stat_date排序
    ntile(2) over(partition by shop_id order by stat_date) as ntile_amt1,
      -- 以按shop_id分成三组、按stat_date排序
    ntile(3) over(partition by shop_id order by stat_date) as ntile_amt2
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	ntile_amt1	ntile_amt2
    10026	201901230030	5170.0	1	1
    10026	201901230100	5669.0	1	1
    10026	201901230130	2396.0	1	1
    10026	201901230200	1498.0	1	1
    10026	201901230230	1997.0	1	2
    10026	201901230300	1188.0	2	2
    10026	201901230330	598.0	2	2
    10026	201901230400	479.0	2	3
    10026	201901230430	1587.0	2	3
    10026	201901230530	799.0	2	3
    10027	201901230030	2170.0	1	1
    10027	201901230100	1623.0	1	1
    10027	201901230130	3397.0	1	2
    10027	201901230200	1434.0	2	2
    10027	201901230230	1001.0	2	3
    10028	201901230300	1687.0	1	1
    10028	201901230330	1298.0	1	2
    10028	201901230400	149.0	2	3
    10029	201901230430	2587.0	1	1
    10029	201901230530	589.0	2	2
    

    row_number开窗函数

    从1开始对分区内的数据排序。

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分区、按stat_date排序
    row_number() over(partition by shop_id order by stat_date) as row_amt
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	row_amt
    10026	201901230030	5170.0	1
    10026	201901230100	5669.0	2
    10026	201901230130	2396.0	3
    10026	201901230200	1498.0	4
    10026	201901230230	1997.0	5
    10026	201901230300	1188.0	6
    10026	201901230330	598.0	7
    10026	201901230400	479.0	8
    10026	201901230430	1587.0	9
    10026	201901230530	799.0	10
    10027	201901230030	2170.0	1
    10027	201901230100	1623.0	2
    10027	201901230130	3397.0	3
    10027	201901230200	1434.0	4
    10027	201901230230	1001.0	5
    10028	201901230300	1687.0	1
    10028	201901230330	1298.0	2
    10028	201901230400	149.0	3
    10029	201901230430	2587.0	1
    10029	201901230530	589.0	2
    

    percent_rank开窗函数

    计算给定行的百分比排名。可以用来计算超过了百分之多少的人。如360小助手开机速度超过了百分之多少的人。
    (当前行的rank值-1)/(分组内的总行数-1)

    select
    shop_id, stat_date, ordamt,
      -- 以按shop_id分区、按stat_date排序
    row_number() over(partition by shop_id order by stat_date) as row_amt,
    round(percent_rank() over(partition by shop_id order by stat_date), 2) as percent_amt
    from db_window.rt_data;
    
    OK
    shop_id	stat_date	ordamt	row_amt	percent_amt
    10026	201901230030	5170.0	1	0.0
    10026	201901230100	5669.0	2	0.11
    10026	201901230130	2396.0	3	0.22
    10026	201901230200	1498.0	4	0.33
    10026	201901230230	1997.0	5	0.44
    10026	201901230300	1188.0	6	0.56
    10026	201901230330	598.0	7	0.67
    10026	201901230400	479.0	8	0.78
    10026	201901230430	1587.0	9	0.89
    10026	201901230530	799.0	10	1.0
    10027	201901230030	2170.0	1	0.0
    10027	201901230100	1623.0	2	0.25
    10027	201901230130	3397.0	3	0.5
    10027	201901230200	1434.0	4	0.75
    10027	201901230230	1001.0	5	1.0
    10028	201901230300	1687.0	1	0.0
    10028	201901230330	1298.0	2	0.5
    10028	201901230400	149.0	3	1.0
    10029	201901230430	2587.0	1	0.0
    10029	201901230530	589.0	2	1.0
    
    展开全文
  • 开窗函数 over():跟在聚合函数后面,指定分析函数工作的数据窗口大小,决定了数据的聚合范围,默认范围是整个数据窗口,即所有行。可使用partition by将数据进行分组聚合;使用order by除了排序,还会改变聚合范围...

    开窗函数

    over():跟在聚合函数后面,指定分析函数工作的数据窗口大小,决定了数据的聚合范围,默认范围是整个数据窗口,即所有行。可使用partition by将数据进行分组聚合;使用order by除了排序(默认asc升序),还会改变聚合范围(从开始行到当前行的聚合),使用distribute by…sort by…和partition by…order by…效果一样。
    如下的窗口子句,能对聚合数据进一步的划分:
    current row:当前行
    n preceding:往前n行数据
    n following:往后n行数据
    unbounded:起点,unbounded preceding表示从前面的起点开始,unbounded following表示到后面的终点结束
    其他分析函数:
    lag(col, n[, default_val]):往前第n行数据
    lead(col, n[, default_val]):往后第n行数据
    ntile(n):把有序分区中的行分发到指定数据的组中,各个组的编号从1开始,一共分成n片。

    例子

    create table business(name string, orderdate string, cost int);
    

    (1)查询在2019年1月份购买过的顾客及总人数

    select name,count(*) over()
    from business
    where date_format(orderdate, 'yyyy-MM')='2019-01'
    group by name;
    

    注意“总人数”指顾客数量,如果不使用over(),结果将是每位顾客对应的购买次数,此处over()省略窗口字句ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING,即针对所有行。
    (2)查询顾客的购买明细及月购买总额

    select name, orderdate, cost, sum(cost) over(partition by month(orderdate))
    from business;
    或:
    select name, orderdate, cost, sum(cost) over(distribute by month(orderdate))
    from business;
    

    由于查询字段中含有与orderdate无关的字段,故使用不了group by。
    (3)查询顾客购买明细,以及每个顾客的cost按照日期进行累加

    select name, orderdate, cost, sum(cost) over(partition by name order by orderdate)
    from business;
    

    order by的聚合范围即从开始行到当前行的聚合,加上窗口语句“ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”效果一样。
    (4)查询顾客上次的购买时间

    select name, orderdate, lag(orderdate, 1, '0000-00-00') over(partition by name order by orderdate) last_time
    from business;
    

    lag函数如果没有上n次的值,返回null,或者指定的默认值。
    (5)查询前20%时间的订单信息

    select name, orderdate, cost from(
    select name, orderdate, cost, ntile(5) over(order by orderdate) arr
    from business
    )
    where arr = 1;
    

    需要查询前20%(即1/5)的数据可以使用ntile函数,切成5片数据,再从中取第一份数据。

    Rank函数

    rank():相同时排名值会重复,总数不会变,故相同时排名值会跳跃而不连续
    row_number():相同时排名值会根据顺序计算,不会重复
    dense_rank():相同时排名值会重复,总数会减少

    例子

    create table score(name string, subject string, score int);
    

    计算每门学科成绩排名

    select name, subject,score,rank()
    over(partition by subject order by score desc)
    from score;
    
    展开全文
  • 目录前言1.row_number开窗函数2.rank开窗函数3.dense_rank开窗函数4.ntile开窗函数5.percent_rank开窗函数 前言 书接上回,上回我们把聚合开窗函数讲完了,闲言少叙,这次我们讲解排序开窗函数; 排序开窗函数,包含...

    前言

    书接上回,上回我们把聚合开窗函数讲完了,闲言少叙,这次我们讲解排序开窗函数;
    排序开窗函数,包含row_number开窗函数,rank开窗函数,dense_rank开窗函数,ntile开窗函数和percent_rank开窗函数;
    这次重点讲解row_number开窗函数,ntile开窗函数和percent_rank开窗函数;

    1.row_number开窗函数

    语义:从1开始对分区内的数据排序
    排序类似:1,2,3,4,5。

    -- row_number 开窗函数
    
    select studentid,departmentid,classid,math,
    -- 对分区departmentid,classid内的数据按math排序
    row_number() over(partition by departmentid,classid order by math) as row_number
    from student_scores;
    
    结果
    studentid   departmentid    classid math    row_number
    111         department1     class1  69      1
    113         department1     class1  74      2
    112         department1     class1  80      3
    115         department1     class1  93      4
    114         department1     class1  94      5
    124         department1     class2  70      1
    121         department1     class2  74      2
    123         department1     class2  78      3
    122         department1     class2  86      4
    216         department2     class1  74      1
    215         department2     class1  82      2
    212         department2     class1  83      3
    211         department2     class1  93      4
    213         department2     class1  94      5
    214         department2     class1  94      6
    223         department2     class2  74      1
    222         department2     class2  78      2
    224         department2     class2  80      3
    225         department2     class2  85      4
    221         department2     class2  99      5
    
    结果解释:
        同一分区,相同值,不同序。如studentid=213 studentid=214 值都为94 排序为5,6

    2.rank开窗函数

    同row_number开窗函数
    语义:从1开始对分区内的数据排序
    排序类似:1,2,2,4,5。

    3.dense_rank开窗函数

    同row_number开窗函数
    语义:从1开始对分区内的数据排序
    排序类似:1,2,2,3,4。

    4.ntile开窗函数

    语义:将分区中已排序的行划分为大小尽可能相等的指定数量的排名的组,并返回给定行所在的组的排名。

    -- ntile 开窗函数
    
    select *,
    -- 对分区内的数据分成两组
    ntile(2) over(partition by departmentid order by math) as ntile1,
    -- 对分区内的数据分成三组
    ntile(3) over(partition by departmentid order by math) as ntile2
    from student_scores;
    
    结果
    id  studentid   language    math    english classid departmentid    ntile1  ntile2
    1   111         68          69      90      class1  department1     1       1
    9   124         76          70      76      class2  department1     1       1
    6   121         96          74      79      class2  department1     1       1
    3   113         90          74      75      class1  department1     1       2
    8   123         70          78      61      class2  department1     1       2
    2   112         73          80      96      class1  department1     2       2
    7   122         89          86      85      class2  department1     2       3
    5   115         99          93      89      class1  department1     2       3
    4   114         89          94      93      class1  department1     2       3
    18  223         79          74      96      class2  department2     1       1
    15  216         85          74      93      class1  department2     1       1
    17  222         80          78      96      class2  department2     1       1
    19  224         75          80      78      class2  department2     1       1
    14  215         84          82      73      class1  department2     1       2
    11  212         76          83      75      class1  department2     1       2
    20  225         82          85      63      class2  department2     2       2
    10  211         89          93      60      class1  department2     2       2
    12  213         71          94      90      class1  department2     2       3
    13  214         94          94      66      class1  department2     2       3
    16  221         77          99      61      class2  department2     2       3
    
    结果解释:
        第8行
            ntile1:对分区的数据均匀分成2组后,当前行的组排名为2
            ntile2:对分区的数据均匀分成3组后,当前行的组排名为3
    

    5.percent_rank开窗函数

    计算给定行的百分比排名。可以用来计算超过了百分之多少的人,不含本身;
    计算公式:(当前行的rank值-1)/(分组内的总行数-1)
    类似与cume_dist开窗函数,是统计小于当前分数的人数占总人数的比例;

    -- percent_rank 开窗函数
    
    select studentid,departmentid,classid,math,
    row_number() over(partition by departmentid,classid order by math) as row_number,
    percent_rank() over(partition by departmentid,classid order by math) as percent_rank
    from student_scores;
    
    结果
    studentid   departmentid    classid math    row_number  percent_rank
    111         department1     class1  69      1           0.0
    113         department1     class1  74      2           0.25
    112         department1     class1  80      3           0.5
    115         department1     class1  93      4           0.75
    114         department1     class1  94      5           1.0
    124         department1     class2  70      1           0.0
    121         department1     class2  74      2           0.3333333333333333
    123         department1     class2  78      3           0.6666666666666666
    122         department1     class2  86      4           1.0
    216         department2     class1  74      1           0.0
    215         department2     class1  82      2           0.2
    212         department2     class1  83      3           0.4
    211         department2     class1  93      4           0.6
    213         department2     class1  94      5           0.8
    214         department2     class1  94      6           0.8
    223         department2     class2  74      1           0.0
    222         department2     class2  78      2           0.25
    224         department2     class2  80      3           0.5
    225         department2     class2  85      4           0.75
    221         department2     class2  99      5           1.0
    
    结果解释:
        studentid=115,percent_rank=(4-1)/(5-1)=0.75
        studentid=123,percent_rank=(3-1)/(4-1)=0.6666666666666666
    

    参考:https://blog.csdn.net/wangpei1949/article/details/81437574

    展开全文
  • sql开窗函数初体验

    2020-07-26 19:45:33
    目录一、定义:1.1 使用group by进行查询1.2 使用开窗函数1.3 ...number函数2.8 rank开窗函数2.9 dense_rank开窗函数hive同样可以使用开窗函数 十分方便,但是就是mapreduce是真的慢 一、定义: 开窗函数用于为行定义一

    一、定义:

    开窗函数用于为行定义一个窗口(这里的窗口是指运算将要操作的行的集合),它对一组值进行操作,不需要使用GROUP BY子句对数据进行分组,能够在同一行中同时返回基础行的列和聚合列。
    开窗函数与聚合函数计算方式一样,开窗函数也是对行集组进行聚合计算,但是它不像普通聚合函数那样每组只返回一个值,开窗函数可以为每组返回多个值。
    开窗函数的语法为:分析函数 + over(partition by 列名1 order by 列名2 ), 括号中的两个关键词partition by 和order by 可以只出现一个。over() 前面是一个函数。 可以是 count(), sum(), row_number()来自百度百科
    举个例子说明(使用mysql进行简单说明):
    有一个表 websites,里面的数据如下:
    在这里插入图片描述

    1.1 使用group by进行查询

    select count(name) as websites,country from  websites GROUP BY country;
    

    结果如下:
    在这里插入图片描述

    1.2 使用开窗函数

    SELECTname, country, COUNT(name) OVER(PARTITION BY country) AS countCountry FROM websites; 
    

    在这里插入图片描述
    可以看出区别了,group by可以进行分组,但是一组值只会返回一个结果。而开窗函数不同,开窗函数可以每组返回多个值。 在一些复杂的查询环境下,开窗函数更能满足需求。

    1.3 使用ORDER BY

    select name, country,count(name) over (PARTITION by country ORDER BY id) as countCountry from websites;
    

    结果:
    在这里插入图片描述
    结果解释:
    开窗函数没有指定窗口范围的时候,默认统计窗口是从起点到当前行,也就是说,统计到 name = ‘taobao’ 这个网站的时候,从起点到它,都只有一条记录,也就是说此时窗口内只有一条数据,那么count函数获得的值就是1。
    当函数计算到第二条数据,也就是 name = ‘runoonb’ 的时候,此时窗口有三个值,其中 country = ‘CN’ 的记录有两条,那么count的值就是2。
    ……
    当计算到第四条数据,也就是 name = ‘stackoverflow’ 时,此时窗口内一共包含四条数据, country = 'IND’的只有一条,那么该类别的count函数值肯定就是1。

    1.4 rows关键字 指定窗口范围

    select name, country, count(name) over (PARTITION by country ORDER BY id rows between 1 preceding and 2 following) as countCountry from websites;
    

    这里使用rows关键字,将窗口范围限定在当前记录的前一条 – 当前记录的后两条之间。 也就是说窗口中会有四条数据。
    结果:
    在这里插入图片描述
    结果解释:
    当sql开始执行,第一次的窗口就包含了第一条数据,以及第一条数据的后面两条数据。因为它前面没有数据,那么此时窗口中就包含了三条数据。该三条数据,都属于’CN’, 所以此时 count函数的值就是3;第二条数据同理, 处理它时的窗口包含它的上一条数据+它+它的后两条数据,依然是有 三条 countrty = ‘CN’ 的数据。 当处理到第三条数据的时候,情况开始变化,它此时的窗口也是四条数据,前一条+它本身+它的后两条,它是属于’CN’的,当前窗口中的四条数据,已经只有两条的 country = 'CN‘ 了,所以此时的count函数的值为2。

    二、多种分析函数的使用

    2.1 min函数

    表 score
    在这里插入图片描述

    #ROWS between 1 preceding and 0 following  当前行前一行 -- 当前行   
    #ROWS between unbounded preceding and unbounded following   从起始行到结束行所有行
    #rows between unbounded preceding and current row   从起始行到当前行
    select *,min(degree) over(partition by CNO ORDER BY degree) as MinDegree from score
    select *,min(degree) over(partition by CNO ORDER BY degree desc) as MinDegree from score
    

    结果1:
    在这里插入图片描述
    结果2:
    在这里插入图片描述
    确实验证了 不加rows关键字的话,窗口范围就是默认从起始行到当前行

    select CNO,DEGREE,min(degree) over(partition by CNO ) as MinDegree from score;
    

    结果:
    在这里插入图片描述
    所以其实如果不加order by子句 计算范围是整个分区;

    2.2 sum开窗函数

    select CNO,DEGREE,sum(degree) over(partition by CNO ORDER BY degree) as MinDegree from score;
    

    结果:
    在这里插入图片描述

    2.3 first_value开窗函数

    该函数是返回分区的第一个值

    select CNO,DEGREE,first_value(degree) over(PARTITION by cno ORDER BY degree) as first_value1 from score;
    select CNO,DEGREE,first_value(degree) over (PARTITION by cno order by degree rows BETWEEN 1 preceding and 1 following) as first_value1 from score
    

    结果1:
    在这里插入图片描述
    结果2:
    在这里插入图片描述

    2.4 last_vlaue开窗函数

    作用:返回分区最后一个值

    2.5 lag函数

    lag函数其实是用于统计窗口内向上第n行的值。 比如说我需要查询比当前分数排名高3个名次的分数,就可以用到该函数
    lag(column, n,default_value); 如果没有设置default_value 那就是null

    select cno,degree,lag(degree,3,60) over(PARTITION  by cno ORDER BY degree) as lag1 from score;
    select cno,degree,lag(degree,3) over(PARTITION  by cno ORDER BY degree) as lag1 from score;
    

    结果1:
    在这里插入图片描述
    结果2:
    在这里插入图片描述

    2.6 lead函数

    跟lag相反,向下统计第n行的值

    select cno,degree,lead(degree,3,60) over(PARTITION  by cno ORDER BY degree) as lag1 from score;
    select cno,degree,lead(degree,3) over(PARTITION  by cno ORDER BY degree) as lag1 from score;
    

    在这里插入图片描述
    在这里插入图片描述
    可以发现这里依然受partitiion by 函数制约, degree=‘88’ 的值 往后第三行已经是另一个分区的分数了,所以会返回null

    2.7 row_number函数

    顾名思义,就是个计算行号的函数

    select cno,degree,row_number() over(PARTITION  by cno ORDER BY degree) as rowNUmber from score;
    

    结果:
    在这里插入图片描述
    此时往表中插入几条分数相同的数据
    再执行sql,查看结果:
    在这里插入图片描述
    不管你结果degree 是否相同,行号会一直增加。

    2.8 rank开窗函数

    同样也是计算“行号”的函数,但是不同的是,相同的degree 它会返回相同的行号,但是下一个行号就会是递增的行号
    举例: 该表有并列的两个degree 都是返回4,但是 下一个degree 会返回6

    select cno,degree,rank() over(PARTITION  by cno ORDER BY degree) as rowNUmber from score;
    

    结果:
    在这里插入图片描述

    2.9 dense_rank开窗函数

    select cno,degree,dense_rank() over(PARTITION  by cno ORDER BY degree) as rowNUmber from score;
    

    结果:
    在这里插入图片描述
    与rank函数不同, dense_rank不会新增行号,就算你有并列的数据,下一次依然只 + 1;

    hive同样可以使用开窗函数 十分方便,但是就是mapreduce是真的慢
    展开全文
  • 作为一名开发人员来讲,我感觉在职场白混...今天将要介绍SQL Sever的开窗函数,何谓开窗函数,不懂吧。反正对于我来说,我是摸不着头脑了,第一次听说过。那么,什么是开窗函数,其实可以理解为是聚合函数的一个加强版
  • 文章目录SparkSQL 开窗函数聚合函数和开窗函数开窗函数分类聚合开窗函数SparkSQL 排序开窗函数ROW_NUMBER顺序排序RANK跳跃排序DENSE_RANK连续排序NTILE分组排名 SparkSQL 开窗函数 ...开窗函数的引入是为了既显示聚集前...
  • Hive 开窗函数

    万次阅读 多人点赞 2018-08-06 00:19:44
    工作中用到了几个hive开窗函数,便想把hive开窗函数系统梳理一遍。 开窗函数 普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中...
  • 文章目录分析函数和开窗函数数据源排名函数和分析函数rownumrow_numberrankdense_rank开窗函数 over() 分析函数和开窗函数 数据源 with base_source as ( select '李紫萼' as username, '河北' as school, '80' as ...
  • 1.dense_rank() over开窗函数:在做排序时如果遇到列有重复值,则重复值所在行的序列值(排名)相同,而其后的序列值依旧递增。本人牛客网SQL编程使用的具体例子(要求获取当前(to_date='9999-01-01')薪水第二多的...
  • Oracle–开窗函数函数排序rank、desc_rank、row_number rank() over([partition by par_col] order by order_col):相同数据并列排序并且会跳过并列的应该有的次序 select salary,rank() over(order by salary desc) ...
  • 1.开窗函数over partition开窗函数使用开窗函数使用于取出多列分组,取一列分组下另一组前几名,先利用开窗函数对其分组排名,开窗函数排名函数较多使用row_number(),还有rank()等,生成排名列之后将结果集筛选其...
  • ORACLE数据库(六)-----开窗函数开窗函数又名分析函数、窗口函数、OLAP(数据分析)函数聚合函数:将数据按照一定的规则分组,统一分析各组的某项情况,每个分组返回一行结果开窗函数:将数据按照一定的规则分组,统一...
  • Oracle开窗函数

    2020-07-28 17:30:50
    SQL开窗函数 开窗函数:在开窗函数出现之前存在着很多用 SQL 语句很难解决的问题,很多都要通过复杂的相关子查询或者存储过程来完成。为了解决这些问题,在 2003 年 ISO SQL 标准加入了开窗函数开窗函数的使用...
  • Spark SQL 之开窗函数

    千次阅读 2019-06-20 19:08:39
    目录 一:开窗函数 二:row_number()开窗函数实战 三:开窗函数与一般函数使用的...为了解决这些问题,在2003年ISO SQL标准增加了开窗函数开窗函数的使用使得这些经典的难题能够被轻松的解决。眼下在 MSSQLSe...
  • 1、开窗函数和分组函数的区别 分组函数是指按照某列或者某些列分组后进行某种计算,比如计数,求和等聚合函数进行计算。 开窗函数是指基于某列或某些列让数据有序,数据行数和原始数据数相同,依然能曾现个体数据...
  • 开窗函数

    2021-03-30 16:39:25
    一、什么是开窗函数 开窗函数/分析函数:over() 开窗函数也叫分析函数,有两类:一类是聚合开窗函数,一类是排序开窗函数开窗函数的调用格式为: 函数名(列名) OVER(partition by 列名 order by列名) 。 ...
  • Spark 开窗函数

    2019-07-31 08:04:22
    row_number() 开窗函数是按照某个字段分组,然后取另一字段的前几个的值,相当于 分组取topN 如果SQL语句里面使用到了开窗函数,那么这个SQL语句必须使用HiveContext来执行,HiveContext默认情况下在本地无法创建。...
  • Hive开窗函数总结

    2019-04-06 15:11:20
    在 Hive 工作使用过程中,可能使用导开窗函数。这里做一下总结: 开窗函数:普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每...
  • 1 排序:row_number(),dense_rank() select sku_id,sku_name,category_id3,sales_count,price, sales_count*price as s1, dense_rank() over(partition by category_id3 order by s1 desc) as r2 -- Invalid ...
  • HIVE开窗函数

    2021-06-10 20:43:52
    目录标题ETL,SQL面试高频考点——HIVE开窗函数一,窗口函数介绍二,开窗函数三,开窗函数分类排序开窗函数:实列解析:后续更新中~ 一,窗口函数介绍 窗口函数,也叫OLAP函数(Online Anallytical Processing,联机...
  • Oracle—开窗函数

    千次阅读 2019-04-18 11:16:00
    开发工具与关键技术:工具:...分组函数用于求取分组后的某个值,有时候这种查询并不能满足我们的需求,这就有了开窗函数开窗函数也能对数据进行分组。   开窗函数的作用是计算基于组的某种聚合值。开窗函数...
  • 开窗函数简介

    2018-03-26 16:47:34
    开窗函数与聚合函数一样,也是对行集组进行聚合计算,但是它不像普通聚合函数那样每组只返回一个值,开窗函数可以为每组返回多个值,因为开窗函数所执行聚合计算的行集组是窗口。 语法:主要是over( PARTITION BY ...
  • 最近又用到oracle over、开窗函数rank、dense_rank,找了一些前辈写的资料,这里转载一下,以备查用 一、oracle的分析函数over 及开窗函数 一:分析函数over Oracle从8.1.6开始提供分析函数,分析函数用于...
  • hive开窗函数

    2020-02-26 14:13:21
    Hive 开窗函数 展开 工作中用到了几个hive开窗函数,便想把hive开窗函数系统梳理一遍。 开窗函数 普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 5,595
精华内容 2,238
关键字:

rank开窗函数