Lakehouse 时间序列分析指南

概述

时间序列分析是数据分析中的核心场景,广泛应用于业务趋势监控、用户行为分析和运营指标计算。云器 Lakehouse 提供完整的时间函数和窗口函数支持,包括时间截断、日期格式化、滚动窗口聚合和日期序列生成。本指南按业务场景分类,帮助你快速掌握高效的时间序列分析方法。

涉及的 SQL 命令

命令/函数用途适用场景
DATE_FORMAT()
DATE_FORMAT()
将时间戳格式化为字符串按月/日分组、展示
DATE_TRUNC()
DATE_TRUNC()
截断到指定时间粒度按周/月/年聚合
DATEDIFF()
DATEDIFF()
计算两个日期之间的天数差事件间隔、留存分析
LAG()
LAG()
/
LEAD()
LEAD()
访问前后行数据环比、同比计算
AVG() OVER (RANGE BETWEEN INTERVAL ...)
AVG() OVER (RANGE BETWEEN INTERVAL ...)
基于时间范围的滚动聚合7日/30日滚动平均
SEQUENCE()
SEQUENCE()
+
EXPLODE()
EXPLODE()
生成连续日期序列并展开为行日期补零、日历表
INTERVAL
INTERVAL
时间偏移量日期加减、范围过滤
COALESCE()
COALESCE()
空值替换补零、默认值填充

前置准备

以下示例使用模拟的用户事件表

doc_ts_events
doc_ts_events

-- 创建测试表 CREATE TABLE IF NOT EXISTS doc_ts_events ( event_id INT, user_id INT, event_type STRING, amount DOUBLE, event_time TIMESTAMP ); -- 插入测试数据(覆盖 2024-01 至 2024-03) INSERT INTO doc_ts_events VALUES (1, 101, 'purchase', 120.50, CAST('2024-01-05 10:00:00' AS TIMESTAMP)), (2, 102, 'purchase', 85.00, CAST('2024-01-10 14:30:00' AS TIMESTAMP)), (3, 101, 'refund', 30.00, CAST('2024-01-15 09:00:00' AS TIMESTAMP)), (4, 103, 'purchase', 200.00, CAST('2024-01-20 16:00:00' AS TIMESTAMP)), (5, 102, 'purchase', 55.00, CAST('2024-01-25 11:00:00' AS TIMESTAMP)), (6, 101, 'purchase', 300.00, CAST('2024-02-03 10:00:00' AS TIMESTAMP)), (7, 104, 'purchase', 150.00, CAST('2024-02-08 13:00:00' AS TIMESTAMP)), (8, 103, 'refund', 50.00, CAST('2024-02-14 15:00:00' AS TIMESTAMP)), (9, 102, 'purchase', 90.00, CAST('2024-02-18 09:30:00' AS TIMESTAMP)), (10, 101, 'purchase', 75.00, CAST('2024-02-22 17:00:00' AS TIMESTAMP)), (11, 104, 'purchase', 220.00, CAST('2024-03-01 10:00:00' AS TIMESTAMP)), (12, 103, 'purchase', 180.00, CAST('2024-03-07 14:00:00' AS TIMESTAMP)), (13, 101, 'refund', 25.00, CAST('2024-03-12 11:00:00' AS TIMESTAMP)), (14, 102, 'purchase', 130.00, CAST('2024-03-18 16:00:00' AS TIMESTAMP)), (15, 104, 'purchase', 95.00, CAST('2024-03-25 12:00:00' AS TIMESTAMP));


场景一:按时间粒度聚合

按月统计各事件类型的发生次数和金额总和,是最常见的时间序列聚合需求。

-- 按月、事件类型聚合 SELECT DATE_FORMAT(event_time, 'yyyy-MM') AS month, event_type, COUNT(*) AS event_count, ROUND(SUM(amount), 2) AS total_amount FROM doc_ts_events GROUP BY DATE_FORMAT(event_time, 'yyyy-MM'), event_type ORDER BY month, event_type;

执行结果

monthevent_typeevent_counttotal_amount
2024-01purchase4460.5
2024-01refund130
2024-02purchase4615
2024-02refund150
2024-03purchase4625
2024-03refund125

如需按周聚合,使用

DATE_TRUNC('week', ...)
DATE_TRUNC('week', ...)
截断到周一:

-- 按自然周聚合(week_start 为每周周一) SELECT DATE_FORMAT(DATE_TRUNC('week', event_time), 'yyyy-MM-dd') AS week_start, COUNT(*) AS event_count, ROUND(SUM(amount), 2) AS weekly_amount FROM doc_ts_events WHERE event_type = 'purchase' GROUP BY DATE_TRUNC('week', event_time) ORDER BY week_start;

执行结果(前 5 行)

week_startevent_countweekly_amount
2024-01-011120.5
2024-01-08185
2024-01-151200
2024-01-22155
2024-01-291300

场景二:滚动窗口计算

计算每个交易日的 7 日滚动平均金额,用于平滑短期波动、观察趋势。

RANGE BETWEEN INTERVAL N DAY PRECEDING AND CURRENT ROW
RANGE BETWEEN INTERVAL N DAY PRECEDING AND CURRENT ROW
基于时间范围(而非行数)定义窗口,能正确处理日期不连续的情况。

WITH daily_sales AS ( SELECT CAST(DATE_FORMAT(event_time, 'yyyy-MM-dd') AS DATE) AS sale_day, ROUND(SUM(amount), 2) AS daily_amount FROM doc_ts_events WHERE event_type = 'purchase' GROUP BY DATE_FORMAT(event_time, 'yyyy-MM-dd') ) SELECT sale_day, daily_amount, ROUND( AVG(daily_amount) OVER ( ORDER BY sale_day RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW ), 2 ) AS rolling_7d_avg FROM daily_sales ORDER BY sale_day;

执行结果

sale_daydaily_amountrolling_7d_avg
2024-01-05120.5120.5
2024-01-1085102.75
2024-01-20200200
2024-01-2555127.5
2024-02-03300300
2024-02-08150225
2024-02-189090
2024-02-227582.5
2024-03-01220220
2024-03-07180200
2024-03-18130130
2024-03-259595

场景三:同比/环比计算

使用

LAG()
LAG()
获取上一期数据,计算月环比增长率。

WITH monthly_sales AS ( SELECT DATE_FORMAT(event_time, 'yyyy-MM') AS month, ROUND(SUM(amount), 2) AS monthly_amount FROM doc_ts_events WHERE event_type = 'purchase' GROUP BY DATE_FORMAT(event_time, 'yyyy-MM') ) SELECT month, monthly_amount, LAG(monthly_amount, 1) OVER (ORDER BY month) AS prev_month_amount, ROUND( (monthly_amount - LAG(monthly_amount, 1) OVER (ORDER BY month)) / LAG(monthly_amount, 1) OVER (ORDER BY month) * 100, 2 ) AS mom_growth_pct FROM monthly_sales ORDER BY month;

执行结果

monthmonthly_amountprev_month_amountmom_growth_pct
2024-01460.5NULLNULL
2024-02615460.533.55
2024-036256151.63

场景四:时间间隔计算

计算每个用户相邻两次事件之间的间隔天数,用于分析用户活跃频率和购买周期。

SELECT user_id, event_type, DATE_FORMAT(event_time, 'yyyy-MM-dd') AS event_day, LAG(DATE_FORMAT(event_time, 'yyyy-MM-dd'), 1) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_event_day, DATEDIFF( DATE_FORMAT(event_time, 'yyyy-MM-dd'), LAG(DATE_FORMAT(event_time, 'yyyy-MM-dd'), 1) OVER (PARTITION BY user_id ORDER BY event_time) ) AS days_since_last FROM doc_ts_events ORDER BY user_id, event_time;

执行结果

user_idevent_typeevent_dayprev_event_daydays_since_last
101purchase2024-01-05NULLNULL
101refund2024-01-152024-01-0510
101purchase2024-02-032024-01-1519
101purchase2024-02-222024-02-0319
101refund2024-03-122024-02-2219
102purchase2024-01-10NULLNULL
102purchase2024-01-252024-01-1015
102purchase2024-02-182024-01-2524
102purchase2024-03-182024-02-1829
103purchase2024-01-20NULLNULL
103refund2024-02-142024-01-2025
103purchase2024-03-072024-02-1422
104purchase2024-02-08NULLNULL
104purchase2024-03-012024-02-0822
104purchase2024-03-252024-03-0124

场景五:连续日期补零

当某些日期没有数据时,直接

GROUP BY
GROUP BY
会跳过这些日期,导致折线图出现断点。使用
SEQUENCE
SEQUENCE
+
EXPLODE
EXPLODE
生成完整日期序列,再
LEFT JOIN
LEFT JOIN
实际数据,用
COALESCE
COALESCE
将空值填为 0。

WITH date_spine AS ( -- 生成 2024-01-01 到 2024-03-31 的连续日期 SELECT EXPLODE( SEQUENCE( CAST('2024-01-01' AS DATE), CAST('2024-03-31' AS DATE), INTERVAL 1 DAY ) ) AS cal_day ), daily_sales AS ( SELECT CAST(DATE_FORMAT(event_time, 'yyyy-MM-dd') AS DATE) AS sale_day, ROUND(SUM(amount), 2) AS daily_amount FROM doc_ts_events WHERE event_type = 'purchase' GROUP BY DATE_FORMAT(event_time, 'yyyy-MM-dd') ) SELECT d.cal_day, COALESCE(s.daily_amount, 0) AS daily_amount FROM date_spine d LEFT JOIN daily_sales s ON d.cal_day = s.sale_day ORDER BY d.cal_day LIMIT 10;

执行结果(前 10 行)

cal_daydaily_amount
2024-01-010
2024-01-020
2024-01-030
2024-01-040
2024-01-05120.5
2024-01-060
2024-01-070
2024-01-080
2024-01-090
2024-01-1085

场景六:最近N天活跃用户

筛选在最近 90 天内有购买行为且购买次数不少于 2 次的用户,统计其购买频次和最近活跃日期。

SELECT user_id, COUNT(*) AS purchase_count, MAX(DATE_FORMAT(event_time, 'yyyy-MM-dd')) AS last_active_day FROM doc_ts_events WHERE event_type = 'purchase' AND event_time >= CAST('2024-03-31' AS TIMESTAMP) - INTERVAL 90 DAY GROUP BY user_id HAVING COUNT(*) >= 2 ORDER BY last_active_day DESC;

执行结果

user_idpurchase_countlast_active_day
10432024-03-25
10242024-03-18
10322024-03-07
10132024-02-22

清理测试数据

完成时间序列分析验证后,建议清理测试表:

-- 删除测试表 DROP TABLE IF EXISTS doc_ts_events;


注意事项

  1. 时区处理
    TIMESTAMP
    TIMESTAMP
    类型存储 UTC 时间,
    DATE_FORMAT
    DATE_FORMAT
    DATE_TRUNC
    DATE_TRUNC
    按 session 时区(默认 UTC+8)解析。插入数据时建议使用
    CAST('yyyy-MM-dd HH:mm:ss' AS TIMESTAMP)
    CAST('yyyy-MM-dd HH:mm:ss' AS TIMESTAMP)
    明确指定时间字面量,避免隐式转换失败。
  2. RANGE vs ROWS
    ROWS BETWEEN N PRECEDING
    ROWS BETWEEN N PRECEDING
    按行数定义窗口,
    RANGE BETWEEN INTERVAL N DAY PRECEDING
    RANGE BETWEEN INTERVAL N DAY PRECEDING
    按时间范围定义窗口。日期不连续时两者结果不同,时间序列分析通常使用
    RANGE
    RANGE
  3. DATE_FORMAT 格式字符串:Lakehouse 使用 Java 风格格式,
    yyyy
    yyyy
    表示四位年份,
    MM
    MM
    表示月份,
    dd
    dd
    表示日期,
    HH
    HH
    表示 24 小时制小时。格式字母大小写敏感。
  4. DATEDIFF 参数顺序
    DATEDIFF(end_date, start_date)
    DATEDIFF(end_date, start_date)
    返回
    end_date - start_date
    end_date - start_date
    的天数,结束日期在前。
  5. 窗口函数执行顺序:窗口函数在
    WHERE
    WHERE
    GROUP BY
    GROUP BY
    之后执行,不能直接在
    WHERE
    WHERE
    中使用窗口函数结果。如需过滤,请使用
    QUALIFY
    QUALIFY
    或子查询。
  6. SEQUENCE 生成大范围日期
    SEQUENCE
    SEQUENCE
    生成的是数组,
    EXPLODE
    EXPLODE
    将其展开为行。生成跨度较大的日期范围(如数年)时,注意数据量对查询性能的影响。

相关文档

联系我们
预约咨询
微信咨询
电话咨询