Lakehouse 漏斗分析与用户行为分析指南

概述

漏斗分析与用户行为分析是电商、SaaS 和移动应用场景中最常见的数据分析需求,用于衡量用户从注册到最终转化的各步骤流失情况,以及用户的活跃频率、留存率和行为路径。云器 Lakehouse 提供完整的窗口函数、聚合函数和时间函数支持,可高效完成上述分析。本指南按业务场景分类,帮助你快速掌握漏斗分析的核心 SQL 写法。

涉及的 SQL 命令

命令/函数用途适用场景
COUNT(DISTINCT ...)
COUNT(DISTINCT ...)
统计去重用户数漏斗各步骤用户计数
MAX(CASE WHEN ...)
MAX(CASE WHEN ...)
判断用户是否完成某步骤漏斗标记、留存标记
MIN()
MIN()
/
MAX()
MAX()
取最早/最晚时间首次行为、末次活跃
LAG()
LAG()
访问前一行数据有序漏斗、会话切割
UNIX_TIMESTAMP()
UNIX_TIMESTAMP()
时间戳转秒数计算事件间隔秒数
DATEDIFF()
DATEDIFF()
计算日期差留存天数计算
SUM() OVER (... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
SUM() OVER (... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
累计求和会话编号生成
GROUP_CONCAT(... ORDER BY ... SEPARATOR ...)
GROUP_CONCAT(... ORDER BY ... SEPARATOR ...)
有序字符串拼接行为路径字符串
COLLECT_LIST(... ORDER BY ...)
COLLECT_LIST(... ORDER BY ...)
有序数组收集行为路径数组

前置准备

以下示例使用模拟的用户行为事件表

doc_funnel_events
doc_funnel_events
,覆盖注册、浏览、加购、支付四个步骤:

-- 创建测试表 CREATE TABLE IF NOT EXISTS doc_funnel_events ( event_id INT, user_id INT, step_name STRING, event_time TIMESTAMP ); -- 插入测试数据(7 个用户,部分未完成全流程) INSERT INTO doc_funnel_events VALUES (1, 101, 'register', CAST('2024-01-01 08:00:00' AS TIMESTAMP)), (2, 101, 'browse', CAST('2024-01-01 08:05:00' AS TIMESTAMP)), (3, 101, 'add_cart', CAST('2024-01-01 08:10:00' AS TIMESTAMP)), (4, 101, 'pay', CAST('2024-01-01 08:20:00' AS TIMESTAMP)), (5, 102, 'register', CAST('2024-01-02 09:00:00' AS TIMESTAMP)), (6, 102, 'browse', CAST('2024-01-02 09:10:00' AS TIMESTAMP)), (7, 102, 'add_cart', CAST('2024-01-02 09:20:00' AS TIMESTAMP)), (8, 103, 'register', CAST('2024-01-03 10:00:00' AS TIMESTAMP)), (9, 103, 'browse', CAST('2024-01-03 10:15:00' AS TIMESTAMP)), (10, 104, 'register', CAST('2024-01-04 11:00:00' AS TIMESTAMP)), (11, 104, 'browse', CAST('2024-01-04 11:30:00' AS TIMESTAMP)), (12, 104, 'add_cart', CAST('2024-01-04 12:00:00' AS TIMESTAMP)), (13, 104, 'pay', CAST('2024-01-04 12:30:00' AS TIMESTAMP)), (14, 105, 'register', CAST('2024-01-05 14:00:00' AS TIMESTAMP)), (15, 106, 'register', CAST('2024-01-06 15:00:00' AS TIMESTAMP)), (16, 106, 'browse', CAST('2024-01-06 15:10:00' AS TIMESTAMP)), (17, 107, 'register', CAST('2024-01-07 16:00:00' AS TIMESTAMP)), (18, 107, 'browse', CAST('2024-01-07 16:20:00' AS TIMESTAMP)), (19, 107, 'add_cart', CAST('2024-01-07 17:00:00' AS TIMESTAMP)), (20, 107, 'pay', CAST('2024-01-07 17:30:00' AS TIMESTAMP)), -- 后续日期的行为(用于留存和会话分析) (21, 101, 'browse', CAST('2024-01-02 10:00:00' AS TIMESTAMP)), (22, 101, 'pay', CAST('2024-01-08 11:00:00' AS TIMESTAMP)), (23, 102, 'browse', CAST('2024-01-03 09:00:00' AS TIMESTAMP)), (24, 104, 'browse', CAST('2024-01-05 14:00:00' AS TIMESTAMP)), (25, 104, 'pay', CAST('2024-01-11 15:00:00' AS TIMESTAMP)), (26, 107, 'browse', CAST('2024-01-08 10:00:00' AS TIMESTAMP));


场景一:基础漏斗转化率

统计注册、浏览、加购、支付各步骤的用户数,并计算相邻步骤之间的转化率。

WITH funnel AS ( SELECT user_id, MAX(CASE WHEN step_name = 'register' THEN 1 ELSE 0 END) AS has_register, MAX(CASE WHEN step_name = 'browse' THEN 1 ELSE 0 END) AS has_browse, MAX(CASE WHEN step_name = 'add_cart' THEN 1 ELSE 0 END) AS has_add_cart, MAX(CASE WHEN step_name = 'pay' THEN 1 ELSE 0 END) AS has_pay FROM doc_funnel_events GROUP BY user_id ) SELECT COUNT(*) AS total_users, SUM(has_register) AS step1_register, SUM(has_browse) AS step2_browse, SUM(has_add_cart) AS step3_add_cart, SUM(has_pay) AS step4_pay, ROUND(SUM(has_browse) * 100.0 / SUM(has_register), 1) AS browse_rate, ROUND(SUM(has_add_cart) * 100.0 / SUM(has_browse), 1) AS add_cart_rate, ROUND(SUM(has_pay) * 100.0 / SUM(has_add_cart), 1) AS pay_rate FROM funnel;

执行结果

total_usersstep1_registerstep2_browsestep3_add_cartstep4_paybrowse_rateadd_cart_ratepay_rate
7764385.766.775.0

场景二:有序漏斗

基础漏斗只判断用户是否触达某步骤,不关心顺序。有序漏斗要求步骤必须按时间先后发生,过滤掉先加购后才注册等异常行为。

WITH step_times AS ( -- 取每个用户每个步骤的最早发生时间 SELECT user_id, MIN(CASE WHEN step_name = 'register' THEN event_time END) AS t_register, MIN(CASE WHEN step_name = 'browse' THEN event_time END) AS t_browse, MIN(CASE WHEN step_name = 'add_cart' THEN event_time END) AS t_add_cart, MIN(CASE WHEN step_name = 'pay' THEN event_time END) AS t_pay FROM doc_funnel_events GROUP BY user_id ), ordered_funnel AS ( SELECT user_id, CASE WHEN t_register IS NOT NULL THEN 1 ELSE 0 END AS s1, CASE WHEN t_browse > t_register THEN 1 ELSE 0 END AS s2, CASE WHEN t_add_cart > t_browse THEN 1 ELSE 0 END AS s3, CASE WHEN t_pay > t_add_cart THEN 1 ELSE 0 END AS s4 FROM step_times ) SELECT COUNT(*) AS total, SUM(s1) AS step1_register, SUM(s2) AS step2_browse, SUM(s3) AS step3_add_cart, SUM(s4) AS step4_pay FROM ordered_funnel;

执行结果

totalstep1_registerstep2_browsestep3_add_cartstep4_pay
77643

如需查看每个用户的有序漏斗明细:

WITH step_times AS ( SELECT user_id, MIN(CASE WHEN step_name = 'register' THEN event_time END) AS t_register, MIN(CASE WHEN step_name = 'browse' THEN event_time END) AS t_browse, MIN(CASE WHEN step_name = 'add_cart' THEN event_time END) AS t_add_cart, MIN(CASE WHEN step_name = 'pay' THEN event_time END) AS t_pay FROM doc_funnel_events GROUP BY user_id ) SELECT user_id, t_register IS NOT NULL AS did_register, t_browse > t_register AS did_browse_after, t_add_cart > t_browse AS did_add_cart_after, t_pay > t_add_cart AS did_pay_after FROM step_times ORDER BY user_id;

执行结果

user_iddid_registerdid_browse_afterdid_add_cart_afterdid_pay_after
101truetruetruetrue
102truetruetrueNULL
103truetrueNULLNULL
104truetruetruetrue
105trueNULLNULLNULL
106truetrueNULLNULL
107truetruetruetrue

场景三:用户首次/末次行为

使用

MIN
MIN
MAX
MAX
找出每个用户的首次注册时间、首次购买时间和末次活跃时间。

SELECT user_id, MIN(event_time) AS first_event_time, MAX(event_time) AS last_event_time, MIN(CASE WHEN step_name = 'pay' THEN event_time END) AS first_pay_time, MAX(CASE WHEN step_name = 'pay' THEN event_time END) AS last_pay_time FROM doc_funnel_events GROUP BY user_id ORDER BY user_id;

执行结果

user_idfirst_event_timelast_event_timefirst_pay_timelast_pay_time
1012024-01-01T00:00:00.000Z2024-01-08T03:00:00.000Z2024-01-01T00:20:00.000Z2024-01-08T03:00:00.000Z
1022024-01-02T01:00:00.000Z2024-01-03T01:00:00.000ZNULLNULL
1032024-01-03T02:00:00.000Z2024-01-03T02:15:00.000ZNULLNULL
1042024-01-04T03:00:00.000Z2024-01-11T07:00:00.000Z2024-01-04T04:30:00.000Z2024-01-11T07:00:00.000Z
1052024-01-05T06:00:00.000Z2024-01-05T06:00:00.000ZNULLNULL
1062024-01-06T07:00:00.000Z2024-01-06T07:10:00.000ZNULLNULL
1072024-01-07T08:00:00.000Z2024-01-08T02:00:00.000Z2024-01-07T09:30:00.000Z2024-01-07T09:30:00.000Z

场景四:用户留存分析

统计每个用户注册后第 1 天、第 7 天是否有行为,并汇总留存率。

WITH first_register AS ( -- 每个用户的注册日期 SELECT user_id, MIN(event_time) AS register_time FROM doc_funnel_events WHERE step_name = 'register' GROUP BY user_id ), activity AS ( -- 每个用户每天的活跃记录(去重) SELECT DISTINCT user_id, CAST(DATE_FORMAT(event_time, 'yyyy-MM-dd') AS DATE) AS active_day FROM doc_funnel_events ), retention AS ( SELECT r.user_id, DATE_FORMAT(r.register_time, 'yyyy-MM-dd') AS register_day, MAX(CASE WHEN DATEDIFF(a.active_day, CAST(DATE_FORMAT(r.register_time, 'yyyy-MM-dd') AS DATE)) = 1 THEN 1 ELSE 0 END) AS day1_retained, MAX(CASE WHEN DATEDIFF(a.active_day, CAST(DATE_FORMAT(r.register_time, 'yyyy-MM-dd') AS DATE)) = 7 THEN 1 ELSE 0 END) AS day7_retained FROM first_register r LEFT JOIN activity a ON r.user_id = a.user_id GROUP BY r.user_id, r.register_time ) SELECT COUNT(*) AS total_users, SUM(day1_retained) AS day1_count, SUM(day7_retained) AS day7_count, ROUND(SUM(day1_retained) * 100.0 / COUNT(*), 1) AS day1_rate, ROUND(SUM(day7_retained) * 100.0 / COUNT(*), 1) AS day7_rate FROM retention;

执行结果

total_usersday1_countday7_countday1_rateday7_rate
74257.128.6

如需查看每个用户的留存明细:

WITH first_register AS ( SELECT user_id, MIN(event_time) AS register_time FROM doc_funnel_events WHERE step_name = 'register' GROUP BY user_id ), activity AS ( SELECT DISTINCT user_id, CAST(DATE_FORMAT(event_time, 'yyyy-MM-dd') AS DATE) AS active_day FROM doc_funnel_events ) SELECT r.user_id, DATE_FORMAT(r.register_time, 'yyyy-MM-dd') AS register_day, MAX(CASE WHEN DATEDIFF(a.active_day, CAST(DATE_FORMAT(r.register_time, 'yyyy-MM-dd') AS DATE)) = 1 THEN 1 ELSE 0 END) AS day1_retained, MAX(CASE WHEN DATEDIFF(a.active_day, CAST(DATE_FORMAT(r.register_time, 'yyyy-MM-dd') AS DATE)) = 7 THEN 1 ELSE 0 END) AS day7_retained FROM first_register r LEFT JOIN activity a ON r.user_id = a.user_id GROUP BY r.user_id, r.register_time ORDER BY r.user_id;

执行结果

user_idregister_dayday1_retainedday7_retained
1012024-01-0111
1022024-01-0210
1032024-01-0300
1042024-01-0411
1052024-01-0500
1062024-01-0600
1072024-01-0710

场景五:会话切割

将同一用户的事件流按时间间隔切割为会话:两次事件间隔超过 30 分钟(1800 秒)视为新会话开始。

WITH with_prev AS ( SELECT user_id, step_name, event_time, LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time FROM doc_funnel_events ), with_session AS ( SELECT user_id, step_name, event_time, -- 首次事件或间隔超过 30 分钟时标记为新会话 SUM(CASE WHEN prev_time IS NULL OR (UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP(prev_time)) > 1800 THEN 1 ELSE 0 END) OVER ( PARTITION BY user_id ORDER BY event_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS session_id FROM with_prev ) SELECT user_id, session_id, COUNT(*) AS event_count, MIN(DATE_FORMAT(event_time, 'yyyy-MM-dd HH:mm:ss')) AS session_start, MAX(DATE_FORMAT(event_time, 'yyyy-MM-dd HH:mm:ss')) AS session_end FROM with_session GROUP BY user_id, session_id ORDER BY user_id, session_id;

执行结果

user_idsession_idevent_countsession_startsession_end
101142024-01-01 08:00:002024-01-01 08:20:00
101212024-01-02 10:00:002024-01-02 10:00:00
101312024-01-08 11:00:002024-01-08 11:00:00
102132024-01-02 09:00:002024-01-02 09:20:00
102212024-01-03 09:00:002024-01-03 09:00:00
103122024-01-03 10:00:002024-01-03 10:15:00
104142024-01-04 11:00:002024-01-04 12:30:00
104212024-01-05 14:00:002024-01-05 14:00:00
104312024-01-11 15:00:002024-01-11 15:00:00
105112024-01-05 14:00:002024-01-05 14:00:00
106122024-01-06 15:00:002024-01-06 15:10:00
107122024-01-07 16:00:002024-01-07 16:20:00
107222024-01-07 17:00:002024-01-07 17:30:00
107312024-01-08 10:00:002024-01-08 10:00:00

场景六:行为路径拼接

将每个用户的行为步骤按时间顺序拼接为字符串或数组,用于路径分析和模式挖掘。

方式一:使用

GROUP_CONCAT
GROUP_CONCAT
生成字符串路径

SELECT user_id, GROUP_CONCAT(step_name ORDER BY event_time SEPARATOR ' -> ') AS behavior_path FROM doc_funnel_events GROUP BY user_id ORDER BY user_id;

执行结果

user_idbehavior_path
101register -> browse -> add_cart -> pay -> browse -> pay
102register -> browse -> add_cart -> browse
103register -> browse
104register -> browse -> add_cart -> pay -> browse -> pay
105register
106register -> browse
107register -> browse -> add_cart -> pay -> browse

方式二:使用

COLLECT_LIST
COLLECT_LIST
生成数组路径

SELECT user_id, COLLECT_LIST(step_name ORDER BY event_time) AS step_list, SIZE(COLLECT_LIST(step_name ORDER BY event_time)) AS step_count FROM doc_funnel_events GROUP BY user_id ORDER BY user_id;

执行结果

user_idstep_liststep_count
101["register","browse","add_cart","pay","browse","pay"]6
102["register","browse","add_cart","browse"]4
103["register","browse"]2
104["register","browse","add_cart","pay","browse","pay"]6
105["register"]1
106["register","browse"]2
107["register","browse","add_cart","pay","browse"]5

清理测试数据

完成漏斗分析验证后,建议清理测试表:

-- 删除测试表 DROP TABLE IF EXISTS doc_funnel_events;


注意事项

  1. 漏斗步骤去重
    MAX(CASE WHEN step_name = 'pay' THEN 1 ELSE 0 END)
    MAX(CASE WHEN step_name = 'pay' THEN 1 ELSE 0 END)
    对每个用户只标记是否触达,不受重复事件影响。若需统计触达次数,改用
    COUNT(CASE WHEN step_name = 'pay' THEN 1 END)
    COUNT(CASE WHEN step_name = 'pay' THEN 1 END)
  2. 有序漏斗的 NULL 传播:当某步骤时间戳为
    NULL
    NULL
    时,
    t_add_cart > t_browse
    t_add_cart > t_browse
    返回
    NULL
    NULL
    而非
    false
    false
    。在
    SUM
    SUM
    NULL
    NULL
    被忽略,等同于 0,不影响汇总结果。
  3. UNIX_TIMESTAMP 精度
    UNIX_TIMESTAMP
    UNIX_TIMESTAMP
    返回秒级整数,适合计算分钟级间隔。若需毫秒级精度,可使用
    UNIX_MILLIS(event_time)
    UNIX_MILLIS(event_time)
    并将阈值改为
    1800000
    1800000
  4. DATEDIFF 参数顺序
    DATEDIFF(end_date, start_date)
    DATEDIFF(end_date, start_date)
    返回
    end_date - start_date
    end_date - start_date
    的天数,结束日期在前。
  5. GROUP_CONCAT 默认长度限制
    GROUP_CONCAT
    GROUP_CONCAT
    生成的字符串有长度上限,路径步骤极多时可能被截断。如需完整路径,优先使用
    COLLECT_LIST
    COLLECT_LIST
  6. 窗口函数执行顺序:窗口函数在
    WHERE
    WHERE
    GROUP BY
    GROUP BY
    之后执行,不能直接在
    WHERE
    WHERE
    中使用窗口函数结果。如需过滤,请使用
    QUALIFY
    QUALIFY
    或子查询。

相关文档

联系我们
预约咨询
微信咨询
电话咨询