云器 Lakehouse Medallion 架构实践:纯 SQL Dynamic Table 方案

Medallion 架构(Bronze → Silver → Gold)是 Databricks 推广的数据湖组织模式。在 ClickZetta Lakehouse 上,除了用 ZettaPark Python API 实现,还有一种更简洁的方案:全部用 SQL 的 Dynamic Table(动态表)声明式构建——无需写一行 Python 代码,无需调度平台配置,三层自动按依赖链增量刷新。

本文用 NHL(北美冰球联赛)真实数据集(10 张表、~1400 万行)完整演示这一方案。

湖上加速全景:本文在哪一环

数据湖加速的典型链路为:对象存储文件 → Volume(挂载)→ Pipe(持续导入)→ 目标表 → Dynamic Table(增量聚合)。前面两步负责"数据自动入仓",本文聚焦最后一步——入仓之后的清洗、建模、聚合,用 Dynamic Table 声明式构建 Bronze → Silver → Gold 三层管道。

如果你还没配好数据入仓,先看 Volume + Pipe 端到端实践,跑通文件自动入库再回来看本文。如果数据已在 Lakehouse 表里(像本文的 NHL 数据集),直接从这里开始。


为什么用 Dynamic Table 构建 Medallion

传统的 Medallion 架构通常依赖调度平台(Airflow/Databricks Workflows)按顺序执行 Python Notebook 或 SQL 脚本。Dynamic Table 提供了一种不同的范式:

对比维度传统 ETL 调度Dynamic Table 方案
编码方式Python/ZettaPark 或 SQL 脚本纯 SQL(
CREATE DYNAMIC TABLE ... AS SELECT
CREATE DYNAMIC TABLE ... AS SELECT
调度配置需要配置 DAG 和 Cron声明式
REFRESH INTERVAL
REFRESH INTERVAL
,系统自动调度
增量计算需手动实现增量逻辑系统 CBO 自动识别增量,只计算变更部分
依赖管理人工编排上下游顺序DT 自动按引用关系确定刷新顺序
数据血缘需要额外工具追踪
SHOW DYNAMIC TABLE REFRESH HISTORY
SHOW DYNAMIC TABLE REFRESH HISTORY
内建
代码资产化Notebook/脚本分散管理集中在 Studio,可搜索、可对比、可复用

核心区别:你不需要关心"什么时候跑"和"跑什么"——你只需要声明"我想要什么结果"。系统负责计算编排、增量识别、并行调度。


数据集概览

NHL 冰球数据,来自

nhl_game_data
nhl_game_data
Schema(Bronze 层,已入库):

行数说明
game
game
26,305比赛主表(对阵、比分、场馆、赛季)
player_info
player_info
3,925球员档案(姓名、国籍、位置、身高体重)
team_info
team_info
33球队信息(名称、缩写)
game_skater_stats
game_skater_stats
945,830球员技术统计(进球、助攻、射门、抢断、+/-等)
game_goalie_stats
game_goalie_stats
56,656门将统计(扑救、失球、扑救率)
game_goals
game_goals
148,992进球明细
game_plays
game_plays
5,050,529比赛事件(play-by-play)
game_plays_players
game_plays_players
7,586,604事件球员参与明细
game_penalties
game_penalties
247,828犯规记录
game_teams_stats
game_teams_stats
52,610球队赛级统计

数据关系:

game
game
是核心事实表,通过
game_id
game_id
player_id
player_id
team_id
team_id
关联其他表。覆盖 2010-2020 共 10 个赛季。


架构设计

Bronze (nhl_game_data.*) Silver (silver.*) DT Gold (gold.*) DT ═══════════════════════ ══════════════════ ══════════════════ 原始数据,零转换 清洗 + 维度关联 业务指标 game ─────────┐ ┌─ dim_team (33) ┌─ scoring_leaders team_info ────┤ LEFT JOIN ──→├─ dim_player (3,925) ├─ player_career_stats player_info ──┘ ├─ fact_skater_stats ├─ team_season_summary skater_stats ── LEFT JOIN ──→ └─ fact_goalie_stats ├─ goalie_season_rankings goalie_stats ── LEFT JOIN ──→ └─ team_home_away_split

三层职责:

Schema表类型职责
Bronze
nhl_game_data
nhl_game_data
普通表原始数据,不做任何转换
Silver
silver
silver
Dynamic TableJOIN 维度表关联名称、清洗字段类型(STRING→INT)、标准化
Gold
gold
gold
Dynamic Table聚合指标:射手榜、球队战绩、门将排名、生涯统计

实施步骤

前置条件

  • VCluster 可用(使用
    DEFAULT
    DEFAULT
    ,GP 型,Serverless 按需唤醒)
  • Bronze 数据已入库(
    nhl_game_data.*
    nhl_game_data.*
    10 张表)
  • 具备 CREATE SCHEMA / CREATE DYNAMIC TABLE 权限

第一步:创建 Schema

用独立的 Schema 物理隔离各层:

CREATE SCHEMA IF NOT EXISTS silver COMMENT 'Medallion Silver 清洗层'; CREATE SCHEMA IF NOT EXISTS gold COMMENT 'Medallion Gold 聚合指标层';

第二步:Silver 层 — 维度表

最简单的 DT:直接从 Bronze 表筛选/转换列。这两张表数据量小(33 行和 3,925 行),每次 FULL 刷新也毫无压力。

-- 球队维度 CREATE OR REPLACE DYNAMIC TABLE silver.dim_team REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver 球队维度表' AS SELECT team_id, franchiseid, shortname, teamname, abbreviation, link FROM nhl_game_data.team_info; -- 球员维度(标准化 + 添加全名列) CREATE OR REPLACE DYNAMIC TABLE silver.dim_player REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver 球员维度表——标准化字段 + 全名' AS SELECT player_id, firstname, lastname, CONCAT(firstname, ' ', lastname) AS full_name, nationality, birthcity, primaryposition AS position, birthdate, height, height_cm, CAST(NULLIF(REGEXP_REPLACE(weight, ',', ''), '') AS INT) AS weight_kg, shootscatches FROM nhl_game_data.player_info;

第三步:Silver 层 — 事实表

事实表的核心工作:JOIN 维度表关联名称 + 类型标准化。以球员技术统计为例:

CREATE OR REPLACE DYNAMIC TABLE silver.fact_skater_stats REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver 球员技术统计事实表——关联球员名+球队名+赛季' AS SELECT s.game_id, s.player_id, p.full_name AS player_name, p.position, s.team_id, t.teamname AS team_name, t.abbreviation AS team_abbr, g.season, g.date_time_gmt AS game_date, s.timeonice, s.goals, s.assists, s.goals + s.assists AS points, -- 计算字段:得分 s.shots, CAST(NULLIF(REGEXP_REPLACE(s.hits, ',', ''), '') AS INT) AS hits, s.powerplaygoals, s.penaltyminutes, s.plusminus, s.eventimeonice, s.powerplaytimeonice FROM nhl_game_data.game_skater_stats s LEFT JOIN nhl_game_data.game g ON s.game_id = g.game_id LEFT JOIN silver.dim_player p ON s.player_id = p.player_id LEFT JOIN silver.dim_team t ON s.team_id = t.team_id;

门将统计事实表同理,额外计算了扑救率:

CREATE OR REPLACE DYNAMIC TABLE silver.fact_goalie_stats REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Silver 门将技术统计事实表——含扑救率计算' AS SELECT gs.game_id, gs.player_id, p.full_name AS player_name, t.teamname AS team_name, t.abbreviation AS team_abbr, g.season, g.date_time_gmt AS game_date, gs.timeonice, gs.shots AS shots_faced, gs.saves, CASE WHEN gs.shots > 0 THEN ROUND(gs.saves * 1.0 / gs.shots, 3) ELSE NULL END AS save_pct, -- 计算字段:扑救率 gs.decision FROM nhl_game_data.game_goalie_stats gs LEFT JOIN nhl_game_data.game g ON gs.game_id = g.game_id LEFT JOIN silver.dim_player p ON gs.player_id = p.player_id LEFT JOIN silver.dim_team t ON gs.team_id = t.team_id;

第四步:首次刷新 Silver 层

DT 创建后只是定义了计算逻辑,还没有数据。需要手动触发首次刷新:

REFRESH DYNAMIC TABLE silver.dim_team; REFRESH DYNAMIC TABLE silver.dim_player; REFRESH DYNAMIC TABLE silver.fact_skater_stats; REFRESH DYNAMIC TABLE silver.fact_goalie_stats;

第五步:Gold 层 — 聚合指标

Gold 层从 Silver 层读取数据,用聚合函数生成业务指标。所有表使用

1 DAY
1 DAY
刷新间隔(T+1 场景)。

射手榜:每赛季 TOP 20 得分手

使用

RANK()
RANK()
窗口函数按赛季排名:

CREATE OR REPLACE DYNAMIC TABLE gold.scoring_leaders REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold 每赛季射手榜 TOP 20——按得分(进球+助攻)排名' AS SELECT season, rank, player_id, player_name, position, team_abbr, games_played, goals, assists, points, ROUND(points * 1.0 / games_played, 2) AS pts_per_game FROM ( SELECT season, player_id, player_name, position, team_abbr, COUNT(*) AS games_played, SUM(goals) AS goals, SUM(assists) AS assists, SUM(points) AS points, RANK() OVER (PARTITION BY season ORDER BY SUM(points) DESC) AS rank FROM silver.fact_skater_stats GROUP BY season, player_id, player_name, position, team_abbr ) t WHERE rank <= 20;

验证结果(2019-20 赛季):

rankplayerteamgoalsassistspoints
1Nikita KucherovTBL160316476
2Nathan MacKinnonCOL176296472
3Leon DraisaitlEDM181274455
4David PastrnakBOS204216420
5Connor McDavidEDM153262415

球队赛季战绩

Bronze 数据只有每场比赛的主客队视角。需要将每场比赛拆成两行(主队一行、客队一行),再按球队和赛季聚合。这里用

UNION ALL
UNION ALL
+
CASE WHEN
CASE WHEN
实现:

CREATE OR REPLACE DYNAMIC TABLE gold.team_season_summary REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold 球队赛季战绩——胜/负/进球/失球/积分' AS SELECT g.season, g.team_id, t.teamname AS team_name, t.abbreviation AS team_abbr, COUNT(*) AS games_played, SUM(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 1 WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 1 ELSE 0 END) AS wins, SUM(CASE WHEN g.side = 'home' AND g.outcome LIKE 'away win%' THEN 1 WHEN g.side = 'away' AND g.outcome LIKE 'home win%' THEN 1 ELSE 0 END) AS losses, SUM(CASE WHEN g.side = 'home' THEN g.home_goals ELSE g.away_goals END) AS goals_for, SUM(CASE WHEN g.side = 'home' THEN g.away_goals ELSE g.home_goals END) AS goals_against, SUM(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 2 WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 2 ELSE 0 END) AS points FROM ( SELECT season, home_team_id AS team_id, outcome, home_goals, away_goals, 'home' AS side FROM nhl_game_data.game UNION ALL SELECT season, away_team_id AS team_id, outcome, home_goals, away_goals, 'away' AS side FROM nhl_game_data.game ) g LEFT JOIN silver.dim_team t ON g.team_id = t.team_id GROUP BY g.season, g.team_id, t.teamname, t.abbreviation;

验证结果(2019-20 赛季 TOP 5):

teamgameswinslossespoints
Lightning (TBL)19012268244
Stars (DAL)19210488208
Golden Knights (VGK)18210280204
Avalanche (COL)17010268204
Flyers (PHI)17010268204

门将赛季排名 + 球员生涯统计 + 主客场对比

完整 DDL 见本文附录。核心模式相同:从 Silver 层聚合 →

RANK() OVER (PARTITION BY season ...)
RANK() OVER (PARTITION BY season ...)
→ 取 TOP N。

第六步:验证全链路

-- 各层行数对比 SELECT 'Bronze game' AS layer, COUNT(*) FROM nhl_game_data.game UNION ALL SELECT 'Silver dim_team', COUNT(*) FROM silver.dim_team UNION ALL SELECT 'Silver fact_skater', COUNT(*) FROM silver.fact_skater_stats UNION ALL SELECT 'Gold scoring_leaders', COUNT(*) FROM gold.scoring_leaders UNION ALL SELECT 'Gold team_season', COUNT(*) FROM gold.team_season_summary; -- 查看 DT 刷新历史 SHOW DYNAMIC TABLE REFRESH HISTORY WHERE name = 'scoring_leaders';

完整验证结果:

行数刷新模式状态
Silverdim_team33FULL✅ 与 Bronze 一致
Silverdim_player3,925FULL✅ 与 Bronze 一致
Silverfact_skater_stats1,130,682FULL✅ 含 player_name/team_name/points
Silverfact_goalie_stats67,642FULL✅ 含 save_pct 计算字段
Goldscoring_leaders399FULL✅ 每赛季 TOP 20
Goldplayer_career_stats3,353FULL✅ 生涯汇总
Goldteam_season_summary580FULL✅ 33 队 × 18 赛季
Goldgoalie_season_rankings294FULL✅ 每赛季 TOP 15
Goldteam_home_away_split580FULL✅ 主客场对比

设计原则

1. 层间引用规则

引用方向允许示例
Silver → Bronze
FROM nhl_game_data.game
FROM nhl_game_data.game
Gold → Silver
FROM silver.fact_skater_stats
FROM silver.fact_skater_stats
Gold → Bronze⚠️ 不推荐应通过 Silver 层间接访问
Gold → Gold⚠️ 谨慎仅在需要多层聚合时使用
Bronze → Silver❌ 禁止下层不应依赖上层

2. LEFT JOIN 过滤条件必须放 ON 子句

-- ❌ 错误:WHERE 过滤导致 LEFT JOIN 退化为 INNER JOIN SELECT * FROM skater_stats s LEFT JOIN team_info t ON s.team_id = t.team_id WHERE t.abbreviation = 'TBL'; -- ✅ 正确:过滤条件放 ON 子句 SELECT * FROM skater_stats s LEFT JOIN team_info t ON s.team_id = t.team_id AND t.abbreviation = 'TBL';

3. 首次刷新基准时间

REFRESH INTERVAL 1 DAY
REFRESH INTERVAL 1 DAY
以创建时间为基准计算下次触发,不对齐整点。建议创建后立即执行
REFRESH DYNAMIC TABLE
REFRESH DYNAMIC TABLE
手动触发首次刷新,同时重置基准时间:

CREATE DYNAMIC TABLE gold.scoring_leaders ...; REFRESH DYNAMIC TABLE gold.scoring_leaders;

4. 字符串清洗

原始数据来自外部系统时,数值字段可能含非标准字符:

CAST(NULLIF(REGEXP_REPLACE(hits, ',', ''), '') AS INT)

三步清洗:去逗号 → NULLIF 空字符串 → CAST 到目标类型。

NULLIF
NULLIF
避免空字符串导致的 CAST 失败。


成本分析

DT 数量刷新频率预估 CRU
Silver41 DAY低(全量刷新,但数据量小)
Gold51 DAY中(涉及聚合,~1400 万行扫描)

全部使用 GP 型 VCluster(

DEFAULT
DEFAULT
),Serverless 按需计费。T+1 场景下每天只刷新一次,比传统小时级 ETL 成本更低。


与 ZettaPark 方案的对比

ZettaPark 方案纯 SQL DT 方案(本文)
适合人群Python 开发者、Data ScientistSQL 开发者、数据分析师
编码量Python 脚本 + Spark API纯 SQL(DDL)
调度需要外部调度(Studio/Notebook)DT 自动刷新,无需调度
增量计算需手动管理 CDC系统自动
灵活性高(Python 可调用任意库)中(SQL 表达能力范围)
学习成本Pandas/PySpark/ZettaPark纯 SQL
适用场景复杂转换、ML 特征工程、外部 API 调用标准 ETL、聚合、JOIN、窗口计算

两种方案共存不冲突:用 ZettaPark 做复杂清洗,用 DT 做聚合指标,在同一套 Medallion 架构中各取所长。


注意事项

注意点说明
Bronze 数据变动时 DT 自动联动全链路 9 张 DT 按依赖顺序刷新,无需手动触发
DT 不支持 ALTER 修改 SQL需用
CREATE OR REPLACE
CREATE OR REPLACE
重建
VCluster 必须用 GP 型AP 型不支持小文件合并,长期查询变慢
Silver 事实表引用 Silver 维度表系统自动保证维度表先刷新
字符串数值字段需清洗去逗号 → NULLIF → CAST,三步走
UNION ALL 行展开时注意业务逻辑本文中主客队拆分后,胜负判定必须交叉匹配 side 和 outcome
首次创建后需手动 REFRESH
REFRESH INTERVAL
REFRESH INTERVAL
不会立即触发首次计算

附录:完整 Gold 层 DDL

球员生涯统计

CREATE OR REPLACE DYNAMIC TABLE gold.player_career_stats REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold 球员生涯总览——全赛季汇总 + 场均效率' AS SELECT player_id, player_name, position, COUNT(*) AS games_played, SUM(goals) AS total_goals, SUM(assists) AS total_assists, SUM(points) AS total_points, ROUND(SUM(points) * 1.0 / COUNT(*), 2) AS pts_per_game, ROUND(SUM(goals) * 1.0 / NULLIF(SUM(shots), 0), 3) AS shooting_pct, AVG(timeonice) AS avg_timeonice_sec, SUM(penaltyminutes) AS total_pim, AVG(plusminus) AS avg_plusminus FROM silver.fact_skater_stats GROUP BY player_id, player_name, position;

门将赛季排名

CREATE OR REPLACE DYNAMIC TABLE gold.goalie_season_rankings REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold 门将赛季排名 TOP 15——按胜场数排名' AS SELECT season, rank, player_id, player_name, team_abbr, games_played, wins, saves, shots_faced, ROUND(save_pct, 3) AS save_pct FROM ( SELECT season, player_id, player_name, team_abbr, COUNT(*) AS games_played, SUM(CASE WHEN decision = 'W' THEN 1 ELSE 0 END) AS wins, SUM(saves) AS saves, SUM(shots_faced) AS shots_faced, CASE WHEN SUM(shots_faced) > 0 THEN SUM(saves) * 1.0 / SUM(shots_faced) ELSE NULL END AS save_pct, RANK() OVER (PARTITION BY season ORDER BY SUM(CASE WHEN decision = 'W' THEN 1 ELSE 0 END) DESC) AS rank FROM silver.fact_goalie_stats GROUP BY season, player_id, player_name, team_abbr ) t WHERE rank <= 15;

主客场对比

CREATE OR REPLACE DYNAMIC TABLE gold.team_home_away_split REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Gold 球队主客场表现对比——主场 vs 客场胜率' AS SELECT g.season, g.team_id, t.teamname AS team_name, t.abbreviation AS team_abbr, COUNT(CASE WHEN g.side = 'home' THEN 1 END) AS home_games, COUNT(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 1 END) AS home_wins, COUNT(CASE WHEN g.side = 'away' THEN 1 END) AS away_games, COUNT(CASE WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 1 END) AS away_wins, ROUND( COUNT(CASE WHEN g.side = 'home' AND g.outcome LIKE 'home win%' THEN 1 END) * 1.0 / NULLIF(COUNT(CASE WHEN g.side = 'home' THEN 1 END), 0), 3 ) AS home_win_pct, ROUND( COUNT(CASE WHEN g.side = 'away' AND g.outcome LIKE 'away win%' THEN 1 END) * 1.0 / NULLIF(COUNT(CASE WHEN g.side = 'away' THEN 1 END), 0), 3 ) AS away_win_pct FROM ( SELECT season, home_team_id AS team_id, outcome, 'home' AS side FROM nhl_game_data.game UNION ALL SELECT season, away_team_id AS team_id, outcome, 'away' AS side FROM nhl_game_data.game ) g LEFT JOIN silver.dim_team t ON g.team_id = t.team_id GROUP BY g.season, g.team_id, t.teamname, t.abbreviation;


相关文档

湖上加速完整链路:Volume 挂载 → Pipe 入仓 → Dynamic Table 建模。以下文档覆盖各环节:

联系我们
预约咨询
微信咨询
电话咨询
邮件咨询