在线教育学习行为数仓实践

将学生学习行为日志(课程点击、测验答题、作业提交)与课程元数据整合为多层数仓,输出学习效果评分和高风险学生预警。本文以 Open University Learning Analytics(OULAD)公开数据集为基础,端到端演示 Kafka PIPE → ODS → DWD → DWS → ADS 的完整构建过程,并覆盖 Inverted Index、BITMAP 函数、SQL UDF 三项关键平台能力的落地用法。


概述

在线教育平台的学习行为数据链路是:实时行为事件(Kafka)→ 原始存储(ODS)→ 清洗关联(DWD)→ 学生课程进度聚合(DWS)→ 学习评分与预警输出(ADS)

云器 Lakehouse 通过以下组合解决几个核心问题:

问题解决方案
学生点击、答题事件毫秒级高频写入Kafka PIPE 持续摄取,无需手写消费者代码
ODS → DWD → DWS → ADS 自动增量计算Dynamic Table,声明式 SQL,系统自动调度依赖链
按课程/班级统计出勤率和学习完成度
GROUP_BITMAP
GROUP_BITMAP
函数,快速计算活跃学生基数
按活动类型(quiz、oucontent)快速筛选Inverted Index,全文检索活动类型标签
多维学习评分逻辑可复用SQL UDF
calc_learning_score
calc_learning_score
,封装加权评分公式
学生知识点掌握程度推断External Function,调用知识图谱 API 推断掌握状态

涉及的 SQL 命令

命令 / 函数用途说明
CREATE TABLE
CREATE TABLE
建 ODS 层原始表普通表,作为 Dynamic Table 上游
CREATE INVERTED INDEX
CREATE INVERTED INDEX
activity_type
activity_type
列创建倒排索引
支持活动类型全文检索
CREATE PIPE
CREATE PIPE
创建 Kafka 持续摄取管道绑定到 ODS 层目标表
CREATE FUNCTION
CREATE FUNCTION
创建 SQL UDF
calc_learning_score
calc_learning_score
封装学习评分加权公式
CREATE DYNAMIC TABLE
CREATE DYNAMIC TABLE
创建 DWD / DWS / ADS 层增量计算表系统自动识别上游变更并增量刷新
REFRESH DYNAMIC TABLE
REFRESH DYNAMIC TABLE
手动触发一次刷新初次构建或调试时使用
GROUP_BITMAP
GROUP_BITMAP
统计活跃学生基数(去重计数)返回 BIGINT 基数值
GROUP_BITMAP_STATE
GROUP_BITMAP_STATE
构建 bitmap 对象用于跨课程交叉分析返回 bitmap 类型,可做 AND/OR 运算

前置准备

本文所有示例在

best_practice_education_dw
best_practice_education_dw
Schema 下运行。

CREATE SCHEMA IF NOT EXISTS best_practice_education_dw;

执行结果:

{}


ODS 层:原始数据表

ODS 层接收三类数据:Kafka 实时行为事件、PostgreSQL CDC 同步的学籍和成绩数据、批量导入的课程元数据。本文使用 OULAD 公开数据集(UK Open University 真实数据)构建测试数据。

建表

-- 学生 VLE 行为事件表(Kafka PIPE 写入目标) CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_vle ( code_module STRING, code_presentation STRING, id_student BIGINT, id_site BIGINT, event_date INT, sum_click INT, ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP() ); -- 学生基本信息表(CDC 同步目标) CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_info ( code_module STRING, code_presentation STRING, id_student BIGINT, gender STRING, region STRING, highest_education STRING, imd_band STRING, age_band STRING, num_of_prev_attempts INT, studied_credits INT, disability STRING, final_result STRING ); -- 课程元数据表 CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_courses ( code_module STRING, code_presentation STRING, module_presentation_length INT ); -- VLE 资源类型表 CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_vle ( id_site BIGINT, code_module STRING, code_presentation STRING, activity_type STRING, week_from INT, week_to INT ); -- 作业提交与成绩表 CREATE TABLE IF NOT EXISTS best_practice_education_dw.doc_ods_student_assessment ( id_assessment BIGINT, id_student BIGINT, date_submitted INT, is_banked INT, score DOUBLE );

ingest_time
ingest_time
使用
DEFAULT CURRENT_TIMESTAMP()
DEFAULT CURRENT_TIMESTAMP()
,Kafka PIPE 写入时自动填充,消息体中无需携带。

创建 Inverted Index

doc_ods_vle
doc_ods_vle
表的
activity_type
activity_type
列存储活动类型标签(
quiz
quiz
oucontent
oucontent
resource
resource
forumng
forumng
等),分析师频繁按类型筛选资源。

CREATE INVERTED INDEX idx_inv_activity_type ON TABLE doc_ods_vle (activity_type);

索引创建后只对新写入的数据生效。若表中已有存量数据,需执行

BUILD INDEX
BUILD INDEX
将存量覆盖:

BUILD INDEX idx_inv_activity_type ON doc_ods_vle;

配置 Kafka PIPE

Kafka PIPE 在 DDL 阶段就会尝试连接 Kafka broker 验证订阅关系。正式环境替换 broker 地址和 topic 名称后即可创建。

方式一:通过 Kafka 实际写入(推荐)

以下 Python 示例展示如何向 Kafka topic 推送学生行为事件,触发 PIPE 摄取:

from kafka import KafkaProducer import json, time producer = KafkaProducer( bootstrap_servers=['<kafka-broker>:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) # 构造学生 VLE 行为事件消息 event = { "code_module": "AAA", "code_presentation": "2013J", "id_student": 11391, "id_site": 546714, "event_date": 15, "sum_click": 5 } producer.send('edu_student_vle_events', value=event) producer.flush() print("Event sent")

对应的 Kafka PIPE DDL:

-- 先建 raw 字符串接收表,PIPE 写入 JSON 字符串 CREATE TABLE IF NOT EXISTS best_practice_education_dw.kafka_raw_vle (value STRING); -- 创建 Kafka PIPE CREATE PIPE IF NOT EXISTS best_practice_education_dw.pipe_student_vle VIRTUAL_CLUSTER = 'DEFAULT' BATCH_INTERVAL_IN_SECONDS = '60' AS COPY INTO best_practice_education_dw.kafka_raw_vle FROM ( SELECT CAST(value AS STRING) AS value FROM READ_KAFKA( '<kafka-broker>:9092', 'edu_student_vle_events', '', 'cz_edu_consumer', '','','','', 'raw', 'raw', 0, map() ) );

方式二:INSERT 模拟(无 Kafka 环境时)

若暂未配置 Kafka,可先将数据保存为本地 CSV 文件,通过 cz-cli 上传到 User Volume 后用 COPY INTO 导入(推荐):

从本地 CSV 导入(推荐)

-- 第一步:通过 SQL PUT 将本地 CSV 文件上传到 User Volume PUT '/path/to/student_vle_data.csv' TO USER VOLUME FILE 'student_vle_data.csv';

-- 第二步:从 User Volume COPY INTO 表 COPY INTO best_practice_education_dw.doc_ods_student_vle FROM USER VOLUME USING csv OPTIONS('header'='true', 'sep'=',', 'nullValue'='') FILES ('student_vle_data.csv');

也可直接内联插入小批量测试数据(不需要 CSV 文件):

INSERT INTO best_practice_education_dw.doc_ods_student_vle (code_module, code_presentation, id_student, id_site, event_date, sum_click) VALUES ('AAA','2013J',11391,546652, 1, 4), ('AAA','2013J',11391,546652, 2, 8), ('AAA','2013J',11391,546712, 3,15), ('AAA','2013J',11391,546714, 5, 3), ('AAA','2013J',11391,546715, 7, 6), ('AAA','2013J',11391,546712,10,12), ('AAA','2013J',11391,546714,12, 4), ('AAA','2013J',28400,546652, 1, 2), ('AAA','2013J',28400,546712, 4, 9), ('AAA','2013J',28400,546715, 6, 5), ('AAA','2013J',31604,546652, 1, 5), ('AAA','2013J',31604,546712, 3,11), ('AAA','2013J',31604,546714, 6, 2), ('EEE','2013J',70001,550001, 1,18), ('EEE','2013J',70001,550001, 3,22), ('BBB','2013J',40102,547001, 2, 7), ('BBB','2013J',40102,547002, 4, 2), ('CCC','2014J',50001,548001, 2,14), ('CCC','2014J',50001,548002, 5, 3); -- 共 29 条记录

验证 ODS 行数:

SELECT COUNT(*) AS row_count FROM best_practice_education_dw.doc_ods_student_vle;

row_count --------- 29


学习评分 UDF

将多维度学习效果评分逻辑封装为 SQL UDF,DWS 和 ADS 层均可复用。

评分公式:

  • assessment_avg × 0.50
    assessment_avg × 0.50
    :测验成绩贡献 50 分
  • min(total_clicks, 200) / 200 × 30
    min(total_clicks, 200) / 200 × 30
    :平台点击深度贡献 30 分
  • min(submission_count, 5) / 5 × 10
    min(submission_count, 5) / 5 × 10
    :作业提交频次贡献 10 分
  • min(days_active, 30) / 30 × 10
    min(days_active, 30) / 30 × 10
    :活跃天数贡献 10 分

CREATE OR REPLACE FUNCTION best_practice_education_dw.calc_learning_score( total_clicks INT, assessment_avg DOUBLE, submission_count INT, days_active INT ) RETURNS DOUBLE AS GREATEST(0.0, LEAST(100.0, COALESCE(assessment_avg, 0) * 0.50 + LEAST(total_clicks, 200) / 200.0 * 30.0 + LEAST(submission_count, 5) / 5.0 * 10.0 + LEAST(days_active, 30) / 30.0 * 10.0 ));

验证函数——高活跃学生(点击 120 次、平均分 78、提交 3 次、活跃 15 天):

SELECT best_practice_education_dw.calc_learning_score(120, 78.0, 3, 15) AS sample_score;

sample_score ------------ 68


DWD 层 Dynamic Table:清洗与关联

DWD 层将 ODS 原始事件与学生信息、课程元数据、VLE 资源类型关联,输出标准化的学习事件宽表,供 DWS 层直接聚合。

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.dwd_learning_events AS SELECT v.code_module, v.code_presentation, v.id_student, v.id_site, vl.activity_type, v.event_date, v.sum_click, v.ingest_time, s.gender, s.region, s.highest_education, s.age_band, s.final_result AS enrollment_result, c.module_presentation_length FROM best_practice_education_dw.doc_ods_student_vle v LEFT JOIN best_practice_education_dw.doc_ods_student_info s ON v.code_module = s.code_module AND v.code_presentation = s.code_presentation AND v.id_student = s.id_student LEFT JOIN best_practice_education_dw.doc_ods_vle vl ON v.id_site = vl.id_site LEFT JOIN best_practice_education_dw.doc_ods_courses c ON v.code_module = c.code_module AND v.code_presentation = c.code_presentation;

手动触发首次刷新:

REFRESH DYNAMIC TABLE best_practice_education_dw.dwd_learning_events; SELECT COUNT(*) AS dwd_count FROM best_practice_education_dw.dwd_learning_events;

dwd_count --------- 29


DWS 层 Dynamic Table:学生课程进度聚合

DWS 层以

id_student
id_student
+
code_module
code_module
+
code_presentation
code_presentation
为粒度聚合 DWD 层数据,输出每位学生在每门课的行为统计,作为 ADS 层评分的直接输入。

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.dws_student_course_progress AS SELECT e.code_module, e.code_presentation, e.id_student, e.gender, e.region, e.highest_education, e.age_band, e.enrollment_result, e.module_presentation_length, COUNT(DISTINCT e.event_date) AS days_active, SUM(e.sum_click) AS total_clicks, COUNT(DISTINCT e.id_site) AS distinct_resources, SUM(CASE WHEN e.activity_type = 'quiz' THEN e.sum_click ELSE 0 END) AS quiz_clicks, COUNT(DISTINCT CASE WHEN e.activity_type = 'quiz' THEN e.event_date END) AS quiz_days, SUM(CASE WHEN e.activity_type = 'oucontent' THEN e.sum_click ELSE 0 END) AS content_clicks, MAX(e.event_date) AS last_active_day FROM best_practice_education_dw.dwd_learning_events e GROUP BY e.code_module, e.code_presentation, e.id_student, e.gender, e.region, e.highest_education, e.age_band, e.enrollment_result, e.module_presentation_length;

手动触发首次刷新并查看结果:

REFRESH DYNAMIC TABLE best_practice_education_dw.dws_student_course_progress; SELECT code_module, code_presentation, id_student, days_active, total_clicks, distinct_resources, quiz_clicks, last_active_day FROM best_practice_education_dw.dws_student_course_progress ORDER BY total_clicks DESC LIMIT 8;

code_module | code_presentation | id_student | days_active | total_clicks | distinct_resources | quiz_clicks | last_active_day ------------+-------------------+------------+-------------+--------------+--------------------+-------------+---------------- AAA | 2013J | 11391 | 7 | 52 | 4 | 7 | 12 EEE | 2013J | 70001 | 2 | 40 | 1 | 0 | 3 DDD | 2013J | 60001 | 2 | 21 | 1 | 0 | 5 BBB | 2013J | 40102 | 3 | 19 | 2 | 2 | 8 AAA | 2013J | 31604 | 3 | 18 | 3 | 2 | 6 CCC | 2014J | 50001 | 2 | 17 | 2 | 3 | 5 AAA | 2013J | 28400 | 3 | 16 | 3 | 0 | 6 CCC | 2014J | 50002 | 1 | 8 | 1 | 0 | 3

结果解读

  • 学生 11391(AAA 课程)点击总量 52 次、活跃 7 天,是数据集中参与度最高的学生;
    quiz_clicks=7
    quiz_clicks=7
    说明有测验互动,综合评分有望进入 MEDIUM_RISK 区间。
  • 学生 70001(EEE 课程)点击量 40 次但仅活跃 2 天,属于短期高强度学习型,
    quiz_clicks=0
    quiz_clicks=0
    说明未参与测验,需关注知识点掌握情况。
  • distinct_resources
    distinct_resources
    衡量学生探索的资源广度:访问资源类型越多,通常学习完整性越好。

BITMAP 函数:课程活跃学生统计

GROUP_BITMAP
GROUP_BITMAP
对学生 ID 去重计数,效率优于
COUNT(DISTINCT)
COUNT(DISTINCT)
,且可与
GROUP_BITMAP_STATE
GROUP_BITMAP_STATE
配合做跨课程交叉分析:

-- 各课程活跃学生数 SELECT code_module, code_presentation, GROUP_BITMAP(CAST(id_student AS BIGINT)) AS active_student_count FROM best_practice_education_dw.dws_student_course_progress GROUP BY code_module, code_presentation ORDER BY code_module;

code_module | code_presentation | active_student_count ------------+-------------------+--------------------- AAA | 2013J | 5 BBB | 2013J | 2 CCC | 2014J | 2 DDD | 2013J | 2 EEE | 2013J | 1 FFF | 2013J | 1

跨课程比较:各课程总注册学生 vs. 高参与度学生(总点击 > 20 次):

SELECT a.code_module, a.code_presentation, GROUP_BITMAP(CAST(a.id_student AS BIGINT)) AS total_enrolled, GROUP_BITMAP(CASE WHEN a.total_clicks > 20 THEN CAST(a.id_student AS BIGINT) END) AS high_engagement FROM best_practice_education_dw.dws_student_course_progress a GROUP BY a.code_module, a.code_presentation ORDER BY a.code_module;

code_module | code_presentation | total_enrolled | high_engagement ------------+-------------------+----------------+---------------- AAA | 2013J | 5 | 1 BBB | 2013J | 2 | 0 CCC | 2014J | 2 | 0 DDD | 2013J | 2 | 1 EEE | 2013J | 1 | 1 FFF | 2013J | 1 | 0


ADS 层 Dynamic Table:学习评分与高风险预警

ADS 层调用

calc_learning_score
calc_learning_score
UDF 对每位学生打分,并输出三档风险等级,供仪表板直接消费。

CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_education_dw.ads_student_learning_score AS SELECT p.code_module, p.code_presentation, p.id_student, p.gender, p.region, p.highest_education, p.age_band, p.enrollment_result, p.days_active, p.total_clicks, p.distinct_resources, p.quiz_clicks, p.last_active_day, COALESCE(a.submission_count, 0) AS submission_count, COALESCE(a.avg_score, 0.0) AS avg_score, COALESCE(a.max_score, 0.0) AS max_score, best_practice_education_dw.calc_learning_score( CAST(p.total_clicks AS INT), a.avg_score, CAST(COALESCE(a.submission_count, 0) AS INT), CAST(p.days_active AS INT) ) AS learning_score, CASE WHEN best_practice_education_dw.calc_learning_score( CAST(p.total_clicks AS INT), a.avg_score, CAST(COALESCE(a.submission_count, 0) AS INT), CAST(p.days_active AS INT) ) >= 75 THEN 'LOW_RISK' WHEN best_practice_education_dw.calc_learning_score( CAST(p.total_clicks AS INT), a.avg_score, CAST(COALESCE(a.submission_count, 0) AS INT), CAST(p.days_active AS INT) ) >= 50 THEN 'MEDIUM_RISK' ELSE 'HIGH_RISK' END AS risk_level FROM best_practice_education_dw.dws_student_course_progress p LEFT JOIN ( SELECT sa.id_student, COUNT(*) AS submission_count, ROUND(AVG(CASE WHEN sa.is_banked = 0 THEN sa.score END), 2) AS avg_score, MAX(CASE WHEN sa.is_banked = 0 THEN sa.score END) AS max_score FROM best_practice_education_dw.doc_ods_student_assessment sa GROUP BY sa.id_student ) a ON p.id_student = CAST(a.id_student AS STRING);

手动触发首次刷新并查看高风险预警结果:

REFRESH DYNAMIC TABLE best_practice_education_dw.ads_student_learning_score; SELECT code_module, id_student, days_active, total_clicks, avg_score, learning_score, risk_level FROM best_practice_education_dw.ads_student_learning_score ORDER BY learning_score ASC LIMIT 10;

code_module | id_student | days_active | total_clicks | avg_score | learning_score | risk_level ------------+------------+-------------+--------------+-----------+-----------------+----------- BBB | 41203 | 1 | 3 | 0 | 2.783 | HIGH_RISK FFF | 80001 | 1 | 2 | 32 | 18.633 | HIGH_RISK DDD | 60002 | 1 | 4 | 40 | 22.933 | HIGH_RISK AAA | 32885 | 1 | 1 | 41.5 | 25.233 | HIGH_RISK BBB | 40102 | 3 | 19 | 48.5 | 32.1 | HIGH_RISK AAA | 30268 | 2 | 4 | 65 | 37.767 | HIGH_RISK CCC | 50002 | 1 | 8 | 78 | 42.533 | HIGH_RISK AAA | 31604 | 3 | 18 | 82 | 46.7 | HIGH_RISK AAA | 28400 | 3 | 16 | 71 | 46.9 | HIGH_RISK DDD | 60001 | 2 | 21 | 82.5 | 49.067 | HIGH_RISK

查看风险等级分布:

SELECT risk_level, COUNT(*) AS student_count FROM best_practice_education_dw.ads_student_learning_score GROUP BY risk_level ORDER BY risk_level;

risk_level | student_count -------------+-------------- HIGH_RISK | 10 MEDIUM_RISK | 3

结果解读

  • HIGH_RISK(10 人)
    avg_score=0
    avg_score=0
    (学生 41203)表示从未提交过作业,学习分仅 2.8,属于极高流失风险;另几位学生成绩在 40–65 分之间但点击量极低,说明既不参与互动也未能掌握内容。
  • MEDIUM_RISK(3 人):以学生 11391 为例,点击量 52 次、平均分 83、活跃 7 天,学习分 61.6——参与度强但距 LOW_RISK 阈值(75 分)仍有差距,主要原因是提交次数(5 次)和活跃天数(7 天)的上限还有提升空间。
  • 当前数据集无 LOW_RISK 学生,符合模拟数据规模较小(仅 29 条行为记录)的预期——真实生产环境中,高参与度学生比例通常在 30–50%。

Studio 任务调度

Dynamic Table 的定期刷新通过 Studio 任务管理,不在 DDL 里写

REFRESH INTERVAL
REFRESH INTERVAL
。任务与 Dynamic Table 解耦后,可在同一任务上附加监控告警、数据质量检查等规则。

创建任务文件夹

cz-cli task create-folder "education_dw" --parent 186117 -p skill_test

{"data":186121}

创建 DWD 刷新任务

cz-cli task create "Refresh_DWD_Learning_Events" --type SQL --folder 186121 -p skill_test

{"data":{"id":10354662,"studio_url":"https://..."}}

保存 SQL 内容:

cz-cli task save-content "Refresh_DWD_Learning_Events" \ --content "REFRESH DYNAMIC TABLE best_practice_education_dw.dwd_learning_events;" \ -p skill_test

配置每小时调度(5 字段 cron 表达式):

cz-cli task save-cron "Refresh_DWD_Learning_Events" --cron "0 0/1 * * ?" -p skill_test

创建 DWS 刷新任务

cz-cli task create "Refresh_DWS_Learning_Progress" --type SQL --folder 186121 -p skill_test cz-cli task save-content "Refresh_DWS_Learning_Progress" \ --content "REFRESH DYNAMIC TABLE best_practice_education_dw.dws_student_course_progress;" \ -p skill_test cz-cli task save-cron "Refresh_DWS_Learning_Progress" --cron "0 0/1 * * ?" -p skill_test

创建 ADS 刷新任务

cz-cli task create "Refresh_ADS_Learning_Score" --type SQL --folder 186121 -p skill_test cz-cli task save-content "Refresh_ADS_Learning_Score" \ --content "REFRESH DYNAMIC TABLE best_practice_education_dw.ads_student_learning_score;" \ -p skill_test cz-cli task save-cron "Refresh_ADS_Learning_Score" --cron "0 0/1 * * ?" -p skill_test


External Function:知识点掌握度推断

真实在线教育场景中,每道题目对应若干知识点,通过外部知识图谱 API 可以推断学生对各知识点的掌握状态,为精准补救教学提供依据。以下展示 External Function 的典型接入方式。

外部函数框架

# knowledge_mastery.py — 调用知识图谱 API 推断知识点掌握度 from clickzetta.zettapark.functions import annotate class KnowledgeMastery: @annotate( input_args=[("student_id", "BIGINT"), ("question_ids", "STRING"), ("scores", "STRING")], return_type="STRING" ) def evaluate(self, student_id: int, question_ids: str, scores: str) -> str: """ 调用知识图谱服务推断学生对各知识点的掌握等级。 返回 JSON 字符串,如: {"concepts": [{"id": "C001", "name": "线性方程", "mastery": "proficient"}]} """ import json, requests payload = { "student_id": student_id, "responses": [ {"question_id": qid, "score": float(sc)} for qid, sc in zip(question_ids.split(","), scores.split(",")) ] } # 替换为实际知识图谱服务地址 resp = requests.post( "https://<knowledge-graph-service>/api/mastery", json=payload, timeout=5 ) return json.dumps(resp.json(), ensure_ascii=False)

部署为 External Function 后,可直接在 SQL 中调用:

-- 示例:查询学生 11391 的知识点掌握状态 SELECT id_student, best_practice_education_dw.infer_knowledge_mastery( id_student, '1752,1753,1754', -- 对应题目 ID '78,85,88' -- 对应题目得分 ) AS mastery_result FROM best_practice_education_dw.doc_ods_student_info WHERE id_student = 11391 LIMIT 1;


数仓对象总览

全部构建完成后,

best_practice_education_dw
best_practice_education_dw
Schema 下的核心对象:

SHOW TABLES IN best_practice_education_dw;

主要对象一览:

schema_name | table_name | is_dynamic -----------------------------+---------------------------------+----------- best_practice_education_dw | doc_ods_student_vle | false best_practice_education_dw | doc_ods_student_info | false best_practice_education_dw | doc_ods_courses | false best_practice_education_dw | doc_ods_vle | false best_practice_education_dw | doc_ods_student_assessment | false best_practice_education_dw | dwd_learning_events | true best_practice_education_dw | dws_student_course_progress | true best_practice_education_dw | ads_student_learning_score | true

架构结构:

Kafka(实时) PostgreSQL CDC CSV Batch │ │ │ ▼ pipe_student_vle(PIPE) │ │ kafka_raw_vle │ │ ▼ ▼ doc_ods_student_vle doc_ods_student_info doc_ods_courses / doc_ods_vle Inverted Index Inverted Index(activity_type) │ │ │ └────────────────────┼───────────────────────┘ │ ▼ Studio Task: Refresh_DWD_Learning_Events(每小时) dwd_learning_events(Dynamic Table) LEFT JOIN 关联 · 清洗 · 标准化 │ ▼ Studio Task: Refresh_DWS_Learning_Progress(每小时) dws_student_course_progress(Dynamic Table) days_active / total_clicks / quiz_clicks / GROUP_BITMAP │ ▼ Studio Task: Refresh_ADS_Learning_Score(每小时) ads_student_learning_score(Dynamic Table) calc_learning_score UDF · HIGH/MEDIUM/LOW_RISK


注意事项

  • Dynamic Table 不写 REFRESH INTERVAL:调度通过 Studio 任务管理。DDL 中不包含

    REFRESH INTERVAL
    REFRESH INTERVAL
    参数,刷新任务通过
    cz-cli task create/save-content/save-cron
    cz-cli task create/save-content/save-cron
    创建,可在同一任务上追加告警和数据质量检查规则。

  • Inverted Index 对存量数据不自动生效

    CREATE INVERTED INDEX
    CREATE INVERTED INDEX
    只对创建后写入的新数据生效。若表中已有存量数据,需执行
    BUILD INDEX idx_inv_activity_type ON doc_ods_vle
    BUILD INDEX idx_inv_activity_type ON doc_ods_vle
    覆盖存量,否则
    MATCH_ALL
    MATCH_ALL
    查询可能返回空结果。

  • GROUP_BITMAP
    GROUP_BITMAP
    GROUP_BITMAP_STATE
    GROUP_BITMAP_STATE
    语义不同
    GROUP_BITMAP
    GROUP_BITMAP
    返回基数(BIGINT),
    GROUP_BITMAP_STATE
    GROUP_BITMAP_STATE
    返回 bitmap 对象(可做 AND/OR 交叉运算)。两者不可混用——
    BITMAP_COUNT(GROUP_BITMAP(...))
    BITMAP_COUNT(GROUP_BITMAP(...))
    会报类型错误,应使用
    GROUP_BITMAP_STATE
    GROUP_BITMAP_STATE
    后再调用
    BITMAP_COUNT
    BITMAP_COUNT

  • Dynamic Table 增量刷新依赖上游变更追踪:第一次

    REFRESH
    REFRESH
    做全量快照计算;后续增量刷新只处理上游 ODS 层自上次刷新点以来新增或变更的行。如果 ODS 层使用
    INSERT OVERWRITE
    INSERT OVERWRITE
    写入,会导致 Dynamic Table 退化为全量刷新。

  • ADS 层 JOIN 时注意类型匹配

    dws_student_course_progress.id_student
    dws_student_course_progress.id_student
    经过 Dynamic Table 聚合后可能变为 STRING 类型,与
    doc_ods_student_assessment.id_student
    doc_ods_student_assessment.id_student
    (BIGINT)JOIN 时需要显式
    CAST
    CAST
    ,否则导致关联失败或全量笛卡尔积。

  • External Function 网络延迟:调用外部知识图谱 API 有网络往返延迟,不适合在高频查询中使用。建议将推断结果物化到 ADS 层专用表,按天或按学习阶段批量更新一次。


相关文档

联系我们
预约咨询
微信咨询
电话咨询
邮件咨询