DBT 数据质量实战

本文代码来自 clickzetta/jaffle-shop-clickzetta，实测 30/30 测试通过（27 个 data test + 3 个 unit test）。

两种测试类型

dbt 提供两种测试机制，解决不同层面的数据质量问题：

类型	测试对象	运行方式	适用场景
data test	实际数据	对真实表执行 SQL 断言	验证生产数据的完整性、唯一性、引用关系
unit test	模型逻辑	用 mock 数据测试 SQL 转换逻辑	验证计算逻辑正确性，不依赖真实数据

两者的本质区别：data test 是在问"数据对不对"，unit test 是在问"逻辑对不对"。

data test 依赖真实数据，所以必须先

dbt run

dbt run

建好表才能跑。它能发现数据管道运行后产生的问题，比如上游数据质量差导致的空值、重复键等。

unit test 完全不依赖真实数据，用你手写的 mock 数据测试 SQL 逻辑。它在

dbt build

dbt build

时就会运行，不需要先有真实数据。这意味着你可以在开发阶段就验证逻辑，而不是等数据跑出来再发现问题。

测试失败了怎么办？

dbt test

dbt test

失败时，dbt 会打印出失败的 SQL 查询，你可以直接在 Studio 或 cz-cli 里执行这条 SQL，看到具体是哪些行违反了约束。例如

unique

unique

测试失败，执行失败的 SQL 就能看到哪些值重复了。

两者互补：data test 发现数据问题，unit test 发现逻辑问题。

Data Test

内置测试类型

dbt 内置 4 种通用测试，在

schema.yml

schema.yml

里声明：

models: - name: orders columns: - name: order_id data_tests: - not_null # 不能为空 - unique # 值唯一 - name: customer_id data_tests: - relationships: # 外键引用完整性 to: ref('stg_customers') field: customer_id - name: customer_type data_tests: - accepted_values: # 枚举值校验 values: ["new", "returning"]

这是 jaffle-shop-clickzetta 里

customers

customers

模型的实际测试配置。

跨列表达式测试

内置测试只能测单列，跨列的业务规则需要用

dbt_utils.expression_is_true

dbt_utils.expression_is_true

：

models: - name: orders data_tests: - dbt_utils.expression_is_true: expression: "order_items_subtotal = subtotal" - dbt_utils.expression_is_true: expression: "order_total = subtotal + tax_paid"

这两条测试来自 jaffle-shop-clickzetta 的

orders

orders

模型，验证订单金额的计算逻辑：

所有订单项的小计之和等于订单小计
订单总额等于小计加税费

dbt_utils

dbt_utils

是 dbt 官方维护的扩展包，在

packages.yml

packages.yml

里声明依赖后即可使用：

packages: - package: dbt-labs/dbt_utils version: [">=1.0.0", "<2.0.0"]

Source 测试

Source 测试在数据进入 dbt 管道之前就做校验，发现问题更早：

sources: - name: ecom schema: raw tables: - name: raw_orders loaded_at_field: ordered_at # 用于 source freshness 检查 columns: - name: id data_tests: - not_null - unique

loaded_at_field

loaded_at_field

指定时间戳字段后，可以用

dbt source freshness

dbt source freshness

检查数据新鲜度——如果最新数据超过阈值时间没有更新，会发出警告或报错。

运行测试

dbt test # 运行所有测试 dbt test --select orders # 只测 orders 模型 dbt test --select source:ecom # 只测 ecom source dbt test --select test_type:data # 只跑 data test

运行结果（来自 jaffle-shop-clickzetta）：

Done. PASS=27 WARN=0 ERROR=0 SKIP=0 TOTAL=27

27 个 data test 全部通过，耗时约 5 秒。

Unit Test

什么是 Unit Test

Unit test 用 mock 数据测试模型的 SQL 转换逻辑，不依赖真实数据库里的数据。

适合的场景：

验证复杂的 CASE WHEN 逻辑
验证聚合计算（SUM、COUNT 等）
验证时间戳处理（截断、格式转换）
在没有真实数据时也能测试逻辑

基本写法

Unit test 在

schema.yml

schema.yml

里声明，结构是

given

given

（输入 mock 数据）+

expect

expect

（期望输出）：

unit_tests: - name: test_does_location_opened_at_trunc_to_date description: "验证 opened_at 时间戳被正确截断为日期" model: stg_locations given: - input: source('ecom', 'raw_stores') rows: - { id: 1, name: "Vice City", tax_rate: 0.2, opened_at: "2016-09-01T00:00:00" } - { id: 2, name: "San Andreas", tax_rate: 0.1, opened_at: "2079-10-27T23:59:59.9999" } expect: rows: - { location_id: 1, location_name: "Vice City", tax_rate: 0.2, opened_date: "2016-09-01" } - { location_id: 2, location_name: "San Andreas", tax_rate: 0.1, opened_date: "2079-10-27" }

这是 jaffle-shop-clickzetta 里

stg_locations

stg_locations

的实际 unit test。它验证了：时间戳

"2079-10-27T23:59:59.9999"

"2079-10-27T23:59:59.9999"

截断后应该是

"2079-10-27"

"2079-10-27"

，而不是

"2079-10-28"

"2079-10-28"

。这种边界情况用真实数据很难覆盖，但 unit test 可以精确构造。

多输入 Mock

当模型引用多个上游模型时，每个输入都需要提供 mock 数据：

unit_tests: - name: test_supply_costs_sum_correctly description: "验证供应成本按商品正确汇总" model: order_items given: - input: ref('stg_supplies') rows: - { product_id: 1, supply_cost: 4.50 } - { product_id: 2, supply_cost: 3.50 } - { product_id: 2, supply_cost: 5.00 } # product_id=2 有两条供应记录 - input: ref('stg_products') rows: - { product_id: 1 } - { product_id: 2 } - input: ref('stg_order_items') rows: - { order_id: 1, product_id: 1 } - { order_id: 2, product_id: 2 } - { order_id: 2, product_id: 2 } - input: ref('stg_orders') rows: - { order_id: 1 } - { order_id: 2 } expect: rows: - { order_id: 1, product_id: 1, supply_cost: 4.50 } - { order_id: 2, product_id: 2, supply_cost: 8.50 } # 3.50 + 5.00 - { order_id: 2, product_id: 2, supply_cost: 8.50 }

这个测试验证了：

product_id=2

product_id=2

有两条供应记录（3.50 + 5.00），汇总后应该是 8.50。

运行 Unit Test

dbt test --select test_type:unit # 只跑 unit test dbt build # build 时自动包含 unit test

运行结果（来自 jaffle-shop-clickzetta）：

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

3 个 unit test 全部通过，耗时约 2 秒。

⚠️ 注意：需要 dbt-clickzetta >= 1.7.5。早期版本有 unit test fixture 清理的 bug（DROP TABLE 用在 VIEW 上报错），1.7.5 已修复。

完整测试策略

jaffle-shop-clickzetta 的测试分布：

层级	测试内容	数量
Source	`not_null` not_null 、 `unique` unique （raw 表主键）	6
Staging	`not_null` not_null 、 `unique` unique （staging 视图主键）+ 1 个 unit test（时间戳截断）	8
Marts	`not_null` not_null 、 `unique` unique 、 `relationships` relationships 、 `accepted_values` accepted_values 、 `expression_is_true` expression_is_true + 2 个 unit test	19
合计		30

测试覆盖原则：

每张表的主键必须有
```
not_null
```
not_null
+
```
unique
```
unique
外键必须有
```
relationships
```
relationships
测试
枚举字段用
```
accepted_values
```
accepted_values
跨列业务规则用
```
expression_is_true
```
expression_is_true
复杂转换逻辑用 unit test