4.1 dbt 测试的本质
dbt 的测试(Test)本质上是一条 返回失败行数的 SQL 查询。如果查询返回 0 行,测试通过;如果返回任何行,测试失败。这个设计极为优雅——测试就是断言"这个条件下不应该有任何数据"。
测试 = "找到违反约束的行" 比如 unique 测试会查询"哪些值出现了多次";not_null 测试会查询"哪些行的该列为 NULL"。如果这些查询返回空结果集,说明数据质量达标。
4.2 四大内置通用测试
dbt 内置四种通用测试(Generic Tests),覆盖绝大多数数据质量需求:
| 测试类型 | 作用 | 适用场景 |
|---|---|---|
unique | 列中无重复值 | 主键、自然键、唯一标识符 |
not_null | 列中无 NULL 值 | 必填字段、外键、关键业务字段 |
accepted_values | 列值在指定集合内 | 枚举类型、状态字段、分类标签 |
relationships | 外键引用完整性 | 确保子表 FK 在父表中存在 |
4.3 在 schema.yml 中声明测试
YAML# models/staging/schema.yml
version: 2
models:
- name: stg_orders
description: "清洗后的订单数据"
columns:
- name: order_id
description: "订单唯一 ID"
tests:
- unique # 不能有重复
- not_null # 不能为空
- name: customer_id
description: "关联的用户 ID"
tests:
- not_null
- relationships: # 引用完整性:必须在 stg_customers 中存在
to: ref('stg_customers')
field: customer_id
- name: status
description: "订单状态"
tests:
- not_null
- accepted_values:
values: ['PENDING', 'COMPLETED', 'CANCELLED', 'REFUNDED']
- name: amount_dollars
description: "订单金额(美元)"
tests:
- not_null
- name: stg_customers
columns:
- name: customer_id
tests: [unique, not_null]
- name: email
tests: [unique, not_null]
4.4 运行测试:dbt test
BASH# 运行所有测试
dbt test
# 只测试特定模型
dbt test --select stg_orders
# 只运行特定类型的测试
dbt test --select "test_type:generic"
dbt test --select "test_type:singular"
# 运行模型并立即测试
dbt build # = dbt run + dbt test(按 DAG 顺序,每个模型跑完就测试)
测试输出示例:
OUTPUTRunning 8 tests
PASS unique_stg_orders_order_id ................. [PASS in 0.12s]
PASS not_null_stg_orders_order_id ............... [PASS in 0.08s]
PASS not_null_stg_orders_customer_id ............ [PASS in 0.09s]
FAIL accepted_values_stg_orders_status ......... [FAIL 3 in 0.15s]
PASS relationships_stg_orders_customer_id ....... [PASS in 0.20s]
Finished running 8 tests in 0.64s.
FAIL=1 PASS=7 ERROR=0 SKIP=0 TOTAL=8
4.5 自定义 SQL 测试(Singular Tests)
对于复杂的业务规则,可以在 tests/ 目录创建自定义 SQL 测试文件(Singular Test)。这些 SQL 应返回"违规的行"——如果返回空则测试通过:
SQL-- tests/assert_order_amount_positive.sql
-- 测试:订单金额必须大于 0
SELECT
order_id,
amount_dollars
FROM {{ ref('stg_orders') }}
WHERE amount_dollars <= 0
OR amount_dollars IS NULL
SQL-- tests/assert_refund_le_original.sql
-- 测试:退款金额不能超过原始订单金额
WITH orders AS (
SELECT order_id, amount_dollars FROM {{ ref('stg_orders') }}
),
refunds AS (
SELECT order_id, refund_amount FROM {{ ref('stg_refunds') }}
)
SELECT
r.order_id,
o.amount_dollars,
r.refund_amount
FROM refunds r
JOIN orders o USING (order_id)
WHERE r.refund_amount > o.amount_dollars
4.6 测试失败处理策略:severity
并非所有测试失败都需要中断 pipeline。dbt 支持两种严重级别:
YAMLmodels:
- name: stg_orders
columns:
- name: order_id
tests:
- unique:
severity: error # 默认:失败时标记为 ERROR,pipeline 中止
- not_null:
severity: warn # 宽松:失败时只警告,pipeline 继续执行
- name: status
tests:
- accepted_values:
values: ['PENDING', 'COMPLETED', 'CANCELLED', 'REFUNDED']
severity: warn
# 也可设置失败行数阈值
error_if: ">100" # 失败行数超过 100 才报 error
warn_if: ">10" # 失败行数超过 10 就警告
4.7 dbt-expectations 扩展测试包
dbt-expectations 是受 Great Expectations 启发的测试扩展包,提供更丰富的断言能力:
YAML# packages.yml — 安装 dbt-expectations
packages:
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
BASH">dbt deps # 安装依赖包
YAML# 使用 dbt-expectations 的测试示例
models:
- name: stg_orders
tests:
# 行数范围检查
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 100
max_value: 10000000
columns:
- name: amount_dollars
tests:
# 数值范围
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0.01
max_value: 99999.99
# 非负数
- dbt_expectations.expect_column_values_to_be_positive
- name: order_date
tests:
# 日期格式
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[0-9]{4}-[0-9]{2}-[0-9]{2}$"
# 非未来日期
- dbt_expectations.expect_column_values_to_be_between:
max_value: "current_date"
- name: email
tests:
# 邮箱格式
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
4.8 完整数据质量测试套件实战
YAML# models/marts/schema.yml — 宽表层完整测试
version: 2
models:
- name: customer_orders
description: "客户订单汇总宽表,每行一个客户"
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 1
columns:
- name: customer_id
tests: [unique, not_null]
- name: total_orders
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
- name: lifetime_value
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
- name: first_order_date
tests:
- dbt_expectations.expect_column_values_to_be_between:
max_value: "current_date"
- name: most_recent_order_date
tests:
# 最近订单日期不能早于第一次订单日期
- dbt_expectations.expect_column_pair_values_A_to_be_greater_than_or_equal_to_B:
column_A: most_recent_order_date
column_B: first_order_date
本章小结
dbt 测试是返回违规行的 SELECT 查询,空结果 = 测试通过。四大内置测试:unique(唯一性)、not_null(非空)、accepted_values(枚举值)、relationships(引用完整性)。
在 schema.yml 的 tests: 块中声明测试;自定义业务规则测试放在 tests/ 目录。severity: warn/error 控制失败策略,error_if/warn_if 设置数量阈值。dbt build 一次完成 run + test,推荐用于 CI/CD 流水线。