Chapter 04

测试(Tests)体系

用 YAML 声明数据质量断言,构建覆盖数据完整性、唯一性、引用完整性的测试套件

4.1 dbt 测试的本质

dbt 的测试(Test)本质上是一条 返回失败行数的 SQL 查询。如果查询返回 0 行,测试通过;如果返回任何行,测试失败。这个设计极为优雅——测试就是断言"这个条件下不应该有任何数据"。

💡

测试 = "找到违反约束的行" 比如 unique 测试会查询"哪些值出现了多次";not_null 测试会查询"哪些行的该列为 NULL"。如果这些查询返回空结果集,说明数据质量达标。

4.2 四大内置通用测试

dbt 内置四种通用测试(Generic Tests),覆盖绝大多数数据质量需求:

测试类型作用适用场景
unique列中无重复值主键、自然键、唯一标识符
not_null列中无 NULL 值必填字段、外键、关键业务字段
accepted_values列值在指定集合内枚举类型、状态字段、分类标签
relationships外键引用完整性确保子表 FK 在父表中存在

4.3 在 schema.yml 中声明测试

YAML# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    description: "清洗后的订单数据"
    columns:
      - name: order_id
        description: "订单唯一 ID"
        tests:
          - unique                          # 不能有重复
          - not_null                        # 不能为空

      - name: customer_id
        description: "关联的用户 ID"
        tests:
          - not_null
          - relationships:                  # 引用完整性:必须在 stg_customers 中存在
              to: ref('stg_customers')
              field: customer_id

      - name: status
        description: "订单状态"
        tests:
          - not_null
          - accepted_values:
              values: ['PENDING', 'COMPLETED', 'CANCELLED', 'REFUNDED']

      - name: amount_dollars
        description: "订单金额(美元)"
        tests:
          - not_null

  - name: stg_customers
    columns:
      - name: customer_id
        tests: [unique, not_null]
      - name: email
        tests: [unique, not_null]

4.4 运行测试:dbt test

BASH# 运行所有测试
dbt test

# 只测试特定模型
dbt test --select stg_orders

# 只运行特定类型的测试
dbt test --select "test_type:generic"
dbt test --select "test_type:singular"

# 运行模型并立即测试
dbt build   # = dbt run + dbt test(按 DAG 顺序,每个模型跑完就测试)

测试输出示例:

OUTPUTRunning 8 tests

PASS  unique_stg_orders_order_id ................. [PASS in 0.12s]
PASS  not_null_stg_orders_order_id ............... [PASS in 0.08s]
PASS  not_null_stg_orders_customer_id ............ [PASS in 0.09s]
FAIL  accepted_values_stg_orders_status ......... [FAIL 3 in 0.15s]
PASS  relationships_stg_orders_customer_id ....... [PASS in 0.20s]

Finished running 8 tests in 0.64s.

FAIL=1 PASS=7 ERROR=0 SKIP=0 TOTAL=8

4.5 自定义 SQL 测试(Singular Tests)

对于复杂的业务规则,可以在 tests/ 目录创建自定义 SQL 测试文件(Singular Test)。这些 SQL 应返回"违规的行"——如果返回空则测试通过:

SQL-- tests/assert_order_amount_positive.sql
-- 测试:订单金额必须大于 0
SELECT
    order_id,
    amount_dollars
FROM {{ ref('stg_orders') }}
WHERE amount_dollars <= 0
   OR amount_dollars IS NULL
SQL-- tests/assert_refund_le_original.sql
-- 测试:退款金额不能超过原始订单金额
WITH orders AS (
    SELECT order_id, amount_dollars FROM {{ ref('stg_orders') }}
),
refunds AS (
    SELECT order_id, refund_amount FROM {{ ref('stg_refunds') }}
)
SELECT
    r.order_id,
    o.amount_dollars,
    r.refund_amount
FROM refunds r
JOIN orders o USING (order_id)
WHERE r.refund_amount > o.amount_dollars

4.6 测试失败处理策略:severity

并非所有测试失败都需要中断 pipeline。dbt 支持两种严重级别:

YAMLmodels:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique:
              severity: error      # 默认:失败时标记为 ERROR,pipeline 中止
          - not_null:
              severity: warn       # 宽松:失败时只警告,pipeline 继续执行

      - name: status
        tests:
          - accepted_values:
              values: ['PENDING', 'COMPLETED', 'CANCELLED', 'REFUNDED']
              severity: warn
              # 也可设置失败行数阈值
              error_if: ">100"     # 失败行数超过 100 才报 error
              warn_if: ">10"      # 失败行数超过 10 就警告

4.7 dbt-expectations 扩展测试包

dbt-expectations 是受 Great Expectations 启发的测试扩展包,提供更丰富的断言能力:

YAML# packages.yml — 安装 dbt-expectations
packages:
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]
BASH">dbt deps   # 安装依赖包
YAML# 使用 dbt-expectations 的测试示例
models:
  - name: stg_orders
    tests:
      # 行数范围检查
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 100
          max_value: 10000000
    columns:
      - name: amount_dollars
        tests:
          # 数值范围
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0.01
              max_value: 99999.99
          # 非负数
          - dbt_expectations.expect_column_values_to_be_positive
      - name: order_date
        tests:
          # 日期格式
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[0-9]{4}-[0-9]{2}-[0-9]{2}$"
          # 非未来日期
          - dbt_expectations.expect_column_values_to_be_between:
              max_value: "current_date"
      - name: email
        tests:
          # 邮箱格式
          - dbt_expectations.expect_column_values_to_match_regex:
              regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

4.8 完整数据质量测试套件实战

YAML# models/marts/schema.yml — 宽表层完整测试
version: 2

models:
  - name: customer_orders
    description: "客户订单汇总宽表,每行一个客户"
    tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 1
    columns:
      - name: customer_id
        tests: [unique, not_null]

      - name: total_orders
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0

      - name: lifetime_value
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0

      - name: first_order_date
        tests:
          - dbt_expectations.expect_column_values_to_be_between:
              max_value: "current_date"

      - name: most_recent_order_date
        tests:
          # 最近订单日期不能早于第一次订单日期
          - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_or_equal_to_B:
              column_A: most_recent_order_date
              column_B: first_order_date
📌

本章小结
dbt 测试是返回违规行的 SELECT 查询,空结果 = 测试通过。四大内置测试:unique(唯一性)、not_null(非空)、accepted_values(枚举值)、relationships(引用完整性)。

在 schema.yml 的 tests: 块中声明测试;自定义业务规则测试放在 tests/ 目录。severity: warn/error 控制失败策略,error_if/warn_if 设置数量阈值。dbt build 一次完成 run + test,推荐用于 CI/CD 流水线。