Chapter 04

窗口函数与分析

数据分析的核心利器:窗口函数、QUALIFY 过滤、用户行为漏斗/留存分析实战

4.1 窗口函数基础

窗口函数在不改变行数的情况下,对"窗口"(一组相关行)进行计算。与 GROUP BY 聚合不同,窗口函数保留每行的原始数据,同时附加计算结果。

SQL-- 基础语法
函数名() OVER (
    [PARTITION BY 分组列]   -- 可选:按此列分组,每组独立计算
    [ORDER BY 排序列]        -- 可选:窗口内排序(某些函数必须)
    [框架子句]               -- 可选:定义行范围
)

4.2 排名函数

SQLSELECT
    name,
    department,
    salary,
    -- ROW_NUMBER:唯一序号,没有并列
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
    -- RANK:有并列,下一名跳号(1,1,3)
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
    -- DENSE_RANK:有并列,不跳号(1,1,2)
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
    -- PERCENT_RANK:百分比排名 (0~1)
    PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS pct_rank,
    -- NTILE(n):分成 n 组(四分位、十分位)
    NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) AS quartile
FROM employees;

4.3 位移函数 LAG / LEAD

LAG 获取当前行之前的行值,LEAD 获取之后的行值,常用于计算环比变化。

SQLSELECT
    month,
    revenue,
    -- LAG:上月收入
    LAG(revenue, 1, 0) OVER (ORDER BY month) AS prev_revenue,
    -- 计算环比增长率
    round(
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / LAG(revenue) OVER (ORDER BY month) * 100,
        2
    ) AS mom_growth_pct,
    -- LEAD:下月收入(预测对比)
    LEAD(revenue, 1) OVER (ORDER BY month) AS next_revenue
FROM monthly_revenue;

4.4 聚合窗口函数

SQLSELECT
    date,
    daily_sales,
    -- 累计销售额(从开始到当前行)
    SUM(daily_sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sales,
    -- 7日移动平均
    AVG(daily_sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma7,
    -- 30日移动平均
    AVG(daily_sales) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS ma30,
    -- 当前行占全局总和的百分比
    round(daily_sales / SUM(daily_sales) OVER() * 100, 2) AS pct_of_total
FROM daily_sales;

4.5 QUALIFY 子句:直接过滤窗口结果

DuckDB 特有的 QUALIFY 子句允许直接在 WHERE 中过滤窗口函数结果,无需嵌套子查询。这是最常用的 DuckDB 专属优化之一。

SQL-- 标准 SQL(需要子查询,繁琐)
SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
    FROM orders
) WHERE rn = 1;  -- 取每个用户最新一单

-- DuckDB QUALIFY(一行搞定!)
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) = 1;

-- 取每个部门薪资最高的 Top 3
SELECT name, department, salary
FROM employees
QUALIFY RANK() OVER (PARTITION BY department ORDER BY salary DESC) <= 3;

4.6 RANGE vs ROWS vs GROUPS 窗口框架

框架类型含义适用场景
ROWS按物理行号计算偏移,精确移动平均、固定行数的滑动窗口
RANGE按值范围计算,相同排序键的行算同一组处理并列值时希望它们一致
GROUPS按分组数计算偏移跳过相同值组,按组数滑动
SQL-- ROWS 框架:严格按行号
SUM(amount) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day

-- RANGE 框架:按时间范围(前7天到当前)
SUM(amount) OVER (
    ORDER BY date
    RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS rolling_7day_range

4.7 实战:用户行为分析

漏斗分析

SQL-- 电商漏斗:浏览 → 加购 → 下单 → 支付
WITH funnel AS (
    SELECT
        user_id,
        MAX(CASE WHEN event_type = 'view'    THEN 1 ELSE 0 END) AS viewed,
        MAX(CASE WHEN event_type = 'cart'    THEN 1 ELSE 0 END) AS carted,
        MAX(CASE WHEN event_type = 'order'   THEN 1 ELSE 0 END) AS ordered,
        MAX(CASE WHEN event_type = 'payment' THEN 1 ELSE 0 END) AS paid
    FROM user_events
    WHERE event_date = current_date()
    GROUP BY user_id
)
SELECT
    SUM(viewed)  AS step1_view,
    SUM(carted)  AS step2_cart,
    SUM(ordered) AS step3_order,
    SUM(paid)    AS step4_payment,
    round(SUM(carted)::FLOAT  / SUM(viewed)  * 100, 1) AS view_to_cart_pct,
    round(SUM(ordered)::FLOAT / SUM(carted)  * 100, 1) AS cart_to_order_pct,
    round(SUM(paid)::FLOAT    / SUM(ordered) * 100, 1) AS order_to_pay_pct
FROM funnel;

N日留存分析

SQL-- 计算第 1/7/30 日留存率
WITH first_login AS (
    SELECT user_id, MIN(login_date) AS reg_date
    FROM logins
    GROUP BY user_id
),
retention AS (
    SELECT
        f.user_id,
        f.reg_date,
        date_diff('day', f.reg_date, l.login_date) AS day_diff
    FROM first_login f
    JOIN logins l ON f.user_id = l.user_id
)
SELECT
    reg_date,
    COUNT(DISTINCT user_id) AS new_users,
    COUNT(DISTINCT CASE WHEN day_diff = 1  THEN user_id END) AS d1_retained,
    COUNT(DISTINCT CASE WHEN day_diff = 7  THEN user_id END) AS d7_retained,
    COUNT(DISTINCT CASE WHEN day_diff = 30 THEN user_id END) AS d30_retained
FROM retention
GROUP BY reg_date
ORDER BY reg_date;
💡

本章小结 DuckDB 的窗口函数覆盖排名(ROW_NUMBER/RANK/DENSE_RANK)、位移(LAG/LEAD)、聚合(SUM/AVG OVER)。QUALIFY 子句是 DuckDB 独有的语法糖,直接过滤窗口结果,避免子查询嵌套。ROWS/RANGE/GROUPS 三种框架各有适用场景。漏斗分析和留存分析是数据分析最核心的两个场景,DuckDB 配合窗口函数可以非常高效地完成。