4.1 窗口函数基础
窗口函数在不改变行数的情况下,对"窗口"(一组相关行)进行计算。与 GROUP BY 聚合不同,窗口函数保留每行的原始数据,同时附加计算结果。
SQL-- 基础语法
函数名() OVER (
[PARTITION BY 分组列] -- 可选:按此列分组,每组独立计算
[ORDER BY 排序列] -- 可选:窗口内排序(某些函数必须)
[框架子句] -- 可选:定义行范围
)
4.2 排名函数
SQLSELECT
name,
department,
salary,
-- ROW_NUMBER:唯一序号,没有并列
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
-- RANK:有并列,下一名跳号(1,1,3)
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank,
-- DENSE_RANK:有并列,不跳号(1,1,2)
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
-- PERCENT_RANK:百分比排名 (0~1)
PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS pct_rank,
-- NTILE(n):分成 n 组(四分位、十分位)
NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) AS quartile
FROM employees;
4.3 位移函数 LAG / LEAD
LAG 获取当前行之前的行值,LEAD 获取之后的行值,常用于计算环比变化。
SQLSELECT
month,
revenue,
-- LAG:上月收入
LAG(revenue, 1, 0) OVER (ORDER BY month) AS prev_revenue,
-- 计算环比增长率
round(
(revenue - LAG(revenue) OVER (ORDER BY month))
/ LAG(revenue) OVER (ORDER BY month) * 100,
2
) AS mom_growth_pct,
-- LEAD:下月收入(预测对比)
LEAD(revenue, 1) OVER (ORDER BY month) AS next_revenue
FROM monthly_revenue;
4.4 聚合窗口函数
SQLSELECT
date,
daily_sales,
-- 累计销售额(从开始到当前行)
SUM(daily_sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sales,
-- 7日移动平均
AVG(daily_sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma7,
-- 30日移动平均
AVG(daily_sales) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS ma30,
-- 当前行占全局总和的百分比
round(daily_sales / SUM(daily_sales) OVER() * 100, 2) AS pct_of_total
FROM daily_sales;
4.5 QUALIFY 子句:直接过滤窗口结果
DuckDB 特有的 QUALIFY 子句允许直接在 WHERE 中过滤窗口函数结果,无需嵌套子查询。这是最常用的 DuckDB 专属优化之一。
SQL-- 标准 SQL(需要子查询,繁琐)
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM orders
) WHERE rn = 1; -- 取每个用户最新一单
-- DuckDB QUALIFY(一行搞定!)
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) = 1;
-- 取每个部门薪资最高的 Top 3
SELECT name, department, salary
FROM employees
QUALIFY RANK() OVER (PARTITION BY department ORDER BY salary DESC) <= 3;
4.6 RANGE vs ROWS vs GROUPS 窗口框架
| 框架类型 | 含义 | 适用场景 |
|---|---|---|
ROWS | 按物理行号计算偏移,精确 | 移动平均、固定行数的滑动窗口 |
RANGE | 按值范围计算,相同排序键的行算同一组 | 处理并列值时希望它们一致 |
GROUPS | 按分组数计算偏移 | 跳过相同值组,按组数滑动 |
SQL-- ROWS 框架:严格按行号
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day
-- RANGE 框架:按时间范围(前7天到当前)
SUM(amount) OVER (
ORDER BY date
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS rolling_7day_range
4.7 实战:用户行为分析
漏斗分析
SQL-- 电商漏斗:浏览 → 加购 → 下单 → 支付
WITH funnel AS (
SELECT
user_id,
MAX(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS viewed,
MAX(CASE WHEN event_type = 'cart' THEN 1 ELSE 0 END) AS carted,
MAX(CASE WHEN event_type = 'order' THEN 1 ELSE 0 END) AS ordered,
MAX(CASE WHEN event_type = 'payment' THEN 1 ELSE 0 END) AS paid
FROM user_events
WHERE event_date = current_date()
GROUP BY user_id
)
SELECT
SUM(viewed) AS step1_view,
SUM(carted) AS step2_cart,
SUM(ordered) AS step3_order,
SUM(paid) AS step4_payment,
round(SUM(carted)::FLOAT / SUM(viewed) * 100, 1) AS view_to_cart_pct,
round(SUM(ordered)::FLOAT / SUM(carted) * 100, 1) AS cart_to_order_pct,
round(SUM(paid)::FLOAT / SUM(ordered) * 100, 1) AS order_to_pay_pct
FROM funnel;
N日留存分析
SQL-- 计算第 1/7/30 日留存率
WITH first_login AS (
SELECT user_id, MIN(login_date) AS reg_date
FROM logins
GROUP BY user_id
),
retention AS (
SELECT
f.user_id,
f.reg_date,
date_diff('day', f.reg_date, l.login_date) AS day_diff
FROM first_login f
JOIN logins l ON f.user_id = l.user_id
)
SELECT
reg_date,
COUNT(DISTINCT user_id) AS new_users,
COUNT(DISTINCT CASE WHEN day_diff = 1 THEN user_id END) AS d1_retained,
COUNT(DISTINCT CASE WHEN day_diff = 7 THEN user_id END) AS d7_retained,
COUNT(DISTINCT CASE WHEN day_diff = 30 THEN user_id END) AS d30_retained
FROM retention
GROUP BY reg_date
ORDER BY reg_date;
本章小结 DuckDB 的窗口函数覆盖排名(ROW_NUMBER/RANK/DENSE_RANK)、位移(LAG/LEAD)、聚合(SUM/AVG OVER)。QUALIFY 子句是 DuckDB 独有的语法糖,直接过滤窗口结果,避免子查询嵌套。ROWS/RANGE/GROUPS 三种框架各有适用场景。漏斗分析和留存分析是数据分析最核心的两个场景,DuckDB 配合窗口函数可以非常高效地完成。