第7章多模态 RAG — ColPali 完全指南

端到端流程

用户 query │ ▼ ColPali 检索 → top-3 页面(仍是图像) │ ▼ VLM 生成:prompt + [query, page1.png, page2.png, page3.png] │ ▼ 结构化答案(带页面级引用)

Claude Vision 实战

import anthropic, base64

client = anthropic.Anthropic()

def image_block(path):
    with open(path, "rb") as f:
        data = base64.b64encode(f.read()).decode()
    return {
        "type": "image",
        "source": {"type": "base64", "media_type": "image/png", "data": data},
    }

def answer_with_colpali(query, top_pages):
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                *[image_block(p.image_path) for p in top_pages],
                {"type": "text", "text": f"""
仅根据以上 {len(top_pages)} 张页面图作答,每个事实给出页号引用。
问题:{query}
"""},
            ],
        }],
    )
    return msg.content[0].text

引用:精确到页 vs 精确到区域

页级引用(简单)

prompt 里要求模型以 "[p.3]" 格式输出引用。检索已经给了页号,直接让模型用即可。够 90% 场景。

区域引用(高级)

让 VLM 输出 bounding box(Claude Opus/Gemini 2 已支持)。前端把框叠在原页面上——用户能"看到"答案出自哪块。

prompt = """
Answer the question. For each fact cite the page number AND the pixel-space
bounding box [x1, y1, x2, y2] (0-1000 scale) on that page.
Output JSON: {"answer": "...", "citations": [{"page": N, "bbox": [..]}]}
"""

分层解耦:检索 vs 生成选不同模型

阶段	推荐模型	原因
检索	ColPali / ColQwen2	便宜、专精、可微调
生成(短答)	Claude Haiku / GPT-4o-mini	快且便宜,适合 Q&A
生成(复杂 reasoning)	Claude Opus / GPT-4o	多图推理强
生成(本地)	Qwen2.5-VL-72B	数据不出境

成本估算

一次查询:检索 ~$0 + 3 页图像作为输入 + 200 token 生成

Claude Opus 4.7:3 图 ≈ 5000 input tokens,$0.075 输入 + $0.015 输出 ≈ $0.09
GPT-4o:$0.075 + $0.006 ≈ $0.08
Haiku 4.5:$0.004 + $0.0008 ≈ $0.005(单次不到 4 分钱)

成本优化
入参图先 resize 到 1024px 长边——Claude/GPT 对超大图会按 tile 计费,降尺寸能省 30%+。且对 ColPali 已选中的页面,视觉信息已经足够。

对比:OCR 流程 vs 视觉流程

维度	OCR → 文本 RAG	ColPali → 视觉 RAG
索引时间(500 页)	~10 分钟(OCR 为主)	~12 秒
检索 nDCG@5	0.55	0.82
生成引用准确度	易"幻觉"页号	原图直喂,错引概率低
表格/图表问题	差	优
生成成本	低(纯文本)	中(图像 token 更贵)
总体成本	索引+维护高	查询略高,维护低

混合模式:视觉检索 + 文本生成

极端省钱的方案:

ColPali 检索到 top-3 页
对这 3 页按需 OCR(只做命中页,不预处理所有)
OCR 文本 + 原图一起喂 LLM

索引阶段零 OCR 成本,查询阶段只处理命中的几页——兼顾。

Streamlit 快速 demo

import streamlit as st
from byaldi import RAGMultiModalModel
import anthropic

model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")
model.load_index("./index")
claude = anthropic.Anthropic()

q = st.text_input("问题")
if q:
    hits = model.search(q, k=3)
    cols = st.columns(3)
    for col, h in zip(cols, hits):
        col.image(h.base64, caption=f"p.{h.page_num} score={h.score:.2f}")
    answer = answer_with_colpali(q, hits)
    st.markdown(answer)

本章小结

    端到端视觉 RAG:ColPali 检索 + VLM 直接读图生成
跳过 OCR,表格/图表问题解释能力飙升
引用默认页级,进阶可输出 bbox 区域定位
检索用 ColPali、生成可挑 Haiku/Opus/Qwen-VL 按预算选
混合模式:命中页再 OCR 兼顾性价比

  

端到端多模态 RAG

端到端流程

Claude Vision 实战

引用:精确到页 vs 精确到区域

分层解耦:检索 vs 生成选不同模型

成本估算

对比:OCR 流程 vs 视觉流程

混合模式:视觉检索 + 文本生成

Streamlit 快速 demo

本章小结