Chapter 05

微调:领域文档再涨 10 个点

通用 ColPali 在 ViDoRe 很强,但你的"药品说明书"、"造船设计图"、"法院判决书"不在训练集里。这一章走一遍:怎么合成 query、挖 hard negatives、用 LoRA 微调,低成本把 nDCG@5 拉高 10+ 点。

什么时候该微调

评估后通用模型召回 < 60%
先构造 200 条人工标注 query,算 nDCG@5。低于 0.6 说明领域 gap 明显,微调有显著空间。
术语、图示、版式极其行业化
药品分子式、造船 3 视图、古籍竖排——通用 PaliGemma 没见过。
查询风格独特
法律人喜欢用案号、律师名检索,通用 embedding 会倾向关键词重合的误导页。

步骤 1:合成训练数据

没有现成标注怎么办?用一个强 VLM(Qwen2.5-VL-72B / Gemini / Claude)给每页自动生成 3-5 条可能的 query:

from openai import OpenAI
client = OpenAI()

prompt_template = """
You are given a document page image. Generate 5 diverse realistic search queries
that a user would type to find this exact page. Mix:
- factoid queries about specific numbers/terms on the page
- conceptual queries about the topic
- visual queries about charts/tables/figures
- short (3-5 words) and long (15-25 words) variants
Respond only with JSON array of strings.
"""

def gen_queries(image_b64):
    r = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt_template},
                {"type": "image_url",
                 "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
            ],
        }],
    )
    return json.loads(r.choices[0].message.content)

10K 页文档 × 5 query = 50K 训练对,GPT-4o 跑一晚上、成本约 $100。数据质量远高于 BM25 挖掘。

步骤 2:挖掘 Hard Negatives

对比学习里 in-batch negatives 不够难——模型轻松分开"讲财报"和"讲猫"。我们要"讲 2023 财报"和"讲 2024 财报"这种细粒度负样本:

# 用当前(待微调的)模型对每条 query 召回 top-50
# 从 top-5..50 里随机采 3-5 个作为 hard negative
for q, positive_page in train_pairs:
    hits = base_model.search(q, k=50)
    negatives = [h for h in hits[4:] if h.page_id != positive_page.id][:5]
    yield {"q": q, "pos": positive_page, "negs": negatives}

步骤 3:LoRA PEFT 训练

全参数微调 PaliGemma-3B 要 40GB+ 显存。LoRA 只训练 adapter,单 A100 24GB 够用:

from peft import LoraConfig, get_peft_model
from colpali_engine.models import ColPali
from colpali_engine.loss import ColbertPairwiseCELoss
from colpali_engine.trainer import ContrastiveTrainer

model = ColPali.from_pretrained("vidore/colpali-v1.2", torch_dtype=torch.bfloat16)
lora_config = LoraConfig(
    r=32, lora_alpha=32, lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="FEATURE_EXTRACTION",
)
model = get_peft_model(model, lora_config)

trainer = ContrastiveTrainer(
    model=model,
    args=TrainingArguments(
        output_dir="./colpali-lora-medical",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=5e-5,
        warmup_steps=100,
        num_train_epochs=3,
        bf16=True,
        remove_unused_columns=False,
    ),
    train_dataset=train_ds,
    loss_func=ColbertPairwiseCELoss(),
)
trainer.train()

步骤 4:评估

from colpali_engine.evaluation import CustomEvaluator

evaluator = CustomEvaluator(is_multi_vector=True)
metrics = evaluator.evaluate(
    qs=query_embeddings,
    ps=page_embeddings,
    relevant_docs=qrels,
)
print(metrics)
# {'ndcg@5': 0.89, 'mrr@5': 0.85, 'recall@10': 0.93}

典型收益

领域基线 nDCG@5微调后提升
医疗(药品说明)0.670.83+16 pt
法律(判决书)0.720.89+17 pt
工业(制造手册)0.580.78+20 pt
中文财报0.650.81+16 pt

容易犯的错

部署 LoRA adapter

# 生产加载
from peft import PeftModel

base = ColPali.from_pretrained("vidore/colpali-v1.2", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "./colpali-lora-medical")
model = model.merge_and_unload()      # 合并权重,加速推理

LoRA adapter 仅 100MB 左右,可以走 S3 托管——多租户场景不同客户用不同 adapter,底模共享。

本章小结