第4章多向量索引 — ColPali 完全指南

问题规模

100 万页文档 × 1024 向量/页 = 10 亿向量 × 128 维 × 2 byte = 256 GB 存储查询 = 20 token × 10 亿向量 MaxSim = 不可接受

需要两件事:

物理存储能放下 10 亿 128 维向量且支持随机访问
打分过程在高维空间做 MaxSim 近似(两阶段检索)

方案一:Qdrant Multivector

Qdrant 1.10+ 原生支持 multivector——每条 record 存一个矩阵,查询给一个矩阵,引擎自动做 MaxSim。

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, MultiVectorConfig, MultiVectorComparator

client = QdrantClient("localhost", port=6333)

client.create_collection(
    "colpali-pages",
    vectors_config=VectorParams(
        size=128,
        distance=Distance.COSINE,
        multivector_config=MultiVectorConfig(
            comparator=MultiVectorComparator.MAX_SIM,
        ),
    ),
)

# 写入(每页 1024 个向量作为一个 record)
client.upsert(
    "colpali-pages",
    points=[
        PointStruct(
            id=i,
            vector=page_vecs[i].cpu().float().numpy(),  # (1024, 128)
            payload={"doc": doc_id, "page": page_num, "image": image_b64},
        ) for i in range(N)
    ],
)

# 查询
results = client.query_points(
    "colpali-pages",
    query=q_vecs.squeeze().cpu().float().numpy(),   # (n_tokens, 128)
    limit=5,
).points

优点

API 直观、Rust 实现、Binary Quantization/Scalar Quantization 开箱即用,配合可省 4-32 倍空间。

缺点

纯 MaxSim 全量扫描仍贵,高召回场景推荐配合下文"两段式"。

方案二:Vespa Tensor + ColBERT ranking

Vespa 是老牌搜索引擎(Yahoo 出品),ColBERT 论文作者原班人马给它做了 ColPali 集成。优点是内置两阶段检索:粗筛 + MaxSim 精排。

<!-- schemas/page.sd -->
schema page {
  document page {
    field embedding type tensor<bfloat16>(patch{}, v[128]) {
      indexing: attribute
    }
    field quant_emb type tensor<int8>(patch{}, v[16]) {
      indexing: attribute | index
    }
  }
  rank-profile colpali {
    inputs {
      query(qt) tensor<float>(querytoken{}, v[128])
    }
    first-phase {
      expression: // 粗筛:binary HNSW
        max(unpack_bits(attribute(quant_emb)) * query(qt))
    }
    second-phase {
      expression: // 精排:full MaxSim
        sum(reduce(sum(query(qt) * attribute(embedding), v), max, patch), querytoken)
      rerank-count: 100
    }
  }
}

这个 schema 存了两种 embedding:binary(16 字节/patch)和 bfloat16(256 字节/patch),第一阶段粗筛极快、第二阶段 top-100 全量算 MaxSim 精排。

方案三:Weaviate 1.28+

2025 年初 Weaviate 加入 multivector 支持,API 类似 Qdrant,对 Python 开发者友好:

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

client.collections.create(
    "ColPaliPage",
    vectorizer_config=Configure.Vectorizer.none(),
    vector_index_config=Configure.VectorIndex.hnsw(
        multi_vector=Configure.VectorIndex.MultiVector.multi_vector(),
        quantizer=Configure.VectorIndex.Quantizer.rq(),   # Rotational Quantization
    ),
    properties=[
        Property(name="doc", data_type=DataType.TEXT),
        Property(name="page", data_type=DataType.INT),
    ],
)

三者横向对比

维度	Qdrant	Vespa	Weaviate
Multivector 支持	✅ 原生(1.10+)	✅ 最早	✅ 1.28+
二值量化	✅	✅	RQ 量化
两阶段检索	需要自己拼	✅ 内置	部分支持
学习曲线	简单	陡峭	中等
部署复杂度	Docker 一键	需要 ZooKeeper 等	Docker 一键
推荐场景	< 1000 万页,快速上线	> 1 亿页,极致性能	TypeScript 生态

存储成本真相

精度	每 patch 字节	每页 (1024 patch)	100 万页总量
float32	512	512 KB	512 GB
bfloat16	256	256 KB	256 GB
int8(Scalar Q)	128	128 KB	128 GB
binary(Q 后)	16	16 KB	16 GB
binary + token pool(下章)	16	1-4 KB	1-4 GB

二值化损多少?
实测 nDCG@5 从 0.82 降到 0.79(ViDoRe 平均)——对绝大多数业务可以接受,换来 16 倍存储节省。Qdrant 默认支持 binary,建议直接开。

真实性能数字

4 核 vCPU + 32GB 内存的 Qdrant,binary quantization,100 万页索引:

索引写入:~400 页/秒(配合 batch upsert)
查询延迟 P50:18ms,P99:45ms
内存占用:~18GB(热数据驻留)
磁盘:22GB

本章小结

    ColPali 每页 1024 向量,必须用 multivector 数据库
Qdrant:门槛低,推荐 <1000 万页
Vespa:两阶段检索原生支持,亿级以上场景
Weaviate:TS 生态友好,新兴选择
Binary quantization 是必开的开关——16 倍空间压缩,精度仅降 3 点
真实生产:Qdrant + binary,100 万页 22GB,P99 45ms

  

多向量索引:Qdrant、Vespa、Weaviate

问题规模

方案一:Qdrant Multivector

方案二:Vespa Tensor + ColBERT ranking

方案三:Weaviate 1.28+

三者横向对比

存储成本真相

真实性能数字

本章小结