第 6 章 · Vector Store · 从 SimpleVectorStore 到 Qdrant

一、要不要"专门的" Vector Store

并不是所有项目都需要 Qdrant/Weaviate。根据数据规模:

规模	推荐	理由
< 1 万向量	SimpleVectorStore(内存)	几 MB 内存搞定,本地开发、小助手够用
1-10 万	Chroma / FAISS 本地	单机持久化,零运维
10 万-1000 万	Qdrant / pgvector / Weaviate	专用向量库,HNSW 索引,过滤能力强
1000 万+ / 多租户	Milvus / Qdrant Cloud / Pinecone	分片、副本、IVF 量化
已有 Postgres	pgvector	事务一致性,不用多运维一个服务 🔥

二、SimpleVectorStore:默认的默默无闻

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(docs)
# 底层默认就是 SimpleVectorStore,全内存

# 持久化到 JSON 文件
index.storage_context.persist("./storage")

# 加载
from llama_index.core import load_index_from_storage, StorageContext
index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage"))

适合教程、原型、CLI 小工具。缺点:整库要装进内存,启动时要全量读 JSON,过滤要扫全表——超过 5 万向量就开始力不从心。

三、Qdrant:开源向量库的一线选手

docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant:latest

pip install llama-index-vector-stores-qdrant qdrant-client

from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, AsyncQdrantClient

client       = QdrantClient(url="http://localhost:6333")
async_client = AsyncQdrantClient(url="http://localhost:6333")

vs = QdrantVectorStore(
    client=client,
    aclient=async_client,           # 异步客户端,并发必备
    collection_name="docs_v1",
    enable_hybrid=True,             # 稀疏+稠密混合
    batch_size=64,
)

storage_context = StorageContext.from_defaults(vector_store=vs)
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

为什么 Qdrant 是首选:

Rust 实现,吞吐和延迟都是第一梯队
原生 HNSW + scalar/product 量化(4x 内存压缩)
Hybrid search(dense + sparse BM25)2024 年开箱支持
metadata 过滤在索引时就构造"payload index",过滤 + 检索在一个查询里,不像 pgvector 那样过滤后重新扫
Qdrant Cloud 免费起步,生产可以 Cloud 也可以自建

四、pgvector:已经用 Postgres 就别犹豫

# Postgres 16+ 已内置;否则装扩展
# CREATE EXTENSION vector;

pip install llama-index-vector-stores-postgres

from llama_index.vector_stores.postgres import PGVectorStore

vs = PGVectorStore.from_params(
    database="mydb",
    host="localhost", port=5432,
    user="postgres", password="pw",
    table_name="llama_docs",
    embed_dim=1536,
    hnsw_kwargs={"hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40},
    hybrid_search=True,          # pgvector + tsvector 组合
    text_search_config="english",
)

pgvector 优势:和业务 Postgres 一个数据库——事务一致(订单表 + 向量同表更新同一个 tx)、一份备份、一个运维面板。500 万向量以内性能足够,能不加新组件就不加。
pgvector 劣势:HNSW 是 PG 16+ 的 0.5.0 版才有;极端吞吐/QPS 场景比 Qdrant 差一档。

五、Weaviate / Milvus / Chroma:次选

方案	强项	弱点	适合
Weaviate	GraphQL 查询、模块化(自带 embed)	内存大,JVM 生态味	已有 Weaviate 运维经验
Milvus	超大规模(10 亿+)、IVF/PQ 算法全	运维复杂(etcd + S3 + Pulsar)	千万-亿级向量
Chroma	本地最简单,SQLite-backed	大规模性能不行	POC / 桌面应用
Pinecone	零运维 SaaS	锁定 + 成本	小团队快速上线
FAISS	算法库,极快	没有服务器功能,无过滤	学术、离线评估
Elasticsearch 8+	已有 ES 经验,全文+向量	向量不是强项	日志/搜索已经用 ES

六、MetadataFilters:给向量检索加条件

向量检索 + "只看 2025 年的 + category=产品手册 + 非机密"——这就是 MetadataFilters 的场景:

from llama_index.core.vector_stores import (
    MetadataFilters, MetadataFilter, FilterOperator, FilterCondition
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="year", value=2025, operator=FilterOperator.EQ),
        MetadataFilter(key="category", value="manual", operator=FilterOperator.EQ),
        MetadataFilter(key="confidential", value=True, operator=FilterOperator.NE),
    ],
    condition=FilterCondition.AND,
)

qe = index.as_query_engine(filters=filters, similarity_top_k=5)
ans = qe.query("怎么配置双因素认证?")

常用 operator

Operator	含义	示例
`EQ / NE`	等于 / 不等于	`year == 2025`
`GT / GTE / LT / LTE`	数值比较	`updated_at > 1700000000`
`IN / NIN`	在/不在集合里	`tag IN ["hr", "legal"]`
`CONTAINS`	list 字段包含	`tags CONTAINS "urgent"`
`TEXT_MATCH`	全文字符串匹配(Qdrant)	部分后端支持

后端差异:不同 vector store 支持的 operator 不完全一致——SimpleVectorStore 全支持(但慢)、Qdrant/pg 主流都有、FAISS 根本不支持过滤。选型时要提前确认你的 filter 需求。

七、Hybrid Search:稀疏 + 稠密

纯向量检索在"产品型号 XJ-2024""CVE-2024-3094"这类精确字符串上会翻车。解决方案:同时用 BM25(稀疏)和向量(稠密)检索,结果融合。

# Qdrant 原生支持
vs = QdrantVectorStore(
    client=client, aclient=async_client,
    collection_name="docs",
    enable_hybrid=True,                    # 打开就行
    fastembed_sparse_model="Qdrant/bm25",   # 或 splade
)

# 或用 LlamaIndex 的通用 QueryFusionRetriever(任何 vector store 都能用)
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever   = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=5)

fusion = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,               # 1 表示不重写 query
    mode="reciprocal_rerank",   # 另选 dist_based_score / relative_score
)
nodes = fusion.retrieve("CVE-2024-3094 修复方案")

融合模式:

reciprocal_rerank(RRF):排名相加,不怕不同分数量纲——最常用、最稳定
relative_score:归一化分数后加权
dist_based_score:直接用距离(只适合同质检索器)

八、性能调优

HNSW 参数

m:每个节点的连接数。16 稳定,32 更准但慢且占内存
ef_construction:建图时的邻居搜索宽度,64-128 常见——越大建索引越慢但查询越准
ef_search:查询时搜索宽度,40-128——查询时可以动态调,要准大一点,要快小一点

量化(大规模必选)

Scalar quantization:float32 → int8,4x 压缩,精度几乎不损——Qdrant 默认推荐
Product quantization (PQ):更激进的压缩(16-32x),精度损失明显但内存省——亿级规模用
Binary quantization:1 bit/维,32x——前提是 embedding 模型支持(cohere/openai v3 支持)

Batch 插入

# 别一个一个插,一个一个请求延迟会杀死你
vs.add(nodes, batch_size=100)          # 100-500 常见

九、异步与并发

import asyncio

async def query_all(questions):
    qe = index.as_query_engine(similarity_top_k=5, use_async=True)
    tasks = [qe.aquery(q) for q in questions]
    return await asyncio.gather(*tasks)

answers = asyncio.run(query_all(["问题 1", "问题 2", "问题 3"]))

关键:传 aclient(Qdrant/Weaviate 都有),否则 async 会退化成线程池——真正的并发是异步客户端 + aquery。

十、从 SimpleVectorStore 迁移到 Qdrant

# 1. 从老 index 拿到 docstore
from llama_index.core import load_index_from_storage, StorageContext
old = load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage_old"))
nodes = list(old.docstore.docs.values())   # 原始节点(有 embedding 信息)

# 2. 新建 Qdrant collection 并灌入
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vs = QdrantVectorStore(client=QdrantClient(url="http://localhost:6333"), collection_name="docs_v2")
storage_context = StorageContext.from_defaults(vector_store=vs)

new = VectorStoreIndex(nodes=nodes, storage_context=storage_context)
new.storage_context.persist("./storage_new")

如果老 nodes 没存 embedding(只有 text),需要重新 embed——提前确认 docstore 里有没有向量,没有就得重跑,几万向量几分钟几美元。

十一、生产上线 checklist

✅ Vector store 有独立持久化卷 + 定时备份
✅ HNSW 参数调过(m=16, ef=64+),不用默认
✅ metadata 字段在 vector store 建 payload index(Qdrant)或 B-tree(pg),过滤才快
✅ 大规模(100 万+)开 scalar quantization
✅ 异步客户端 + batch 插入
✅ 监控:向量数、collection 大小、p99 查询延迟、过滤命中率
✅ 明确 collection 版本策略:换 embedding 模型 → 新 collection,灰度切流
✅ 有 hybrid 需求(精确关键词多)—— Qdrant enable_hybrid 或 QueryFusionRetriever

十二、本章小结

记住:
① 小于 1 万向量 SimpleVectorStore 够用,1-10 万 Chroma,生产主力 Qdrant 或 pgvector——已有 Postgres 就 pgvector。
② MetadataFilters 是 RAG 精度的关键——"2025 年、非机密、产品手册"比纯语义检索准 10 倍。
③ 精确关键词(代号、CVE、型号)必须开 hybrid search——Qdrant 原生或 QueryFusionRetriever(RRF)。
④ 调优三板斧:HNSW 参数、scalar quantization、异步 + batch——任何一条都能让延迟少一半。