第 5 章 · Index 家族 · 选对你的索引 — LlamaIndex 数据框架实战

一、Index 全景

Index	核心机制	典型场景	问答风格
`VectorStoreIndex` 🔥	向量相似度	语义问答(主力)	"什么是 X"
`SummaryIndex`	遍历所有节点	全文总结	"帮我总结这份报告"
`KeywordTableIndex`	LLM 抽关键词 + 倒排	精确关键词	"所有提到 CVE-2024-XX 的"
`DocumentSummaryIndex`	每文档先摘要 + 向量	大文档库,选对文档再细查	"哪份合同提到..."
`KnowledgeGraphIndex`	LLM 抽三元组建图	实体关系问答	"张三向谁汇报"
`TreeIndex`	层级摘要树	大文本递归总结	"这本书的主线是?"
`ComposableGraph`	多 Index 组合	异构知识库	路由到对应子索引

二、VectorStoreIndex:主力

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vs = QdrantVectorStore(
    client=QdrantClient(url="http://localhost:6333"),
    collection_name="docs",
)
storage_context = StorageContext.from_defaults(vector_store=vs)

# 法 1:直接从 documents 建
index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

# 法 2:如果已经有 nodes(比如走了 IngestionPipeline)
index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)

# 法 3:从已存在的 vector store 直接加载,不用重嵌入
index = VectorStoreIndex.from_vector_store(vs)

qe = index.as_query_engine(similarity_top_k=5)
print(qe.query("2025 Q3 营收"))

主力中的主力,90% 问答场景都走它。具体的向量库选型留给 Ch6。

三、SummaryIndex:要全看才答得出

"帮我总结这份 200 页的年度报告"——向量检索只能拉回 top-k,天然缺失全局视角。SummaryIndex 的查询会把所有 Node 过一遍:

from llama_index.core import SummaryIndex

index = SummaryIndex.from_documents(docs)
qe = index.as_query_engine(
    response_mode="tree_summarize",  # 每一批 summarize 后再 summarize 合并
)
ans = qe.query("总结一下这份年报的三个核心数字")

几种 response_mode:

refine:一个 chunk 一次 LLM,逐步精炼——精细但慢、贵
compact:把能装进 context 的 chunk 拼一起,少调几次——平衡
tree_summarize:二分归并,最适合总结任务 🔥
simple_summarize:粗暴拼一起一次生成——小文档用

成本警告:SummaryIndex 的查询代价 O(N)——文档越多越贵。只适合单份中短文档(最多几百个 chunk)。大语料库要总结,先 DocumentSummaryIndex 过滤,再 SummaryIndex 总结选中的几份。

四、KeywordTableIndex:精确关键词

场景:客服工单里找"所有提到CVE-2024-3094的"、专利库里找"含polymer electrolyte的"。向量检索会被语义相似吸引,反而匹配不到精确的字符串。

from llama_index.core import KeywordTableIndex, SimpleKeywordTableIndex

# LLM 抽关键词版本,贵但准
index = KeywordTableIndex.from_documents(docs)

# 规则抽取版本,快且免费
index = SimpleKeywordTableIndex.from_documents(docs)

qe = index.as_query_engine()
ans = qe.query("哪些工单提到了 CVE-2024-3094?")

建索引时每个 chunk 抽一组关键词,倒排存到 keyword → nodes。查询时从问题抽关键词,找交集节点。典型 RAG 里作为混合检索的稀疏那条线——配合向量检索效果最好。

五、DocumentSummaryIndex:先选文档再查

500 份合同要按"谁谁之间的保密协议"查——直接向量搜合同全文会被条款噪声淹没。更聪明的做法:

给每份合同先生成摘要(LLM 一次)
摘要向量化,建索引
查询时先按摘要选出最可能的 3 份合同
再进到这 3 份的 chunk 级 VectorStoreIndex 做精细检索

from llama_index.core import DocumentSummaryIndex, get_response_synthesizer

response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)

index = DocumentSummaryIndex.from_documents(
    docs,
    response_synthesizer=response_synthesizer,
    show_progress=True,
)

# 能拿到每份文档的 LLM 摘要
print(index.get_document_summary(doc_id="contract-42"))

# 检索:问题 → 摘要向量检索 → 选中文档 → 取对应 nodes
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve("Acme 和 Foo 在 2024 年签的 NDA 条款")

六、KnowledgeGraphIndex:实体关系网

场景:组织结构、产品依赖图、论文作者网络——这类"谁 -- 关系 -- 谁"的问题,向量检索答不好,图检索才对。

from llama_index.core import KnowledgeGraphIndex

index = KnowledgeGraphIndex.from_documents(
    docs,
    max_triplets_per_chunk=10,     # 每段最多抽 10 个三元组
    include_embeddings=True,        # 同时建向量索引,可混合检索
    show_progress=True,
)

# 三元组例子:
# ("张三", "汇报给", "李四")
# ("产品A", "依赖", "库B")

qe = index.as_query_engine(include_text=True, response_mode="tree_summarize")
ans = qe.query("产品 A 的上游依赖有哪些?")

进阶:需要持久化 + 图查询语言时,换 PropertyGraphIndex + Neo4j/FalkorDB——支持 Cypher 查询,能跑"3 度关系内所有人"这种复杂图算法。小规模试用 KnowledgeGraphIndex 就够。

七、TreeIndex:层级摘要树

"这本书讲了什么?"——TreeIndex 先对 chunk 分组每组摘要,再对摘要分组摘要,递归形成倒金字塔。查询时从根节点往下走:

from llama_index.core import TreeIndex

index = TreeIndex.from_documents(
    docs,
    num_children=10,            # 每层合并 10 个 child
)

qe = index.as_query_engine(
    child_branch_factor=1,     # 查询时每层走 1 个最相关子节点
)
ans = qe.query("这本书的主要论点")

适合长文/长书籍的大尺度主题提问。但建索引贵(每层都要 LLM 摘要),用得少——大部分项目走 DocumentSummaryIndex 就行。

八、ComposableGraph:异构组合

真实公司知识库里——合同适合 DocumentSummaryIndex,产品文档适合 VectorStoreIndex,组织架构适合 KnowledgeGraphIndex。ComposableGraph 就是把这些组合成一张总索引:

from llama_index.core import ComposableGraph, SummaryIndex

# 三个子索引
idx_contracts = DocumentSummaryIndex.from_documents(contract_docs)
idx_products  = VectorStoreIndex.from_documents(product_docs)
idx_people    = KnowledgeGraphIndex.from_documents(people_docs)

# 给每个子索引一段描述(检索路由靠它)
graph = ComposableGraph.from_indices(
    SummaryIndex,
    [idx_contracts, idx_products, idx_people],
    index_summaries=[
        "公司对外签的所有合同与 NDA",
        "产品文档、API 参考、用户手册",
        "员工与组织关系、汇报结构",
    ],
)

qe = graph.as_query_engine()
ans = qe.query("李四的下属里谁签过和 Acme 的合同?")

LlamaIndex 会把问题先路由到相关子索引(合同 + 组织),各自检索再合成答案。现代写法也能用 RouterQueryEngine(Ch7)更简洁。

九、决策树:我该选哪个?

问答类型? ├─ 语义相似/一般问答 ─────────▶ VectorStoreIndex (默认) ├─ 要全文总结(单份文档) ────▶ SummaryIndex (tree_summarize) ├─ 精确关键词/代号/ID ────────▶ KeywordTableIndex + Vector 混合 ├─ 大文档库,先定位再查 ─────▶ DocumentSummaryIndex ├─ 实体关系/图谱 ────────────▶ KnowledgeGraphIndex / PropertyGraphIndex ├─ 长书的主题提问 ───────────▶ TreeIndex └─ 异构知识库、需要路由 ─────▶ ComposableGraph / RouterQueryEngine

十、持久化与加载

所有 Index 都支持 StorageContext 统一持久化:

from llama_index.core import StorageContext, load_index_from_storage

# 保存
index.storage_context.persist(persist_dir="./storage")

# 加载(如果有外部 vector store,重建 StorageContext 指向它)
storage_context = StorageContext.from_defaults(
    persist_dir="./storage",
    vector_store=vs,         # Qdrant/pg/Weaviate 等外部
)
index = load_index_from_storage(storage_context)

重要:VectorStoreIndex 只持久化 docstore/index_store,向量本体在 vector store 里——换句话说外部 vector store(Qdrant/pg)已经自带持久化,persist 只是存了节点元数据和索引结构。

十一、多索引共享 Docstore

同一份数据想建向量索引 + 关键词索引 + 图索引 同时存在——共享 docstore 避免重复存储:

from llama_index.core import StorageContext
from llama_index.core.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(nodes)  # 一次写入

storage_context = StorageContext.from_defaults(docstore=docstore, vector_store=vs)

vector_index   = VectorStoreIndex(nodes, storage_context=storage_context)
keyword_index  = SimpleKeywordTableIndex(nodes, storage_context=storage_context)
# 两个索引 nodes 共享,只存了一份

十二、反模式

用 SummaryIndex 做大库问答:节点多了 O(N) 查询成本爆炸。
只用 VectorStoreIndex 答"精确代号":产品型号/CVE/订单号向量化后被模糊掉。
KnowledgeGraphIndex 生产直接用:它建图慢、查图粗,生产上 PropertyGraphIndex + Neo4j 才扛得住。
每次启动重建索引:persist + load,不要每次 from_documents。
忘了换 embedding 模型要重建:模型变了老向量废了。
DocumentSummaryIndex 不设 embed_summaries:默认关了——开了检索质量大幅提升。
ComposableGraph 的 index_summary 写得太泛:路由器分不清,答非所问——summary 要具体、有代表性术语。
持久化到本地但分布式部署:多实例各自读本地 dir,数据不一致——用外部 vector store + 集中 docstore(Redis)。

十三、本章小结

记住:
① VectorStoreIndex 是默认,但不是万能——关键词、全文总结、实体关系各有专用索引。
② 大文档库要两级:DocumentSummaryIndex 选文档 → VectorStoreIndex 细查。
③ 真实项目往往多索引组合——不同数据形态走不同索引,RouterQueryEngine/ComposableGraph 路由。
④ 持久化时记住 vector store 自己持久化向量,persist 存的是元数据——分布式部署一定用外部 vector store。