第5章 Vision 图像理解 — Anthropic API 完全指南

支持的格式

格式:JPEG / PNG / GIF / WebP
大小:单图 ≤ 5 MB(base64 编码前)
尺寸:推荐最大 1568×1568 —— 超过会自动缩放,更大的不涨效果
每条消息:最多 100 张图(4.x 模型,具体看文档)

传图方式 1:base64

import fs from "node:fs";

const imageData = fs.readFileSync("./chart.png", { encoding: "base64" });

const msg = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      {
        type: "image",
        source: {
          type: "base64",
          media_type: "image/png",
          data: imageData,
        },
      },
      { type: "text", text: "这张图表告诉我什么?" },
    ],
  }],
});

传图方式 2:URL

{
  type: "image",
  source: {
    type: "url",
    url: "https://example.com/chart.png",
  },
}

URL 方式由 Anthropic 服务端去拉取,要求:

公开可访问(不能要鉴权)
HTTPS
直接返回图片二进制,不是 HTML

传图方式 3:Files API

图片大 / 会在多次对话中复用 → 先上传到 Anthropic Files API,用 file_id 引用:

const uploaded = await client.beta.files.upload({
  file: fs.createReadStream("./big.jpg"),
});

{
  type: "image",
  source: { type: "file", file_id: uploaded.id },
}

优势:省带宽 / 结合 Prompt Caching 命中率更高 / 多轮复用不用重复编码。

Token 成本计算

图片被转成 token,大致公式:

tokens ≈ (width × height) / 750

举几个参考:

尺寸	估算 tokens
200×200	~54
500×500	~334
1000×1000	~1334
1568×1568(最大)	~3500

省钱建议
发给 Claude 之前先缩到合适分辨率。文字密集的截图用 1568(看清每字),UI 缩略图用 800 就够。每多一倍像素多花 4 倍 token——裁剪就是省钱。

多图对比

messages: [{
  role: "user",
  content: [
    { type: "image", source: {...}, /* 图1 */ },
    { type: "text", text: "这是 v1 设计稿。" },
    { type: "image", source: {...}, /* 图2 */ },
    { type: "text", text: "这是 v2 设计稿。对比差异,列出 v2 相比 v1 的 5 个改动。" },
  ],
}]

关键:交替放 image 和 text 说明,告诉模型哪张是哪张。直接两张图 + "对比" 会让 Claude 困惑"哪是 before / 哪是 after"。

OCR / 文档读取

{
  role: "user",
  content: [
    { type: "image", source: {...}, /* 发票扫描件 */ },
    { type: "text", text:
`从发票里抽取:
- 发票号
- 开票日期(ISO 格式)
- 金额(不含税)
- 税率
- 合计金额
- 卖方名称

用 JSON 返回,字段英文。` },
  ],
}

更稳的姿势:配合 Tool Use——见下一节。

Vision + Tool Use 黄金组合

const tools = [{
  name: "record_invoice",
  description: "记录发票的结构化字段",
  input_schema: {
    type: "object",
    properties: {
      invoice_no: { type: "string" },
      date: { type: "string", format: "date" },
      subtotal: { type: "number" },
      tax_rate: { type: "number" },
      total: { type: "number" },
      seller: { type: "string" },
    },
    required: ["invoice_no", "date", "total"],
  },
}];

const resp = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 512,
  tools,
  tool_choice: { type: "tool", name: "record_invoice" },
  messages: [{
    role: "user",
    content: [
      { type: "image", source: imageSource },
      { type: "text", text: "抽取这张发票的信息" },
    ],
  }],
});

const data = resp.content[0].input;
// data 直接是 {invoice_no, date, subtotal, ...} 的 object,schema 保证

这样比让 Claude 输出 JSON 字符串再解析可靠 10 倍:schema 强制字段类型、多出字段被过滤、缺失必填字段不会通过。

PDF 支持

Claude 4.x 原生支持 PDF(它会 OCR + 读文字 + 识别图表):

{
  type: "document",
  source: {
    type: "base64",
    media_type: "application/pdf",
    data: readFile("./report.pdf", "base64"),
  },
  cache_control: { type: "ephemeral" },   // PDF 大,建议缓存
}

上限 32 MB / 100 页。Claude 一次把整个 PDF 塞进 context,然后像处理文档一样 QA / 总结 / 抽取。

坐标问题

Claude 能描述位置("右上角有个红框"),但不擅长返回精确像素坐标("框左上角是 x=342, y=128")。精确框选任务请用专门的目标检测模型(YOLO、DETR 等)。

什么场景不适合

视频 —— 不支持,你得自己抽帧再喂多张图
人脸识别 / 人物 ID —— Anthropic 政策禁止
非常密集的小字表格 —— OCR 会漏字,试试 DocumentAI 或先 pre-process
医疗影像诊断 —— 能描述,但不能用于临床决策

错误速查

media_type 不对

必须是 image/png / image/jpeg / image/gif / image/webp;忘了或写错 400

图片太大

base64 前 > 5MB → 拒收;提前用 sharp / PIL 压缩

base64 带 data: 前缀

不能带 data:image/png;base64, —— 只要纯 base64 字符串

URL 无法访问

需公开 HTTPS,返回二进制;防盗链会失败

图片预处理代码

import sharp from "sharp";

async function toClaudeImage(filePath: string) {
  const buf = await sharp(filePath)
    .resize({ width: 1568, height: 1568, fit: "inside" })
    .jpeg({ quality: 85 })
    .toBuffer();
  return {
    type: "image",
    source: {
      type: "base64",
      media_type: "image/jpeg",
      data: buf.toString("base64"),
    },
  };
}

本章小结

    传图三法:base64 / URL / Files API,格式 JPG/PNG/GIF/WebP
图片 token ≈ pixels / 750,1568px 是性价比甜点
多图对比要交替 image / text 说明
OCR / 抽取 → 配合 Tool Use + 强制 tool_choice,schema 保证结构
PDF 原生支持,32MB / 100 页上限,建议开 Prompt Caching
不做人脸识别、不返精确坐标,医疗仅辅助不诊断

  

Claude 能"看"

支持的格式

传图方式 1:base64

传图方式 2:URL

传图方式 3:Files API

Token 成本计算

多图对比

OCR / 文档读取

Vision + Tool Use 黄金组合

PDF 支持

坐标问题

什么场景不适合

错误速查

图片预处理代码

本章小结