第8章 Evals — Mastra 完全指南

为什么需要 Evals

换模型:从 4o-mini 切到 Haiku,怎么确认没退化?
改 prompt:加一句"要用中文",是不是牵一发动全身?
加 tool:新工具是不是反而让 Agent 乱调?
升级框架版本:底层行为变了没?

人眼抽查十条可能没事,一个月后发现某场景全线塌方。Eval 让每次改动都有客观数据支撑。

Metric 分类

Textual(文本质量)

answer-relevancy、content-similarity、completeness、toxicity、bias 等。

RAG(检索增强相关)

faithfulness(是否忠于上下文)、context-recall(检索是否够)、context-precision(是否精)。

Prompt Alignment

prompt-alignment:输出是否符合 instructions 要求。

Rule-based

word-inclusion、keyword-coverage、tone、JSON-validity 等,快且稳定,不花 token。

Custom

继承 Metric 基类,任意业务逻辑。

挂 Metric 到 Agent

import { Agent } from '@mastra/core/agent';
import {
  AnswerRelevancyMetric,
  ToxicityMetric,
  FaithfulnessMetric,
} from '@mastra/evals/llm';
import { openai } from '@ai-sdk/openai';

const judge = openai('gpt-4o-mini');

export const qaAgent = new Agent({
  name: 'qa',
  instructions: '忠于文档回答问题,不会就说不知道',
  model: openai('gpt-4o-mini'),
  evals: {
    relevancy: new AnswerRelevancyMetric(judge),
    toxicity:  new ToxicityMetric(judge),
    faithful:  new FaithfulnessMetric(judge),
  },
});

挂上后,Playground 每条对话下方自动出现评分卡片;也可以程序里直接取:

const { text, evals } = await qaAgent.generate('如何部署到 Cloudflare?');
console.log(evals.relevancy.score);  // 0.0 ~ 1.0
console.log(evals.relevancy.info);   // 解释

跑测试集

// evals.test.ts(Vitest)
import { describe, it, expect } from 'vitest';
import { qaAgent } from './agents';
import { AnswerRelevancyMetric } from '@mastra/evals/llm';

const cases = [
  { q: 'Cloudflare 怎么部署?', mustInclude: ['wrangler', 'mastra deploy'] },
  { q: '怎么开启 memory?', mustInclude: ['Memory', 'storage'] },
];

describe('qaAgent', () => {
  const metric = new AnswerRelevancyMetric(openai('gpt-4o-mini'));

  cases.forEach(({ q, mustInclude }) => {
    it(q, async () => {
      const { text } = await qaAgent.generate(q);
      const { score } = await metric.measure(q, text);
      expect(score).toBeGreaterThan(0.7);
      mustInclude.forEach(kw => expect(text).toContain(kw));
    });
  });
});

CI 里跑

# .github/workflows/eval.yml
name: Mastra Evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - run: pnpm install
      - run: pnpm test:eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

控制 eval 成本
LLM-as-judge 每次调用要花钱。做法:PR 跑一个小回归集(~20 条),夜里跑完整集(~500 条)。Judge 模型用 4o-mini 够,别用 Opus。

自定义 Metric

import { Metric, type MetricResult } from '@mastra/core/eval';

export class ChineseOnlyMetric extends Metric {
  async measure(input: string, output: string): Promise<MetricResult> {
    const cnChars = output.match(/[一-龥]/g)?.length ?? 0;
    const total = output.replace(/\s/g, '').length;
    const ratio = total ? cnChars / total : 0;
    return {
      score: ratio,
      info: { reason: `中文占比 ${(ratio*100).toFixed(1)}%` },
    };
  }
}

RAG 三指标

Faithfulness(忠实度)

回答是否只使用了检索到的上下文,不胡编。低 = 模型幻觉严重。

Context Recall(召回完整性)

参考答案里的关键点,有多少出现在检索结果里。低 = 检索少了,换 embedding、rerank、chunk。

Context Precision(召回精度)

检索结果里有多少真的被回答用到了。低 = 召回太泛,应该调 topK、加 rerank。

import {
  FaithfulnessMetric,
  ContextualRecallMetric,
  ContextualPrecisionMetric,
} from '@mastra/evals/llm';

const faith = new FaithfulnessMetric(judge, { context: retrievedChunks });
const r = await faith.measure(userQ, agentAnswer);

Eval 面板

Playground 的 Eval 面板可以:

导入 CSV/JSON 测试集
一键跑当前 Agent 所有 metric
得分直方图 + 排序找 worst case
对比两次运行的 diff(调 prompt 后的回归对比)

常见坑

Judge 模型自身偏差

用 gpt-4o-mini 当 judge 给自家 gpt-4o 打分可能自我加分。跨家 judge(Claude 评 GPT、反之)更公允。

测试集过小

10 条过不了统计意义。至少 50-100,涵盖典型 + 长尾 + 边界。

阈值一刀切

关键 flow(支付、医疗)要 ≥ 0.9;娱乐类 0.6 可能够。按场景定。

本章小结

    Agent 需要 Eval 才能放心迭代
内置 Metric 覆盖 Textual / RAG / Prompt Alignment / Rule-based
挂到 Agent.evals 自动在 Playground 显示打分
CI 跑回归:小集 PR / 大集夜跑,judge 用便宜模型控成本
继承 Metric 类做业务指标,和内置组合使用
Faithfulness + Context Recall/Precision 是 RAG 三件套

  

Evals:给 Agent 打分,代替"肉眼测试"

为什么需要 Evals

Metric 分类

挂 Metric 到 Agent

跑测试集

CI 里跑

自定义 Metric

RAG 三指标

Eval 面板

常见坑

本章小结