第8章 K8s + Helm 生产部署 — Langfuse 自建实战

前置条件

Kubernetes 1.28+,3 个 worker node 起步(每个 node 至少 8 vCPU / 32 GB)
StorageClass 支持动态 provision,推荐 SSD(ClickHouse 对磁盘敏感)
Ingress Controller(nginx / traefik)+ cert-manager(自动 TLS)
有 S3 兼容对象存储:AWS S3 / R2 / Aliyun OSS / 自建 MinIO(Operator 装)
helm 3.14+、kubectl、yq(调 values 方便)

一张图看部署拓扑

┌─────────────────────────────────────┐ Internet ──> │ Ingress (nginx + cert-manager) │ └──────────┬──────────────────────────┘ │ ┌─────────────────┴──────────────────┐ ▼ ▼ ┌──────────┐ ┌───────────┐ │ Web x 3 │ (Next.js, HPA) │ Worker x3 │ (消费队列) └────┬─────┘ └─────┬─────┘ │ reads/writes │ writes ├─────────┬──────────┐ │ ▼ ▼ ▼ ▼ ┌────────┐ ┌───────┐ ┌────────┐ ┌──────────────┐ │Postgres│ │ Redis │ │ S3 │ │ ClickHouse │ │ HA │ │ HA │ │ Bucket │ │ 2x shard + │ └────────┘ └───────┘ └────────┘ │ 2x replica │ │ + Keeper x3 │ └──────────────┘

Helm 仓库与版本锁定

helm repo add langfuse https://langfuse.github.io/langfuse-k8s
helm repo update

# 列出可用版本, 生产一定锁定
helm search repo langfuse/langfuse --versions | head

所有 helm 命令都带 --version 1.x.y。生产环境绝不能 --version latest——某一天突然破坏性升级把 ClickHouse 表改掉就傻了。

values.yaml 关键段

官方默认 values 适合 demo,生产必改的区块逐段讲。

① 镜像与副本

langfuse:
  image:
    repository: langfuse/langfuse
    tag: "3.20.1"           # 显式锁定
    pullPolicy: IfNotPresent

  web:
    replicas: 3
    resources:
      requests: { cpu: 500m, memory: 1Gi }
      limits:   { cpu: 2,    memory: 4Gi }

  worker:
    replicas: 3              # 写入高峰时再加
    resources:
      requests: { cpu: 500m, memory: 1Gi }
      limits:   { cpu: 4,    memory: 8Gi }

② 必改 secrets

langfuse:
  nextauth:
    # 生产用 openssl rand -hex 32
    secret: "CHANGE_ME_nextauth_secret_32_hex"
    url: "https://langfuse.yourcompany.com"

  salt: "CHANGE_ME_salt_32_hex"
  encryptionKey: "CHANGE_ME_encryption_64_hex"   # 注意 64 位

  # 禁用公开注册(公司内部用 SSO)
  additionalEnv:
    - name: AUTH_DISABLE_SIGNUP
      value: "true"
    - name: LANGFUSE_INIT_ORG_NAME
      value: "YourCompany"

secret 千万别提交到 Git
上面四个 secret 应该用 Sealed Secrets / External Secrets / SOPS / Vault 注入,绝不写在 values.yaml 里。很多团队事后后悔就是第一次把它 commit 到内部仓库了。

③ Postgres

两种玩法:用 chart 内置(postgresql.deploy=true)或外部托管(AWS RDS / CloudSQL)。生产强烈推荐外部:

postgresql:
  deploy: false   # 用外部 RDS
  host: "langfuse-pg.cluster-xxx.us-east-1.rds.amazonaws.com"
  port: 5432
  database: "langfuse"
  auth:
    existingSecret: "langfuse-pg-credentials"
    usernameKey: "username"
    passwordKey: "password"

④ Redis

redis:
  deploy: false
  host: "langfuse-redis.xxx.cache.amazonaws.com"
  port: 6379
  auth:
    existingSecret: "langfuse-redis-credentials"
    passwordKey: "password"
  tls:
    enabled: true

⑤ ClickHouse(生产难点)

chart 内置 ClickHouse 单机够 demo,但想要副本 + 分片 + HA,要上 Altinity clickhouse-operator,让 Langfuse 外连它:

clickhouse:
  deploy: false
  host: "chi-langfuse-cluster-0-0.langfuse.svc"
  port: 9000
  database: "default"
  auth:
    existingSecret: "langfuse-ch-credentials"
    usernameKey: "username"
    passwordKey: "password"
  # 集群名要和 operator 里的 cluster 名一致
  cluster: "langfuse"

一个最简 2 分片 2 副本的 ClickHouseInstallation:

apiVersion: "clickhouse.altinity.com/v1"
kind: ClickHouseInstallation
metadata:
  name: langfuse
spec:
  configuration:
    clusters:
      - name: langfuse
        layout:
          shardsCount: 2
          replicasCount: 2
    zookeeper:
      nodes:
        - host: chk-langfuse-0-0
        - host: chk-langfuse-0-1
        - host: chk-langfuse-0-2
  defaults:
    templates:
      podTemplate: ch-pod
      dataVolumeClaimTemplate: ch-data
  templates:
    podTemplates:
      - name: ch-pod
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:24.8
              resources:
                requests: { cpu: 2, memory: 8Gi }
                limits:   { cpu: 8, memory: 32Gi }
    volumeClaimTemplates:
      - name: ch-data
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: gp3
          resources: { requests: { storage: 500Gi } }

⑥ S3 / MinIO

s3:
  deploy: false                     # 用外部 S3
  eventUpload:
    bucket: "langfuse-events"
    region: "us-east-1"
    endpoint: ""                    # 空 = AWS, 用 R2/MinIO 填对应 endpoint
    accessKeyId:
      valueFrom: { secretKeyRef: { name: langfuse-s3, key: accessKey } }
    secretAccessKey:
      valueFrom: { secretKeyRef: { name: langfuse-s3, key: secretKey } }
    forcePathStyle: false           # MinIO 要 true
  mediaUpload:
    bucket: "langfuse-media"
    # 同样的认证

⑦ Ingress + TLS

ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"    # 大 prompt 上报
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
  hosts:
    - host: langfuse.yourcompany.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: langfuse-tls
      hosts: [langfuse.yourcompany.com]

第一次 install

kubectl create namespace langfuse

# 塞 secrets(例子用 kubectl, 生产改 ExternalSecret)
kubectl -n langfuse create secret generic langfuse-pg-credentials \
  --from-literal=username=langfuse \
  --from-literal=password="$(openssl rand -hex 24)"

helm upgrade --install langfuse langfuse/langfuse \
  -n langfuse \
  --version 1.x.y \
  -f values-prod.yaml

# 看 pod 起来没
kubectl -n langfuse get pods -w

Web pod 首次启动会自动跑 Postgres migration;Worker pod 跑 ClickHouse migration。日志里看到 migration completed 就稳了。

HPA 自动扩容

langfuse:
  web:
    autoscaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 20
      targetCPUUtilizationPercentage: 70
  worker:
    autoscaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 30
      targetCPUUtilizationPercentage: 60
      # Worker 更该按队列长度 scale,用 KEDA + Redis trigger
      # 参考 KEDA scaler: redis-list

Worker 的真正正确姿势是按队列长度扩——KEDA + redis-list scaler,ingestion 队列 backlog 上来立刻拉多个 Worker 进来消化。

PodDisruptionBudget + topologySpreadConstraints

Langfuse chart 默认不配 PDB。自己 patch 一下,避免节点维护时所有副本同时挂:

langfuse:
  web:
    podDisruptionBudget:
      enabled: true
      minAvailable: 2
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app.kubernetes.io/name: langfuse-web }

观测:Langfuse 自己也要被监控

讽刺的是,可观测平台自己也要被观测。最少三件事:

Prometheus scrape:Langfuse 暴露了 /api/metrics(Prometheus 格式),chart 里 serviceMonitor.enabled=true 一键开
ClickHouse exporter:监控查询延迟、merge 队列、磁盘占用,低于阈值告警
Redis 队列长度:ingestion 队列长度持续增长 = Worker 消化不过来

备份策略

Postgres

RDS / CloudSQL 自带 PITR,保 7-30 天。自建 Patroni + WAL-G 到 S3,每天全量 + WAL 连续。

ClickHouse

BACKUP DATABASE default TO S3(...) 做周全量 + 日增量。clickhouse-backup 工具帮管理。

S3 blob

S3 本身有版本,开跨区复制做 DR。MinIO 自建的话配 site replication。

Redis

队列数据本质可重放(SDK 会重试),只做 AOF 持久化即可,不用专门备份。

零停机升级

读 Release Notes,确认是否有 breaking DB schema
预发环境先升,跑 smoke test
先升 Worker(helm upgrade 只改 worker image tag),让新 ClickHouse migration 跑完
再升 Web,Ingress 会自动滚动,浏览器刷新就是新版
观察 10 分钟错误率,没事收工

canary 也可以
真求稳的话用 Argo Rollouts 做 canary:先滚 1 个 Web pod 到新版,盯 5 分钟错误率,再逐步推全量。 chart 本身不自带 canary,你接上去就是。

常见坑

ingestion 队列堆积:90% 是 ClickHouse 写入慢或 Worker 副本不够。先看 ClickHouse CPU / IO,再看 Worker pod 有没有 OOMKilled
Web 504:通常是 ClickHouse 查询慢。进 Traces 详情打开很慢 = ClickHouse merge 积压或索引丢。第 9 章讲调优
S3 5xx:检查 bucket policy + access key 权限,forcePathStyle 对 MinIO 必须开
NextAuth cookie secure 问题:Ingress 没走 HTTPS / NEXTAUTH_URL 写成 http,登录死循环。一律 https 到底

本章小结

    生产用 Helm chart,版本必锁,不 --version latest
Web / Worker 独立扩,Worker 最好按队列长度 KEDA 扩
Postgres / Redis / ClickHouse / S3 尽量用托管,自建成本比想象中高
ClickHouse 生产要上 Altinity operator 做分片 + 副本
secret 用 External Secrets / SOPS / Vault,绝不入 Git
PDB + topologySpread + HPA 是高可用三件套
Langfuse 自己要被 Prometheus 监控,ClickHouse 与 Redis 队列长度是首要指标
升级先 Worker 再 Web,Release Notes 必读