Chapter 08

K8s + Helm 生产部署

Docker compose 适合本地,生产要的是"滚动升级、高可用、可备份、可观测"。官方 Helm chart 把这些都做好了,本章讲怎么用。

前置条件

一张图看部署拓扑

┌─────────────────────────────────────┐ Internet ──> │ Ingress (nginx + cert-manager) │ └──────────┬──────────────────────────┘ │ ┌─────────────────┴──────────────────┐ ▼ ▼ ┌──────────┐ ┌───────────┐ │ Web x 3 │ (Next.js, HPA) │ Worker x3 │ (消费队列) └────┬─────┘ └─────┬─────┘ │ reads/writes │ writes ├─────────┬──────────┐ │ ▼ ▼ ▼ ▼ ┌────────┐ ┌───────┐ ┌────────┐ ┌──────────────┐ │Postgres│ │ Redis │ │ S3 │ │ ClickHouse │ │ HA │ │ HA │ │ Bucket │ │ 2x shard + │ └────────┘ └───────┘ └────────┘ │ 2x replica │ │ + Keeper x3 │ └──────────────┘

Helm 仓库与版本锁定

helm repo add langfuse https://langfuse.github.io/langfuse-k8s
helm repo update

# 列出可用版本, 生产一定锁定
helm search repo langfuse/langfuse --versions | head

所有 helm 命令都带 --version 1.x.y。生产环境绝不能 --version latest——某一天突然破坏性升级把 ClickHouse 表改掉就傻了。

values.yaml 关键段

官方默认 values 适合 demo,生产必改的区块逐段讲。

① 镜像与副本

langfuse:
  image:
    repository: langfuse/langfuse
    tag: "3.20.1"           # 显式锁定
    pullPolicy: IfNotPresent

  web:
    replicas: 3
    resources:
      requests: { cpu: 500m, memory: 1Gi }
      limits:   { cpu: 2,    memory: 4Gi }

  worker:
    replicas: 3              # 写入高峰时再加
    resources:
      requests: { cpu: 500m, memory: 1Gi }
      limits:   { cpu: 4,    memory: 8Gi }

② 必改 secrets

langfuse:
  nextauth:
    # 生产用 openssl rand -hex 32
    secret: "CHANGE_ME_nextauth_secret_32_hex"
    url: "https://langfuse.yourcompany.com"

  salt: "CHANGE_ME_salt_32_hex"
  encryptionKey: "CHANGE_ME_encryption_64_hex"   # 注意 64 位

  # 禁用公开注册(公司内部用 SSO)
  additionalEnv:
    - name: AUTH_DISABLE_SIGNUP
      value: "true"
    - name: LANGFUSE_INIT_ORG_NAME
      value: "YourCompany"
secret 千万别提交到 Git
上面四个 secret 应该用 Sealed Secrets / External Secrets / SOPS / Vault 注入,绝不写在 values.yaml 里。很多团队事后后悔就是第一次把它 commit 到内部仓库了。

③ Postgres

两种玩法:用 chart 内置(postgresql.deploy=true)或外部托管(AWS RDS / CloudSQL)。生产强烈推荐外部:

postgresql:
  deploy: false   # 用外部 RDS
  host: "langfuse-pg.cluster-xxx.us-east-1.rds.amazonaws.com"
  port: 5432
  database: "langfuse"
  auth:
    existingSecret: "langfuse-pg-credentials"
    usernameKey: "username"
    passwordKey: "password"

④ Redis

redis:
  deploy: false
  host: "langfuse-redis.xxx.cache.amazonaws.com"
  port: 6379
  auth:
    existingSecret: "langfuse-redis-credentials"
    passwordKey: "password"
  tls:
    enabled: true

⑤ ClickHouse(生产难点)

chart 内置 ClickHouse 单机够 demo,但想要副本 + 分片 + HA,要上 Altinity clickhouse-operator,让 Langfuse 外连它:

clickhouse:
  deploy: false
  host: "chi-langfuse-cluster-0-0.langfuse.svc"
  port: 9000
  database: "default"
  auth:
    existingSecret: "langfuse-ch-credentials"
    usernameKey: "username"
    passwordKey: "password"
  # 集群名要和 operator 里的 cluster 名一致
  cluster: "langfuse"

一个最简 2 分片 2 副本的 ClickHouseInstallation:

apiVersion: "clickhouse.altinity.com/v1"
kind: ClickHouseInstallation
metadata:
  name: langfuse
spec:
  configuration:
    clusters:
      - name: langfuse
        layout:
          shardsCount: 2
          replicasCount: 2
    zookeeper:
      nodes:
        - host: chk-langfuse-0-0
        - host: chk-langfuse-0-1
        - host: chk-langfuse-0-2
  defaults:
    templates:
      podTemplate: ch-pod
      dataVolumeClaimTemplate: ch-data
  templates:
    podTemplates:
      - name: ch-pod
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:24.8
              resources:
                requests: { cpu: 2, memory: 8Gi }
                limits:   { cpu: 8, memory: 32Gi }
    volumeClaimTemplates:
      - name: ch-data
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: gp3
          resources: { requests: { storage: 500Gi } }

⑥ S3 / MinIO

s3:
  deploy: false                     # 用外部 S3
  eventUpload:
    bucket: "langfuse-events"
    region: "us-east-1"
    endpoint: ""                    # 空 = AWS, 用 R2/MinIO 填对应 endpoint
    accessKeyId:
      valueFrom: { secretKeyRef: { name: langfuse-s3, key: accessKey } }
    secretAccessKey:
      valueFrom: { secretKeyRef: { name: langfuse-s3, key: secretKey } }
    forcePathStyle: false           # MinIO 要 true
  mediaUpload:
    bucket: "langfuse-media"
    # 同样的认证

⑦ Ingress + TLS

ingress:
  enabled: true
  className: "nginx"
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"    # 大 prompt 上报
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
  hosts:
    - host: langfuse.yourcompany.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: langfuse-tls
      hosts: [langfuse.yourcompany.com]

第一次 install

kubectl create namespace langfuse

# 塞 secrets(例子用 kubectl, 生产改 ExternalSecret)
kubectl -n langfuse create secret generic langfuse-pg-credentials \
  --from-literal=username=langfuse \
  --from-literal=password="$(openssl rand -hex 24)"

helm upgrade --install langfuse langfuse/langfuse \
  -n langfuse \
  --version 1.x.y \
  -f values-prod.yaml

# 看 pod 起来没
kubectl -n langfuse get pods -w

Web pod 首次启动会自动跑 Postgres migration;Worker pod 跑 ClickHouse migration。日志里看到 migration completed 就稳了。

HPA 自动扩容

langfuse:
  web:
    autoscaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 20
      targetCPUUtilizationPercentage: 70
  worker:
    autoscaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 30
      targetCPUUtilizationPercentage: 60
      # Worker 更该按队列长度 scale,用 KEDA + Redis trigger
      # 参考 KEDA scaler: redis-list

Worker 的真正正确姿势是按队列长度扩——KEDA + redis-list scaler,ingestion 队列 backlog 上来立刻拉多个 Worker 进来消化。

PodDisruptionBudget + topologySpreadConstraints

Langfuse chart 默认不配 PDB。自己 patch 一下,避免节点维护时所有副本同时挂:

langfuse:
  web:
    podDisruptionBudget:
      enabled: true
      minAvailable: 2
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app.kubernetes.io/name: langfuse-web }

观测:Langfuse 自己也要被监控

讽刺的是,可观测平台自己也要被观测。最少三件事:

备份策略

Postgres
RDS / CloudSQL 自带 PITR,保 7-30 天。自建 Patroni + WAL-G 到 S3,每天全量 + WAL 连续。
ClickHouse
BACKUP DATABASE default TO S3(...) 做周全量 + 日增量。clickhouse-backup 工具帮管理。
S3 blob
S3 本身有版本,开跨区复制做 DR。MinIO 自建的话配 site replication。
Redis
队列数据本质可重放(SDK 会重试),只做 AOF 持久化即可,不用专门备份。

零停机升级

  1. 读 Release Notes,确认是否有 breaking DB schema
  2. 预发环境先升,跑 smoke test
  3. 先升 Worker(helm upgrade 只改 worker image tag),让新 ClickHouse migration 跑完
  4. 再升 Web,Ingress 会自动滚动,浏览器刷新就是新版
  5. 观察 10 分钟错误率,没事收工
canary 也可以
真求稳的话用 Argo Rollouts 做 canary:先滚 1 个 Web pod 到新版,盯 5 分钟错误率,再逐步推全量。 chart 本身不自带 canary,你接上去就是。

常见坑

本章小结