用AI自动调查支持工单：从日志结构到Agent实现

Billing 是 inherently stateful 的。API 调用的结果取决于客户的 billing 状态和历史。比如客户先 scheduled 一个 cancellation，后来又 upgrade 了 plan，cancellation 可能需要自动撤销。这意味着简单操作有时会产生 confusing 的结果，导致更多支持工单和 bug report。

调查这些问题通常意味着 dig through logs 来重建客户的 billing timeline：他们采取了什么行动、请求时处于什么状态、请求后发生了什么。对于单个工单，需要写一堆 Axiom（日志存储）查询，sift through 数百条日志，分析 request/response payloads 来拼凑发生了什么。

你可以想象手动做这件事有多 tedious。有时 entire days 花在调查上。

结构化日志是前提

要理解如何自动化调查，首先要理解日志如何结构化。作者花了很多时间让日志结构化、可查询、易于推理——这本来是为了人类，incidentally 也让它们对 AI Agent 极其有效。

Request-centric 日志

代码库中的每个日志都为 originating request 添加 context，通过 enrich 一个 JSON payload，在请求结束时一次性输出。采用 wide logging 原则。每个请求至少有：

Tenant context：哪个 org 和 customer 发起了请求
Request/response body
Miscellaneous：timestamp、API version 等

通过 middleware 实现。比如添加 tenant context 的 middleware：

export const analyticsMiddleware = async (c: Context<HonoEnv>, next: Next) => {
    const ctx = c.get("ctx");
    const skipUrls = ["/v1/customers/all/search"];
    ctx.logger = addAppContextToLogs({
        logger: ctx.logger,
        appContext: {
            org_id: ctx.org?.id,
            org_slug: ctx.org?.slug,
            env: ctx.env,
            auth_type: ctx.authType,
            customer_id: ctx.customerId,
            api_version: ctx.apiVersion?.semver,
            scopes: ctx.scopes
        },
    });
    await next();
    const finalCtx = c.get("ctx");
    logRequestResult({ ctx: finalCtx, skipUrls });
};

如果日志里只能加一件事，加 tenant context。它是迄今为止最有效的 filter，也是大多数调查的 starting point。

Append-only extras

每个 endpoint 在 walk through request 时，intentionally 写函数添加到 request context 对象的 extras 字段——capture 理解 customer state 在那个时间点需要的信息。比如 upgrade 请求会 log：outgoing plan 是什么、incoming plan 是什么、有没有 previous cancellation 等。

教 AI Agent 如何调查

有了日志结构，调查变得 much more definable。本质上是 iterate over queries 直到有正确的日志集来 make accurate assumption。

把信息传给 Agent 的方式很简单：

Hook Claude Code up to Axiom's MCP and skills
写几个 custom skills，detail 日志如何结构化、不同调查"domain"（billing、stripe webhooks、entitlements 等）
对每个 domain，添加 core pieces of information 关于它 under the hood 如何工作（比如 entitlements 的 caching structure）

有了这些 skills，可以 one prompt Claude：paste 一个 ticket from Slack，它就会自由运行，figure out root cause，甚至 implement a fix。

调查不仅完成得更快，工程师还能在 build new features 的同时 background handle 几个这样的调查。

这些 skills 是 very intentionally 写的。之前尝试过 one-shotting，从不 effective。这次成功是因为花了半天时间 iterate with Claude，分享个人如何调查的例子、在 Axiom 上用什么样的 queries、break down into domains。

更进一步：自动触发

调查 skill 本身是 lifesaver，但最终目标是让这些工单 fully autonomously handled。

第一步是 ticket 进来时自动触发调查。需要基础设施来：

在 cloud host 一个 AI Agent，access to codebase、MCPs and skills
把 AI Agent 连接到 slack，ingest support tickets 和 necessary context

没有直接用 Slack API，因为 right foundations matter。选择用 Plain 作为 support infrastructure——它处理 thread infrastructure、channel integration、webhook triggers 等 support primitives，不用自己 build。

对于 host AI Agent，选择 Mastra——它 abstracted away 很多 lower level tools like memory、loading skills、MCP support、file system support 等。spin up 一个 agent with all the tools local Claude Code normally has，实际上相当 straightforward。

当前 setup：Plain webhook → Mastra Agent → Claude Code with Axiom MCP → Slack thread with results。

诚实的局限

这个版本的 Agent 大多只对 straightforward investigations 有用。一旦 issue 需要 deeper iteration，通常 fallback 到 local Claude Code。

Fully replicate local Claude Code environment in the cloud 出奇地难。Local 已经能 autonomously handle fairly involved bug fixes through test-driven workflow。Recreating that reliably in hosted agent 比 expected 难得多——MCP auth is flaky、skills don't seem to trigger correctly、overall understanding of codebase feels noticeably worse。很难 pinpoint exactly where things break down，但 local 和 hosted agent 之间的 quality gap 仍然 very real。

Interaction model 也 matters。Autonomous support agent 通过 Slack 交互——agent investigates a ticket、posts results to Slack、然后在 thread 里 converse to dig deeper。但 compared to Claude Code or Codex，slack interface just feels kinda clunky。很多 features 没有 quite optimised，比如 agent thinking、tool calling 等。At some point，overhead of continuing an investigation over Slack becomes higher than just opening a new Claude Code session locally。

未来的方向

Suspect the future is about exposing functionality to existing harnesses like Claude Code or Codex，rather than rebuilding those interaction patterns from scratch。

还无法 quite visualise 这长什么样，但 imagine 可能是：Plain webhook triggering a cloud Claude Code session，which we can then continue on locally 等。

What I do know though，is that there's something really powerful about having a single harness with access to all the relevant tools and context。

关键洞察

Billing is inherently stateful——调查不是查一个事件，是重建 timeline
结构化日志是 AI 调查的前提——request-centric + append-only extras，不是为了 AI 而做，是为了人类，incidentally 让 AI 能用
Skills 需要 intentional iteration——one-shot 从不 work，需要花时间和 Claude 一起 iterate
Hosted agent 质量 gap 真实存在——local Claude Code 和 cloud 版本之间有可感知的差距
未来不是重建交互模式，而是扩展现有 harness——Claude Code/Codex 作为核心，cloud 做 trigger，local 做 deep dive