Tools Are Contracts, Not Functions：重新理解 Coding Agent 的工具调用

一句话框架

Harness 系列第 3 篇：Tools Are Contracts, Not Functions。

如果 Coding Agent 拥有一个世界，那么每次 tool call，就是用 generated text 请求改变其世界的状态。

玩具 vs 真 Agent

Tool call 容易显得简单：model 输出 JSON，Harness parse 它，某个本地 function 执行，result 被放回下一轮 prompt。

这是 tiny agent 版本。

但 coding agent 的 tool call 并非简单 function。它来自 generated text，正在请求访问 workspace、shell、network、transcript、或另一个 Agent。 这个差异会改变整个 Coding Agent 的状态和所处的环境。

The model can ask. The harness decides.

The Proposal and Contract

想象 model 输出一个看起来很 structured 的 JSON。它依然只是 proposal。

Model 产出看似合法的 JSON，并不自动获得 write access。Harness 仍需回答一串问题：

这个 tool 是否存在？
args 是否匹配 schema？
path 是否在 workspace 内？
这是 create、overwrite、还是 patch？
model 最近是否读过现有文件？file-state baseline 是否 fresh？
这个 action 是否需要 approval？
approval 前应该给 human 看什么？
多少 result output 可以安全进入 context？
执行之后需要更新哪些 transcript 和 state？

这一串问题，就是 tool contract。

Naive Design 的失败

naive design 会把 tools 当作一个 function map：

tool_map = {
  "read_file": lambda args: open(args["path"]).read(),
  "run_shell": lambda args: subprocess.run(args["cmd"], ...),
  "patch_file": lambda args: edit_file(...)
}

代码很短，看起来干净。但它把 8 个责任压进了一行：

Parsing
schema validation
path safety
permission policy
sandbox choice
execution
output clipping
transcript recording
state updates

Demo 可以这样写。但能修改真实 repo 的 coding agent 需要更强的边界。 问题不在 function 本身，问题在于 tool boundary 有不同的 trust rules。

Function call vs Tool call

普通程序	Coding Agent
function call 是 implementation detail	tool call 是 generated text 请求改变真实世界的位置
内部细节	接口契约
同一 trust domain	跨 trust boundary
失败是 bug	失败是 policy 决定

Tool Contract 应当包含什么

有用的 tool contract 不能只有 name 和 handler。它应该描述：

Argument schema（model 面向）—— 精确的字段、类型、约束
Path / resource 范围（harness 面向）—— 这个 tool 能 read 或 mutate 什么
Fresh state 要求—— 运行前需要哪些 baseline（file 已读？build 已跑？）
Permission policy—— allow / ask / sandbox / deny 怎么定
Result limits—— 多少 output 允许进 context
Transcript hooks—— success / failure 后哪些 durable state 变化
Pairing 策略—— call 和 result 如何配对（异步、并发、out-of-order）

有些字段面向 model，有些字段只属于 harness。 Model 需要足够信息来正确发起请求，Harness 需要足够 metadata 来正确裁决。

最小可用 Lifecycle

一个 tool call 从发出到执行完，要经过这 7 步：

步骤	拦截的问题
parse	malformed output
schema validation	missing fields, wrong types
path validation	workspace escape
policy	risky action, approval, sandbox, denial
execution	run real handler
bounding	避免一次 command 或 file read 淹没下一轮 prompt
recording	让下一轮可以恢复和审计

所以"just call the function"这个 mental model 不够用——function 只是 lifecycle 中的很小片段。

Validate Before Approval

validation 应该发生在 approval 之前。 用户不该被要求判断一个本来就无效的请求。

必须 reject 的情况：

patch_file 指向 workspace 外 → reject
old_text 缺失或有歧义 → reject
tool call 立即重复上一轮失败请求 → reject 或 retry
write 目标是现有文件但缺 fresh baseline → reject

approval 是 product surface。 需要 approval 时，summary 应该保持 bounded：

文件编辑的 approval prompt 应该展示 affected path 和 change shape
避免倾倒 unbounded raw content
Harness 之后展示 diff、count、preview
Approval boundary 要小而清楚

Bounded Results Are Part Of Safety

Tool output 会变成 context，context 会影响 Agent 行为。

危险模式：

search 返回 10,000 条结果 → model 并没有更清楚
run_shell 打出巨大 log → 下一轮可能丢掉真正的 user request
read_file 每轮重复同一未变文件 → 有用 context 被 duplicate text 挤走

所以 tool contract 需要 result limits。 这不只是 token cost，它关系model working set 是否准确。

Harness 应该决定：

哪些 result evidence 进入 transcript
哪些成为 durable state
哪些保留为 external artifact reference

评估工具的 8 个问题

评估一个 coding-agent tool 时，不要先问"model 能不能 call 它"，而要问：

argument schema 精确吗？
这个 tool 能 read 或 mutate 什么？
哪些 paths、commands、resources 被允许？
运行前需要哪种 fresh state？
allow、ask、sandbox、deny 由什么 policy 决定？
什么 result evidence 会回到 model？
success 或 failure 之后，哪些 durable state 会变化？
之后如何把这个 call 和 result 配对？

一句话总结

Model 可以请求，Harness 负责裁决。Tool Call 只在 Harness 裁决批准后执行。这个过程，就是契约。

工具即契约。

翻译与外推

这篇文章和 Peter Wang 那篇《Building a Good Vertical Agent》（L1/L2/L3 内存分层）正好互补：

维度	Peter Wang	马东锡 NLP
主题	上下文是缓存	工具是契约
层级	L1/L2/L3（按频率分层）	parse / schema / path / policy / execute / bound / record（按 lifecycle 分层）
核心张力	准确率 vs 上下文大小	自主性 vs trust boundary
谁决定	模型读结构	Harness 决 policy
共同的根	把 context 当稀缺资源	把 tool 当 trust boundary

两篇文章合起来给出一个完整的 Agent 设计哲学：

Context 是稀缺资源（Peter）—— 上下文按 L1/L2/L3 分层，按 task 分布压缩
Tool 是 trust boundary（马东锡）—— 每个 tool call 都是契约，必须 validate、bound、record

如果做 AgentBase 这样的产品级 Agent 引擎，这两套分层必须同时设计。Cursor、Claude Code、Devin 都在做这件事——做得好的就是好 harness，做不好的就是 toy harness。

具体的工程清单（所有 harness 通用）：

每个 tool 有明确的 trust boundary（哪些 read、哪些 mutate、哪些 require approval）
Argument schema 用 JSON Schema / Zod / Pydantic 严格定义
Path validation 在 sandbox 之外再加一层显式校验
Result 永远 bounded（max_tokens, max_lines, max_matches）
Fresh baseline 检查在 write 之前强制执行
重复失败请求自动 retry/reject 而非反复烧 token
Transcript 完整记录 call + result + state 变化
Approval prompt 小而清楚（affected paths + change shape，不倾倒 content）

这 8 条做齐了，你的 tool contract 才算真"工作"。