Test-Time Compute 是 Capability 的隐藏变量：为什么单数字 Benchmark 已经失真

TL;DR: As LLMs become more capable, benchmark performance is increasingly a function of test-time compute. In fact, we likely don't know what the capability ceiling is for modern LLMs because it's too expensive to measure. We should change LLM evaluations to account for that by measuring performance vs tokens, cost, or time.

GPT-5.5 发布那天的怪事

GPT-5.5 发布那天，第一反应是怀疑。Benchmark 数字好一点，但好得不多。

几个小时后，等大家有时间上手玩了，结论变得清楚——5.5 比 5.4 是一个 step-change。Classic 的「benchmark grid」显然没讲完整个故事。Why？

把 GPT-5.5 和 5.4 放到以 tokens 为 x 轴的图上，答案就清楚了：

左图（cyber eval 1）：5.5 跟 5.4 在各自「最大」test-time compute 下看起来没强那么多。
右图（cyber eval 2）：一旦控制 tokens / cost / latency，5.5 比 5.4 强很多。

GPT-5.5 没有跟 5.4 在同一个 token budget（或 dollar budget）下被评估。一旦控制 test-time compute，5.5 比 5.4 强很多。

Plateau 在消失

讨论到这里常有人问：为什么不直接用一个 harness 一直推 test-time compute 直到性能 plateau？

问题是：经验上 plateau 推得非常远。有时候在合理预算里根本观察不到 plateau。两个证据：

Karpathy 的 autoresearch 实验：performance 跨几百次实验后还在涨。
METR 的 cyber eval：Mythos 和 GPT-5.5 的 performance 在 100M tokens 之后还在快速涨。

更强的模型在长程上的 performance 提升更强。Plateau 的点被推远，甚至可能消失。

应该长什么样的评估

正确的评估方式是一张 performance vs test-time compute 的图，x 轴是 tokens、cost 或 wall-clock time。已经有些 benchmark 在这个方向上走了——ARC-AGI 测的是 score vs cost。

另一个合理选项是显式设 token / time / cost 预算并告知 model——这跟人类考试（SAT、IMO）评估方式是一样的。

每个 x 轴都有 tradeoff：

Tokens 跨 model 不可直接比——tokenizer、速度、per-token cost 都不同
Dollars 取决于实现细节（batching、hardware utilization），cost 和 latency 能互相 trade
Wall-clock time 也不完美——multi-agent 技术（如 best-of-N）能扩 test-time compute 但几乎不增 latency

但——any of these curves 比单 scalar 都更有信息量。

Safety policy 那一段：Gemini 3 Deep Think 的真问题

Frontier model 发布前，labs 常规要评 cyber / bio / 其他 misuse risk。如果一个 model 跨过 capability 阈值，release 可能被延迟到 mitigation 就位。

但——如果 capability 是 inference compute 的函数，那 safety evaluation 该在什么 inference budget 下跑？

实际情况是：model release 的 safety evaluation 大多数没考虑过 inference 用量。Gemini 3 Deep Think 的发布和引发的愤怒，是个有教育意义的例子。

Gemini 3 Deep Think 发布时，benchmark 分数比之前的 model 高很多。但当时没发 system card 评估它的风险。 AI safety 社区的愤怒随之而来。

Noam 的判断：批评 DeepMind 这次 release 的人missed the deeper issue**——AI labs 和 safety orgs 不系统地把 test-time compute 算进 model 评估。

Deep Think 大概率是一个对其他（有 system card 的）model 的 scaffold。外部任何人都能复现这样的 scaffold。换句话说，Deep Think 的能力本来就是任何愿意付 Deep Think 级别 inference 钱的人能拿到的——scaffolding 一堆 model query 拼起来而已。Deep Think 只是给 casual user 提供了更方便的方式。

真该被批评的是：Gemini 3 和其他 model 发布时，system card 没有按 test-time compute 测 benchmark。

长程 evaluation 的新问题

一个国家行为体可以花 $10M+ inference 在单任务上。 但评估一个 model 常规要跑几千到几百万 rollout——在每个 rollout 上都按这么高的 budget 评不现实。

幸运的是，performance 看起来跟 inference compute 用量还算 predictable scale。所以可以用相对低的 inference budget 评估，然后外推（带不确定性）高 budget 下能力可能是什么样。

但长程 evaluation 可能会引入小 budget 外推兜不住的复杂性。比如，1 年 horizon 的 AI agent misalignment 评估，可能只有真跑 agent 一年才能确信。

AI labs 可能很快会陷入一个奇怪位置：agent 的 operating horizon 超过新 model 的开发周期。到那时，要在新 model 发布前跑完它最大 lifetime 的 evaluation 是不可能的——除非延后发布。

给 AI 社区的三条具体建议

AI labs 应该在发新 model 时按 tokens / cost / time 为 x 轴发 benchmark 性能。最低限度：report 拿到这个 scalar benchmark 结果用了什么 inference budget。
Benchmarks 应该在 leaderboard 上跟踪 inference 用量，或显式设 token / cost / time budget。很多 benchmark 已经在朝这个方向走，但还不是标准实践。
Preparedness Frameworks 和 Responsible Scaling Policies 应该显式把 inference compute 算进 model 是否跨安全阈值的判定。另外，evaluation 应该在多个 inference budget 下估计能力，包括从小 budget run 外推（带 stated uncertainty）。

收尾

如果你跟了我一段时间，这篇整文看着可能像 nothing new。 自 2024-09 o1 发布我们就知道了——reasoning model 性能随 inference compute 缩放。

可——快两年后，frontier AI labs 发新 model 时还在发单数字 benchmark 结果；AI safety orgs 看到 scaffold 用 100x inference budget 拿到更好性能时还在惊讶；Preparedness Frameworks 和 RSP 在判定 model 是否达到关键 capability 阈值时还在经常忽略 inference compute。

最新一批 model 利用 test-time compute 的能力比以往任何时候都强，把 performance plateau 推得更远。如果这个趋势继续（我完全预期会继续），不把 inference compute 算进去的 benchmark 分数每发一代都会更没信息量。

是时候把 inference budget 当成 capability measurement 和 safety policy 的 first-class 一部分了。

这条在 2026 AI 评估范式里的位置

Test-time compute 应该是和训练 compute 同等重要的一阶维度——不控制 inference budget 看 benchmark 就是看错对象。

2026 年最好的评估长这样：一张以 tokens / cost / time 为 x 轴的 performance curve，不是 scalar。

2026 年最危险的评估误区是：用「这个 model 跑单次 prompt 的分数」去判定它的 capability 阈值——Deep Think 这种 scaffold 把这个误区的代价摆在了台面上。