不是故意撒谎。是像旧配置文件那样撒谎:几个月前写的规则,针对已不存在的上下文,Claude 每个 session 仍在遵循。
作者通过本地运行 Anthropic 新功能 Dreaming 发现了这一点。Dreaming 读取最多 100 个过去 session,重写记忆文件。官方版本在 Managed Agents 后面加 beta header。作者的版本是 80 行 Python。
数据
- 90 天的 Claude Code session
- 6M token
- 100 个 session
- 11 分钟
- $4.20
产出:一个 38 行的文件,删除了 CLAUDE.md 的四分之三,并浮现出四个从未写下的模式。
Dreaming 是什么
Anthropic 官方描述:Dreaming 是对现有记忆存储的异步遍历。Agent 读取最多 100 个先前 session 的 transcript,找到模式,输出新的记忆存储。原始文件保留,你审查新的并决定保留或丢弃。
三个关键数字:
- 每次 dream pass 最多 100 个 session
- 运行时间"数十分钟"
- Harvey 报告在 drafting agent 上运行 Dreaming 后任务完成率提升 6 倍
六倍。不是 14%,不是 41%。六倍。
本地复刻
Managed Agents 是企业定价。Dreaming 本身跑在标准 API token 上,但周边平台不是为个人用户设计的。
原材料已经在磁盘上:~/.claude/projects/<project>/,每个 session 都是 JSONL,包含 memory 子目录。无需上传,无需迁移。
问题不是"该不该等",而是"Managed Agents 版本里有什么是一个 Python 脚本和一份好 rubric 不能复现的"。
90 分钟后答案:没有。对于单用户工作流,以下脚本足够。
脚本:四阶段
# dream.py — local Dreaming replica
# Reads ~/.claude/projects/*/sessions/*.jsonl
# Outputs ~/.claude/memory/dream_output.md
import os, json, glob
from pathlib import Path
from anthropic import Anthropic
client = Anthropic()
SESSION_DIR = Path.home() / ".claude" / "projects"
OUTPUT = Path.home() / ".claude" / "memory" / "dream_output.md"
# Phase 1 — Orient. Read existing memory if any.
existing = ""
existing_path = Path.home() / ".claude" / "memory" / "MEMORY.md"
if existing_path.exists():
existing = existing_path.read_text()
# Phase 2 — Gather. Pull the last 100 sessions.
sessions = sorted(
glob.glob(str(SESSION_DIR / "*" / "sessions" / "*.jsonl")),
key=os.path.getmtime,
reverse=True
)[:100]
transcripts = []
for s in sessions:
with open(s) as f:
msgs = [json.loads(line) for line in f if line.strip()]
clean = [m for m in msgs if m.get("type") in ("user", "assistant")]
transcripts.append("\n".join(json.dumps(m) for m in clean))
# Phase 3 — Dream. Single API call with rubric prompt.
rubric = Path(__file__).parent / "rubric.md"
prompt = rubric.read_text() + "\n\n" + \
f"Existing memory:\n{existing}\n\n" + \
f"100 sessions follow:\n\n" + "\n---\n".join(transcripts)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=8000,
messages=[{"role": "user", "content": prompt}]
)
# Phase 4 — Output. Write the new memory file.
OUTPUT.parent.mkdir(parents=True, exist_ok=True)
OUTPUT.write_text(response.content[0].text)
print(f"Dream complete. Output: {OUTPUT}")
一次 dream pass 成本:0.80 以下。
关键:Rubric
第一次运行没有 rubric,只是"总结一下你学到了什么"。输出无用:"用户重视效率"、"用户偏好清晰沟通"——放谁身上都成立。
第二次用了这份 12 行 rubric,产出下一节的 38 行文件。
# Dream pass rubric
You are doing a forensic pass over 100 of my Claude Code sessions.
Your job is not to summarize what I asked. It is to find patterns I would not write down myself.
Output a memory file with three sections:
## Workflow patterns observed
- Cite frequency: "[high-confidence, 50+ sessions]", "[medium]", "[low]"
- One line per pattern. No prose.
- Behavioral observations only. No declared preferences.
## Decisions and reasoning
- Capture architectural and stylistic choices I made and rejected
- Include the date of the decision if visible in transcript
- Note what was tried and rejected, not just what I picked
## Patterns to NOT re-suggest
- Things I've rejected multiple times across sessions
- Format: brief reason, no defense of why I should reconsider
Rules:
- Maximum 40 lines total. Trim anything that doesn't earn its line.
- If a "preference" appears once or contradicts another, delete it.
- Cite session count, not session names.
- One-off corrections are NOT preferences. Recurring patterns are.
Rubric 是本文最大收获。 没有它,Dreaming 产出 generic 的 LinkedIn 式总结。有了它,产出有用的东西。
12 行 rubric。38 行输出。背后是 6M token 的 session 数据。
38 行输出:行为,不是偏好
## Workflow patterns observed across 100 sessions
[high-confidence, 50+ sessions]
- Asks for "review" or "feedback" but accepts approval 73% of the time without revision
- Switches between TypeScript and Python mid-conversation; rarely re-states stack context
- Treats /clear as a checkpoint not a reset — expects context retention after /clear
- "Quick fix" requests average 12 turns to resolution; flag at turn 4 to redirect
- Corrects prose output 8.2x more than code output
[medium-confidence, 20-50 sessions]
- Prefers diffs over rewrites for changes >40 lines
- Asks "what did you change" after edits; pre-emptive summary saves a turn
- Uses Polymarket-related vocabulary; codebase context is trading infrastructure
- Discards 3-step explanations; keeps single-line answers
- Will ask for shorter output 3-5 messages in; default shorter from start
[low-confidence but worth keeping, <20 sessions]
- Sometimes builds in restricted networks (Hetzner / Riga / proxy hops); test commands accordingly
- Prefers ALL_CAPS for env var documentation
- Em-dashes flagged as "AI-sounding"; minimize unless rhythmic
## Decisions made and the reasoning behind them
[architectural]
- Hetzner Falkenstein chosen 2026-02: German jurisdiction, $4.51/mo, low latency to Polymarket EU
- 4-process screen pipeline (scanner -> brain -> executor -> exit_monitor): debugged once, never refactor
- Sonnet 4.6 for scoring, Opus 4.7 for full theses. Cost split, not capability split.
[style]
- Articles run 2,500-3,000 words target; drop below if filler
- Banner format locked: #EFEAE0 + serif + monospace strip
- "What didn't work" section mandatory for credibility load
## Patterns Claude should NOT re-suggest
- Switching from screen to systemd (rejected 4 times, complexity not worth it)
- Kubernetes deployment (rejected, single-VPS architecture is intentional)
- Switching trading from Polymarket to Kalshi as default (Polymarket-first is explicit)
- Adding Grafana / monitoring stack (rejected, logs + cron alerts are enough)
38 行。不是几百行。不是"全面"。不是"关于我:我是一个喜欢干净代码的开发者"。
这是 Claude 在 600 万 token 实际工作中观察到的模式的法医总结。一半内容读起来不舒服,因为它记录了从未写下的东西。
四个扎心发现
发现一:"Review" 的意思是 "approve"
Asks for "review" or "feedback" but accepts approval 73% of the time without revision
写过几百次这个 prompt:"Review this and tell me what's wrong."
Claude 会列出一堆问题。作者说"thanks, let's keep moving",一个都不改。
Claude 的解读:不是在要 review,是在要 permission。73% 的情况下请求以无编辑结束。
一旦看见就回不去了。修复方法:要"approval or block"的二元决策,或者真的要 rewrite。别再假装中间选项是自己想要的。
发现二:切换技术栈不声明上下文
Switches between TypeScript and Python mid-conversation; rarely re-states stack context
先在 trading bot(Python)里开 session,然后问 article builder(TypeScript)的问题,再回 bot。Claude 不知道切换了。session 历史里一半 broken output 是因为 Claude 把 TS 习惯用到 Python 文件上——因为作者从没说已经换地方了。
修复方法:CLAUDE.md 加一行——"Always re-confirm language at the start of any new sub-task."
为什么重要:Claude 只在 dream pass 里标记这个问题,而不是实时。Dreaming 发现模型在 session 中抓不到的 drift,因为 in-session 它在忙着回答,没空审计。
发现三:Quick fix 从不 quick
"Quick fix" requests average 12 turns to resolution; flag at turn 4 to redirect
抽象地知道这件事。不知道平均 12 轮。Dream 输出是具体的:到第 4 轮,不重启就解决的概率低于 30%。Claude 想发出的信号是"我们已经过了 quick zone,你想重启吗?"
作者为此建了一个 tiny hook:任何包含"quick fix"或"small change"的 prompt 触发计数器。第 4 轮注入:"This thread has 4 turns of debugging. Restart or commit to a longer fix?" 很烦人。但也把平均轮数减半。
发现四:Prose 修正 8 倍于代码
Corrects prose output 8.2x more than code output
检查了三遍。数字是对的。
90 天内纠正 Claude 的 prose 输出是代码输出的 8.2 倍。这个比例两边都错:
- 要么代码很烂,Claude 在默默吸收错误
- 要么 prose 品味过度校准,在纠正已经 90% 可以的东西
可能两者都有,但第二点更重要。80% 的编辑精力花在已经 90% 的东西上,20% 花在可能只有 60% 的东西上。比例颠倒了。
核心洞察
CLAUDE.md 的问题不是写得不好,是写的时候不知道自己在撒谎。几个月前的规则,针对已不存在的上下文,Claude 仍在遵循。
Dreaming 的价值不是"总结",是审计。in-session 的模型在忙着回答,没空审计。Dreaming 是事后法医,发现 drift、矛盾、和从未被记录的行为模式。
12 行 rubric 决定输出质量——这是 Prompt Engineering 的终极形态:不是写 prompt 给模型,是写 prompt 给"审计模型"。