我的 CLAUDE.md 有 73% 在撒谎：用 Dreaming 发现真相

不是故意撒谎。是像旧配置文件那样撒谎：几个月前写的规则，针对已不存在的上下文，Claude 每个 session 仍在遵循。

作者通过本地运行 Anthropic 新功能 Dreaming 发现了这一点。Dreaming 读取最多 100 个过去 session，重写记忆文件。官方版本在 Managed Agents 后面加 beta header。作者的版本是 80 行 Python。

数据

90 天的 Claude Code session
6M token
100 个 session
11 分钟
$4.20

产出：一个 38 行的文件，删除了 CLAUDE.md 的四分之三，并浮现出四个从未写下的模式。

Dreaming 是什么

Anthropic 官方描述：Dreaming 是对现有记忆存储的异步遍历。Agent 读取最多 100 个先前 session 的 transcript，找到模式，输出新的记忆存储。原始文件保留，你审查新的并决定保留或丢弃。

三个关键数字：

每次 dream pass 最多 100 个 session
运行时间"数十分钟"
Harvey 报告在 drafting agent 上运行 Dreaming 后任务完成率提升 6 倍

六倍。不是 14%，不是 41%。六倍。

本地复刻

Managed Agents 是企业定价。Dreaming 本身跑在标准 API token 上，但周边平台不是为个人用户设计的。

原材料已经在磁盘上：~/.claude/projects/<project>/，每个 session 都是 JSONL，包含 memory 子目录。无需上传，无需迁移。

问题不是"该不该等"，而是"Managed Agents 版本里有什么是一个 Python 脚本和一份好 rubric 不能复现的"。

90 分钟后答案：没有。对于单用户工作流，以下脚本足够。

脚本：四阶段

# dream.py — local Dreaming replica
# Reads ~/.claude/projects/*/sessions/*.jsonl
# Outputs ~/.claude/memory/dream_output.md

import os, json, glob
from pathlib import Path
from anthropic import Anthropic

client = Anthropic()
SESSION_DIR = Path.home() / ".claude" / "projects"
OUTPUT = Path.home() / ".claude" / "memory" / "dream_output.md"

# Phase 1 — Orient. Read existing memory if any.
existing = ""
existing_path = Path.home() / ".claude" / "memory" / "MEMORY.md"
if existing_path.exists():
    existing = existing_path.read_text()

# Phase 2 — Gather. Pull the last 100 sessions.
sessions = sorted(
    glob.glob(str(SESSION_DIR / "*" / "sessions" / "*.jsonl")),
    key=os.path.getmtime,
    reverse=True
)[:100]

transcripts = []
for s in sessions:
    with open(s) as f:
        msgs = [json.loads(line) for line in f if line.strip()]
        clean = [m for m in msgs if m.get("type") in ("user", "assistant")]
        transcripts.append("\n".join(json.dumps(m) for m in clean))

# Phase 3 — Dream. Single API call with rubric prompt.
rubric = Path(__file__).parent / "rubric.md"
prompt = rubric.read_text() + "\n\n" + \
         f"Existing memory:\n{existing}\n\n" + \
         f"100 sessions follow:\n\n" + "\n---\n".join(transcripts)

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8000,
    messages=[{"role": "user", "content": prompt}]
)

# Phase 4 — Output. Write the new memory file.
OUTPUT.parent.mkdir(parents=True, exist_ok=True)
OUTPUT.write_text(response.content[0].text)
print(f"Dream complete. Output: {OUTPUT}")

一次 dream pass 成本： $4.20 API token。6M token 输入很大，但输出很小（记忆文件最多几千 token）。缓存把下一次 pass 在一小时内降到$ 0.80 以下。

关键：Rubric

第一次运行没有 rubric，只是"总结一下你学到了什么"。输出无用："用户重视效率"、"用户偏好清晰沟通"——放谁身上都成立。

第二次用了这份 12 行 rubric，产出下一节的 38 行文件。

# Dream pass rubric

You are doing a forensic pass over 100 of my Claude Code sessions.
Your job is not to summarize what I asked. It is to find patterns I would not write down myself.

Output a memory file with three sections:

## Workflow patterns observed
- Cite frequency: "[high-confidence, 50+ sessions]", "[medium]", "[low]"
- One line per pattern. No prose.
- Behavioral observations only. No declared preferences.

## Decisions and reasoning
- Capture architectural and stylistic choices I made and rejected
- Include the date of the decision if visible in transcript
- Note what was tried and rejected, not just what I picked

## Patterns to NOT re-suggest
- Things I've rejected multiple times across sessions
- Format: brief reason, no defense of why I should reconsider

Rules:
- Maximum 40 lines total. Trim anything that doesn't earn its line.
- If a "preference" appears once or contradicts another, delete it.
- Cite session count, not session names.
- One-off corrections are NOT preferences. Recurring patterns are.

Rubric 是本文最大收获。 没有它，Dreaming 产出 generic 的 LinkedIn 式总结。有了它，产出有用的东西。

12 行 rubric。38 行输出。背后是 6M token 的 session 数据。

38 行输出：行为，不是偏好

## Workflow patterns observed across 100 sessions

[high-confidence, 50+ sessions]
- Asks for "review" or "feedback" but accepts approval 73% of the time without revision
- Switches between TypeScript and Python mid-conversation; rarely re-states stack context
- Treats /clear as a checkpoint not a reset — expects context retention after /clear
- "Quick fix" requests average 12 turns to resolution; flag at turn 4 to redirect
- Corrects prose output 8.2x more than code output

[medium-confidence, 20-50 sessions]
- Prefers diffs over rewrites for changes >40 lines
- Asks "what did you change" after edits; pre-emptive summary saves a turn
- Uses Polymarket-related vocabulary; codebase context is trading infrastructure
- Discards 3-step explanations; keeps single-line answers
- Will ask for shorter output 3-5 messages in; default shorter from start

[low-confidence but worth keeping, <20 sessions]
- Sometimes builds in restricted networks (Hetzner / Riga / proxy hops); test commands accordingly
- Prefers ALL_CAPS for env var documentation
- Em-dashes flagged as "AI-sounding"; minimize unless rhythmic

## Decisions made and the reasoning behind them

[architectural]
- Hetzner Falkenstein chosen 2026-02: German jurisdiction, $4.51/mo, low latency to Polymarket EU
- 4-process screen pipeline (scanner -> brain -> executor -> exit_monitor): debugged once, never refactor
- Sonnet 4.6 for scoring, Opus 4.7 for full theses. Cost split, not capability split.

[style]
- Articles run 2,500-3,000 words target; drop below if filler
- Banner format locked: #EFEAE0 + serif + monospace strip
- "What didn't work" section mandatory for credibility load

## Patterns Claude should NOT re-suggest

- Switching from screen to systemd (rejected 4 times, complexity not worth it)
- Kubernetes deployment (rejected, single-VPS architecture is intentional)
- Switching trading from Polymarket to Kalshi as default (Polymarket-first is explicit)
- Adding Grafana / monitoring stack (rejected, logs + cron alerts are enough)

38 行。不是几百行。不是"全面"。不是"关于我：我是一个喜欢干净代码的开发者"。

这是 Claude 在 600 万 token 实际工作中观察到的模式的法医总结。一半内容读起来不舒服，因为它记录了从未写下的东西。

四个扎心发现

发现一："Review" 的意思是 "approve"

Asks for "review" or "feedback" but accepts approval 73% of the time without revision

写过几百次这个 prompt："Review this and tell me what's wrong."

Claude 会列出一堆问题。作者说"thanks, let's keep moving"，一个都不改。

Claude 的解读：不是在要 review，是在要 permission。73% 的情况下请求以无编辑结束。

一旦看见就回不去了。修复方法：要"approval or block"的二元决策，或者真的要 rewrite。别再假装中间选项是自己想要的。

发现二：切换技术栈不声明上下文

Switches between TypeScript and Python mid-conversation; rarely re-states stack context

先在 trading bot（Python）里开 session，然后问 article builder（TypeScript）的问题，再回 bot。Claude 不知道切换了。session 历史里一半 broken output 是因为 Claude 把 TS 习惯用到 Python 文件上——因为作者从没说已经换地方了。

修复方法：CLAUDE.md 加一行——"Always re-confirm language at the start of any new sub-task."

为什么重要：Claude 只在 dream pass 里标记这个问题，而不是实时。Dreaming 发现模型在 session 中抓不到的 drift，因为 in-session 它在忙着回答，没空审计。

发现三：Quick fix 从不 quick

"Quick fix" requests average 12 turns to resolution; flag at turn 4 to redirect

抽象地知道这件事。不知道平均 12 轮。Dream 输出是具体的：到第 4 轮，不重启就解决的概率低于 30%。Claude 想发出的信号是"我们已经过了 quick zone，你想重启吗？"

作者为此建了一个 tiny hook：任何包含"quick fix"或"small change"的 prompt 触发计数器。第 4 轮注入："This thread has 4 turns of debugging. Restart or commit to a longer fix?" 很烦人。但也把平均轮数减半。

发现四：Prose 修正 8 倍于代码

Corrects prose output 8.2x more than code output

检查了三遍。数字是对的。

90 天内纠正 Claude 的 prose 输出是代码输出的 8.2 倍。这个比例两边都错：

要么代码很烂，Claude 在默默吸收错误
要么 prose 品味过度校准，在纠正已经 90% 可以的东西

可能两者都有，但第二点更重要。80% 的编辑精力花在已经 90% 的东西上，20% 花在可能只有 60% 的东西上。比例颠倒了。

核心洞察

CLAUDE.md 的问题不是写得不好，是写的时候不知道自己在撒谎。几个月前的规则，针对已不存在的上下文，Claude 仍在遵循。

Dreaming 的价值不是"总结"，是审计。in-session 的模型在忙着回答，没空审计。Dreaming 是事后法医，发现 drift、矛盾、和从未被记录的行为模式。

12 行 rubric 决定输出质量——这是 Prompt Engineering 的终极形态：不是写 prompt 给模型，是写 prompt 给"审计模型"。

数据