The treatment beats bare Claude Code on the tasks the field can't reliably solve
Higher resolution, half the wall-clock time, on inputs the leaderboard has barely touched.
Same Claude Code agent, same SWE-bench harness, same tool budget — the only thing we change is the prompt construction. With Hypernym-compressed context the model converges faster, more predictably, and on pytest-dev/pytest-5787 resolves a task that bare Claude Code did not solve in five attempts.
Four hard tasks with real runs in both arms. Each card shows mean wall-clock (μ), standard deviation (σ), and the full range across all completed runs. Quick failures (0-hop or sub-30s runs) are excluded so the times reflect actual attempts. Resolution is from harness_runs.db.
Eleven stages, every artifact on disk. Each file gets compressed to ~10-20% of original via the Hypernym float pipeline; shield rotation and packed-facts pick which trials of which floats survive into the final prompt.
Jarvis PageRank + D1+D7 forward selection picks the top 40 source files for the bug.
AST chunking, P-Span sweep, 60 stochastic trials, coherence + coverage validation.
Shield rotation + aboutness-guided fact resolution selects what survives into 80K tokens.
Claude Code generates a patch. Zephyr SWE-bench harness runs the gold tests.
The leaderboard says: tune the agent, scaffold the tools, prompt-engineer per task. We sent the same Claude Code at the bug with no scaffolding and lost. We sent the same Claude Code with a Hypernym-compressed view of the codebase and on pytest-5787 won four out of five — recovering 80% of what tool-agent scaffolding delivers, with nothing but the prompt changing.
Settings used for every run in the scorecard above.
| Source benchmark | SWE-bench Verified (500 tasks) |
| Hard-task cutoff | ≤31 external solves |
| Float compression | ~10-20% of original size |
| P-Span sweep | 5-30 element counts |
| Comprehensive trials | 60 per file |
| Coherence target | ~0.79-0.82 (BAAI/bge-m3) |
| Final prompt budget | 80,000 tokens (cl100k_base) |
| Per-float fact budget | 3,077 tokens |
| Treatment_api prompt size | ~85K tokens (84-95K range) |
| Patch generation model | Claude Code (Anthropic) |
| Harness | Zephyr B SWE-bench remote executor |
| Gold-test definition | FAIL_TO_PASS flips + PASS_TO_PASS holds |
The numbers in the scorecard come from harness_runs.db. The aggregate is reproducible with a single SQL query.
-- harness_runs.db (sqlite3) SELECT SUM(CASE WHEN condition='control' THEN 1 ELSE 0 END) AS ctl_runs, SUM(CASE WHEN condition='control' AND resolved=1 THEN 1 ELSE 0 END) AS ctl_res, SUM(CASE WHEN condition LIKE 'treatment%' THEN 1 ELSE 0 END) AS trt_runs, SUM(CASE WHEN condition LIKE 'treatment%' AND resolved=1 THEN 1 ELSE 0 END) AS trt_res FROM harness_runs WHERE task_id IN ( 'astropy__astropy-13398', 'sphinx-doc__sphinx-7590', 'pytest-dev__pytest-6197', 'pytest-dev__pytest-5787' ) AND condition IN ('control', 'treatment_api', 'treatment_probes') AND status='completed'; -- ctl_runs=20 ctl_res=3 trt_runs=22 trt_res=5