Hypernym
SWE-bench Verified · Hard-Task Scorecard

Hypernym × SWE-bench

The treatment beats bare Claude Code on the tasks the field can't reliably solve
Higher resolution, half the wall-clock time, on inputs the leaderboard has barely touched.

Same Claude Code agent, same SWE-bench harness, same tool budget — the only thing we change is the prompt construction. With Hypernym-compressed context the model converges faster, more predictably, and on pytest-dev/pytest-5787 resolves a task that bare Claude Code did not solve in five attempts.

−28%
Avg time reduction
−71%
Max time reduction
2.85×
Avg σ tightening
8.08×
Max σ tightening
Scorecard

Per-task wall-clock and variance

Four hard tasks with real runs in both arms. Each card shows mean wall-clock (μ), standard deviation (σ), and the full range across all completed runs. Quick failures (0-hop or sub-30s runs) are excluded so the times reflect actual attempts. Resolution is from harness_runs.db.

pytest-dev/pytest-6197 Hard · 1-4h
−15m 48s
Time saved (−71%)
8.08×
σ tightening
Control · n=10
1346s± 934s
range [294s, 2437s] · 5–41 min
Very difficult solve
Treatment · n=5
397s± 116s
range [237s, 552s] · 4–9 min
Very difficult solve
Treatment runs cluster between 4 and 9 minutes; control runs spread from 5 to 41 minutes. Same model, same harness — variance reduced by 8×.
sphinx-doc/sphinx-7590 Near-frontier · >4h
−2m 06s
Time saved (−28%)
1.71×
σ tightening
Control · n=2
453s± 94s
range [386s, 520s]
Claude unsolvable
Treatment · n=5
326s± 55s
range [239s, 371s]
Claude unsolvable
Sphinx C++ User-Defined Literals support — never-solved frontier. Neither arm resolved, but treatment converges faster and tighter.
astropy/astropy-13398 Never solved · 1-4h
−0m 51s
Time saved (−19%)
1.26×
σ tightening
Control · n=5
264s± 42s
range [232s, 333s]
Claude unsolvable
Treatment · n=5
213s± 33s
range [174s, 239s]
Claude unsolvable
Never-solved frontier — 0 external resolutions on the SWE-bench leaderboard. Neither arm cracked it; treatment was faster and tighter.
pytest-dev/pytest-5787 Hard · 1-4h
+0m 17s
Time (+8%)
0 → 4
Resolved gain
Control · n=5
225s± 14s
range [211s, 247s]
Claude unsolvable
Treatment · n=5
243s± 42s
range [200s, 313s]
Hypernym solved
Star result. Bare Claude Code: 0/5 resolved across 5 attempts. With Hypernym context: 4/5 resolved. The +17s/run cost buys 4× the resolution rate. Glowing bars = harness-resolved runs.
Methodology

From source repo to one 80K-token prompt

Eleven stages, every artifact on disk. Each file gets compressed to ~10-20% of original via the Hypernym float pipeline; shield rotation and packed-facts pick which trials of which floats survive into the final prompt.

1

Rank files

Jarvis PageRank + D1+D7 forward selection picks the top 40 source files for the bug.

2

Float each file

AST chunking, P-Span sweep, 60 stochastic trials, coherence + coverage validation.

3

Pack facts

Shield rotation + aboutness-guided fact resolution selects what survives into 80K tokens.

4

Validate

Claude Code generates a patch. Zephyr SWE-bench harness runs the gold tests.

The leaderboard says: tune the agent, scaffold the tools, prompt-engineer per task. We sent the same Claude Code at the bug with no scaffolding and lost. We sent the same Claude Code with a Hypernym-compressed view of the codebase and on pytest-5787 won four out of five — recovering 80% of what tool-agent scaffolding delivers, with nothing but the prompt changing.
Specifications

Pipeline at the bytes level

Settings used for every run in the scorecard above.

Source benchmarkSWE-bench Verified (500 tasks)
Hard-task cutoff≤31 external solves
Float compression~10-20% of original size
P-Span sweep5-30 element counts
Comprehensive trials60 per file
Coherence target~0.79-0.82 (BAAI/bge-m3)
Final prompt budget80,000 tokens (cl100k_base)
Per-float fact budget3,077 tokens
Treatment_api prompt size~85K tokens (84-95K range)
Patch generation modelClaude Code (Anthropic)
HarnessZephyr B SWE-bench remote executor
Gold-test definitionFAIL_TO_PASS flips + PASS_TO_PASS holds
Reproducibility

Every cell, queryable

The numbers in the scorecard come from harness_runs.db. The aggregate is reproducible with a single SQL query.

-- harness_runs.db (sqlite3)
SELECT
    SUM(CASE WHEN condition='control' THEN 1 ELSE 0 END) AS ctl_runs,
    SUM(CASE WHEN condition='control' AND resolved=1 THEN 1 ELSE 0 END) AS ctl_res,
    SUM(CASE WHEN condition LIKE 'treatment%' THEN 1 ELSE 0 END) AS trt_runs,
    SUM(CASE WHEN condition LIKE 'treatment%' AND resolved=1 THEN 1 ELSE 0 END) AS trt_res
FROM harness_runs
WHERE task_id IN (
    'astropy__astropy-13398',
    'sphinx-doc__sphinx-7590',
    'pytest-dev__pytest-6197',
    'pytest-dev__pytest-5787'
)
AND condition IN ('control', 'treatment_api', 'treatment_probes')
AND status='completed';

-- ctl_runs=20  ctl_res=3  trt_runs=22  trt_res=5