SWE-bench Verified · Hard-Task Scorecard

Hypernym × SWE-bench

The treatment beats bare Claude Code on the tasks the field can't reliably solve
Higher resolution, half the wall-clock time, on inputs the leaderboard has barely touched.

Same Claude Code agent, same SWE-bench harness, same tool budget — the only thing we change is the prompt construction. With Hypernym-compressed context the model converges faster, more predictably, and on pytest-dev/pytest-5787 resolves a task that bare Claude Code did not solve in five attempts.

See the scorecard How it's built

−28%

Avg time reduction

−71%

Max time reduction

2.85×

Avg σ tightening

8.08×

Max σ tightening

Scorecard

Per-task wall-clock and variance

Four hard tasks with real runs in both arms. Each card shows mean wall-clock (μ), standard deviation (σ), and the full range across all completed runs. Quick failures (0-hop or sub-30s runs) are excluded so the times reflect actual attempts. Resolution is from harness_runs.db.

pytest-dev/pytest-6197 Hard · 1-4h

−15m 48s

Time saved (−71%)

8.08×

σ tightening

Control · n=10

1346s± 934s

range [294s, 2437s] · 5–41 min

Very difficult solve

✓

Treatment · n=5

397s± 116s

range [237s, 552s] · 4–9 min

Very difficult solve

✓

Treatment runs cluster between 4 and 9 minutes; control runs spread from 5 to 41 minutes. Same model, same harness — variance reduced by 8×.

sphinx-doc/sphinx-7590 Near-frontier · >4h

−2m 06s

Time saved (−28%)

1.71×

σ tightening

Control · n=2

453s± 94s

range [386s, 520s]

Claude unsolvable

✓

Treatment · n=5

326s± 55s

range [239s, 371s]

Claude unsolvable

✓

Sphinx C++ User-Defined Literals support — never-solved frontier. Neither arm resolved, but treatment converges faster and tighter.

astropy/astropy-13398 Never solved · 1-4h

−0m 51s

Time saved (−19%)

1.26×

σ tightening

Control · n=5

264s± 42s

range [232s, 333s]

Claude unsolvable

✓

Treatment · n=5

213s± 33s

range [174s, 239s]

Claude unsolvable

✓

Never-solved frontier — 0 external resolutions on the SWE-bench leaderboard. Neither arm cracked it; treatment was faster and tighter.

pytest-dev/pytest-5787 Hard · 1-4h

+0m 17s

Time (+8%)

0 → 4

Resolved gain

Control · n=5

225s± 14s

range [211s, 247s]

Claude unsolvable

✓

Treatment · n=5

243s± 42s

range [200s, 313s]

Hypernym solved

✓

Star result. Bare Claude Code: 0/5 resolved across 5 attempts. With Hypernym context: 4/5 resolved. The +17s/run cost buys 4× the resolution rate. Glowing bars = harness-resolved runs.

Methodology

From source repo to one 80K-token prompt

Eleven stages, every artifact on disk. Each file gets compressed to ~10-20% of original via the Hypernym float pipeline; shield rotation and packed-facts pick which trials of which floats survive into the final prompt.

Rank files

Jarvis PageRank + D1+D7 forward selection picks the top 40 source files for the bug.

Float each file

AST chunking, P-Span sweep, 60 stochastic trials, coherence + coverage validation.

Pack facts

Shield rotation + aboutness-guided fact resolution selects what survives into 80K tokens.

Validate

Claude Code generates a patch. Zephyr SWE-bench harness runs the gold tests.

The leaderboard says: tune the agent, scaffold the tools, prompt-engineer per task. We sent the same Claude Code at the bug with no scaffolding and lost. We sent the same Claude Code with a Hypernym-compressed view of the codebase and on pytest-5787 won four out of five — recovering 80% of what tool-agent scaffolding delivers, with nothing but the prompt changing.

Specifications

Pipeline at the bytes level

Settings used for every run in the scorecard above.

Source benchmark	SWE-bench Verified (500 tasks)
Hard-task cutoff	≤31 external solves
Float compression	~10-20% of original size
P-Span sweep	5-30 element counts
Comprehensive trials	60 per file
Coherence target	~0.79-0.82 (BAAI/bge-m3)
Final prompt budget	80,000 tokens (cl100k_base)
Per-float fact budget	3,077 tokens
Treatment_api prompt size	~85K tokens (84-95K range)
Patch generation model	Claude Code (Anthropic)
Harness	Zephyr B SWE-bench remote executor
Gold-test definition	FAIL_TO_PASS flips + PASS_TO_PASS holds

Reproducibility

Every cell, queryable

The numbers in the scorecard come from harness_runs.db. The aggregate is reproducible with a single SQL query.

-- harness_runs.db (sqlite3)
SELECT
    SUM(CASE WHEN condition='control' THEN 1 ELSE 0 END) AS ctl_runs,
    SUM(CASE WHEN condition='control' AND resolved=1 THEN 1 ELSE 0 END) AS ctl_res,
    SUM(CASE WHEN condition LIKE 'treatment%' THEN 1 ELSE 0 END) AS trt_runs,
    SUM(CASE WHEN condition LIKE 'treatment%' AND resolved=1 THEN 1 ELSE 0 END) AS trt_res
FROM harness_runs
WHERE task_id IN (
    'astropy__astropy-13398',
    'sphinx-doc__sphinx-7590',
    'pytest-dev__pytest-6197',
    'pytest-dev__pytest-5787'
)
AND condition IN ('control', 'treatment_api', 'treatment_probes')
AND status='completed';

-- ctl_runs=20  ctl_res=3  trt_runs=22  trt_res=5

Hypernym × SWE-bench

Per-task wall-clock and variance

From source repo to one 80K-token prompt

Rank files

Float each file

Pack facts

Validate

Pipeline at the bytes level

Every cell, queryable

Methodology — Eleven Stages

Stages 0.1 – 0.3: Task setup

Stages 1 – 4: Pre-selection

Stages 5 – 6: Context budgeting

Stages 7 – 11: Patch & validate

Float pipeline (Stage 2 internals)

Data Sources

Authoritative tables

Per-task artifacts

Reproducibility

Validation date