swebench_harness
swebench_harness
¶
SWE-bench harness scorer — runs the official swebench test harness.
This is the authoritative pass/fail scorer for SWE-bench-Verified.
The lightweight :class:SWEBenchScorer in swebench_structural.py
checks only that the model produced something patch-shaped; this one
actually applies the patch, runs the targeted tests, and reads the
harness's report JSON.
Backends (selected by SWEBENCH_BACKEND env var):
modal(default) — runs on Modal in the cloud; needsswebench[modal]installed andmodal token newconfigured once.docker— runs locally; needs Docker daemon + user indockergroup.
Ported from hybrid-local-cloud-compute/benches/swebench_verified/{runner,parsing}.py,
with two upstream-swebench patches applied at import time:
-
Modal cgroup-v2 fix:
swebench/harness/modal_eval/run_evaluation_modal.py:66writes to/sys/fs/cgroup/cpu/cpu.shares(cgroup v1). Modal v2 sandboxes use cgroup v2 — the path doesn't exist and every sandbox dies on the write. Wrap the write in try/except. -
Rescore
*_idsfix: older harness rescore code readresolved_instances/unresolved_instances/error_instancesas lists. Current swebench writes counts there and puts IDs in*_idsfields. Wherever we read these we use*_ids.
Both patches are idempotent and only fire when the harness modules are
imported via this scorer (we don't touch swebench until score() is
called for the first time).
Classes¶
SWEBenchHarnessScorer
¶
SWEBenchHarnessScorer(*, timeout_s: int = 1800, judge_backend: object = None, judge_model: str = '')
Bases: Scorer
SWE-bench Verified scorer that runs the official harness.
score(record, model_answer) returns (is_correct, details):
is_correct = Trueif the harness marks the instance resolved.is_correct = Falseon harness failure or unresolved.detailsincludes the raw harness report under["report"]plus a"patch"field with the extracted patch text.
Source code in src/openjarvis/evals/scorers/swebench_harness.py
Functions¶
extract_patch
¶
Pull a unified diff out of agent output. None if not found.