swebench_harness

swebench_harness ¶

SWE-bench harness scorer — runs the official swebench test harness.

This is the authoritative pass/fail scorer for SWE-bench-Verified. The lightweight :class:SWEBenchScorer in swebench_structural.py checks only that the model produced something patch-shaped; this one actually applies the patch, runs the targeted tests, and reads the harness's report JSON.

Backends (selected by SWEBENCH_BACKEND env var):

modal (default) — runs on Modal in the cloud; needs swebench[modal] installed and modal token new configured once.
docker — runs locally; needs Docker daemon + user in docker group.

Ported from hybrid-local-cloud-compute/benches/swebench_verified/{runner,parsing}.py, with two upstream-swebench patches applied at import time:

Modal cgroup-v2 fix: swebench/harness/modal_eval/run_evaluation_modal.py:66 writes to /sys/fs/cgroup/cpu/cpu.shares (cgroup v1). Modal v2 sandboxes use cgroup v2 — the path doesn't exist and every sandbox dies on the write. Wrap the write in try/except. In swebench 4.x the call site is ModalSandboxRuntime.__init__ → self.write_file(...) → self.sandbox.open(path, "w"); in older swebench it was a free set_cpu_quota function. We patch both: write_file swallows FileNotFoundError for cgroup paths, and set_cpu_quota (if present) is wrapped too.
Rescore *_ids fix: older harness rescore code read resolved_instances / unresolved_instances / error_instances as lists. Current swebench writes counts there and puts IDs in *_ids fields. Wherever we read these we use *_ids.

Both patches are idempotent and only fire when the harness modules are imported via this scorer (we don't touch swebench until score() is called for the first time).

Classes¶

SWEBenchHarnessScorer ¶

SWEBenchHarnessScorer(*, timeout_s: int = 1800, cell_name: Optional[str] = None, judge_backend: object = None, judge_model: str = '')

Bases: Scorer

SWE-bench Verified scorer that runs the official harness.

score(record, model_answer) returns (is_correct, details):

is_correct = True if the harness marks the instance resolved.
is_correct = False on harness failure or unresolved.
details includes the raw harness report under ["report"] plus a "patch" field with the extracted patch text.

Source code in src/openjarvis/evals/scorers/swebench_harness.py

def __init__(
    self,
    *,
    timeout_s: int = 1800,
    cell_name: Optional[str] = None,
    judge_backend: object = None,  # noqa: ARG002 — CLI factory compat
    judge_model: str = "",  # noqa: ARG002 — CLI factory compat
) -> None:
    self._timeout_s = int(timeout_s)
    # ``cell_name`` namespaces the ``run_id`` so concurrent cells scoring
    # the same SWE instance don't collide on the harness's shared cache.
    # See :func:`_build_run_id` for the failure mode this prevents. Pass
    # the hybrid cell name (e.g. ``"skillorchestra-qwen36-opus47-swe-n100"``)
    # or leave as ``None`` for single-cell callers.
    self._cell_name = cell_name

Functions¶

extract_patch ¶

extract_patch(text: str) -> Optional[str]

Pull a unified diff out of agent output. None if not found.

Source code in src/openjarvis/evals/scorers/swebench_harness.py

def extract_patch(text: str) -> Optional[str]:
    """Pull a unified diff out of agent output. ``None`` if not found."""
    if not text:
        return None
    for pat in _FENCE_PATTERNS:
        m = pat.search(text)
        if m:
            return m.group(1).strip() + "\n"
    if "diff --git" in text:
        start = text.index("diff --git")
        return text[start:].strip() + "\n"
    return None