Skip to content

swebench_harness

swebench_harness

SWE-bench harness scorer — runs the official swebench test harness.

This is the authoritative pass/fail scorer for SWE-bench-Verified. The lightweight :class:SWEBenchScorer in swebench_structural.py checks only that the model produced something patch-shaped; this one actually applies the patch, runs the targeted tests, and reads the harness's report JSON.

Backends (selected by SWEBENCH_BACKEND env var):

  • modal (default) — runs on Modal in the cloud; needs swebench[modal] installed and modal token new configured once.
  • docker — runs locally; needs Docker daemon + user in docker group.

Ported from hybrid-local-cloud-compute/benches/swebench_verified/{runner,parsing}.py, with two upstream-swebench patches applied at import time:

  1. Modal cgroup-v2 fix: swebench/harness/modal_eval/run_evaluation_modal.py:66 writes to /sys/fs/cgroup/cpu/cpu.shares (cgroup v1). Modal v2 sandboxes use cgroup v2 — the path doesn't exist and every sandbox dies on the write. Wrap the write in try/except. In swebench 4.x the call site is ModalSandboxRuntime.__init__self.write_file(...)self.sandbox.open(path, "w"); in older swebench it was a free set_cpu_quota function. We patch both: write_file swallows FileNotFoundError for cgroup paths, and set_cpu_quota (if present) is wrapped too.

  2. Rescore *_ids fix: older harness rescore code read resolved_instances / unresolved_instances / error_instances as lists. Current swebench writes counts there and puts IDs in *_ids fields. Wherever we read these we use *_ids.

Both patches are idempotent and only fire when the harness modules are imported via this scorer (we don't touch swebench until score() is called for the first time).

Classes

SWEBenchHarnessScorer

SWEBenchHarnessScorer(*, timeout_s: int = 1800, cell_name: Optional[str] = None, judge_backend: object = None, judge_model: str = '')

Bases: Scorer

SWE-bench Verified scorer that runs the official harness.

score(record, model_answer) returns (is_correct, details):

  • is_correct = True if the harness marks the instance resolved.
  • is_correct = False on harness failure or unresolved.
  • details includes the raw harness report under ["report"] plus a "patch" field with the extracted patch text.
Source code in src/openjarvis/evals/scorers/swebench_harness.py
def __init__(
    self,
    *,
    timeout_s: int = 1800,
    cell_name: Optional[str] = None,
    judge_backend: object = None,  # noqa: ARG002 — CLI factory compat
    judge_model: str = "",  # noqa: ARG002 — CLI factory compat
) -> None:
    self._timeout_s = int(timeout_s)
    # ``cell_name`` namespaces the ``run_id`` so concurrent cells scoring
    # the same SWE instance don't collide on the harness's shared cache.
    # See :func:`_build_run_id` for the failure mode this prevents. Pass
    # the hybrid cell name (e.g. ``"skillorchestra-qwen36-opus47-swe-n100"``)
    # or leave as ``None`` for single-cell callers.
    self._cell_name = cell_name

Functions

extract_patch

extract_patch(text: str) -> Optional[str]

Pull a unified diff out of agent output. None if not found.

Source code in src/openjarvis/evals/scorers/swebench_harness.py
def extract_patch(text: str) -> Optional[str]:
    """Pull a unified diff out of agent output. ``None`` if not found."""
    if not text:
        return None
    for pat in _FENCE_PATTERNS:
        m = pat.search(text)
        if m:
            return m.group(1).strip() + "\n"
    if "diff --git" in text:
        start = text.index("diff --git")
        return text[start:].strip() + "\n"
    return None