Skip to content

swebench_harness

swebench_harness

SWE-bench harness scorer — runs the official swebench test harness.

This is the authoritative pass/fail scorer for SWE-bench-Verified. The lightweight :class:SWEBenchScorer in swebench_structural.py checks only that the model produced something patch-shaped; this one actually applies the patch, runs the targeted tests, and reads the harness's report JSON.

Backends (selected by SWEBENCH_BACKEND env var):

  • modal (default) — runs on Modal in the cloud; needs swebench[modal] installed and modal token new configured once.
  • docker — runs locally; needs Docker daemon + user in docker group.

Ported from hybrid-local-cloud-compute/benches/swebench_verified/{runner,parsing}.py, with two upstream-swebench patches applied at import time:

  1. Modal cgroup-v2 fix: swebench/harness/modal_eval/run_evaluation_modal.py:66 writes to /sys/fs/cgroup/cpu/cpu.shares (cgroup v1). Modal v2 sandboxes use cgroup v2 — the path doesn't exist and every sandbox dies on the write. Wrap the write in try/except.

  2. Rescore *_ids fix: older harness rescore code read resolved_instances / unresolved_instances / error_instances as lists. Current swebench writes counts there and puts IDs in *_ids fields. Wherever we read these we use *_ids.

Both patches are idempotent and only fire when the harness modules are imported via this scorer (we don't touch swebench until score() is called for the first time).

Classes

SWEBenchHarnessScorer

SWEBenchHarnessScorer(*, timeout_s: int = 1800, judge_backend: object = None, judge_model: str = '')

Bases: Scorer

SWE-bench Verified scorer that runs the official harness.

score(record, model_answer) returns (is_correct, details):

  • is_correct = True if the harness marks the instance resolved.
  • is_correct = False on harness failure or unresolved.
  • details includes the raw harness report under ["report"] plus a "patch" field with the extracted patch text.
Source code in src/openjarvis/evals/scorers/swebench_harness.py
def __init__(
    self,
    *,
    timeout_s: int = 1800,
    judge_backend: object = None,  # noqa: ARG002 — CLI factory compat
    judge_model: str = "",         # noqa: ARG002 — CLI factory compat
) -> None:
    self._timeout_s = int(timeout_s)

Functions

extract_patch

extract_patch(text: str) -> Optional[str]

Pull a unified diff out of agent output. None if not found.

Source code in src/openjarvis/evals/scorers/swebench_harness.py
def extract_patch(text: str) -> Optional[str]:
    """Pull a unified diff out of agent output. ``None`` if not found."""
    if not text:
        return None
    for pat in _FENCE_PATTERNS:
        m = pat.search(text)
        if m:
            return m.group(1).strip() + "\n"
    if "diff --git" in text:
        start = text.index("diff --git")
        return text[start:].strip() + "\n"
    return None