swebench_structural

swebench_structural ¶

SWE-bench scorer — structural patch validation.

Full SWE-bench evaluation requires running tests inside the repository environment. This scorer performs lightweight structural checks on the model output (e.g. whether it looks like a valid patch) and defers the authoritative pass/fail to external test execution.

Classes¶

SWEBenchScorer ¶

SWEBenchScorer(judge_backend: object = None, judge_model: str = '')

Bases: Scorer

Structural validation scorer for SWE-bench patches.

Since true SWE-bench scoring requires test execution in a sandboxed repository checkout, this scorer only checks whether the model produced something that looks like a valid unified diff. The is_correct field is set to None (indeterminate) when a patch-like response is detected — downstream harnesses should run the actual tests.

Source code in src/openjarvis/evals/scorers/swebench_structural.py

def __init__(
    self,
    judge_backend: object = None,
    judge_model: str = "",
) -> None:
    # Accept judge_backend/judge_model so the CLI factory pattern works,
    # but they are unused — scoring is purely structural.
    self._judge_backend = judge_backend
    self._judge_model = judge_model