swebench_structural
swebench_structural
¶
SWE-bench scorer — structural patch validation.
Full SWE-bench evaluation requires running tests inside the repository environment. This scorer performs lightweight structural checks on the model output (e.g. whether it looks like a valid patch) and defers the authoritative pass/fail to external test execution.
Classes¶
SWEBenchScorer
¶
Bases: Scorer
Structural validation scorer for SWE-bench patches.
Since true SWE-bench scoring requires test execution in a sandboxed
repository checkout, this scorer only checks whether the model
produced something that looks like a valid unified diff. The
is_correct field is set to None (indeterminate) when a
patch-like response is detected — downstream harnesses should run
the actual tests.