terminalbench_native_structural

terminalbench_native_structural ¶

TerminalBench Native scorer — test-result-based evaluation.

Reads is_resolved and test_results from the record's metadata (populated by the native terminal-bench harness) and returns a deterministic pass/fail without any LLM judging.

Classes¶

TerminalBenchNativeScorer ¶

TerminalBenchNativeScorer(judge_backend: object = None, judge_model: str = '')

Bases: Scorer

Test-result-based scorer for TerminalBench Native tasks.

The native terminal-bench package produces is_resolved and test_results fields after executing a task. This scorer reads those fields from record.metadata and translates them into the standard (is_correct, meta) tuple.

Source code in src/openjarvis/evals/scorers/terminalbench_native_structural.py

def __init__(
    self,
    judge_backend: object = None,
    judge_model: str = "",
) -> None:
    # Accept judge_backend/judge_model so the CLI factory pattern works,
    # but they are unused — scoring is based on test results.
    self._judge_backend = judge_backend
    self._judge_model = judge_model