terminalbench_v2_1

terminalbench_v2_1 ¶

TerminalBench V2.1 scorer.

Two modes:

Agentic mode (preferred) — the dataset's create_task_env spun up a container, the agent interacted with it through docker_shell_exec, and on context exit the env ran tests/test.sh and wrote tbv21_reward into record.metadata. The scorer just reads that value.
One-shot mode (fallback) — no env was attached. The model answer is treated as a bash script (extracted from a bash ... fence if present). The scorer runs the script in the task container and then the tests, same as before.

This means the same scorer supports both backend = "jarvis-direct" and backend = "jarvis-agent" TB v2.1 configs.

Classes¶

TerminalBenchV21Scorer ¶

TerminalBenchV21Scorer(judge_backend=None, judge_model: str = '')

Bases: Scorer

Reward = 1 if the task's tests pass, 0 otherwise.

Source code in src/openjarvis/evals/scorers/terminalbench_v2_1.py

def __init__(
    self,
    judge_backend=None,
    judge_model: str = "",
) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model