terminalbench_v2_1
terminalbench_v2_1
¶
TerminalBench V2.1 scorer.
Two modes:
-
Agentic mode (preferred) — the dataset's
create_task_envspun up a container, the agent interacted with it throughdocker_shell_exec, and on context exit the env rantests/test.shand wrotetbv21_rewardintorecord.metadata. The scorer just reads that value. -
One-shot mode (fallback) — no env was attached. The model answer is treated as a bash script (extracted from a
bash ...fence if present). The scorer runs the script in the task container and then the tests, same as before.
This means the same scorer supports both backend = "jarvis-direct" and
backend = "jarvis-agent" TB v2.1 configs.