terminalbench_v2_1
terminalbench_v2_1
¶
TerminalBench V2.1 dataset provider.
Loads tasks from the terminal-bench-2.1 repo layout (ekellbuch/terminal-bench-2, branch terminal-bench-2.1). Each task lives in a top-level directory containing:
<task_name>/
task.toml # metadata + docker image + timeouts
instruction.md # the agent prompt
environment/ # Dockerfile + supporting files (pre-built into task.toml's docker_image)
solution/ # oracle solve.sh (not used by eval)
tests/ # test.sh + test_outputs.py (pytest) used by the verifier
Reference: https://github.com/ekellbuch/terminal-bench-2/tree/terminal-bench-2.1
Classes¶
TerminalBenchV21Dataset
¶
TerminalBenchV21Dataset(repo_url: str = _DEFAULT_REPO, branch: str = _DEFAULT_BRANCH, path: Optional[str] = None, task_ids: Optional[List[str]] = None)
Bases: DatasetProvider
TerminalBench V2.1 dataset (89 Docker-based terminal tasks).
Source code in src/openjarvis/evals/datasets/terminalbench_v2_1.py
Functions¶
create_task_env
¶
Return a per-task Docker environment (context manager).
The runner enters this around the agent call so that tools like
docker_shell_exec can target the running container.
Source code in src/openjarvis/evals/datasets/terminalbench_v2_1.py
verify_requirements
¶
Check runtime prerequisites (docker, git, tomllib).