Skip to content

terminalbench_v2_1_env

terminalbench_v2_1_env

TerminalBench V2.1 task environment.

Per-task Docker container + scoring lifecycle. Intended to be used as a context manager by the eval runner so that the agent has a live container to interact with through :mod:openjarvis.tools.docker_shell_exec.

On __enter__: * Pulls / runs the task's docker image with sleep infinity. * Mounts the task's tests/ directory read-only at /tests. * Creates /logs/verifier/ for reward output. * Binds the container name into :mod:docker_shell_exec's thread-local state so the agent's docker_shell_exec tool targets this container.

On __exit__: * Runs /tests/test.sh to produce /logs/verifier/reward.txt. * Reads the reward, stashes it on record.metadata. * Clears the docker_shell_exec thread-local. * Tears down the container.

Classes

TerminalBenchV21TaskEnv

TerminalBenchV21TaskEnv(metadata: MutableMapping[str, Any])

Per-task Docker + scoring lifecycle for TerminalBench V2.1.

Source code in src/openjarvis/evals/execution/terminalbench_v2_1_env.py
def __init__(self, metadata: MutableMapping[str, Any]) -> None:
    self._metadata = metadata
    self._container: Optional[str] = None
    self._started = False