terminalbench_native
terminalbench_native
¶
Native TerminalBench V2.1 backend.
Uses Harness for Docker-based execution and scoring.
Classes¶
TerminalBenchNativeBackend
¶
TerminalBenchNativeBackend(model: str = 'openai/default', api_base: str = 'http://localhost:8000/v1', temperature: float = 0.2, agent_name: str = 'naive', output_dir: str = 'results/terminalbench/', max_samples: Optional[int] = None, dataset_name: str = 'terminal-bench-core', dataset_version: str = '0.1.1', system_prompt: str = '', max_tokens: int = 16384, n_concurrent: int = 4, global_agent_timeout_sec: Optional[float] = 1800.0, global_timeout_multiplier: Optional[float] = None)
Bases: InferenceBackend
Runs terminal-bench tasks natively via Harness with Docker execution.
Uses terminal-bench's own agent + LiteLLM to call the model, Docker containers for task execution, and built-in test scripts for scoring. This gives real agentic evaluation, not text-only.
Args of note:
global_agent_timeout_sec: Hard wall-clock bound for each trial's
agent phase. terminal-bench runs installed-agent SETUP inside
this same budget with an infinite tmux timeout, so this bounds
SETUP+RUN together (a setup-only timeout needs an upstream
terminal-bench change). When set, it REPLACES each task's own
max_agent_timeout_sec. Set None or 0 to fall back
to per-task budgets.
global_timeout_multiplier: Scales per-task budgets when
global_agent_timeout_sec is not set. None keeps
terminal-bench's default (1.0).
Source code in src/openjarvis/evals/backends/terminalbench_native.py
Functions¶
run_harness
¶
Run the full terminal-bench harness and return results.
Source code in src/openjarvis/evals/backends/terminalbench_native.py
Functions¶
summarize_benchmark_results
¶
summarize_benchmark_results(results: Any, *, model: str, benchmark: str = 'terminalbench-native') -> Tuple[RunSummary, List[Dict[str, str]]]
Convert terminal-bench BenchmarkResults into a RunSummary.
Trials are classified into three buckets:
- resolved:
is_resolvedis True -> counted correct. - model miss: unresolved, but the model was actually contacted -> counted in the accuracy denominator.
- harness/infra failure: excluded from the accuracy denominator and
reported in
RunSummary.errorsplus the returned failure list.
Zero-model-contact signal choice: terminal-bench 0.2.18 leaves
failure_mode UNSET both on clean success and on genuine unresolved
misses, so failure_mode cannot distinguish "the model tried and failed"
from "the agent never called the model". Token usage can: this backend
always runs terminus-2, which reports real LiteLLM usage, so an
unresolved trial with zero/missing input+output tokens means no model
request ever completed — an infrastructure failure (in-container setup
hang/death, tmux failure), not a model miss. CAVEAT: terminal-bench
"installed agents" (openhands, claude-code, ...) hardcode 0 tokens even
on success; if this backend ever honors agent_name for installed
agents, this heuristic must be gated on the agent type.