terminalbench_judge
terminalbench_judge
¶
TerminalBench scorer — LLM-as-judge for terminal task evaluation.
Compares predicted terminal output / commands against the expected answer using an LLM judge.
Classes¶
TerminalBenchScorer
¶
TerminalBenchScorer(judge_backend: InferenceBackend, judge_model: str)
Bases: LLMJudgeScorer
LLM-as-judge evaluation for TerminalBench terminal tasks.