toolcall15
toolcall15
¶
ToolCall-15 scorer — deterministic tool-calling evaluation.
Scores each of the 15 scenarios based on whether the model called the correct tool(s) with correct arguments, following the scoring rubric defined in the benchmark's METHODOLOGY.md.
Scoring: 0 (fail), 1 (partial), or 2 (full pass) per scenario. is_correct = True when score == 2 (full pass).
Reference: https://github.com/stevibe/ToolCall-15
Classes¶
ToolCall15Scorer
¶
ToolCall15Scorer(judge_backend: InferenceBackend, judge_model: str)
Bases: LLMJudgeScorer
Deterministic scorer for ToolCall-15 benchmark.
Scores each scenario based on whether the model called the correct tool(s) with correct arguments. No LLM judge is needed — scoring is fully deterministic, but the class extends LLMJudgeScorer to satisfy the _build_scorer interface.
Scoring: 0 (fail), 1 (partial), 2 (full pass). is_correct = True when score == 2.