Skip to content

toolcall15

toolcall15

ToolCall-15 scorer — deterministic tool-calling evaluation.

Scores each of the 15 scenarios based on whether the model called the correct tool(s) with correct arguments, following the scoring rubric defined in the benchmark's METHODOLOGY.md.

Scoring: 0 (fail), 1 (partial), or 2 (full pass) per scenario. is_correct = True when score == 2 (full pass).

Reference: https://github.com/stevibe/ToolCall-15

Classes

ToolCall15Scorer

ToolCall15Scorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

Deterministic scorer for ToolCall-15 benchmark.

Scores each scenario based on whether the model called the correct tool(s) with correct arguments. No LLM judge is needed — scoring is fully deterministic, but the class extends LLMJudgeScorer to satisfy the _build_scorer interface.

Scoring: 0 (fail), 1 (partial), 2 (full pass). is_correct = True when score == 2.

Source code in src/openjarvis/evals/core/scorer.py
def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model