Skip to content

ama_bench_judge

ama_bench_judge

LLM-judge scorer for AMA-Bench agent memory assessment.

Follows the evaluation protocol from the AMA-Bench paper (Appendix C.1): - Judge receives (question, reference_answer, predicted_answer) triplet - Returns binary yes/no decision - Reports both Accuracy (judge-based) and token-level F1

Classes

AMABenchScorer

AMABenchScorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

Score AMA-Bench QA via LLM judge + token F1.

Follows the paper's evaluation protocol: Accuracy via LLM-as-judge (Qwen3-32B recommended) plus token-level F1 as a secondary metric.

Source code in src/openjarvis/evals/core/scorer.py
def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model