ama_bench_judge
ama_bench_judge
¶
LLM-judge scorer for AMA-Bench agent memory assessment.
Follows the evaluation protocol from the AMA-Bench paper (Appendix C.1): - Judge receives (question, reference_answer, predicted_answer) triplet - Returns binary yes/no decision - Reports both Accuracy (judge-based) and token-level F1
Classes¶
AMABenchScorer
¶
AMABenchScorer(judge_backend: InferenceBackend, judge_model: str)
Bases: LLMJudgeScorer
Score AMA-Bench QA via LLM judge + token F1.
Follows the paper's evaluation protocol: Accuracy via LLM-as-judge (Qwen3-32B recommended) plus token-level F1 as a secondary metric.