liveresearchbench
liveresearchbench
¶
LiveResearchBench scorer — checklist-based LLM-as-judge scoring.
For each task, LiveResearchBench provides a list of checklist items that a good response must cover. The scorer asks an LLM judge to evaluate each checklist item against the response, producing a per-item pass/fail. Final score = fraction of passed items (coverage).
Reference: https://arxiv.org/abs/2510.14240
Classes¶
LiveResearchBenchScorer
¶
LiveResearchBenchScorer(judge_backend: InferenceBackend, judge_model: str)
Bases: LLMJudgeScorer
Checklist-based LLM-as-judge scorer for LiveResearchBench.
For each sample, the judge evaluates each checklist item against the
model's report. Score = fraction covered. Tasks with score >=
PASS_THRESHOLD (default 0.5) are marked correct.