liveresearch
liveresearch
¶
LiveResearchBench scorer — LLM-as-judge for deep research quality.
Evaluates research output quality across four dimensions from the LiveResearchBench rubric: comprehensiveness, insight, instruction_following, and readability. Uses LLM-as-judge with per-task criteria when available, falling back to a generic research quality rubric.
Reference: https://github.com/Ayanami0730/deep_research_bench Paper: https://arxiv.org/abs/2510.14240
Classes¶
LiveResearchBenchScorer
¶
LiveResearchBenchScorer(judge_backend: InferenceBackend, judge_model: str)
Bases: LLMJudgeScorer
LLM-as-judge scorer for LiveResearchBench deep research tasks.
Evaluates research reports across four dimensions: comprehensiveness, insight, instruction_following, readability. Uses task-specific criteria when available from the benchmark data.