Skip to content

liveresearch

liveresearch

LiveResearchBench scorer — LLM-as-judge for deep research quality.

Evaluates research output quality across four dimensions from the LiveResearchBench rubric: comprehensiveness, insight, instruction_following, and readability. Uses LLM-as-judge with per-task criteria when available, falling back to a generic research quality rubric.

Reference: https://github.com/Ayanami0730/deep_research_bench Paper: https://arxiv.org/abs/2510.14240

Classes

LiveResearchBenchScorer

LiveResearchBenchScorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

LLM-as-judge scorer for LiveResearchBench deep research tasks.

Evaluates research reports across four dimensions: comprehensiveness, insight, instruction_following, readability. Uses task-specific criteria when available from the benchmark data.

Source code in src/openjarvis/evals/core/scorer.py
def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model