hle_judge

hle_judge ¶

HLE scorer -- LLM-as-judge for Humanity's Last Exam.

Uses the same exact-match-then-LLM-fallback pattern as the reasoning judge but with an HLE-specific grading template. Adapted from IPW's evaluation handlers.

Classes¶

HLEScorer ¶

HLEScorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

LLM-as-judge evaluation for HLE benchmark.

Fast path: normalized exact match (no API call). Slow path: LLM judge for semantic equivalence.

Source code in src/openjarvis/evals/core/scorer.py

def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model

hle_judge

hle_judge ¶

Classes¶

HLEScorer ¶

Functions¶