simpleqa_judge

simpleqa_judge ¶

SimpleQA scorer -- normalized exact match with LLM fallback.

Evaluates short factual answers using exact string matching (with normalization) and falls back to an LLM judge for semantic comparison.

Classes¶

SimpleQAScorer ¶

SimpleQAScorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

SimpleQA evaluation: exact match with normalization + LLM fallback.

Source code in src/openjarvis/evals/core/scorer.py

def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model

Functions¶

exact_match ¶

exact_match(model_answer: str, ground_truth: str) -> bool

Exact-match scorer with normalization for numbers and strings.

Source code in src/openjarvis/evals/scorers/simpleqa_judge.py

def exact_match(model_answer: str, ground_truth: str) -> bool:
    """Exact-match scorer with normalization for numbers and strings."""
    if model_answer is None:
        model_answer = "None"

    if _is_float(ground_truth):
        normalized = _normalize_number_str(model_answer)
        return normalized == float(ground_truth)

    return _normalize_str(model_answer) == _normalize_str(ground_truth)