reasoning_judge

reasoning_judge ¶

Reasoning judge scorer -- LLM-as-judge for math and reasoning tasks.

Attempts normalized exact match first, then falls back to an LLM judge for semantic comparison. Adapted from IPW's reasoning evaluation handlers.

Classes¶

ReasoningJudgeScorer ¶

ReasoningJudgeScorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

LLM-as-judge evaluation for reasoning tasks.

Fast path: normalized exact match (no API call). Slow path: LLM judge for semantic equivalence.

Source code in src/openjarvis/evals/core/scorer.py

def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model

Functions¶

reasoning_exact_match ¶

reasoning_exact_match(model_answer: str, ground_truth: str) -> bool

Normalized exact match for reasoning answers.

Handles numbers, LaTeX boxed answers, and plain strings.

Source code in src/openjarvis/evals/scorers/reasoning_judge.py

def reasoning_exact_match(model_answer: str, ground_truth: str) -> bool:
    """Normalized exact match for reasoning answers.

    Handles numbers, LaTeX boxed answers, and plain strings.
    """
    if model_answer is None:
        return False

    # Try extracting boxed answer from model output
    boxed = _extract_boxed(model_answer)
    if boxed is not None:
        model_answer = boxed

    # Also extract boxed from ground truth if present
    gt_boxed = _extract_boxed(ground_truth)
    if gt_boxed is not None:
        ground_truth = gt_boxed

    # Numeric comparison
    if _is_float(ground_truth):
        return _normalize_number_str(model_answer) == float(ground_truth)

    return _normalize_str(model_answer) == _normalize_str(ground_truth)