judge

judge ¶

TraceJudge -- LLM-as-judge scoring for agent traces.

Classes¶

TraceJudge ¶

TraceJudge(backend: InferenceBackend, model: str)

LLM-as-judge for scoring traces when no ground truth exists.

Given a :class:Trace, the judge constructs a prompt showing the query, agent steps, and final result, then asks an LLM to rate the quality on a 0-1 scale.

Source code in src/openjarvis/learning/optimize/feedback/judge.py

def __init__(self, backend: InferenceBackend, model: str) -> None:
    self._backend = backend
    self._model = model

Functions¶

score_trace ¶

score_trace(trace: Trace) -> Tuple[float, str]

Score a single trace.

Returns: (score, feedback) where score is in [0, 1] and feedback is the judge's textual reasoning.

Source code in src/openjarvis/learning/optimize/feedback/judge.py

def score_trace(self, trace: Trace) -> Tuple[float, str]:
    """Score a single trace.

    Returns:
        ``(score, feedback)`` where *score* is in [0, 1] and
        *feedback* is the judge's textual reasoning.
    """
    prompt = _format_trace(trace)
    response = self._backend.generate(
        prompt,
        model=self._model,
        system=_SYSTEM_PROMPT,
        temperature=0.0,
        max_tokens=1024,
    )
    score = _parse_score(response)
    return score, response

batch_evaluate ¶

batch_evaluate(traces: List[Trace]) -> List[Tuple[float, str]]

Evaluate multiple traces sequentially.

Returns a list of (score, feedback) tuples, one per trace.

Source code in src/openjarvis/learning/optimize/feedback/judge.py

def batch_evaluate(
    self,
    traces: List[Trace],
) -> List[Tuple[float, str]]:
    """Evaluate multiple traces sequentially.

    Returns a list of ``(score, feedback)`` tuples, one per trace.
    """
    results: List[Tuple[float, str]] = []
    for trace in traces:
        results.append(self.score_trace(trace))
    return results