Index

personal ¶

Personal benchmark system -- synthesize benchmarks from interaction traces.

Classes¶

PersonalBenchmarkDataset ¶

PersonalBenchmarkDataset(benchmark: PersonalBenchmark)

Bases: DatasetProvider

Wraps a PersonalBenchmark as a DatasetProvider for EvalRunner.

Source code in src/openjarvis/learning/optimize/personal/dataset.py

def __init__(self, benchmark: PersonalBenchmark) -> None:
    self._benchmark = benchmark
    self._records: List[EvalRecord] = []

Functions¶

load ¶

load(*, max_samples: Optional[int] = None, split: Optional[str] = None, seed: Optional[int] = None) -> None

Convert :class:PersonalBenchmarkSample instances to :class:EvalRecord.

Source code in src/openjarvis/learning/optimize/personal/dataset.py

def load(
    self,
    *,
    max_samples: Optional[int] = None,
    split: Optional[str] = None,
    seed: Optional[int] = None,
) -> None:
    """Convert :class:`PersonalBenchmarkSample` instances to :class:`EvalRecord`."""
    samples = self._benchmark.samples
    if max_samples is not None:
        samples = samples[:max_samples]
    self._records = [
        EvalRecord(
            record_id=s.trace_id,
            problem=s.query,
            reference=s.reference_answer,
            category=s.category,
            subject=s.agent or "general",
            metadata=s.metadata,
        )
        for s in samples
    ]

iter_records ¶

iter_records() -> Iterable[EvalRecord]

Iterate over loaded records.

Source code in src/openjarvis/learning/optimize/personal/dataset.py

def iter_records(self) -> Iterable[EvalRecord]:
    """Iterate over loaded records."""
    return iter(self._records)

size ¶

size() -> int

Return the number of loaded records.

Source code in src/openjarvis/learning/optimize/personal/dataset.py

def size(self) -> int:
    """Return the number of loaded records."""
    return len(self._records)

PersonalBenchmarkScorer ¶

PersonalBenchmarkScorer(judge_backend: InferenceBackend, judge_model: str)

Bases: LLMJudgeScorer

Judges a candidate response against the best-known response from traces.

Source code in src/openjarvis/learning/optimize/personal/scorer.py

def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None:
    super().__init__(judge_backend, judge_model)

Functions¶

score ¶

score(record: EvalRecord, model_answer: str) -> Tuple[Optional[bool], Dict[str, Any]]

Compare model_answer against record.reference using the judge LLM.

Returns (is_correct, metadata) where is_correct indicates whether the candidate answer is at least as good as the reference.

Source code in src/openjarvis/learning/optimize/personal/scorer.py

def score(
    self,
    record: EvalRecord,
    model_answer: str,
) -> Tuple[Optional[bool], Dict[str, Any]]:
    """Compare *model_answer* against *record.reference* using the judge LLM.

    Returns ``(is_correct, metadata)`` where *is_correct* indicates whether
    the candidate answer is at least as good as the reference.
    """
    prompt = (
        "Compare these two answers to the query.\n\n"
        f"Query: {record.problem}\n\n"
        "Reference answer (known good):\n"
        f"{record.reference}\n\n"
        "Candidate answer:\n"
        f"{model_answer}\n\n"
        "Is the candidate answer at least as good as the reference? "
        'Respond with exactly "YES" or "NO" on the first line, '
        "then explain your reasoning."
    )
    response = self._ask_judge(
        prompt,
        system="You are an impartial quality judge.",
    )
    first_line = response.strip().split("\n")[0].strip().upper()
    is_correct = first_line.startswith("YES")
    return is_correct, {"judge_response": response}

PersonalBenchmark `dataclass` ¶

PersonalBenchmark(workflow_id: str, samples: List[PersonalBenchmarkSample] = list(), created_at: float = 0.0)

A synthesized benchmark from user interaction traces.

PersonalBenchmarkSample `dataclass` ¶

PersonalBenchmarkSample(trace_id: str, query: str, reference_answer: str, agent: str = '', category: str = 'chat', feedback_score: float = 0.0, metadata: Dict[str, Any] = dict())

A single sample in a personal benchmark.

PersonalBenchmarkSynthesizer ¶

PersonalBenchmarkSynthesizer(trace_store: TraceStore)

Mines interaction traces into a reusable personal benchmark.

Source code in src/openjarvis/learning/optimize/personal/synthesizer.py

def __init__(self, trace_store: TraceStore) -> None:
    self._store = trace_store

Functions¶

synthesize ¶

synthesize(workflow_id: str = 'default', min_feedback: float = 0.7, max_samples: int = 100) -> PersonalBenchmark

Build a personal benchmark from high-quality traces.

Query traces that have feedback >= min_feedback.
Group by query class (agent + first 50 chars of query).
For each class, pick the trace with the highest feedback as reference.
Return a :class:PersonalBenchmark capped at max_samples.

Source code in src/openjarvis/learning/optimize/personal/synthesizer.py

def synthesize(
    self,
    workflow_id: str = "default",
    min_feedback: float = 0.7,
    max_samples: int = 100,
) -> PersonalBenchmark:
    """Build a personal benchmark from high-quality traces.

    1. Query traces that have feedback >= *min_feedback*.
    2. Group by query class (agent + first 50 chars of query).
    3. For each class, pick the trace with the highest feedback as reference.
    4. Return a :class:`PersonalBenchmark` capped at *max_samples*.
    """
    # Fetch a large pool of traces (limit high enough to cover most stores)
    all_traces = self._store.list_traces(limit=10_000)

    # Filter to traces with sufficient feedback
    qualified = [
        t
        for t in all_traces
        if t.feedback is not None and t.feedback >= min_feedback
    ]

    # Group by query class
    groups: Dict[str, list] = defaultdict(list)
    for trace in qualified:
        key = _query_class_key(trace.agent, trace.query)
        groups[key].append(trace)

    # Pick best trace per class
    samples: List[PersonalBenchmarkSample] = []
    for _key, traces in groups.items():
        best = max(traces, key=lambda t: t.feedback or 0.0)
        samples.append(
            PersonalBenchmarkSample(
                trace_id=best.trace_id,
                query=best.query,
                reference_answer=best.result,
                agent=best.agent,
                category=_infer_category(best.agent),
                feedback_score=best.feedback or 0.0,
                metadata=best.metadata,
            ),
        )

    # Sort deterministically (highest feedback first) and cap
    samples.sort(key=lambda s: (-s.feedback_score, s.trace_id))
    samples = samples[:max_samples]

    return PersonalBenchmark(
        workflow_id=workflow_id,
        samples=samples,
        created_at=time.time(),
    )

Index

personal ¶

Classes¶

PersonalBenchmarkDataset ¶

Functions¶

load ¶

iter_records ¶

size ¶

PersonalBenchmarkScorer ¶

Functions¶

score ¶

PersonalBenchmark dataclass ¶

PersonalBenchmarkSample dataclass ¶

PersonalBenchmarkSynthesizer ¶

Functions¶

synthesize ¶

PersonalBenchmark `dataclass` ¶

PersonalBenchmarkSample `dataclass` ¶