Index
personal
¶
Personal benchmark system -- synthesize benchmarks from interaction traces.
Classes¶
PersonalBenchmarkDataset
¶
PersonalBenchmarkDataset(benchmark: PersonalBenchmark)
Bases: DatasetProvider
Wraps a PersonalBenchmark as a DatasetProvider for EvalRunner.
Source code in src/openjarvis/learning/optimize/personal/dataset.py
Functions¶
load
¶
load(*, max_samples: Optional[int] = None, split: Optional[str] = None, seed: Optional[int] = None) -> None
Convert :class:PersonalBenchmarkSample instances to :class:EvalRecord.
Source code in src/openjarvis/learning/optimize/personal/dataset.py
iter_records
¶
iter_records() -> Iterable[EvalRecord]
PersonalBenchmarkScorer
¶
PersonalBenchmarkScorer(judge_backend: InferenceBackend, judge_model: str)
Bases: LLMJudgeScorer
Judges a candidate response against the best-known response from traces.
Source code in src/openjarvis/learning/optimize/personal/scorer.py
Functions¶
score
¶
score(record: EvalRecord, model_answer: str) -> Tuple[Optional[bool], Dict[str, Any]]
Compare model_answer against record.reference using the judge LLM.
Returns (is_correct, metadata) where is_correct indicates whether
the candidate answer is at least as good as the reference.
Source code in src/openjarvis/learning/optimize/personal/scorer.py
PersonalBenchmark
dataclass
¶
PersonalBenchmark(workflow_id: str, samples: List[PersonalBenchmarkSample] = list(), created_at: float = 0.0)
A synthesized benchmark from user interaction traces.
PersonalBenchmarkSample
dataclass
¶
PersonalBenchmarkSample(trace_id: str, query: str, reference_answer: str, agent: str = '', category: str = 'chat', feedback_score: float = 0.0, metadata: Dict[str, Any] = dict())
A single sample in a personal benchmark.
PersonalBenchmarkSynthesizer
¶
PersonalBenchmarkSynthesizer(trace_store: TraceStore)
Mines interaction traces into a reusable personal benchmark.
Source code in src/openjarvis/learning/optimize/personal/synthesizer.py
Functions¶
synthesize
¶
synthesize(workflow_id: str = 'default', min_feedback: float = 0.7, max_samples: int = 100) -> PersonalBenchmark
Build a personal benchmark from high-quality traces.
- Query traces that have feedback >= min_feedback.
- Group by query class (agent + first 50 chars of query).
- For each class, pick the trace with the highest feedback as reference.
- Return a :class:
PersonalBenchmarkcapped at max_samples.