benchmark_gate
benchmark_gate
¶
BenchmarkGate: accept or reject edits based on benchmark performance.
Runs the personal benchmark via a provided scorer callable, compares before/after snapshots, and decides accept/reject based on thresholds.
See spec §7.3.
Classes¶
GateResult
dataclass
¶
GateResult(accepted: bool, snapshot: BenchmarkSnapshot, delta: float, reason: str = '')
Result of a benchmark gate evaluation.
BenchmarkGate
¶
BenchmarkGate(*, scorer: ScorerFn, benchmark_version: str, min_improvement: float = 0.0, max_regression: float = 0.05, subsample_size: int = 50)
Runs the personal benchmark and decides accept/reject.
| PARAMETER | DESCRIPTION |
|---|---|
scorer
|
Callable that runs the benchmark and returns a BenchmarkSnapshot.
Signature:
TYPE:
|
benchmark_version
|
Which benchmark version to score against (locked per session).
TYPE:
|
min_improvement
|
Minimum overall score improvement to accept (default 0.0).
TYPE:
|
max_regression
|
Maximum per-cluster score drop before rejecting (default 0.05).
TYPE:
|
subsample_size
|
Number of tasks to score per gate run (default 50).
TYPE:
|
Source code in src/openjarvis/learning/distillation/gate/benchmark_gate.py
Functions¶
evaluate
¶
evaluate(*, before: BenchmarkSnapshot, session_seed: int) -> GateResult
Run the benchmark and compare against the before snapshot.
| PARAMETER | DESCRIPTION |
|---|---|
before
|
Snapshot captured before the edit was applied.
TYPE:
|
session_seed
|
Deterministic seed for subsampling (same across all gate runs in one session).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
GateResult
|
|