composite_reward
composite_reward
¶
Composite reward for Intelligence-edit training (paper §3.3, Eq. 1).
R(q, y) = alpha * R_acc(q, y) - beta * E_hat(q, y) - gamma * L_hat(q, y) - delta * C_hat(q, y)
The efficiency quantities (E, L, C) are normalised within the evaluated benchmark before weighting (z-score), so the reward trades dimensionless deviations rather than raw joules / seconds / dollars (paper Appendix C.6).
Default weights (alpha, beta, gamma, delta) = (0.5, 0.1, 0.1, 0.3).
This reward is consumed only inside an Intelligence edit that triggers training (LoRA/GRPO). The held-out gate (BenchmarkGate) evaluates the resulting spec end-to-end and is not affected by these weights.
Classes¶
RewardWeights
dataclass
¶
Composite-reward weights from paper Eq. 1.
TrainingSample
dataclass
¶
One (query, response) candidate scored during Intelligence training.
All efficiency quantities are raw (un-normalised); normalisation is
applied across the batch by score_batch.
Functions¶
score_batch
¶
score_batch(samples: Sequence[TrainingSample], weights: RewardWeights | None = None) -> list[float]
Score a batch of candidates with the paper's composite reward.
Energy / latency / cost are z-scored within the batch before being weighted, so the reward magnitudes are comparable across benchmarks and hardware platforms.
Args: samples: candidate (query, response) pairs to score. weights: composite-reward weights; uses paper defaults if None.
Returns: One scalar reward per sample, in the same order.