Multi-objective reward function for orchestrator training.
Adapted from IPW's reward.py. Balances accuracy, cost, energy,
latency, and power into a single scalar reward used by both the SFT
grading pipeline and the GRPO policy gradient.
Classes
RewardWeights
dataclass
RewardWeights(alpha: float = 0.4, beta_cost: float = 0.15, beta_energy: float = 0.15, gamma_latency: float = 0.15, gamma_power: float = 0.15)
Weights for multi-objective reward function.
Each metric has its own coefficient:
- alpha: Accuracy (correctness of answer)
- beta_cost: API/cloud cost in USD
- beta_energy: Energy consumption in joules
- gamma_latency: Response time in seconds
- gamma_power: Peak power usage in watts
Normalizers
dataclass
Normalizers(energy_scale: float = 100.0, cost_scale: float = 0.1, latency_scale: float = 30.0, power_scale: float = 200.0)
Normalization constants for reward scaling.
These are typical values used to scale metrics to similar ranges.
Tune based on your specific tools and tasks.
MultiObjectiveReward
Multi-objective reward combining accuracy, cost, energy, latency, power.
Formula::
reward = alpha * accuracy
- beta_cost * (cost / cost_scale)
- beta_energy * (energy / energy_scale)
- gamma_latency * (latency / latency_scale)
- gamma_power * (power / power_scale)
Source code in src/openjarvis/learning/orchestrator/reward.py
| def __init__(
self,
weights: RewardWeights,
normalizers: Normalizers,
) -> None:
self.weights = weights
self.normalizers = normalizers
|
Functions
compute
Compute scalar reward for an episode.
Source code in src/openjarvis/learning/orchestrator/reward.py
| def compute(self, episode: Episode) -> float:
"""Compute scalar reward for an episode."""
accuracy_reward = 1.0 if episode.correct else 0.0
cost_penalty = episode.total_cost_usd / self.normalizers.cost_scale
energy_penalty = (
episode.total_energy_joules / self.normalizers.energy_scale
)
latency_penalty = (
episode.total_latency_seconds / self.normalizers.latency_scale
)
power_penalty = episode.max_power_watts / self.normalizers.power_scale
return (
self.weights.alpha * accuracy_reward
- self.weights.beta_cost * cost_penalty
- self.weights.beta_energy * energy_penalty
- self.weights.gamma_latency * latency_penalty
- self.weights.gamma_power * power_penalty
)
|
compute_with_breakdown
compute_with_breakdown(episode: Episode) -> Dict[str, float]
Compute reward with detailed per-component breakdown.
Source code in src/openjarvis/learning/orchestrator/reward.py
| def compute_with_breakdown(self, episode: Episode) -> Dict[str, float]:
"""Compute reward with detailed per-component breakdown."""
accuracy_reward = 1.0 if episode.correct else 0.0
cost_penalty = episode.total_cost_usd / self.normalizers.cost_scale
energy_penalty = (
episode.total_energy_joules / self.normalizers.energy_scale
)
latency_penalty = (
episode.total_latency_seconds / self.normalizers.latency_scale
)
power_penalty = episode.max_power_watts / self.normalizers.power_scale
accuracy_component = self.weights.alpha * accuracy_reward
cost_component = -self.weights.beta_cost * cost_penalty
energy_component = -self.weights.beta_energy * energy_penalty
latency_component = -self.weights.gamma_latency * latency_penalty
power_component = -self.weights.gamma_power * power_penalty
total_reward = (
accuracy_component
+ cost_component
+ energy_component
+ latency_component
+ power_component
)
ipj = episode.compute_ipj()
return {
"total_reward": total_reward,
"accuracy_reward": accuracy_reward,
"accuracy_component": accuracy_component,
"cost_penalty": cost_penalty,
"cost_component": cost_component,
"energy_penalty": energy_penalty,
"energy_component": energy_component,
"latency_penalty": latency_penalty,
"latency_component": latency_component,
"power_penalty": power_penalty,
"power_component": power_component,
"ipj": ipj,
"total_energy_joules": episode.total_energy_joules,
"total_cost_usd": episode.total_cost_usd,
"total_latency_seconds": episode.total_latency_seconds,
}
|
compute_batch
compute_batch(episodes: List[Episode]) -> List[float]
Compute rewards for a batch of episodes.
Source code in src/openjarvis/learning/orchestrator/reward.py
| def compute_batch(self, episodes: List[Episode]) -> List[float]:
"""Compute rewards for a batch of episodes."""
return [self.compute(ep) for ep in episodes]
|
AdaptiveRewardWeights
AdaptiveRewardWeights(initial_alpha: float = 0.6, final_alpha: float = 0.3, initial_beta_cost: float = 0.1, final_beta_cost: float = 0.15, initial_beta_energy: float = 0.1, final_beta_energy: float = 0.2, initial_gamma_latency: float = 0.1, final_gamma_latency: float = 0.15, initial_gamma_power: float = 0.1, final_gamma_power: float = 0.2, total_steps: int = 10000)
Adaptive reward weights that shift during training.
Early training focuses on accuracy (higher alpha).
Late training optimises efficiency (higher cost/energy/power weights).
Source code in src/openjarvis/learning/orchestrator/reward.py
| def __init__(
self,
initial_alpha: float = 0.6,
final_alpha: float = 0.3,
initial_beta_cost: float = 0.1,
final_beta_cost: float = 0.15,
initial_beta_energy: float = 0.1,
final_beta_energy: float = 0.2,
initial_gamma_latency: float = 0.1,
final_gamma_latency: float = 0.15,
initial_gamma_power: float = 0.1,
final_gamma_power: float = 0.2,
total_steps: int = 10000,
) -> None:
self.initial_alpha = initial_alpha
self.final_alpha = final_alpha
self.initial_beta_cost = initial_beta_cost
self.final_beta_cost = final_beta_cost
self.initial_beta_energy = initial_beta_energy
self.final_beta_energy = final_beta_energy
self.initial_gamma_latency = initial_gamma_latency
self.final_gamma_latency = final_gamma_latency
self.initial_gamma_power = initial_gamma_power
self.final_gamma_power = final_gamma_power
self.total_steps = total_steps
|
Functions
get_weights
Get weights for current_step via linear interpolation.
Source code in src/openjarvis/learning/orchestrator/reward.py
| def get_weights(self, current_step: int) -> RewardWeights:
"""Get weights for *current_step* via linear interpolation."""
progress = min(1.0, current_step / self.total_steps)
alpha = self.initial_alpha + (self.final_alpha - self.initial_alpha) * progress
beta_cost = (
self.initial_beta_cost
+ (self.final_beta_cost - self.initial_beta_cost) * progress
)
beta_energy = (
self.initial_beta_energy
+ (self.final_beta_energy - self.initial_beta_energy) * progress
)
gamma_latency = (
self.initial_gamma_latency
+ (self.final_gamma_latency - self.initial_gamma_latency) * progress
)
gamma_power = (
self.initial_gamma_power
+ (self.final_gamma_power - self.initial_gamma_power) * progress
)
# Normalize to sum to 1.0
total = alpha + beta_cost + beta_energy + gamma_latency + gamma_power
return RewardWeights(
alpha=alpha / total,
beta_cost=beta_cost / total,
beta_energy=beta_energy / total,
gamma_latency=gamma_latency / total,
gamma_power=gamma_power / total,
)
|