grpo_policy
grpo_policy
¶
GRPO router — Group Relative Policy Optimization for query→model routing.
Classes¶
GRPOSample
dataclass
¶
A single sample in a GRPO group.
GRPOState
dataclass
¶
GRPOState(weights: Dict[str, Dict[str, float]] = (lambda: defaultdict(lambda: defaultdict(float)))(), sample_counts: Dict[str, int] = (lambda: defaultdict(int))(), total_updates: int = 0)
Persistent state for GRPO policy weights.
GRPORouterPolicy
¶
GRPORouterPolicy(*, learning_rate: float = 0.1, min_samples: int = 5, group_size: int = 4, temperature: float = 1.0)
Group Relative Policy Optimization for routing queries to models.
Groups samples by query_class, computes relative advantage within each group (reward - mean_reward) / std, and updates policy weights via softmax gradient.
Falls back to random selection when insufficient samples exist.
Source code in src/openjarvis/learning/grpo_policy.py
Attributes¶
Functions¶
route
¶
route(context: RoutingContext, models: List[str]) -> str
Select the best model for the given routing context.
Source code in src/openjarvis/learning/grpo_policy.py
add_sample
¶
Add a training sample to the buffer.
Source code in src/openjarvis/learning/grpo_policy.py
update
¶
Run GRPO update on accumulated samples.
Groups samples by query_class, computes relative advantages, and updates policy weights.
Returns stats about the update.
Source code in src/openjarvis/learning/grpo_policy.py
Functions¶
ensure_registered
¶
Register GRPORouterPolicy if not already present.