grpo_trainer
grpo_trainer
¶
General-purpose GRPO trainer -- Group Relative Policy Optimization.
Fine-tunes any local model by sampling N responses per prompt, computing group-relative advantages, and applying a clipped policy gradient with KL penalty vs a frozen reference model.
Classes¶
RewardFn
¶
Bases: Protocol
Protocol for reward functions used by GRPOTrainer.
DefaultRewardFn
¶
Default reward function using length-normalized response quality heuristics.
Functions¶
score
¶
Score a response. Higher is better, range [0, 1].
Source code in src/openjarvis/learning/intelligence/grpo_trainer.py
GRPOTrainer
¶
GRPOTrainer(config: GRPOConfig, reward_fn: RewardFn | None = None)
General-purpose GRPO trainer.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
GRPOConfig controlling model, sampling, and optimization params.
TYPE:
|
reward_fn
|
Pluggable reward function. Defaults to
TYPE:
|
Source code in src/openjarvis/learning/intelligence/grpo_trainer.py
Functions¶
train
¶
End-to-end: mine prompts from traces, then train.
| PARAMETER | DESCRIPTION |
|---|---|
trace_store
|
Object with
TYPE:
|
Source code in src/openjarvis/learning/intelligence/grpo_trainer.py
train_on_prompts
¶
train_on_prompts(prompts: List[str], ground_truths: List[str | None] | None = None) -> Dict[str, Any]
Run GRPO training on a set of prompts.
| PARAMETER | DESCRIPTION |
|---|---|
prompts
|
List of prompt strings to train on.
TYPE:
|
ground_truths
|
Optional parallel list of ground-truth answers for reward scoring.
TYPE:
|