ama_bench
ama_bench
¶
AMA-Bench dataset loader.
Reference dataset: https://huggingface.co/datasets/AMA-bench/AMA-bench
Paper: https://arxiv.org/abs/2602.22769
This implementation follows the published schema with fields like: - episode_id - task / task_type / domain / source / success / num_turns / total_tokens - trajectory: list[{turn_idx, action, observation}] - qa_pairs: list[{question, answer, question_uuid, type}]
Evaluation protocol follows the paper's long-context baseline: pack the trajectory into the model input, reserving space for the question and answer. When a trajectory exceeds the budget, truncation preserves the first 50% and last 50% of the token budget (matching Appendix B of the paper).
Classes¶
AMABenchDataset
¶
AMABenchDataset(subset: str = 'default', cache_dir: Optional[str] = None, max_trajectory_tokens: Optional[int] = None)
Bases: DatasetProvider
AMA-Bench agent memory assessment benchmark.
Source code in src/openjarvis/evals/datasets/ama_bench.py
Functions¶
iter_episodes
¶
iter_episodes() -> Iterable[List[EvalRecord]]