lifelong_agent
lifelong_agent
¶
LifelongAgentBench dataset loader.
Faithful reimplementation of: https://github.com/caixd-220529/LifelongAgentBench https://huggingface.co/datasets/csyq/LifelongAgentBench Paper: arXiv:2505.11942
The HF dataset has three subsets stored as separate parquet directories: - db_bench (SQL tasks against MySQL database tables) - knowledge_graph (KG reasoning via multi-turn API actions) - os_interaction (Bash tasks in a Docker Ubuntu container)
Key design decisions matching the original:
-
Lifelong episodes: Tasks within each subset are ordered by
sample_indexand yielded as a single episode viaiter_episodes(). When the eval runner usesepisode_mode=True, tasks are processed sequentially, enabling lifelong learning (in-context example injection from prior successes, mirroring the original'sPreviousSampleUtilizationCallback). -
Multi-turn interaction: Each record provides a
create_task_env()that returns aTaskEnvironmentfor multi-turn agent interaction. DB tasks interact with a real database, KG tasks with an API simulator, OS tasks with a Docker container — matching the original's protocol. -
Environment requirements: DB tasks need Docker+MySQL (SQLite fallback warns loudly). OS tasks need Docker. KG tasks need a SPARQL endpoint for full fidelity (oracle simulation with clear warning otherwise).
Classes¶
LifelongAgentDataset
¶
LifelongAgentDataset(subset: str = 'all', cache_dir: Optional[str] = None, sparql_endpoint: Optional[str] = None, os_image: Optional[str] = None)
Bases: DatasetProvider
LifelongAgentBench dataset loader.
Loads from HuggingFace with three subsets: db_bench, knowledge_graph,
os_interaction. Set subset="all" to load all subsets (default).
Tasks within each subset form a lifelong episode — they are
processed sequentially with the agent accumulating experience across
tasks. Use episode_mode=True in RunConfig to enable this,
mirroring the original's PreviousSampleUtilizationCallback.
Provides create_task_env(record) for multi-turn interactive
evaluation, matching the original's multi-turn agent-environment
interaction protocol.
Source code in src/openjarvis/evals/datasets/lifelong_agent.py
Functions¶
iter_episodes
¶
iter_episodes() -> Iterable[List[EvalRecord]]
Yield one lifelong episode per subset, ordered by sample_index.
The original benchmark processes all tasks within a subset
sequentially, accumulating successful completions as in-context
examples for subsequent tasks. This method groups records by
subset and sorts by sample_index so the eval runner can
replicate this lifelong protocol when episode_mode=True.
Source code in src/openjarvis/evals/datasets/lifelong_agent.py
create_task_env
¶
create_task_env(record: EvalRecord)
Create a multi-turn TaskEnvironment for interactive evaluation.
This is called by EvalRunner when episode_mode=True and enables the faithful multi-turn interaction protocol matching the original.