Skip to content

lifelong_agent

lifelong_agent

LifelongAgentBench dataset loader.

Faithful reimplementation of: https://github.com/caixd-220529/LifelongAgentBench https://huggingface.co/datasets/csyq/LifelongAgentBench Paper: arXiv:2505.11942

The HF dataset has three subsets stored as separate parquet directories: - db_bench (SQL tasks against MySQL database tables) - knowledge_graph (KG reasoning via multi-turn API actions) - os_interaction (Bash tasks in a Docker Ubuntu container)

Key design decisions matching the original:

  1. Lifelong episodes: Tasks within each subset are ordered by sample_index and yielded as a single episode via iter_episodes(). When the eval runner uses episode_mode=True, tasks are processed sequentially, enabling lifelong learning (in-context example injection from prior successes, mirroring the original's PreviousSampleUtilizationCallback).

  2. Multi-turn interaction: Each record provides a create_task_env() that returns a TaskEnvironment for multi-turn agent interaction. DB tasks interact with a real database, KG tasks with an API simulator, OS tasks with a Docker container — matching the original's protocol.

  3. Environment requirements: DB tasks need Docker+MySQL (SQLite fallback warns loudly). OS tasks need Docker. KG tasks need a SPARQL endpoint for full fidelity (oracle simulation with clear warning otherwise).

Classes

LifelongAgentDataset

LifelongAgentDataset(subset: str = 'all', cache_dir: Optional[str] = None, sparql_endpoint: Optional[str] = None, os_image: Optional[str] = None)

Bases: DatasetProvider

LifelongAgentBench dataset loader.

Loads from HuggingFace with three subsets: db_bench, knowledge_graph, os_interaction. Set subset="all" to load all subsets (default).

Tasks within each subset form a lifelong episode — they are processed sequentially with the agent accumulating experience across tasks. Use episode_mode=True in RunConfig to enable this, mirroring the original's PreviousSampleUtilizationCallback.

Provides create_task_env(record) for multi-turn interactive evaluation, matching the original's multi-turn agent-environment interaction protocol.

Source code in src/openjarvis/evals/datasets/lifelong_agent.py
def __init__(
    self,
    subset: str = "all",
    cache_dir: Optional[str] = None,
    sparql_endpoint: Optional[str] = None,
    os_image: Optional[str] = None,
) -> None:
    if subset != "all" and subset not in _VALID_SUBSETS:
        raise ValueError(
            f"Unknown subset {subset!r}. "
            f"Choose from: {list(_VALID_SUBSETS)} or 'all'"
        )
    self._subset = subset
    self._cache_dir = cache_dir
    self._sparql_endpoint = sparql_endpoint
    self._os_image = os_image
    self._records: List[EvalRecord] = []
Functions
iter_episodes
iter_episodes() -> Iterable[List[EvalRecord]]

Yield one lifelong episode per subset, ordered by sample_index.

The original benchmark processes all tasks within a subset sequentially, accumulating successful completions as in-context examples for subsequent tasks. This method groups records by subset and sorts by sample_index so the eval runner can replicate this lifelong protocol when episode_mode=True.

Source code in src/openjarvis/evals/datasets/lifelong_agent.py
def iter_episodes(self) -> Iterable[List[EvalRecord]]:
    """Yield one lifelong episode per subset, ordered by sample_index.

    The original benchmark processes all tasks within a subset
    sequentially, accumulating successful completions as in-context
    examples for subsequent tasks.  This method groups records by
    subset and sorts by ``sample_index`` so the eval runner can
    replicate this lifelong protocol when ``episode_mode=True``.
    """
    by_subset: Dict[str, List[EvalRecord]] = defaultdict(list)
    for record in self._records:
        by_subset[record.metadata["subset"]].append(record)

    for subset in sorted(by_subset):
        episode = sorted(
            by_subset[subset],
            key=lambda r: r.metadata["sample_index"],
        )
        for i, record in enumerate(episode):
            record.metadata["episode_task_index"] = i
            record.metadata["episode_length"] = len(episode)
        yield episode
create_task_env
create_task_env(record: EvalRecord)

Create a multi-turn TaskEnvironment for interactive evaluation.

This is called by EvalRunner when episode_mode=True and enables the faithful multi-turn interaction protocol matching the original.

Source code in src/openjarvis/evals/datasets/lifelong_agent.py
def create_task_env(self, record: EvalRecord):
    """Create a multi-turn TaskEnvironment for interactive evaluation.

    This is called by EvalRunner when episode_mode=True and enables
    the faithful multi-turn interaction protocol matching the original.
    """
    from openjarvis.evals.environments.lifelong_agent_env import (
        create_task_environment,
    )
    return create_task_environment(
        record,
        sparql_endpoint=self._sparql_endpoint,
        os_image=self._os_image,
    )