lifelong_agent_env

lifelong_agent_env ¶

Multi-turn task environments for LifelongAgentBench.

Implements faithful reproductions of the original's interaction protocols: - DB: MySQL Docker container (SQLite fallback with degraded-mode warning) - KG: Variable-store API simulation (SPARQL endpoint if configured) - OS: Docker container with correct image

Reference: https://github.com/caixd-220529/LifelongAgentBench

Classes¶

TaskEnvironment ¶

Bases: ABC

Environment for multi-turn interactive evaluation.

Subclasses implement the reset/step/evaluate lifecycle: 1. reset(record) — initialize the environment, return initial observation 2. step(agent_response) — parse agent action, execute, return feedback 3. evaluate() — assess final state, return (is_correct, metadata) 4. close() — release resources

Attributes¶

max_turns `property` ¶

max_turns: int

Maximum interaction turns for this environment.

Subclasses should override to match the original benchmark's per-task-type turn limits. Default: 15.

Functions¶

reset `abstractmethod` ¶

reset(record: EvalRecord) -> str

Initialize environment for a record.

Returns the initial observation/context for the agent (e.g. schema description, entity list, task instructions).

Source code in src/openjarvis/evals/environments/base.py

@abstractmethod
def reset(self, record: EvalRecord) -> str:
    """Initialize environment for a record.

    Returns the initial observation/context for the agent (e.g. schema
    description, entity list, task instructions).
    """

step `abstractmethod` ¶

step(agent_response: str) -> Tuple[str, bool]

Process an agent response.

Returns: observation: Feedback text to show the agent (e.g. SQL result, API return value, bash output). is_done: True if the agent signaled completion (e.g. Action: Answer, Final Answer:, Act: finish).

Source code in src/openjarvis/evals/environments/base.py

@abstractmethod
def step(self, agent_response: str) -> Tuple[str, bool]:
    """Process an agent response.

    Returns:
        observation: Feedback text to show the agent (e.g. SQL result,
            API return value, bash output).
        is_done: True if the agent signaled completion (e.g. ``Action:
            Answer``, ``Final Answer:``, ``Act: finish``).
    """

evaluate `abstractmethod` ¶

evaluate() -> Tuple[Optional[bool], Dict[str, Any]]

Evaluate the final state after interaction completes.

Returns: is_correct: True/False/None (None = not scorable). metadata: Scoring details dict.

Source code in src/openjarvis/evals/environments/base.py

@abstractmethod
def evaluate(self) -> Tuple[Optional[bool], Dict[str, Any]]:
    """Evaluate the final state after interaction completes.

    Returns:
        is_correct: True/False/None (None = not scorable).
        metadata: Scoring details dict.
    """

close ¶

close() -> None

Release resources (Docker containers, DB connections, etc.).

Source code in src/openjarvis/evals/environments/base.py

def close(self) -> None:
    """Release resources (Docker containers, DB connections, etc.)."""

run_agent_loop ¶

run_agent_loop(generate_fn: Any, record: 'EvalRecord') -> str

Run the full reset → [generate → step] × N → evaluate cycle.

Called by AgenticRunner for environments that use the reset/step/evaluate protocol instead of one-shot generation.

Maintains full conversation history across turns so the agent retains context. Strips <think> tags from agent responses before passing to step().

After completion, sets on self: - last_eval_result: (is_correct, metadata) - all_responses: list of raw agent responses per turn - turn_wall_clocks: list of per-turn wall clock seconds - interaction_history: message list (for lifelong injection)

Source code in src/openjarvis/evals/environments/base.py

def run_agent_loop(
    self,
    generate_fn: Any,
    record: "EvalRecord",
) -> str:
    """Run the full reset → [generate → step] × N → evaluate cycle.

    Called by AgenticRunner for environments that use the
    reset/step/evaluate protocol instead of one-shot generation.

    Maintains full conversation history across turns so the agent
    retains context.  Strips ``<think>`` tags from agent responses
    before passing to ``step()``.

    After completion, sets on ``self``:
    - ``last_eval_result``: ``(is_correct, metadata)``
    - ``all_responses``: list of raw agent responses per turn
    - ``turn_wall_clocks``: list of per-turn wall clock seconds
    - ``interaction_history``: message list (for lifelong injection)
    """
    self.reset(record)

    messages: List[Dict[str, str]] = []

    # Use the full record.problem as the first user message — it
    # already contains the system prompt, schema, and task instruction.
    messages.append({"role": "user", "content": record.problem})

    all_responses: List[str] = []
    turn_wall_clocks: List[float] = []
    last_response = ""

    for _ in range(self.max_turns):
        prompt = _format_messages(messages)

        t0 = time.time()
        last_response = generate_fn(prompt)
        turn_wall = time.time() - t0

        all_responses.append(last_response)
        turn_wall_clocks.append(turn_wall)

        cleaned = _strip_think_tags(last_response)
        messages.append({"role": "assistant", "content": cleaned})

        obs, done = self.step(cleaned)
        messages.append({"role": "user", "content": obs})

        if done:
            break

    self.last_eval_result: Optional[Tuple[Optional[bool], Dict[str, Any]]] = (
        self.evaluate()
    )
    self.all_responses: List[str] = all_responses
    self.turn_wall_clocks: List[float] = turn_wall_clocks
    self.interaction_history: List[Dict[str, str]] = [
        msg for msg in messages if msg.get("role") != "system"
    ]
    return last_response

DBEnvironment ¶

DBEnvironment(use_mysql: bool = True)

Bases: TaskEnvironment

Multi-turn DB environment matching the original's interaction protocol.

The original uses MySQL Docker containers. We try MySQL first and fall back to SQLite with a clear degraded-mode warning.

Source code in src/openjarvis/evals/environments/lifelong_agent_env.py

def __init__(self, use_mysql: bool = True) -> None:
    self._use_mysql = use_mysql and _mysql_available()
    self._conn: Optional[sqlite3.Connection] = None
    self._mysql_conn: Any = None  # mysql.connector connection
    self._mysql_port: Optional[int] = None
    self._mysql_container: Optional[str] = None
    self._record: Optional[EvalRecord] = None
    self._table_info: Dict[str, Any] = {}
    self._answer_info: Dict[str, Any] = {}
    self._answer_type: str = "direct"
    self._agent_final_answer: Optional[str] = None
    self._agent_sql_history: List[str] = []
    self._is_done = False
    self._degraded = False

KGEnvironment ¶

KGEnvironment(sparql_endpoint: Optional[str] = None)

Bases: TaskEnvironment

KG API simulation environment.

The original uses a Freebase SPARQL endpoint. When no endpoint is configured, we simulate API calls using the gold action_list from the dataset as oracle responses and clearly warn about degraded mode.

Metrics match the original: exact-set match + F1 on answer entities.

Source code in src/openjarvis/evals/environments/lifelong_agent_env.py

def __init__(self, sparql_endpoint: Optional[str] = None) -> None:
    self._sparql_endpoint = sparql_endpoint
    self._record: Optional[EvalRecord] = None
    self._variables: List[_Variable] = []
    self._entity_dict: Dict[str, str] = {}
    self._answer_list: List[str] = []
    self._action_list: List[Dict[str, Any]] = []
    self._action_idx: int = 0
    self._agent_final_answer: Optional[str] = None
    self._is_done = False
    self._degraded = False

OSEnvironment ¶

OSEnvironment(image: Optional[str] = None, timeout: int = 120)

Bases: TaskEnvironment

Docker-based OS interaction environment.

Matches the original's three-phase protocol: 1. Start container, run initialization_command_item 2. Agent sends bash commands, gets output 3. Run evaluation_command_item — pass iff exit_code == 0

Source code in src/openjarvis/evals/environments/lifelong_agent_env.py

def __init__(self, image: Optional[str] = None, timeout: int = 120) -> None:
    self._image = image
    self._timeout = timeout
    self._container_name: Optional[str] = None
    self._record: Optional[EvalRecord] = None
    self._agent_commands: List[str] = []
    self._is_done = False

Functions¶

create_task_environment ¶

create_task_environment(record: EvalRecord, *, sparql_endpoint: Optional[str] = None, os_image: Optional[str] = None, os_timeout: int = 120) -> TaskEnvironment

Factory: create the right environment for a LifelongAgentBench record.

Source code in src/openjarvis/evals/environments/lifelong_agent_env.py

def create_task_environment(
    record: EvalRecord,
    *,
    sparql_endpoint: Optional[str] = None,
    os_image: Optional[str] = None,
    os_timeout: int = 120,
) -> TaskEnvironment:
    """Factory: create the right environment for a LifelongAgentBench record."""
    subset = record.metadata.get("subset", "db_bench")
    if subset == "db_bench":
        return DBEnvironment()
    elif subset == "knowledge_graph":
        return KGEnvironment(sparql_endpoint=sparql_endpoint)
    elif subset == "os_interaction":
        return OSEnvironment(image=os_image, timeout=os_timeout)
    else:
        raise ValueError(f"Unknown LifelongAgentBench subset: {subset}")

lifelong_agent_env