lifelong_agent_env
lifelong_agent_env
¶
Multi-turn task environments for LifelongAgentBench.
Implements faithful reproductions of the original's interaction protocols: - DB: MySQL Docker container (SQLite fallback with degraded-mode warning) - KG: Variable-store API simulation (SPARQL endpoint if configured) - OS: Docker container with correct image
Reference: https://github.com/caixd-220529/LifelongAgentBench
Classes¶
TaskEnvironment
¶
Bases: ABC
Environment for multi-turn interactive evaluation.
Subclasses implement the reset/step/evaluate lifecycle:
1. reset(record) — initialize the environment, return initial observation
2. step(agent_response) — parse agent action, execute, return feedback
3. evaluate() — assess final state, return (is_correct, metadata)
4. close() — release resources
Attributes¶
max_turns
property
¶
Maximum interaction turns for this environment.
Subclasses should override to match the original benchmark's per-task-type turn limits. Default: 15.
Functions¶
reset
abstractmethod
¶
reset(record: EvalRecord) -> str
Initialize environment for a record.
Returns the initial observation/context for the agent (e.g. schema description, entity list, task instructions).
step
abstractmethod
¶
Process an agent response.
Returns:
observation: Feedback text to show the agent (e.g. SQL result,
API return value, bash output).
is_done: True if the agent signaled completion (e.g. Action:
Answer, Final Answer:, Act: finish).
Source code in src/openjarvis/evals/environments/base.py
evaluate
abstractmethod
¶
Evaluate the final state after interaction completes.
Returns: is_correct: True/False/None (None = not scorable). metadata: Scoring details dict.
Source code in src/openjarvis/evals/environments/base.py
close
¶
run_agent_loop
¶
Run the full reset → [generate → step] × N → evaluate cycle.
Called by AgenticRunner for environments that use the reset/step/evaluate protocol instead of one-shot generation.
Maintains full conversation history across turns so the agent
retains context. Strips <think> tags from agent responses
before passing to step().
After completion, sets on self:
- last_eval_result: (is_correct, metadata)
- all_responses: list of raw agent responses per turn
- turn_wall_clocks: list of per-turn wall clock seconds
- interaction_history: message list (for lifelong injection)
Source code in src/openjarvis/evals/environments/base.py
DBEnvironment
¶
Bases: TaskEnvironment
Multi-turn DB environment matching the original's interaction protocol.
The original uses MySQL Docker containers. We try MySQL first and fall back to SQLite with a clear degraded-mode warning.
Source code in src/openjarvis/evals/environments/lifelong_agent_env.py
KGEnvironment
¶
Bases: TaskEnvironment
KG API simulation environment.
The original uses a Freebase SPARQL endpoint. When no endpoint is
configured, we simulate API calls using the gold action_list from
the dataset as oracle responses and clearly warn about degraded mode.
Metrics match the original: exact-set match + F1 on answer entities.
Source code in src/openjarvis/evals/environments/lifelong_agent_env.py
OSEnvironment
¶
Bases: TaskEnvironment
Docker-based OS interaction environment.
Matches the original's three-phase protocol: 1. Start container, run initialization_command_item 2. Agent sends bash commands, gets output 3. Run evaluation_command_item — pass iff exit_code == 0
Source code in src/openjarvis/evals/environments/lifelong_agent_env.py
Functions¶
create_task_environment
¶
create_task_environment(record: EvalRecord, *, sparql_endpoint: Optional[str] = None, os_image: Optional[str] = None, os_timeout: int = 120) -> TaskEnvironment
Factory: create the right environment for a LifelongAgentBench record.