Index
environments
¶
Task environments for interactive multi-turn evaluation.
Classes¶
TaskEnvironment
¶
Bases: ABC
Environment for multi-turn interactive evaluation.
Subclasses implement the reset/step/evaluate lifecycle:
1. reset(record) — initialize the environment, return initial observation
2. step(agent_response) — parse agent action, execute, return feedback
3. evaluate() — assess final state, return (is_correct, metadata)
4. close() — release resources
Attributes¶
max_turns
property
¶
Maximum interaction turns for this environment.
Subclasses should override to match the original benchmark's per-task-type turn limits. Default: 15.
Functions¶
reset
abstractmethod
¶
reset(record: EvalRecord) -> str
Initialize environment for a record.
Returns the initial observation/context for the agent (e.g. schema description, entity list, task instructions).
step
abstractmethod
¶
Process an agent response.
Returns:
observation: Feedback text to show the agent (e.g. SQL result,
API return value, bash output).
is_done: True if the agent signaled completion (e.g. Action:
Answer, Final Answer:, Act: finish).
Source code in src/openjarvis/evals/environments/base.py
evaluate
abstractmethod
¶
Evaluate the final state after interaction completes.
Returns: is_correct: True/False/None (None = not scorable). metadata: Scoring details dict.
Source code in src/openjarvis/evals/environments/base.py
close
¶
run_agent_loop
¶
Run the full reset → [generate → step] × N → evaluate cycle.
Called by AgenticRunner for environments that use the reset/step/evaluate protocol instead of one-shot generation.
Maintains full conversation history across turns so the agent
retains context. Strips <think> tags from agent responses
before passing to step().
After completion, sets on self:
- last_eval_result: (is_correct, metadata)
- all_responses: list of raw agent responses per turn
- turn_wall_clocks: list of per-turn wall clock seconds
- interaction_history: message list (for lifelong injection)