workarena_env
workarena_env
¶
WorkArena task environment — per-task BrowserGym lifecycle + validation.
Wraps BrowserGym's BrowserEnv to provide per-task browser/ServiceNow
setup, observation access, action stepping, and native validate()
scoring against the live ServiceNow instance.
Classes¶
WorkArenaTaskEnv
¶
Per-task BrowserGym environment for WorkArena.
Context manager that creates a BrowserEnv, resets the task against
the ServiceNow instance, and exposes observation/action/validate methods.
After the agent finishes, run_tests() calls the task's native
validate() to determine pass/fail from the actual ServiceNow state.
Source code in src/openjarvis/evals/execution/workarena_env.py
Functions¶
get_observation_text
¶
Return the current observation formatted as text for the agent.
step
¶
Execute a BrowserGym action and return (obs_text, reward, done, info).
Actions use BrowserGym's high-level action format, e.g.: click("bid_123") fill("bid_456", "hello world") scroll(0, 300) send_msg_to_user("The answer is 42")
Source code in src/openjarvis/evals/execution/workarena_env.py
send_chat_message
¶
Send a message from the assistant to the chat.
run_tests
¶
Validate the task using the native WorkArena validate() method.
This calls task.validate(page, chat_messages) which checks the
actual state of the ServiceNow instance — the canonical evaluation
method from the original benchmark.
Source code in src/openjarvis/evals/execution/workarena_env.py
run_agent_loop
¶
Drive the BrowserGym env in a step loop using generate_fn for LLM calls.
generate_fn(prompt) -> response is called once per step.
The loop feeds observations to the LLM, parses a BrowserGym
action from its response, and steps the environment until the
task is done or max_steps is reached.
Validation (run_tests) is not called here — the caller
(e.g. AgenticRunner) is responsible for that.