WebChoreArena task environment — Playwright-based browser interaction.
Wraps the WebArena browser environment to provide per-task setup,
observation access, action stepping, and evaluation using the
original WebArena evaluation harness (StringEvaluator, URLEvaluator,
HTMLContentEvaluator combined multiplicatively).
Requires:
- playwright (pip install playwright && playwright install)
- Running WebArena standalone sites (Shopping, Reddit, GitLab, etc.)
- Environment variables: SHOPPING, SHOPPING_ADMIN, REDDIT, GITLAB, MAP, WIKIPEDIA
Classes
WebChoreArenaTaskEnv
WebChoreArenaTaskEnv(metadata: MutableMapping[str, Any], headless: bool = True)
Per-task browser environment for WebChoreArena.
Context manager that creates a Playwright browser, navigates to the
task's start URL, and exposes observation/action/evaluate methods.
Evaluation uses the original WebArena evaluator harness with
multiplicative combination of StringEvaluator, URLEvaluator, and
HTMLContentEvaluator.
Source code in src/openjarvis/evals/execution/webchorearena_env.py
| def __init__(
self,
metadata: MutableMapping[str, Any],
headless: bool = True,
) -> None:
self._metadata = metadata
self._headless = headless
self._playwright: Any = None
self._browser: Any = None
self._context: Any = None
self._page: Any = None
self._cdp_session: Any = None
self._done = False
self._agent_answer = ""
self._step_count = 0
self._task_config: Dict[str, Any] = metadata.get("task_config", {})
|
Functions
run_agent_loop
run_agent_loop(generate_fn: Callable[[str], str], max_steps: Optional[int] = None) -> str
Drive the browser env in a step loop using generate_fn for LLM calls.
Returns the agent's final answer text.
Source code in src/openjarvis/evals/execution/webchorearena_env.py
| def run_agent_loop(
self,
generate_fn: Callable[[str], str],
max_steps: Optional[int] = None,
) -> str:
"""Drive the browser env in a step loop using *generate_fn* for LLM calls.
Returns the agent's final answer text.
"""
if self._page is None:
raise RuntimeError("Environment not initialized — use as context manager")
if max_steps is None:
max_steps = _MAX_STEPS_DEFAULT
responses: List[str] = []
intent = self._task_config.get(
"intent", self._task_config.get("intent_template", ""),
)
for step_idx in range(max_steps):
if self._done:
break
prompt = self._build_step_prompt(intent, step_idx, max_steps)
response = generate_fn(prompt)
responses.append(response)
action = response.strip()
self._execute_action(action)
self._step_count += 1
if self._done:
break
# Run evaluation after the interaction loop
self._run_evaluation()
return self._agent_answer or "\n---\n".join(responses)
|
run_tests
Run the WebArena evaluation harness and populate metadata.
Source code in src/openjarvis/evals/execution/webchorearena_env.py
| def run_tests(self) -> None:
"""Run the WebArena evaluation harness and populate metadata."""
self._run_evaluation()
|