webchorearena_env

webchorearena_env ¶

WebChoreArena task environment — Playwright-based browser interaction.

Wraps the WebArena browser environment to provide per-task setup, observation access, action stepping, and evaluation using the original WebArena evaluation harness (StringEvaluator, URLEvaluator, HTMLContentEvaluator combined multiplicatively).

Requires: - playwright (pip install playwright && playwright install) - Running WebArena standalone sites (Shopping, Reddit, GitLab, etc.) - Environment variables: SHOPPING, SHOPPING_ADMIN, REDDIT, GITLAB, MAP, WIKIPEDIA

Classes¶

WebChoreArenaTaskEnv ¶

WebChoreArenaTaskEnv(metadata: MutableMapping[str, Any], headless: bool = True)

Per-task browser environment for WebChoreArena.

Context manager that creates a Playwright browser, navigates to the task's start URL, and exposes observation/action/evaluate methods.

Evaluation uses the original WebArena evaluator harness with multiplicative combination of StringEvaluator, URLEvaluator, and HTMLContentEvaluator.

Source code in src/openjarvis/evals/execution/webchorearena_env.py

def __init__(
    self,
    metadata: MutableMapping[str, Any],
    headless: bool = True,
) -> None:
    self._metadata = metadata
    self._headless = headless
    self._playwright: Any = None
    self._browser: Any = None
    self._context: Any = None
    self._page: Any = None
    self._cdp_session: Any = None
    self._done = False
    self._agent_answer = ""
    self._step_count = 0
    self._task_config: Dict[str, Any] = metadata.get("task_config", {})

Functions¶

run_agent_loop ¶

run_agent_loop(generate_fn: Callable[[str], str], max_steps: Optional[int] = None) -> str

Drive the browser env in a step loop using generate_fn for LLM calls.

Returns the agent's final answer text.

Source code in src/openjarvis/evals/execution/webchorearena_env.py

def run_agent_loop(
    self,
    generate_fn: Callable[[str], str],
    max_steps: Optional[int] = None,
) -> str:
    """Drive the browser env in a step loop using *generate_fn* for LLM calls.

    Returns the agent's final answer text.
    """
    if self._page is None:
        raise RuntimeError("Environment not initialized — use as context manager")

    if max_steps is None:
        max_steps = _MAX_STEPS_DEFAULT

    responses: List[str] = []
    intent = self._task_config.get(
        "intent",
        self._task_config.get("intent_template", ""),
    )

    for step_idx in range(max_steps):
        if self._done:
            break

        prompt = self._build_step_prompt(intent, step_idx, max_steps)
        response = generate_fn(prompt)
        responses.append(response)

        action = response.strip()
        self._execute_action(action)
        self._step_count += 1

        if self._done:
            break

    # Run evaluation after the interaction loop
    self._run_evaluation()

    return self._agent_answer or "\n---\n".join(responses)

run_tests ¶

run_tests() -> None

Run the WebArena evaluation harness and populate metadata.

Source code in src/openjarvis/evals/execution/webchorearena_env.py

def run_tests(self) -> None:
    """Run the WebArena evaluation harness and populate metadata."""
    self._run_evaluation()