webchorearena

webchorearena ¶

WebChoreArena: Realistic tedious web browsing tasks.

Evaluates web agents on 532 tasks across Shopping, Shopping Admin, Reddit, GitLab, and Cross-site environments. Tests massive memory, calculation, and long-term memory capabilities.

Requires a running WebArena standalone environment (Shopping/Magento, Reddit/Postmill, GitLab, Shopping Admin). Tasks are per-site JSON configs cloned from the original GitHub repository.

Source: https://github.com/WebChoreArena/WebChoreArena

Classes¶

WebChoreArenaDataset ¶

WebChoreArenaDataset(subset: str = 'all', cache_dir: Optional[str] = None, headless: bool = True)

Bases: DatasetProvider

WebChoreArena benchmark — interactive browser-based web tasks.

Tasks are enumerated from the original GitHub repository's config_files/ JSON files. Each task requires a live WebArena standalone environment and Playwright for evaluation.

Source code in src/openjarvis/evals/datasets/webchorearena.py

def __init__(
    self,
    subset: str = "all",
    cache_dir: Optional[str] = None,
    headless: bool = True,
) -> None:
    self._subset = subset  # "all", "small", or a site name
    self._cache_dir = (
        Path(cache_dir) if cache_dir else Path.home() / ".cache" / "webchorearena"
    )
    self._headless = headless
    self._records: List[EvalRecord] = []

Functions¶

create_task_env ¶

create_task_env(record: EvalRecord)

Return a WebChoreArenaTaskEnv for the given record.

Source code in src/openjarvis/evals/datasets/webchorearena.py

def create_task_env(self, record: EvalRecord):
    """Return a WebChoreArenaTaskEnv for the given record."""
    try:
        from openjarvis.evals.execution.webchorearena_env import (
            WebChoreArenaTaskEnv,
        )

        return WebChoreArenaTaskEnv(record.metadata, headless=self._headless)
    except ImportError:
        return None