Skip to content

webchorearena

webchorearena

WebChoreArena: Realistic tedious web browsing tasks.

Evaluates web agents on 532 tasks across Shopping, Shopping Admin, Reddit, GitLab, and Cross-site environments. Tests massive memory, calculation, and long-term memory capabilities.

Requires a running WebArena standalone environment (Shopping/Magento, Reddit/Postmill, GitLab, Shopping Admin). Tasks are per-site JSON configs cloned from the original GitHub repository.

Source: https://github.com/WebChoreArena/WebChoreArena

Classes

WebChoreArenaDataset

WebChoreArenaDataset(subset: str = 'all', cache_dir: Optional[str] = None, headless: bool = True)

Bases: DatasetProvider

WebChoreArena benchmark — interactive browser-based web tasks.

Tasks are enumerated from the original GitHub repository's config_files/ JSON files. Each task requires a live WebArena standalone environment and Playwright for evaluation.

Source code in src/openjarvis/evals/datasets/webchorearena.py
def __init__(
    self,
    subset: str = "all",
    cache_dir: Optional[str] = None,
    headless: bool = True,
) -> None:
    self._subset = subset  # "all", "small", or a site name
    self._cache_dir = (
        Path(cache_dir) if cache_dir
        else Path.home() / ".cache" / "webchorearena"
    )
    self._headless = headless
    self._records: List[EvalRecord] = []
Functions
create_task_env
create_task_env(record: EvalRecord)

Return a WebChoreArenaTaskEnv for the given record.

Source code in src/openjarvis/evals/datasets/webchorearena.py
def create_task_env(self, record: EvalRecord):
    """Return a WebChoreArenaTaskEnv for the given record."""
    try:
        from openjarvis.evals.execution.webchorearena_env import (
            WebChoreArenaTaskEnv,
        )
        return WebChoreArenaTaskEnv(record.metadata, headless=self._headless)
    except ImportError:
        return None