webchorearena
webchorearena
¶
WebChoreArena: Realistic tedious web browsing tasks.
Evaluates web agents on 532 tasks across Shopping, Shopping Admin, Reddit, GitLab, and Cross-site environments. Tests massive memory, calculation, and long-term memory capabilities.
Requires a running WebArena standalone environment (Shopping/Magento, Reddit/Postmill, GitLab, Shopping Admin). Tasks are per-site JSON configs cloned from the original GitHub repository.
Source: https://github.com/WebChoreArena/WebChoreArena
Classes¶
WebChoreArenaDataset
¶
Bases: DatasetProvider
WebChoreArena benchmark — interactive browser-based web tasks.
Tasks are enumerated from the original GitHub repository's
config_files/ JSON files. Each task requires a live WebArena
standalone environment and Playwright for evaluation.
Source code in src/openjarvis/evals/datasets/webchorearena.py
Functions¶
create_task_env
¶
create_task_env(record: EvalRecord)
Return a WebChoreArenaTaskEnv for the given record.