workarena
workarena
¶
WorkArena++ enterprise workflow benchmark on ServiceNow.
Faithful integration of the original browsergym-workarena package. Tasks are Python classes that run against a live ServiceNow instance via BrowserGym / Playwright — NOT a static JSON dataset.
L1 = 33 atomic tasks (ICML 2024) L2/L3 = 682 composite tasks (NeurIPS 2024)
Source: https://github.com/ServiceNow/WorkArena Requires: pip install browsergym-workarena playwright==1.44.0
Classes¶
WorkArenaDataset
¶
WorkArenaDataset(level: str = 'l2', n_seed_l1: int = 10, meta_seed: int = 42, headless: bool = True)
Bases: DatasetProvider
WorkArena++ benchmark using the native browsergym-workarena package.
Tasks are enumerated from the installed browsergym-workarena
package exactly as in the original benchmark. Each task class is
instantiated with a seed by BrowserGym at evaluation time. Scoring
uses the task's native validate() method against the live
ServiceNow instance — no LLM judge or text matching.
Source code in src/openjarvis/evals/datasets/workarena.py
Functions¶
create_task_env
¶
create_task_env(record: EvalRecord)
Return a WorkArenaTaskEnv for the given record.
Source code in src/openjarvis/evals/datasets/workarena.py
verify_requirements
¶
Check that all prerequisites for WorkArena evaluation are met.