workarena_scorer
workarena_scorer
¶
Scorer for WorkArena++ enterprise workflow tasks.
Uses the native task.validate() reward from BrowserGym, which checks
the actual state of the ServiceNow instance via Playwright. No LLM
judging — scoring is fully deterministic based on environment validation.
This mirrors the TerminalBenchNativeScorer pattern: the WorkArenaTaskEnv
populates record.metadata["is_resolved"] and record.metadata["reward"]
after calling task.validate(), and this scorer reads those fields.
Classes¶
WorkArenaScorer
¶
Bases: Scorer
Environment-validated scorer for WorkArena++ tasks.
Reads is_resolved and reward from record.metadata,
populated by WorkArenaTaskEnv.run_tests() which calls the
original task.validate(page, chat_messages).