workarena_scorer

workarena_scorer ¶

Scorer for WorkArena++ enterprise workflow tasks.

Uses the native task.validate() reward from BrowserGym, which checks the actual state of the ServiceNow instance via Playwright. No LLM judging — scoring is fully deterministic based on environment validation.

This mirrors the TerminalBenchNativeScorer pattern: the WorkArenaTaskEnv populates record.metadata["is_resolved"] and record.metadata["reward"] after calling task.validate(), and this scorer reads those fields.

Classes¶

WorkArenaScorer ¶

WorkArenaScorer(judge_backend: object = None, judge_model: str = '')

Bases: Scorer

Environment-validated scorer for WorkArena++ tasks.

Reads is_resolved and reward from record.metadata, populated by WorkArenaTaskEnv.run_tests() which calls the original task.validate(page, chat_messages).

Source code in src/openjarvis/evals/scorers/workarena_scorer.py

def __init__(
    self,
    judge_backend: object = None,
    judge_model: str = "",
) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model