webchorearena_scorer

webchorearena_scorer ¶

Scorer for WebChoreArena web chore tasks.

Uses the environment-validated scoring pattern (same as WorkArenaScorer): the WebChoreArenaTaskEnv runs the original WebArena evaluation harness (StringEvaluator × URLEvaluator × HTMLContentEvaluator, multiplicative) and populates record.metadata["is_resolved"] and record.metadata["reward"]. This scorer reads those fields.

The evaluation harness inside the task env faithfully mirrors the original: - StringEvaluator: exact_match, must_include (with |OR| support), fuzzy_match (LLM-judged via GPT-4o), ua_match (unachievable task detection) - URLEvaluator: checks browser's current page URL against reference URLs - HTMLContentEvaluator: navigates to URLs, runs JS locators on DOM, checks element content against expected values

Classes¶

WebChoreArenaScorer ¶

WebChoreArenaScorer(judge_backend: object = None, judge_model: str = '')

Bases: Scorer

Environment-validated scorer for WebChoreArena tasks.

Reads is_resolved and reward from record.metadata, populated by WebChoreArenaTaskEnv._run_evaluation() which runs the original WebArena evaluation harness against the live browser state.

Source code in src/openjarvis/evals/scorers/webchorearena_scorer.py

def __init__(
    self,
    judge_backend: object = None,
    judge_model: str = "",
) -> None:
    self._judge_backend = judge_backend
    self._judge_model = judge_model