Skip to content

lifelong_agent_scorer

lifelong_agent_scorer

Scorer for LifelongAgentBench.

Reproduces the original evaluation methodology from: https://github.com/caixd-220529/LifelongAgentBench

When used with interactive environments (episode_mode=True), the TaskEnvironment handles scoring directly. This scorer serves as:

  1. Fallback for non-interactive (single-shot) evaluation — with loud warnings that results are degraded and not faithful to the original.
  2. Helper library for shared scoring functions used by both the environments and this scorer.

Three subset scoring strategies matching the original:

  1. db_bench — Two modes matching the original Task._complete():
  2. direct (SELECT): Execute agent SQL, compare tuples with numeric tolerance (rel_tol=1e-6).
  3. md5 (INSERT/UPDATE/DELETE): Execute SQL, compare full table state.

  4. knowledge_graph — Exact-set match + F1 score on answer entities, matching the original's calculate_metric().

  5. os_interaction — Docker evaluation with exit_code == 0.

IMPORTANT: Single-shot scoring is DEGRADED and will always emit warnings. The original benchmark is multi-turn interactive. Use episode_mode=True with the jarvis-agent backend for faithful evaluation.

Classes

LifelongAgentScorer

LifelongAgentScorer(judge_backend: Any = None, judge_model: str = '')

Bases: Scorer

Scorer for LifelongAgentBench.

When used with interactive environments (episode_mode), the environment handles scoring and this scorer is bypassed. For single-shot mode, this scorer provides degraded fallback scoring with clear warnings.

Constructor accepts (judge_backend, judge_model) for CLI compatibility but does not use them — all scoring is deterministic.

Source code in src/openjarvis/evals/scorers/lifelong_agent_scorer.py
def __init__(self, judge_backend: Any = None, judge_model: str = "") -> None:
    pass

Functions

build_db

build_db(table_info: Dict[str, Any]) -> Connection

Build an in-memory SQLite DB from table_info.

Note: The original uses MySQL Docker containers. SQLite is used as a portable fallback. MySQL-specific features (e.g. backtick quoting, GROUP_CONCAT, MD5()) will not be available.

Source code in src/openjarvis/evals/scorers/lifelong_agent_scorer.py
def build_db(table_info: Dict[str, Any]) -> sqlite3.Connection:
    """Build an in-memory SQLite DB from table_info.

    Note: The original uses MySQL Docker containers.  SQLite is used
    as a portable fallback.  MySQL-specific features (e.g. backtick
    quoting, GROUP_CONCAT, MD5()) will not be available.
    """
    conn = sqlite3.connect(":memory:")
    table_name = table_info.get("name", "data")
    columns = table_info.get("column_info_list", [])
    rows = table_info.get("row_list", [])

    col_defs = []
    for col in columns:
        raw_type = col.get("type", "TEXT")
        base = raw_type.split("(")[0].strip().upper()
        stype = _TYPE_MAP.get(base, "TEXT")
        col_defs.append(f'"{col.get("name", "col")}" {stype}')

    if not col_defs:
        col_defs = ['"value" TEXT']

    conn.execute(f'CREATE TABLE "{table_name}" ({", ".join(col_defs)})')

    if rows and columns:
        ncols = len(columns)
        ph = ", ".join(["?"] * ncols)
        for row_idx, row in enumerate(rows):
            padded = list(row[:ncols])
            while len(padded) < ncols:
                padded.append(None)
            try:
                conn.execute(
                    f'INSERT INTO "{table_name}" VALUES ({ph})', padded,
                )
            except sqlite3.Error as exc:
                logger.debug(
                    "Skipping row %d in table %s: %s", row_idx, table_name, exc,
                )

    conn.commit()
    return conn