lifelong_agent_scorer
lifelong_agent_scorer
¶
Scorer for LifelongAgentBench.
Reproduces the original evaluation methodology from: https://github.com/caixd-220529/LifelongAgentBench
When used with interactive environments (episode_mode=True), the TaskEnvironment handles scoring directly. This scorer serves as:
- Fallback for non-interactive (single-shot) evaluation — with loud warnings that results are degraded and not faithful to the original.
- Helper library for shared scoring functions used by both the environments and this scorer.
Three subset scoring strategies matching the original:
- db_bench — Two modes matching the original
Task._complete(): - direct (SELECT): Execute agent SQL, compare tuples with numeric
tolerance (
rel_tol=1e-6). -
md5 (INSERT/UPDATE/DELETE): Execute SQL, compare full table state.
-
knowledge_graph — Exact-set match + F1 score on answer entities, matching the original's
calculate_metric(). -
os_interaction — Docker evaluation with exit_code == 0.
IMPORTANT: Single-shot scoring is DEGRADED and will always emit warnings.
The original benchmark is multi-turn interactive. Use episode_mode=True
with the jarvis-agent backend for faithful evaluation.
Classes¶
LifelongAgentScorer
¶
Bases: Scorer
Scorer for LifelongAgentBench.
When used with interactive environments (episode_mode), the environment handles scoring and this scorer is bypassed. For single-shot mode, this scorer provides degraded fallback scoring with clear warnings.
Constructor accepts (judge_backend, judge_model) for CLI
compatibility but does not use them — all scoring is deterministic.
Source code in src/openjarvis/evals/scorers/lifelong_agent_scorer.py
Functions¶
build_db
¶
Build an in-memory SQLite DB from table_info.
Note: The original uses MySQL Docker containers. SQLite is used as a portable fallback. MySQL-specific features (e.g. backtick quoting, GROUP_CONCAT, MD5()) will not be available.