Evaluations¶
The OpenJarvis evaluation framework (openjarvis.evals) measures model correctness and accuracy on academic datasets. It ships inside the main openjarvis package (at src/openjarvis/evals/) and is designed specifically for research workflows where you need reproducible, dataset-driven quality assessments.
Evals vs. Benchmarks
OpenJarvis has two distinct measurement systems that complement each other:
| System | Module | Measures | Entry Point |
|---|---|---|---|
| Evaluations | openjarvis.evals |
Correctness on academic datasets (accuracy, pass rate) | jarvis eval |
| Benchmarks | openjarvis.bench |
Engine performance (latency, throughput) | jarvis bench |
Use evaluations to answer "does this model get the right answer?" and benchmarks to answer "how fast does this model respond?". See the Benchmarks guide for the performance measurement system.
Tip: LLM-guided spec search uses this same eval infrastructure to gate edits against your personal benchmark. See LLM-guided spec search.
Installation¶
The evaluation framework is part of the main openjarvis package — no separate install or extra is required. The standard dev setup is enough:
The framework's core dependencies (click, datasets, rich) are base dependencies of openjarvis. Two optional extras enable experiment tracking integrations:
uv sync --extra dev --extra eval-wandb # Weights & Biases run tracking
uv sync --extra dev --extra eval-sheets # Google Sheets results export
Python version requirement
Python 3.10 requires the tomli package for TOML config parsing. openjarvis declares it as a conditional dependency, so it is installed automatically.
Entry Points¶
Two equivalent entry points expose the framework:
| Command | Surface |
|---|---|
jarvis eval {list,run,compare,report} |
Canonical CLI. run covers the common options; compare and report post-process result files. |
python -m openjarvis.evals {list,run,run-all,summarize,reparse-judge} |
Full research surface, including judge configuration, the agentic runner, and episode mode. |
The openjarvis-eval console script is an alias for python -m openjarvis.evals — same commands, same options. This guide uses jarvis eval wherever its option set suffices and the module form for research-only options.
Datasets¶
The framework ships with 40 registered benchmarks covering academic reasoning, agentic tasks, coding, retrieval, conversation quality, and practical use-case benchmarks. Datasets are grouped by category below; uv run python -m openjarvis.evals list prints the authoritative registry.
Use-Case Benchmarks¶
These benchmarks evaluate models on practical tasks that mirror real OpenJarvis use cases.
| Dataset | Key | Description |
|---|---|---|
| CodingAssistant | coding_assistant |
Bug-fix coding assistant (test-based) |
| SecurityScanner | security_scanner |
Security vulnerability scanner |
| DailyDigest | daily_digest |
Daily briefing generation |
| DocQA | doc_qa |
Document-grounded QA with citations |
| BrowserAssistant | browser_assistant |
Web research with fact verification |
| EmailTriage | email_triage |
Email triage classification + draft |
| MorningBrief | morning_brief |
Morning briefing generation |
| ResearchMining | research_mining |
Research synthesis + accuracy |
| KnowledgeBase | knowledge_base |
Document-grounded retrieval QA |
| CodingTask | coding_task |
Function-level code generation |
Academic Benchmarks¶
These benchmarks measure reasoning and knowledge on established academic datasets.
| Dataset | Key | Category | Description |
|---|---|---|---|
| SuperGPQA | supergpqa |
reasoning | Graduate-level multiple-choice across scientific disciplines |
| GPQA | gpqa |
reasoning | Graduate-level MCQ (Diamond, Extended, Main variants) |
| MMLU-Pro | mmlu-pro |
reasoning | Enhanced MMLU multiple-choice |
| MATH-500 | math500 |
reasoning | Competition-level math problems |
| NaturalReasoning | natural-reasoning |
reasoning | Natural language reasoning |
| HLE | hle |
reasoning | Humanity's Last Exam hard challenges |
| LiveResearchBench | liveresearchbench |
reasoning | Recent research comprehension (Salesforce) |
| SimpleQA | simpleqa |
chat | Short-form factual question answering |
| IPW | ipw |
chat | Intelligence Per Watt mixed benchmark |
Agent Benchmarks¶
These benchmarks test multi-step agent capabilities including tool use, code generation, and long-horizon planning.
| Dataset | Key | Category | Description |
|---|---|---|---|
| GAIA | gaia |
agentic | Multi-step tasks with file I/O, calculations, web lookup |
| SWE-bench | swebench |
agentic | Real-world GitHub code patches |
| SWEfficiency | swefficiency |
agentic | Software optimization tasks |
| TerminalBench | terminalbench |
agentic | Terminal-based task completion |
| TerminalBench Native | terminalbench-native |
agentic | TerminalBench with native Docker execution |
| TerminalBench V2.1 | terminalbench-v2.1 |
agentic | TB v2.1 Harbor-style Docker tasks |
| PinchBench | pinchbench |
agentic | Real-world agent tasks |
| TauBench | taubench |
agentic | Multi-turn customer service |
| DeepResearchBench | liveresearch |
agentic | Deep research report generation |
| DeepResearchBench (alias) | deepresearch |
agentic | Same benchmark as liveresearch |
| ToolCall-15 | toolcall15 |
agentic | Tool calling benchmark |
| LifelongAgent | lifelong-agent |
agentic | Sequential task learning across sessions |
| PaperArena | paperarena |
agentic | Scientific paper analysis |
| DeepPlanning | deepplanning |
agentic | Shopping constraint planning |
| LogHub | loghub |
agentic | Log anomaly detection |
| AMA-Bench | ama-bench |
agentic | Agent memory assessment |
| WebChoreArena | webchorearena |
agentic | Web chore tasks |
| WorkArena | workarena |
agentic | WorkArena++ enterprise workflows |
Both liveresearch and deepresearch are registered keys for the DeepResearchBench report-generation benchmark.
Coding Benchmarks¶
| Dataset | Key | Category | Description |
|---|---|---|---|
| LiveCodeBench | livecodebench |
coding | Competitive programming |
Retrieval Benchmarks¶
| Dataset | Key | Category | Description |
|---|---|---|---|
| FRAMES | frames |
rag | Multi-hop factual retrieval across Wikipedia articles |
Conversation Benchmarks¶
| Dataset | Key | Category | Description |
|---|---|---|---|
| WildChat | wildchat |
chat | Real user conversation quality (pairwise LLM judge) |
Dataset Details¶
SuperGPQA is a large-scale multiple-choice benchmark spanning graduate-level questions across scientific disciplines. Each sample has a question, a set of lettered options, and a reference answer letter.
GAIA is an agentic benchmark requiring models to complete multi-step tasks that may involve file reading, calculations, and web lookup. Questions are drawn from the 2023 GAIA challenge set.
FRAMES tests multi-hop factual retrieval. Each question requires synthesizing information across multiple Wikipedia articles, making it a strong probe of retrieval-augmented generation capability.
WildChat uses real user conversations filtered to English single-turn exchanges. The reference answer is the original assistant response from the dataset; the model under evaluation is compared against it by an LLM judge.
GAIA dataset access
The GAIA dataset requires a HuggingFace account and acceptance of the dataset's terms of use. The loader downloads the full dataset snapshot on first use and caches it at ~/.cache/gaia_benchmark/. Subsequent runs use the local cache.
Use-Case Eval Configs¶
The framework includes two pre-built configs for evaluating models on the five core use-case benchmarks (coding_assistant, security_scanner, daily_digest, doc_qa, browser_assistant).
Cloud models¶
This config evaluates 6 cloud models (Claude Opus 4.6, Claude Haiku 4.5, Gemini 3.1 Pro, Gemini 3.1 Flash Lite, GPT-5.4, GPT-5 Mini) against all 5 use-case benchmarks with 30 samples each, producing a 6x5 = 30-run matrix. Results are written to results/use-cases-v2-cloud/.
Local models¶
This config evaluates 5 local models via Ollama (Qwen3.5 122B-A10B, GPT-OSS 120B, GLM4, Qwen3.5 35B-A3B, GLM-4.7-Flash) against the same 5 benchmarks, producing a 5x5 = 25-run matrix. Uses 2 workers (suitable for single-GPU setups). Results are written to results/use-cases-v2-local/.
Customizing use-case evals
Copy one of the use_case_v2_*.toml configs and modify the [[models]] entries to evaluate your own models. The five use-case benchmarks use synthetic datasets (no HuggingFace download required) and run quickly with 30 samples each.
Inference Backends¶
Every evaluation run routes model calls through one of four backends:
| Backend | Key | Description |
|---|---|---|
| jarvis-direct | jarvis-direct |
Engine-level inference via SystemBuilder. Works for local (Ollama, vLLM, llama.cpp) and cloud models. |
| jarvis-agent | jarvis-agent |
Agent-level inference with tool calling. Uses JarvisSystem.ask() with the specified agent and tools. |
| hermes | hermes |
Real Hermes Agent (Nous Research) via subprocess. Requires --base-url and --api-key. |
| openclaw | openclaw |
Real OpenClaw via Node subprocess. Requires --base-url and --api-key. |
Use jarvis-direct for most evaluations. Use jarvis-agent when the benchmark requires tool use — for example, GAIA tasks that reference files that must be read with file_read, or arithmetic tasks that benefit from calculator.
The hermes and openclaw backends shell out to external agent frameworks and need an OpenAI-compatible endpoint for their model calls: pass --base-url/--api-key, set the JARVIS_BACKEND_BASE_URL/JARVIS_BACKEND_API_KEY environment variables, or add a [backend.external] section to your config (see Config Reference).
TerminalBench Native
jarvis eval run --backend additionally accepts terminalbench-native, a Docker-based execution backend used by the TerminalBench Native benchmark.
CLI Usage¶
List available benchmarks and backends¶
Abridged output (40 benchmarks, 4 backends):
Available Benchmarks
┌──────────────────────┬───────────┬───────────────────────────────────┐
│ Name │ Category │ Description │
├──────────────────────┼───────────┼───────────────────────────────────┤
│ supergpqa │ reasoning │ SuperGPQA multiple-choice │
│ gpqa │ reasoning │ GPQA graduate-level MCQ │
│ ... │ ... │ ... │
│ livecodebench │ coding │ LiveCodeBench competitive progr. │
│ toolcall15 │ agentic │ ToolCall-15 tool calling benchmark│
└──────────────────────┴───────────┴───────────────────────────────────┘
Available Backends
┌───────────────┬──────────────────────────────────────────────────┐
│ jarvis-direct │ Engine-level inference (local or cloud) │
│ jarvis-agent │ Agent-level inference with tool calling │
│ hermes │ Real Hermes Agent (Nous Research) via subprocess │
│ openclaw │ Real OpenClaw via Node subprocess │
└───────────────┴──────────────────────────────────────────────────┘
jarvis eval list prints a similar table but currently shows a curated subset of the registry; the module form above is the authoritative listing.
Run a single benchmark¶
# Evaluate qwen3:8b on SuperGPQA (engine-level, 10 samples)
uv run jarvis eval run -b supergpqa -m qwen3:8b -n 10
# Evaluate GPT-5 Mini on GAIA using the agent backend with tools
uv run jarvis eval run -b gaia -m gpt-5-mini --backend jarvis-agent \
--agent orchestrator --tools calculator,file_read -n 50
# Run FRAMES with the vLLM engine, write output to a file
uv run jarvis eval run -b frames -m llama3:70b -e vllm \
-o results/frames_llama70b.jsonl
# Run WildChat with a higher temperature for chat quality
uv run jarvis eval run -b wildchat -m qwen3:8b --temperature 0.7 -n 100
jarvis eval run option reference¶
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--config |
-c |
path | — | TOML config file; when provided, -b and -m are not required |
--benchmark |
-b |
str | required* | Any registered benchmark key (see ... list) |
--model |
-m |
str | required* | Model identifier (e.g., qwen3:8b, gpt-5-mini) |
--max-samples |
-n |
int | all | Limit the number of samples evaluated |
--backend |
choice | jarvis-direct |
jarvis-direct, jarvis-agent, hermes, openclaw, or terminalbench-native |
|
--base-url |
str | — | OpenAI-compatible endpoint URL (env: JARVIS_BACKEND_BASE_URL) |
|
--api-key |
str | — | API key for the endpoint (env: JARVIS_BACKEND_API_KEY) |
|
--agent |
str | — | Agent name for jarvis-agent backend (e.g., orchestrator) |
|
--engine |
-e |
str | auto | Engine key (ollama, vllm, cloud, ...) |
--tools |
str | "" |
Comma-separated tool names (e.g., calculator,file_read) |
|
--telemetry/--no-telemetry |
flag | off | Enable telemetry collection during eval | |
--gpu-metrics/--no-gpu-metrics |
flag | off | Enable GPU metric polling | |
--seed |
int | 42 |
Random seed for dataset shuffling | |
--temperature |
float | 0.0 |
Generation temperature | |
--max-tokens |
int | 2048 |
Maximum output tokens | |
--model-filter |
str | — | Filter models by name substring (multi-model configs) | |
--output |
-o |
path | auto-generated | Output JSONL file path |
--wandb-project / --wandb-entity / --wandb-tags / --wandb-group |
str | "" |
Weights & Biases tracking (requires eval-wandb extra) |
|
--sheets-id / --sheets-worksheet / --sheets-creds |
str | "" |
Google Sheets export (requires eval-sheets extra) |
|
--verbose |
-v |
flag | off | Enable debug logging |
*Required when --config is not provided.
Research-only options (python -m openjarvis.evals run)¶
The module CLI accepts everything above plus research-grade options that jarvis eval run does not expose:
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--max-workers |
-w |
int | 4 |
Parallel evaluation workers |
--judge-model |
str | gpt-5-mini-2025-08-07 |
LLM used for judge-based scoring (see --help for the current default) |
|
--judge-engine |
str | cloud |
Engine key for the LLM judge; use vllm to judge locally |
|
--split |
str | dataset default | Override the dataset split | |
--compact |
flag | off | Dense single-table output | |
--trace-detail |
flag | off | Full per-step trace listing | |
--agentic |
flag | off | Use AgenticRunner for multi-turn agent execution |
|
--episode-mode |
flag | off | Sequential episode processing with lifelong learning (required for lifelong-agent and similar benchmarks) |
|
--concurrency |
int | 1 |
Parallel query execution (AgenticRunner only) | |
--query-timeout |
float | — | Per-query wall-clock timeout in seconds (AgenticRunner only) |
Note: the module CLI's --backend choice covers jarvis-direct, jarvis-agent, hermes, and openclaw; terminalbench-native as a backend is available via jarvis eval run and TOML configs.
Run all benchmarks at once¶
The run-all command (module CLI only) evaluates a single model against every registered benchmark sequentially and writes results to an output directory:
uv run python -m openjarvis.evals run-all -m qwen3:8b
# With options
uv run python -m openjarvis.evals run-all -m gpt-5-mini -n 100 --output-dir results/gpt5mini/
Output files are written as {output_dir}/{benchmark}_{model-slug}.jsonl. The model slug replaces / and : with -, so qwen3:8b becomes qwen3-8b.
Summarize results¶
After a run, inspect a JSONL results file:
Output:
File: results/supergpqa_qwen3-8b.jsonl
Benchmark: supergpqa
Model: qwen3:8b
Total: 200
Scored: 198
Correct: 143
Accuracy: 0.7222
Errors: 2
The module CLI also provides reparse-judge, which re-parses stored judge output in a results file and recovers records whose judge verdicts initially failed to parse — useful after improving the judge-output parser without re-running inference.
Compare and report¶
jarvis eval adds two post-processing commands for result files:
# Side-by-side metric comparison across runs
uv run jarvis eval compare results/supergpqa_qwen3-8b.jsonl results/supergpqa_gpt-5-mini.jsonl
# Detailed report (accuracy, latency, cost, per-subject breakdown) for one run
uv run jarvis eval report results/supergpqa_qwen3-8b.jsonl
Evaluating an Already-Running Endpoint¶
If you already have an OpenAI-compatible server running — jarvis serve, vLLM, SGLang, llama.cpp's server, or a hosted endpoint — point an eval directly at it with --base-url and --api-key:
# A vLLM server is already serving Qwen/Qwen3-8B on a GPU node:
# vllm serve Qwen/Qwen3-8B --port 8000
uv run jarvis eval run -b supergpqa -m Qwen/Qwen3-8B \
--base-url http://gpu-node:8000/v1 \
--api-key local-key \
-n 50
The -m value must match a model id the server reports at GET /v1/models. Both flags fall back to the JARVIS_BACKEND_BASE_URL and JARVIS_BACKEND_API_KEY environment variables, so CI jobs can set them once:
export JARVIS_BACKEND_BASE_URL=http://gpu-node:8000/v1
export JARVIS_BACKEND_API_KEY=local-key
uv run jarvis eval run -b gaia -m Qwen/Qwen3-8B --backend jarvis-agent -n 25
For the external hermes and openclaw backends these values are required (the foreign frameworks need an endpoint to send model calls to).
Engine-level alternative for vLLM
The vLLM engine also honors the VLLM_HOST environment variable (default http://localhost:8000):
VLLM_HOST=http://gpu-node:8000 uv run python -m openjarvis.evals run \
-b supergpqa -m Qwen/Qwen3-8B -e vllm -n 50
VLLM_HOST is process-global — if the candidate and the judge both use the vllm engine, they share the same endpoint. Prefer --base-url when you need them separate.
TOML Config System¶
For research workflows that compare multiple models across multiple benchmarks, use a TOML config file to define the evaluation as a models x benchmarks matrix. This is the recommended approach for systematic evaluations.
Running from a config¶
When --config is provided, the -b/--benchmark and -m/--model options are not required. All settings come from the config file. The CLI expands the matrix, prints a progress table, and writes results to the configured output_dir.
Config file format¶
A config file has six sections: [meta], [defaults], [judge], [run], [[models]], and [[benchmarks]]. Only [[models]] and [[benchmarks]] are required — all other sections are optional and fall back to built-in defaults.
# Suite-level metadata (optional)
[meta]
name = "full-suite-v1"
description = "Evaluate all benchmarks against production models"
# Default generation parameters (optional)
[defaults]
temperature = 0.0
max_tokens = 2048
# LLM judge configuration (optional)
[judge]
model = "gpt-4o"
temperature = 0.0
max_tokens = 1024
# Execution settings (optional)
[run]
max_workers = 4
output_dir = "results/"
seed = 42
# --- Models (one [[models]] block per model) ---
[[models]]
name = "qwen3:8b"
engine = "ollama"
temperature = 0.3 # overrides [defaults] for this model
max_tokens = 4096
[[models]]
name = "gpt-4o"
provider = "openai" # uses cloud engine
[[models]]
name = "llama3:70b"
engine = "vllm"
temperature = 0.1
# --- Benchmarks (one [[benchmarks]] block per benchmark) ---
[[benchmarks]]
name = "supergpqa"
backend = "jarvis-direct"
max_samples = 200
split = "train"
[[benchmarks]]
name = "gaia"
backend = "jarvis-agent"
agent = "orchestrator"
tools = ["file_read", "calculator"]
max_samples = 50
judge_model = "claude-sonnet-4-20250514" # override judge for this benchmark
[[benchmarks]]
name = "frames"
backend = "jarvis-direct"
max_samples = 100
[[benchmarks]]
name = "wildchat"
backend = "jarvis-direct"
max_samples = 150
temperature = 0.7 # override temperature for this benchmark
This config produces 3 models x 4 benchmarks = 12 evaluation runs.
Merge precedence¶
Settings are resolved with the following precedence, from highest to lowest:
For example, temperature is resolved as: use [defaults].temperature (0.0), then apply [[models]].temperature if set (0.3 for qwen3:8b), then override with [[benchmarks]].temperature if set (0.7 for wildchat). The WildChat run with qwen3:8b therefore runs at temperature = 0.7.
Minimal config¶
A config requires only one [[models]] and one [[benchmarks]] entry:
[[models]]
name = "qwen3:8b"
[[benchmarks]]
name = "supergpqa"
This runs SuperGPQA against qwen3:8b with all default settings. Use this as a starting point when iterating on a single model or dataset.
Single-run config with full options¶
[meta]
name = "single-run-example"
description = "Evaluate SuperGPQA with a single model and full configuration"
[defaults]
temperature = 0.0
max_tokens = 2048
[judge]
model = "gpt-4o"
temperature = 0.0
max_tokens = 1024
[run]
max_workers = 4
output_dir = "results/"
seed = 42
[[models]]
name = "qwen3:8b"
engine = "ollama"
temperature = 0.3
max_tokens = 4096
[[benchmarks]]
name = "supergpqa"
backend = "jarvis-direct"
max_samples = 100
split = "train"
Config Reference¶
[meta]¶
Suite-level metadata. Neither field affects evaluation behavior; both are used in CLI output and summary files.
| Field | Type | Default | Description |
|---|---|---|---|
name |
str | "" |
Suite name shown in CLI output |
description |
str | "" |
Human-readable description |
[defaults]¶
Default generation parameters applied to every run unless overridden at the model or benchmark level.
| Field | Type | Default | Description |
|---|---|---|---|
temperature |
float | 0.0 |
Sampling temperature |
max_tokens |
int | 2048 |
Maximum output tokens |
[judge]¶
Configuration for the LLM used as a judge in GAIA, FRAMES, and WildChat scoring.
| Field | Type | Default | Description |
|---|---|---|---|
model |
str | "gpt-5-mini-2025-08-07" |
Judge model identifier |
engine |
str | None |
Engine key for the judge (e.g., "vllm" to judge locally; defaults to cloud) |
provider |
str | None |
Provider override (e.g., "openai") |
temperature |
float | 0.0 |
Judge sampling temperature |
max_tokens |
int | 1024 |
Maximum judge output tokens |
Judge model costs
Every sample that requires LLM-based scoring makes a separate call to the judge model. For large runs with hundreds of samples, judge costs can exceed evaluation costs. GAIA, FRAMES, and WildChat all require a judge; SuperGPQA uses an LLM to extract the answer letter, then compares it against the reference without a separate judge call.
[run]¶
Execution settings that apply to the entire suite.
| Field | Type | Default | Description |
|---|---|---|---|
max_workers |
int | 4 |
Number of parallel evaluation threads |
output_dir |
str | "results/" |
Directory where JSONL and summary files are written |
seed |
int | 42 |
Random seed for dataset shuffling |
telemetry |
bool | false |
Enable GPU telemetry capture (energy, power, utilization, throughput) |
gpu_metrics |
bool | false |
Enable GPU metric polling via pynvml (requires pynvml or nvidia-ml-py) |
warmup_samples |
int | 0 |
Untimed warmup samples before measurement |
energy_vendor |
str | "" |
GPU energy vendor override |
max_turns |
int | None |
Maximum agent turns per query |
wandb_project / wandb_entity / wandb_tags / wandb_group |
str | "" |
Weights & Biases tracking |
sheets_spreadsheet_id / sheets_worksheet / sheets_credentials_path |
str | "" / "Results" / "" |
Google Sheets export |
[backend.external]¶
Endpoint settings for the hermes and openclaw backends. Environment variables override TOML values.
| Field | Type | Default | Description |
|---|---|---|---|
base_url |
str | None |
OpenAI-compatible endpoint URL (env: JARVIS_BACKEND_BASE_URL) |
api_key |
str | None |
API key for the endpoint (env: JARVIS_BACKEND_API_KEY) |
[[models]]¶
One block per model. The name field is required.
| Field | Type | Default | Description |
|---|---|---|---|
name |
str | required | Model identifier (e.g., "qwen3:8b", "gpt-5-mini") |
engine |
str | None |
Engine key to use ("ollama", "vllm", "cloud", ...) |
provider |
str | None |
Provider override for cloud models (e.g., "openai") |
temperature |
float | None |
Override [defaults].temperature for this model |
max_tokens |
int | None |
Override [defaults].max_tokens for this model |
param_count_b |
float | 0.0 |
Total model parameter count in billions (for MFU/MBU computation) |
active_params_b |
float | None |
Active parameters per token in billions (for MoE models; defaults to param_count_b) |
gpu_peak_tflops |
float | 0.0 |
GPU peak FP16 TFLOPS (e.g., 312.0 for A100 SXM) |
gpu_peak_bandwidth_gb_s |
float | 0.0 |
GPU peak memory bandwidth in GB/s (e.g., 2039.0 for A100 SXM) |
num_gpus |
int | 1 |
Number of GPUs used (for tensor-parallel inference) |
[[benchmarks]]¶
One block per benchmark. The name field is required.
| Field | Type | Default | Description |
|---|---|---|---|
name |
str | required | Any registered benchmark key (see uv run python -m openjarvis.evals list) |
backend |
str | "jarvis-direct" |
jarvis-direct, jarvis-agent, hermes, openclaw, or terminalbench-native |
max_samples |
int | None |
Limit number of samples; None evaluates the full dataset |
split |
str | None |
Override the default dataset split |
subset |
str | None |
Dataset subset/variant (benchmark-specific) |
record_ids |
list[str] | None |
Evaluate only these record ids |
agent |
str | None |
Agent name for jarvis-agent backend (e.g., "orchestrator") |
tools |
list[str] | [] |
Tool names for jarvis-agent backend |
judge_model |
str | None |
Override [judge].model for this benchmark only |
temperature |
float | None |
Override temperature for this benchmark (highest precedence) |
max_tokens |
int | None |
Override max tokens for this benchmark (highest precedence) |
Output Format¶
JSONL results file¶
Each completed sample is appended to the output JSONL file immediately after scoring. The file path is either specified with -o/--output, or auto-generated as {output_dir}/{benchmark}_{model-slug}.jsonl.
Each line is a JSON object with the following fields:
{
"record_id": "supergpqa-42",
"benchmark": "supergpqa",
"model": "qwen3:8b",
"backend": "jarvis-direct",
"model_answer": "The answer is C because...",
"is_correct": true,
"score": 1.0,
"latency_seconds": 1.34,
"prompt_tokens": 187,
"completion_tokens": 12,
"cost_usd": 0.0,
"error": null,
"scoring_metadata": {"reference_letter": "C", "candidate_letter": "C"},
"ttft": 0.0,
"energy_joules": 140792.95,
"power_watts": 893.0,
"gpu_utilization_pct": 47.4,
"throughput_tok_per_sec": 36.6,
"mfu_pct": 0.0176,
"mbu_pct": 26.89,
"ipw": 0.00112,
"ipj": 0.000007
}
| Field | Type | Description |
|---|---|---|
record_id |
str | Unique sample identifier |
benchmark |
str | Benchmark name |
model |
str | Model identifier |
backend |
str | Backend used |
model_answer |
str | Raw model output |
is_correct |
bool or null | Scoring result (null if unscored) |
score |
float or null | Numeric score (1.0 correct, 0.0 incorrect, null unscored) |
latency_seconds |
float | Inference latency |
prompt_tokens |
int | Input tokens consumed |
completion_tokens |
int | Output tokens generated |
cost_usd |
float | Estimated cost in USD |
error |
str or null | Error message if the sample failed |
scoring_metadata |
dict | Scorer-specific details (extracted letters, judge output, etc.) |
ttft |
float | Time to first token in seconds (0.0 if unavailable) |
energy_joules |
float | GPU energy consumed for this sample (joules) |
power_watts |
float | Average GPU power draw during inference (watts) |
gpu_utilization_pct |
float | Average GPU utilization percentage |
throughput_tok_per_sec |
float | Output token throughput (tokens/sec) |
mfu_pct |
float | Model FLOPs Utilization percentage (requires model hardware params) |
mbu_pct |
float | Memory Bandwidth Utilization percentage (requires model hardware params) |
ipw |
float | Intelligence Per Watt: accuracy / power_watts (0 if incorrect or no power data) |
ipj |
float | Intelligence Per Joule: accuracy / energy_joules (0 if incorrect or no energy data) |
Summary JSON file¶
After all samples complete, a summary file is written alongside the JSONL at {output_path}.summary.json:
{
"benchmark": "supergpqa",
"category": "reasoning",
"backend": "jarvis-direct",
"model": "qwen3:8b",
"total_samples": 200,
"scored_samples": 198,
"correct": 143,
"accuracy": 0.7222,
"errors": 2,
"mean_latency_seconds": 1.4821,
"total_cost_usd": 0.0,
"per_subject": {
"chemistry": {"accuracy": 0.74, "total": 50.0, "scored": 50.0, "correct": 37.0},
"mathematics": {"accuracy": 0.68, "total": 50.0, "scored": 49.0, "correct": 33.0}
},
"started_at": 1708789200.0,
"ended_at": 1708789496.3,
"accuracy_stats": {"mean": 0.72, "median": 1.0, "min": 0.0, "max": 1.0, "std": 0.45},
"energy_stats": {"mean": 140792.95, "median": 135112.79, "min": 3926.17, "max": 1806568.12, "std": 156038.54},
"power_stats": {"mean": 892.98, "median": 898.19, "min": 811.50, "max": 1104.90, "std": 42.65},
"gpu_utilization_stats": {"mean": 47.41, "median": 47.45, "min": 42.38, "max": 56.23, "std": 2.72},
"throughput_stats": {"mean": 36.55, "median": 37.22, "min": 26.22, "max": 45.03, "std": 5.00},
"mfu_stats": {"mean": 0.0176, "median": 0.0179, "min": 0.0126, "max": 0.0216, "std": 0.0024},
"mbu_stats": {"mean": 26.89, "median": 27.38, "min": 19.29, "max": 33.13, "std": 3.68},
"ipw_stats": {"mean": 0.00113, "median": 0.00112, "min": 0.00100, "max": 0.00123, "std": 0.00005},
"ipj_stats": {"mean": 0.00003, "median": 0.00001, "min": 0.000002, "max": 0.00021, "std": 0.00004},
"total_energy_joules": 28158590.26
}
When telemetry = true and gpu_metrics = true are set in [run], the summary includes MetricStats (mean, median, min, max, std) for every telemetry metric plus total_energy_joules. These stats are null when no values are available for that metric.
The per_subject breakdown groups results by the dataset's subject or category field, which varies per benchmark:
- SuperGPQA:
subfield,field, ordiscipline - GAIA: difficulty level (
level_1,level_2,level_3) - FRAMES: reasoning type(s) (e.g.,
temporal,intersection) - WildChat: always
"conversation"
Scoring Methods¶
Each benchmark uses a scorer tuned to its answer format.
SuperGPQA: LLM-assisted MCQ extraction¶
SuperGPQA responses are free-form text that must contain one of the valid option letters (A, B, C, D, ...). The scorer uses the judge LLM to extract the final answer letter from the model's response, then compares it against the reference letter with exact string matching.
The judge is prompted with the original problem and the model's response and asked to return only a single letter. This handles cases where the model reasons extensively before stating its final answer.
Scoring metadata includes: reference_letter, candidate_letter, and valid_letters.
GAIA: Normalized exact match with LLM fallback¶
GAIA answers are typically numbers, short phrases, or comma-separated lists. The scorer applies a normalization pass before comparison:
- Numbers: strips
$,%,,and converts to float for comparison - Lists: splits on
,/;and compares element-by-element (with per-element type detection) - Strings: lowercases, strips whitespace and punctuation
If the normalized exact match fails, the scorer falls back to the judge LLM, which returns a structured response with extracted_final_answer, reasoning, and correct: yes/no. The LLM fallback handles cases like unit variations, alternative phrasings, and equivalent but differently-formatted answers.
FRAMES: LLM-as-judge (factual correctness)¶
FRAMES uses an LLM judge that evaluates semantic equivalence between the model's answer and the ground truth. The judge receives the question, ground truth, and predicted answer, then responds with a structured verdict:
The scorer parses the correct: line and falls back to presence of TRUE/FALSE tokens if the structured format is missing.
WildChat: Pairwise LLM comparison¶
WildChat does not have a single "correct" answer — it measures chat response quality. The scorer runs a dual pairwise comparison:
- The judge evaluates (model answer as A, reference as B) and returns a verdict token such as
[[A>>B]],[[A>B]],[[A=B]],[[B>A]], or[[B>>A]]. - The judge then evaluates (reference as A, model answer as B) and returns another verdict.
The model is considered to have passed (is_correct = True) if it wins or ties in either comparison. The dual comparison reduces positional bias in the judge.
The judge uses a multi-step rubric that distinguishes subjective queries (scored on correctness, helpfulness, relevance, conciseness, and creativity) from objective/technical queries (scored on correctness only).
Interpreting WildChat accuracy
A WildChat accuracy score of 0.50 means the model matched or beat the reference response in half of comparisons. Because the reference response comes from the original dataset (which may include responses from capable models), a score above 0.50 indicates strong chat quality for that sample set.
Parallel Execution¶
The EvalRunner processes samples concurrently using a ThreadPoolExecutor. Results are flushed to the JSONL file incrementally as each sample completes, so you can inspect partial results during a long run.
# Use more workers for faster evaluation (if the engine supports concurrent requests)
uv run python -m openjarvis.evals run -b supergpqa -m qwen3:8b -w 8 -n 500
Worker count and engine load
Higher worker counts increase throughput only if the inference engine can handle concurrent requests. Local Ollama instances typically handle 1-2 concurrent requests. Cloud APIs (OpenAI, Anthropic) can handle higher concurrency. Set -w based on your engine's actual parallelism.
See Also¶
- Benchmarks — Measure inference engine latency and throughput
- Telemetry & Traces — Record and analyze inference metrics from production use
- Agents — Configure the
OrchestratorAgentused byjarvis-agentbackend - Tools — Available tools for agent-backed evaluations
- Python SDK — Programmatic access to OpenJarvis inference and agents