Skip to content

Evaluations

The OpenJarvis evaluation framework (openjarvis.evals) measures model correctness and accuracy on academic datasets. It ships inside the main openjarvis package (at src/openjarvis/evals/) and is designed specifically for research workflows where you need reproducible, dataset-driven quality assessments.

Evals vs. Benchmarks

OpenJarvis has two distinct measurement systems that complement each other:

System Module Measures Entry Point
Evaluations openjarvis.evals Correctness on academic datasets (accuracy, pass rate) jarvis eval
Benchmarks openjarvis.bench Engine performance (latency, throughput) jarvis bench

Use evaluations to answer "does this model get the right answer?" and benchmarks to answer "how fast does this model respond?". See the Benchmarks guide for the performance measurement system.


Tip: LLM-guided spec search uses this same eval infrastructure to gate edits against your personal benchmark. See LLM-guided spec search.

Installation

The evaluation framework is part of the main openjarvis package — no separate install or extra is required. The standard dev setup is enough:

uv sync --extra dev

The framework's core dependencies (click, datasets, rich) are base dependencies of openjarvis. Two optional extras enable experiment tracking integrations:

uv sync --extra dev --extra eval-wandb     # Weights & Biases run tracking
uv sync --extra dev --extra eval-sheets    # Google Sheets results export

Python version requirement

Python 3.10 requires the tomli package for TOML config parsing. openjarvis declares it as a conditional dependency, so it is installed automatically.

Entry Points

Two equivalent entry points expose the framework:

Command Surface
jarvis eval {list,run,compare,report} Canonical CLI. run covers the common options; compare and report post-process result files.
python -m openjarvis.evals {list,run,run-all,summarize,reparse-judge} Full research surface, including judge configuration, the agentic runner, and episode mode.

The openjarvis-eval console script is an alias for python -m openjarvis.evals — same commands, same options. This guide uses jarvis eval wherever its option set suffices and the module form for research-only options.


Datasets

The framework ships with 40 registered benchmarks covering academic reasoning, agentic tasks, coding, retrieval, conversation quality, and practical use-case benchmarks. Datasets are grouped by category below; uv run python -m openjarvis.evals list prints the authoritative registry.

Use-Case Benchmarks

These benchmarks evaluate models on practical tasks that mirror real OpenJarvis use cases.

Dataset Key Description
CodingAssistant coding_assistant Bug-fix coding assistant (test-based)
SecurityScanner security_scanner Security vulnerability scanner
DailyDigest daily_digest Daily briefing generation
DocQA doc_qa Document-grounded QA with citations
BrowserAssistant browser_assistant Web research with fact verification
EmailTriage email_triage Email triage classification + draft
MorningBrief morning_brief Morning briefing generation
ResearchMining research_mining Research synthesis + accuracy
KnowledgeBase knowledge_base Document-grounded retrieval QA
CodingTask coding_task Function-level code generation

Academic Benchmarks

These benchmarks measure reasoning and knowledge on established academic datasets.

Dataset Key Category Description
SuperGPQA supergpqa reasoning Graduate-level multiple-choice across scientific disciplines
GPQA gpqa reasoning Graduate-level MCQ (Diamond, Extended, Main variants)
MMLU-Pro mmlu-pro reasoning Enhanced MMLU multiple-choice
MATH-500 math500 reasoning Competition-level math problems
NaturalReasoning natural-reasoning reasoning Natural language reasoning
HLE hle reasoning Humanity's Last Exam hard challenges
LiveResearchBench liveresearchbench reasoning Recent research comprehension (Salesforce)
SimpleQA simpleqa chat Short-form factual question answering
IPW ipw chat Intelligence Per Watt mixed benchmark

Agent Benchmarks

These benchmarks test multi-step agent capabilities including tool use, code generation, and long-horizon planning.

Dataset Key Category Description
GAIA gaia agentic Multi-step tasks with file I/O, calculations, web lookup
SWE-bench swebench agentic Real-world GitHub code patches
SWEfficiency swefficiency agentic Software optimization tasks
TerminalBench terminalbench agentic Terminal-based task completion
TerminalBench Native terminalbench-native agentic TerminalBench with native Docker execution
TerminalBench V2.1 terminalbench-v2.1 agentic TB v2.1 Harbor-style Docker tasks
PinchBench pinchbench agentic Real-world agent tasks
TauBench taubench agentic Multi-turn customer service
DeepResearchBench liveresearch agentic Deep research report generation
DeepResearchBench (alias) deepresearch agentic Same benchmark as liveresearch
ToolCall-15 toolcall15 agentic Tool calling benchmark
LifelongAgent lifelong-agent agentic Sequential task learning across sessions
PaperArena paperarena agentic Scientific paper analysis
DeepPlanning deepplanning agentic Shopping constraint planning
LogHub loghub agentic Log anomaly detection
AMA-Bench ama-bench agentic Agent memory assessment
WebChoreArena webchorearena agentic Web chore tasks
WorkArena workarena agentic WorkArena++ enterprise workflows

Both liveresearch and deepresearch are registered keys for the DeepResearchBench report-generation benchmark.

Coding Benchmarks

Dataset Key Category Description
LiveCodeBench livecodebench coding Competitive programming

Retrieval Benchmarks

Dataset Key Category Description
FRAMES frames rag Multi-hop factual retrieval across Wikipedia articles

Conversation Benchmarks

Dataset Key Category Description
WildChat wildchat chat Real user conversation quality (pairwise LLM judge)

Dataset Details

SuperGPQA is a large-scale multiple-choice benchmark spanning graduate-level questions across scientific disciplines. Each sample has a question, a set of lettered options, and a reference answer letter.

GAIA is an agentic benchmark requiring models to complete multi-step tasks that may involve file reading, calculations, and web lookup. Questions are drawn from the 2023 GAIA challenge set.

FRAMES tests multi-hop factual retrieval. Each question requires synthesizing information across multiple Wikipedia articles, making it a strong probe of retrieval-augmented generation capability.

WildChat uses real user conversations filtered to English single-turn exchanges. The reference answer is the original assistant response from the dataset; the model under evaluation is compared against it by an LLM judge.

GAIA dataset access

The GAIA dataset requires a HuggingFace account and acceptance of the dataset's terms of use. The loader downloads the full dataset snapshot on first use and caches it at ~/.cache/gaia_benchmark/. Subsequent runs use the local cache.


Use-Case Eval Configs

The framework includes two pre-built configs for evaluating models on the five core use-case benchmarks (coding_assistant, security_scanner, daily_digest, doc_qa, browser_assistant).

Cloud models

uv run jarvis eval run --config src/openjarvis/evals/configs/use_case_v2_cloud.toml

This config evaluates 6 cloud models (Claude Opus 4.6, Claude Haiku 4.5, Gemini 3.1 Pro, Gemini 3.1 Flash Lite, GPT-5.4, GPT-5 Mini) against all 5 use-case benchmarks with 30 samples each, producing a 6x5 = 30-run matrix. Results are written to results/use-cases-v2-cloud/.

Local models

uv run jarvis eval run --config src/openjarvis/evals/configs/use_case_v2_local.toml

This config evaluates 5 local models via Ollama (Qwen3.5 122B-A10B, GPT-OSS 120B, GLM4, Qwen3.5 35B-A3B, GLM-4.7-Flash) against the same 5 benchmarks, producing a 5x5 = 25-run matrix. Uses 2 workers (suitable for single-GPU setups). Results are written to results/use-cases-v2-local/.

Customizing use-case evals

Copy one of the use_case_v2_*.toml configs and modify the [[models]] entries to evaluate your own models. The five use-case benchmarks use synthetic datasets (no HuggingFace download required) and run quickly with 30 samples each.


Inference Backends

Every evaluation run routes model calls through one of four backends:

Backend Key Description
jarvis-direct jarvis-direct Engine-level inference via SystemBuilder. Works for local (Ollama, vLLM, llama.cpp) and cloud models.
jarvis-agent jarvis-agent Agent-level inference with tool calling. Uses JarvisSystem.ask() with the specified agent and tools.
hermes hermes Real Hermes Agent (Nous Research) via subprocess. Requires --base-url and --api-key.
openclaw openclaw Real OpenClaw via Node subprocess. Requires --base-url and --api-key.

Use jarvis-direct for most evaluations. Use jarvis-agent when the benchmark requires tool use — for example, GAIA tasks that reference files that must be read with file_read, or arithmetic tasks that benefit from calculator.

The hermes and openclaw backends shell out to external agent frameworks and need an OpenAI-compatible endpoint for their model calls: pass --base-url/--api-key, set the JARVIS_BACKEND_BASE_URL/JARVIS_BACKEND_API_KEY environment variables, or add a [backend.external] section to your config (see Config Reference).

TerminalBench Native

jarvis eval run --backend additionally accepts terminalbench-native, a Docker-based execution backend used by the TerminalBench Native benchmark.


CLI Usage

List available benchmarks and backends

uv run python -m openjarvis.evals list

Abridged output (40 benchmarks, 4 backends):

                         Available Benchmarks
┌──────────────────────┬───────────┬───────────────────────────────────┐
│ Name                 │ Category  │ Description                       │
├──────────────────────┼───────────┼───────────────────────────────────┤
│ supergpqa            │ reasoning │ SuperGPQA multiple-choice         │
│ gpqa                 │ reasoning │ GPQA graduate-level MCQ           │
│ ...                  │ ...       │ ...                               │
│ livecodebench        │ coding    │ LiveCodeBench competitive progr.  │
│ toolcall15           │ agentic   │ ToolCall-15 tool calling benchmark│
└──────────────────────┴───────────┴───────────────────────────────────┘
                         Available Backends
┌───────────────┬──────────────────────────────────────────────────┐
│ jarvis-direct │ Engine-level inference (local or cloud)          │
│ jarvis-agent  │ Agent-level inference with tool calling          │
│ hermes        │ Real Hermes Agent (Nous Research) via subprocess │
│ openclaw      │ Real OpenClaw via Node subprocess                │
└───────────────┴──────────────────────────────────────────────────┘

jarvis eval list prints a similar table but currently shows a curated subset of the registry; the module form above is the authoritative listing.

Run a single benchmark

# Evaluate qwen3:8b on SuperGPQA (engine-level, 10 samples)
uv run jarvis eval run -b supergpqa -m qwen3:8b -n 10

# Evaluate GPT-5 Mini on GAIA using the agent backend with tools
uv run jarvis eval run -b gaia -m gpt-5-mini --backend jarvis-agent \
    --agent orchestrator --tools calculator,file_read -n 50

# Run FRAMES with the vLLM engine, write output to a file
uv run jarvis eval run -b frames -m llama3:70b -e vllm \
    -o results/frames_llama70b.jsonl

# Run WildChat with a higher temperature for chat quality
uv run jarvis eval run -b wildchat -m qwen3:8b --temperature 0.7 -n 100

jarvis eval run option reference

Option Short Type Default Description
--config -c path TOML config file; when provided, -b and -m are not required
--benchmark -b str required* Any registered benchmark key (see ... list)
--model -m str required* Model identifier (e.g., qwen3:8b, gpt-5-mini)
--max-samples -n int all Limit the number of samples evaluated
--backend choice jarvis-direct jarvis-direct, jarvis-agent, hermes, openclaw, or terminalbench-native
--base-url str OpenAI-compatible endpoint URL (env: JARVIS_BACKEND_BASE_URL)
--api-key str API key for the endpoint (env: JARVIS_BACKEND_API_KEY)
--agent str Agent name for jarvis-agent backend (e.g., orchestrator)
--engine -e str auto Engine key (ollama, vllm, cloud, ...)
--tools str "" Comma-separated tool names (e.g., calculator,file_read)
--telemetry/--no-telemetry flag off Enable telemetry collection during eval
--gpu-metrics/--no-gpu-metrics flag off Enable GPU metric polling
--seed int 42 Random seed for dataset shuffling
--temperature float 0.0 Generation temperature
--max-tokens int 2048 Maximum output tokens
--model-filter str Filter models by name substring (multi-model configs)
--output -o path auto-generated Output JSONL file path
--wandb-project / --wandb-entity / --wandb-tags / --wandb-group str "" Weights & Biases tracking (requires eval-wandb extra)
--sheets-id / --sheets-worksheet / --sheets-creds str "" Google Sheets export (requires eval-sheets extra)
--verbose -v flag off Enable debug logging

*Required when --config is not provided.

Research-only options (python -m openjarvis.evals run)

The module CLI accepts everything above plus research-grade options that jarvis eval run does not expose:

Option Short Type Default Description
--max-workers -w int 4 Parallel evaluation workers
--judge-model str gpt-5-mini-2025-08-07 LLM used for judge-based scoring (see --help for the current default)
--judge-engine str cloud Engine key for the LLM judge; use vllm to judge locally
--split str dataset default Override the dataset split
--compact flag off Dense single-table output
--trace-detail flag off Full per-step trace listing
--agentic flag off Use AgenticRunner for multi-turn agent execution
--episode-mode flag off Sequential episode processing with lifelong learning (required for lifelong-agent and similar benchmarks)
--concurrency int 1 Parallel query execution (AgenticRunner only)
--query-timeout float Per-query wall-clock timeout in seconds (AgenticRunner only)

Note: the module CLI's --backend choice covers jarvis-direct, jarvis-agent, hermes, and openclaw; terminalbench-native as a backend is available via jarvis eval run and TOML configs.

Run all benchmarks at once

The run-all command (module CLI only) evaluates a single model against every registered benchmark sequentially and writes results to an output directory:

uv run python -m openjarvis.evals run-all -m qwen3:8b

# With options
uv run python -m openjarvis.evals run-all -m gpt-5-mini -n 100 --output-dir results/gpt5mini/

Output files are written as {output_dir}/{benchmark}_{model-slug}.jsonl. The model slug replaces / and : with -, so qwen3:8b becomes qwen3-8b.

Summarize results

After a run, inspect a JSONL results file:

uv run python -m openjarvis.evals summarize results/supergpqa_qwen3-8b.jsonl

Output:

File:      results/supergpqa_qwen3-8b.jsonl
Benchmark: supergpqa
Model:     qwen3:8b
Total:     200
Scored:    198
Correct:   143
Accuracy:  0.7222
Errors:    2

The module CLI also provides reparse-judge, which re-parses stored judge output in a results file and recovers records whose judge verdicts initially failed to parse — useful after improving the judge-output parser without re-running inference.

Compare and report

jarvis eval adds two post-processing commands for result files:

# Side-by-side metric comparison across runs
uv run jarvis eval compare results/supergpqa_qwen3-8b.jsonl results/supergpqa_gpt-5-mini.jsonl

# Detailed report (accuracy, latency, cost, per-subject breakdown) for one run
uv run jarvis eval report results/supergpqa_qwen3-8b.jsonl

Evaluating an Already-Running Endpoint

If you already have an OpenAI-compatible server running — jarvis serve, vLLM, SGLang, llama.cpp's server, or a hosted endpoint — point an eval directly at it with --base-url and --api-key:

# A vLLM server is already serving Qwen/Qwen3-8B on a GPU node:
#   vllm serve Qwen/Qwen3-8B --port 8000
uv run jarvis eval run -b supergpqa -m Qwen/Qwen3-8B \
    --base-url http://gpu-node:8000/v1 \
    --api-key local-key \
    -n 50

The -m value must match a model id the server reports at GET /v1/models. Both flags fall back to the JARVIS_BACKEND_BASE_URL and JARVIS_BACKEND_API_KEY environment variables, so CI jobs can set them once:

export JARVIS_BACKEND_BASE_URL=http://gpu-node:8000/v1
export JARVIS_BACKEND_API_KEY=local-key
uv run jarvis eval run -b gaia -m Qwen/Qwen3-8B --backend jarvis-agent -n 25

For the external hermes and openclaw backends these values are required (the foreign frameworks need an endpoint to send model calls to).

Engine-level alternative for vLLM

The vLLM engine also honors the VLLM_HOST environment variable (default http://localhost:8000):

VLLM_HOST=http://gpu-node:8000 uv run python -m openjarvis.evals run \
    -b supergpqa -m Qwen/Qwen3-8B -e vllm -n 50

VLLM_HOST is process-global — if the candidate and the judge both use the vllm engine, they share the same endpoint. Prefer --base-url when you need them separate.


TOML Config System

For research workflows that compare multiple models across multiple benchmarks, use a TOML config file to define the evaluation as a models x benchmarks matrix. This is the recommended approach for systematic evaluations.

Running from a config

uv run jarvis eval run --config src/openjarvis/evals/configs/full-suite.toml

When --config is provided, the -b/--benchmark and -m/--model options are not required. All settings come from the config file. The CLI expands the matrix, prints a progress table, and writes results to the configured output_dir.

Config file format

A config file has six sections: [meta], [defaults], [judge], [run], [[models]], and [[benchmarks]]. Only [[models]] and [[benchmarks]] are required — all other sections are optional and fall back to built-in defaults.

src/openjarvis/evals/configs/full-suite.toml
# Suite-level metadata (optional)
[meta]
name = "full-suite-v1"
description = "Evaluate all benchmarks against production models"

# Default generation parameters (optional)
[defaults]
temperature = 0.0
max_tokens = 2048

# LLM judge configuration (optional)
[judge]
model = "gpt-4o"
temperature = 0.0
max_tokens = 1024

# Execution settings (optional)
[run]
max_workers = 4
output_dir = "results/"
seed = 42

# --- Models (one [[models]] block per model) ---

[[models]]
name = "qwen3:8b"
engine = "ollama"
temperature = 0.3    # overrides [defaults] for this model
max_tokens = 4096

[[models]]
name = "gpt-4o"
provider = "openai"  # uses cloud engine

[[models]]
name = "llama3:70b"
engine = "vllm"
temperature = 0.1

# --- Benchmarks (one [[benchmarks]] block per benchmark) ---

[[benchmarks]]
name = "supergpqa"
backend = "jarvis-direct"
max_samples = 200
split = "train"

[[benchmarks]]
name = "gaia"
backend = "jarvis-agent"
agent = "orchestrator"
tools = ["file_read", "calculator"]
max_samples = 50
judge_model = "claude-sonnet-4-20250514"  # override judge for this benchmark

[[benchmarks]]
name = "frames"
backend = "jarvis-direct"
max_samples = 100

[[benchmarks]]
name = "wildchat"
backend = "jarvis-direct"
max_samples = 150
temperature = 0.7   # override temperature for this benchmark

This config produces 3 models x 4 benchmarks = 12 evaluation runs.

Merge precedence

Settings are resolved with the following precedence, from highest to lowest:

benchmark-level  >  model-level  >  [defaults]  >  built-in defaults

For example, temperature is resolved as: use [defaults].temperature (0.0), then apply [[models]].temperature if set (0.3 for qwen3:8b), then override with [[benchmarks]].temperature if set (0.7 for wildchat). The WildChat run with qwen3:8b therefore runs at temperature = 0.7.

Minimal config

A config requires only one [[models]] and one [[benchmarks]] entry:

src/openjarvis/evals/configs/minimal.toml
[[models]]
name = "qwen3:8b"

[[benchmarks]]
name = "supergpqa"

This runs SuperGPQA against qwen3:8b with all default settings. Use this as a starting point when iterating on a single model or dataset.

Single-run config with full options

src/openjarvis/evals/configs/single-run.toml
[meta]
name = "single-run-example"
description = "Evaluate SuperGPQA with a single model and full configuration"

[defaults]
temperature = 0.0
max_tokens = 2048

[judge]
model = "gpt-4o"
temperature = 0.0
max_tokens = 1024

[run]
max_workers = 4
output_dir = "results/"
seed = 42

[[models]]
name = "qwen3:8b"
engine = "ollama"
temperature = 0.3
max_tokens = 4096

[[benchmarks]]
name = "supergpqa"
backend = "jarvis-direct"
max_samples = 100
split = "train"

Config Reference

[meta]

Suite-level metadata. Neither field affects evaluation behavior; both are used in CLI output and summary files.

Field Type Default Description
name str "" Suite name shown in CLI output
description str "" Human-readable description

[defaults]

Default generation parameters applied to every run unless overridden at the model or benchmark level.

Field Type Default Description
temperature float 0.0 Sampling temperature
max_tokens int 2048 Maximum output tokens

[judge]

Configuration for the LLM used as a judge in GAIA, FRAMES, and WildChat scoring.

Field Type Default Description
model str "gpt-5-mini-2025-08-07" Judge model identifier
engine str None Engine key for the judge (e.g., "vllm" to judge locally; defaults to cloud)
provider str None Provider override (e.g., "openai")
temperature float 0.0 Judge sampling temperature
max_tokens int 1024 Maximum judge output tokens

Judge model costs

Every sample that requires LLM-based scoring makes a separate call to the judge model. For large runs with hundreds of samples, judge costs can exceed evaluation costs. GAIA, FRAMES, and WildChat all require a judge; SuperGPQA uses an LLM to extract the answer letter, then compares it against the reference without a separate judge call.

[run]

Execution settings that apply to the entire suite.

Field Type Default Description
max_workers int 4 Number of parallel evaluation threads
output_dir str "results/" Directory where JSONL and summary files are written
seed int 42 Random seed for dataset shuffling
telemetry bool false Enable GPU telemetry capture (energy, power, utilization, throughput)
gpu_metrics bool false Enable GPU metric polling via pynvml (requires pynvml or nvidia-ml-py)
warmup_samples int 0 Untimed warmup samples before measurement
energy_vendor str "" GPU energy vendor override
max_turns int None Maximum agent turns per query
wandb_project / wandb_entity / wandb_tags / wandb_group str "" Weights & Biases tracking
sheets_spreadsheet_id / sheets_worksheet / sheets_credentials_path str "" / "Results" / "" Google Sheets export

[backend.external]

Endpoint settings for the hermes and openclaw backends. Environment variables override TOML values.

Field Type Default Description
base_url str None OpenAI-compatible endpoint URL (env: JARVIS_BACKEND_BASE_URL)
api_key str None API key for the endpoint (env: JARVIS_BACKEND_API_KEY)

[[models]]

One block per model. The name field is required.

Field Type Default Description
name str required Model identifier (e.g., "qwen3:8b", "gpt-5-mini")
engine str None Engine key to use ("ollama", "vllm", "cloud", ...)
provider str None Provider override for cloud models (e.g., "openai")
temperature float None Override [defaults].temperature for this model
max_tokens int None Override [defaults].max_tokens for this model
param_count_b float 0.0 Total model parameter count in billions (for MFU/MBU computation)
active_params_b float None Active parameters per token in billions (for MoE models; defaults to param_count_b)
gpu_peak_tflops float 0.0 GPU peak FP16 TFLOPS (e.g., 312.0 for A100 SXM)
gpu_peak_bandwidth_gb_s float 0.0 GPU peak memory bandwidth in GB/s (e.g., 2039.0 for A100 SXM)
num_gpus int 1 Number of GPUs used (for tensor-parallel inference)

[[benchmarks]]

One block per benchmark. The name field is required.

Field Type Default Description
name str required Any registered benchmark key (see uv run python -m openjarvis.evals list)
backend str "jarvis-direct" jarvis-direct, jarvis-agent, hermes, openclaw, or terminalbench-native
max_samples int None Limit number of samples; None evaluates the full dataset
split str None Override the default dataset split
subset str None Dataset subset/variant (benchmark-specific)
record_ids list[str] None Evaluate only these record ids
agent str None Agent name for jarvis-agent backend (e.g., "orchestrator")
tools list[str] [] Tool names for jarvis-agent backend
judge_model str None Override [judge].model for this benchmark only
temperature float None Override temperature for this benchmark (highest precedence)
max_tokens int None Override max tokens for this benchmark (highest precedence)

Output Format

JSONL results file

Each completed sample is appended to the output JSONL file immediately after scoring. The file path is either specified with -o/--output, or auto-generated as {output_dir}/{benchmark}_{model-slug}.jsonl.

Each line is a JSON object with the following fields:

results/supergpqa_qwen3-8b.jsonl (one line per sample)
{
  "record_id": "supergpqa-42",
  "benchmark": "supergpqa",
  "model": "qwen3:8b",
  "backend": "jarvis-direct",
  "model_answer": "The answer is C because...",
  "is_correct": true,
  "score": 1.0,
  "latency_seconds": 1.34,
  "prompt_tokens": 187,
  "completion_tokens": 12,
  "cost_usd": 0.0,
  "error": null,
  "scoring_metadata": {"reference_letter": "C", "candidate_letter": "C"},
  "ttft": 0.0,
  "energy_joules": 140792.95,
  "power_watts": 893.0,
  "gpu_utilization_pct": 47.4,
  "throughput_tok_per_sec": 36.6,
  "mfu_pct": 0.0176,
  "mbu_pct": 26.89,
  "ipw": 0.00112,
  "ipj": 0.000007
}
Field Type Description
record_id str Unique sample identifier
benchmark str Benchmark name
model str Model identifier
backend str Backend used
model_answer str Raw model output
is_correct bool or null Scoring result (null if unscored)
score float or null Numeric score (1.0 correct, 0.0 incorrect, null unscored)
latency_seconds float Inference latency
prompt_tokens int Input tokens consumed
completion_tokens int Output tokens generated
cost_usd float Estimated cost in USD
error str or null Error message if the sample failed
scoring_metadata dict Scorer-specific details (extracted letters, judge output, etc.)
ttft float Time to first token in seconds (0.0 if unavailable)
energy_joules float GPU energy consumed for this sample (joules)
power_watts float Average GPU power draw during inference (watts)
gpu_utilization_pct float Average GPU utilization percentage
throughput_tok_per_sec float Output token throughput (tokens/sec)
mfu_pct float Model FLOPs Utilization percentage (requires model hardware params)
mbu_pct float Memory Bandwidth Utilization percentage (requires model hardware params)
ipw float Intelligence Per Watt: accuracy / power_watts (0 if incorrect or no power data)
ipj float Intelligence Per Joule: accuracy / energy_joules (0 if incorrect or no energy data)

Summary JSON file

After all samples complete, a summary file is written alongside the JSONL at {output_path}.summary.json:

results/supergpqa_qwen3-8b.jsonl.summary.json
{
  "benchmark": "supergpqa",
  "category": "reasoning",
  "backend": "jarvis-direct",
  "model": "qwen3:8b",
  "total_samples": 200,
  "scored_samples": 198,
  "correct": 143,
  "accuracy": 0.7222,
  "errors": 2,
  "mean_latency_seconds": 1.4821,
  "total_cost_usd": 0.0,
  "per_subject": {
    "chemistry": {"accuracy": 0.74, "total": 50.0, "scored": 50.0, "correct": 37.0},
    "mathematics": {"accuracy": 0.68, "total": 50.0, "scored": 49.0, "correct": 33.0}
  },
  "started_at": 1708789200.0,
  "ended_at": 1708789496.3,
  "accuracy_stats": {"mean": 0.72, "median": 1.0, "min": 0.0, "max": 1.0, "std": 0.45},
  "energy_stats": {"mean": 140792.95, "median": 135112.79, "min": 3926.17, "max": 1806568.12, "std": 156038.54},
  "power_stats": {"mean": 892.98, "median": 898.19, "min": 811.50, "max": 1104.90, "std": 42.65},
  "gpu_utilization_stats": {"mean": 47.41, "median": 47.45, "min": 42.38, "max": 56.23, "std": 2.72},
  "throughput_stats": {"mean": 36.55, "median": 37.22, "min": 26.22, "max": 45.03, "std": 5.00},
  "mfu_stats": {"mean": 0.0176, "median": 0.0179, "min": 0.0126, "max": 0.0216, "std": 0.0024},
  "mbu_stats": {"mean": 26.89, "median": 27.38, "min": 19.29, "max": 33.13, "std": 3.68},
  "ipw_stats": {"mean": 0.00113, "median": 0.00112, "min": 0.00100, "max": 0.00123, "std": 0.00005},
  "ipj_stats": {"mean": 0.00003, "median": 0.00001, "min": 0.000002, "max": 0.00021, "std": 0.00004},
  "total_energy_joules": 28158590.26
}

When telemetry = true and gpu_metrics = true are set in [run], the summary includes MetricStats (mean, median, min, max, std) for every telemetry metric plus total_energy_joules. These stats are null when no values are available for that metric.

The per_subject breakdown groups results by the dataset's subject or category field, which varies per benchmark:

  • SuperGPQA: subfield, field, or discipline
  • GAIA: difficulty level (level_1, level_2, level_3)
  • FRAMES: reasoning type(s) (e.g., temporal, intersection)
  • WildChat: always "conversation"

Scoring Methods

Each benchmark uses a scorer tuned to its answer format.

SuperGPQA: LLM-assisted MCQ extraction

SuperGPQA responses are free-form text that must contain one of the valid option letters (A, B, C, D, ...). The scorer uses the judge LLM to extract the final answer letter from the model's response, then compares it against the reference letter with exact string matching.

The judge is prompted with the original problem and the model's response and asked to return only a single letter. This handles cases where the model reasons extensively before stating its final answer.

is_correct = extracted_letter == reference_letter

Scoring metadata includes: reference_letter, candidate_letter, and valid_letters.

GAIA: Normalized exact match with LLM fallback

GAIA answers are typically numbers, short phrases, or comma-separated lists. The scorer applies a normalization pass before comparison:

  • Numbers: strips $, %, , and converts to float for comparison
  • Lists: splits on ,/; and compares element-by-element (with per-element type detection)
  • Strings: lowercases, strips whitespace and punctuation

If the normalized exact match fails, the scorer falls back to the judge LLM, which returns a structured response with extracted_final_answer, reasoning, and correct: yes/no. The LLM fallback handles cases like unit variations, alternative phrasings, and equivalent but differently-formatted answers.

FRAMES: LLM-as-judge (factual correctness)

FRAMES uses an LLM judge that evaluates semantic equivalence between the model's answer and the ground truth. The judge receives the question, ground truth, and predicted answer, then responds with a structured verdict:

extracted_final_answer: <extracted answer>
reasoning: <brief explanation>
correct: yes / no

The scorer parses the correct: line and falls back to presence of TRUE/FALSE tokens if the structured format is missing.

WildChat: Pairwise LLM comparison

WildChat does not have a single "correct" answer — it measures chat response quality. The scorer runs a dual pairwise comparison:

  1. The judge evaluates (model answer as A, reference as B) and returns a verdict token such as [[A>>B]], [[A>B]], [[A=B]], [[B>A]], or [[B>>A]].
  2. The judge then evaluates (reference as A, model answer as B) and returns another verdict.

The model is considered to have passed (is_correct = True) if it wins or ties in either comparison. The dual comparison reduces positional bias in the judge.

The judge uses a multi-step rubric that distinguishes subjective queries (scored on correctness, helpfulness, relevance, conciseness, and creativity) from objective/technical queries (scored on correctness only).

Interpreting WildChat accuracy

A WildChat accuracy score of 0.50 means the model matched or beat the reference response in half of comparisons. Because the reference response comes from the original dataset (which may include responses from capable models), a score above 0.50 indicates strong chat quality for that sample set.


Parallel Execution

The EvalRunner processes samples concurrently using a ThreadPoolExecutor. Results are flushed to the JSONL file incrementally as each sample completes, so you can inspect partial results during a long run.

# Use more workers for faster evaluation (if the engine supports concurrent requests)
uv run python -m openjarvis.evals run -b supergpqa -m qwen3:8b -w 8 -n 500

Worker count and engine load

Higher worker counts increase throughput only if the inference engine can handle concurrent requests. Local Ollama instances typically handle 1-2 concurrent requests. Cloud APIs (OpenAI, Anthropic) can handle higher concurrency. Set -w based on your engine's actual parallelism.


See Also

  • Benchmarks — Measure inference engine latency and throughput
  • Telemetry & Traces — Record and analyze inference metrics from production use
  • Agents — Configure the OrchestratorAgent used by jarvis-agent backend
  • Tools — Available tools for agent-backed evaluations
  • Python SDK — Programmatic access to OpenJarvis inference and agents