Index
telemetry
¶
Telemetry — SQLite-backed inference recording and instrumented wrappers.
Classes¶
AggregatedStats
dataclass
¶
AggregatedStats(total_calls: int = 0, total_tokens: int = 0, total_cost: float = 0.0, total_latency: float = 0.0, total_energy_joules: float = 0.0, avg_throughput_tok_per_sec: float = 0.0, avg_gpu_utilization_pct: float = 0.0, avg_energy_per_output_token_joules: float = 0.0, avg_throughput_per_watt: float = 0.0, total_prefill_energy_joules: float = 0.0, total_decode_energy_joules: float = 0.0, avg_mean_itl_ms: float = 0.0, avg_median_itl_ms: float = 0.0, avg_p95_itl_ms: float = 0.0, per_model: List[ModelStats] = list(), per_engine: List[EngineStats] = list())
Top-level summary combining per-model and per-engine stats.
EngineStats
dataclass
¶
EngineStats(engine: str = '', call_count: int = 0, total_tokens: int = 0, total_latency: float = 0.0, avg_latency: float = 0.0, total_cost: float = 0.0, avg_ttft: float = 0.0, total_energy_joules: float = 0.0, avg_gpu_utilization_pct: float = 0.0, avg_throughput_tok_per_sec: float = 0.0, avg_tokens_per_joule: float = 0.0, avg_energy_per_output_token_joules: float = 0.0, avg_throughput_per_watt: float = 0.0, total_prefill_energy_joules: float = 0.0, total_decode_energy_joules: float = 0.0, avg_mean_itl_ms: float = 0.0, avg_median_itl_ms: float = 0.0, avg_p95_itl_ms: float = 0.0)
Aggregated statistics for a single engine backend.
ModelStats
dataclass
¶
ModelStats(model_id: str = '', call_count: int = 0, total_tokens: int = 0, prompt_tokens: int = 0, completion_tokens: int = 0, total_latency: float = 0.0, avg_latency: float = 0.0, total_cost: float = 0.0, avg_ttft: float = 0.0, total_energy_joules: float = 0.0, avg_gpu_utilization_pct: float = 0.0, avg_throughput_tok_per_sec: float = 0.0, avg_tokens_per_joule: float = 0.0, avg_energy_per_output_token_joules: float = 0.0, avg_throughput_per_watt: float = 0.0, total_prefill_energy_joules: float = 0.0, total_decode_energy_joules: float = 0.0, avg_mean_itl_ms: float = 0.0, avg_median_itl_ms: float = 0.0, avg_p95_itl_ms: float = 0.0)
Aggregated statistics for a single model.
TelemetryAggregator
¶
Read-only query layer over the telemetry SQLite database.
Source code in src/openjarvis/telemetry/aggregator.py
Functions¶
per_batch_stats
¶
per_batch_stats(*, since: Optional[float] = None, until: Optional[float] = None, exclude_warmup: bool = False) -> List[Dict[str, Any]]
Aggregate telemetry by batch_id.
Returns list of dicts with batch_id, total_requests, total_tokens, total_energy_joules, energy_per_token_joules.
Source code in src/openjarvis/telemetry/aggregator.py
TelemetryStore
¶
Append-only SQLite store for inference telemetry records.
Source code in src/openjarvis/telemetry/store.py
Functions¶
record
¶
record(rec: TelemetryRecord) -> None
Persist a single telemetry record.
Source code in src/openjarvis/telemetry/store.py
GpuHardwareSpec
dataclass
¶
Peak theoretical capabilities for a known GPU model.
GpuMonitor
¶
Background GPU poller using pynvml.
Usage::
mon = GpuMonitor(poll_interval_ms=50)
with mon.sample() as result:
# ... run inference ...
pass
print(result.energy_joules)
mon.close()
Source code in src/openjarvis/telemetry/gpu_monitor.py
Functions¶
available
staticmethod
¶
Return True if pynvml is importable and can be initialized.
Source code in src/openjarvis/telemetry/gpu_monitor.py
sample
¶
sample() -> Generator[GpuSample, None, None]
Context manager that polls GPUs during the block, then populates the sample.
If pynvml is unavailable or no devices are found, yields an empty
:class:GpuSample without starting a background thread.
Source code in src/openjarvis/telemetry/gpu_monitor.py
close
¶
Shut down pynvml if it was initialized.
Source code in src/openjarvis/telemetry/gpu_monitor.py
GpuSample
dataclass
¶
GpuSample(energy_joules: float = 0.0, mean_power_watts: float = 0.0, peak_power_watts: float = 0.0, mean_utilization_pct: float = 0.0, peak_utilization_pct: float = 0.0, mean_memory_used_gb: float = 0.0, peak_memory_used_gb: float = 0.0, mean_temperature_c: float = 0.0, peak_temperature_c: float = 0.0, duration_seconds: float = 0.0, num_snapshots: int = 0)
Aggregated GPU metrics over an inference bracket.
GpuSnapshot
dataclass
¶
GpuSnapshot(power_watts: float, utilization_pct: float, memory_used_gb: float, temperature_c: float, device_id: int = 0)
A single point-in-time reading from one GPU device.
EfficiencyMetrics
dataclass
¶
EfficiencyMetrics(mfu_pct: float = 0.0, mbu_pct: float = 0.0, actual_flops: float = 0.0, peak_flops: float = 0.0, actual_bandwidth_gb_s: float = 0.0, peak_bandwidth_gb_s: float = 0.0, ipj: float = 0.0)
Results of an MFU/MBU efficiency calculation.
VLLMMetrics
dataclass
¶
VLLMMetrics(ttft_p50: float = 0.0, ttft_p95: float = 0.0, ttft_p99: float = 0.0, gpu_cache_usage_pct: float = 0.0, e2e_latency_p50: float = 0.0, e2e_latency_p95: float = 0.0, queue_depth: float = 0.0)
Parsed vLLM performance metrics.
VLLMMetricsScraper
¶
Scrapes vLLM's Prometheus /metrics endpoint.
Source code in src/openjarvis/telemetry/vllm_metrics.py
Functions¶
scrape
¶
scrape() -> VLLMMetrics
Fetch and parse vLLM metrics. Returns zeroed metrics on error.
Source code in src/openjarvis/telemetry/vllm_metrics.py
EnergyMonitor
¶
Bases: ABC
Abstract base class for energy measurement backends.
Each vendor implementation probes for hardware support at init,
exposes an available() class method, and provides a sample()
context manager that measures energy over a code block.
Functions¶
available
abstractmethod
staticmethod
¶
vendor
abstractmethod
¶
vendor() -> EnergyVendor
energy_method
abstractmethod
¶
sample
abstractmethod
¶
sample() -> Generator[EnergySample, None, None]
Context manager that measures energy during the enclosed block.
Yields an EnergySample that is populated when the block exits.
Source code in src/openjarvis/telemetry/energy_monitor.py
snapshot
¶
snapshot() -> EnergySample
Return an instantaneous energy reading without start/stop bracket.
Subclasses should override to provide actual readings. Default returns an empty sample.
EnergySample
dataclass
¶
EnergySample(energy_joules: float = 0.0, mean_power_watts: float = 0.0, peak_power_watts: float = 0.0, duration_seconds: float = 0.0, num_snapshots: int = 0, mean_utilization_pct: float = 0.0, peak_utilization_pct: float = 0.0, mean_memory_used_gb: float = 0.0, peak_memory_used_gb: float = 0.0, mean_temperature_c: float = 0.0, peak_temperature_c: float = 0.0, vendor: str = '', device_name: str = '', device_count: int = 0, energy_method: str = '', cpu_energy_joules: float = 0.0, gpu_energy_joules: float = 0.0, dram_energy_joules: float = 0.0, ane_energy_joules: float = 0.0)
Aggregated energy metrics over an inference bracket.
Superset of GpuSample — adds vendor, device info, energy method,
and per-component breakdown (CPU, GPU, DRAM, ANE).
EnergyVendor
¶
Bases: str, Enum
Supported energy measurement vendors.
BatchMetrics
dataclass
¶
BatchMetrics(batch_id: str = '', total_requests: int = 0, total_tokens: int = 0, total_energy_joules: float = 0.0, energy_per_token_joules: float = 0.0, energy_per_request_joules: float = 0.0, mean_power_watts: float = 0.0, mean_throughput_tok_per_sec: float = 0.0, prefill_energy_joules: float = 0.0, decode_energy_joules: float = 0.0, per_request_energy: List[float] = list())
Aggregated metrics for a batch of inference requests.
EnergyBatch
¶
Group inference requests into a batch and compute per-token energy.
Works with or without an EnergyMonitor. When no monitor is provided,
request counts are still tracked but energy values stay at zero.
Source code in src/openjarvis/telemetry/batch.py
Functions¶
sample
¶
Wrap an energy monitor sample and provide a context for recording requests.
Yields a _BatchContext whose record_request() method should be
called once per inference request inside the block.
Source code in src/openjarvis/telemetry/batch.py
SteadyStateConfig
dataclass
¶
SteadyStateConfig(warmup_samples: int = 5, window_size: int = 5, cv_threshold: float = 0.05, min_steady_samples: int = 3, metric: str = 'throughput')
Configuration for steady-state detection.
SteadyStateDetector
¶
SteadyStateDetector(config: SteadyStateConfig | None = None)
Detect steady state using coefficient of variation over a sliding window.
The first warmup_samples recordings are always classified as warmup.
After warmup, the CV (stdev / mean) of the last window_size values is
checked. When CV < cv_threshold for min_steady_samples consecutive
checks, steady state is declared.
Source code in src/openjarvis/telemetry/steady_state.py
Attributes¶
Functions¶
record
¶
Record a sample. Returns True when steady state is reached.
Source code in src/openjarvis/telemetry/steady_state.py
SteadyStateResult
dataclass
¶
SteadyStateResult(total_samples: int = 0, warmup_samples: int = 0, steady_state_samples: int = 0, steady_state_reached: bool = False, warmup_throughputs: List[float] = list(), warmup_energies: List[float] = list(), steady_throughputs: List[float] = list(), steady_energies: List[float] = list())
Result of steady-state detection.
TelemetrySample
dataclass
¶
TelemetrySample(timestamp_ns: int, gpu_power_w: float = 0.0, cpu_power_w: float = 0.0, gpu_energy_j: float = 0.0, cpu_energy_j: float = 0.0, gpu_util_pct: float = 0.0, gpu_temp_c: float = 0.0, gpu_mem_gb: float = 0.0)
Single telemetry sample.
TelemetrySession
¶
TelemetrySession(monitor: Optional[EnergyMonitor] = None, interval_ms: int = 100, buffer_size: int = 100000)
Background-sampling telemetry session.
Spawns a daemon thread that calls monitor.snapshot() at the configured interval. Stores samples in a ring buffer (Rust-backed if available, else pure-Python fallback).
Source code in src/openjarvis/telemetry/session.py
Functions¶
instrumented_generate
¶
instrumented_generate(engine: InferenceEngine, messages: Sequence[Message], *, model: str, bus: EventBus, temperature: float = 0.7, max_tokens: int = 1024, **kwargs: Any) -> Dict[str, Any]
Call engine.generate() and publish telemetry events on bus.
Returns the raw result dict from the engine.
Source code in src/openjarvis/telemetry/wrapper.py
compute_efficiency
¶
compute_efficiency(param_count_b: float, active_params_b: float | None, gpu_peak_tflops: float, gpu_peak_bandwidth_gb_s: float, tokens_per_sec: float, num_gpus: int = 1, energy_joules: float = 0.0, accuracy: float = 0.0, bytes_per_param: float = 2.0) -> EfficiencyMetrics
Compute MFU, MBU, and derived efficiency metrics.
Args: param_count_b: Total parameter count in billions. active_params_b: Active parameters per token in billions (None for dense). gpu_peak_tflops: Peak theoretical TFLOPS per GPU (e.g. 312 for A100 SXM FP16). gpu_peak_bandwidth_gb_s: Peak memory bandwidth per GPU in GB/s (e.g. 2039 for A100 SXM). tokens_per_sec: Measured generation throughput (tokens/second). num_gpus: Number of GPUs used for inference. energy_joules: Total energy consumed in joules (for IPJ calculation). accuracy: Accuracy score in [0, 1] (for IPJ calculation). bytes_per_param: Bytes per parameter (default 2.0 for FP16).
Returns:
:class:EfficiencyMetrics with all computed values.
Source code in src/openjarvis/telemetry/efficiency.py
create_energy_monitor
¶
create_energy_monitor(poll_interval_ms: int = 50, prefer_vendor: Optional[str] = None) -> Optional[EnergyMonitor]
Factory — auto-detect and return the best available EnergyMonitor.
Detection order: NVIDIA > AMD > Apple > CPU RAPL. If prefer_vendor is set, try that vendor first.
Returns None if no energy monitoring is available.
Source code in src/openjarvis/telemetry/energy_monitor.py
compute_phase_metrics
¶
compute_phase_metrics(session: TelemetrySession, start_ns: int, end_ns: int, tokens: int) -> dict
Compute energy/power metrics for a phase window.
Source code in src/openjarvis/telemetry/phase_metrics.py
split_at_ttft
¶
split_at_ttft(session: TelemetrySession, start_ns: int, ttft_ns: int, end_ns: int, input_tokens: int, output_tokens: int) -> tuple[dict, dict]
Split energy at TTFT boundary into prefill and decode phases.
Source code in src/openjarvis/telemetry/phase_metrics.py
compute_itl_stats
¶
Compute ITL statistics from token arrival timestamps (in ms).
Returns dict with p50_ms, p90_ms, p95_ms, p99_ms, mean_ms, min_ms, max_ms.
Source code in src/openjarvis/telemetry/itl.py
compute_mfu
¶
Compute Model FLOPs Utilization.
MFU = actual_tflops / (peak_tflops * num_gpus)
Source code in src/openjarvis/telemetry/flops.py
estimate_flops
¶
Estimate FLOPs for an inference pass.
Uses the 2 * P * T approximation where P = params, T = total tokens. Returns (total_flops, flops_per_token).