Inference Engine Primitive¶
The Engine primitive provides the inference runtime -- the layer that connects OpenJarvis to language model servers. All backends implement a uniform interface, making it straightforward to swap between local and cloud inference without changing application code.
InferenceEngine ABC¶
Every engine backend extends the InferenceEngine abstract base class:
class InferenceEngine(ABC):
engine_id: str
@abstractmethod
def generate(
self,
messages: Sequence[Message],
*,
model: str,
temperature: float = 0.7,
max_tokens: int = 1024,
**kwargs: Any,
) -> Dict[str, Any]:
"""Synchronous completion -- returns a dict with 'content' and 'usage'."""
@abstractmethod
async def stream(
self,
messages: Sequence[Message],
*,
model: str,
temperature: float = 0.7,
max_tokens: int = 1024,
**kwargs: Any,
) -> AsyncIterator[str]:
"""Yield token strings as they are generated."""
@abstractmethod
def list_models(self) -> List[str]:
"""Return identifiers of models available on this engine."""
@abstractmethod
def health(self) -> bool:
"""Return True when the engine is reachable and healthy."""
def prepare(self, model: str) -> None:
"""Optional warm-up hook called before the first request."""
Return Format¶
The generate() method returns a dictionary with the following structure:
{
"content": "The model's response text",
"usage": {
"prompt_tokens": 42,
"completion_tokens": 128,
"total_tokens": 170,
},
"model": "qwen3:8b",
"finish_reason": "stop",
"tool_calls": [...] # Optional, present if model requested tool calls
}
When the model requests tool calls, they are extracted and passed through in OpenAI format:
{
"tool_calls": [
{
"id": "call_abc123",
"name": "calculator",
"arguments": "{\"expression\": \"2 + 2\"}"
}
]
}
Multi-Provider Tool Call Extraction¶
Engine backends normalize tool calls from different providers into the standard flat format used by agents:
| Provider | Source Format | Extraction Logic |
|---|---|---|
| OpenAI | choices[0].message.tool_calls[].function.{name, arguments} |
Direct extraction, add id from tool_calls[].id |
| Anthropic | content[] blocks with type: "tool_use" |
Filter tool_use blocks, map input dict to JSON arguments |
candidates[0].content.parts[] with function_call |
Extract function_call.name and function_call.args, serialize args to JSON |
|
| LiteLLM | Flat {id, name, arguments} dicts (proxy pre-normalizes) |
Pass through directly |
| Ollama | message.tool_calls[].function.{name, arguments} |
Extract from Ollama native format, serialize arguments dict to JSON |
All providers produce the same output format consumed by agents:
{
"tool_calls": [
{"id": "call_abc", "name": "calculator", "arguments": "{\"expression\": \"2+2\"}"}
]
}
Backend Comparison¶
| Backend | Registry Key | Protocol | Default Port | GPU Required | Best For |
|---|---|---|---|---|---|
| Ollama | ollama |
Native HTTP API | 11434 | No (GPU optional) | Getting started, consumer GPUs, Apple Silicon |
| vLLM | vllm |
OpenAI-compatible | 8000 | NVIDIA recommended | Datacenter GPUs (A100, H100), high throughput |
| SGLang | sglang |
OpenAI-compatible | 30000 | NVIDIA recommended | Structured generation, speculative decoding |
| llama.cpp | llamacpp |
OpenAI-compatible | 8080 | No (CPU-optimized) | CPU-only systems, GGUF models, edge devices |
| MLX | mlx |
OpenAI-compatible | 8080 | Apple Silicon | Apple Silicon native inference via MLX |
| LM Studio | lmstudio |
OpenAI-compatible | 1234 | No (GPU optional) | Desktop GUI, easy model management |
| Exo | exo |
OpenAI-compatible | 52415 | No (distributed) | Distributed inference across heterogeneous devices |
| Nexa | nexa |
OpenAI-compatible | 18181 | No (CPU/GPU) | On-device inference with GGUF models |
| Uzu | uzu |
OpenAI-compatible | 8000 | Varies | Uzu inference runtime |
| Apple FM | apple_fm |
OpenAI-compatible | 8079 | Apple Silicon | Apple Foundation Model on-device inference |
| LiteLLM | litellm |
OpenAI-compatible | — | No | Unified proxy to 100+ LLM providers |
| Cloud | cloud |
Provider SDKs | — | No | OpenAI, Anthropic, Google API access |
Ollama¶
The Ollama backend communicates via Ollama's native HTTP API at /api/chat and /api/tags. It is the default engine on Apple Silicon and consumer NVIDIA GPUs.
- Default host:
http://localhost:11434 - Health check:
GET /api/tags - Model listing:
GET /api/tags(extracts model names) - Tool support: Passes
toolsin the request payload and extractstool_callsfrom responses
vLLM¶
The vLLM backend uses the OpenAI-compatible /v1/chat/completions API. It is recommended for datacenter GPUs (A100, H100, L40, A10, A30) and AMD GPUs.
- Default host:
http://localhost:8000 - Health check:
GET /v1/models - Tool fallback: If the server returns HTTP 400 when tools are included, the engine automatically retries without tools
SGLang¶
The SGLang backend also uses the OpenAI-compatible API. It shares the same _OpenAICompatibleEngine base class as vLLM and llama.cpp.
- Default host:
http://localhost:30000 - Health check:
GET /v1/models
llama.cpp¶
The llama.cpp backend connects to a llama-server instance via the OpenAI-compatible API. It is recommended for CPU-only systems and GGUF-quantized models.
- Default host:
http://localhost:8080 - Health check:
GET /v1/models
Cloud¶
The Cloud backend provides access to OpenAI, Anthropic, and Google models via their respective Python SDKs. It automatically detects the provider based on the model name:
- Models containing
"claude"route to the Anthropic client - Models containing
"gemini"route to the Google client - All other models route to the OpenAI client
API Keys
Cloud models require API keys set as environment variables:
OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY (or GOOGLE_API_KEY).
The cloud engine is only registered if the corresponding SDK packages are installed.
MLX¶
The MLX backend serves models via the MLX framework on Apple Silicon. It uses the OpenAI-compatible /v1/chat/completions API.
- Default host:
http://localhost:8080 - Health check:
GET /v1/models - Best for: Apple Silicon Macs (M1/M2/M3/M4) running MLX-format or GGUF models natively
LM Studio¶
The LM Studio backend connects to the LM Studio desktop application's built-in server, which exposes an OpenAI-compatible API.
- Default host:
http://localhost:1234 - Health check:
GET /v1/models - Best for: Users who prefer a GUI for model management and want a zero-configuration local server
Exo¶
The Exo backend connects to the Exo distributed inference runtime, which partitions model layers across multiple heterogeneous devices (e.g., a Mac and a Linux box). Exo supports Apple Silicon, NVIDIA, and AMD GPUs.
- Default host:
http://localhost:52415 - Health check:
GET /v1/models - Install:
pip install exoor from source at github.com/exo-explore/exo - Best for: Running models too large for a single device by distributing across multiple Apple Silicon or heterogeneous machines
Nexa¶
The Nexa backend connects to the Nexa SDK on-device inference server via a FastAPI shim (nexa_shim.py). It wraps nexaai.LLM as an OpenAI-compatible API on port 18181.
- Default host:
http://localhost:18181 - Health check:
GET /v1/models - Install:
pip install nexaai - Best for: On-device inference with GGUF models on Apple Silicon or CPU
Uzu¶
The Uzu backend connects to the Uzu inference runtime. Unlike other OpenAI-compatible engines, Uzu serves its API at the root path (no /v1 prefix).
- Default host:
http://localhost:8000 - API prefix: (none — endpoints are
/chat/completions,/models) - Health check:
GET /models - Best for: Uzu-optimized inference workloads
Apple FM¶
The Apple FM backend connects to Apple's Foundation Model SDK via a FastAPI shim (apple_fm_shim.py). It wraps python-apple-fm-sdk as an OpenAI-compatible API. Requires macOS 15+ with Apple Silicon.
Token counts
The Apple FM SDK does not expose token counts. The shim returns 0 for all token counts. Benchmark throughput and energy-per-token metrics will reflect this limitation.
- Default host:
http://localhost:8079 - Health check:
GET /v1/models - Install:
pip install python-apple-fm-sdk - Best for: Running Apple Foundation Models natively on Apple Silicon hardware
LiteLLM¶
The LiteLLM backend connects to a LiteLLM proxy server, which provides a unified OpenAI-compatible interface to 100+ LLM providers (OpenAI, Anthropic, Google, Azure, AWS Bedrock, Groq, Together, and more).
- Registry key:
litellm - Best for: Teams that need a single endpoint to route across multiple cloud providers with unified logging and cost tracking
Hardware Auto-Detection¶
OpenJarvis automatically detects system hardware to recommend the best engine. Detection runs at config load time via detect_hardware():
| Detection | Method | Information Extracted |
|---|---|---|
| NVIDIA GPU | nvidia-smi |
GPU name, VRAM (GB), count |
| AMD GPU | rocm-smi |
GPU name |
| Apple Silicon | system_profiler SPDisplaysDataType |
Chipset model name |
| CPU | /proc/cpuinfo or sysctl |
Brand string |
| RAM | /proc/meminfo or sysctl hw.memsize |
Total GB |
Engine Recommendation Logic¶
The recommend_engine() function maps hardware to the best engine:
graph TD
A["detect_hardware()"] --> B{"GPU detected?"}
B -->|No| C["llamacpp"]
B -->|Yes| D{"GPU vendor?"}
D -->|Apple| E["ollama"]
D -->|NVIDIA| F{"Datacenter card?<br/>(A100, H100, H200,<br/>L40, A10, A30)"}
F -->|Yes| G["vllm"]
F -->|No| H["ollama"]
D -->|AMD| I["vllm"]
D -->|Other| J["llamacpp"]
Engine Discovery¶
The _discovery.py module provides three functions for finding and instantiating engines at runtime.
get_engine(config, engine_key=None)¶
Returns a (key, engine_instance) tuple for the requested engine, or None if unavailable:
- If
engine_keyis specified, try to instantiate and health-check that specific engine - Otherwise, try the default engine from config
- If the default is unhealthy, fall back to any healthy engine via
discover_engines()
discover_engines(config)¶
Probes all registered engines for health and returns a sorted list of healthy (key, engine) pairs. The config default engine is sorted first.
from openjarvis.engine import discover_engines
from openjarvis.core.config import load_config
config = load_config()
healthy = discover_engines(config)
# [("ollama", OllamaEngine(...)), ("vllm", VLLMEngine(...))]
discover_models(engines)¶
Calls list_models() on each engine and returns a dictionary mapping engine keys to model ID lists:
from openjarvis.engine import discover_engines, discover_models
engines = discover_engines(config)
models = discover_models(engines)
# {"ollama": ["qwen3:8b", "llama3.2:3b"], "vllm": ["mistral:7b"]}
OpenAI Compatibility Layer¶
The _OpenAICompatibleEngine base class provides a shared implementation for engines that serve the standard /v1/chat/completions endpoint. vLLM, SGLang, and llama.cpp all extend this base class with minimal overrides -- typically just setting engine_id and _default_host.
class _OpenAICompatibleEngine(InferenceEngine):
engine_id: str = ""
_default_host: str = "http://localhost:8000"
def __init__(self, host: str | None = None, *, timeout: float = 120.0):
self._host = (host or self._default_host).rstrip("/")
self._client = httpx.Client(base_url=self._host, timeout=timeout)
Key behaviors:
- Synchronous generation:
POST /v1/chat/completionswithstream=False - Streaming:
POST /v1/chat/completionswithstream=True, parsing SSEdata:lines - Model listing:
GET /v1/models, extractingdata[].id - Health check:
GET /v1/modelswith a 2-second timeout - Tool call fallback: On HTTP 400 with tools in the payload, retries without tools (handles engines that do not support function calling)
Configuration¶
Engine hosts and defaults are configured in ~/.openjarvis/config.toml using nested per-engine sub-sections:
[engine]
default = "ollama"
[engine.ollama]
host = "http://localhost:11434"
[engine.vllm]
host = "http://localhost:8000"
[engine.sglang]
host = "http://localhost:30000"
# [engine.llamacpp]
# host = "http://localhost:8080"
# binary_path = ""
The EngineConfig dataclass and its per-engine sub-dataclasses map these settings:
| Config Class | Field | Default | Description |
|---|---|---|---|
EngineConfig |
default |
"ollama" (hardware-dependent) |
Preferred engine backend |
OllamaEngineConfig |
host |
http://localhost:11434 |
Ollama server URL |
VLLMEngineConfig |
host |
http://localhost:8000 |
vLLM server URL |
SGLangEngineConfig |
host |
http://localhost:30000 |
SGLang server URL |
LlamaCppEngineConfig |
host |
http://localhost:8080 |
llama.cpp server URL |
LlamaCppEngineConfig |
binary_path |
"" |
Path to llama.cpp binary (for managed mode) |
Backward compatibility
The old flat field names ollama_host, vllm_host, llamacpp_host, llamacpp_path, and sglang_host under [engine] are still accepted as backward-compatible properties on EngineConfig. New configurations should use the nested sub-section format.
Utility Functions¶
messages_to_dicts()¶
Converts a sequence of Message objects to OpenAI-format dictionaries, handling tool calls and tool call IDs:
from openjarvis.engine._base import messages_to_dicts
from openjarvis.core.types import Message, Role
messages = [Message(role=Role.USER, content="Hello")]
dicts = messages_to_dicts(messages)
# [{"role": "user", "content": "Hello"}]
EngineConnectionError¶
A custom exception raised when an engine is unreachable. All engine backends catch httpx.ConnectError and httpx.TimeoutException and re-raise as EngineConnectionError: