Skip to content

API Server

OpenJarvis includes an OpenAI-compatible API server built on FastAPI and uvicorn. It exposes chat completion, model listing, and health check endpoints, making it a drop-in replacement for the OpenAI API when working with local models.

Starting the Server

The server requires the [server] extra (FastAPI + uvicorn):

git clone https://github.com/open-jarvis/OpenJarvis.git
cd OpenJarvis
uv sync --extra server

Start with default settings:

jarvis serve

The server reads defaults from ~/.openjarvis/config.toml and auto-detects available engines and models. Override any option via CLI flags:

jarvis serve --host 0.0.0.0 --port 8000 --engine ollama --model qwen3:8b --agent orchestrator

CLI Options

Option Description Default
--host Network address to bind to From config (0.0.0.0)
--port Port number to listen on From config (8000)
-e / --engine Inference engine backend (ollama, vllm, llamacpp, sglang) Auto-detected
-m / --model Default model for completions First available
-a / --agent Agent for non-streaming requests (simple, orchestrator, react, openhands) From config (orchestrator)

On startup, the server prints a summary:

Starting OpenJarvis API server
  Engine: ollama
  Model:  qwen3:8b
  Agent:  orchestrator
  URL:    http://0.0.0.0:8000

Server dependency check

If the [server] extra is not installed, jarvis serve exits with a clear error message explaining how to install the required dependencies.

Endpoints

POST /v1/chat/completions

The primary endpoint for generating chat completions. Accepts the same request format as the OpenAI Chat Completions API.

Request Body

{
  "model": "qwen3:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "stream": false,
  "tools": null
}
Parameter Type Default Description
model string -- Required. Model identifier to use for generation.
messages array -- Required. Array of message objects with role and content.
temperature float 0.7 Sampling temperature (0.0 to 2.0).
max_tokens integer 1024 Maximum number of tokens to generate.
stream boolean false Whether to stream the response via SSE.
tools array or null null Tool definitions in OpenAI function-calling format.

Each message object:

Field Type Description
role string One of system, user, assistant, or tool.
content string The message content.
name string or null Optional name for the message author.
tool_calls array or null Tool calls made by the assistant (in assistant messages).
tool_call_id string or null ID of the tool call this message responds to (in tool messages).

Response (Non-Streaming)

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1740100800,
  "model": "qwen3:8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

When an agent is configured on the server, non-streaming requests are routed through the agent, which can perform multi-turn reasoning with tool calls before returning a final response. When no agent is configured, requests go directly to the inference engine.

Tool Calls

When tools are provided in the request, the engine may return tool_calls in the assistant message:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "calculator",
              "arguments": "{\"expression\": \"2 + 2\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

GET /v1/models

Lists all models available on the configured inference engine.

Response

{
  "object": "list",
  "data": [
    {
      "id": "qwen3:8b",
      "object": "model",
      "created": 1740100800,
      "owned_by": "openjarvis"
    },
    {
      "id": "llama3.1:8b",
      "object": "model",
      "created": 1740100800,
      "owned_by": "openjarvis"
    }
  ]
}

GET /health

Health check endpoint that verifies the inference engine is responsive.

Response (Healthy)

HTTP 200:

{"status": "ok"}

Response (Unhealthy)

HTTP 503:

{"detail": "Engine unhealthy"}

GET /dashboard

Serves the built-in Savings Dashboard, an HTML page that displays real-time statistics on inference calls served locally and estimated cost savings compared to cloud API providers. The dashboard auto-refreshes every 5 seconds by polling the /v1/savings endpoint.

GET /v1/channels

List registered channel backends and their connection status.

Response

{
  "channels": ["slack", "discord", "telegram"]
}

POST /v1/channels/send

Send a message to a specific channel.

Request Body

{
  "target": "slack",
  "message": "Hello from Jarvis!"
}

Response

{
  "status": "sent",
  "target": "slack"
}

GET /v1/channels/status

Show connection status for all configured channels.

Response

{
  "channels": {
    "slack": "connected",
    "discord": "connected",
    "telegram": "disconnected"
  }
}

Channel endpoints

Channel endpoints require [channel] enabled = true in your config and platform-specific credentials configured in [channel.<platform>] sub-sections. When not configured, GET /v1/channels returns an empty list and other channel endpoints return 503.

Streaming via SSE

When "stream": true is set in the request, the server returns a text/event-stream response using Server-Sent Events (SSE). The response follows the same format as the OpenAI streaming API.

Each event is a data: line containing a JSON chunk, followed by a blank line:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The stream follows this sequence:

  1. Role chunk -- first chunk contains "delta": {"role": "assistant"} with no content.
  2. Content chunks -- subsequent chunks each contain a "delta": {"content": "..."} with one or more tokens.
  3. Finish chunk -- a chunk with an empty delta and "finish_reason": "stop".
  4. Done signal -- the literal string data: [DONE] indicates the stream is complete.

Response headers include Cache-Control: no-cache and Connection: keep-alive for proper SSE behavior.

Client Examples

Non-streaming request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Streaming request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming."}
    ],
    "stream": true
  }'

List models:

curl http://localhost:8000/v1/models

Health check:

curl http://localhost:8000/health

The OpenAI Python library works as a drop-in client by pointing base_url at the local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # Required by the library but not validated
)

# Non-streaming
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "Write a short poem about AI."}
    ],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

# List models
models = client.models.list()
for model in models.data:
    print(model.id)

Using httpx for direct HTTP requests:

import httpx
import json

BASE_URL = "http://localhost:8000"

# Non-streaming request
response = httpx.post(
    f"{BASE_URL}/v1/chat/completions",
    json={
        "model": "qwen3:8b",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    },
)
data = response.json()
print(data["choices"][0]["message"]["content"])

# Streaming request
with httpx.stream(
    "POST",
    f"{BASE_URL}/v1/chat/completions",
    json={
        "model": "qwen3:8b",
        "messages": [
            {"role": "user", "content": "Write a haiku about code."}
        ],
        "stream": True,
    },
) as response:
    for line in response.iter_lines():
        if line.startswith("data: ") and line != "data: [DONE]":
            chunk = json.loads(line[6:])
            content = chunk["choices"][0]["delta"].get("content", "")
            if content:
                print(content, end="", flush=True)
print()

# List models
response = httpx.get(f"{BASE_URL}/v1/models")
for model in response.json()["data"]:
    print(model["id"])

# Health check
response = httpx.get(f"{BASE_URL}/health")
print(response.json())

Configuration via config.toml

The [server] section of ~/.openjarvis/config.toml controls default server behavior:

[server]
host = "0.0.0.0"
port = 8000
agent = "orchestrator"
model = ""
workers = 1
Key Type Default Description
host string "0.0.0.0" Network address to bind to. Use "127.0.0.1" for localhost-only access.
port integer 8000 Port number.
agent string "orchestrator" Default agent for non-streaming requests. Set to "" for direct engine mode.
model string "" Default model name. When empty, falls back to [intelligence] default_model or the first model discovered on the engine.
workers integer 1 Number of uvicorn workers (for future use).

CLI flags override config file values. For example, jarvis serve --port 9000 overrides the port setting in the config file.

The server also reads from other config sections at startup:

  • [engine] -- determines which inference backend to connect to and its host URL.
  • [intelligence] -- provides the fallback default_model when no model is specified.
  • [agent] -- supplies max_turns for multi-turn agents like orchestrator.

Running Behind a Reverse Proxy

For production deployments, run OpenJarvis behind a reverse proxy like Nginx or Caddy for TLS termination, rate limiting, and authentication.

Nginx

server {
    listen 443 ssl;
    server_name jarvis.example.com;

    ssl_certificate /etc/ssl/certs/jarvis.pem;
    ssl_certificate_key /etc/ssl/private/jarvis.key;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # SSE streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }
}

Disable buffering for SSE

The proxy_buffering off directive is critical for streaming responses. Without it, Nginx buffers the SSE chunks and delivers them in batches, defeating the purpose of streaming.

Caddy

jarvis.example.com {
    reverse_proxy 127.0.0.1:8000 {
        flush_interval -1
    }
}

The flush_interval -1 setting disables response buffering, which is required for SSE streaming.

Bind to Localhost

When running behind a reverse proxy, bind the server to 127.0.0.1 so it only accepts connections from the proxy:

jarvis serve --host 127.0.0.1 --port 8000

Or in config.toml:

[server]
host = "127.0.0.1"
port = 8000