API Server¶

OpenJarvis includes an OpenAI-compatible API server built on FastAPI and uvicorn. It exposes chat completion, model listing, and health check endpoints, making it a drop-in replacement for the OpenAI API when working with local models.

Starting the Server¶

The server requires the [server] extra (FastAPI + uvicorn):

git clone https://github.com/open-jarvis/OpenJarvis.git
cd OpenJarvis
uv sync --extra server

Start with default settings:

jarvis serve

The server reads defaults from ~/.openjarvis/config.toml and auto-detects available engines and models. Override any option via CLI flags:

jarvis serve --host 0.0.0.0 --port 8000 --engine ollama --model qwen3:8b --agent orchestrator

CLI Options¶

Option	Description	Default
`--host`	Network address to bind to	From config (`0.0.0.0`)
`--port`	Port number to listen on	From config (`8000`)
`-e` / `--engine`	Inference engine backend (`ollama`, `vllm`, `llamacpp`, `sglang`)	Auto-detected
`-m` / `--model`	Default model for completions	First available
`-a` / `--agent`	Agent for non-streaming requests (`simple`, `orchestrator`, `react`, `openhands`)	From config (`orchestrator`)

On startup, the server prints a summary:

Starting OpenJarvis API server
  Engine: ollama
  Model:  qwen3:8b
  Agent:  orchestrator
  URL:    http://0.0.0.0:8000

Server dependency check

If the [server] extra is not installed, jarvis serve exits with a clear error message explaining how to install the required dependencies.

Endpoints¶

`POST /v1/chat/completions`¶

The primary endpoint for generating chat completions. Accepts the same request format as the OpenAI Chat Completions API.

Request Body¶

{
  "model": "qwen3:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "stream": false,
  "tools": null
}

Parameter	Type	Default	Description
`model`	`string`	--	Required. Model identifier to use for generation.
`messages`	`array`	--	Required. Array of message objects with `role` and `content`.
`temperature`	`float`	`0.7`	Sampling temperature (0.0 to 2.0).
`max_tokens`	`integer`	`1024`	Maximum number of tokens to generate.
`stream`	`boolean`	`false`	Whether to stream the response via SSE.
`tools`	`array` or `null`	`null`	Tool definitions in OpenAI function-calling format.

Each message object:

Field	Type	Description
`role`	`string`	One of `system`, `user`, `assistant`, or `tool`.
`content`	`string`	The message content.
`name`	`string` or `null`	Optional name for the message author.
`tool_calls`	`array` or `null`	Tool calls made by the assistant (in assistant messages).
`tool_call_id`	`string` or `null`	ID of the tool call this message responds to (in tool messages).

Response (Non-Streaming)¶

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1740100800,
  "model": "qwen3:8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

When an agent is configured on the server, non-streaming requests are routed through the agent, which can perform multi-turn reasoning with tool calls before returning a final response. When no agent is configured, requests go directly to the inference engine.

Tool Calls¶

When tools are provided in the request, the engine may return tool_calls in the assistant message:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "calculator",
              "arguments": "{\"expression\": \"2 + 2\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

`GET /v1/models`¶

Lists all models available on the configured inference engine.

Response¶

{
  "object": "list",
  "data": [
    {
      "id": "qwen3:8b",
      "object": "model",
      "created": 1740100800,
      "owned_by": "openjarvis"
    },
    {
      "id": "llama3.1:8b",
      "object": "model",
      "created": 1740100800,
      "owned_by": "openjarvis"
    }
  ]
}

`GET /health`¶

Health check endpoint that verifies the inference engine is responsive.

Response (Healthy)¶

HTTP 200:

{"status": "ok"}

Response (Unhealthy)¶

HTTP 503:

{"detail": "Engine unhealthy"}

`GET /dashboard`¶

Serves the built-in Savings Dashboard, an HTML page that displays real-time statistics on inference calls served locally and estimated cost savings compared to cloud API providers. The dashboard auto-refreshes every 5 seconds by polling the /v1/savings endpoint.

`GET /v1/channels`¶

List registered channel backends and their connection status.

Response¶

{
  "channels": ["slack", "discord", "telegram"]
}

`POST /v1/channels/send`¶

Send a message to a specific channel.

Request Body¶

{
  "target": "slack",
  "message": "Hello from Jarvis!"
}

Response¶

{
  "status": "sent",
  "target": "slack"
}

`GET /v1/channels/status`¶

Show connection status for all configured channels.

Response¶

{
  "channels": {
    "slack": "connected",
    "discord": "connected",
    "telegram": "disconnected"
  }
}

Channel endpoints

Channel endpoints require [channel] enabled = true in your config and platform-specific credentials configured in [channel.<platform>] sub-sections. When not configured, GET /v1/channels returns an empty list and other channel endpoints return 503.

Streaming via SSE¶

When "stream": true is set in the request, the server returns a text/event-stream response using Server-Sent Events (SSE). The response follows the same format as the OpenAI streaming API.

Each event is a data: line containing a JSON chunk, followed by a blank line:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1740100800,"model":"qwen3:8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The stream follows this sequence:

Role chunk -- first chunk contains "delta": {"role": "assistant"} with no content.
Content chunks -- subsequent chunks each contain a "delta": {"content": "..."} with one or more tokens.
Finish chunk -- a chunk with an empty delta and "finish_reason": "stop".
Done signal -- the literal string data: [DONE] indicates the stream is complete.

Response headers include Cache-Control: no-cache and Connection: keep-alive for proper SSE behavior.

Client Examples¶

curlPython (openai)Python (httpx)

Non-streaming request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one paragraph."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Streaming request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Write a haiku about programming."}
    ],
    "stream": true
  }'

List models:

curl http://localhost:8000/v1/models

Health check:

curl http://localhost:8000/health

The OpenAI Python library works as a drop-in client by pointing base_url at the local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # Required by the library but not validated
)

# Non-streaming
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "Write a short poem about AI."}
    ],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

# List models
models = client.models.list()
for model in models.data:
    print(model.id)

Using httpx for direct HTTP requests:

import httpx
import json

BASE_URL = "http://localhost:8000"

# Non-streaming request
response = httpx.post(
    f"{BASE_URL}/v1/chat/completions",
    json={
        "model": "qwen3:8b",
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    },
)
data = response.json()
print(data["choices"][0]["message"]["content"])

# Streaming request
with httpx.stream(
    "POST",
    f"{BASE_URL}/v1/chat/completions",
    json={
        "model": "qwen3:8b",
        "messages": [
            {"role": "user", "content": "Write a haiku about code."}
        ],
        "stream": True,
    },
) as response:
    for line in response.iter_lines():
        if line.startswith("data: ") and line != "data: [DONE]":
            chunk = json.loads(line[6:])
            content = chunk["choices"][0]["delta"].get("content", "")
            if content:
                print(content, end="", flush=True)
print()

# List models
response = httpx.get(f"{BASE_URL}/v1/models")
for model in response.json()["data"]:
    print(model["id"])

# Health check
response = httpx.get(f"{BASE_URL}/health")
print(response.json())

Configuration via `config.toml`¶

The [server] section of ~/.openjarvis/config.toml controls default server behavior:

[server]
host = "0.0.0.0"
port = 8000
agent = "orchestrator"
model = ""
workers = 1

Key	Type	Default	Description
`host`	`string`	`"0.0.0.0"`	Network address to bind to. Use `"127.0.0.1"` for localhost-only access.
`port`	`integer`	`8000`	Port number.
`agent`	`string`	`"orchestrator"`	Default agent for non-streaming requests. Set to `""` for direct engine mode.
`model`	`string`	`""`	Default model name. When empty, falls back to `[intelligence] default_model` or the first model discovered on the engine.
`workers`	`integer`	`1`	Number of uvicorn workers (for future use).

CLI flags override config file values. For example, jarvis serve --port 9000 overrides the port setting in the config file.

The server also reads from other config sections at startup:

[engine] -- determines which inference backend to connect to and its host URL.
[intelligence] -- provides the fallback default_model when no model is specified.
[agent] -- supplies max_turns for multi-turn agents like orchestrator.

Running Behind a Reverse Proxy¶

For production deployments, run OpenJarvis behind a reverse proxy like Nginx or Caddy for TLS termination, rate limiting, and authentication.

Nginx¶

server {
    listen 443 ssl;
    server_name jarvis.example.com;

    ssl_certificate /etc/ssl/certs/jarvis.pem;
    ssl_certificate_key /etc/ssl/private/jarvis.key;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # SSE streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }
}

Disable buffering for SSE

The proxy_buffering off directive is critical for streaming responses. Without it, Nginx buffers the SSE chunks and delivers them in batches, defeating the purpose of streaming.

Caddy¶

jarvis.example.com {
    reverse_proxy 127.0.0.1:8000 {
        flush_interval -1
    }
}

The flush_interval -1 setting disables response buffering, which is required for SSE streaming.

Bind to Localhost¶

When running behind a reverse proxy, bind the server to 127.0.0.1 so it only accepts connections from the proxy:

jarvis serve --host 127.0.0.1 --port 8000

Or in config.toml:

[server]
host = "127.0.0.1"
port = 8000

API Server¶

Starting the Server¶

CLI Options¶

Endpoints¶

POST /v1/chat/completions¶

Request Body¶

Response (Non-Streaming)¶

Tool Calls¶

GET /v1/models¶

Response¶

GET /health¶

Response (Healthy)¶

Response (Unhealthy)¶

GET /dashboard¶

GET /v1/channels¶

Response¶

POST /v1/channels/send¶

Request Body¶

Response¶

GET /v1/channels/status¶

Response¶

Streaming via SSE¶

Client Examples¶

Configuration via config.toml¶

Running Behind a Reverse Proxy¶

Nginx¶

Caddy¶

Bind to Localhost¶

`POST /v1/chat/completions`¶

`GET /v1/models`¶

`GET /health`¶

`GET /dashboard`¶

`GET /v1/channels`¶

`POST /v1/channels/send`¶

`GET /v1/channels/status`¶

Configuration via `config.toml`¶