Skip to content

toolorchestra

toolorchestra

ToolOrchestraAgent — prompted port of NVlabs ToolOrchestra (arXiv:2511.21689).

The paper RL-trains an 8B Orchestrator (nvidia/Nemotron-Orchestrator-8B) to coordinate basic tools + specialist LLMs + generalist LLMs in multi-turn agentic loops, ranked #1 on GAIA at release.

The hybrid harness adapter for ToolOrchestra is a documented stub — running the real thing needs a separate vLLM srun for the Orchestrator-8B checkpoint, a FAISS wiki retriever, a Tavily API key, and a refactor of the upstream eval scripts. None of that fits in our cluster allocation.

This port keeps the same scope discipline: inference-time only, prompted, no RL. A cloud model plays the role of the orchestrator, dispatching to a pool of (tool | specialist_llm | generalist_llm) workers in a reactive loop. The loop is the paradigm; the orchestrator weights are not.

Why ship this at all if it's not the "real" thing? Because the prompted upper-bound is useful as a reference point alongside the other paradigms, and because the OpenJarvis registry needs all six entries for the distillation pipeline to slot ToolOrchestra in alongside the rest.

Pipeline per task:

  1. Orchestrator (cloud) reads question + numbered worker pool.
  2. Each turn it emits {"action": "call_worker", "worker_id": int, "input": str} or {"action": "final_answer", "answer": str}.
  3. Up to max_turns (default 6) calls before forcing a final-answer prompt; fallback to strongest worker on parse failure.

Workers come from cfg["workers"] or a sensible default pool (local Qwen if vLLM up, plus a web-search tool via Anthropic, Opus 4.7, gpt-5-mini).

Not yet validated end-to-end in the hybrid harness — the hybrid adapter raise NotImplementedErrors. Treat results from this paradigm as preliminary until we have a real ToolOrchestra-8B deployment.

Classes

ToolOrchestraAgent

ToolOrchestraAgent(engine: InferenceEngine, model: str, *, local_model: Optional[str] = None, local_endpoint: Optional[str] = None, cloud_endpoint: str = 'anthropic', cfg: Optional[Dict[str, Any]] = None, bus: Optional[Any] = None, temperature: Optional[float] = None, max_tokens: Optional[int] = None)

Bases: LocalCloudAgent

Prompted multi-turn dispatcher over a mixed worker pool.

Inference-only port — does NOT use the RL-trained Nemotron-Orchestrator-8B. See module docstring for what's missing relative to the published paper.

Source code in src/openjarvis/agents/hybrid/_base.py
def __init__(
    self,
    engine: InferenceEngine,
    model: str,
    *,
    local_model: Optional[str] = None,
    local_endpoint: Optional[str] = None,
    cloud_endpoint: str = "anthropic",
    cfg: Optional[Dict[str, Any]] = None,
    bus: Optional[Any] = None,
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
) -> None:
    super().__init__(
        engine,
        model,
        bus=bus,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    self._cloud_model = model
    self._cloud_endpoint = (cloud_endpoint or "anthropic").lower()
    self._local_model = local_model
    self._local_endpoint = local_endpoint
    self._cfg: Dict[str, Any] = dict(cfg or {})

Functions