taubench_env

taubench_env ¶

TauBench task environment — native OpenJarvis agent in tau2 simulation.

Plugs OpenJarvis's inference engine into tau2-bench's orchestrator as a HalfDuplexAgent, so the multi-turn conversation loop, user simulator, domain tools, database, and evaluation all come from tau2-bench while the agent's LLM calls go through OpenJarvis.

Classes¶

JarvisHalfDuplexAgent ¶

JarvisHalfDuplexAgent(tools: list, domain_policy: str, engine: Any, model: str, temperature: float = 0.7, max_tokens: int = 4096)

A tau2 HalfDuplexAgent backed by OpenJarvis's inference engine.

Replaces tau2's built-in LLMAgent while keeping the same interface so the Orchestrator, UserSimulator, and evaluation work unchanged.

Source code in src/openjarvis/evals/execution/taubench_env.py

def __init__(
    self,
    tools: list,
    domain_policy: str,
    engine: Any,
    model: str,
    temperature: float = 0.7,
    max_tokens: int = 4096,
) -> None:
    self.tools = tools
    self.domain_policy = domain_policy
    self._engine = engine
    self._model = model
    self._temperature = temperature
    self._max_tokens = max_tokens

Functions¶

stop ¶

stop(*args, **kwargs) -> None

Cleanup hook called by orchestrator.

Source code in src/openjarvis/evals/execution/taubench_env.py

def stop(self, *args, **kwargs) -> None:
    """Cleanup hook called by orchestrator."""
    pass

TauBenchTaskEnv ¶

TauBenchTaskEnv(record: EvalRecord, engine_key: Optional[str] = None, model: Optional[str] = None, temperature: float = 0.7, max_tokens: int = 4096, user_model: Optional[str] = None, num_trials: int = 1, telemetry: bool = False, gpu_metrics: bool = False)

Per-task environment for TauBench evaluation.

Creates an OpenJarvis-powered agent, plugs it into tau2's orchestrator, runs the simulation, and stores results in record.metadata for the scorer.

Source code in src/openjarvis/evals/execution/taubench_env.py

def __init__(
    self,
    record: EvalRecord,
    engine_key: Optional[str] = None,
    model: Optional[str] = None,
    temperature: float = 0.7,
    max_tokens: int = 4096,
    user_model: Optional[str] = None,
    num_trials: int = 1,
    telemetry: bool = False,
    gpu_metrics: bool = False,
) -> None:
    self._record = record
    self._num_trials = num_trials
    self._engine_key = engine_key
    self._model = model or "claude-opus-4-6"
    self._temperature = temperature
    self._max_tokens = max_tokens
    self._user_model = user_model or "gpt-5-mini-2025-08-07"
    self._telemetry = telemetry
    self._gpu_metrics = gpu_metrics
    self._system = None