taubench
taubench
¶
TauBench V2 dataset provider — multi-turn customer service benchmark.
Wraps the tau2-bench framework for evaluation within OpenJarvis. Supports airline, retail, and telecom domains.
Reference: https://github.com/sierra-research/tau2-bench
Classes¶
TauBenchDataset
¶
Bases: DatasetProvider
TauBench V2 multi-turn customer service benchmark.
Wraps tau2-bench's task loading and evaluation infrastructure. Each EvalRecord represents a single customer service scenario.
Source code in src/openjarvis/evals/datasets/taubench.py
Functions¶
set_engine_config
¶
set_engine_config(engine_key: Optional[str] = None, model: Optional[str] = None, temperature: float = 0.7, max_tokens: int = 4096, user_model: Optional[str] = None, num_trials: Optional[int] = None, telemetry: bool = False, gpu_metrics: bool = False) -> None
Inject engine configuration for the agent. Called by CLI.
Source code in src/openjarvis/evals/datasets/taubench.py
create_task_env
¶
create_task_env(record: EvalRecord)
Create a TauBench task environment for evaluation.