environment
environment
¶
Environment provider ABC for benchmarks requiring external environments.
Classes¶
TaskEnvironmentError
¶
Bases: RuntimeError
A task execution environment failed to start or operate.
Raised when infrastructure backing a task (Docker container, docker
compose project, tmux session, recording binaries, ...) breaks. This is
a harness/environment failure, not a model failure: runners record
it distinctly (QueryTrace.error_kind == "harness_error") so scoring
can exclude the sample from resolve-rate instead of silently counting
it as a model miss.
EnvironmentProvider
¶
Bases: ABC
Manages an external environment for evaluation benchmarks.
Provides lifecycle management (setup/reset/teardown) and environment-state validation for benchmarks that need Docker containers, ServiceNow instances, or other live systems.
Functions¶
setup
abstractmethod
¶
Start the environment and return connection info.
Returns a dict with environment-specific connection details (e.g., URLs, ports, credentials).
reset
abstractmethod
¶
Reset environment state between tasks.
Called between records within an episode to restore the environment to a known state.
validate
abstractmethod
¶
validate(record: EvalRecord) -> Tuple[bool, Dict[str, Any]]
Check environment state against expected outcome.
Args: record: The eval record containing the expected state in metadata.
Returns: (is_correct, metadata) where is_correct indicates whether the environment is in the expected state.