mini_swe_agent

mini_swe_agent ¶

MiniSWEAgent — vendored, ~330-line port of mini-SWE-agent v2.

Single-LLM agent loop with a bash tool, run inside a per-task git clone. The model iterates: read files, grep, run tests, edit, retry — the environment-interaction loop that turns SWE-bench from "predict the patch blind" (~0.30) into "actually fix the bug" (~0.77 for frontier models).

Two ways to use this module:

Standalone agent — :class:MiniSWEAgent registered as mini_swe_agent. Use it directly as the agent for a cell.
As a worker subroutine inside another paradigm — call :func:run_swe_agent_loop(task, ...). Returns a dict with the final patch, token totals, cost, etc. This is how Minions / Conductor / Advisors / SkillOrchestra / ToolOrchestra / Archon swap their one-shot worker call for a real agent loop when running SWE-bench.

Differences vs. the upstream (https://github.com/swe-agent/mini-swe-agent):

No Docker sandbox. We clone the SWE-bench repo into a tempdir and exec bash there. Network is available (pip etc.). Treat outputs as untrusted — model can run rm -rf against its own workdir, but the workdir is disposable. Don't run this on a host with secrets in the CWD.
One tool, bash. No separate submit — the loop ends when the model produces a turn with no tool calls. We extract the patch from git diff in the workdir at that point.
Trace events captured via the LocalCloudAgent thread-local trace buffer so every bash invocation + result lands in experiments/<cell>/logs/<task_id>.json.

Classes¶

MiniSWEAgent ¶

MiniSWEAgent(engine: InferenceEngine, model: str, *, local_model: Optional[str] = None, local_endpoint: Optional[str] = None, cloud_endpoint: str = 'anthropic', cfg: Optional[Dict[str, Any]] = None, bus: Optional[Any] = None, temperature: Optional[float] = None, max_tokens: Optional[int] = None)

Bases: LocalCloudAgent

Single-model bash-loop agent for SWE-bench-shaped tasks.

Configurable knobs via cfg:

backbone (str, default "cloud"): "cloud" or "local".
max_turns (int, default 30): hard cap on tool turns.
bash_timeout_s (int, default 120): per-command timeout.
output_cap (int, default 10_000): per-command stdout/stderr cap.
turn_max_tokens (int, default 4096): max_tokens per LLM turn.

Source code in src/openjarvis/agents/hybrid/_base.py

def __init__(
    self,
    engine: InferenceEngine,
    model: str,
    *,
    local_model: Optional[str] = None,
    local_endpoint: Optional[str] = None,
    cloud_endpoint: str = "anthropic",
    cfg: Optional[Dict[str, Any]] = None,
    bus: Optional[Any] = None,
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
) -> None:
    super().__init__(
        engine,
        model,
        bus=bus,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    self._cloud_model = model
    self._cloud_endpoint = (cloud_endpoint or "anthropic").lower()
    self._local_model = local_model
    self._local_endpoint = local_endpoint
    self._cfg: Dict[str, Any] = dict(cfg or {})

Functions¶

run_swe_agent_loop ¶

run_swe_agent_loop(task: Dict[str, Any], *, backbone: str, backbone_model: str, cloud_endpoint: str = 'anthropic', local_endpoint: Optional[str] = None, initial_prompt: Optional[str] = None, max_turns: int = 30, bash_timeout: int = 120, bash_timeout_s: Optional[int] = None, output_cap: int = 10000, turn_max_tokens: int = 4096, trace_prefix: str = 'mini_swe', workdir: Optional[Path] = None, compact_at_tokens: int = 24000, compact_keep_last: int = 4) -> Dict[str, Any]

Run a mini-SWE-agent loop for one SWE-bench task. Returns:

.. code-block:: python

{
  "answer":         str,   # final framed answer with ```diff fence
  "patch":          str,   # raw unified diff from git diff
  "final_summary":  str,   # the no-tool-call assistant text (may be empty)
  "tokens_in":      int,
  "tokens_out":     int,
  "tokens_local":   int,   # bookkeeping split for paradigms
  "tokens_cloud":   int,
  "cost_usd":       float,
  "turns":          int,
  "max_turns_hit":  bool,
  "workdir":        str,
}

Captures every bash invocation + LLM turn into the active trace buffer via :func:_record_event from the LocalCloudAgent base, so callers don't have to do their own per-call instrumentation.

Args: task: SWE-bench-shaped dict with repo + base_commit + task_id + (optional) problem_statement / hints_text. backbone: "cloud" to drive the loop with the cloud model (Anthropic only today), "local" for vLLM. backbone_model: model id for the loop's backbone. cloud_endpoint / local_endpoint: SDK targets. initial_prompt: if set, used as the first user message (paradigms embed orchestrator context in here). If None, falls back to the task's problem_statement. workdir: pre-cloned repo path. If None, this function clones the repo into a tempdir and cleans it up at the end. Paradigms that want to chain multiple subloops over the same working tree can manage their own workdir.

Source code in src/openjarvis/agents/hybrid/mini_swe_agent.py

def run_swe_agent_loop(
    task: Dict[str, Any],
    *,
    backbone: str,  # "cloud" or "local"
    backbone_model: str,
    cloud_endpoint: str = "anthropic",
    local_endpoint: Optional[str] = None,
    initial_prompt: Optional[str] = None,
    max_turns: int = 30,
    bash_timeout: int = 120,
    bash_timeout_s: Optional[int] = None,
    output_cap: int = 10_000,
    turn_max_tokens: int = 4096,
    trace_prefix: str = "mini_swe",
    workdir: Optional[Path] = None,
    compact_at_tokens: int = 24_000,
    compact_keep_last: int = 4,
) -> Dict[str, Any]:
    """Run a mini-SWE-agent loop for one SWE-bench task. Returns:

    .. code-block:: python

        {
          "answer":         str,   # final framed answer with ```diff fence
          "patch":          str,   # raw unified diff from git diff
          "final_summary":  str,   # the no-tool-call assistant text (may be empty)
          "tokens_in":      int,
          "tokens_out":     int,
          "tokens_local":   int,   # bookkeeping split for paradigms
          "tokens_cloud":   int,
          "cost_usd":       float,
          "turns":          int,
          "max_turns_hit":  bool,
          "workdir":        str,
        }

    Captures every bash invocation + LLM turn into the active trace buffer
    via :func:`_record_event` from the LocalCloudAgent base, so callers
    don't have to do their own per-call instrumentation.

    Args:
      task: SWE-bench-shaped dict with ``repo`` + ``base_commit`` + ``task_id``
        + (optional) ``problem_statement`` / ``hints_text``.
      backbone: ``"cloud"`` to drive the loop with the cloud model
        (Anthropic only today), ``"local"`` for vLLM.
      backbone_model: model id for the loop's backbone.
      cloud_endpoint / local_endpoint: SDK targets.
      initial_prompt: if set, used as the first user message (paradigms
        embed orchestrator context in here). If None, falls back to the
        task's problem_statement.
      workdir: pre-cloned repo path. If None, this function clones the
        repo into a tempdir and cleans it up at the end. Paradigms that
        want to chain multiple subloops over the same working tree can
        manage their own workdir.
    """
    if bash_timeout_s is not None:
        bash_timeout = int(bash_timeout_s)
    repo = task.get("repo") or ""
    base_commit = task.get("base_commit") or ""
    if not repo or not base_commit:
        raise ValueError(
            f"run_swe_agent_loop needs task['repo'] + task['base_commit']; "
            f"got repo={repo!r}, base_commit={base_commit!r}"
        )

    own_workdir = workdir is None
    if own_workdir:
        workdir = Path(tempfile.mkdtemp(prefix=f"mini-swe-{task.get('task_id', 'x')}-"))
        try:
            _clone_repo(repo, base_commit, workdir)
        except Exception:
            shutil.rmtree(workdir, ignore_errors=True)
            raise

    _record_event(
        {
            "kind": f"{trace_prefix}_setup",
            "repo": repo,
            "base_commit": base_commit,
            "workdir": str(workdir),
            "owns_workdir": own_workdir,
            "backbone": backbone,
            "backbone_model": backbone_model,
            "ts": time.time(),
        }
    )

    user_prompt = initial_prompt or task.get("problem_statement") or ""

    try:
        if backbone == "cloud":
            result = _loop_cloud(
                user_prompt,
                workdir,
                model=backbone_model,
                cloud_endpoint=cloud_endpoint,
                max_turns=max_turns,
                bash_timeout=bash_timeout,
                output_cap=output_cap,
                turn_max_tokens=turn_max_tokens,
                trace_prefix=trace_prefix,
            )
        elif backbone == "local":
            if not local_endpoint:
                raise ValueError(
                    "run_swe_agent_loop(backbone='local') needs local_endpoint"
                )
            result = _loop_local(
                user_prompt,
                workdir,
                model=backbone_model,
                endpoint=local_endpoint,
                max_turns=max_turns,
                bash_timeout=bash_timeout,
                output_cap=output_cap,
                turn_max_tokens=turn_max_tokens,
                trace_prefix=trace_prefix,
                compact_at_tokens=compact_at_tokens,
                compact_keep_last=compact_keep_last,
            )
        else:
            raise ValueError(f"unsupported backbone: {backbone!r}")

        patch = _extract_diff(workdir)
        framed = result["final_summary"] or "[mini-swe-agent produced no summary text]"
        if patch.strip():
            framed = f"{framed}\n\n```diff\n{patch}```"

        return {
            "answer": framed,
            "patch": patch,
            "final_summary": result["final_summary"],
            "tokens_in": result["tokens_in"],
            "tokens_out": result["tokens_out"],
            "tokens_local": result["tokens_in"] + result["tokens_out"]
            if backbone == "local"
            else 0,
            "tokens_cloud": result["tokens_in"] + result["tokens_out"]
            if backbone == "cloud"
            else 0,
            "cost_usd": (
                estimate_cost(backbone_model, result["tokens_in"], result["tokens_out"])
                if backbone == "cloud"
                else 0.0
            ),
            "turns": result["turns"],
            "max_turns_hit": result["max_turns_hit"],
            "workdir": str(workdir),
        }
    finally:
        if own_workdir:
            shutil.rmtree(workdir, ignore_errors=True)