PinchBench dataset provider — real-world agent task benchmark.
Clones the pinchbench/skill repo at runtime and parses task markdown files
into EvalRecords for use with AgenticRunner.
Reference: https://github.com/pinchbench/skill
Classes
PinchBenchDataset
PinchBenchDataset(path: Optional[str] = None)
Bases: DatasetProvider
PinchBench real-world agent benchmark.
Clones pinchbench/skill from GitHub (or uses a local path) and
parses task markdown files into EvalRecords.
Source code in src/openjarvis/evals/datasets/pinchbench.py
| def __init__(self, path: Optional[str] = None) -> None:
self._local_path = Path(path) if path else None
self._repo_dir: Path = self._local_path or CACHE_DIR
self._records: List[EvalRecord] = []
|
Functions
set_judge
set_judge(judge_backend: Any, judge_model: str) -> None
Set the judge backend/model for LLM-judge and hybrid grading.
Source code in src/openjarvis/evals/datasets/pinchbench.py
| def set_judge(self, judge_backend: Any, judge_model: str) -> None:
"""Set the judge backend/model for LLM-judge and hybrid grading."""
self._judge_backend = judge_backend
self._judge_model = judge_model
|