Skip to content

paperarena

paperarena

PaperArena: scientific literature reasoning benchmark.

Evaluates agents on research paper comprehension with three question types: MC (multiple choice), CA (closed answer), OA (open answer) across easy/medium/hard difficulty. Source: https://github.com/Melmaphother/PaperArena Paper: arXiv:2510.10909

Classes

PaperArenaDataset

PaperArenaDataset(cache_dir: Optional[str] = None)

Bases: DatasetProvider

PaperArena scientific literature reasoning benchmark.

Three question types (MC, CA, OA) across three difficulty levels.

Source code in src/openjarvis/evals/datasets/paperarena.py
def __init__(
    self,
    cache_dir: Optional[str] = None,
) -> None:
    self._cache_dir = (
        Path(cache_dir) if cache_dir
        else Path.home() / ".cache" / "paperarena"
    )
    self._records: List[EvalRecord] = []