{"title":"CORE-Bench:通过计算可重复性代理基准促进已发表研究的可信度","authors":"Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan","doi":"arxiv-2409.11363","DOIUrl":null,"url":null,"abstract":"AI agents have the potential to aid users on a variety of consequential\ntasks, including conducting scientific research. To spur the development of\nuseful agents, we need benchmarks that are challenging, but more crucially,\ndirectly correspond to real-world tasks of interest. This paper introduces such\na benchmark, designed to measure the accuracy of AI agents in tackling a\ncrucial yet surprisingly challenging aspect of scientific research:\ncomputational reproducibility. This task, fundamental to the scientific\nprocess, involves reproducing the results of a study using the provided code\nand data. We introduce CORE-Bench (Computational Reproducibility Agent\nBenchmark), a benchmark consisting of 270 tasks based on 90 scientific papers\nacross three disciplines (computer science, social science, and medicine).\nTasks in CORE-Bench consist of three difficulty levels and include both\nlanguage-only and vision-language tasks. We provide an evaluation system to\nmeasure the accuracy of agents in a fast and parallelizable way, saving days of\nevaluation time for each run compared to a sequential implementation. We\nevaluated two baseline agents: the general-purpose AutoGPT and a task-specific\nagent called CORE-Agent. We tested both variants using two underlying language\nmodels: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on\nthe hardest task, showing the vast scope for improvement in automating routine\nscientific tasks. Having agents that can reproduce existing work is a necessary\nstep towards building agents that can conduct novel research and could verify\nand improve the performance of other research agents. We hope that CORE-Bench\ncan improve the state of reproducibility and spur the development of future\nresearch agents.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"55 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark\",\"authors\":\"Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan\",\"doi\":\"arxiv-2409.11363\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AI agents have the potential to aid users on a variety of consequential\\ntasks, including conducting scientific research. To spur the development of\\nuseful agents, we need benchmarks that are challenging, but more crucially,\\ndirectly correspond to real-world tasks of interest. This paper introduces such\\na benchmark, designed to measure the accuracy of AI agents in tackling a\\ncrucial yet surprisingly challenging aspect of scientific research:\\ncomputational reproducibility. This task, fundamental to the scientific\\nprocess, involves reproducing the results of a study using the provided code\\nand data. We introduce CORE-Bench (Computational Reproducibility Agent\\nBenchmark), a benchmark consisting of 270 tasks based on 90 scientific papers\\nacross three disciplines (computer science, social science, and medicine).\\nTasks in CORE-Bench consist of three difficulty levels and include both\\nlanguage-only and vision-language tasks. We provide an evaluation system to\\nmeasure the accuracy of agents in a fast and parallelizable way, saving days of\\nevaluation time for each run compared to a sequential implementation. We\\nevaluated two baseline agents: the general-purpose AutoGPT and a task-specific\\nagent called CORE-Agent. We tested both variants using two underlying language\\nmodels: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on\\nthe hardest task, showing the vast scope for improvement in automating routine\\nscientific tasks. Having agents that can reproduce existing work is a necessary\\nstep towards building agents that can conduct novel research and could verify\\nand improve the performance of other research agents. We hope that CORE-Bench\\ncan improve the state of reproducibility and spur the development of future\\nresearch agents.\",\"PeriodicalId\":501315,\"journal\":{\"name\":\"arXiv - CS - Multiagent Systems\",\"volume\":\"55 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multiagent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11363\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
AI agents have the potential to aid users on a variety of consequential
tasks, including conducting scientific research. To spur the development of
useful agents, we need benchmarks that are challenging, but more crucially,
directly correspond to real-world tasks of interest. This paper introduces such
a benchmark, designed to measure the accuracy of AI agents in tackling a
crucial yet surprisingly challenging aspect of scientific research:
computational reproducibility. This task, fundamental to the scientific
process, involves reproducing the results of a study using the provided code
and data. We introduce CORE-Bench (Computational Reproducibility Agent
Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers
across three disciplines (computer science, social science, and medicine).
Tasks in CORE-Bench consist of three difficulty levels and include both
language-only and vision-language tasks. We provide an evaluation system to
measure the accuracy of agents in a fast and parallelizable way, saving days of
evaluation time for each run compared to a sequential implementation. We
evaluated two baseline agents: the general-purpose AutoGPT and a task-specific
agent called CORE-Agent. We tested both variants using two underlying language
models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on
the hardest task, showing the vast scope for improvement in automating routine
scientific tasks. Having agents that can reproduce existing work is a necessary
step towards building agents that can conduct novel research and could verify
and improve the performance of other research agents. We hope that CORE-Bench
can improve the state of reproducibility and spur the development of future
research agents.