DSBench：数据科学代理离成为数据科学专家还有多远？

arXiv - CS - Artificial Intelligence Pub Date : 2024-09-12 DOI:arxiv-2409.07703

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

{"title":"DSBench：数据科学代理离成为数据科学专家还有多远？","authors":"Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu","doi":"arxiv-2409.07703","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have\ndemonstrated impressive language/vision reasoning abilities, igniting the\nrecent trend of building agents for targeted applications such as shopping\nassistants or AI software engineers. Recently, many data science benchmarks\nhave been proposed to investigate their performance in the data science domain.\nHowever, existing data science benchmarks still fall short when compared to\nreal-world data science applications due to their simplified settings. To\nbridge this gap, we introduce DSBench, a comprehensive benchmark designed to\nevaluate data science agents with realistic tasks. This benchmark includes 466\ndata analysis tasks and 74 data modeling tasks, sourced from Eloquence and\nKaggle competitions. DSBench offers a realistic setting by encompassing long\ncontexts, multimodal task backgrounds, reasoning with large data files and\nmulti-table structures, and performing end-to-end data modeling tasks. Our\nevaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle\nwith most tasks, with the best agent solving only 34.12% of data analysis tasks\nand achieving a 34.74% Relative Performance Gap (RPG). These findings\nunderscore the need for further advancements in developing more practical,\nintelligent, and autonomous data science agents.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?\",\"authors\":\"Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu\",\"doi\":\"arxiv-2409.07703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have\\ndemonstrated impressive language/vision reasoning abilities, igniting the\\nrecent trend of building agents for targeted applications such as shopping\\nassistants or AI software engineers. Recently, many data science benchmarks\\nhave been proposed to investigate their performance in the data science domain.\\nHowever, existing data science benchmarks still fall short when compared to\\nreal-world data science applications due to their simplified settings. To\\nbridge this gap, we introduce DSBench, a comprehensive benchmark designed to\\nevaluate data science agents with realistic tasks. This benchmark includes 466\\ndata analysis tasks and 74 data modeling tasks, sourced from Eloquence and\\nKaggle competitions. DSBench offers a realistic setting by encompassing long\\ncontexts, multimodal task backgrounds, reasoning with large data files and\\nmulti-table structures, and performing end-to-end data modeling tasks. Our\\nevaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle\\nwith most tasks, with the best agent solving only 34.12% of data analysis tasks\\nand achieving a 34.74% Relative Performance Gap (RPG). These findings\\nunderscore the need for further advancements in developing more practical,\\nintelligent, and autonomous data science agents.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07703\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLMs）和大型视觉语言模型（LVLMs）已经展示了令人印象深刻的语言/视觉推理能力，从而引发了为购物助手或人工智能软件工程师等目标应用构建代理的新趋势。最近，人们提出了许多数据科学基准，以研究它们在数据科学领域的性能。然而，现有的数据科学基准由于设置简化，与真实世界的数据科学应用相比仍有不足。为了弥补这一不足，我们引入了 DSBench，这是一个综合性基准，旨在通过现实任务评估数据科学代理。该基准包括 466 项数据分析任务和 74 项数据建模任务，均来自 Eloquence 和 Kaggle 竞赛。DSBench 提供了一个逼真的环境，包括长上下文、多模式任务背景、大型数据文件和多表结构推理，以及执行端到端数据建模任务。对最先进的 LLM、LVLM 和代理的评估表明，它们在大多数任务中都很吃力，最好的代理只能解决 34.12% 的数据分析任务，相对性能差距 (RPG) 为 34.74%。这些发现表明，需要进一步开发更实用、更智能、更自主的数据科学代理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量