BEAVER: An Enterprise Benchmark for Text-to-SQL

arXiv - CS - Databases Pub Date : 2024-09-03 DOI:arxiv-2409.02038

Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker

{"title":"BEAVER: An Enterprise Benchmark for Text-to-SQL","authors":"Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker","doi":"arxiv-2409.02038","DOIUrl":null,"url":null,"abstract":"Existing text-to-SQL benchmarks have largely been constructed using publicly\navailable tables from the web with human-generated tests containing question\nand SQL statement pairs. They typically show very good results and lead people\nto think that LLMs are effective at text-to-SQL tasks. In this paper, we apply\noff-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In\nthis environment, LLMs perform poorly, even when standard prompt engineering\nand RAG techniques are utilized. As we will show, the reasons for poor\nperformance are largely due to three characteristics: (1) public LLMs cannot\ntrain on enterprise data warehouses because they are largely in the \"dark web\",\n(2) schemas of enterprise tables are more complex than the schemas in public\ndata, which leads the SQL-generation task innately harder, and (3)\nbusiness-oriented questions are often more complex, requiring joins over\nmultiple tables and aggregations. As a result, we propose a new dataset BEAVER,\nsourced from real enterprise data warehouses together with natural language\nqueries and their correct SQL statements which we collected from actual user\nhistory. We evaluated this dataset using recent LLMs and demonstrated their\npoor performance on this task. We hope this dataset will facilitate future\nresearchers building more sophisticated text-to-SQL systems which can do better\non this important class of data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BEAVER：文本到 SQL 的企业基准

现有的文本到 SQL 基准主要是利用网络上公开的表格和人工生成的包含问题和 SQL 语句对的测试来构建的。它们通常显示出非常好的结果，让人们认为 LLM 在文本到 SQL 任务中非常有效。在本文中，我们将现成的 LLM 应用于包含企业数据仓库数据的基准测试。在这种环境下，即使使用了标准的提示工程和 RAG 技术，LLM 的性能也很差。正如我们将展示的那样，性能不佳的原因主要有三个：（1）公共 LLM 无法对企业数据仓库进行约束，因为它们在很大程度上处于 "暗网 "中；（2）企业表的模式比公共数据中的模式更加复杂，这导致 SQL 生成任务天生更加困难；（3）面向业务的问题通常更加复杂，需要对多个表进行连接和聚合。因此，我们提出了一个新的数据集 BEAVER，其来源是真实的企业数据仓库，以及从实际用户历史中收集的自然语言查询及其正确的 SQL 语句。我们使用最近的 LLM 对该数据集进行了评估，结果表明它们在这项任务上的性能很差。我们希望这个数据集能帮助未来的研究人员构建更复杂的文本到 SQL 系统，从而在这类重要数据上取得更好的成绩。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes