Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker
{"title":"BEAVER: An Enterprise Benchmark for Text-to-SQL","authors":"Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker","doi":"arxiv-2409.02038","DOIUrl":null,"url":null,"abstract":"Existing text-to-SQL benchmarks have largely been constructed using publicly\navailable tables from the web with human-generated tests containing question\nand SQL statement pairs. They typically show very good results and lead people\nto think that LLMs are effective at text-to-SQL tasks. In this paper, we apply\noff-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In\nthis environment, LLMs perform poorly, even when standard prompt engineering\nand RAG techniques are utilized. As we will show, the reasons for poor\nperformance are largely due to three characteristics: (1) public LLMs cannot\ntrain on enterprise data warehouses because they are largely in the \"dark web\",\n(2) schemas of enterprise tables are more complex than the schemas in public\ndata, which leads the SQL-generation task innately harder, and (3)\nbusiness-oriented questions are often more complex, requiring joins over\nmultiple tables and aggregations. As a result, we propose a new dataset BEAVER,\nsourced from real enterprise data warehouses together with natural language\nqueries and their correct SQL statements which we collected from actual user\nhistory. We evaluated this dataset using recent LLMs and demonstrated their\npoor performance on this task. We hope this dataset will facilitate future\nresearchers building more sophisticated text-to-SQL systems which can do better\non this important class of data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Existing text-to-SQL benchmarks have largely been constructed using publicly
available tables from the web with human-generated tests containing question
and SQL statement pairs. They typically show very good results and lead people
to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply
off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In
this environment, LLMs perform poorly, even when standard prompt engineering
and RAG techniques are utilized. As we will show, the reasons for poor
performance are largely due to three characteristics: (1) public LLMs cannot
train on enterprise data warehouses because they are largely in the "dark web",
(2) schemas of enterprise tables are more complex than the schemas in public
data, which leads the SQL-generation task innately harder, and (3)
business-oriented questions are often more complex, requiring joins over
multiple tables and aggregations. As a result, we propose a new dataset BEAVER,
sourced from real enterprise data warehouses together with natural language
queries and their correct SQL statements which we collected from actual user
history. We evaluated this dataset using recent LLMs and demonstrated their
poor performance on this task. We hope this dataset will facilitate future
researchers building more sophisticated text-to-SQL systems which can do better
on this important class of data.