仅有 Text2SQL 是不够的：用 TAG 统一人工智能和数据库

arXiv - CS - Databases Pub Date : 2024-08-27 DOI:arxiv-2408.14717

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia

{"title":"仅有 Text2SQL 是不够的：用 TAG 统一人工智能和数据库","authors":"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia","doi":"arxiv-2408.14717","DOIUrl":null,"url":null,"abstract":"AI systems that serve natural language questions over databases promise to\nunlock tremendous value. Such systems would allow users to leverage the\npowerful reasoning and knowledge capabilities of language models (LMs)\nalongside the scalable computational power of data management systems. These\ncombined capabilities would empower users to ask arbitrary natural language\nquestions over custom data sources. However, existing methods and benchmarks\ninsufficiently explore this setting. Text2SQL methods focus solely on natural\nlanguage questions that can be expressed in relational algebra, representing a\nsmall subset of the questions real users wish to ask. Likewise,\nRetrieval-Augmented Generation (RAG) considers the limited subset of queries\nthat can be answered with point lookups to one or a few data records within the\ndatabase. We propose Table-Augmented Generation (TAG), a unified and\ngeneral-purpose paradigm for answering natural language questions over\ndatabases. The TAG model represents a wide range of interactions between the LM\nand database that have been previously unexplored and creates exciting research\nopportunities for leveraging the world knowledge and reasoning capabilities of\nLMs over data. We systematically develop benchmarks to study the TAG problem\nand find that standard methods answer no more than 20% of queries correctly,\nconfirming the need for further research in this area. We release code for the\nbenchmark at https://github.com/TAG-Research/TAG-Bench.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text2SQL is Not Enough: Unifying AI and Databases with TAG\",\"authors\":\"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia\",\"doi\":\"arxiv-2408.14717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AI systems that serve natural language questions over databases promise to\\nunlock tremendous value. Such systems would allow users to leverage the\\npowerful reasoning and knowledge capabilities of language models (LMs)\\nalongside the scalable computational power of data management systems. These\\ncombined capabilities would empower users to ask arbitrary natural language\\nquestions over custom data sources. However, existing methods and benchmarks\\ninsufficiently explore this setting. Text2SQL methods focus solely on natural\\nlanguage questions that can be expressed in relational algebra, representing a\\nsmall subset of the questions real users wish to ask. Likewise,\\nRetrieval-Augmented Generation (RAG) considers the limited subset of queries\\nthat can be answered with point lookups to one or a few data records within the\\ndatabase. We propose Table-Augmented Generation (TAG), a unified and\\ngeneral-purpose paradigm for answering natural language questions over\\ndatabases. The TAG model represents a wide range of interactions between the LM\\nand database that have been previously unexplored and creates exciting research\\nopportunities for leveraging the world knowledge and reasoning capabilities of\\nLMs over data. We systematically develop benchmarks to study the TAG problem\\nand find that standard methods answer no more than 20% of queries correctly,\\nconfirming the need for further research in this area. We release code for the\\nbenchmark at https://github.com/TAG-Research/TAG-Bench.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.14717\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过数据库为自然语言问题提供服务的人工智能系统有望带来巨大价值。这种系统将使用户能够利用语言模型（LM）的强大推理和知识能力，以及数据管理系统的可扩展计算能力。这些综合能力将使用户能够对自定义数据源提出任意的自然语言问题。然而，现有的方法和基准并没有充分探索这一环境。Text2SQL 方法只关注可以用关系代数表达的自然语言问题，这只是真实用户希望提出的问题的一小部分。同样，检索增强生成（RAG）考虑的是有限的查询子集，这些查询可以通过对数据库中的一条或几条数据记录进行点查询来回答。我们提出了表增强生成（TAG），这是一种统一的通用范例，用于回答数据库中的自然语言问题。TAG 模型代表了 LM 与数据库之间广泛的交互，而这些交互以前从未被探索过，它为利用 LM 的世界知识和数据推理能力创造了令人兴奋的研究机会。我们系统地开发了基准来研究 TAG 问题，并发现标准方法只能正确回答不超过 20% 的查询，这证实了在这一领域开展进一步研究的必要性。我们在 https://github.com/TAG-Research/TAG-Bench 上发布了该基准的代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Text2SQL is Not Enough: Unifying AI and Databases with TAG

AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL methods focus solely on natural language questions that can be expressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited subset of queries that can be answered with point lookups to one or a few data records within the database. We propose Table-Augmented Generation (TAG), a unified and general-purpose paradigm for answering natural language questions over databases. The TAG model represents a wide range of interactions between the LM and database that have been previously unexplored and creates exciting research opportunities for leveraging the world knowledge and reasoning capabilities of LMs over data. We systematically develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly, confirming the need for further research in this area. We release code for the benchmark at https://github.com/TAG-Research/TAG-Bench.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes