Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia
{"title":"仅有 Text2SQL 是不够的:用 TAG 统一人工智能和数据库","authors":"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia","doi":"arxiv-2408.14717","DOIUrl":null,"url":null,"abstract":"AI systems that serve natural language questions over databases promise to\nunlock tremendous value. Such systems would allow users to leverage the\npowerful reasoning and knowledge capabilities of language models (LMs)\nalongside the scalable computational power of data management systems. These\ncombined capabilities would empower users to ask arbitrary natural language\nquestions over custom data sources. However, existing methods and benchmarks\ninsufficiently explore this setting. Text2SQL methods focus solely on natural\nlanguage questions that can be expressed in relational algebra, representing a\nsmall subset of the questions real users wish to ask. Likewise,\nRetrieval-Augmented Generation (RAG) considers the limited subset of queries\nthat can be answered with point lookups to one or a few data records within the\ndatabase. We propose Table-Augmented Generation (TAG), a unified and\ngeneral-purpose paradigm for answering natural language questions over\ndatabases. The TAG model represents a wide range of interactions between the LM\nand database that have been previously unexplored and creates exciting research\nopportunities for leveraging the world knowledge and reasoning capabilities of\nLMs over data. We systematically develop benchmarks to study the TAG problem\nand find that standard methods answer no more than 20% of queries correctly,\nconfirming the need for further research in this area. We release code for the\nbenchmark at https://github.com/TAG-Research/TAG-Bench.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text2SQL is Not Enough: Unifying AI and Databases with TAG\",\"authors\":\"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia\",\"doi\":\"arxiv-2408.14717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AI systems that serve natural language questions over databases promise to\\nunlock tremendous value. Such systems would allow users to leverage the\\npowerful reasoning and knowledge capabilities of language models (LMs)\\nalongside the scalable computational power of data management systems. These\\ncombined capabilities would empower users to ask arbitrary natural language\\nquestions over custom data sources. However, existing methods and benchmarks\\ninsufficiently explore this setting. Text2SQL methods focus solely on natural\\nlanguage questions that can be expressed in relational algebra, representing a\\nsmall subset of the questions real users wish to ask. Likewise,\\nRetrieval-Augmented Generation (RAG) considers the limited subset of queries\\nthat can be answered with point lookups to one or a few data records within the\\ndatabase. We propose Table-Augmented Generation (TAG), a unified and\\ngeneral-purpose paradigm for answering natural language questions over\\ndatabases. The TAG model represents a wide range of interactions between the LM\\nand database that have been previously unexplored and creates exciting research\\nopportunities for leveraging the world knowledge and reasoning capabilities of\\nLMs over data. We systematically develop benchmarks to study the TAG problem\\nand find that standard methods answer no more than 20% of queries correctly,\\nconfirming the need for further research in this area. We release code for the\\nbenchmark at https://github.com/TAG-Research/TAG-Bench.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.14717\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text2SQL is Not Enough: Unifying AI and Databases with TAG
AI systems that serve natural language questions over databases promise to
unlock tremendous value. Such systems would allow users to leverage the
powerful reasoning and knowledge capabilities of language models (LMs)
alongside the scalable computational power of data management systems. These
combined capabilities would empower users to ask arbitrary natural language
questions over custom data sources. However, existing methods and benchmarks
insufficiently explore this setting. Text2SQL methods focus solely on natural
language questions that can be expressed in relational algebra, representing a
small subset of the questions real users wish to ask. Likewise,
Retrieval-Augmented Generation (RAG) considers the limited subset of queries
that can be answered with point lookups to one or a few data records within the
database. We propose Table-Augmented Generation (TAG), a unified and
general-purpose paradigm for answering natural language questions over
databases. The TAG model represents a wide range of interactions between the LM
and database that have been previously unexplored and creates exciting research
opportunities for leveraging the world knowledge and reasoning capabilities of
LMs over data. We systematically develop benchmarks to study the TAG problem
and find that standard methods answer no more than 20% of queries correctly,
confirming the need for further research in this area. We release code for the
benchmark at https://github.com/TAG-Research/TAG-Bench.