Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang
{"title":"A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?","authors":"Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang","doi":"arxiv-2408.05109","DOIUrl":null,"url":null,"abstract":"Translating users' natural language queries (NL) into SQL queries (i.e.,\nNL2SQL) can significantly reduce barriers to accessing relational databases and\nsupport various commercial applications. The performance of NL2SQL has been\ngreatly enhanced with the emergence of Large Language Models (LLMs). In this\nsurvey, we provide a comprehensive review of NL2SQL techniques powered by LLMs,\ncovering its entire lifecycle from the following four aspects: (1) Model:\nNL2SQL translation techniques that tackle not only NL ambiguity and\nunder-specification, but also properly map NL with database schema and\ninstances; (2) Data: From the collection of training data, data synthesis due\nto training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating\nNL2SQL methods from multiple angles using different metrics and granularities;\nand (4) Error Analysis: analyzing NL2SQL errors to find the root cause and\nguiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for\ndeveloping NL2SQL solutions. Finally, we discuss the research challenges and\nopen problems of NL2SQL in the LLMs era.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.05109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Translating users' natural language queries (NL) into SQL queries (i.e.,
NL2SQL) can significantly reduce barriers to accessing relational databases and
support various commercial applications. The performance of NL2SQL has been
greatly enhanced with the emergence of Large Language Models (LLMs). In this
survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs,
covering its entire lifecycle from the following four aspects: (1) Model:
NL2SQL translation techniques that tackle not only NL ambiguity and
under-specification, but also properly map NL with database schema and
instances; (2) Data: From the collection of training data, data synthesis due
to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating
NL2SQL methods from multiple angles using different metrics and granularities;
and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and
guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for
developing NL2SQL solutions. Finally, we discuss the research challenges and
open problems of NL2SQL in the LLMs era.