SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik
{"title":"SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging","authors":"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik","doi":"arxiv-2408.12733","DOIUrl":null,"url":null,"abstract":"Text-to-SQL systems, which convert natural language queries into SQL\ncommands, have seen significant progress primarily for the SQLite dialect.\nHowever, adapting these systems to other SQL dialects like BigQuery and\nPostgreSQL remains a challenge due to the diversity in SQL syntax and\nfunctions. We introduce SQL-GEN, a framework for generating high-quality\ndialect-specific synthetic data guided by dialect-specific tutorials, and\ndemonstrate its effectiveness in creating training datasets for multiple\ndialects. Our approach significantly improves performance, by up to 20\\%, over\nprevious methods and reduces the gap with large-scale human-annotated datasets.\nMoreover, combining our synthetic data with human-annotated data provides\nadditional performance boosts of 3.3\\% to 5.6\\%. We also introduce a novel\nMixture of Experts (MoE) initialization method that integrates dialect-specific\nmodels into a unified system by merging self-attention layers and initializing\nthe gates with dialect-specific keywords, further enhancing performance across\ndifferent SQL dialects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20\%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3\% to 5.6\%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SQL-GEN:通过合成数据和模型合并弥合文本到 SQL 的方言差距
文本到 SQL 系统可将自然语言查询转换为 SQL 命令,主要在 SQLite 方言方面取得了重大进展。然而,由于 SQL 语法和功能的多样性,将这些系统适用于 BigQuery 和 PostgreSQL 等其他 SQL 方言仍然是一项挑战。我们介绍了 SQL-GEN,这是一个在特定方言教程指导下生成高质量特定方言合成数据的框架,并演示了它在创建多方言训练数据集方面的有效性。与以前的方法相比,我们的方法大大提高了性能,提高幅度高达20%,并缩小了与大规模人类标注数据集的差距。此外,将我们的合成数据与人类标注数据相结合,还能使性能提高3.3%到5.6%。我们还引入了一种新颖的专家混合(MoE)初始化方法,该方法通过合并自注意层和使用特定方言关键词初始化门,将特定方言模型集成到一个统一的系统中,从而进一步提高了跨不同 SQL 方言的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1