SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

arXiv - CS - Databases Pub Date : 2024-08-22 DOI:arxiv-2408.12733

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik

引用次数: 0

Abstract

Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by dialect-specific tutorials, and demonstrate its effectiveness in creating training datasets for multiple dialects. Our approach significantly improves performance, by up to 20\%, over previous methods and reduces the gap with large-scale human-annotated datasets. Moreover, combining our synthetic data with human-annotated data provides additional performance boosts of 3.3\% to 5.6\%. We also introduce a novel Mixture of Experts (MoE) initialization method that integrates dialect-specific models into a unified system by merging self-attention layers and initializing the gates with dialect-specific keywords, further enhancing performance across different SQL dialects.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SQL-GEN：通过合成数据和模型合并弥合文本到 SQL 的方言差距

文本到 SQL 系统可将自然语言查询转换为 SQL 命令，主要在 SQLite 方言方面取得了重大进展。然而，由于 SQL 语法和功能的多样性，将这些系统适用于 BigQuery 和 PostgreSQL 等其他 SQL 方言仍然是一项挑战。我们介绍了 SQL-GEN，这是一个在特定方言教程指导下生成高质量特定方言合成数据的框架，并演示了它在创建多方言训练数据集方面的有效性。与以前的方法相比，我们的方法大大提高了性能，提高幅度高达20%，并缩小了与大规模人类标注数据集的差距。此外，将我们的合成数据与人类标注数据相结合，还能使性能提高3.3%到5.6%。我们还引入了一种新颖的专家混合（MoE）初始化方法，该方法通过合并自注意层和使用特定方言关键词初始化门，将特定方言模型集成到一个统一的系统中，从而进一步提高了跨不同 SQL 方言的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes