TabSAL: Synthesizing Tabular data with Small agent Assisted Language models

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2024-08-30 DOI:10.1016/j.knosys.2024.112438

{"title":"TabSAL: Synthesizing Tabular data with Small agent Assisted Language models","authors":"","doi":"10.1016/j.knosys.2024.112438","DOIUrl":null,"url":null,"abstract":"<div><p>Tabular data are widely used in machine-learning tasks because of their prevalence in various fields; however, the potential risks of data breaches in tabular data and privacy protection regulations render such data almost unavailable. Tabular data generation methods alleviate data unavailability by synthesizing privacy-free data, and generating data using language models is a novel innovation. Language models can synthesize high-quality datasets by learning knowledge from nondestructive information and recognizing the semantics of table columns. However, when current language models function as generators, their encoding methods are hindered by complicated decoding processes, and the limited predictive ability of language models restricts their generative capability. To this end, we propose an encoding method based on interactive data structures such as JavaScript Object Notation for converting tabular data. We design TabSAL, which is a pluggable tabular data generation framework with small agent assisted language models, to boost the predictive capability, resulting in high-quality synthetic datasets with a much lower computational resource cost. In addition, a benchmark that integrates eight datasets, three methods, and three assessment directions has been issued, which indicates that TabSAL surpasses the state of the art by up to 60%.</p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124010724","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Tabular data are widely used in machine-learning tasks because of their prevalence in various fields; however, the potential risks of data breaches in tabular data and privacy protection regulations render such data almost unavailable. Tabular data generation methods alleviate data unavailability by synthesizing privacy-free data, and generating data using language models is a novel innovation. Language models can synthesize high-quality datasets by learning knowledge from nondestructive information and recognizing the semantics of table columns. However, when current language models function as generators, their encoding methods are hindered by complicated decoding processes, and the limited predictive ability of language models restricts their generative capability. To this end, we propose an encoding method based on interactive data structures such as JavaScript Object Notation for converting tabular data. We design TabSAL, which is a pluggable tabular data generation framework with small agent assisted language models, to boost the predictive capability, resulting in high-quality synthetic datasets with a much lower computational resource cost. In addition, a benchmark that integrates eight datasets, three methods, and three assessment directions has been issued, which indicates that TabSAL surpasses the state of the art by up to 60%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TabSAL：利用小型代理辅助语言模型合成表格数据

表格式数据在各个领域都非常普遍，因此被广泛应用于机器学习任务中；然而，表格式数据潜在的数据泄露风险和隐私保护法规使得这些数据几乎不可用。表格数据生成方法通过合成无隐私数据来缓解数据不可用的问题，而使用语言模型生成数据则是一种新颖的创新。语言模型可以通过学习非破坏性信息中的知识和识别表格列的语义来合成高质量的数据集。然而，目前的语言模型在作为生成器时，其编码方法受到复杂解码过程的阻碍，语言模型有限的预测能力也限制了其生成能力。为此，我们提出了一种基于 JavaScript Object Notation 等交互式数据结构的编码方法，用于转换表格数据。我们设计的 TabSAL 是一个可插拔的表格数据生成框架，它具有小型代理辅助语言模型，可提高预测能力，从而以更低的计算资源成本生成高质量的合成数据集。此外，我们还发布了一项整合了八个数据集、三种方法和三个评估方向的基准测试，结果表明 TabSAL 比现有技术水平高出 60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.