Can Large Language Models Predict Data Correlations from Column Names?

Immanuel Trummer
{"title":"Can Large Language Models Predict Data Correlations from Column Names?","authors":"Immanuel Trummer","doi":"10.14778/3625054.3625066","DOIUrl":null,"url":null,"abstract":"Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3625054.3625066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型语言模型能否从列名预测数据相关性?
最近有出版物建议使用数据库模式元素的自然语言分析来指导调整和剖析工作。其基本假设是,最先进的语言处理方法,即所谓的语言模型,能够从模式文本中提取有关数据属性的信息。 本文在数据相关性分析的背景下研究了这一假设:通过语言模型分析列名,是否有可能找到具有相关数据的列对?首先,本文介绍了数据相关性分析的新基准,该基准是通过分析数千个 Kaggle 数据集创建的(可供下载)。其次,论文利用这些数据研究了语言模型根据列名预测相关性的能力。该分析涵盖了不同的语言模型、各种相关性指标和多种准确性指标。它指出了有助于成功预测的因素,如列名的长度和单词比例。最后,研究分析了列类型对预测性能的影响。研究结果表明,模式文本可以成为有用的信息源,并为今后针对 NLP 增强型数据库调整和数据剖析的研究工作提供参考。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1