Can Large Language Models Predict Data Correlations from Column Names?

Proc. VLDB Endow. Pub Date : 2023-09-01 DOI:10.14778/3625054.3625066

Immanuel Trummer

{"title":"Can Large Language Models Predict Data Correlations from Column Names?","authors":"Immanuel Trummer","doi":"10.14778/3625054.3625066","DOIUrl":null,"url":null,"abstract":"Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"42 1","pages":"4310-4323"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3625054.3625066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型能否从列名预测数据相关性？

最近有出版物建议使用数据库模式元素的自然语言分析来指导调整和剖析工作。其基本假设是，最先进的语言处理方法，即所谓的语言模型，能够从模式文本中提取有关数据属性的信息。本文在数据相关性分析的背景下研究了这一假设：通过语言模型分析列名，是否有可能找到具有相关数据的列对？首先，本文介绍了数据相关性分析的新基准，该基准是通过分析数千个 Kaggle 数据集创建的（可供下载）。其次，论文利用这些数据研究了语言模型根据列名预测相关性的能力。该分析涵盖了不同的语言模型、各种相关性指标和多种准确性指标。它指出了有助于成功预测的因素，如列名的长度和单词比例。最后，研究分析了列类型对预测性能的影响。研究结果表明，模式文本可以成为有用的信息源，并为今后针对 NLP 增强型数据库调整和数据剖析的研究工作提供参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量

期刊最新文献

Cryptographically Secure Private Record Linkage Using Locality-Sensitive Hashing Utility-aware Payment Channel Network Rebalance Relational Query Synthesis ⋈ Decision Tree Learning Billion-Scale Bipartite Graph Embedding: A Global-Local Induced Approach Query Refinement for Diversity Constraint Satisfaction