CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

arXiv - CS - Databases Pub Date : 2024-08-28 DOI:arxiv-2408.16170

Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan

{"title":"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":null,"url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\nrelational databases. Recently learned cardinality estimation models have been\nproposed to improve accuracy but there is no systematic benchmark or datasets\nwhich allows researchers to evaluate the progress made by new learned\napproaches and even systematically develop new learned approaches. In this\npaper, we are releasing a benchmark, containing thousands of queries over 20\ndistinct real-world databases for learned cardinality estimation. In contrast\nto other initial benchmarks, our benchmark is much more diverse and can be used\nfor training and testing learned models systematically. Using this benchmark,\nwe explored whether learned cardinality estimation can be transferred to an\nunseen dataset in a zero-shot manner. We trained GNN-based and\ntransformer-based models to study the problem in three setups: 1-)\ninstance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\nwe get promising results for zero-shot cardinality estimation on simple single\ntable queries; as soon as we add joins, the accuracy drops. However, we show\nthat with fine-tuning, we can still utilize pre-trained models for cardinality\nestimation, significantly reducing training overheads compared to instance\nspecific models. We are open sourcing our scripts to collect statistics,\ngenerate queries and training datasets to foster more extensive research, also\nfrom the ML community on the important problem of cardinality estimation and in\nparticular improve on recent directions such as pre-trained cardinality\nestimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CardBench：关系数据库中学习到的卡片性估计基准

卡片性估计对于关系数据库实现高性能查询至关重要。近来，人们提出了一些学习的卡饭估计模型来提高准确性，但目前还没有系统的基准或数据集可以让研究人员评估新的学习方法所取得的进展，甚至系统地开发新的学习方法。在本文中，我们将发布一个基准，其中包含对 20 个不同的真实数据库进行的数千次查询，用于学习的万有引力估计。与其他初始基准相比，我们的基准更加多样化，可用于系统地训练和测试学习模型。利用这一基准，我们探索了学习到的卡方估计是否能以 "零次 "的方式转移到未知的数据集上。我们训练了基于 GNN 的模型和基于变换器的模型，在三种情况下研究了这个问题：1-）基于实例；2-）零点；3-）微调。我们的结果表明，虽然我们在简单的单一查询中获得了很好的零次卡片性估计结果，但一旦加入连接，准确率就会下降。不过，我们的结果表明，通过微调，我们仍然可以利用预先训练好的模型来进行卡片品质估计，与针对特定实例的模型相比，大大减少了训练开销。我们正在开源我们的脚本，以收集统计数据、生成查询和训练数据集，从而促进更广泛的研究，也包括来自 ML 社区的关于万有引力估计这一重要问题的研究，特别是改进预训练万有引力估计等最近的研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量

期刊最新文献

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code! Fast and Adaptive Bulk Loading of Multidimensional Points Matrix Profile for Anomaly Detection on Multidimensional Time Series Extending predictive process monitoring for collaborative processes