Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan
{"title":"CardBench:关系数据库中学习到的卡片性估计基准","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":null,"url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\nrelational databases. Recently learned cardinality estimation models have been\nproposed to improve accuracy but there is no systematic benchmark or datasets\nwhich allows researchers to evaluate the progress made by new learned\napproaches and even systematically develop new learned approaches. In this\npaper, we are releasing a benchmark, containing thousands of queries over 20\ndistinct real-world databases for learned cardinality estimation. In contrast\nto other initial benchmarks, our benchmark is much more diverse and can be used\nfor training and testing learned models systematically. Using this benchmark,\nwe explored whether learned cardinality estimation can be transferred to an\nunseen dataset in a zero-shot manner. We trained GNN-based and\ntransformer-based models to study the problem in three setups: 1-)\ninstance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\nwe get promising results for zero-shot cardinality estimation on simple single\ntable queries; as soon as we add joins, the accuracy drops. However, we show\nthat with fine-tuning, we can still utilize pre-trained models for cardinality\nestimation, significantly reducing training overheads compared to instance\nspecific models. We are open sourcing our scripts to collect statistics,\ngenerate queries and training datasets to foster more extensive research, also\nfrom the ML community on the important problem of cardinality estimation and in\nparticular improve on recent directions such as pre-trained cardinality\nestimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases\",\"authors\":\"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan\",\"doi\":\"arxiv-2408.16170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cardinality estimation is crucial for enabling high query performance in\\nrelational databases. Recently learned cardinality estimation models have been\\nproposed to improve accuracy but there is no systematic benchmark or datasets\\nwhich allows researchers to evaluate the progress made by new learned\\napproaches and even systematically develop new learned approaches. In this\\npaper, we are releasing a benchmark, containing thousands of queries over 20\\ndistinct real-world databases for learned cardinality estimation. In contrast\\nto other initial benchmarks, our benchmark is much more diverse and can be used\\nfor training and testing learned models systematically. Using this benchmark,\\nwe explored whether learned cardinality estimation can be transferred to an\\nunseen dataset in a zero-shot manner. We trained GNN-based and\\ntransformer-based models to study the problem in three setups: 1-)\\ninstance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\\nwe get promising results for zero-shot cardinality estimation on simple single\\ntable queries; as soon as we add joins, the accuracy drops. However, we show\\nthat with fine-tuning, we can still utilize pre-trained models for cardinality\\nestimation, significantly reducing training overheads compared to instance\\nspecific models. We are open sourcing our scripts to collect statistics,\\ngenerate queries and training datasets to foster more extensive research, also\\nfrom the ML community on the important problem of cardinality estimation and in\\nparticular improve on recent directions such as pre-trained cardinality\\nestimation.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.16170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
卡片性估计对于关系数据库实现高性能查询至关重要。近来,人们提出了一些学习的卡饭估计模型来提高准确性,但目前还没有系统的基准或数据集可以让研究人员评估新的学习方法所取得的进展,甚至系统地开发新的学习方法。在本文中,我们将发布一个基准,其中包含对 20 个不同的真实数据库进行的数千次查询,用于学习的万有引力估计。与其他初始基准相比,我们的基准更加多样化,可用于系统地训练和测试学习模型。利用这一基准,我们探索了学习到的卡方估计是否能以 "零次 "的方式转移到未知的数据集上。我们训练了基于 GNN 的模型和基于变换器的模型,在三种情况下研究了这个问题:1-)基于实例;2-)零点;3-)微调。我们的结果表明,虽然我们在简单的单一查询中获得了很好的零次卡片性估计结果,但一旦加入连接,准确率就会下降。不过,我们的结果表明,通过微调,我们仍然可以利用预先训练好的模型来进行卡片品质估计,与针对特定实例的模型相比,大大减少了训练开销。我们正在开源我们的脚本,以收集统计数据、生成查询和训练数据集,从而促进更广泛的研究,也包括来自 ML 社区的关于万有引力估计这一重要问题的研究,特别是改进预训练万有引力估计等最近的研究方向。
CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases
Cardinality estimation is crucial for enabling high query performance in
relational databases. Recently learned cardinality estimation models have been
proposed to improve accuracy but there is no systematic benchmark or datasets
which allows researchers to evaluate the progress made by new learned
approaches and even systematically develop new learned approaches. In this
paper, we are releasing a benchmark, containing thousands of queries over 20
distinct real-world databases for learned cardinality estimation. In contrast
to other initial benchmarks, our benchmark is much more diverse and can be used
for training and testing learned models systematically. Using this benchmark,
we explored whether learned cardinality estimation can be transferred to an
unseen dataset in a zero-shot manner. We trained GNN-based and
transformer-based models to study the problem in three setups: 1-)
instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while
we get promising results for zero-shot cardinality estimation on simple single
table queries; as soon as we add joins, the accuracy drops. However, we show
that with fine-tuning, we can still utilize pre-trained models for cardinality
estimation, significantly reducing training overheads compared to instance
specific models. We are open sourcing our scripts to collect statistics,
generate queries and training datasets to foster more extensive research, also
from the ML community on the important problem of cardinality estimation and in
particular improve on recent directions such as pre-trained cardinality
estimation.