Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan
{"title":"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":null,"url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\nrelational databases. Recently learned cardinality estimation models have been\nproposed to improve accuracy but there is no systematic benchmark or datasets\nwhich allows researchers to evaluate the progress made by new learned\napproaches and even systematically develop new learned approaches. In this\npaper, we are releasing a benchmark, containing thousands of queries over 20\ndistinct real-world databases for learned cardinality estimation. In contrast\nto other initial benchmarks, our benchmark is much more diverse and can be used\nfor training and testing learned models systematically. Using this benchmark,\nwe explored whether learned cardinality estimation can be transferred to an\nunseen dataset in a zero-shot manner. We trained GNN-based and\ntransformer-based models to study the problem in three setups: 1-)\ninstance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\nwe get promising results for zero-shot cardinality estimation on simple single\ntable queries; as soon as we add joins, the accuracy drops. However, we show\nthat with fine-tuning, we can still utilize pre-trained models for cardinality\nestimation, significantly reducing training overheads compared to instance\nspecific models. We are open sourcing our scripts to collect statistics,\ngenerate queries and training datasets to foster more extensive research, also\nfrom the ML community on the important problem of cardinality estimation and in\nparticular improve on recent directions such as pre-trained cardinality\nestimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cardinality estimation is crucial for enabling high query performance in
relational databases. Recently learned cardinality estimation models have been
proposed to improve accuracy but there is no systematic benchmark or datasets
which allows researchers to evaluate the progress made by new learned
approaches and even systematically develop new learned approaches. In this
paper, we are releasing a benchmark, containing thousands of queries over 20
distinct real-world databases for learned cardinality estimation. In contrast
to other initial benchmarks, our benchmark is much more diverse and can be used
for training and testing learned models systematically. Using this benchmark,
we explored whether learned cardinality estimation can be transferred to an
unseen dataset in a zero-shot manner. We trained GNN-based and
transformer-based models to study the problem in three setups: 1-)
instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while
we get promising results for zero-shot cardinality estimation on simple single
table queries; as soon as we add joins, the accuracy drops. However, we show
that with fine-tuning, we can still utilize pre-trained models for cardinality
estimation, significantly reducing training overheads compared to instance
specific models. We are open sourcing our scripts to collect statistics,
generate queries and training datasets to foster more extensive research, also
from the ML community on the important problem of cardinality estimation and in
particular improve on recent directions such as pre-trained cardinality
estimation.