Sivaprasad Sudhir, Wenbo Tao, N. Laptev, Cyrille Habis, Michael J. Cafarella, S. Madden
{"title":"Pando: Enhanced Data Skipping with Logical Data Partitioning","authors":"Sivaprasad Sudhir, Wenbo Tao, N. Laptev, Cyrille Habis, Michael J. Cafarella, S. Madden","doi":"10.14778/3598581.3598601","DOIUrl":null,"url":null,"abstract":"With enormous volumes of data, quickly retrieving data that is relevant to a query is essential for achieving high performance. Modern cloud-based database systems often partition the data into blocks and employ various techniques to skip irrelevant blocks during query execution. Several algorithms, often based on historical properties of a workload of queries run over the data, have been proposed to tune the physical layout of data to reduce the number of blocks accessed. The effectiveness of these methods at skipping blocks depends on what metadata is stored and how well the physical data layout aligns with the queries. Existing work on automatic physical database design misses significant opportunities in skipping blocks because it ignores logical predicates in the workload that exhibit strongly correlated results. In this paper, we present Pando which enables significantly better block skipping than past methods by informing physical layout decisions with correlation-aware logical partitioning. Across a range of benchmark and real-world workloads, Pando attains up to 2.8X reduction in the number of blocks scanned and up to 2.3X speedup in end-to-end query execution time over the state-of-the-art techniques.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"20 1","pages":"2316-2329"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3598581.3598601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With enormous volumes of data, quickly retrieving data that is relevant to a query is essential for achieving high performance. Modern cloud-based database systems often partition the data into blocks and employ various techniques to skip irrelevant blocks during query execution. Several algorithms, often based on historical properties of a workload of queries run over the data, have been proposed to tune the physical layout of data to reduce the number of blocks accessed. The effectiveness of these methods at skipping blocks depends on what metadata is stored and how well the physical data layout aligns with the queries. Existing work on automatic physical database design misses significant opportunities in skipping blocks because it ignores logical predicates in the workload that exhibit strongly correlated results. In this paper, we present Pando which enables significantly better block skipping than past methods by informing physical layout decisions with correlation-aware logical partitioning. Across a range of benchmark and real-world workloads, Pando attains up to 2.8X reduction in the number of blocks scanned and up to 2.3X speedup in end-to-end query execution time over the state-of-the-art techniques.