Andra Ionescu, A. Alexandridou, Leonidas Ikonomou, Kyriakos Psarakis, Kostas Patroumpas, Georgios Chatzigeorgakidis, Dimitrios Skoutas, Spiros Athanasiou, Rihan Hai, Asterios Katsifodimos
The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of value-added services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to practice. In this paper we propose to demonstrate the Topio Marketplace, an open-source data market platform that facilitates the search, exploration, discovery and augmentation of data assets. To support filtering, searching and discovery of data assets, we developed methods to extract and visualise a variety of metadata, as well as methods to discover related assets and mechanism to augment them. This paper aims at presenting these methods with a real deployment of the Topio marketplace, comprising hundreds of open and proprietary datasets.
{"title":"Topio Marketplace: Search and Discovery of Geospatial Data","authors":"Andra Ionescu, A. Alexandridou, Leonidas Ikonomou, Kyriakos Psarakis, Kostas Patroumpas, Georgios Chatzigeorgakidis, Dimitrios Skoutas, Spiros Athanasiou, Rihan Hai, Asterios Katsifodimos","doi":"10.48786/edbt.2023.73","DOIUrl":"https://doi.org/10.48786/edbt.2023.73","url":null,"abstract":"The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of value-added services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to practice. In this paper we propose to demonstrate the Topio Marketplace, an open-source data market platform that facilitates the search, exploration, discovery and augmentation of data assets. To support filtering, searching and discovery of data assets, we developed methods to extract and visualise a variety of metadata, as well as methods to discover related assets and mechanism to augment them. This paper aims at presenting these methods with a real deployment of the Topio marketplace, comprising hundreds of open and proprietary datasets.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"114 1","pages":"819-822"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77595192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficiently managing and querying sliding windows is a key com-ponent in stream processing systems. Conventional index structures such as the B+Tree are not efficient for handling a stream of time-series data, where the data is very dynamic, and the indexes must be updated on a continuous basis. Stream processing structures such as queues can accommodate large volumes of updates (enqueue and dequeue); however, they are not efficient for fast retrieval. This paper proposes FLIRT, a parameter-free index structure that manages a sliding window over a high-velocity stream of data and simultaneously supports efficient range queries on the sliding window. FLIRT uses learned indexing to reduce the lookup time. This is enabled by organising the incoming stream of time-series data into linearly predictable segments, allowing fast queue operations such as enqueue, dequeue, and search. We further boost the search performance by introducing two multithreaded versions of FLIRT for different query workloads. Experimental results show up to 7 × speedup over conventional indexes, 8 × speedup over queues, and up to 109 × speedup over learned indexes.
{"title":"FLIRT: A Fast Learned Index for Rolling Time frames","authors":"Guang Yang, Liang Liang, A. Hadian, T. Heinis","doi":"10.48786/edbt.2023.19","DOIUrl":"https://doi.org/10.48786/edbt.2023.19","url":null,"abstract":"Efficiently managing and querying sliding windows is a key com-ponent in stream processing systems. Conventional index structures such as the B+Tree are not efficient for handling a stream of time-series data, where the data is very dynamic, and the indexes must be updated on a continuous basis. Stream processing structures such as queues can accommodate large volumes of updates (enqueue and dequeue); however, they are not efficient for fast retrieval. This paper proposes FLIRT, a parameter-free index structure that manages a sliding window over a high-velocity stream of data and simultaneously supports efficient range queries on the sliding window. FLIRT uses learned indexing to reduce the lookup time. This is enabled by organising the incoming stream of time-series data into linearly predictable segments, allowing fast queue operations such as enqueue, dequeue, and search. We further boost the search performance by introducing two multithreaded versions of FLIRT for different query workloads. Experimental results show up to 7 × speedup over conventional indexes, 8 × speedup over queues, and up to 109 × speedup over learned indexes.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"39 1","pages":"234-246"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85503539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Colombo, Luigi Bellomarini, S. Ceri, Eleonora Laurenza
{"title":"Smart Derivative Contracts in DatalogMTL","authors":"Andrea Colombo, Luigi Bellomarini, S. Ceri, Eleonora Laurenza","doi":"10.48786/edbt.2023.65","DOIUrl":"https://doi.org/10.48786/edbt.2023.65","url":null,"abstract":"","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"773-781"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89342053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Lucchese, S. Orlando, R. Perego, Alberto Veneri
Most accurate machine learning models unfortunately produce black-box predictions, for which it is impossible to grasp the internal logic that leads to a specific decision. Unfolding the logic of such black-box models is of increasing importance, especially when they are used in sensitive decision-making processes. In this work we focus on forests of decision trees, which may include hundreds to thousands of decision trees to produce accurate predictions. Such complexity raises the need of developing explanations for the predictions generated by large forests. We propose a post hoc explanation method of large forests, named GAM-based Explanation of Forests (GEF), which builds a Generalized Additive Model (GAM) able to explain, both locally and globally, the impact on the predictions of a limited set of features and feature interactions. We evaluate GEF over both synthetic and real-world datasets and show that GEF can create a GAM model with high fidelity by analyzing the given forest only and without using any further information, not even the initial training dataset.
{"title":"GAM Forest Explanation","authors":"C. Lucchese, S. Orlando, R. Perego, Alberto Veneri","doi":"10.48786/edbt.2023.14","DOIUrl":"https://doi.org/10.48786/edbt.2023.14","url":null,"abstract":"Most accurate machine learning models unfortunately produce black-box predictions, for which it is impossible to grasp the internal logic that leads to a specific decision. Unfolding the logic of such black-box models is of increasing importance, especially when they are used in sensitive decision-making processes. In this work we focus on forests of decision trees, which may include hundreds to thousands of decision trees to produce accurate predictions. Such complexity raises the need of developing explanations for the predictions generated by large forests. We propose a post hoc explanation method of large forests, named GAM-based Explanation of Forests (GEF), which builds a Generalized Additive Model (GAM) able to explain, both locally and globally, the impact on the predictions of a limited set of features and feature interactions. We evaluate GEF over both synthetic and real-world datasets and show that GEF can create a GAM model with high fidelity by analyzing the given forest only and without using any further information, not even the initial training dataset.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"35 1","pages":"171-182"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80735581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Jibril, Hani Al-Sayeh, Alexander Baumstark, K. Sattler
Offloading graph analytics to GPU yields significant performance speedups. In heterogeneous hybrid transactional/analytical graph processing (graph H 2 TAP), where each graph workload type is executed on the most suitable processor, transactions are executed on a CPU-based main graph and analytics are executed on a GPU-optimized graph replica. The problem that arises, as a result, is that updates by transactions on the main graph have to be particularly handled with respect to the graph replica. In this paper, we present a fast and efficient approach to this update handling problem, based on a delta store optimized for graphs. The delta store is a differential graph store that captures the transactional updates, which are later propagated to the graph replica so that analytical queries are executed on the most recently committed version of the graph in accordance with freshness requirements. Our approach ensures consistency be-tween the main graph and the replica. Our evaluation shows the performance advantage of our approach over existing HTAP approaches.
{"title":"Fast and Efficient Update Handling for Graph H2TAP","authors":"M. Jibril, Hani Al-Sayeh, Alexander Baumstark, K. Sattler","doi":"10.48786/edbt.2023.60","DOIUrl":"https://doi.org/10.48786/edbt.2023.60","url":null,"abstract":"Offloading graph analytics to GPU yields significant performance speedups. In heterogeneous hybrid transactional/analytical graph processing (graph H 2 TAP), where each graph workload type is executed on the most suitable processor, transactions are executed on a CPU-based main graph and analytics are executed on a GPU-optimized graph replica. The problem that arises, as a result, is that updates by transactions on the main graph have to be particularly handled with respect to the graph replica. In this paper, we present a fast and efficient approach to this update handling problem, based on a delta store optimized for graphs. The delta store is a differential graph store that captures the transactional updates, which are later propagated to the graph replica so that analytical queries are executed on the most recently committed version of the graph in accordance with freshness requirements. Our approach ensures consistency be-tween the main graph and the replica. Our evaluation shows the performance advantage of our approach over existing HTAP approaches.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"38 1","pages":"723-736"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75037683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an 𝑂 ( 𝑇 × 𝑛 2 × 𝑑 ) complexity, which makes it impractical for larger datasets. E.g., Chen et al. [8] show runtimes of more than 10 hours for just 𝑛 = 70 , 000 data points, but improve this to just above one hour by using R-Trees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized but instead just stops when most points are close to synchronizing. In this paper, our contributions are manifold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.
{"title":"EGG-SynC: Exact GPU-parallelized Grid-based Clustering by Synchronization","authors":"Jakob Rødsgaard Jørgensen, I. Assent","doi":"10.48786/edbt.2023.16","DOIUrl":"https://doi.org/10.48786/edbt.2023.16","url":null,"abstract":"Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an 𝑂 ( 𝑇 × 𝑛 2 × 𝑑 ) complexity, which makes it impractical for larger datasets. E.g., Chen et al. [8] show runtimes of more than 10 hours for just 𝑛 = 70 , 000 data points, but improve this to just above one hour by using R-Trees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized but instead just stops when most points are close to synchronizing. In this paper, our contributions are manifold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"27 1","pages":"195-207"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75089729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stream processing is widely applied in industry as well as in research to process unbounded data streams. In many use cases, specific data streams are processed by multiple continuous queries. Current systems group events of an unbounded data stream into bounded windows to produce results of individual queries in a timely fashion. For multiple concurrent queries, multiple concurrent and usually overlapping windows are generated. To reduce redundant computations and share partial results, state-of-the-art solutions divide windows into slices and then share the results of those slices. However, this is only applicable for queries with the same aggregation function and window measure, as in the case of overlaps for sliding windows. For multiple queries on the same stream with different aggregation functions and window measures, partial results cannot be shared. Furthermore, data streams are produced from devices that are distributed in large decentralized networks. Current systems cannot process queries on decentralized data streams efficiently. All queries in a decentralized network are either computed centrally or processed individually without exploiting partial results across queries. We present Desis, a stream processing system that can efficiently process multiple stream aggregation queries. We propose an aggregation engine that can share partial results between multiple queries with different window types, measures, and aggregation functions. In decentralized networks, Desis moves computation to data sources and shares overlapping computation as early as possible between queries. Desis outperforms existing solutions by orders of magnitude in throughput when processing multiple queries and can scale to millions of queries. In a decentralized setup, Desis can save up to 99% of network traffic and scale performance linearly.
{"title":"Desis: Efficient Window Aggregation in Decentralized Networks","authors":"W. Yue, Lawrence Benson, T. Rabl","doi":"10.48786/edbt.2023.52","DOIUrl":"https://doi.org/10.48786/edbt.2023.52","url":null,"abstract":"Stream processing is widely applied in industry as well as in research to process unbounded data streams. In many use cases, specific data streams are processed by multiple continuous queries. Current systems group events of an unbounded data stream into bounded windows to produce results of individual queries in a timely fashion. For multiple concurrent queries, multiple concurrent and usually overlapping windows are generated. To reduce redundant computations and share partial results, state-of-the-art solutions divide windows into slices and then share the results of those slices. However, this is only applicable for queries with the same aggregation function and window measure, as in the case of overlaps for sliding windows. For multiple queries on the same stream with different aggregation functions and window measures, partial results cannot be shared. Furthermore, data streams are produced from devices that are distributed in large decentralized networks. Current systems cannot process queries on decentralized data streams efficiently. All queries in a decentralized network are either computed centrally or processed individually without exploiting partial results across queries. We present Desis, a stream processing system that can efficiently process multiple stream aggregation queries. We propose an aggregation engine that can share partial results between multiple queries with different window types, measures, and aggregation functions. In decentralized networks, Desis moves computation to data sources and shares overlapping computation as early as possible between queries. Desis outperforms existing solutions by orders of magnitude in throughput when processing multiple queries and can scale to millions of queries. In a decentralized setup, Desis can save up to 99% of network traffic and scale performance linearly.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"2 1","pages":"618-631"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78974443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The formulation of structured queries in Knowledge Graphs is a challenging task since it presupposes familiarity with the syntax of the query language and the contents of the knowledge graph. To alleviate this problem, for enabling plain users to formulate SPARQL queries, and advanced users to formulate queries with less effort, in this paper we introduce a novel method for “SPARQL by Example". According to this method the user points to positive/negative entities, the system computes one query that describes these entities, and then the user refines the query interactively by providing positive/negative feedback on entities and suggested constraints. We shall demonstrate SPARQL-QBE , a tool that implements this approach, and we will briefly refer to the results of a task-based evaluation with users that provided positive evidence about the usability of the approach.
知识图中结构化查询的表述是一项具有挑战性的任务,因为它以熟悉查询语言的语法和知识图的内容为前提。为了缓解这一问题,使普通用户能够更轻松地制定SPARQL查询,而高级用户也能够更轻松地制定查询,本文介绍了一种“SPARQL by Example”的新方法。根据该方法,用户指向正/负实体,系统计算一个描述这些实体的查询,然后用户通过对实体和建议约束提供正/负反馈来交互式地改进查询。我们将演示SPARQL-QBE,这是一个实现此方法的工具,我们将简要地引用基于任务的用户评估的结果,该结果提供了关于该方法可用性的积极证据。
{"title":"Demonstrating Interactive SPARQL Formulation through Positive and Negative Examples and Feedback","authors":"Akritas Akritidis, Yannis Tzitzikas","doi":"10.48786/edbt.2023.71","DOIUrl":"https://doi.org/10.48786/edbt.2023.71","url":null,"abstract":"The formulation of structured queries in Knowledge Graphs is a challenging task since it presupposes familiarity with the syntax of the query language and the contents of the knowledge graph. To alleviate this problem, for enabling plain users to formulate SPARQL queries, and advanced users to formulate queries with less effort, in this paper we introduce a novel method for “SPARQL by Example\". According to this method the user points to positive/negative entities, the system computes one query that describes these entities, and then the user refines the query interactively by providing positive/negative feedback on entities and suggested constraints. We shall demonstrate SPARQL-QBE , a tool that implements this approach, and we will briefly refer to the results of a task-based evaluation with users that provided positive evidence about the usability of the approach.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"49 1","pages":"811-814"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86139878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we consider using deep learning models over a collection of sets to replace traditional approaches utilized in database systems. We propose solutions for data indexing, membership queries, and cardinality estimation. Unlike relational data, learned models over sets need to be permutation invariant and able to deal with variable set sizes. The proposed models are based on the DeepSets architecture and include per-element compression to achieve acceptable accuracy with modest model sizes. We further suggest a hybrid structure with bounded error guarantees using guided learning to mitigate the inherent challenges when working with set data. We outline challenges and opportunities when dealing with set data and demonstrate the suitability of the models through extensive experimental evaluation with one synthetic and two real-world datasets.
{"title":"Learning over Sets for Databases","authors":"Angjela Davitkova, Damjan Gjurovski, S. Michel","doi":"10.48786/edbt.2024.07","DOIUrl":"https://doi.org/10.48786/edbt.2024.07","url":null,"abstract":"In this work, we consider using deep learning models over a collection of sets to replace traditional approaches utilized in database systems. We propose solutions for data indexing, membership queries, and cardinality estimation. Unlike relational data, learned models over sets need to be permutation invariant and able to deal with variable set sizes. The proposed models are based on the DeepSets architecture and include per-element compression to achieve acceptable accuracy with modest model sizes. We further suggest a hybrid structure with bounded error guarantees using guided learning to mitigate the inherent challenges when working with set data. We outline challenges and opportunities when dealing with set data and demonstrate the suitability of the models through extensive experimental evaluation with one synthetic and two real-world datasets.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"45 1","pages":"68-80"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72821433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data partitioning is the key for parallel query processing in modern analytical database systems. Choosing the right partitioning key for a given dataset is a difficult task and crucial for query performance. Real world data warehouses contain a large amount of tables connected in complex schemes resulting in an over-whelming amount of partition key candidates. In this paper, we present the approach of patched multi-key partitioning, allowing to define multiple partition keys simultaneously without data replication. The key idea is to map the relational table partitioning problem to a graph partition problem in order to use existing graph partitioning algorithms to find connectivity components in the data and maintain exceptions (patches) to the partitioning separately. We show that patched multi-key partitioning offer opportunities for achieving robust query performance, i.e. reaching reasonably good performance for many queries instead of optimal performance for only a few queries.
{"title":"Patched Multi-Key Partitioning for Robust Query Performance","authors":"Steffen Kläbe, K. Sattler","doi":"10.48786/edbt.2023.26","DOIUrl":"https://doi.org/10.48786/edbt.2023.26","url":null,"abstract":"Data partitioning is the key for parallel query processing in modern analytical database systems. Choosing the right partitioning key for a given dataset is a difficult task and crucial for query performance. Real world data warehouses contain a large amount of tables connected in complex schemes resulting in an over-whelming amount of partition key candidates. In this paper, we present the approach of patched multi-key partitioning, allowing to define multiple partition keys simultaneously without data replication. The key idea is to map the relational table partitioning problem to a graph partition problem in order to use existing graph partitioning algorithms to find connectivity components in the data and maintain exceptions (patches) to the partitioning separately. We show that patched multi-key partitioning offer opportunities for achieving robust query performance, i.e. reaching reasonably good performance for many queries instead of optimal performance for only a few queries.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"9 1","pages":"324-336"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74353789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}