Pub Date : 2022-03-02DOI: 10.48550/arXiv.2203.00812
Binbin Gu, Saeed Kargar, Faisal Nawab
Clustering aims to group unlabeled objects based on similarity inherent among them into clusters. It is important for many tasks such as anomaly detection, database sharding, record linkage, and others. Some clustering methods are taken as batch algorithms that incur a high overhead as they cluster all the objects in the database from scratch or assume an incremental workload. In practice, database objects are updated, added, and removed from databases continuously which makes previous results stale. Running batch algorithms is infeasible in such scenarios as it would incur a significant overhead if performed continuously. This is particularly the case for high-velocity scenarios such as ones in Internet of Things applications. In this paper, we tackle the problem of clustering in high-velocity dynamic scenarios, where the objects are continuously updated, inserted, and deleted. Specifically, we propose a generally dynamic approach to clustering that utilizes previous clustering results. Our system, DynamicC, uses a machine learning model that is augmented with an existing batch algorithm. The DynamicC model trains by observing the clustering decisions made by the batch algorithm. After training, the DynamicC model is usedin cooperation with the batch algorithm to achieve both accurate and fast clustering decisions. The experimental results on four real-world and one synthetic datasets show that our approach has a better performance compared to the state-of-the-art method while achieving similarly accurate clustering results to the baseline batch algorithm.
{"title":"Efficient Dynamic Clustering: Capturing Patterns from Historical Cluster Evolution","authors":"Binbin Gu, Saeed Kargar, Faisal Nawab","doi":"10.48550/arXiv.2203.00812","DOIUrl":"https://doi.org/10.48550/arXiv.2203.00812","url":null,"abstract":"Clustering aims to group unlabeled objects based on similarity inherent among them into clusters. It is important for many tasks such as anomaly detection, database sharding, record linkage, and others. Some clustering methods are taken as batch algorithms that incur a high overhead as they cluster all the objects in the database from scratch or assume an incremental workload. In practice, database objects are updated, added, and removed from databases continuously which makes previous results stale. Running batch algorithms is infeasible in such scenarios as it would incur a significant overhead if performed continuously. This is particularly the case for high-velocity scenarios such as ones in Internet of Things applications. In this paper, we tackle the problem of clustering in high-velocity dynamic scenarios, where the objects are continuously updated, inserted, and deleted. Specifically, we propose a generally dynamic approach to clustering that utilizes previous clustering results. Our system, DynamicC, uses a machine learning model that is augmented with an existing batch algorithm. The DynamicC model trains by observing the clustering decisions made by the batch algorithm. After training, the DynamicC model is usedin cooperation with the batch algorithm to achieve both accurate and fast clustering decisions. The experimental results on four real-world and one synthetic datasets show that our approach has a better performance compared to the state-of-the-art method while achieving similarly accurate clustering results to the baseline batch algorithm.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"75 1","pages":"2:351-2:363"},"PeriodicalIF":0.0,"publicationDate":"2022-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85515179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dejun Teng, Yanhui Liang, Furqan Baig, Jun Kong, Vo Hoang, Fusheng Wang
Large-scale three-dimensional spatial data has gained increasing attention with the development of self-driving, mineral exploration, CAD, and human atlases. Such 3D objects are often represented with a polygonal model at high resolution to preserve accuracy. This poses major challenges for 3D data management and spatial queries due to the massive amounts of 3D objects, e.g., trillions of 3D cells, and the high complexity of 3D geometric computation. Traditional spatial querying methods in the Filter-Refine paradigm have a major focus on indexing-based filtering using approximations like minimal bounding boxes and largely neglect the heavy computation in the refinement step at the intra-geometry level, which often dominates the cost of query processing. In this paper, we introduce 3DPro, a system that supports efficient spatial queries for complex 3D objects. 3DPro uses progressive compression of 3D objects preserving multiple levels of details, which significantly reduces the size of the objects and has the data fit into memory. Through a novel Filter-Progressive-Refine paradigm, 3DPro can have query results returned early whenever possible to minimize decompression and geometric computations of 3D objects in higher resolution representations. Our experiments demonstrate that 3DPro out-performs the state-of-the-art 3D data processing techniques by up to an order of magnitude for typical spatial queries.
{"title":"3DPro: Querying Complex Three-Dimensional Data with Progressive Compression and Refinement.","authors":"Dejun Teng, Yanhui Liang, Furqan Baig, Jun Kong, Vo Hoang, Fusheng Wang","doi":"10.48786/edbt.2022.02","DOIUrl":"10.48786/edbt.2022.02","url":null,"abstract":"<p><p>Large-scale three-dimensional spatial data has gained increasing attention with the development of self-driving, mineral exploration, CAD, and human atlases. Such 3D objects are often represented with a polygonal model at high resolution to preserve accuracy. This poses major challenges for 3D data management and spatial queries due to the massive amounts of 3D objects, e.g., trillions of 3D cells, and the high complexity of 3D geometric computation. Traditional spatial querying methods in the Filter-Refine paradigm have a major focus on indexing-based filtering using approximations like minimal bounding boxes and largely neglect the heavy computation in the refinement step at the intra-geometry level, which often dominates the cost of query processing. In this paper, we introduce <i>3DPro</i>, a system that supports efficient spatial queries for complex 3D objects. 3DPro uses progressive compression of 3D objects preserving multiple levels of details, which significantly reduces the size of the objects and has the data fit into memory. Through a novel Filter-Progressive-Refine paradigm, 3DPro can have query results returned early whenever possible to minimize decompression and geometric computations of 3D objects in higher resolution representations. Our experiments demonstrate that 3DPro out-performs the state-of-the-art 3D data processing techniques by up to an order of magnitude for typical spatial queries.</p>","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"25 2","pages":"104-117"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/7e/40/nihms-1827080.PMC9540604.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33501263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional origin-destination (OD) matrices record the count of trips between pairs of start and end locations, and have been extensively used in transportation, traffic planning, etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling, it is increasingly important to also record intermediate points along an individual's path, rather than only the trip start and end points. This can be achieved by using a multi-dimensional frequency matrix over a data space partitioning at the desired level of granularity. However, serious privacy constraints occur when releasing OD matrix data, and especially when adding multiple intermediate points, which makes individual trajectories more distinguishable to an attacker. To address this threat, we propose a technique for privacy-preserving publication of multi-dimensional OD matrices that achieves differential privacy (DP), the de-facto standard in private data release. We propose a family of approaches that factor in important data properties such as data density and homogeneity in order to build OD matrices that provide provable protection guarantees while preserving query accuracy. Extensive experiments on real and synthetic datasets show that the proposed approaches clearly outperform existing state-of-the-art.
{"title":"Differentially-Private Publication of Origin-Destination Matrices with Intermediate Stops","authors":"Sina Shaham, Gabriel Ghinita, C. Shahabi","doi":"10.48786/edbt.2022.04","DOIUrl":"https://doi.org/10.48786/edbt.2022.04","url":null,"abstract":"Conventional origin-destination (OD) matrices record the count of trips between pairs of start and end locations, and have been extensively used in transportation, traffic planning, etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling, it is increasingly important to also record intermediate points along an individual's path, rather than only the trip start and end points. This can be achieved by using a multi-dimensional frequency matrix over a data space partitioning at the desired level of granularity. However, serious privacy constraints occur when releasing OD matrix data, and especially when adding multiple intermediate points, which makes individual trajectories more distinguishable to an attacker. To address this threat, we propose a technique for privacy-preserving publication of multi-dimensional OD matrices that achieves differential privacy (DP), the de-facto standard in private data release. We propose a family of approaches that factor in important data properties such as data density and homogeneity in order to build OD matrices that provide provable protection guarantees while preserving query accuracy. Extensive experiments on real and synthetic datasets show that the proposed approaches clearly outperform existing state-of-the-art.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"86 1","pages":"2:131-2:142"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75524222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luigi Bellomarini, Andrea Gentili, Eleonora Laurenza, Emanuel Sallinger
We propose a model-independent design framework for Knowledge Graphs (KGs), capitalizing on our experience in KGs and model management for the roll out of a very large and complex financial KG for the Central Bank of Italy. KGs have recently garnered increasing attention from industry and are currently exploited in a variety of applications. Many of the common notions of KG share the presence of an extensional component, typically implemented as a graph database storing the enterprise data, and an intensional component, to derive new implicit knowledge in the form of new nodes and new edges. Our framework, KGModel, is based on a meta-level approach, where the data engineer designs the extensional and the intensional components of the KG—the graph schema and the reasoning rules, respectively—at meta-level. Then, in a model-driven fashion, such high-level specification is translated into schema definitions and reasoning rules that can be deployed into the target database systems and state-of-the-art reasoners. Our framework offers a model-independent visual modeling language, a logic-based language for the intensional component, and a set of new complementary software tools for the translation of metalevel specifications for the target systems. We present the details of KGModel, illustrate the software tools we implemented and show the suitability of the framework for real-world scenarios.
{"title":"Model-Independent Design of Knowledge Graphs - Lessons Learnt From Complex Financial Graphs","authors":"Luigi Bellomarini, Andrea Gentili, Eleonora Laurenza, Emanuel Sallinger","doi":"10.48786/edbt.2022.46","DOIUrl":"https://doi.org/10.48786/edbt.2022.46","url":null,"abstract":"We propose a model-independent design framework for Knowledge Graphs (KGs), capitalizing on our experience in KGs and model management for the roll out of a very large and complex financial KG for the Central Bank of Italy. KGs have recently garnered increasing attention from industry and are currently exploited in a variety of applications. Many of the common notions of KG share the presence of an extensional component, typically implemented as a graph database storing the enterprise data, and an intensional component, to derive new implicit knowledge in the form of new nodes and new edges. Our framework, KGModel, is based on a meta-level approach, where the data engineer designs the extensional and the intensional components of the KG—the graph schema and the reasoning rules, respectively—at meta-level. Then, in a model-driven fashion, such high-level specification is translated into schema definitions and reasoning rules that can be deployed into the target database systems and state-of-the-art reasoners. Our framework offers a model-independent visual modeling language, a logic-based language for the intensional component, and a set of new complementary software tools for the translation of metalevel specifications for the target systems. We present the details of KGModel, illustrate the software tools we implemented and show the suitability of the framework for real-world scenarios.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:524-2:526"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77815555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antony S. Higginson, Clive Bostock, N. Paton, Suzanne M. Embury
Capacity planning is an essential activity in the procurement and daily running of any multi-server computer system. Workload placement is a well known problem and there are several solutions to help address capacity planning problems of knowing where , when and how much resource is needed to place work-loads of varying shapes (resources consumed). Bin-packing algorithms are used extensively in addressing workload placement problems, however, we propose that extensions to existing bin-packing algorithms are required when dealing with workloads from advanced computational architectures such as clustering and consolidation (pluggable), or workloads that exhibit complex data patterns in their signals , such as seasonality, trend and/or shocks (exogenous or otherwise). These extentions are especially needed when consolidating workloads together, for example, consolidation of multiple databases into one ( pluggable databases ) to reduce database server sprawl on estates. In this paper we address bin-packing for singular or clustered environments and propose new algorithms that introduce a time element, giving a richer understanding of the resources requested when workloads are consolidated together, ensuring High Availability (HA) for workloads obtained from advanced database configurations. An experimental evaluation shows that the approach we propose reduces the risk of provisioning wastage in pay-as-you-go cloud architectures.
{"title":"Placement of Workloads from Advanced RDBMS Architectures into Complex Cloud Infrastructure","authors":"Antony S. Higginson, Clive Bostock, N. Paton, Suzanne M. Embury","doi":"10.48786/edbt.2022.43","DOIUrl":"https://doi.org/10.48786/edbt.2022.43","url":null,"abstract":"Capacity planning is an essential activity in the procurement and daily running of any multi-server computer system. Workload placement is a well known problem and there are several solutions to help address capacity planning problems of knowing where , when and how much resource is needed to place work-loads of varying shapes (resources consumed). Bin-packing algorithms are used extensively in addressing workload placement problems, however, we propose that extensions to existing bin-packing algorithms are required when dealing with workloads from advanced computational architectures such as clustering and consolidation (pluggable), or workloads that exhibit complex data patterns in their signals , such as seasonality, trend and/or shocks (exogenous or otherwise). These extentions are especially needed when consolidating workloads together, for example, consolidation of multiple databases into one ( pluggable databases ) to reduce database server sprawl on estates. In this paper we address bin-packing for singular or clustered environments and propose new algorithms that introduce a time element, giving a richer understanding of the resources requested when workloads are consolidated together, ensuring High Availability (HA) for workloads obtained from advanced database configurations. An experimental evaluation shows that the approach we propose reduces the risk of provisioning wastage in pay-as-you-go cloud architectures.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"15 1","pages":"2:487-2:497"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82313561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Zhu, Ziniu Wu, Chengliang Chai, A. Pfadler, Bolin Ding, Guoliang Li, Jingren Zhou
Applying ML-based techniques to optimize traditional databases, or AI4DB, has becoming a hot research spot in recent. Learned techniques for query optimizer(QO) is the forefront in AI4DB. QO provides the most suitable experimental plots for utilizing ML techniques and learned QO has exhibited superiority with enough evidence. In this tutorial, we aim at providing a wide and deep review and analysis on learned QO, ranging from algorithm design, real-world applications and system deployment. For algorithm, we would introduce the advances for learning each individual component in QO, as well as the whole QO module. For system, we would analyze the challenges, as well as some attempts, for deploying ML-based QO into actual DBMS. Based on them, we summarize some design principles and point out several future directions. We hope this tutorial could inspire and guide researchers and engineers working on learned QO, as well as other context in AI4DB.
{"title":"Learned Query Optimizer: At the Forefront of AI-Driven Databases","authors":"Rong Zhu, Ziniu Wu, Chengliang Chai, A. Pfadler, Bolin Ding, Guoliang Li, Jingren Zhou","doi":"10.48786/edbt.2022.56","DOIUrl":"https://doi.org/10.48786/edbt.2022.56","url":null,"abstract":"Applying ML-based techniques to optimize traditional databases, or AI4DB, has becoming a hot research spot in recent. Learned techniques for query optimizer(QO) is the forefront in AI4DB. QO provides the most suitable experimental plots for utilizing ML techniques and learned QO has exhibited superiority with enough evidence. In this tutorial, we aim at providing a wide and deep review and analysis on learned QO, ranging from algorithm design, real-world applications and system deployment. For algorithm, we would introduce the advances for learning each individual component in QO, as well as the whole QO module. For system, we would analyze the challenges, as well as some attempts, for deploying ML-based QO into actual DBMS. Based on them, we summarize some design principles and point out several future directions. We hope this tutorial could inspire and guide researchers and engineers working on learned QO, as well as other context in AI4DB.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"195 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80736466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann
Aggregations are an arithmetic relationship between a single number and a set of numbers. Tables in raw CSV files often include various types of aggregations to summarize data therein. Identifying aggregations in tables can help understand file structures, detect data errors, and normalize tables. However, recognizing aggregations in CSV files is not trivial, as these files often organize information in an ad-hoc manner with aggregations appearing in arbitrary positions and displaying rounding errors. We propose the three-stage approach AggreCol to recognize aggregations of five types: sum, difference, average, division, and relative change. The first stage detects aggregations of each type individually. The second stage uses a set of pruning rules to remove spurious candidates. The last stage employs rules to allow individual detectors to skip specific parts of the file and retrieve more aggregations. We evaluated our approach with two manually annotated datasets, showing that AggreCol is capable of achieving 0.95 precision and recall for 91.1% and 86.3% of the files, respectively. We obtained similar results on an unseen test dataset, proving the generalizability of our proposed techniques.
{"title":"Aggregation Detection in CSV Files","authors":"Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann","doi":"10.48786/edbt.2022.10","DOIUrl":"https://doi.org/10.48786/edbt.2022.10","url":null,"abstract":"Aggregations are an arithmetic relationship between a single number and a set of numbers. Tables in raw CSV files often include various types of aggregations to summarize data therein. Identifying aggregations in tables can help understand file structures, detect data errors, and normalize tables. However, recognizing aggregations in CSV files is not trivial, as these files often organize information in an ad-hoc manner with aggregations appearing in arbitrary positions and displaying rounding errors. We propose the three-stage approach AggreCol to recognize aggregations of five types: sum, difference, average, division, and relative change. The first stage detects aggregations of each type individually. The second stage uses a set of pruning rules to remove spurious candidates. The last stage employs rules to allow individual detectors to skip specific parts of the file and retrieve more aggregations. We evaluated our approach with two manually annotated datasets, showing that AggreCol is capable of achieving 0.95 precision and recall for 91.1% and 86.3% of the files, respectively. We obtained similar results on an unseen test dataset, proving the generalizability of our proposed techniques.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:207-2:219"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77850703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge graphs are at the core of numerous consumer and enterprise applications where learned graph embeddings are used to derive insights for the users of these applications. Since knowledge graphs can be very large, the process of learning embeddings is time and resource intensive and needs to be done in a distributed manner to leverage compute resources of multiple machines. Therefore, these applications demand performance and scalability at the development and deployment stages, and require these models to be developed and deployed in frameworks that address these requirements. Ray 1 is an example of such a framework that offers both ease of development and deployment, and enables running tasks in a distributed manner using simple APIs. In this work, we use Ray to build an end-to-end system for data preprocessing and distributed training of graph neural network based knowledge graph embedding models. We apply our system to link prediction task, i.e. using knowledge graph embedding to discover links between nodes in graphs. We evaluate our system on a real-world industrial dataset and demonstrate significant speedups of both, distributed data preprocessing and distributed model training. Compared to non-distributed learning, we achieved a training speedup of 12 × with 4 Ray workers without any deterioration in the evaluation metrics.
{"title":"Distributed Training of Knowledge Graph Embedding Models using Ray","authors":"Nasrullah Sheikh, Xiao Qin, B. Reinwald","doi":"10.48786/edbt.2022.48","DOIUrl":"https://doi.org/10.48786/edbt.2022.48","url":null,"abstract":"Knowledge graphs are at the core of numerous consumer and enterprise applications where learned graph embeddings are used to derive insights for the users of these applications. Since knowledge graphs can be very large, the process of learning embeddings is time and resource intensive and needs to be done in a distributed manner to leverage compute resources of multiple machines. Therefore, these applications demand performance and scalability at the development and deployment stages, and require these models to be developed and deployed in frameworks that address these requirements. Ray 1 is an example of such a framework that offers both ease of development and deployment, and enables running tasks in a distributed manner using simple APIs. In this work, we use Ray to build an end-to-end system for data preprocessing and distributed training of graph neural network based knowledge graph embedding models. We apply our system to link prediction task, i.e. using knowledge graph embedding to discover links between nodes in graphs. We evaluate our system on a real-world industrial dataset and demonstrate significant speedups of both, distributed data preprocessing and distributed model training. Compared to non-distributed learning, we achieved a training speedup of 12 × with 4 Ray workers without any deterioration in the evaluation metrics.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"29 1","pages":"2:549-2:553"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81603949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The variety feature of Big Data, represented by multi-model data, has brought a new dimension of complexity to data management. The need to process a set of distinct but interlinked models is a challenging task. In our demonstration, we present our prototype implementation MM-infer that ensures inference of a common schema of multi-model data. It supports popular data models and all three types of their mutual combinations, i.e., inter-model references, the embedding of models, and cross-model redundancy. Following the current trends, the implementation can efficiently process large amounts of data. To the best of our knowledge, ours is the first tool addressing schema inference in the world of multi-model databases.
{"title":"MM-infer: A Tool for Inference of Multi-Model Schemas","authors":"P. Koupil, Sebastián Hricko, I. Holubová","doi":"10.48786/edbt.2022.52","DOIUrl":"https://doi.org/10.48786/edbt.2022.52","url":null,"abstract":"The variety feature of Big Data, represented by multi-model data, has brought a new dimension of complexity to data management. The need to process a set of distinct but interlinked models is a challenging task. In our demonstration, we present our prototype implementation MM-infer that ensures inference of a common schema of multi-model data. It supports popular data models and all three types of their mutual combinations, i.e., inter-model references, the embedding of models, and cross-model redundancy. Following the current trends, the implementation can efficiently process large amounts of data. To the best of our knowledge, ours is the first tool addressing schema inference in the world of multi-model databases.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"31 1","pages":"2:566-2:569"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82357077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Marathe, S. Lin, Weidong Yu, Kareem El Gebaly, P. Larson, Calvin Sun, Huawei, Calvin Sun
The MySQL query optimizer was designed for relatively simple, OLTP-type queries; for more complex queries its limitations quickly become apparent. Join order optimization, for example, considers only left-deep plans, and selects the join order using a greedy algorithm. Instead of continuing to patch the MySQL optimizer, why not delegate optimization of more complex queries to another more capable optimizer? This paper reports on our experience with integrating the Orca optimizer into MySQL. Orca is an extensible open-source query optimizer—originally used by Pivotal’s Greenplum DBMS—specifically designed for demanding analytical workloads. Queries submitted to MySQL are routed to Orca for optimization, and the resulting plans are returned to MySQL for execution. Metadata and statistical information needed during optimization is retrieved from MySQL’s data dictionary. Experimental results show substantial performance gains. On the TPC-DS benchmark, Orca’s plans were over 10X faster on 10 of the 99 queries, and over 100X faster on 3 queries.
{"title":"Integrating the Orca Optimizer into MySQL","authors":"A. Marathe, S. Lin, Weidong Yu, Kareem El Gebaly, P. Larson, Calvin Sun, Huawei, Calvin Sun","doi":"10.48786/edbt.2022.45","DOIUrl":"https://doi.org/10.48786/edbt.2022.45","url":null,"abstract":"The MySQL query optimizer was designed for relatively simple, OLTP-type queries; for more complex queries its limitations quickly become apparent. Join order optimization, for example, considers only left-deep plans, and selects the join order using a greedy algorithm. Instead of continuing to patch the MySQL optimizer, why not delegate optimization of more complex queries to another more capable optimizer? This paper reports on our experience with integrating the Orca optimizer into MySQL. Orca is an extensible open-source query optimizer—originally used by Pivotal’s Greenplum DBMS—specifically designed for demanding analytical workloads. Queries submitted to MySQL are routed to Orca for optimization, and the resulting plans are returned to MySQL for execution. Metadata and statistical information needed during optimization is retrieved from MySQL’s data dictionary. Experimental results show substantial performance gains. On the TPC-DS benchmark, Orca’s plans were over 10X faster on 10 of the 99 queries, and over 100X faster on 3 queries.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:511-2:523"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82456275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}