Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献_第9页

Efficient Dynamic Clustering: Capturing Patterns from Historical Cluster Evolution 高效动态聚类:从历史聚类演化中捕获模式

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-03-02 DOI: 10.48550/arXiv.2203.00812

Binbin Gu, Saeed Kargar, Faisal Nawab

Clustering aims to group unlabeled objects based on similarity inherent among them into clusters. It is important for many tasks such as anomaly detection, database sharding, record linkage, and others. Some clustering methods are taken as batch algorithms that incur a high overhead as they cluster all the objects in the database from scratch or assume an incremental workload. In practice, database objects are updated, added, and removed from databases continuously which makes previous results stale. Running batch algorithms is infeasible in such scenarios as it would incur a significant overhead if performed continuously. This is particularly the case for high-velocity scenarios such as ones in Internet of Things applications. In this paper, we tackle the problem of clustering in high-velocity dynamic scenarios, where the objects are continuously updated, inserted, and deleted. Specifically, we propose a generally dynamic approach to clustering that utilizes previous clustering results. Our system, DynamicC, uses a machine learning model that is augmented with an existing batch algorithm. The DynamicC model trains by observing the clustering decisions made by the batch algorithm. After training, the DynamicC model is usedin cooperation with the batch algorithm to achieve both accurate and fast clustering decisions. The experimental results on four real-world and one synthetic datasets show that our approach has a better performance compared to the state-of-the-art method while achieving similarly accurate clustering results to the baseline batch algorithm.

聚类的目的是将未标记的对象根据它们之间固有的相似性进行聚类。它对于异常检测、数据库分片、记录链接等许多任务都很重要。一些集群方法被视为批处理算法，当它们从头开始对数据库中的所有对象进行集群或承担增量工作负载时，会产生很高的开销。在实践中，数据库对象不断地更新、添加和从数据库中删除，这使得以前的结果过时。在这种情况下，运行批处理算法是不可行的，因为如果连续执行，将产生巨大的开销。这对于高速场景(例如物联网应用程序)尤其如此。在本文中，我们解决了高速动态场景中对象不断更新、插入和删除的聚类问题。具体来说，我们提出了一种利用先前聚类结果的一般动态聚类方法。我们的系统DynamicC使用了一个机器学习模型，该模型由现有的批处理算法增强。DynamicC模型通过观察批处理算法做出的聚类决策来进行训练。经过训练后，将DynamicC模型与批处理算法配合使用，实现准确快速的聚类决策。在四个真实数据集和一个合成数据集上的实验结果表明，我们的方法与最先进的方法相比具有更好的性能，同时获得与基线批处理算法相似的准确聚类结果。

{"title":"Efficient Dynamic Clustering: Capturing Patterns from Historical Cluster Evolution","authors":"Binbin Gu, Saeed Kargar, Faisal Nawab","doi":"10.48550/arXiv.2203.00812","DOIUrl":"https://doi.org/10.48550/arXiv.2203.00812","url":null,"abstract":"Clustering aims to group unlabeled objects based on similarity inherent among them into clusters. It is important for many tasks such as anomaly detection, database sharding, record linkage, and others. Some clustering methods are taken as batch algorithms that incur a high overhead as they cluster all the objects in the database from scratch or assume an incremental workload. In practice, database objects are updated, added, and removed from databases continuously which makes previous results stale. Running batch algorithms is infeasible in such scenarios as it would incur a significant overhead if performed continuously. This is particularly the case for high-velocity scenarios such as ones in Internet of Things applications. In this paper, we tackle the problem of clustering in high-velocity dynamic scenarios, where the objects are continuously updated, inserted, and deleted. Specifically, we propose a generally dynamic approach to clustering that utilizes previous clustering results. Our system, DynamicC, uses a machine learning model that is augmented with an existing batch algorithm. The DynamicC model trains by observing the clustering decisions made by the batch algorithm. After training, the DynamicC model is usedin cooperation with the batch algorithm to achieve both accurate and fast clustering decisions. The experimental results on four real-world and one synthetic datasets show that our approach has a better performance compared to the state-of-the-art method while achieving similarly accurate clustering results to the baseline batch algorithm.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"75 1","pages":"2:351-2:363"},"PeriodicalIF":0.0,"publicationDate":"2022-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85515179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

3DPro: Querying Complex Three-Dimensional Data with Progressive Compression and Refinement. 3DPro：利用渐进压缩和细化功能查询复杂的三维数据

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-03-01 DOI: 10.48786/edbt.2022.02

Dejun Teng, Yanhui Liang, Furqan Baig, Jun Kong, Vo Hoang, Fusheng Wang

Large-scale three-dimensional spatial data has gained increasing attention with the development of self-driving, mineral exploration, CAD, and human atlases. Such 3D objects are often represented with a polygonal model at high resolution to preserve accuracy. This poses major challenges for 3D data management and spatial queries due to the massive amounts of 3D objects, e.g., trillions of 3D cells, and the high complexity of 3D geometric computation. Traditional spatial querying methods in the Filter-Refine paradigm have a major focus on indexing-based filtering using approximations like minimal bounding boxes and largely neglect the heavy computation in the refinement step at the intra-geometry level, which often dominates the cost of query processing. In this paper, we introduce 3DPro, a system that supports efficient spatial queries for complex 3D objects. 3DPro uses progressive compression of 3D objects preserving multiple levels of details, which significantly reduces the size of the objects and has the data fit into memory. Through a novel Filter-Progressive-Refine paradigm, 3DPro can have query results returned early whenever possible to minimize decompression and geometric computations of 3D objects in higher resolution representations. Our experiments demonstrate that 3DPro out-performs the state-of-the-art 3D data processing techniques by up to an order of magnitude for typical spatial queries.

随着自动驾驶、矿产勘探、计算机辅助设计和人类地图集的发展，大规模三维空间数据越来越受到关注。这些三维物体通常采用高分辨率的多边形模型来表示，以保持精度。这给三维数据管理和空间查询带来了重大挑战，因为三维物体的数量巨大，例如数万亿个三维单元，而且三维几何计算非常复杂。传统的 "过滤-细化 "空间查询方法主要侧重于使用最小边界框等近似值进行基于索引的过滤，在很大程度上忽略了细化步骤中几何内部级别的繁重计算，而这往往是查询处理的主要成本。本文介绍的 3DPro 是一个支持复杂三维物体高效空间查询的系统。3DPro 对三维物体采用渐进式压缩，保留了多层次的细节，从而大大减小了物体的大小，并使数据适合内存。通过新颖的 "过滤-渐进-重定义 "模式，3DPro 可以尽可能早地返回查询结果，从而最大限度地减少三维物体在更高分辨率表示法中的解压缩和几何计算。我们的实验证明，对于典型的空间查询，3DPro 的性能比最先进的三维数据处理技术高出一个数量级。

{"title":"3DPro: Querying Complex Three-Dimensional Data with Progressive Compression and Refinement.","authors":"Dejun Teng, Yanhui Liang, Furqan Baig, Jun Kong, Vo Hoang, Fusheng Wang","doi":"10.48786/edbt.2022.02","DOIUrl":"10.48786/edbt.2022.02","url":null,"abstract":"Large-scale three-dimensional spatial data has gained increasing attention with the development of self-driving, mineral exploration, CAD, and human atlases. Such 3D objects are often represented with a polygonal model at high resolution to preserve accuracy. This poses major challenges for 3D data management and spatial queries due to the massive amounts of 3D objects, e.g., trillions of 3D cells, and the high complexity of 3D geometric computation. Traditional spatial querying methods in the Filter-Refine paradigm have a major focus on indexing-based filtering using approximations like minimal bounding boxes and largely neglect the heavy computation in the refinement step at the intra-geometry level, which often dominates the cost of query processing. In this paper, we introduce 3DPro, a system that supports efficient spatial queries for complex 3D objects. 3DPro uses progressive compression of 3D objects preserving multiple levels of details, which significantly reduces the size of the objects and has the data fit into memory. Through a novel Filter-Progressive-Refine paradigm, 3DPro can have query results returned early whenever possible to minimize decompression and geometric computations of 3D objects in higher resolution representations. Our experiments demonstrate that 3DPro out-performs the state-of-the-art 3D data processing techniques by up to an order of magnitude for typical spatial queries.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"25 2","pages":"104-117"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/7e/40/nihms-1827080.PMC9540604.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33501263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Differentially-Private Publication of Origin-Destination Matrices with Intermediate Stops 具有中间停点的起点-终点矩阵的微分私有发布

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-02-24 DOI: 10.48786/edbt.2022.04

Sina Shaham, Gabriel Ghinita, C. Shahabi

Conventional origin-destination (OD) matrices record the count of trips between pairs of start and end locations, and have been extensively used in transportation, traffic planning, etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling, it is increasingly important to also record intermediate points along an individual's path, rather than only the trip start and end points. This can be achieved by using a multi-dimensional frequency matrix over a data space partitioning at the desired level of granularity. However, serious privacy constraints occur when releasing OD matrix data, and especially when adding multiple intermediate points, which makes individual trajectories more distinguishable to an attacker. To address this threat, we propose a technique for privacy-preserving publication of multi-dimensional OD matrices that achieves differential privacy (DP), the de-facto standard in private data release. We propose a family of approaches that factor in important data properties such as data density and homogeneity in order to build OD matrices that provide provable protection guarantees while preserving query accuracy. Extensive experiments on real and synthetic datasets show that the proposed approaches clearly outperform existing state-of-the-art.

传统的起点-终点(OD)矩阵记录了对起点和终点之间的行程数，在交通运输、交通规划等方面得到了广泛的应用。最近，由于COVID-19大流行传播建模等用例场景，记录个人路径上的中间点而不仅仅是旅行的起点和终点变得越来越重要。这可以通过在所需粒度级别的数据空间分区上使用多维频率矩阵来实现。然而，在释放OD矩阵数据时，特别是在添加多个中间点时，会出现严重的隐私约束，这使得攻击者更容易区分单个轨迹。为了解决这一威胁，我们提出了一种多维OD矩阵的隐私保护发布技术，该技术实现了差分隐私(DP)，这是私有数据发布的事实标准。我们提出了一系列考虑重要数据属性(如数据密度和同质性)的方法，以便构建OD矩阵，在保持查询准确性的同时提供可证明的保护保证。在真实和合成数据集上进行的大量实验表明，所提出的方法明显优于现有的最先进的方法。

{"title":"Differentially-Private Publication of Origin-Destination Matrices with Intermediate Stops","authors":"Sina Shaham, Gabriel Ghinita, C. Shahabi","doi":"10.48786/edbt.2022.04","DOIUrl":"https://doi.org/10.48786/edbt.2022.04","url":null,"abstract":"Conventional origin-destination (OD) matrices record the count of trips between pairs of start and end locations, and have been extensively used in transportation, traffic planning, etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling, it is increasingly important to also record intermediate points along an individual's path, rather than only the trip start and end points. This can be achieved by using a multi-dimensional frequency matrix over a data space partitioning at the desired level of granularity. However, serious privacy constraints occur when releasing OD matrix data, and especially when adding multiple intermediate points, which makes individual trajectories more distinguishable to an attacker. To address this threat, we propose a technique for privacy-preserving publication of multi-dimensional OD matrices that achieves differential privacy (DP), the de-facto standard in private data release. We propose a family of approaches that factor in important data properties such as data density and homogeneity in order to build OD matrices that provide provable protection guarantees while preserving query accuracy. Extensive experiments on real and synthetic datasets show that the proposed approaches clearly outperform existing state-of-the-art.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"86 1","pages":"2:131-2:142"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75524222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Model-Independent Design of Knowledge Graphs - Lessons Learnt From Complex Financial Graphs 知识图的模型独立设计——从复杂金融图中吸取的教训

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.46

Luigi Bellomarini, Andrea Gentili, Eleonora Laurenza, Emanuel Sallinger

We propose a model-independent design framework for Knowledge Graphs (KGs), capitalizing on our experience in KGs and model management for the roll out of a very large and complex financial KG for the Central Bank of Italy. KGs have recently garnered increasing attention from industry and are currently exploited in a variety of applications. Many of the common notions of KG share the presence of an extensional component, typically implemented as a graph database storing the enterprise data, and an intensional component, to derive new implicit knowledge in the form of new nodes and new edges. Our framework, KGModel, is based on a meta-level approach, where the data engineer designs the extensional and the intensional components of the KG—the graph schema and the reasoning rules, respectively—at meta-level. Then, in a model-driven fashion, such high-level specification is translated into schema definitions and reasoning rules that can be deployed into the target database systems and state-of-the-art reasoners. Our framework offers a model-independent visual modeling language, a logic-based language for the intensional component, and a set of new complementary software tools for the translation of metalevel specifications for the target systems. We present the details of KGModel, illustrate the software tools we implemented and show the suitability of the framework for real-world scenarios.

我们提出了一个独立于模型的知识图谱(KGs)设计框架，利用我们在知识图谱和模型管理方面的经验，为意大利中央银行推出一个非常庞大和复杂的金融KG。kg最近引起了工业界越来越多的关注，目前在各种应用中得到了利用。KG的许多常见概念都共享一个外延组件(通常实现为存储企业数据的图形数据库)和一个内延组件(以新节点和新边的形式派生新的隐含知识)的存在。我们的框架KGModel基于元级方法，其中数据工程师在元级分别设计kg的外延和内延组件(图模式和推理规则)。然后，以模型驱动的方式，这样的高级规范被转换成模式定义和推理规则，这些规则可以部署到目标数据库系统和最先进的推理器中。我们的框架提供了一种独立于模型的可视化建模语言，一种用于内涵组件的基于逻辑的语言，以及一组用于翻译目标系统的金属级规范的新的补充软件工具。我们介绍了KGModel的细节，说明了我们实现的软件工具，并展示了该框架对现实场景的适用性。

{"title":"Model-Independent Design of Knowledge Graphs - Lessons Learnt From Complex Financial Graphs","authors":"Luigi Bellomarini, Andrea Gentili, Eleonora Laurenza, Emanuel Sallinger","doi":"10.48786/edbt.2022.46","DOIUrl":"https://doi.org/10.48786/edbt.2022.46","url":null,"abstract":"We propose a model-independent design framework for Knowledge Graphs (KGs), capitalizing on our experience in KGs and model management for the roll out of a very large and complex financial KG for the Central Bank of Italy. KGs have recently garnered increasing attention from industry and are currently exploited in a variety of applications. Many of the common notions of KG share the presence of an extensional component, typically implemented as a graph database storing the enterprise data, and an intensional component, to derive new implicit knowledge in the form of new nodes and new edges. Our framework, KGModel, is based on a meta-level approach, where the data engineer designs the extensional and the intensional components of the KG—the graph schema and the reasoning rules, respectively—at meta-level. Then, in a model-driven fashion, such high-level specification is translated into schema definitions and reasoning rules that can be deployed into the target database systems and state-of-the-art reasoners. Our framework offers a model-independent visual modeling language, a logic-based language for the intensional component, and a set of new complementary software tools for the translation of metalevel specifications for the target systems. We present the details of KGModel, illustrate the software tools we implemented and show the suitability of the framework for real-world scenarios.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:524-2:526"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77815555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Placement of Workloads from Advanced RDBMS Architectures into Complex Cloud Infrastructure 将工作负载从高级RDBMS架构放置到复杂的云基础架构中

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.43

Antony S. Higginson, Clive Bostock, N. Paton, Suzanne M. Embury

Capacity planning is an essential activity in the procurement and daily running of any multi-server computer system. Workload placement is a well known problem and there are several solutions to help address capacity planning problems of knowing where , when and how much resource is needed to place work-loads of varying shapes (resources consumed). Bin-packing algorithms are used extensively in addressing workload placement problems, however, we propose that extensions to existing bin-packing algorithms are required when dealing with workloads from advanced computational architectures such as clustering and consolidation (pluggable), or workloads that exhibit complex data patterns in their signals , such as seasonality, trend and/or shocks (exogenous or otherwise). These extentions are especially needed when consolidating workloads together, for example, consolidation of multiple databases into one ( pluggable databases ) to reduce database server sprawl on estates. In this paper we address bin-packing for singular or clustered environments and propose new algorithms that introduce a time element, giving a richer understanding of the resources requested when workloads are consolidated together, ensuring High Availability (HA) for workloads obtained from advanced database configurations. An experimental evaluation shows that the approach we propose reduces the risk of provisioning wastage in pay-as-you-go cloud architectures.

容量规划是任何多服务器计算机系统的采购和日常运行中必不可少的活动。工作负载放置是一个众所周知的问题，有几个解决方案可以帮助解决容量规划问题，了解放置不同形状的工作负载(消耗的资源)需要在何时、何地以及多少资源。装箱算法广泛用于解决工作负载放置问题，然而，我们建议在处理来自高级计算架构(如聚类和整合(可插拔))的工作负载或在其信号中表现出复杂数据模式(如季节性、趋势和/或冲击(外生或其他)的工作负载时，需要扩展现有的装箱算法。在合并工作负载时尤其需要这些扩展，例如，将多个数据库合并为一个(可插拔的数据库)以减少数据库服务器在资产上的扩展。在本文中，我们讨论了单一或集群环境中的打包，并提出了引入时间元素的新算法，从而更深入地了解工作负载合并在一起时所请求的资源，从而确保从高级数据库配置获得的工作负载的高可用性(HA)。实验评估表明，我们提出的方法降低了在按需付费的云架构中配置浪费的风险。

{"title":"Placement of Workloads from Advanced RDBMS Architectures into Complex Cloud Infrastructure","authors":"Antony S. Higginson, Clive Bostock, N. Paton, Suzanne M. Embury","doi":"10.48786/edbt.2022.43","DOIUrl":"https://doi.org/10.48786/edbt.2022.43","url":null,"abstract":"Capacity planning is an essential activity in the procurement and daily running of any multi-server computer system. Workload placement is a well known problem and there are several solutions to help address capacity planning problems of knowing where , when and how much resource is needed to place work-loads of varying shapes (resources consumed). Bin-packing algorithms are used extensively in addressing workload placement problems, however, we propose that extensions to existing bin-packing algorithms are required when dealing with workloads from advanced computational architectures such as clustering and consolidation (pluggable), or workloads that exhibit complex data patterns in their signals , such as seasonality, trend and/or shocks (exogenous or otherwise). These extentions are especially needed when consolidating workloads together, for example, consolidation of multiple databases into one ( pluggable databases ) to reduce database server sprawl on estates. In this paper we address bin-packing for singular or clustered environments and propose new algorithms that introduce a time element, giving a richer understanding of the resources requested when workloads are consolidated together, ensuring High Availability (HA) for workloads obtained from advanced database configurations. An experimental evaluation shows that the approach we propose reduces the risk of provisioning wastage in pay-as-you-go cloud architectures.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"15 1","pages":"2:487-2:497"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82313561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learned Query Optimizer: At the Forefront of AI-Driven Databases 学习查询优化器:在人工智能驱动数据库的最前沿

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.56

Rong Zhu, Ziniu Wu, Chengliang Chai, A. Pfadler, Bolin Ding, Guoliang Li, Jingren Zhou

Applying ML-based techniques to optimize traditional databases, or AI4DB, has becoming a hot research spot in recent. Learned techniques for query optimizer(QO) is the forefront in AI4DB. QO provides the most suitable experimental plots for utilizing ML techniques and learned QO has exhibited superiority with enough evidence. In this tutorial, we aim at providing a wide and deep review and analysis on learned QO, ranging from algorithm design, real-world applications and system deployment. For algorithm, we would introduce the advances for learning each individual component in QO, as well as the whole QO module. For system, we would analyze the challenges, as well as some attempts, for deploying ML-based QO into actual DBMS. Based on them, we summarize some design principles and point out several future directions. We hope this tutorial could inspire and guide researchers and engineers working on learned QO, as well as other context in AI4DB.

利用基于ml的技术对传统数据库(AI4DB)进行优化，已成为近年来的研究热点。查询优化器(QO)的学习技术是AI4DB中的前沿。QO为机器学习技术的应用提供了最合适的实验场景，学习后的QO已经显示出了足够的优势。在本教程中，我们的目标是从算法设计、现实世界的应用和系统部署等方面对学习后的qos进行广泛而深入的回顾和分析。对于算法，我们将介绍QO中每个单独组件的学习进展，以及整个QO模块。对于系统，我们将分析将基于ml的QO部署到实际DBMS中的挑战以及一些尝试。在此基础上，我们总结了一些设计原则，并指出了未来的发展方向。我们希望本教程可以启发和指导研究QO的研究人员和工程师，以及AI4DB中的其他上下文。

引用次数: 6

Aggregation Detection in CSV Files CSV文件中的聚合检测

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.10

Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann

Aggregations are an arithmetic relationship between a single number and a set of numbers. Tables in raw CSV files often include various types of aggregations to summarize data therein. Identifying aggregations in tables can help understand file structures, detect data errors, and normalize tables. However, recognizing aggregations in CSV files is not trivial, as these files often organize information in an ad-hoc manner with aggregations appearing in arbitrary positions and displaying rounding errors. We propose the three-stage approach AggreCol to recognize aggregations of five types: sum, difference, average, division, and relative change. The first stage detects aggregations of each type individually. The second stage uses a set of pruning rules to remove spurious candidates. The last stage employs rules to allow individual detectors to skip specific parts of the file and retrieve more aggregations. We evaluated our approach with two manually annotated datasets, showing that AggreCol is capable of achieving 0.95 precision and recall for 91.1% and 86.3% of the files, respectively. We obtained similar results on an unseen test dataset, proving the generalizability of our proposed techniques.

聚合是一个数字和一组数字之间的算术关系。原始CSV文件中的表通常包括各种类型的聚合，以汇总其中的数据。识别表中的聚合有助于理解文件结构、检测数据错误和规范表。然而，在CSV文件中识别聚合并不简单，因为这些文件通常以特别的方式组织信息，聚合出现在任意位置并显示舍入错误。我们提出了三阶段方法AggreCol来识别五种类型的聚合:和、差、平均、分割和相对变化。第一阶段分别检测每种类型的聚合。第二阶段使用一组修剪规则来删除虚假候选。最后一个阶段使用规则来允许单个检测器跳过文件的特定部分并检索更多聚合。我们用两个手动注释的数据集评估了我们的方法，结果表明AggreCol能够分别对91.1%和86.3%的文件达到0.95的精度和召回率。我们在一个未知的测试数据集上得到了类似的结果，证明了我们提出的技术的泛化性。

{"title":"Aggregation Detection in CSV Files","authors":"Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann","doi":"10.48786/edbt.2022.10","DOIUrl":"https://doi.org/10.48786/edbt.2022.10","url":null,"abstract":"Aggregations are an arithmetic relationship between a single number and a set of numbers. Tables in raw CSV files often include various types of aggregations to summarize data therein. Identifying aggregations in tables can help understand file structures, detect data errors, and normalize tables. However, recognizing aggregations in CSV files is not trivial, as these files often organize information in an ad-hoc manner with aggregations appearing in arbitrary positions and displaying rounding errors. We propose the three-stage approach AggreCol to recognize aggregations of five types: sum, difference, average, division, and relative change. The first stage detects aggregations of each type individually. The second stage uses a set of pruning rules to remove spurious candidates. The last stage employs rules to allow individual detectors to skip specific parts of the file and retrieve more aggregations. We evaluated our approach with two manually annotated datasets, showing that AggreCol is capable of achieving 0.95 precision and recall for 91.1% and 86.3% of the files, respectively. We obtained similar results on an unseen test dataset, proving the generalizability of our proposed techniques.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:207-2:219"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77850703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed Training of Knowledge Graph Embedding Models using Ray 基于Ray的知识图嵌入模型的分布式训练

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.48

Nasrullah Sheikh, Xiao Qin, B. Reinwald

Knowledge graphs are at the core of numerous consumer and enterprise applications where learned graph embeddings are used to derive insights for the users of these applications. Since knowledge graphs can be very large, the process of learning embeddings is time and resource intensive and needs to be done in a distributed manner to leverage compute resources of multiple machines. Therefore, these applications demand performance and scalability at the development and deployment stages, and require these models to be developed and deployed in frameworks that address these requirements. Ray 1 is an example of such a framework that offers both ease of development and deployment, and enables running tasks in a distributed manner using simple APIs. In this work, we use Ray to build an end-to-end system for data preprocessing and distributed training of graph neural network based knowledge graph embedding models. We apply our system to link prediction task, i.e. using knowledge graph embedding to discover links between nodes in graphs. We evaluate our system on a real-world industrial dataset and demonstrate significant speedups of both, distributed data preprocessing and distributed model training. Compared to non-distributed learning, we achieved a training speedup of 12 × with 4 Ray workers without any deterioration in the evaluation metrics.

知识图是许多消费者和企业应用程序的核心，其中学习的图嵌入用于为这些应用程序的用户派生见解。由于知识图可能非常大，学习嵌入的过程是时间和资源密集型的，需要以分布式方式完成，以利用多台机器的计算资源。因此，这些应用程序在开发和部署阶段需要性能和可伸缩性，并且需要在满足这些需求的框架中开发和部署这些模型。Ray 1就是这样一个框架的例子，它提供了开发和部署的便利性，并支持使用简单的api以分布式方式运行任务。在这项工作中，我们使用Ray构建了一个端到端的系统，用于基于知识图嵌入模型的图神经网络的数据预处理和分布式训练。我们将该系统应用于链接预测任务，即利用知识图嵌入来发现图中节点之间的链接。我们在现实世界的工业数据集上评估了我们的系统，并展示了分布式数据预处理和分布式模型训练的显着速度。与非分布式学习相比，我们使用4个Ray工人实现了12倍的培训加速，而评估指标没有任何恶化。

{"title":"Distributed Training of Knowledge Graph Embedding Models using Ray","authors":"Nasrullah Sheikh, Xiao Qin, B. Reinwald","doi":"10.48786/edbt.2022.48","DOIUrl":"https://doi.org/10.48786/edbt.2022.48","url":null,"abstract":"Knowledge graphs are at the core of numerous consumer and enterprise applications where learned graph embeddings are used to derive insights for the users of these applications. Since knowledge graphs can be very large, the process of learning embeddings is time and resource intensive and needs to be done in a distributed manner to leverage compute resources of multiple machines. Therefore, these applications demand performance and scalability at the development and deployment stages, and require these models to be developed and deployed in frameworks that address these requirements. Ray 1 is an example of such a framework that offers both ease of development and deployment, and enables running tasks in a distributed manner using simple APIs. In this work, we use Ray to build an end-to-end system for data preprocessing and distributed training of graph neural network based knowledge graph embedding models. We apply our system to link prediction task, i.e. using knowledge graph embedding to discover links between nodes in graphs. We evaluate our system on a real-world industrial dataset and demonstrate significant speedups of both, distributed data preprocessing and distributed model training. Compared to non-distributed learning, we achieved a training speedup of 12 × with 4 Ray workers without any deterioration in the evaluation metrics.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"29 1","pages":"2:549-2:553"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81603949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MM-infer: A Tool for Inference of Multi-Model Schemas MM-infer:一个多模型图式推断工具

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.52

P. Koupil, Sebastián Hricko, I. Holubová

The variety feature of Big Data, represented by multi-model data, has brought a new dimension of complexity to data management. The need to process a set of distinct but interlinked models is a challenging task. In our demonstration, we present our prototype implementation MM-infer that ensures inference of a common schema of multi-model data. It supports popular data models and all three types of their mutual combinations, i.e., inter-model references, the embedding of models, and cross-model redundancy. Following the current trends, the implementation can efficiently process large amounts of data. To the best of our knowledge, ours is the first tool addressing schema inference in the world of multi-model databases.

以多模型数据为代表的大数据的多样性给数据管理带来了新的复杂性维度。需要处理一组不同但相互关联的模型是一项具有挑战性的任务。在我们的演示中，我们展示了我们的原型实现MM-infer，它确保对多模型数据的公共模式进行推断。它支持流行的数据模型及其相互组合的所有三种类型，即模型间引用、模型嵌入和跨模型冗余。按照目前的趋势，该实现可以高效地处理大量数据。据我们所知，我们的工具是第一个在多模型数据库世界中处理模式推断的工具。

引用次数: 6

Integrating the Orca Optimizer into MySQL 将Orca优化器集成到MySQL中

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2022-01-01 DOI: 10.48786/edbt.2022.45

A. Marathe, S. Lin, Weidong Yu, Kareem El Gebaly, P. Larson, Calvin Sun, Huawei, Calvin Sun

The MySQL query optimizer was designed for relatively simple, OLTP-type queries; for more complex queries its limitations quickly become apparent. Join order optimization, for example, considers only left-deep plans, and selects the join order using a greedy algorithm. Instead of continuing to patch the MySQL optimizer, why not delegate optimization of more complex queries to another more capable optimizer? This paper reports on our experience with integrating the Orca optimizer into MySQL. Orca is an extensible open-source query optimizer—originally used by Pivotal’s Greenplum DBMS—specifically designed for demanding analytical workloads. Queries submitted to MySQL are routed to Orca for optimization, and the resulting plans are returned to MySQL for execution. Metadata and statistical information needed during optimization is retrieved from MySQL’s data dictionary. Experimental results show substantial performance gains. On the TPC-DS benchmark, Orca’s plans were over 10X faster on 10 of the 99 queries, and over 100X faster on 3 queries.

MySQL查询优化器是为相对简单的oltp类型查询而设计的;对于更复杂的查询，它的局限性很快就会显现出来。例如，连接顺序优化只考虑左深计划，并使用贪婪算法选择连接顺序。与其继续修补MySQL优化器，为什么不将更复杂的查询优化委托给另一个更有能力的优化器呢?本文报告了我们将Orca优化器集成到MySQL中的经验。Orca是一个可扩展的开源查询优化器——最初由Pivotal的Greenplum dbms使用——专门为要求苛刻的分析工作负载而设计。提交给MySQL的查询被路由到Orca进行优化，结果计划被返回到MySQL执行。优化过程中需要的元数据和统计信息从MySQL的数据字典中检索。实验结果显示了显著的性能提升。在TPC-DS基准测试中，Orca的计划在99个查询中的10个查询中速度超过10倍，在3个查询中速度超过100倍。

{"title":"Integrating the Orca Optimizer into MySQL","authors":"A. Marathe, S. Lin, Weidong Yu, Kareem El Gebaly, P. Larson, Calvin Sun, Huawei, Calvin Sun","doi":"10.48786/edbt.2022.45","DOIUrl":"https://doi.org/10.48786/edbt.2022.45","url":null,"abstract":"The MySQL query optimizer was designed for relatively simple, OLTP-type queries; for more complex queries its limitations quickly become apparent. Join order optimization, for example, considers only left-deep plans, and selects the join order using a greedy algorithm. Instead of continuing to patch the MySQL optimizer, why not delegate optimization of more complex queries to another more capable optimizer? This paper reports on our experience with integrating the Orca optimizer into MySQL. Orca is an extensible open-source query optimizer—originally used by Pivotal’s Greenplum DBMS—specifically designed for demanding analytical workloads. Queries submitted to MySQL are routed to Orca for optimization, and the resulting plans are returned to MySQL for execution. Metadata and statistical information needed during optimization is retrieved from MySQL’s data dictionary. Experimental results show substantial performance gains. On the TPC-DS benchmark, Orca’s plans were over 10X faster on 10 of the 99 queries, and over 100X faster on 3 queries.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"2:511-2:523"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82456275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4