首页 > 最新文献

Proceedings of the 28th International Conference on Scientific and Statistical Database Management最新文献

英文 中文
Demonstrating KDBMS: A Knowledge-based Database Management System 演示KDBMS:一个基于知识的数据库管理系统
Mohamed E. Khalefa, Sameh S. El-Atawy
We demonstrate a KDBMS, a prototype system which seamlessly integrates Knowledge base and DBMS. While state-of-the-art approaches, i.e., Ontology-based data access, denoted as OBDA, use ontologies to only query data stored in relational databases using SPARQL. In this demo, we present a high level description of the proposed system, introduce a new knowledge-based query language, denoted as KQL, and highlight some query optimization opportunities by employing knowledge across database layers in query optimization, and query processing, while ease the administrating for a complex database schema.
我们展示了一个KDBMS,一个无缝集成知识库和DBMS的原型系统。而最先进的方法,即基于本体的数据访问,表示为OBDA,使用本体仅使用SPARQL查询存储在关系数据库中的数据。在本演示中,我们对所建议的系统进行了高层次的描述,介绍了一种新的基于知识的查询语言,表示为KQL,并通过在查询优化和查询处理中使用跨数据库层的知识来强调一些查询优化机会,同时简化了对复杂数据库模式的管理。
{"title":"Demonstrating KDBMS: A Knowledge-based Database Management System","authors":"Mohamed E. Khalefa, Sameh S. El-Atawy","doi":"10.1145/2949689.2949714","DOIUrl":"https://doi.org/10.1145/2949689.2949714","url":null,"abstract":"We demonstrate a KDBMS, a prototype system which seamlessly integrates Knowledge base and DBMS. While state-of-the-art approaches, i.e., Ontology-based data access, denoted as OBDA, use ontologies to only query data stored in relational databases using SPARQL. In this demo, we present a high level description of the proposed system, introduce a new knowledge-based query language, denoted as KQL, and highlight some query optimization opportunities by employing knowledge across database layers in query optimization, and query processing, while ease the administrating for a complex database schema.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123788693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SciServer Compute: Bringing Analysis Close to the Data 剪刀服务器计算:让分析更接近数据
Dmitry Medvedev, G. Lemson, M. Rippin
SciServer Compute uses Jupyter notebooks running within server-side Docker containers attached to large relational databases and file storage to bring advanced analysis capabilities close to the data. SciServer Compute is a component of SciServer, a big-data infrastructure project developed at Johns Hopkins University that will provide a common environment for computational research. SciServer Compute integrates with large existing databases in the fields of astronomy, cosmology, turbulence, genomics, oceanography and materials science. These are accessible through the CasJobs service for direct SQL queries. SciServer Compute adds interactive server-side computational capabilities through notebooks in Python, R and MATLAB, an API for running asynchronous tasks, and a very large (hundreds of terabytes) scratch space for storing intermediate results. Science-ready results can be stored on a Dropbox-like service, SciDrive, for sharing with collaborators and dissemination to the public. Notebooks and batch jobs run inside Docker containers owned by the users. This provides security and isolation and allows flexible configuration of computational contexts through domain specific images and mounting of domain specific data sets. We present a demo that illustrates the capabilities of SciServer Compute: using Jupyter notebooks, performing analyses on data selections from diverse scientific fields, and running asynchronous jobs in a Docker container. The demo will highlight the data flow between file storage, database, and compute components.
SciServer Compute使用在服务器端Docker容器中运行的Jupyter笔记本,这些容器附加到大型关系数据库和文件存储中,从而为数据提供高级分析功能。剪刀服务器计算是剪刀服务器的一个组成部分,剪刀服务器是约翰霍普金斯大学开发的一个大数据基础设施项目,将为计算研究提供一个公共环境。SciServer Compute集成了天文学、宇宙学、湍流、基因组学、海洋学和材料科学领域的大型现有数据库。这些都可以通过CasJobs服务进行直接SQL查询。SciServer Compute通过Python、R和MATLAB中的笔记本增加了交互式服务器端计算能力,用于运行异步任务的API,以及用于存储中间结果的非常大(数百tb)的刮刮空间。为科学准备的结果可以存储在一个类似于dropbox的服务scirive上,以便与合作者分享并向公众传播。笔记本和批处理作业在用户拥有的Docker容器中运行。这提供了安全性和隔离性,并允许通过特定于域的映像和装载特定于域的数据集灵活地配置计算上下文。我们提供了一个演示,演示了SciServer Compute的功能:使用Jupyter笔记本,对来自不同科学领域的数据选择执行分析,以及在Docker容器中运行异步作业。该演示将突出显示文件存储、数据库和计算组件之间的数据流。
{"title":"SciServer Compute: Bringing Analysis Close to the Data","authors":"Dmitry Medvedev, G. Lemson, M. Rippin","doi":"10.1145/2949689.2949700","DOIUrl":"https://doi.org/10.1145/2949689.2949700","url":null,"abstract":"SciServer Compute uses Jupyter notebooks running within server-side Docker containers attached to large relational databases and file storage to bring advanced analysis capabilities close to the data. SciServer Compute is a component of SciServer, a big-data infrastructure project developed at Johns Hopkins University that will provide a common environment for computational research. SciServer Compute integrates with large existing databases in the fields of astronomy, cosmology, turbulence, genomics, oceanography and materials science. These are accessible through the CasJobs service for direct SQL queries. SciServer Compute adds interactive server-side computational capabilities through notebooks in Python, R and MATLAB, an API for running asynchronous tasks, and a very large (hundreds of terabytes) scratch space for storing intermediate results. Science-ready results can be stored on a Dropbox-like service, SciDrive, for sharing with collaborators and dissemination to the public. Notebooks and batch jobs run inside Docker containers owned by the users. This provides security and isolation and allows flexible configuration of computational contexts through domain specific images and mounting of domain specific data sets. We present a demo that illustrates the capabilities of SciServer Compute: using Jupyter notebooks, performing analyses on data selections from diverse scientific fields, and running asynchronous jobs in a Docker container. The demo will highlight the data flow between file storage, database, and compute components.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131077352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Regular Path Queries on Massive Graphs 海量图的常规路径查询
Maurizio Nolé, C. Sartiani
Regular Path Queries (RPQs) represent a powerful tool for querying graph databases and are of particular interest, because they form the building blocks of other query languages, and because they can be used in many theoretical or practical contexts for different purposes. In this paper we present a novel system for processing regular path queries on massive data graphs. As confirmed by an extensive experimental evaluation, our system scales linearly with the number of vertices and/or edges, and it can efficiently query graphs up to a billion vertices and 100 billion edges.
正则路径查询(Regular Path Queries, rpq)是查询图数据库的一种强大的工具,也是人们特别感兴趣的,因为它们构成了其他查询语言的构建块,并且可以在许多理论或实践上下文中用于不同的目的。本文提出了一种处理海量数据图上规则路径查询的新系统。经过广泛的实验评估证实,我们的系统随着顶点和/或边的数量线性扩展,并且可以有效地查询多达10亿个顶点和1000亿个边的图。
{"title":"Regular Path Queries on Massive Graphs","authors":"Maurizio Nolé, C. Sartiani","doi":"10.1145/2949689.2949711","DOIUrl":"https://doi.org/10.1145/2949689.2949711","url":null,"abstract":"Regular Path Queries (RPQs) represent a powerful tool for querying graph databases and are of particular interest, because they form the building blocks of other query languages, and because they can be used in many theoretical or practical contexts for different purposes. In this paper we present a novel system for processing regular path queries on massive data graphs. As confirmed by an extensive experimental evaluation, our system scales linearly with the number of vertices and/or edges, and it can efficiently query graphs up to a billion vertices and 100 billion edges.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131847954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Framework for real-time clustering over sliding windows 基于滑动窗口的实时集群框架
Sobhan Badiozamany, Kjell Orsborn, T. Risch
Clustering queries over sliding windows require maintaining cluster memberships that change as windows slide. To address this, the Generic 2-phase Continuous Summarization framework (G2CS) utilizes a generation based window maintenance approach where windows are maintained over different time intervals. It provides algorithm independent and efficient sliding mechanisms for clustering queries where the clustering algorithms are defined in terms of queries over cluster data represented as temporal tables. A particular challenge for real-time detection of a high number of fastly evolving clusters is efficiently supporting smooth re-clustering in real-time, i.e. to minimize the sliding time with increasing window size and decreasing strides. To efficiently support such re-clustering for clustering algorithms where deletion of expired data is not supported, e.g. BIRCH, G2CS includes a novel window maintenance mechanism called Sliding Binary Merge (SBM), which maintains several generations of intermediate window instances and does not require decremental cluster maintenance. To improve real-time sliding performance, G2CS uses generation-based multi-dimensional indexing. Extensive performance evaluation on both synthetic and real data shows that G2CS scales substantially better than related approaches.
滑动窗口上的集群查询需要维护随着窗口滑动而变化的集群成员关系。为了解决这个问题,通用的两阶段连续总结框架(G2CS)利用基于生成的窗口维护方法,其中窗口在不同的时间间隔内进行维护。它为聚类查询提供了独立于算法的高效滑动机制,其中聚类算法是根据对表示为时态表的聚类数据的查询来定义的。对于实时检测大量快速进化的聚类来说,一个特别的挑战是如何有效地支持实时平滑的重新聚类,即随着窗口大小的增加和步幅的减少而最小化滑动时间。对于不支持删除过期数据的聚类算法(例如BIRCH),为了有效地支持这种重新聚类,G2CS包含了一种新的窗口维护机制,称为滑动二进制合并(SBM),它维护了几代中间窗口实例,并且不需要减少群集维护。为了提高实时滑动性能,G2CS使用基于生成的多维索引。对合成数据和真实数据的广泛性能评估表明,G2CS的可扩展性大大优于相关方法。
{"title":"Framework for real-time clustering over sliding windows","authors":"Sobhan Badiozamany, Kjell Orsborn, T. Risch","doi":"10.1145/2949689.2949696","DOIUrl":"https://doi.org/10.1145/2949689.2949696","url":null,"abstract":"Clustering queries over sliding windows require maintaining cluster memberships that change as windows slide. To address this, the Generic 2-phase Continuous Summarization framework (G2CS) utilizes a generation based window maintenance approach where windows are maintained over different time intervals. It provides algorithm independent and efficient sliding mechanisms for clustering queries where the clustering algorithms are defined in terms of queries over cluster data represented as temporal tables. A particular challenge for real-time detection of a high number of fastly evolving clusters is efficiently supporting smooth re-clustering in real-time, i.e. to minimize the sliding time with increasing window size and decreasing strides. To efficiently support such re-clustering for clustering algorithms where deletion of expired data is not supported, e.g. BIRCH, G2CS includes a novel window maintenance mechanism called Sliding Binary Merge (SBM), which maintains several generations of intermediate window instances and does not require decremental cluster maintenance. To improve real-time sliding performance, G2CS uses generation-based multi-dimensional indexing. Extensive performance evaluation on both synthetic and real data shows that G2CS scales substantially better than related approaches.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133367872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Efficient Feedback Collection for Pay-as-you-go Source Selection 有效的反馈收集,即付即用源选择
Julio César Cortés Ríos, N. Paton, A. Fernandes, Khalid Belhajjame
Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.
数据网络和网络数据提取等技术发展,加上与开放政府或开放科学有关的政策发展,正在导致越来越多的数据源的可用性。事实上,有了这些物理来源,还可以创建进一步的虚拟来源,整合、汇总或总结来自原始来源的数据。因此,存在过多的数据源,其中一小部分可能能够提供支持任务所需的信息。可用源的数量和变化速度可能会使专家手动选择源和管理对许多应用程序来说不切实际,导致需要追求一种随用随付的方法,在这种方法中,人群或数据消费者根据结果的正确性或适用性对结果进行注释,并使用由此产生的注释来通知,例如,源选择算法。然而,为了使现收现付的反馈收集具有成本效益,可能需要明智地选择要获得反馈的数据项。本文描述了OLBP(排序和标记精度),这是一种基于启发式的方法,用于针对数据项进行反馈,以支持映射和源选择任务,其中用户根据精度和召回率之间的权衡来表达他们的偏好。然后在两种不同的场景下对所提出的方法进行了评估,即使用合成数据进行映射选择,以及使用web数据提取产生的真实数据进行源选择。结果表明,在使用OLBP时,达到用户提供的目标所需的反馈量显著减少。
{"title":"Efficient Feedback Collection for Pay-as-you-go Source Selection","authors":"Julio César Cortés Ríos, N. Paton, A. Fernandes, Khalid Belhajjame","doi":"10.1145/2949689.2949690","DOIUrl":"https://doi.org/10.1145/2949689.2949690","url":null,"abstract":"Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115350634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Novel Data Reduction Based on Statistical Similarity 基于统计相似度的新型数据约简
Dongeun Lee, A. Sim, Jaesik Choi, Kesheng Wu
Applications such as scientific simulations and power grid monitoring are generating so much data quickly that compression is essential to reduce storage requirement or transmission capacity. To achieve better compression, one is often willing to discard some repeated information. These lossy compression methods are primarily designed to minimize the Euclidean distance between the original data and the compressed data. But this measure of distance severely limits either reconstruction quality or compression performance. We propose a new class of compression method by redefining the distance measure with a statistical concept known as exchangeability. This approach reduces the storage requirement and captures essential features, while reducing the storage requirement. In this paper, we report our design and implementation of such a compression method named IDEALEM. To demonstrate its effectiveness, we apply it on a set of power grid monitoring data, and show that it can reduce the volume of data much more than the best known compression method while maintaining the quality of the compressed data. In these tests, IDEALEM captures extraordinary events in the data, while its compression ratios can far exceed 100.
科学模拟和电网监测等应用程序正在快速生成如此多的数据,因此压缩对于减少存储需求或传输容量至关重要。为了实现更好的压缩,人们通常愿意丢弃一些重复的信息。这些有损压缩方法主要是为了最小化原始数据和压缩数据之间的欧氏距离。但是这种距离度量严重限制了重建质量或压缩性能。我们提出了一种新的压缩方法,通过用称为互换性的统计概念重新定义距离度量。这种方法减少了存储需求并捕获了基本特性,同时减少了存储需求。在本文中,我们报告了我们的设计和实现这种压缩方法称为IDEALEM。为了证明该方法的有效性,我们将其应用于一组电网监测数据,结果表明,在保持压缩数据质量的同时,该方法比目前已知的压缩方法更能减少数据量。在这些测试中,IDEALEM捕获了数据中的异常事件,而其压缩比可以远远超过100。
{"title":"Novel Data Reduction Based on Statistical Similarity","authors":"Dongeun Lee, A. Sim, Jaesik Choi, Kesheng Wu","doi":"10.1145/2949689.2949708","DOIUrl":"https://doi.org/10.1145/2949689.2949708","url":null,"abstract":"Applications such as scientific simulations and power grid monitoring are generating so much data quickly that compression is essential to reduce storage requirement or transmission capacity. To achieve better compression, one is often willing to discard some repeated information. These lossy compression methods are primarily designed to minimize the Euclidean distance between the original data and the compressed data. But this measure of distance severely limits either reconstruction quality or compression performance. We propose a new class of compression method by redefining the distance measure with a statistical concept known as exchangeability. This approach reduces the storage requirement and captures essential features, while reducing the storage requirement. In this paper, we report our design and implementation of such a compression method named IDEALEM. To demonstrate its effectiveness, we apply it on a set of power grid monitoring data, and show that it can reduce the volume of data much more than the best known compression method while maintaining the quality of the compressed data. In these tests, IDEALEM captures extraordinary events in the data, while its compression ratios can far exceed 100.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124048892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Array Database Scalability: Intercontinental Queries on Petabyte Datasets 阵列数据库可扩展性:对pb数据集的洲际查询
A. Dumitru, Vlad Merticariu, P. Baumann
With the deluge of scientific big data affecting a large variety of research institutions, support for large multidimensional arrays has gained traction in the database community in the past decade. Array databases aim to cover the gap left by traditional relational database systems in the domains of large scientific data by enabling researchers to efficiently store and process their data through rich declarative query languages. Such large amounts of data need effective systems that are able to distribute the processing at both local level, through exploitation of heterogeneous hardware as well as at network level, enabling both intra-cloud and intra-federation distribution of data and processing. In this demonstration we aim to showcase the capabilities of rasdaman by allowing users to execute queries that combine petabyte datasets stored at two institutions on different continents.
随着科学大数据的泛滥影响了各种各样的研究机构,对大型多维数组的支持在过去十年中在数据库社区中获得了牵引力。阵列数据库旨在填补传统关系数据库系统在大型科学数据领域的空白,使研究人员能够通过丰富的声明性查询语言有效地存储和处理数据。如此大量的数据需要有效的系统,这些系统能够通过利用异构硬件在本地级别和网络级别分发处理,从而实现云内部和联邦内部的数据和处理分发。在这个演示中,我们的目标是通过允许用户执行将存储在不同大洲的两个机构的pb数据集组合在一起的查询,来展示rasdaman的功能。
{"title":"Array Database Scalability: Intercontinental Queries on Petabyte Datasets","authors":"A. Dumitru, Vlad Merticariu, P. Baumann","doi":"10.1145/2949689.2949717","DOIUrl":"https://doi.org/10.1145/2949689.2949717","url":null,"abstract":"With the deluge of scientific big data affecting a large variety of research institutions, support for large multidimensional arrays has gained traction in the database community in the past decade. Array databases aim to cover the gap left by traditional relational database systems in the domains of large scientific data by enabling researchers to efficiently store and process their data through rich declarative query languages. Such large amounts of data need effective systems that are able to distribute the processing at both local level, through exploitation of heterogeneous hardware as well as at network level, enabling both intra-cloud and intra-federation distribution of data and processing. In this demonstration we aim to showcase the capabilities of rasdaman by allowing users to execute queries that combine petabyte datasets stored at two institutions on different continents.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128702921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Functional Dependencies Unleashed for Scalable Data Exchange 释放可扩展数据交换的功能依赖
A. Bonifati, Ioana Ileana, Michele Linardi
We address the problem of efficiently evaluating target functional dependencies (fds) in the Data Exchange (DE) process. Target fds naturally occur in many DE scenarios, including the ones in Life Sciences in which multiple source relations need to be structured under a constrained target schema. However, despite their wide use, target fds' evaluation is still a bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL approach typically do not support target fds unless additional information is provided. Alternatively, DE engines that do include these dependencies typically pay the price of a significant drop in performance and scalability. In this paper, we present a novel chase-based algorithm that can efficiently handle arbitrary fds on the target. Our approach essentially relies on exploiting the interactions between source-to-target (s-t) tuple-generating dependencies (tgds) and target fds. This allows us to tame the size of the intermediate chase results, by playing on a careful ordering of chase steps interleaving fds and (chosen) tgds. As a direct consequence, we importantly diminish the fd application scope, often a central cause of the dramatic overhead induced by target fds. Moreover, reasoning on dependency interaction further leads us to interesting parallelization opportunities, yielding additional scalability gains. We provide a proof-of-concept implementation of our chase-based algorithm and an experimental study aimed at gauging its scalability and efficiency. Finally, we empirically compare with the latest DE engines, and show that our algorithm outperforms them.
我们解决了在数据交换(DE)过程中有效评估目标功能依赖关系(fds)的问题。目标fd自然出现在许多DE场景中,包括生命科学中的场景,其中需要在受限的目标模式下构建多个源关系。然而,尽管目标fd被广泛使用,但其评估仍然是最先进的DE引擎的瓶颈。依赖于全sql方法的系统通常不支持目标fd,除非提供了额外的信息。另外,包含这些依赖项的DE引擎通常会付出性能和可伸缩性显著下降的代价。在本文中,我们提出了一种新的基于追逐的算法,可以有效地处理目标上的任意fds。我们的方法本质上依赖于利用源到目标(s-t)元组生成依赖关系(tgds)和目标fd之间的交互。这允许我们通过小心地排列追逐步骤,使fds和(选择的)tgds交织在一起,从而控制中间追逐结果的大小。直接的结果是,我们大大缩小了fd的应用范围,这通常是目标fd引起的巨大开销的主要原因。此外,对依赖交互的推理进一步为我们带来了有趣的并行化机会,从而获得了额外的可伸缩性收益。我们提供了基于追逐的算法的概念验证实现和旨在衡量其可扩展性和效率的实验研究。最后,我们与最新的DE引擎进行了经验比较,表明我们的算法优于它们。
{"title":"Functional Dependencies Unleashed for Scalable Data Exchange","authors":"A. Bonifati, Ioana Ileana, Michele Linardi","doi":"10.1145/2949689.2949698","DOIUrl":"https://doi.org/10.1145/2949689.2949698","url":null,"abstract":"We address the problem of efficiently evaluating target functional dependencies (fds) in the Data Exchange (DE) process. Target fds naturally occur in many DE scenarios, including the ones in Life Sciences in which multiple source relations need to be structured under a constrained target schema. However, despite their wide use, target fds' evaluation is still a bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL approach typically do not support target fds unless additional information is provided. Alternatively, DE engines that do include these dependencies typically pay the price of a significant drop in performance and scalability. In this paper, we present a novel chase-based algorithm that can efficiently handle arbitrary fds on the target. Our approach essentially relies on exploiting the interactions between source-to-target (s-t) tuple-generating dependencies (tgds) and target fds. This allows us to tame the size of the intermediate chase results, by playing on a careful ordering of chase steps interleaving fds and (chosen) tgds. As a direct consequence, we importantly diminish the fd application scope, often a central cause of the dramatic overhead induced by target fds. Moreover, reasoning on dependency interaction further leads us to interesting parallelization opportunities, yielding additional scalability gains. We provide a proof-of-concept implementation of our chase-based algorithm and an experimental study aimed at gauging its scalability and efficiency. Finally, we empirically compare with the latest DE engines, and show that our algorithm outperforms them.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115811502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Proceedings of the 28th International Conference on Scientific and Statistical Database Management 第28届科学与统计数据库管理国际会议论文集
{"title":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","authors":"","doi":"10.1145/2949689","DOIUrl":"https://doi.org/10.1145/2949689","url":null,"abstract":"","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121479944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 28th International Conference on Scientific and Statistical Database Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1