首页 > 最新文献

ACM Transactions on Database Systems (TODS)最新文献

英文 中文
Exact and Approximate Maximum Inner Product Search with LEMP LEMP的精确和近似最大内积搜索
Pub Date : 2016-12-03 DOI: 10.1145/2996452
Christina Teflioudi, Rainer Gemulla
We study exact and approximate methods for maximum inner product search, a fundamental problem in a number of data mining and information retrieval tasks. We propose the LEMP framework, which supports both exact and approximate search with quality guarantees. At its heart, LEMP transforms a maximum inner product search problem over a large database of vectors into a number of smaller cosine similarity search problems. This transformation allows LEMP to prune large parts of the search space immediately and to select suitable search algorithms for each of the remaining problems individually. LEMP is able to leverage existing methods for cosine similarity search, but we also provide a number of novel search algorithms tailored to our setting. We conducted an extensive experimental study that provides insight into the performance of many state-of-the-art techniques—including LEMP—on multiple real-world datasets. We found that LEMP often was significantly faster or more accurate than alternative methods.
我们研究了最大内积搜索的精确和近似方法,这是许多数据挖掘和信息检索任务中的基本问题。我们提出了LEMP框架,该框架支持精确和近似搜索,并提供质量保证。LEMP的核心是将大型向量数据库上的最大内积搜索问题转换为许多较小的余弦相似度搜索问题。这种转换允许LEMP立即删减搜索空间的大部分,并为每个剩余问题单独选择合适的搜索算法。LEMP能够利用现有的余弦相似度搜索方法,但我们也提供了许多针对我们的设置量身定制的新颖搜索算法。我们进行了一项广泛的实验研究,深入了解了许多最先进的技术(包括lemp)在多个真实数据集上的性能。我们发现LEMP通常比其他方法更快或更准确。
{"title":"Exact and Approximate Maximum Inner Product Search with LEMP","authors":"Christina Teflioudi, Rainer Gemulla","doi":"10.1145/2996452","DOIUrl":"https://doi.org/10.1145/2996452","url":null,"abstract":"We study exact and approximate methods for maximum inner product search, a fundamental problem in a number of data mining and information retrieval tasks. We propose the LEMP framework, which supports both exact and approximate search with quality guarantees. At its heart, LEMP transforms a maximum inner product search problem over a large database of vectors into a number of smaller cosine similarity search problems. This transformation allows LEMP to prune large parts of the search space immediately and to select suitable search algorithms for each of the remaining problems individually. LEMP is able to leverage existing methods for cosine similarity search, but we also provide a number of novel search algorithms tailored to our setting. We conducted an extensive experimental study that provides insight into the performance of many state-of-the-art techniques—including LEMP—on multiple real-world datasets. We found that LEMP often was significantly faster or more accurate than alternative methods.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2016-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82356636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
UniAD
Pub Date : 2016-11-21 DOI: 10.1145/3009957
Xiaogang Shi, B. Cui, G. Dobbie, Beng Chin Ooi
Instead of constructing complex declarative queries, many users prefer to write their programs using procedural code embedded with simple queries. Since many users are not expert programmers or the programs are written in a rush, these programs usually exhibit poor performance in practice and it is a challenge to automatically and efficiently optimize these programs. In this article, we present UniAD, which stands for Unified execution for Ad hoc Data processing, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We provide the background of program semantics and propose a novel intermediate representation, called Unified Intermediate Representation (UniIR), which utilizes a simple and expressive mechanism HOQ to describe the operations performed in programs. By combining both procedural and declarative logics with the proposed intermediate representation, we can perform various optimizations across the boundary between procedural and declarative code. We propose a transformation-based optimizer to automatically optimize programs and implement the UniAD system. The extensive experimental results on various benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.
与构造复杂的声明性查询不同,许多用户更喜欢使用嵌入简单查询的过程代码编写程序。由于很多用户并不是专业的程序员,或者是匆忙编写的程序,这些程序在实际应用中往往表现不佳,如何自动有效地优化这些程序是一个挑战。在本文中,我们介绍了UniAD,它代表Ad hoc数据处理的统一执行,这是一个旨在简化数据处理任务的编程并为用户程序提供有效执行的系统。我们提供了程序语义的背景,并提出了一种新的中间表示,称为统一中间表示(UniIR),它利用一种简单而富有表现力的机制HOQ来描述程序中执行的操作。通过将过程性和声明性逻辑与建议的中间表示结合起来,我们可以跨过程性和声明性代码之间的边界执行各种优化。我们提出了一个基于转换的优化器来自动优化程序并实现UniAD系统。在各种基准测试上的大量实验结果表明,我们的技术可以显着提高各种数据处理程序的性能。
{"title":"UniAD","authors":"Xiaogang Shi, B. Cui, G. Dobbie, Beng Chin Ooi","doi":"10.1145/3009957","DOIUrl":"https://doi.org/10.1145/3009957","url":null,"abstract":"Instead of constructing complex declarative queries, many users prefer to write their programs using procedural code embedded with simple queries. Since many users are not expert programmers or the programs are written in a rush, these programs usually exhibit poor performance in practice and it is a challenge to automatically and efficiently optimize these programs. In this article, we present UniAD, which stands for Unified execution for Ad hoc Data processing, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We provide the background of program semantics and propose a novel intermediate representation, called Unified Intermediate Representation (UniIR), which utilizes a simple and expressive mechanism HOQ to describe the operations performed in programs. By combining both procedural and declarative logics with the proposed intermediate representation, we can perform various optimizations across the boundary between procedural and declarative code. We propose a transformation-based optimizer to automatically optimize programs and implement the UniAD system. The extensive experimental results on various benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"22 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2016-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77549439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Smart Meter Data Analytics 智能电表数据分析
Pub Date : 2016-11-21 DOI: 10.1145/3004295
Xiufeng Liu, Lukasz Golab, W. Golab, I. Ilyas, Shichao Jin
Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart meter analytics from a software performance perspective. First, we design a performance benchmark that includes common smart meter analytics tasks. These include offline feature extraction and model building as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of application development effort and performance on a multicore machine as well as a cluster of 16 commodity servers.
智能电表已经在全球范围内取代了传统电表,实现了细粒度(例如每15分钟或每小时)消费数据的自动收集。各种智能电表分析算法和应用已经被提出,主要是在智能电网文献中。然而,人们关注的焦点是如何处理这些数据,而不是如何有效地处理这些数据。在本文中,我们将从软件性能的角度研究智能电表分析。首先,我们设计了一个性能基准,其中包括常见的智能电表分析任务。其中包括离线特征提取和模型构建,以及我们提出的在线异常检测框架。其次,由于隐私问题难以获得真实的智能电表数据,我们提出了一种从少量真实数据种子生成大型真实数据集的算法。第三,我们使用五个代表性平台实现了所提出的基准测试:一个传统的数值计算平台(Matlab),一个内置机器学习工具包的关系DBMS (PostgreSQL/MADlib),一个主存列存储(“System C”),以及两个分布式数据处理平台(Hive和Spark/Spark Streaming)。我们比较了这五种平台在多核机器和由16台商品服务器组成的集群上的应用程序开发工作和性能。
{"title":"Smart Meter Data Analytics","authors":"Xiufeng Liu, Lukasz Golab, W. Golab, I. Ilyas, Shichao Jin","doi":"10.1145/3004295","DOIUrl":"https://doi.org/10.1145/3004295","url":null,"abstract":"Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart meter analytics from a software performance perspective. First, we design a performance benchmark that includes common smart meter analytics tasks. These include offline feature extraction and model building as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of application development effort and performance on a multicore machine as well as a cluster of 16 commodity servers.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 39"},"PeriodicalIF":0.0,"publicationDate":"2016-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78691424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Joins via Geometric Resolutions 通过几何分辨率连接
Pub Date : 2016-11-08 DOI: 10.1145/2967101
Mahmoud Abo Khamis, H. Ngo, Christopher Ré, A. Rudra
We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in BTrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution, transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.
我们为关系连接提供了一个简单的几何框架。利用这个框架,我们设计了一个实现分数超树宽度边界的算法,它推广了经典和最近的最坏情况算法在计算连接上的结果。此外,我们使用我们的框架和相同的算法来显示一系列通俗地称为超越最坏情况的结果。该框架允许我们证明存储在b树、多维数据结构甚至每个表的多个索引中的数据的结果。我们框架中的一个关键思想是将索引的推理形式化为一种几何分辨率,将计算连接的算法问题转化为几何问题。我们的几何分辨概念可以看作是逻辑分辨的几何类比。除了几何和逻辑连接之外,我们的算法还可以被认为是带记忆的回溯搜索。
{"title":"Joins via Geometric Resolutions","authors":"Mahmoud Abo Khamis, H. Ngo, Christopher Ré, A. Rudra","doi":"10.1145/2967101","DOIUrl":"https://doi.org/10.1145/2967101","url":null,"abstract":"We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in BTrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution, transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"45 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77528447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
The Goal Behind the Action 行动背后的目标
Pub Date : 2016-11-08 DOI: 10.1145/2934666
Dimitra Papadimitriou, G. Koutrika, J. Mylopoulos, Yannis Velegrakis
Human activity is almost always intentional, be it in a physical context or as part of an interaction with a computer system. By understanding why user-generated events are happening and what purposes they serve, a system can offer a significantly improved and more engaging experience. However, goals cannot be easily captured. Analyzing user actions such as clicks and purchases can reveal patterns and behaviors, but understanding the goals behind these actions is a different and challenging issue. Our work presents a unified, multidisciplinary viewpoint for goal management that covers many different cases where goals can be used and techniques with which they can be exploited. Our purpose is to provide a common reference point to the concepts and challenging tasks that need to be formally defined when someone wants to approach a data analysis problem from a goal-oriented point of view. This work also serves as a springboard to discuss several open challenges and opportunities for goal-oriented approaches in data management, analysis, and sharing systems and applications.
人类活动几乎总是有意的,无论是在物理环境中还是作为与计算机系统交互的一部分。通过理解用户生成事件发生的原因以及它们服务的目的,系统可以提供显著改进且更具吸引力的体验。然而,目标并不容易实现。分析用户行为(如点击和购买)可以揭示模式和行为,但理解这些行为背后的目标是一个不同的、具有挑战性的问题。我们的工作为目标管理提供了一个统一的、多学科的观点,涵盖了许多不同的案例,其中可以使用目标和利用目标的技术。我们的目的是为需要正式定义的概念和具有挑战性的任务提供一个公共参考点,当有人想从面向目标的角度处理数据分析问题时。这项工作还可以作为一个跳板,讨论数据管理、分析和共享系统和应用中面向目标的方法的几个开放挑战和机遇。
{"title":"The Goal Behind the Action","authors":"Dimitra Papadimitriou, G. Koutrika, J. Mylopoulos, Yannis Velegrakis","doi":"10.1145/2934666","DOIUrl":"https://doi.org/10.1145/2934666","url":null,"abstract":"Human activity is almost always intentional, be it in a physical context or as part of an interaction with a computer system. By understanding why user-generated events are happening and what purposes they serve, a system can offer a significantly improved and more engaging experience. However, goals cannot be easily captured. Analyzing user actions such as clicks and purchases can reveal patterns and behaviors, but understanding the goals behind these actions is a different and challenging issue. Our work presents a unified, multidisciplinary viewpoint for goal management that covers many different cases where goals can be used and techniques with which they can be exploited. Our purpose is to provide a common reference point to the concepts and challenging tasks that need to be formally defined when someone wants to approach a data analysis problem from a goal-oriented point of view. This work also serves as a springboard to discuss several open challenges and opportunities for goal-oriented approaches in data management, analysis, and sharing systems and applications.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 43"},"PeriodicalIF":0.0,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89768905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Skycube Materialization Using the Topmost Skyline or Functional Dependencies 使用最顶层的天际线或功能依赖的天空立方体物化
Pub Date : 2016-11-02 DOI: 10.1145/2955092
S. Maabout, C. Ordonez, Patrick Kamnang Wanko, N. Hanusse
Given a table T(Id, D1, …, Dd), the skycube of T is the set of skylines with respect to to all nonempty subsets (subspaces) of the set of all dimensions {D1, …, Dd}. To optimize the evaluation of any skyline query, the solutions proposed so far in the literature either (i) precompute all of the skylines or (ii) use compression techniques so that the derivation of any skyline can be done with little effort. Even though solutions (i) are appealing because skyline queries have optimal execution time, they suffer from time and space scalability because the number of skylines to be materialized is exponential with respect to d. On the other hand, solutions (ii) are attractive in terms of memory consumption, but as we show, they also have a high time complexity. In this article, we make contributions to both kinds of solutions. We first observe that skyline patterns are monotonic. This property leads to a simple yet efficient solution for full and partial skycube materialization when the skyline with respect to all dimensions, the topmost skyline, is small. On the other hand, when the topmost skyline is large relative to the size of the input table, it turns out that functional dependencies, a fundamental concept in databases, uncover a monotonic property between skylines. Equipped with this information, we show that closed attributes sets are fundamental for partial and full skycube materialization. Extensive experiments with real and synthetic datasets show that our solutions generally outperform state-of-the-art algorithms.
给定一个表T(Id, D1,…,Dd), T的天立方是关于所有维度的集合{D1,…,Dd}的所有非空子集(子空间)的天际线的集合。为了优化任何天际线查询的评估,迄今为止在文献中提出的解决方案要么(i)预先计算所有的天际线,要么(ii)使用压缩技术,以便任何天际线的推导都可以毫不费力地完成。尽管解决方案(i)很吸引人,因为天际线查询有最佳的执行时间,但它们受到时间和空间可扩展性的影响,因为要实现的天际线数量相对于d是指数级的。另一方面,解决方案(ii)在内存消耗方面很有吸引力,但正如我们所示,它们也有很高的时间复杂性。在本文中,我们对这两种解决方案都做出了贡献。我们首先观察到天际线的图案是单调的。当天际线相对于所有维度,最顶层的天际线都很小时,这一特性导致了一个简单而有效的解决方案,可以实现全部和部分天际线的物质化。另一方面,当最顶层的天际线相对于输入表的大小较大时,结果是功能依赖关系(数据库中的一个基本概念)揭示了天际线之间的单调性。有了这些信息,我们证明了封闭属性集是部分和完整天空立方体物化的基础。对真实和合成数据集的大量实验表明,我们的解决方案通常优于最先进的算法。
{"title":"Skycube Materialization Using the Topmost Skyline or Functional Dependencies","authors":"S. Maabout, C. Ordonez, Patrick Kamnang Wanko, N. Hanusse","doi":"10.1145/2955092","DOIUrl":"https://doi.org/10.1145/2955092","url":null,"abstract":"Given a table T(Id, D1, …, Dd), the skycube of T is the set of skylines with respect to to all nonempty subsets (subspaces) of the set of all dimensions {D1, …, Dd}. To optimize the evaluation of any skyline query, the solutions proposed so far in the literature either (i) precompute all of the skylines or (ii) use compression techniques so that the derivation of any skyline can be done with little effort. Even though solutions (i) are appealing because skyline queries have optimal execution time, they suffer from time and space scalability because the number of skylines to be materialized is exponential with respect to d. On the other hand, solutions (ii) are attractive in terms of memory consumption, but as we show, they also have a high time complexity. In this article, we make contributions to both kinds of solutions. We first observe that skyline patterns are monotonic. This property leads to a simple yet efficient solution for full and partial skycube materialization when the skyline with respect to all dimensions, the topmost skyline, is small. On the other hand, when the topmost skyline is large relative to the size of the input table, it turns out that functional dependencies, a fundamental concept in databases, uncover a monotonic property between skylines. Equipped with this information, we show that closed attributes sets are fundamental for partial and full skycube materialization. Extensive experiments with real and synthetic datasets show that our solutions generally outperform state-of-the-art algorithms.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"18 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84435336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Building a Hybrid Warehouse 构建混合仓库
Pub Date : 2016-11-02 DOI: 10.1145/2972950
Yuanyuan Tian, Fatma Özcan, Tao Zou, R. Goncalves, H. Pirahesh
The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support. In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.
Hadoop分布式文件系统(HDFS)已经成为企业中重要的数据存储库,作为所有业务分析的中心,从SQL查询和机器学习到报告。同时,企业数据仓库(edw)继续支持关键的业务分析。这就需要在类似hadoop的大数据平台和数据仓库之间建立新一代的特殊联盟,我们称之为混合仓库。有很多应用需要将存储在HDFS中的数据与EDW数据相关联,例如将存储在HDFS中的点击日志与存储在数据库中的销售数据相关联的分析。假设Hadoop端没有有效的SQL支持,所有现有的解决方案都需要接触HDFS并将数据读入EDW以执行连接。在本文中,我们展示了在HDFS端进行大多数数据处理实际上更好,前提是我们可以在Hadoop端利用复杂的执行引擎进行连接。我们通过研究各种连接数据库和HDFS表的算法来确定最佳的混合仓库架构。我们利用Bloom过滤器来最大限度地减少数据移动,并在两个系统中充分利用大规模并行性。本文描述了一种新的锯齿形连接算法,并证明它是一种鲁棒的混合仓库连接算法,在几乎所有情况下都表现良好。我们进一步为各种连接算法建立了一个复杂的代价模型,并表明它可以促进混合仓库中的查询优化,以便在不同的谓词和连接选择性下正确选择正确的算法。
{"title":"Building a Hybrid Warehouse","authors":"Yuanyuan Tian, Fatma Özcan, Tao Zou, R. Goncalves, H. Pirahesh","doi":"10.1145/2972950","DOIUrl":"https://doi.org/10.1145/2972950","url":null,"abstract":"The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support. In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"146 1","pages":"1 - 38"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88095629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Exploiting Integrity Constraints for Cleaning Trajectories of RFID-Monitored Objects 利用完整性约束对rfid监控对象的清洗轨迹
Pub Date : 2016-11-02 DOI: 10.1145/2939368
Bettina Fazzinga, S. Flesca, F. Furfaro, F. Parisi
A probabilistic framework for cleaning the data collected by Radio-Frequency IDentification (RFID) tracking systems is introduced. What has to be cleaned is the set of trajectories that are the possible interpretations of the readings: a trajectory in this set is a sequence whose generic element is a location covered by the reader(s) that made the detection at the corresponding time point. The cleaning is guided by integrity constraints and consists of discarding the inconsistent trajectories and assigning to the others a suitable probability of being the actual one. The probabilities are evaluated by adopting probabilistic conditioning that logically consists of the following steps. First, the trajectories are assigned a priori probabilities that rely on the independence assumption between the time points. Then, these probabilities are revised according to the spatio-temporal correlations encoded by the constraints. This is done by conditioning the a priori probability of each trajectory to the event that the constraints are satisfied: this means taking the ratio of this a priori probability to the sum of the a priori probabilities of all the consistent trajectories. Instead of performing these steps by materializing all the trajectories and their a priori probabilities (which is infeasible, owing to the typically huge number of trajectories), our approach exploits a data structure called conditioned trajectory graph (ct-graph) that compactly represents the trajectories and their conditioned probabilities, and an algorithm for efficiently constructing the ct-graph, which progressively builds it while avoiding the construction of components encoding inconsistent trajectories.
介绍了一种用于清除射频识别(RFID)跟踪系统收集的数据的概率框架。需要清理的是作为读数可能解释的轨迹集:该集合中的轨迹是一个序列,其一般元素是在相应时间点进行检测的读取器所覆盖的位置。清理是由完整性约束引导的,包括丢弃不一致的轨迹,并为其他轨迹分配一个合适的实际概率。通过采用逻辑上由以下步骤组成的概率条件来评估概率。首先,根据时间点之间的独立性假设,为轨迹分配先验概率。然后,根据约束编码的时空相关性对这些概率进行修正。这是通过将每个轨迹的先验概率限制为满足约束的事件来实现的:这意味着将这个先验概率与所有一致轨迹的先验概率之和的比率。我们的方法不是通过物化所有轨迹及其先验概率来执行这些步骤(这是不可行的,因为轨迹的数量通常是巨大的),而是利用一种称为条件轨迹图(ct-graph)的数据结构,它紧凑地表示轨迹及其条件概率,以及一种有效构造ct-graph的算法。它可以逐步构建它,同时避免构建编码不一致轨迹的组件。
{"title":"Exploiting Integrity Constraints for Cleaning Trajectories of RFID-Monitored Objects","authors":"Bettina Fazzinga, S. Flesca, F. Furfaro, F. Parisi","doi":"10.1145/2939368","DOIUrl":"https://doi.org/10.1145/2939368","url":null,"abstract":"A probabilistic framework for cleaning the data collected by Radio-Frequency IDentification (RFID) tracking systems is introduced. What has to be cleaned is the set of trajectories that are the possible interpretations of the readings: a trajectory in this set is a sequence whose generic element is a location covered by the reader(s) that made the detection at the corresponding time point. The cleaning is guided by integrity constraints and consists of discarding the inconsistent trajectories and assigning to the others a suitable probability of being the actual one. The probabilities are evaluated by adopting probabilistic conditioning that logically consists of the following steps. First, the trajectories are assigned a priori probabilities that rely on the independence assumption between the time points. Then, these probabilities are revised according to the spatio-temporal correlations encoded by the constraints. This is done by conditioning the a priori probability of each trajectory to the event that the constraints are satisfied: this means taking the ratio of this a priori probability to the sum of the a priori probabilities of all the consistent trajectories. Instead of performing these steps by materializing all the trajectories and their a priori probabilities (which is infeasible, owing to the typically huge number of trajectories), our approach exploits a data structure called conditioned trajectory graph (ct-graph) that compactly represents the trajectories and their conditioned probabilities, and an algorithm for efficiently constructing the ct-graph, which progressively builds it while avoiding the construction of components encoding inconsistent trajectories.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"15 1","pages":"1 - 52"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87491699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Guarded-Based Disjunctive Tuple-Generating Dependencies 基于保护的析取元组生成依赖
Pub Date : 2016-11-02 DOI: 10.1145/2976736
P. Bourhis, M. Manna, Michael Morak, Andreas Pieris
We perform an in-depth complexity analysis of query answering under guarded-based classes of disjunctive tuple-generating dependencies (DTGDs), focusing on (unions of) conjunctive queries ((U)CQs). We show that the problem under investigation is very hard, namely 2ExpTime-complete, even for fixed sets of dependencies of a very restricted form. This is a surprising lower bound that demonstrates the enormous impact of disjunction on query answering under guarded-based tuple-generating dependencies, and also reveals the source of complexity for expressive logics such as the guarded fragment of first-order logic. We then proceed to investigate whether prominent subclasses of (U)CQs (i.e., queries of bounded treewidth and hypertree-width, and acyclic queries) have a positive impact on the complexity of the problem under consideration. We show that queries of bounded treewidth and bounded hypertree-width do not reduce the complexity of our problem, even if we focus on predicates of bounded arity or on fixed sets of DTGDs. Regarding acyclic queries, although the problem remains 2ExpTime-complete in general, in some relevant settings the complexity reduces to ExpTime-complete. Finally, with the aim of identifying tractable cases, we focus our attention on atomic queries. We show that atomic queries do not make the query answering problem easier under classes of guarded-based DTGDs that allow more than one atom to occur in the body of the dependencies. However, the complexity significantly decreases in the case of dependencies that can have only one atom in the body. In particular, we obtain a Ptime-completeness if we focus on predicates of bounded arity, and AC0-membership when the set of dependencies and the query are fixed. Interestingly, our results can be used as a generic tool for establishing complexity results for query answering under various description logics.
我们对基于保护的析取元组生成依赖(DTGDs)类下的查询应答进行了深入的复杂性分析,重点关注了合取查询((U) cq)的(联合)。我们表明,所研究的问题非常困难,即2ExpTime-complete,即使对于非常受限形式的固定依赖集也是如此。这是一个令人惊讶的下界,它表明了在基于保护的元组生成依赖下,析取对查询应答的巨大影响,也揭示了表达逻辑(如一阶逻辑的保护片段)的复杂性来源。然后,我们继续研究(U) cq的突出子类(即有界树宽和超树宽的查询,以及无循环查询)是否对所考虑的问题的复杂性有积极影响。我们表明,有界树宽和有界超树宽的查询并没有降低问题的复杂性,即使我们关注有界性的谓词或固定的dtgd集。对于非循环查询,尽管问题通常仍然是2ExpTime-complete,但在一些相关设置中,复杂性降低到ExpTime-complete。最后,为了识别可处理的情况,我们将注意力集中在原子查询上。我们展示了原子查询并没有使基于守卫的dtgd类下的查询回答问题变得更容易,这些类允许在依赖项主体中出现多个原子。但是,如果依赖关系的主体中只有一个原子,则复杂性会显著降低。特别是,如果我们关注有界性的谓词,我们将获得Ptime-completeness,而当依赖关系集和查询是固定的时,我们将获得AC0-membership。有趣的是,我们的结果可以用作在各种描述逻辑下建立查询回答的复杂性结果的通用工具。
{"title":"Guarded-Based Disjunctive Tuple-Generating Dependencies","authors":"P. Bourhis, M. Manna, Michael Morak, Andreas Pieris","doi":"10.1145/2976736","DOIUrl":"https://doi.org/10.1145/2976736","url":null,"abstract":"We perform an in-depth complexity analysis of query answering under guarded-based classes of disjunctive tuple-generating dependencies (DTGDs), focusing on (unions of) conjunctive queries ((U)CQs). We show that the problem under investigation is very hard, namely 2ExpTime-complete, even for fixed sets of dependencies of a very restricted form. This is a surprising lower bound that demonstrates the enormous impact of disjunction on query answering under guarded-based tuple-generating dependencies, and also reveals the source of complexity for expressive logics such as the guarded fragment of first-order logic. We then proceed to investigate whether prominent subclasses of (U)CQs (i.e., queries of bounded treewidth and hypertree-width, and acyclic queries) have a positive impact on the complexity of the problem under consideration. We show that queries of bounded treewidth and bounded hypertree-width do not reduce the complexity of our problem, even if we focus on predicates of bounded arity or on fixed sets of DTGDs. Regarding acyclic queries, although the problem remains 2ExpTime-complete in general, in some relevant settings the complexity reduces to ExpTime-complete. Finally, with the aim of identifying tractable cases, we focus our attention on atomic queries. We show that atomic queries do not make the query answering problem easier under classes of guarded-based DTGDs that allow more than one atom to occur in the body of the dependencies. However, the complexity significantly decreases in the case of dependencies that can have only one atom in the body. In particular, we obtain a Ptime-completeness if we focus on predicates of bounded arity, and AC0-membership when the set of dependencies and the query are fixed. Interestingly, our results can be used as a generic tool for establishing complexity results for query answering under various description logics.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"29 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79370569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Extending the Kernel of a Relational DBMS with Comprehensive Support for Sequenced Temporal Queries 扩展关系DBMS的内核,全面支持时序查询
Pub Date : 2016-11-02 DOI: 10.1145/2967608
Anton Dignös, Michael H. Böhlen, J. Gamper, Christian S. Jensen
Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This article demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the article's proposal. The new system is available as open-source software.
许多数据库包含时间或时间引用的数据,并使用间隔来捕获时间方面。虽然基于sql的数据库管理系统(dbms)能够支持间隔数据的管理,但是它们提供的支持可以得到很大的改进。一系列提出的时态数据模型和查询语言为这种效果提供了充分的证据。很难用SQL表述的自然查询很容易用这些时态查询语言表述。越来越多地关注对历史数据的分析,其中查询通常更复杂,这加剧了困难,从而加剧了时态查询语言的潜在好处。商业dbms最近开始以循序渐进的方式提供有限的时间功能,重点放在间隔的表示上,而忽略了查询评估引擎的实现。本文演示了如何扩展关系数据库引擎,以实现成熟的、工业级的时序查询实现,直观地说,时序查询是在每个时间点计算的查询。我们的方法将时态查询减少为对调整间隔的数据的非时态查询,并且不影响非时态查询的处理。具体来说,该方法依赖于三个概念:间隔调整、时间戳传播和属性缩放。通过引入两个新的关系运算符(时间规范化器和时间校准器)来实现间隔调整,后两个概念通过复制时间戳属性和使用所谓的缩放函数来实现。通过提供一组约简规则,我们可以将任何用时态关系运算符表示的时态查询转换为用关系运算符和两个新运算符表示的查询。我们证明了转换后的查询的大小与原始查询中时间运算符的数量呈线性关系。将新的运算符和转换规则以及查询优化规则集成到PostgreSQL内核中。本文涵盖了对结果时态DBMS的实证研究,这些研究提供了对本文建议的相关设计属性的见解。新系统是开源软件。
{"title":"Extending the Kernel of a Relational DBMS with Comprehensive Support for Sequenced Temporal Queries","authors":"Anton Dignös, Michael H. Böhlen, J. Gamper, Christian S. Jensen","doi":"10.1145/2967608","DOIUrl":"https://doi.org/10.1145/2967608","url":null,"abstract":"Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This article demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the article's proposal. The new system is available as open-source software.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"6 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87124117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
期刊
ACM Transactions on Database Systems (TODS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1