Proceedings of the 2018 International Conference on Management of Data最新文献

英文中文

Lightweight Cardinality Estimation in LSM-based Systems 基于lsm系统的轻量级基数估计

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183761

Ildar Absalyamov, M. Carey, V. Tsotras

Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.

社交媒体、移动应用程序和物联网传感器等数据源每天产生数十亿条记录。在向用户提供有用分析的同时，跟上数据的涌入是当今数据密集型系统面临的主要挑战。允许此类系统快速处理传入数据的流行解决方案是依赖于日志结构合并(LSM)存储模型。基于lsm的系统在高速摄取大量数据和在这些数据之上运行高效的分析查询之间提供了一种可调的权衡。对于查询，众所周知，查询处理性能在很大程度上取决于生成高效执行计划的能力。以前的研究表明，OLAP查询工作负载依赖于底层数据的小而精确的统计摘要，这可以推动基于成本的查询优化。在本文中，我们解决了具有快速数据摄取的工作负载的数据统计计算问题，并提出了一个利用LSM存储属性的轻量级统计收集框架。我们的方法旨在利用LSM生命周期的事件(刷新和合并)。这使我们能够轻松地创建初始统计数据，然后使它们与快速变化的数据保持同步，同时最大限度地减少对现有系统的开销。我们已经实现并调整了众所周知的算法来产生各种类型的统计概要，包括等宽直方图，等高直方图和小波。我们进行了深入的经验评估，考虑了基数估计准确性和收集和使用统计数据的运行时开销。我们的实验是在Apache AsterixDB(一个开源的大数据管理系统，完全基于lsm的存储后端)上进行的。

{"title":"Lightweight Cardinality Estimation in LSM-based Systems","authors":"Ildar Absalyamov, M. Carey, V. Tsotras","doi":"10.1145/3183713.3183761","DOIUrl":"https://doi.org/10.1145/3183713.3183761","url":null,"abstract":"Data sources, such as social media, mobile apps and IoT sensors, generate billions of records each day. Keeping up with this influx of data while providing useful analytics to the users is a major challenge for today's data-intensive systems. A popular solution that allows such systems to handle rapidly incoming data is to rely on log-structured merge (LSM) storage models. LSM-based systems provide a tunable trade-off between ingesting vast amounts of data at a high rate and running efficient analytical queries on top of that data. For queries, it is well-known that the query processing performance largely depends on the ability to generate efficient execution plans. Previous research showed that OLAP query workloads rely on having small, yet precise, statistical summaries of the underlying data, which can drive the cost-based query optimization. In this paper we address the problem of computing data statistics for workloads with rapid data ingestion and propose a lightweight statistics-collection framework that exploits the properties of LSM storage. Our approach is designed to piggyback on the events (flush and merge) of the LSM lifecycle. This allows us to easily create an initial statistics and then keep them in sync with rapidly changing data while minimizing the overhead to the existing system. We have implemented and adapted well-known algorithms to produce various types of statistical synopses, including equi-width histograms, equi-height histograms, and wavelets. We performed an in-depth empirical evaluation that considers both the cardinality estimation accuracy and runtime overheads of collecting and using statistics. The experiments were conducted by prototyping our approach on top of Apache AsterixDB, an open source Big Data management system that has an entirely LSM-based storage backend.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89775183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Persistent Bloom Filter: Membership Testing for the Entire History 持久布隆过滤器:整个历史的成员测试

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183737

Yanqing Peng, Jinwei Guo, Feifei Li, Weining Qian, Aoying Zhou

Membership testing is the problem of testing whether an element is in a set of elements. Performing the test exactly is expensive space-wise, requiring the storage of all elements in a set. In many applications, an approximate testing that can be done quickly using small space is often desired. Bloom filter (BF) was designed and has witnessed great success across numerous application domains. But there is no compact structure that supports set membership testing for temporal queries, e.g., has person A visited a web server between 9:30am and 9:40am? And has the same person visited the web server again between 9:45am and 9:50am? It is possible to support such "temporal membership testing" using a BF, but we will show that this is fairly expensive. To that end, this paper designs persistent bloom filter (PBF), a novel data structure for temporal membership testing with compact space.

成员测试是测试一个元素是否在一组元素中。准确地执行测试在空间方面是昂贵的，需要存储集合中的所有元素。在许多应用中，通常需要使用较小的空间快速完成近似测试。布隆过滤器(BF)的设计并在许多应用领域取得了巨大的成功。但是没有紧凑的结构来支持时间查询的集合成员测试，例如，A是否在上午9:30到9:40之间访问了web服务器?在上午9:45到9:50之间，同一个人是否再次访问了web服务器?使用BF支持这样的“临时成员测试”是可能的，但是我们将说明这是相当昂贵的。为此，本文设计了持久布隆滤波器(PBF)，这是一种新颖的紧凑空间时间隶属度测试数据结构。

引用次数: 33

Columnstore and B+ tree - Are Hybrid Physical Designs Important? 列存储和B+树-混合物理设计重要吗?

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190660

Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, M. Syamala

Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied --- a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.

商业dbms(如Microsoft SQL Server)可以满足各种工作负载，包括事务处理、决策支持和操作分析。它们还支持多种物理设计结构，如B+树和columnstore。B+树对于OLTP工作负载和columnstore对于决策支持工作负载的好处是众所周知的。然而，混合物理设计(由同一数据库上的columnstore和B+树索引组成)的重要性并没有得到很好的研究——这是本文的一个重点。我们首先使用精心设计的微基准来量化权衡。这种微基准测试表明，根据工作负载的不同，混合物理设计可以带来数量级的性能提升。对于复杂的实际应用程序，为数据库工作负载选择合适的columnstore和B+树索引组合是一项挑战。我们扩展了Microsoft SQL Server的数据库引擎优化顾问，为给定的工作负载推荐B+树和columnstore索引的合适组合。通过使用行业标准基准测试和几个真实客户工作负载的大量实验，我们量化了能够推荐混合物理设计的物理设计工具如何比依赖仅列存储或仅B+树设计的方法带来更好的执行成本。

{"title":"Columnstore and B+ tree - Are Hybrid Physical Designs Important?","authors":"Adam Dziedzic, Jingjing Wang, Sudipto Das, Bolin Ding, Vivek R. Narasayya, M. Syamala","doi":"10.1145/3183713.3190660","DOIUrl":"https://doi.org/10.1145/3183713.3190660","url":null,"abstract":"Commercial DBMSs, such as Microsoft SQL Server, cater to diverse workloads including transaction processing, decision support, and operational analytics. They also support variety in physical design structures such as B+ tree and columnstore. The benefits of B+ tree for OLTP workloads and columnstore for decision support workloads are well-understood. However, the importance of hybrid physical designs, consisting of both columnstore and B+ tree indexes on the same database, is not well-studied --- a focus of this paper. We first quantify the trade-offs using carefully-crafted micro-benchmarks. This micro-benchmarking indicates that hybrid physical designs can result in orders of magnitude better performance depending on the workload. For complex real-world applications, choosing an appropriate combination of columnstore and B+ tree indexes for a database workload is challenging. We extend the Database Engine Tuning Advisor for Microsoft SQL Server to recommend a suitable combination of B+ tree and columnstore indexes for a given workload. Through extensive experiments using industry-standard benchmarks and several real-world customer workloads, we quantify how a physical design tool capable of recommending hybrid physical designs can result in orders of magnitude better execution costs compared to approaches that rely either on columnstore-only or B+ tree-only designs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"204 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72562732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Session details: Keynote 2 会议详情:主题演讲2

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3258012

Xinyue Dong

引用次数: 0

VALMOD: A Suite for Easy and Exact Detection of Variable Length Motifs in Data Series VALMOD:一套简单而准确地检测数据序列中变长模的工具

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193556

Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn J. Keogh

Data series motif discovery represents one of the most useful primitives for data series mining, with applications to many domains, such as robotics, entomology, seismology, medicine, and climatology, and others. The state-of-the-art motif discovery tools still require the user to provide the motif length. Yet, in several cases, the choice of motif length is critical for their detection. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable, and does not provide any support for ranking motifs at different resolutions (i.e., lengths). We demonstrate VALMOD, our scalable motif discovery algorithm that efficiently finds all motifs in a given range of lengths, and outputs a length-invariant ranking of motifs. Furthermore, we support the analysis process by means of a newly proposed meta-data structure that helps the user to select the most promising pattern length. This demo aims at illustrating in detail the steps of the proposed approach, showcasing how our algorithm and corresponding graphical insights enable users to efficiently identify the correct motifs.

数据序列基元发现是数据序列挖掘中最有用的基元之一，应用于机器人、昆虫学、地震学、医学和气候学等许多领域。最先进的motif发现工具仍然需要用户提供motif长度。然而，在一些情况下，基序长度的选择对它们的检测至关重要。不幸的是，测试给定范围内所有长度的明显暴力解决方案在计算上是站不住脚的，并且不支持在不同分辨率(即长度)下对图案进行排序。我们展示了VALMOD，我们的可扩展motif发现算法，它可以有效地找到给定长度范围内的所有motif，并输出一个长度不变的motif排名。此外，我们通过新提出的元数据结构来支持分析过程，该结构可以帮助用户选择最有希望的模式长度。本演示旨在详细说明所提出方法的步骤，展示我们的算法和相应的图形见解如何使用户能够有效地识别正确的图案。

{"title":"VALMOD: A Suite for Easy and Exact Detection of Variable Length Motifs in Data Series","authors":"Michele Linardi, Yan Zhu, Themis Palpanas, Eamonn J. Keogh","doi":"10.1145/3183713.3193556","DOIUrl":"https://doi.org/10.1145/3183713.3193556","url":null,"abstract":"Data series motif discovery represents one of the most useful primitives for data series mining, with applications to many domains, such as robotics, entomology, seismology, medicine, and climatology, and others. The state-of-the-art motif discovery tools still require the user to provide the motif length. Yet, in several cases, the choice of motif length is critical for their detection. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable, and does not provide any support for ranking motifs at different resolutions (i.e., lengths). We demonstrate VALMOD, our scalable motif discovery algorithm that efficiently finds all motifs in a given range of lengths, and outputs a length-invariant ranking of motifs. Furthermore, we support the analysis process by means of a newly proposed meta-data structure that helps the user to select the most promising pattern length. This demo aims at illustrating in detail the steps of the proposed approach, showcasing how our algorithm and corresponding graphical insights enable users to efficiently identify the correct motifs.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"201 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76998094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

SketchML: Accelerating Distributed Machine Learning with Data Sketches SketchML:使用数据草图加速分布式机器学习

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196894

Jiawei Jiang, Fangcheng Fu, Tong Yang, B. Cui

To address the challenge of explosive big data, distributed machine learning (ML) has drawn the interests of many researchers. Since many distributed ML algorithms trained by stochastic gradient descent (SGD) involve communicating gradients through the network, it is important to compress the transferred gradient. A category of low-precision algorithms can significantly reduce the size of gradients, at the expense of some precision loss. However, existing low-precision methods are not suitable for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression method that can efficiently handle a sparse and nonuniform gradient consisting of key-value pairs? Our first contribution is a sketch based method that compresses the gradient values. Sketch is a class of algorithms using a probabilistic data structure to approximate the distribution of input data. We design a quantile-bucket quantification method that uses a quantile sketch to sort gradient values into buckets and encodes them with the bucket indexes. To further compress the bucket indexes, our second contribution is a sketch algorithm, namely MinMaxSketch. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. The third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and stores them with fewer bytes. We also theoretically discuss the correctness and the error bound of three proposed methods. To the best of our knowledge, this is the first effort combining data sketch with ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc., and show that our method is up to 10X faster than existing methods.

为了应对爆炸性大数据带来的挑战，分布式机器学习(ML)引起了许多研究人员的兴趣。由于许多由随机梯度下降(SGD)训练的分布式机器学习算法涉及通过网络通信梯度，因此压缩传输的梯度非常重要。一类低精度算法可以显著减小梯度的大小，但代价是一定的精度损失。然而，现有的低精度方法不适用于梯度稀疏和不均匀分布的许多情况。在本文中，我们研究了是否有一种压缩方法可以有效地处理由键值对组成的稀疏非均匀梯度。我们的第一个贡献是基于草图的方法来压缩梯度值。Sketch是一类使用概率数据结构来近似输入数据分布的算法。我们设计了一种分位数-桶量化方法，使用分位数草图将梯度值排序到桶中，并用桶索引对其进行编码。为了进一步压缩桶索引，我们的第二个贡献是一个草图算法，即MinMaxSketch。MinMaxSketch构建一组哈希表，并使用MinMax策略解决哈希冲突。本文的第三个贡献是一种增量二进制编码方法，该方法可以计算梯度键的增量并使用更少的字节存储它们。从理论上讨论了三种方法的正确性和误差范围。据我们所知，这是第一次将数据草图与机器学习相结合。我们在我们的工业合作伙伴腾讯公司的一个真实集群中实现了一个原型系统，并表明我们的方法比现有方法快10倍。

{"title":"SketchML: Accelerating Distributed Machine Learning with Data Sketches","authors":"Jiawei Jiang, Fangcheng Fu, Tong Yang, B. Cui","doi":"10.1145/3183713.3196894","DOIUrl":"https://doi.org/10.1145/3183713.3196894","url":null,"abstract":"To address the challenge of explosive big data, distributed machine learning (ML) has drawn the interests of many researchers. Since many distributed ML algorithms trained by stochastic gradient descent (SGD) involve communicating gradients through the network, it is important to compress the transferred gradient. A category of low-precision algorithms can significantly reduce the size of gradients, at the expense of some precision loss. However, existing low-precision methods are not suitable for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression method that can efficiently handle a sparse and nonuniform gradient consisting of key-value pairs? Our first contribution is a sketch based method that compresses the gradient values. Sketch is a class of algorithms using a probabilistic data structure to approximate the distribution of input data. We design a quantile-bucket quantification method that uses a quantile sketch to sort gradient values into buckets and encodes them with the bucket indexes. To further compress the bucket indexes, our second contribution is a sketch algorithm, namely MinMaxSketch. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. The third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and stores them with fewer bytes. We also theoretically discuss the correctness and the error bound of three proposed methods. To the best of our knowledge, this is the first effort combining data sketch with ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc., and show that our method is up to 10X faster than existing methods.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80079774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 90

P-Store: An Elastic Database System with Predictive Provisioning P-Store:具有预测供应的弹性数据库系统

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190650

Rebecca Taft, Nosayba El-Sayed, M. Serafini, Yu Lu, Ashraf Aboulnaga, M. Stonebraker, Ricardo Mayerhofer, Francisco Jose Andrade

OLTP database systems are a critical part of the operation of many enterprises. Such systems are often configured statically with sufficient capacity for peak load. For many OLTP applications, however, the maximum load is an order of magnitude larger than the minimum, and load varies in a repeating daily pattern. It is thus prudent to allocate computing resources dynamically to match demand. One can allocate resources reactively after a load increase is detected, but this places additional burden on the already-overloaded system to reconfigure. A predictive allocation, in advance of load increases, is clearly preferable. We present P-Store, the first elastic OLTP DBMS to use prediction, and apply it to the workload of B2W Digital (B2W), a large online retailer. Our study shows that P-Store outperforms a reactive system on B2W's workload by causing 72% fewer latency violations, and achieves performance comparable to static allocation for peak demand while using 50% fewer servers.

OLTP数据库系统是许多企业运营的关键部分。这样的系统通常静态配置，具有足够的峰值负载容量。但是，对于许多OLTP应用程序，最大负载比最小负载大一个数量级，并且负载以每天重复的模式变化。因此，动态分配计算资源以匹配需求是明智的。可以在检测到负载增加后响应性地分配资源，但这会给已经过载的系统带来额外的负担，需要重新配置。在负载增加之前进行预测分配显然是可取的。我们提出了P-Store，这是第一个使用预测的弹性OLTP DBMS，并将其应用于大型在线零售商B2W Digital (B2W)的工作负载。我们的研究表明，在B2W的工作负载上，P-Store的性能优于响应式系统，因为它导致的延迟违规减少了72%，并且在使用50%的服务器时实现了与峰值需求静态分配相当的性能。

引用次数: 43

AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics aqp++:将近似查询处理与交互分析的聚合预计算连接起来

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183747

Jinglin Peng, Dongxiang Zhang, Jiannan Wang, J. Pei

Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, the database community has proposed two separate ideas, sampling-based approximate query processing (AQP) and aggregate precomputation (AggPre) such as data cubes, to address this challenge. In this paper, we argue for the need to connect these two separate ideas for interactive analytics. We propose AQP++, a novel framework to enable the connection. The framework can leverage both a sample as well as a precomputed aggregate to answer user queries. We discuss the advantages of having such a unified framework and identify new challenges to fulfill this vision. We conduct an in-depth study of these challenges for range queries and explore both optimal and heuristic solutions to address them. Our experiments using two public benchmarks and one real-world dataset show that AQP++ achieves a more flexible and better trade-off among preprocessing cost, query response time, and answer quality than AQP or AggPre.

交互式分析要求数据库系统能够在交互式响应时间内回答聚合查询。随着数据量以前所未有的速度持续增长，这变得越来越具有挑战性。过去，数据库社区提出了两种不同的思想，即基于抽样的近似查询处理(AQP)和聚合预计算(AggPre)，如数据立方体，来应对这一挑战。在本文中，我们认为需要将这两个独立的想法联系起来进行交互式分析。我们提出了aqp++，一个新的框架来实现连接。该框架既可以利用示例，也可以利用预先计算的聚合来回答用户查询。我们讨论了拥有这样一个统一框架的优势，并确定了实现这一愿景的新挑战。我们对范围查询的这些挑战进行了深入的研究，并探索了解决这些问题的最优和启发式解决方案。我们使用两个公共基准测试和一个真实数据集进行的实验表明，aqp++比AQP或AggPre在预处理成本、查询响应时间和回答质量之间实现了更灵活和更好的权衡。

{"title":"AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics","authors":"Jinglin Peng, Dongxiang Zhang, Jiannan Wang, J. Pei","doi":"10.1145/3183713.3183747","DOIUrl":"https://doi.org/10.1145/3183713.3183747","url":null,"abstract":"Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, the database community has proposed two separate ideas, sampling-based approximate query processing (AQP) and aggregate precomputation (AggPre) such as data cubes, to address this challenge. In this paper, we argue for the need to connect these two separate ideas for interactive analytics. We propose AQP++, a novel framework to enable the connection. The framework can leverage both a sample as well as a precomputed aggregate to answer user queries. We discuss the advantages of having such a unified framework and identify new challenges to fulfill this vision. We conduct an in-depth study of these challenges for range queries and explore both optimal and heuristic solutions to address them. Our experiments using two public benchmarks and one real-world dataset show that AQP++ achieves a more flexible and better trade-off among preprocessing cost, query response time, and answer quality than AQP or AggPre.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"121 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90561826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Workload-Aware CPU Performance Scaling for Transactional Database Systems 事务性数据库系统的工作负载感知CPU性能扩展

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196901

Mustafa Korkmaz, M. Karsten, K. Salem, S. Salihoglu

Natural short term fluctuations in the load of transactional data systems present an opportunity for power savings. For example, a system handling 1000 requests per second on average can expect more than 1000 requests in some seconds, fewer in others. By quickly adjusting processing capacity to match such fluctuations, power consumption can be reduced. Many systems do this already, using dynamic voltage and frequency scaling (DVFS) to reduce processor performance and power consumption when the load is low. DVFS is typically controlled by frequency governors in the operating system, or by the processor itself. In this paper, we show that transactional database systems can manage DVFS more effectively than the underlying operating system. This is because the database system has more information about the workload, and more control over that workload, than is available to the operating system. We present a technique called POLARIS for reducing the power consumption of transactional database systems. POLARIS directly manages processor DVFS and controls database transaction scheduling. Its goal is to minimize power consumption while ensuring the transactions are completed within a specified latency target. POLARIS is workload-aware, and can accommodate concurrent workloads with different characteristics and latency budgets. We show that POLARIS can simultaneously reduce power consumption and reduce missed latency targets, relative to operating-system-based DVFS governors.

事务性数据系统负载的自然短期波动为节省电力提供了机会。例如，平均每秒处理1000个请求的系统在某些秒内可能会有超过1000个请求，而在其他秒内则会更少。通过快速调整处理能力以适应这种波动，可以降低功耗。许多系统已经这样做了，当负载较低时，使用动态电压和频率缩放(DVFS)来降低处理器性能和功耗。DVFS通常由操作系统中的频率调节器控制，或者由处理器本身控制。在本文中，我们证明事务性数据库系统可以比底层操作系统更有效地管理DVFS。这是因为与操作系统相比，数据库系统拥有更多关于工作负载的信息，以及对工作负载的更多控制。我们提出了一种称为POLARIS的技术，用于降低事务性数据库系统的功耗。POLARIS直接管理处理器DVFS和控制数据库事务调度。它的目标是最小化功耗，同时确保事务在指定的延迟目标内完成。POLARIS是工作负载感知的，可以适应具有不同特征和延迟预算的并发工作负载。我们表明，相对于基于操作系统的DVFS调控器，POLARIS可以同时降低功耗并减少错过的延迟目标。

{"title":"Workload-Aware CPU Performance Scaling for Transactional Database Systems","authors":"Mustafa Korkmaz, M. Karsten, K. Salem, S. Salihoglu","doi":"10.1145/3183713.3196901","DOIUrl":"https://doi.org/10.1145/3183713.3196901","url":null,"abstract":"Natural short term fluctuations in the load of transactional data systems present an opportunity for power savings. For example, a system handling 1000 requests per second on average can expect more than 1000 requests in some seconds, fewer in others. By quickly adjusting processing capacity to match such fluctuations, power consumption can be reduced. Many systems do this already, using dynamic voltage and frequency scaling (DVFS) to reduce processor performance and power consumption when the load is low. DVFS is typically controlled by frequency governors in the operating system, or by the processor itself. In this paper, we show that transactional database systems can manage DVFS more effectively than the underlying operating system. This is because the database system has more information about the workload, and more control over that workload, than is available to the operating system. We present a technique called POLARIS for reducing the power consumption of transactional database systems. POLARIS directly manages processor DVFS and controls database transaction scheduling. Its goal is to minimize power consumption while ensuring the transactions are completed within a specified latency target. POLARIS is workload-aware, and can accommodate concurrent workloads with different characteristics and latency budgets. We show that POLARIS can simultaneously reduce power consumption and reduce missed latency targets, relative to operating-system-based DVFS governors.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90579089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Deep Learning for Entity Matching: A Design Space Exploration 面向实体匹配的深度学习:一个设计空间探索

Proceedings of the 2018 International Conference on Management of Data

Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196926

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, A. Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, V. Raghavendra

Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.

实体匹配(EM)查找引用相同现实世界实体的数据实例。在本文中，我们研究了将深度学习(DL)应用于EM，以了解DL的优点和局限性。我们回顾了许多为文本处理中的相关匹配任务(例如，实体链接，文本蕴涵等)开发的深度学习解决方案。我们对这些解决方案进行了分类，并定义了EM的深度学习解决方案空间，具体体现为四个具有不同表示能力的解决方案:SIF、RNN、Attention和Hybrid。接下来，我们将探讨深度学习可以帮助解决的EM问题类型。我们考虑三种这样的问题类型，它们分别匹配结构化数据实例、文本实例和脏实例。我们将上述四种深度学习解决方案与麦哲伦(最先进的基于学习的EM解决方案)进行了实证比较。结果表明，深度学习在结构化EM上的表现并不优于当前的解决方案，但在文本EM和脏EM上的表现明显优于它们。对于从业者来说，这表明他们应该认真考虑将深度学习用于文本EM和脏EM问题。最后，对深度学习的性能进行了分析，并对未来的研究方向进行了讨论。

{"title":"Deep Learning for Entity Matching: A Design Space Exploration","authors":"Sidharth Mudgal, Han Li, Theodoros Rekatsinas, A. Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, V. Raghavendra","doi":"10.1145/3183713.3196926","DOIUrl":"https://doi.org/10.1145/3183713.3196926","url":null,"abstract":"Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83494672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 427

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2018 International Conference on Management of Data

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀