首页 > 最新文献

Proceedings of the 2016 International Conference on Management of Data最新文献

英文 中文
SourceSight: Enabling Effective Source Selection SourceSight:启用有效选源功能
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899403
Theodoros Rekatsinas, A. Deshpande, X. Dong, L. Getoor, D. Srivastava
Recently there has been a rapid increase in the number of data sources and data services, such as cloud-based data markets and data portals, that facilitate the collection, publishing and trading of data. Data sources typically exhibit large heterogeneity in the type and quality of data they provide. Unfortunately, when the number of data sources is large, it is difficult for users to reason about the actual usefulness of sources for their applications and the trade-offs between the benefits and costs of acquiring and integrating sources. In this demonstration we present textsc{SourceSight}, a system that allows users to interactively explore a large number of heterogeneous data sources, and discover valuable sets of sources for diverse integration tasks. textsc{SourceSight}~uses a novel multi-level source quality index that enables effective source selection at different granularity levels, and introduces a collection of new techniques to discover and evaluate relevant sources for integration.
最近,数据源和数据服务的数量迅速增加,例如基于云的数据市场和数据门户,它们促进了数据的收集、发布和交易。数据源通常在其提供的数据类型和质量方面表现出很大的异质性。不幸的是,当数据源的数量很大时,用户很难推断数据源对其应用程序的实际有用性,以及获取和集成数据源的收益和成本之间的权衡。在本演示中,我们将介绍textsc{SourceSight},这是一个允许用户交互地探索大量异构数据源的系统,并为各种集成任务发现有价值的数据源集。textsc{SourceSight}采用了一种新颖的多级源质量指标,可以在不同粒度级别上进行有效的源选择,并引入了一系列新技术来发现和评估相关的源,以便进行集成。
{"title":"SourceSight: Enabling Effective Source Selection","authors":"Theodoros Rekatsinas, A. Deshpande, X. Dong, L. Getoor, D. Srivastava","doi":"10.1145/2882903.2899403","DOIUrl":"https://doi.org/10.1145/2882903.2899403","url":null,"abstract":"Recently there has been a rapid increase in the number of data sources and data services, such as cloud-based data markets and data portals, that facilitate the collection, publishing and trading of data. Data sources typically exhibit large heterogeneity in the type and quality of data they provide. Unfortunately, when the number of data sources is large, it is difficult for users to reason about the actual usefulness of sources for their applications and the trade-offs between the benefits and costs of acquiring and integrating sources. In this demonstration we present textsc{SourceSight}, a system that allows users to interactively explore a large number of heterogeneous data sources, and discover valuable sets of sources for diverse integration tasks. textsc{SourceSight}~uses a novel multi-level source quality index that enables effective source selection at different granularity levels, and introduces a collection of new techniques to discover and evaluate relevant sources for integration.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81102324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
PerNav: A Route Summarization Framework for Personalized Navigation PerNav:一个用于个性化导航的路线汇总框架
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899384
Yaguang Li, Han Su, Ugur Demiryurek, Bolong Zheng, Kai Zeng, C. Shahabi
In this paper, we study a route summarization framework for Personalized Navigation dubbed PerNav - with which the goal is to generate more intuitive and customized turn-by-turn directions based on user generated content. The turn-by-turn directions provided in the existing navigation applications are exclusively derived from underlying road network topology information i.e., the connectivity of nodes to each other. Therefore, the turn-by-turn directions are simplified as metric translation of physical world (e.g. distance/time to turn) to spoken language. Such translation- that ignores human cognition about the geographic space- is often verbose and redundant for the drivers who have knowledge about the geographical areas. PerNav utilizes wealth of user generated historical trajectory data to extract namely "landmarks" (e.g., point of interests or intersections) and frequently visited routes between them from the road network. Then this extracted information is used to obtain cognitive turn-by-turn directions customized for each user.
在本文中,我们研究了一个名为PerNav的个性化导航路线汇总框架,其目标是基于用户生成的内容生成更直观和定制的逐向方向。现有导航应用程序中提供的逐向方向完全来自底层道路网络拓扑信息,即节点之间的连通性。因此,转弯方向被简化为物理世界(如距离/转弯时间)到口语的公制翻译。这种翻译忽略了人类对地理空间的认知,对于了解地理区域的司机来说,往往是冗长和多余的。PerNav利用丰富的用户生成的历史轨迹数据,从道路网络中提取“地标”(例如,兴趣点或十字路口)以及它们之间经常访问的路线。然后将提取的信息用于为每个用户定制的认知逐向方向。
{"title":"PerNav: A Route Summarization Framework for Personalized Navigation","authors":"Yaguang Li, Han Su, Ugur Demiryurek, Bolong Zheng, Kai Zeng, C. Shahabi","doi":"10.1145/2882903.2899384","DOIUrl":"https://doi.org/10.1145/2882903.2899384","url":null,"abstract":"In this paper, we study a route summarization framework for Personalized Navigation dubbed PerNav - with which the goal is to generate more intuitive and customized turn-by-turn directions based on user generated content. The turn-by-turn directions provided in the existing navigation applications are exclusively derived from underlying road network topology information i.e., the connectivity of nodes to each other. Therefore, the turn-by-turn directions are simplified as metric translation of physical world (e.g. distance/time to turn) to spoken language. Such translation- that ignores human cognition about the geographic space- is often verbose and redundant for the drivers who have knowledge about the geographical areas. PerNav utilizes wealth of user generated historical trajectory data to extract namely \"landmarks\" (e.g., point of interests or intersections) and frequently visited routes between them from the road network. Then this extracted information is used to obtain cognitive turn-by-turn directions customized for each user.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89388677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
What Makes a Good Physical plan?: Experiencing Hardware-Conscious Query Optimization with Candomblé 什么是好的健身计划?:使用candomblaise体验基于硬件的查询优化
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899410
H. Pirk, Oscar Moll, S. Madden
Query optimization is hard and the current proliferation of "modern" hardware does nothing to make it any easier. In addition, the tools that are commonly used by performance engineers, such as compiler intrinsics, static analyzers or hardware performance counters are neither integrated with data management systems nor easy to learn. This fact makes it (unnecessarily) hard to educate engineers, to prototype and to optimize database query plans for modern hardware. To address this problem, we developed a system called Candomblé that lets database performance engineers interactively examine, optimize and evaluate query plans using a touch-based interface. Candomblé puts attendants in the place of a physical query optimizer that has to rewrite a physical query plan into a better equivalent plan. Attendants experience the challenges when ad-hoc optimizing a physical plan for processing devices such as GPUs and CPUs and capture their gained knowledge in rules to be used by a rule-based optimizer.
查询优化是困难的,而当前“现代”硬件的激增并没有使它变得更容易。此外,性能工程师常用的工具,如编译器内在特性、静态分析器或硬件性能计数器,既不能与数据管理系统集成,也不容易学习。这一事实(不必要地)使得培训工程师、为现代硬件设计原型和优化数据库查询计划变得困难。为了解决这个问题,我们开发了一个名为candomblaise的系统,该系统允许数据库性能工程师使用基于触摸的界面交互式地检查、优化和评估查询计划。candomblaise将服务员置于物理查询优化器的位置,而物理查询优化器必须将物理查询计划重写为更好的等效计划。当为处理设备(如gpu和cpu)临时优化物理计划并将他们获得的知识捕获到规则中以供基于规则的优化器使用时,服务人员会遇到挑战。
{"title":"What Makes a Good Physical plan?: Experiencing Hardware-Conscious Query Optimization with Candomblé","authors":"H. Pirk, Oscar Moll, S. Madden","doi":"10.1145/2882903.2899410","DOIUrl":"https://doi.org/10.1145/2882903.2899410","url":null,"abstract":"Query optimization is hard and the current proliferation of \"modern\" hardware does nothing to make it any easier. In addition, the tools that are commonly used by performance engineers, such as compiler intrinsics, static analyzers or hardware performance counters are neither integrated with data management systems nor easy to learn. This fact makes it (unnecessarily) hard to educate engineers, to prototype and to optimize database query plans for modern hardware. To address this problem, we developed a system called Candomblé that lets database performance engineers interactively examine, optimize and evaluate query plans using a touch-based interface. Candomblé puts attendants in the place of a physical query optimizer that has to rewrite a physical query plan into a better equivalent plan. Attendants experience the challenges when ad-hoc optimizing a physical plan for processing devices such as GPUs and CPUs and capture their gained knowledge in rules to be used by a rule-based optimizer.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84601491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Similarity Join over Array Data 数组数据上的相似性连接
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915247
Weijie Zhao, Florin Rusu, Bin Dong, Kesheng Wu
Scientific applications are generating an ever-increasing volume of multi-dimensional data that are largely processed inside distributed array databases and frameworks. Similarity join is a fundamental operation across scientific workloads that requires complex processing over an unbounded number of pairs of multi-dimensional points. In this paper, we introduce a novel distributed similarity join operator for multi-dimensional arrays. Unlike immediate extensions to array join and relational similarity join, the proposed operator minimizes the overall data transfer and network congestion while providing load-balancing, without completely repartitioning and replicating the input arrays. We define formally array similarity join and present the design, optimization strategies, and evaluation of the first array similarity join operator.
科学应用程序正在生成越来越多的多维数据,这些数据主要在分布式数组数据库和框架中进行处理。相似性连接是跨科学工作负载的基本操作,它需要对无限数量的多维点对进行复杂处理。本文引入了一种新的多维数组分布式相似连接算子。与对数组连接和关系相似连接的直接扩展不同,所建议的操作符在提供负载平衡的同时最小化了总体数据传输和网络拥塞,而无需完全重新分区和复制输入数组。我们正式定义了数组相似连接,并给出了第一个数组相似连接操作符的设计、优化策略和计算。
{"title":"Similarity Join over Array Data","authors":"Weijie Zhao, Florin Rusu, Bin Dong, Kesheng Wu","doi":"10.1145/2882903.2915247","DOIUrl":"https://doi.org/10.1145/2882903.2915247","url":null,"abstract":"Scientific applications are generating an ever-increasing volume of multi-dimensional data that are largely processed inside distributed array databases and frameworks. Similarity join is a fundamental operation across scientific workloads that requires complex processing over an unbounded number of pairs of multi-dimensional points. In this paper, we introduce a novel distributed similarity join operator for multi-dimensional arrays. Unlike immediate extensions to array join and relational similarity join, the proposed operator minimizes the overall data transfer and network congestion while providing load-balancing, without completely repartitioning and replicating the input arrays. We define formally array similarity join and present the design, optimization strategies, and evaluation of the first array similarity join operator.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77286377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Robust Query Processing in Co-Processor-accelerated Databases 协处理器加速数据库中的鲁棒查询处理
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882936
S. Breß, Henning Funke, J. Teubner
Technology limitations are making the use of heterogeneous computing devices much more than an academic curiosity. In fact, the use of such devices is widely acknowledged to be the only promising way to achieve application-speedups that users urgently need and expect. However, building a robust and efficient query engine for heterogeneous co-processor environments is still a significant challenge. In this paper, we identify two effects that limit performance in case co-processor resources become scarce. Cache thrashing occurs when the working set of queries does not fit into the co-processor's data cache, resulting in performance degradations up to a factor of 24. Heap contention occurs when multiple operators run in parallel on a co-processor and when their accumulated memory footprint exceeds the main memory capacity of the co-processor, slowing down query execution by up to a factor of six. We propose solutions for both effects. Data-driven operator placement avoids data movements when they might be harmful; query chopping limits co-processor memory usage and thus avoids contention. The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors. We validate our proposal with our open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.
技术限制使得异构计算设备的使用不仅仅是学术上的好奇心。事实上,使用这种设备被广泛认为是实现用户迫切需要和期望的应用程序加速的唯一有希望的方法。然而,为异构协处理器环境构建一个健壮而高效的查询引擎仍然是一个重大挑战。在本文中,我们确定了在协处理器资源稀缺的情况下限制性能的两种影响。当查询的工作集不适合协处理器的数据缓存时,就会出现缓存抖动,导致性能下降高达24倍。当多个操作符在协处理器上并行运行时,当它们累积的内存占用超过协处理器的主内存容量时,就会发生堆争用,从而使查询执行速度减慢多达六倍。我们针对这两种影响提出解决方案。数据驱动的操作符位置避免了可能有害的数据移动;查询截断限制了协处理器内存的使用,从而避免了争用。这种组合方法——数据驱动的查询切分——在协处理器上实现了健壮和可扩展的性能。我们用开源gpu加速数据库引擎CoGaDB、流行的星型模式和TPC-H基准测试来验证我们的建议。
{"title":"Robust Query Processing in Co-Processor-accelerated Databases","authors":"S. Breß, Henning Funke, J. Teubner","doi":"10.1145/2882903.2882936","DOIUrl":"https://doi.org/10.1145/2882903.2882936","url":null,"abstract":"Technology limitations are making the use of heterogeneous computing devices much more than an academic curiosity. In fact, the use of such devices is widely acknowledged to be the only promising way to achieve application-speedups that users urgently need and expect. However, building a robust and efficient query engine for heterogeneous co-processor environments is still a significant challenge. In this paper, we identify two effects that limit performance in case co-processor resources become scarce. Cache thrashing occurs when the working set of queries does not fit into the co-processor's data cache, resulting in performance degradations up to a factor of 24. Heap contention occurs when multiple operators run in parallel on a co-processor and when their accumulated memory footprint exceeds the main memory capacity of the co-processor, slowing down query execution by up to a factor of six. We propose solutions for both effects. Data-driven operator placement avoids data movements when they might be harmful; query chopping limits co-processor memory usage and thus avoids contention. The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors. We validate our proposal with our open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81810565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
High-Performance Geospatial Analytics in HyPerSpace 超空间中的高性能地理空间分析
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899412
Varun Pandey, Andreas Kipf, Dimitri Vorona, Tobias Mühlbauer, Thomas Neumann, A. Kemper
In the past few years, massive amounts of location-based data has been captured. Numerous datasets containing user location information are readily available to the public. Analyzing such datasets can lead to fascinating insights into the mobility patterns and behaviors of users. Moreover, in recent times a number of geospatial data-driven companies like Uber, Lyft, and Foursquare have emerged. Real-time analysis of geospatial data is essential and enables an emerging class of applications. Database support for geospatial operations is turning into a necessity instead of a distinct feature provided by only a few databases. Even though a lot of database systems provide geospatial support nowadays, queries often do not consider the most current database state. Geospatial queries are inherently slow given the fact that some of these queries require a couple of geometric computations. Disk-based database systems that do support geospatial datatypes and queries, provide rich features and functions, but they fall behind when performance is considered: specifically if real-time analysis of the latest transactional state is a requirement. In this demonstration, we present HyPerSpace, an extension to the high-performance main-memory database system HyPer developed at the Technical University of Munich, capable of processing geospatial queries with sub-second latencies.
在过去的几年里,大量的基于位置的数据被捕获。公众可以随时获得包含用户位置信息的大量数据集。分析这些数据集可以让我们深入了解用户的移动模式和行为。此外,最近出现了许多地理空间数据驱动的公司,如优步、Lyft和Foursquare。地理空间数据的实时分析是必不可少的,并使一类新兴的应用成为可能。对地理空间操作的数据库支持正在变成一种必需品,而不是只有少数数据库提供的独特功能。尽管现在很多数据库系统都提供地理空间支持,但查询通常不会考虑最新的数据库状态。地理空间查询本身就很慢,因为其中一些查询需要进行一些几何计算。基于磁盘的数据库系统确实支持地理空间数据类型和查询,提供了丰富的特性和功能,但是考虑到性能,特别是在需要实时分析最新事务状态时,它们就落后了。在本演示中,我们介绍了HyPerSpace,这是慕尼黑工业大学开发的高性能主内存数据库系统HyPer的扩展,能够以亚秒级延迟处理地理空间查询。
{"title":"High-Performance Geospatial Analytics in HyPerSpace","authors":"Varun Pandey, Andreas Kipf, Dimitri Vorona, Tobias Mühlbauer, Thomas Neumann, A. Kemper","doi":"10.1145/2882903.2899412","DOIUrl":"https://doi.org/10.1145/2882903.2899412","url":null,"abstract":"In the past few years, massive amounts of location-based data has been captured. Numerous datasets containing user location information are readily available to the public. Analyzing such datasets can lead to fascinating insights into the mobility patterns and behaviors of users. Moreover, in recent times a number of geospatial data-driven companies like Uber, Lyft, and Foursquare have emerged. Real-time analysis of geospatial data is essential and enables an emerging class of applications. Database support for geospatial operations is turning into a necessity instead of a distinct feature provided by only a few databases. Even though a lot of database systems provide geospatial support nowadays, queries often do not consider the most current database state. Geospatial queries are inherently slow given the fact that some of these queries require a couple of geometric computations. Disk-based database systems that do support geospatial datatypes and queries, provide rich features and functions, but they fall behind when performance is considered: specifically if real-time analysis of the latest transactional state is a requirement. In this demonstration, we present HyPerSpace, an extension to the high-performance main-memory database system HyPer developed at the Technical University of Munich, capable of processing geospatial queries with sub-second latencies.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88662807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Optimization of Nested Queries using the NF2 Algebra 使用NF2代数优化嵌套查询
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915241
Jürgen Hölsch, Michael Grossniklaus, M. Scholl
A key promise of SQL is that the optimizer will find the most efficient execution plan, regardless of how the query is formulated. In general, query optimizers of modern database systems are able to keep this promise, with the notable exception of nested queries. While several optimization techniques for nested queries have been proposed, their adoption in practice has been limited. In this paper, we argue that the NF2 (non-first normal form) algebra, which was originally designed to process nested tables, is a better approach to nested query optimization as it fulfills two key requirements. First, the NF2 algebra can represent all types of nested queries as well as both existing and novel optimization techniques based on its equivalences. Second, performance benefits can be achieved with little changes to existing transformation-based query optimizers as the NF2 algebra is an extension of the relational algebra.
SQL的一个关键承诺是,无论查询是如何制定的,优化器都会找到最有效的执行计划。一般来说,现代数据库系统的查询优化器能够做到这一点,除了嵌套查询。虽然已经提出了几种针对嵌套查询的优化技术,但它们在实践中的采用受到限制。在本文中,我们认为NF2(非第一范式)代数(最初设计用于处理嵌套表)是一种更好的嵌套查询优化方法,因为它满足了两个关键要求。首先,NF2代数可以表示所有类型的嵌套查询,以及基于其等价性的现有和新的优化技术。其次,由于NF2代数是关系代数的扩展,因此只需对现有的基于转换的查询优化器进行少量更改就可以获得性能优势。
{"title":"Optimization of Nested Queries using the NF2 Algebra","authors":"Jürgen Hölsch, Michael Grossniklaus, M. Scholl","doi":"10.1145/2882903.2915241","DOIUrl":"https://doi.org/10.1145/2882903.2915241","url":null,"abstract":"A key promise of SQL is that the optimizer will find the most efficient execution plan, regardless of how the query is formulated. In general, query optimizers of modern database systems are able to keep this promise, with the notable exception of nested queries. While several optimization techniques for nested queries have been proposed, their adoption in practice has been limited. In this paper, we argue that the NF2 (non-first normal form) algebra, which was originally designed to process nested tables, is a better approach to nested query optimization as it fulfills two key requirements. First, the NF2 algebra can represent all types of nested queries as well as both existing and novel optimization techniques based on its equivalences. Second, performance benefits can be achieved with little changes to existing transformation-based query optimizers as the NF2 algebra is an extension of the relational algebra.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90971329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index 用一个索引对多个距离函数进行近似最近邻搜索
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882930
Yuxin Zheng, Qi Guo, A. Tung, Sai Wu
Due to the "curse of dimensionality" problem, it is very expensive to process the nearest neighbor (NN) query in high-dimensional spaces; and hence, approximate approaches, such as Locality-Sensitive Hashing (LSH), are widely used for their theoretical guarantees and empirical performance. Current LSH-based approaches target at the L1 and L2 spaces, while as shown in previous work, the fractional distance metrics (Lp metrics with 0 < p < 1) can provide more insightful results than the usual L1 and L2 metrics for data mining and multimedia applications. However, none of the existing work can support multiple fractional distance metrics using one index. In this paper, we propose LazyLSH that answers approximate nearest neighbor queries for multiple Lp metrics with theoretical guarantees. Different from previous LSH approaches which need to build one dedicated index for every query space, LazyLSH uses a single base index to support the computations in multiple Lp spaces, significantly reducing the maintenance overhead. Extensive experiments show that LazyLSH provides more accurate results for approximate kNN search under fractional distance metrics.
由于“维数诅咒”问题,在高维空间中处理最近邻(NN)查询是非常昂贵的;因此,近似方法,如位置敏感哈希(LSH),因其理论保证和经验性能而被广泛使用。当前基于lsh的方法针对L1和L2空间,而正如之前的工作所示,对于数据挖掘和多媒体应用,分数距离度量(0 < p < 1的Lp度量)可以提供比通常的L1和L2度量更有洞察力的结果。然而,现有的工作都不能支持使用一个索引的多个分数距离度量。在本文中,我们提出了一个具有理论保证的回答多个Lp指标的近似最近邻查询的LazyLSH。与以前需要为每个查询空间构建一个专用索引的LSH方法不同,LazyLSH使用单个基本索引来支持多个Lp空间中的计算,从而大大降低了维护开销。大量的实验表明,在分数距离度量下,LazyLSH提供了更准确的近似kNN搜索结果。
{"title":"LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index","authors":"Yuxin Zheng, Qi Guo, A. Tung, Sai Wu","doi":"10.1145/2882903.2882930","DOIUrl":"https://doi.org/10.1145/2882903.2882930","url":null,"abstract":"Due to the \"curse of dimensionality\" problem, it is very expensive to process the nearest neighbor (NN) query in high-dimensional spaces; and hence, approximate approaches, such as Locality-Sensitive Hashing (LSH), are widely used for their theoretical guarantees and empirical performance. Current LSH-based approaches target at the L1 and L2 spaces, while as shown in previous work, the fractional distance metrics (Lp metrics with 0 < p < 1) can provide more insightful results than the usual L1 and L2 metrics for data mining and multimedia applications. However, none of the existing work can support multiple fractional distance metrics using one index. In this paper, we propose LazyLSH that answers approximate nearest neighbor queries for multiple Lp metrics with theoretical guarantees. Different from previous LSH approaches which need to build one dedicated index for every query space, LazyLSH uses a single base index to support the computations in multiple Lp spaces, significantly reducing the maintenance overhead. Extensive experiments show that LazyLSH provides more accurate results for approximate kNN search under fractional distance metrics.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90201548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Research Contribution as a Measure of Influence 研究贡献作为影响的衡量标准
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2914834
Lais M. A. Rocha, Mirella M. Moro
We propose the 3c-index that measures the influence degree of researchers by evaluating the links they establish between communities. We evaluate its performance against well known metrics. The results show 3c-index outperforms them in most cases and can be employed as a complementary metric to assess researchers' productivity.
我们提出了3c指数,通过评估研究人员在社区之间建立的联系来衡量他们的影响程度。我们根据众所周知的指标来评估它的性能。结果表明,3c-index在大多数情况下优于它们,可以作为评估研究人员生产力的补充指标。
{"title":"Research Contribution as a Measure of Influence","authors":"Lais M. A. Rocha, Mirella M. Moro","doi":"10.1145/2882903.2914834","DOIUrl":"https://doi.org/10.1145/2882903.2914834","url":null,"abstract":"We propose the 3c-index that measures the influence degree of researchers by evaluating the links they establish between communities. We evaluate its performance against well known metrics. The results show 3c-index outperforms them in most cases and can be employed as a complementary metric to assess researchers' productivity.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90325975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automated Demand-driven Resource Scaling in Relational Database-as-a-Service 关系数据库即服务中自动化需求驱动的资源扩展
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2903733
Sudipto Das, Feng Li, Vivek R. Narasayya, A. König
Relational Database-as-a-Service (DaaS) platforms today support the abstraction of a resource container that guarantees a fixed amount of resources. Tenants are responsible for selecting a container size suitable for their workloads, which they can change to leverage the cloud's elasticity. However, automating this task is daunting for most tenants since estimating resource demands for arbitrary SQL workloads in an RDBMS is complex and challenging. In addition, workloads and resource requirements can vary significantly within minutes to hours, and container sizes vary by orders of magnitude both in the amount of resources as well as monetary cost. We present a solution to enable a DaaS to auto-scale container sizes on behalf of its tenants. Approaches to auto-scale stateless services, such as web servers, that rely on historical resource utilization as the primary signal, often perform poorly for stateful database servers which are significantly more complex. Our solution derives a set of robust signals from database engine telemetry and combines them to significantly improve accuracy of demand estimation for database workloads resulting in more accurate scaling decisions. Our solution raises the abstraction by allowing tenants to reason about monetary budget and query latency rather than resources. We prototyped our approach in Microsoft Azure SQL Database and ran extensive experiments using workloads with realistic time-varying resource demand patterns obtained from production traces. Compared to an approach that uses only resource utilization to estimate demand, our approach results in 1.5x to 3x lower monetary costs while achieving comparable query latencies.
关系数据库即服务(DaaS)平台现在支持资源容器的抽象,以保证固定数量的资源。租户负责选择适合其工作负载的容器大小,他们可以更改容器大小以利用云的弹性。然而,对大多数租户来说,自动化这项任务是令人生畏的,因为在RDBMS中估计任意SQL工作负载的资源需求既复杂又具有挑战性。此外,工作负载和资源需求可能在几分钟到几小时内发生显著变化,容器大小在资源数量和货币成本方面也会发生数量级的变化。我们提供了一个解决方案,使DaaS能够代表其租户自动缩放容器大小。自动扩展无状态服务(如web服务器)的方法依赖于历史资源利用率作为主要信号,对于复杂得多的有状态数据库服务器,这种方法的性能通常很差。我们的解决方案从数据库引擎遥测中获得一组健壮的信号,并将它们组合起来,以显著提高数据库工作负载需求估计的准确性,从而产生更准确的扩展决策。我们的解决方案通过允许租户推断货币预算和查询延迟而不是资源来提高抽象性。我们在Microsoft Azure SQL数据库中对我们的方法进行了原型化,并使用从生产轨迹中获得的具有真实时变资源需求模式的工作负载进行了广泛的实验。与仅使用资源利用率来估计需求的方法相比,我们的方法在实现相当的查询延迟的同时降低了1.5到3倍的货币成本。
{"title":"Automated Demand-driven Resource Scaling in Relational Database-as-a-Service","authors":"Sudipto Das, Feng Li, Vivek R. Narasayya, A. König","doi":"10.1145/2882903.2903733","DOIUrl":"https://doi.org/10.1145/2882903.2903733","url":null,"abstract":"Relational Database-as-a-Service (DaaS) platforms today support the abstraction of a resource container that guarantees a fixed amount of resources. Tenants are responsible for selecting a container size suitable for their workloads, which they can change to leverage the cloud's elasticity. However, automating this task is daunting for most tenants since estimating resource demands for arbitrary SQL workloads in an RDBMS is complex and challenging. In addition, workloads and resource requirements can vary significantly within minutes to hours, and container sizes vary by orders of magnitude both in the amount of resources as well as monetary cost. We present a solution to enable a DaaS to auto-scale container sizes on behalf of its tenants. Approaches to auto-scale stateless services, such as web servers, that rely on historical resource utilization as the primary signal, often perform poorly for stateful database servers which are significantly more complex. Our solution derives a set of robust signals from database engine telemetry and combines them to significantly improve accuracy of demand estimation for database workloads resulting in more accurate scaling decisions. Our solution raises the abstraction by allowing tenants to reason about monetary budget and query latency rather than resources. We prototyped our approach in Microsoft Azure SQL Database and ran extensive experiments using workloads with realistic time-varying resource demand patterns obtained from production traces. Compared to an approach that uses only resource utilization to estimate demand, our approach results in 1.5x to 3x lower monetary costs while achieving comparable query latencies.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85170142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
期刊
Proceedings of the 2016 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1