首页 > 最新文献

2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献

英文 中文
Finding interesting correlations with conditional heavy hitters 找到有条件的重量级人物的有趣关联
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544898
Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, D. Srivastava
The notion of heavy hitters-items that make up a large fraction of the population - has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of Conditional Heavy Hitters to identify such items, with applications in network monitoring, and Markov chain modeling. We introduce several streaming algorithms that allow us to find conditional heavy hitters efficiently, and provide analytical results. Different algorithms are successful for different input characteristics. We perform experimental evaluations to demonstrate the efficacy of our methods, and to study which algorithms are most suited for different types of data.
“重磅物品”(即占总数量很大一部分的物品)的概念已经成功地用于传感器和RFID监控、网络数据分析、事件挖掘等各种应用程序中。然而,当我们以相关对的形式观察数据时,这个概念往往无法捕捉到我们想要的语义。在这里,我们感兴趣的是有条件频繁的项:当一个特定的项在其父项的上下文中频繁时。在这项工作中,我们引入并形式化了条件重拳的概念来识别这些项目,并应用于网络监控和马尔可夫链建模。我们介绍了几种流算法,使我们能够有效地找到有条件的重磅炸弹,并提供分析结果。对于不同的输入特性,不同的算法是成功的。我们进行实验评估,以证明我们的方法的有效性,并研究哪种算法最适合不同类型的数据。
{"title":"Finding interesting correlations with conditional heavy hitters","authors":"Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, D. Srivastava","doi":"10.1109/ICDE.2013.6544898","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544898","url":null,"abstract":"The notion of heavy hitters-items that make up a large fraction of the population - has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of Conditional Heavy Hitters to identify such items, with applications in network monitoring, and Markov chain modeling. We introduce several streaming algorithms that allow us to find conditional heavy hitters efficiently, and provide analytical results. Different algorithms are successful for different input characteristics. We perform experimental evaluations to demonstrate the efficacy of our methods, and to study which algorithms are most suited for different types of data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Identifying hot and cold data in main-memory databases 识别主存数据库中的热数据和冷数据
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544811
Justin J. Levandoski, P. Larson, R. Stoica
Main memories are becoming sufficiently large that most OLTP databases can be stored entirely in main memory, but this may not be the best solution. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). It is more economical to store the coldest records on secondary storage such as flash. As a first step towards managing cold data in databases optimized for main memory we investigate how to efficiently identify hot and cold data. We propose to log record accesses - possibly only a sample to reduce overhead - and perform offline analysis to estimate record access frequencies. We present four estimation algorithms based on exponential smoothing and experimentally evaluate their efficiency and accuracy. We find that exponential smoothing produces very accurate estimates, leading to higher hit rates than the best caching techniques. Our most efficient algorithm is able to analyze a log of 1B accesses in sub-second time on a workstation-class machine.
主内存变得足够大,大多数OLTP数据库可以完全存储在主内存中,但这可能不是最好的解决方案。OLTP工作负载通常表现出倾斜的访问模式,其中一些记录是热的(经常访问),而许多记录是冷的(不经常访问或从未访问)。将最冷的记录存储在二级存储器(如闪存)上更为经济。作为管理针对主存优化的数据库中的冷数据的第一步,我们研究了如何有效地识别热数据和冷数据。我们建议记录记录访问-可能只有一个样本以减少开销-并执行离线分析以估计记录访问频率。提出了四种基于指数平滑的估计算法,并对其效率和精度进行了实验评价。我们发现指数平滑产生非常准确的估计,导致比最好的缓存技术更高的命中率。我们最有效的算法能够在工作站级的机器上在亚秒的时间内分析1B次访问的日志。
{"title":"Identifying hot and cold data in main-memory databases","authors":"Justin J. Levandoski, P. Larson, R. Stoica","doi":"10.1109/ICDE.2013.6544811","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544811","url":null,"abstract":"Main memories are becoming sufficiently large that most OLTP databases can be stored entirely in main memory, but this may not be the best solution. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). It is more economical to store the coldest records on secondary storage such as flash. As a first step towards managing cold data in databases optimized for main memory we investigate how to efficiently identify hot and cold data. We propose to log record accesses - possibly only a sample to reduce overhead - and perform offline analysis to estimate record access frequencies. We present four estimation algorithms based on exponential smoothing and experimentally evaluate their efficiency and accuracy. We find that exponential smoothing produces very accurate estimates, leading to higher hit rates than the best caching techniques. Our most efficient algorithm is able to analyze a log of 1B accesses in sub-second time on a workstation-class machine.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128391169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 142
RoadAlarm: A spatial alarm system on road networks RoadAlarm:道路网络的空间报警系统
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544947
Kisung Lee, Emre Yigitoglu, Ling Liu, Binh Han, Balaji Palanisamy, C. Pu
Spatial alarms are one of the fundamental functionalities for many LBSs. We argue that spatial alarms should be road network aware as mobile objects travel on spatially constrained road networks or walk paths. In this software system demonstration, we will present the first prototype system of ROADALARM - a spatial alarm processing system for moving objects on road networks. The demonstration system of ROAD-ALARM focuses on the three unique features of ROADALARM system design. First, we will show that the road network distance-based spatial alarm is best modeled using road network distance such as segment length-based and travel time-based distance. Thus, a road network spatial alarm is a star-like subgraph centered at the alarm target. Second, we will show the suite of ROADALARM optimization techniques to scale spatial alarm processing by taking into account spatial constraints on road networks and mobility patterns of mobile subscribers. Third, we will show that, by equipping the ROADALARM system with an activity monitoring-based control panel, we are able to enable the system administrator and the end users to visualize road network-based spatial alarms, mobility traces of moving objects and dynamically make selection or customization of the ROADALARM techniques for spatial alarm processing through graphical user interface. We show that the ROADALARM system provides both the general system architecture and the essential building blocks for location-based advertisements and location-based reminders.
空间警报是许多lbs的基本功能之一。我们认为,当移动物体在空间受限的道路网络或步行路径上行驶时,空间报警应该是道路网络感知的。在这个软件系统演示中,我们将展示ROADALARM的第一个原型系统——一个用于道路网络上移动物体的空间报警处理系统。ROAD-ALARM的演示系统重点介绍了ROADALARM系统设计的三个独特之处。首先,我们将证明基于路网距离的空间报警最好使用路网距离建模,如基于路段长度和基于旅行时间的距离。因此,路网空间报警是一个以报警目标为中心的星形子图。其次,我们将展示ROADALARM优化技术套件,通过考虑道路网络的空间约束和移动用户的移动模式来扩展空间报警处理。第三,我们将展示,通过为ROADALARM系统配备一个基于活动监控的控制面板,我们能够使系统管理员和最终用户能够可视化基于道路网络的空间报警,移动物体的移动轨迹,并通过图形用户界面动态地选择或定制ROADALARM技术进行空间报警处理。我们展示了ROADALARM系统提供了通用系统架构和基于位置的广告和基于位置的提醒的基本构建模块。
{"title":"RoadAlarm: A spatial alarm system on road networks","authors":"Kisung Lee, Emre Yigitoglu, Ling Liu, Binh Han, Balaji Palanisamy, C. Pu","doi":"10.1109/ICDE.2013.6544947","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544947","url":null,"abstract":"Spatial alarms are one of the fundamental functionalities for many LBSs. We argue that spatial alarms should be road network aware as mobile objects travel on spatially constrained road networks or walk paths. In this software system demonstration, we will present the first prototype system of ROADALARM - a spatial alarm processing system for moving objects on road networks. The demonstration system of ROAD-ALARM focuses on the three unique features of ROADALARM system design. First, we will show that the road network distance-based spatial alarm is best modeled using road network distance such as segment length-based and travel time-based distance. Thus, a road network spatial alarm is a star-like subgraph centered at the alarm target. Second, we will show the suite of ROADALARM optimization techniques to scale spatial alarm processing by taking into account spatial constraints on road networks and mobility patterns of mobile subscribers. Third, we will show that, by equipping the ROADALARM system with an activity monitoring-based control panel, we are able to enable the system administrator and the end users to visualize road network-based spatial alarms, mobility traces of moving objects and dynamically make selection or customization of the ROADALARM techniques for spatial alarm processing through graphical user interface. We show that the ROADALARM system provides both the general system architecture and the essential building blocks for location-based advertisements and location-based reminders.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129360105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hardware killed the software star 硬件扼杀了软件之星
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544807
G. Alonso
Until relatively recently, the development of data processing applications took place largely ignoring the underlying hardware. Only in niche applications (supercomputing, embedded systems) or in special software (operating systems, database internals, language runtimes) did (some) programmers had to pay attention to the actual hardware where the software would run. In most cases, working atop the abstractions provided by either the operating system or by system libraries was good enough. The constant improvements in processor speed did the rest. The new millennium has radically changed the picture. Driven by multiple needs - e.g., scale, physical constraints, energy limitations, virtualization, business models- hardware architectures are changing at a speed and in ways that current development practices for data processing cannot accommodate. From now on, software will have to be developed paying close attention to the underlying hardware and following strict performance engineering principles. In this paper, several aspects of the ongoing hardware revolution and its impact on data processing are analysed, pointing to the need for new strategies to tackle the challenges ahead.
直到最近,数据处理应用程序的开发在很大程度上忽略了底层硬件。只有在小众应用程序(超级计算、嵌入式系统)或特殊软件(操作系统、数据库内部、语言运行时)中,(一些)程序员才需要关注软件运行的实际硬件。在大多数情况下,在操作系统或系统库提供的抽象之上工作就足够了。处理器速度的不断提高完成了剩下的部分。新的千年从根本上改变了这一图景。在多种需求的驱动下——例如,规模、物理约束、能源限制、虚拟化、商业模式——硬件架构正在以当前数据处理开发实践无法适应的速度和方式发生变化。从现在开始,开发软件时必须密切关注底层硬件,并遵循严格的性能工程原则。本文分析了正在进行的硬件革命及其对数据处理的影响的几个方面,指出需要新的战略来应对未来的挑战。
{"title":"Hardware killed the software star","authors":"G. Alonso","doi":"10.1109/ICDE.2013.6544807","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544807","url":null,"abstract":"Until relatively recently, the development of data processing applications took place largely ignoring the underlying hardware. Only in niche applications (supercomputing, embedded systems) or in special software (operating systems, database internals, language runtimes) did (some) programmers had to pay attention to the actual hardware where the software would run. In most cases, working atop the abstractions provided by either the operating system or by system libraries was good enough. The constant improvements in processor speed did the rest. The new millennium has radically changed the picture. Driven by multiple needs - e.g., scale, physical constraints, energy limitations, virtualization, business models- hardware architectures are changing at a speed and in ways that current development practices for data processing cannot accommodate. From now on, software will have to be developed paying close attention to the underlying hardware and following strict performance engineering principles. In this paper, several aspects of the ongoing hardware revolution and its impact on data processing are analysed, pointing to the need for new strategies to tackle the challenges ahead.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129146578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Efficient direct search on compressed genomic data 对压缩基因组数据的高效直接搜索
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544889
Xiaochun Yang, Bin Wang, Chen Li, Jiaying Wang, Xiaohui Xie
The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data.
下一代测序产生的数据量的爆炸式增长对如何高效、准确地存储、传输和查询这些数据提出了重大的计算挑战。基因组序列数据的一个独特特征是它们中的许多可能彼此高度相似,这激发了通过仅将它们的差异存储到参考序列来压缩序列数据的想法,从而大大降低了存储成本。然而,这个领域的一个未解决的问题是,是否可以直接对压缩数据执行搜索,如果可以,如何执行搜索。在这里,我们表明直接查询压缩的基因组序列数据是可能的,并且可以高效地完成。我们为此描述了一组新的索引结构和算法,并提出了几种优化技术来减少空间需求和查询响应时间。通过对真实基因组数据的深入实验研究,我们证明了该方法的优势,并将其与现有方法进行了比较。
{"title":"Efficient direct search on compressed genomic data","authors":"Xiaochun Yang, Bin Wang, Chen Li, Jiaying Wang, Xiaohui Xie","doi":"10.1109/ICDE.2013.6544889","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544889","url":null,"abstract":"The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132962139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Efficient distance-aware query evaluation on indoor moving objects 室内运动物体的高效距离感知查询评估
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544845
Xike Xie, Hua Lu, T. Pedersen
Indoor spaces accommodate large parts of people's life. The increasing availability of indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables a variety of indoor location-based services (LBSs). Efficient indoor distance-aware queries on indoor moving objects play an important role in supporting and boosting such LBSs. However, the distance-aware query evaluation on indoor moving objects is challenging because: (1) indoor spaces are characterized by many special entities and thus render distance calculation very complex; (2) the limitations of indoor positioning technologies create inherent uncertainties in indoor moving objects data. In this paper, we propose a complete set of techniques for efficient distance-aware queries on indoor moving objects. We define and categorize the indoor distances in relation to indoor uncertain objects, and derive different distance bounds that can facilitate query evaluation. Existing works often assume indoor floor plans are static, and require extensive pre-computation on indoor topologies. In contrast, we design a composite index scheme that integrates indoor geometries, indoor topologies, as well as indoor uncertain objects, and thus supports indoor distance-aware queries efficiently without time-consuming and volatile distance computation. We design algorithms for range query and k nearest neighbor query on indoor moving objects. The results of extensive experimental studies demonstrate that our proposals are efficient and scalable in evaluating distance-aware queries over indoor moving objects.
室内空间容纳了人们生活的很大一部分。在Wi-Fi、RFID和蓝牙等技术的推动下,室内定位的可用性不断提高,使各种基于室内位置的服务(lbs)成为可能。对室内运动物体进行高效的室内距离感知查询,对于支持和促进这种lbs具有重要作用。然而,室内运动物体的距离感知查询评估具有挑战性,因为:(1)室内空间具有许多特殊实体的特征,使得距离计算非常复杂;(2)室内定位技术的局限性导致室内运动物体数据存在固有的不确定性。在本文中,我们提出了一套完整的技术,用于室内运动物体的高效距离感知查询。我们根据室内不确定物体对室内距离进行定义和分类,并推导出便于查询求值的不同距离界限。现有的工程通常假设室内平面图是静态的,并且需要对室内拓扑进行大量的预计算。相比之下,我们设计了一种综合室内几何、室内拓扑和室内不确定物体的复合索引方案,从而有效地支持室内距离感知查询,而不需要耗时和易失的距离计算。设计了室内运动物体的距离查询和k近邻查询算法。大量的实验研究结果表明,我们的建议在评估室内移动物体的距离感知查询方面是有效和可扩展的。
{"title":"Efficient distance-aware query evaluation on indoor moving objects","authors":"Xike Xie, Hua Lu, T. Pedersen","doi":"10.1109/ICDE.2013.6544845","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544845","url":null,"abstract":"Indoor spaces accommodate large parts of people's life. The increasing availability of indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables a variety of indoor location-based services (LBSs). Efficient indoor distance-aware queries on indoor moving objects play an important role in supporting and boosting such LBSs. However, the distance-aware query evaluation on indoor moving objects is challenging because: (1) indoor spaces are characterized by many special entities and thus render distance calculation very complex; (2) the limitations of indoor positioning technologies create inherent uncertainties in indoor moving objects data. In this paper, we propose a complete set of techniques for efficient distance-aware queries on indoor moving objects. We define and categorize the indoor distances in relation to indoor uncertain objects, and derive different distance bounds that can facilitate query evaluation. Existing works often assume indoor floor plans are static, and require extensive pre-computation on indoor topologies. In contrast, we design a composite index scheme that integrates indoor geometries, indoor topologies, as well as indoor uncertain objects, and thus supports indoor distance-aware queries efficiently without time-consuming and volatile distance computation. We design algorithms for range query and k nearest neighbor query on indoor moving objects. The results of extensive experimental studies demonstrate that our proposals are efficient and scalable in evaluating distance-aware queries over indoor moving objects.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126645166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
TBF: A memory-efficient replacement policy for flash-based caches TBF:用于基于闪存的缓存的内存高效替换策略
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544902
C. Ungureanu, Biplob K. Debnath, Stephen A. Rago, Akshat Aranya
The performance and capacity characteristics of flash storage make it attractive to use as a cache. Recency-based cache replacement policies rely on an in-memory full index, typically a B-tree or a hash table, that maps each object to its recency information. Even though the recency information itself may take very little space, the full index for a cache holding N keys requires at least log N bits per key. This metadata overhead is undesirably high when used for very large flash-based caches, such as key-value stores with billions of objects. To solve this problem, we propose a new RAM-frugal cache replacement policy that approximates the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters (TBF) for maintaining the recency information and leverages an on-flash key-value store to cache objects. TBF requires only one byte of RAM per cached object, making it suitable for implementing very large flash-based caches. We evaluate TBF through simulation on traces from several block stores and key-value stores, as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system implementation. Evaluation results show that TBF achieves cache hit rate and operations per second comparable to those of LRU in spite of its much smaller memory requirements.
闪存的性能和容量特点使其成为一种有吸引力的缓存。基于最近的缓存替换策略依赖于内存中的完整索引,通常是b树或散列表,它将每个对象映射到其最近信息。尽管最近信息本身可能占用很少的空间,但包含N个键的缓存的完整索引每个键至少需要log N位。当用于非常大的基于闪存的缓存(例如具有数十亿对象的键值存储)时,这种元数据开销非常高。为了解决这个问题,我们提出了一种新的ram节约缓存替换策略,该策略近似于最近最少使用(least-recently-used, LRU)策略。它使用两个内存中的Bloom子过滤器(TBF)来维护近期信息,并利用flash上的键值存储来缓存对象。TBF每个缓存对象只需要一个字节的RAM,这使得它适合实现非常大的基于闪存的缓存。我们通过模拟几个块存储和键值存储的跟踪来评估TBF,并使用Yahoo!云服务基准在一个真实的系统实现。评估结果表明,尽管TBF的内存需求要小得多,但其缓存命中率和每秒操作数与LRU相当。
{"title":"TBF: A memory-efficient replacement policy for flash-based caches","authors":"C. Ungureanu, Biplob K. Debnath, Stephen A. Rago, Akshat Aranya","doi":"10.1109/ICDE.2013.6544902","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544902","url":null,"abstract":"The performance and capacity characteristics of flash storage make it attractive to use as a cache. Recency-based cache replacement policies rely on an in-memory full index, typically a B-tree or a hash table, that maps each object to its recency information. Even though the recency information itself may take very little space, the full index for a cache holding N keys requires at least log N bits per key. This metadata overhead is undesirably high when used for very large flash-based caches, such as key-value stores with billions of objects. To solve this problem, we propose a new RAM-frugal cache replacement policy that approximates the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters (TBF) for maintaining the recency information and leverages an on-flash key-value store to cache objects. TBF requires only one byte of RAM per cached object, making it suitable for implementing very large flash-based caches. We evaluate TBF through simulation on traces from several block stores and key-value stores, as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system implementation. Evaluation results show that TBF achieves cache hit rate and operations per second comparable to those of LRU in spite of its much smaller memory requirements.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122970274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Publicly verifiable grouped aggregation queries on outsourced data streams 外包数据流上可公开验证的分组聚合查询
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544852
Suman Nath, R. Venkatesan
Outsourcing data streams and desired computations to a third party such as the cloud is a desirable option to many companies. However, data outsourcing and remote computations intrinsically raise issues of trust, making it crucial to verify results returned by third parties. In this context, we propose a novel solution to verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by queries) that are common in many business applications. We consider a setting where a data owner employs an untrusted remote server to run continuous grouped aggregation queries on a data stream it forwards to the server. Untrusted clients then query the server for results and efficiently verify correctness of the results by using a small and easy-to-compute signature provided by the data owner. Our work complements previous works on authenticating remote computation of selection and aggregation queries. The most important aspect of our solution is that it is publicly verifiable - unlike most prior works, we support untrusted clients (who can collude with other clients or with the server). Experimental results on real and synthetic data show that our solution is practical and efficient.
将数据流和所需的计算外包给第三方(如云)是许多公司的理想选择。然而,数据外包和远程计算本质上引起了信任问题,因此验证第三方返回的结果至关重要。在这种情况下,我们提出了一种新的解决方案来验证外包的分组聚合查询(例如,直方图或SQL Group-by查询),这些查询在许多业务应用程序中都很常见。我们考虑这样一种设置:数据所有者使用不受信任的远程服务器,对转发给该服务器的数据流运行连续的分组聚合查询。然后,不受信任的客户端向服务器查询结果,并通过使用数据所有者提供的小且易于计算的签名有效地验证结果的正确性。我们的工作补充了之前在验证选择和聚合查询的远程计算方面的工作。我们的解决方案最重要的方面是它是可公开验证的——与大多数先前的作品不同,我们支持不受信任的客户端(可以与其他客户端或服务器串通)。在实际数据和综合数据上的实验结果表明,该方法是实用、高效的。
{"title":"Publicly verifiable grouped aggregation queries on outsourced data streams","authors":"Suman Nath, R. Venkatesan","doi":"10.1109/ICDE.2013.6544852","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544852","url":null,"abstract":"Outsourcing data streams and desired computations to a third party such as the cloud is a desirable option to many companies. However, data outsourcing and remote computations intrinsically raise issues of trust, making it crucial to verify results returned by third parties. In this context, we propose a novel solution to verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by queries) that are common in many business applications. We consider a setting where a data owner employs an untrusted remote server to run continuous grouped aggregation queries on a data stream it forwards to the server. Untrusted clients then query the server for results and efficiently verify correctness of the results by using a small and easy-to-compute signature provided by the data owner. Our work complements previous works on authenticating remote computation of selection and aggregation queries. The most important aspect of our solution is that it is publicly verifiable - unlike most prior works, we support untrusted clients (who can collude with other clients or with the server). Experimental results on real and synthetic data show that our solution is practical and efficient.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121644619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Data services for E-tailers leveraging web search engine assets 为利用网络搜索引擎资产的电子零售商提供数据服务
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544905
Tao Cheng, K. Chakrabarti, S. Chaudhuri, Vivek R. Narasayya, M. Syamala
Retail is increasingly moving online. There are only a few big e-tailers but there is a long tail of small-sized e-tailers. The big e-tailers are able to collect significant data on user activities at their websites. They use these assets to derive insights about their products and to provide superior experiences for their users. On the other hand, small e-tailers do not possess such user data and hence cannot match the rich user experiences offered by big e-tailers. Our key insight is that web search engines possess significant data on user behaviors that can be used to help smaller e-tailers mine the same signals that big e-tailers derive from their proprietary user data assets. These signals can be exposed as data services in the cloud; e-tailers can leverage them to enable similar user experiences as the big e-tailers. We present three such data services in the paper: entity synonym data service, query-to-entity data service and entity tagging data service. The entity synonym service is an in-production data service that is currently available while the other two are data services currently in development at Microsoft. Our experiments on product datasets show (i) these data services have high quality and (ii) they have significant impact on user experiences on e-tailer websites. To the best of our knowledge, this is the first paper to explore the potential of using search engine data assets for e-tailers.
零售越来越多地转移到网上。大型电子零售商只有几家,但小型电子零售商却有一长串尾巴。大型电子零售商能够收集用户在其网站上活动的重要数据。他们利用这些资产来洞察自己的产品,并为用户提供卓越的体验。另一方面,小型电子零售商没有这样的用户数据,因此无法与大型电子零售商提供的丰富用户体验相匹配。我们的关键观点是,网络搜索引擎拥有用户行为的重要数据,可以用来帮助小型电子零售商挖掘与大型电子零售商从其专有用户数据资产中获得的相同信号。这些信号可以作为云中的数据服务公开;电子零售商可以利用它们来实现与大型电子零售商类似的用户体验。本文提出了三种数据服务:实体同义词数据服务、查询到实体数据服务和实体标记数据服务。实体同义词服务是目前可用的生产中的数据服务,而另外两个是微软目前正在开发的数据服务。我们对产品数据集的实验表明:(i)这些数据服务具有高质量,(ii)它们对电子零售商网站的用户体验有重大影响。据我们所知,这是第一篇探讨电子零售商使用搜索引擎数据资产的潜力的论文。
{"title":"Data services for E-tailers leveraging web search engine assets","authors":"Tao Cheng, K. Chakrabarti, S. Chaudhuri, Vivek R. Narasayya, M. Syamala","doi":"10.1109/ICDE.2013.6544905","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544905","url":null,"abstract":"Retail is increasingly moving online. There are only a few big e-tailers but there is a long tail of small-sized e-tailers. The big e-tailers are able to collect significant data on user activities at their websites. They use these assets to derive insights about their products and to provide superior experiences for their users. On the other hand, small e-tailers do not possess such user data and hence cannot match the rich user experiences offered by big e-tailers. Our key insight is that web search engines possess significant data on user behaviors that can be used to help smaller e-tailers mine the same signals that big e-tailers derive from their proprietary user data assets. These signals can be exposed as data services in the cloud; e-tailers can leverage them to enable similar user experiences as the big e-tailers. We present three such data services in the paper: entity synonym data service, query-to-entity data service and entity tagging data service. The entity synonym service is an in-production data service that is currently available while the other two are data services currently in development at Microsoft. Our experiments on product datasets show (i) these data services have high quality and (ii) they have significant impact on user experiences on e-tailer websites. To the best of our knowledge, this is the first paper to explore the potential of using search engine data assets for e-tailers.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128069972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Top-k query processing in probabilistic databases with non-materialized views 非实体化视图的概率数据库Top-k查询处理
Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544819
Maximilian Dylla, Iris Miliaraki, M. Theobald
We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data.
我们研究了一种在非物化视图的概率数据库中计算top-k排序查询置信限的新方法。与相关方法不同的是,我们提出了一种精确修剪算法,用于根据边缘概率找到排名靠前的查询答案,而无需首先通过视图具体化所有候选答案。具体来说,我们考虑在多个级别的选择项目连接视图上的联合查询,后者被转换为Datalog规则,我们在查询处理时直接以自顶向下的方式建立这些规则。据我们所知,这项工作是第一个通过考虑一阶谱系公式的置信界限来解决概率数据库背景下的深度查询评估的集成数据和置信度计算的工作。我们通过基于选择性估计和对置信度范围的预期影响的调度策略工具套件扩展了查询处理技术。对我们的查询处理策略的进一步扩展包括在排序关系可用作输入的情况下改进的top-k界限,以及考虑递归规则。大型数据集的实验表明,与精确和基于抽样的top-k方法相比,我们的方法在运行时显著改善了概率数据。
{"title":"Top-k query processing in probabilistic databases with non-materialized views","authors":"Maximilian Dylla, Iris Miliaraki, M. Theobald","doi":"10.1109/ICDE.2013.6544819","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544819","url":null,"abstract":"We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133678133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
期刊
2013 IEEE 29th International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1