首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds 基于众包的实时城市交通速度估计:从趋势到速度
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498298
Huiqi Hu, Guoliang Li, Z. Bao, Yan Cui, Jianhua Feng
Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.
实时城市交通速度估计在许多实际应用中提供了显著的好处。然而,现有的交通信息采集系统只能获得一小部分道路的粗粒度交通信息,无法获得每条道路的细粒度交通信息。为了解决这个问题,本文研究了交通速度估计问题,该问题给定预算K,识别K条道路(称为种子),其中这些种子上的实际交通速度可以通过众包获得,并根据这些种子的速度推断其他道路(称为非种子道路)的速度。该问题包括两个子问题:(1)速度推断——如何准确地推断非种子道路的速度;(2)种子选择——如何有效地选择优质种子。由于交通是动态变化的,影响交通的因素很多,难以预测,因此准确估计交通速度是一项具有挑战性的工作。为了解决这些挑战,我们提出了有效的算法来明智地选择高质量的种子,并设计了推理模型来推断非种子道路的速度。一方面,我们观察到道路具有相关性,相关道路具有相似的交通趋势:与历史平均速度相比,相关道路的速度同时上升或下降。我们利用这一特性,提出了一个两步模型来估计交通速度。第一步采用图形模型推断交通趋势,第二步设计层次线性模型根据交通趋势估计交通速度。另一方面,我们提出了种子选择问题,证明了它是np困难的,并提出了几种具有近似保证的贪心算法。在两个大型真实数据集上的实验结果表明,该方法的效率比基线提高了2个数量级,估计精度提高了40%。
{"title":"Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds","authors":"Huiqi Hu, Guoliang Li, Z. Bao, Yan Cui, Jianhua Feng","doi":"10.1109/ICDE.2016.7498298","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498298","url":null,"abstract":"Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"883-894"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84127069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Quality-driven disorder handling for m-way sliding window stream joins m-way滑动窗口流连接的质量驱动无序处理
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498265
Yuanzhen Ji, Jun Sun, A. Nica, Zbigniew Jerzak, Gregor Hackenbroich, C. Fetzer
Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, parallel processing, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.
滑动窗口连接是流应用程序中最重要的操作符之一。为了产生高质量的连接结果,流处理系统必须处理由网络延迟、并行处理等引起的输入流中普遍存在的无序。无序处理涉及在延迟和生成的连接结果质量之间进行不可避免的权衡。为了满足流应用程序的不同需求,最好提供用户可配置的结果延迟与结果质量之间的权衡。现有的混乱处理方法要么不提供这种可配置性,要么只支持用户指定的延迟约束。在这项工作中,我们提倡质量驱动的无序处理思想,并提出了一种基于缓冲区的滑动窗口连接的无序处理方法,该方法最小化了输入排序缓冲区的大小,从而减少了结果延迟,同时尊重用户指定的结果质量要求。我们方法的核心是一个分析模型,它直接捕捉输入缓冲区大小和生成结果质量之间的关系。我们的方法是通用的。它支持任意连接条件的m-way滑动窗口连接。在真实世界和合成数据集上的实验表明,与目前的技术水平相比,我们的方法可以在提供相同水平的结果质量的同时,将无序处理引起的结果延迟减少高达95%。
{"title":"Quality-driven disorder handling for m-way sliding window stream joins","authors":"Yuanzhen Ji, Jun Sun, A. Nica, Zbigniew Jerzak, Gregor Hackenbroich, C. Fetzer","doi":"10.1109/ICDE.2016.7498265","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498265","url":null,"abstract":"Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, parallel processing, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"7 1","pages":"493-504"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85935076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Flexible hybrid stores: Constraint-based rewriting to the rescue 灵活的混合存储:基于约束的重写来拯救
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498353
Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, I. Manolescu, Stamatis Zampetakis
Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.
数据管理经历了一个有趣的时期1,因为当前可用的数据管理系统(简称dms)的数量可能比以往任何时候都要多。这为数据密集型应用程序提供了独特的机会,因为有些系统在某些数据处理操作上提供了出色的性能。然而,它也带来了巨大的挑战,因为在某些任务上高效的系统可能表现不佳或不支持其他任务,这使得不可能为给定的应用程序使用单个DMS。因此,为了利用它们的最佳性能,需要并排使用不同的dms,正如混合存储或多存储等术语所提倡的那样。我们提出了ESTOCADA,一个能够并排开发几乎不结合的多种dms的新系统,同时保证存储的健全性和完整性,并努力从各种dms中提取最佳性能。我们的系统利用了约束下查询重写领域的最新进展,我们使用它来捕获各种数据模型并描述每个DMS存储的片段。
{"title":"Flexible hybrid stores: Constraint-based rewriting to the rescue","authors":"Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, I. Manolescu, Stamatis Zampetakis","doi":"10.1109/ICDE.2016.7498353","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498353","url":null,"abstract":"Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"47 1","pages":"1394-1397"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82593269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Computing Connected Components with linear communication cost in pregel-like systems 类预凝胶系统中具有线性通信代价的连通组件计算
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498231
Xing Feng, Lijun Chang, Xuemin Lin, Lu Qin, W. Zhang
The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m × #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m×#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.
本文研究了图分析中的两个基本问题:图的连通分量(cc)和双连通分量(bcc)。随着近年来大数据的出现,开发高效的分布式算法来计算大图的cc和bcc受到越来越多的关注。与现有的研究成果一样,本文主要关注Pregel编程模型,而这些技术可以扩展到其他编程模型,包括MapReduce和Spark。在Pregel中,计算cc和bcc的最先进技术在数据通信和计算方面需要O(m × #supersteps)的总成本,其中m是图中边的数量,#supersteps是超步骤的数量。由于网络通信速度通常比计算速度慢得多,因此在现有技术中,通信成本是总运行时间的主要成本。在本文中,我们提出了一种基于图分解的新范式,将计算cc和计算bcc的总通信成本从O(mx# supersteps)降低到O(m)。此外,我们的技术的总计算成本比实践中的现有技术要小,尽管理论上它们几乎相同。综合实证研究表明,我们的方法在总运行时间方面比现有技术高出一个数量级。
{"title":"Computing Connected Components with linear communication cost in pregel-like systems","authors":"Xing Feng, Lijun Chang, Xuemin Lin, Lu Qin, W. Zhang","doi":"10.1109/ICDE.2016.7498231","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498231","url":null,"abstract":"The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m × #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m×#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"63 1","pages":"85-96"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84335508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
MuVE: Efficient Multi-Objective View Recommendation for Visual Data Exploration 面向可视化数据探索的高效多目标视图推荐
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498285
Humaira Ehsan, M. Sharaf, Panos K. Chrysanthis
To support effective data exploration, there is a well-recognized need for solutions that can automatically recommend interesting visualizations, which reveal useful insights into the analyzed data. However, such visualizations come at the expense of high data processing costs, where a large number of views are generated to evaluate their usefulness. Those costs are further escalated in the presence of numerical dimensional attributes, due to the potentially large number of possible binning aggregations, which lead to a drastic increase in the number of possible visualizations. To address that challenge, in this paper we propose the MuVE scheme for Multi-Objective View Recommendation for Visual Data Exploration. MuVE introduces a hybrid multi-objective utility function, which captures the impact of binning on the utility of visualizations. Consequently, novel algorithms are proposed for the efficient recommendation of data visualizations that are based on numerical dimensions. The main idea underlying MuVE is to incrementally and progressively assess the different benefits provided by a visualization, which allows an early pruning of a large number of unnecessary operations. Our extensive experimental results show the significant gains provided by our proposed scheme.
为了支持有效的数据探索,有一个公认的解决方案,可以自动推荐有趣的可视化,从而揭示对分析数据的有用见解。然而,这样的可视化是以高昂的数据处理成本为代价的,因为要生成大量的视图来评估它们的有用性。这些成本在存在数值维度属性时进一步升级,因为潜在的大量可能的分组聚合导致可能的可视化数量急剧增加。为了解决这一挑战,本文提出了用于视觉数据探索的多目标视图推荐的MuVE方案。MuVE引入了一个混合的多目标效用函数,它捕获了分组对可视化效用的影响。因此,提出了新的算法,以有效地推荐基于数值维度的数据可视化。MuVE的主要思想是逐步地评估可视化所提供的不同好处,这允许对大量不必要的操作进行早期修剪。广泛的实验结果表明,我们提出的方案提供了显著的收益。
{"title":"MuVE: Efficient Multi-Objective View Recommendation for Visual Data Exploration","authors":"Humaira Ehsan, M. Sharaf, Panos K. Chrysanthis","doi":"10.1109/ICDE.2016.7498285","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498285","url":null,"abstract":"To support effective data exploration, there is a well-recognized need for solutions that can automatically recommend interesting visualizations, which reveal useful insights into the analyzed data. However, such visualizations come at the expense of high data processing costs, where a large number of views are generated to evaluate their usefulness. Those costs are further escalated in the presence of numerical dimensional attributes, due to the potentially large number of possible binning aggregations, which lead to a drastic increase in the number of possible visualizations. To address that challenge, in this paper we propose the MuVE scheme for Multi-Objective View Recommendation for Visual Data Exploration. MuVE introduces a hybrid multi-objective utility function, which captures the impact of binning on the utility of visualizations. Consequently, novel algorithms are proposed for the efficient recommendation of data visualizations that are based on numerical dimensions. The main idea underlying MuVE is to incrementally and progressively assess the different benefits provided by a visualization, which allows an early pruning of a large number of unnecessary operations. Our extensive experimental results show the significant gains provided by our proposed scheme.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"41 1","pages":"731-742"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79845279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Indoor data management 室内数据管理
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498358
Hua Lu, M. A. Cheema
A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.
现代生活的很大一部分是在室内进行的,比如在家里、办公室、购物中心、大学、图书馆和机场。然而,几乎所有现有的基于位置的服务(LBS)都只是为户外空间设计的。这主要是因为全球定位系统(GPS)和其他定位技术无法准确识别室内场地的位置。最近的一些举措已经开始跨越这一技术障碍,为研究机构、政府机构、科技巨头和创业型初创企业提供了巨大的未来机会——利用室内LBS的潜力。因此,室内数据管理在过去几年中获得了重要的研究关注,预计未来几年的研究兴趣将激增。这将导致广泛的室内应用,包括应急服务、公共服务、店内广告、购物、跟踪、导游等等。在本教程中,我们首先强调室内数据管理的重要性和需要解决的独特挑战。随后,我们概述了室内数据管理的现有研究,包括建模、清理、索引、查询和其他相关主题。最后,我们讨论了这一重要且不断发展的研究领域的未来研究方向,包括空间文本搜索、室内外空间集成、不确定室内数据和室内轨迹挖掘。
{"title":"Indoor data management","authors":"Hua Lu, M. A. Cheema","doi":"10.1109/ICDE.2016.7498358","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498358","url":null,"abstract":"A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"1414-1417"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84500609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
On main-memory flushing in microblogs data management systems 微博数据管理系统中的主存刷新
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498261
A. Magdy, Rami Alghamdi, M. Mokbel
Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26–330% while coping up with fast microblog streams of up to 100K microblog/second.
搜索微博,例如tweets和评论,实际上通过主存索引来支持可扩展的数据消化和高效的查询评估。由于微博的连续性和数量过多,将数据长时间保存在主存中是不可行的。因此,一旦分配的内存预算被填满,就会将一部分数据从内存刷新到磁盘,以持续容纳新传入的数据。现有技术要么由于刷新条目而不考虑它们与传入查询的相关性而导致内存命中率较低,要么由于跟踪单个数据项的开销很大,这限制了微博系统在这两种情况下的可伸缩性。在本文中,我们提出了kFlushing策略,该策略利用微博中top-k查询的流行度来智能地选择一个微博子集进行冲洗。kFlushing主要是为了提高内存命中率。为此,它识别并刷新内存中对传入查询没有贡献的数据。释放的内存空间被用来积累更多有用的数据,这些数据用于回答来自内存内容的更多查询。当所有内存都用于有用的数据时,kFlushing会刷新不太可能降低内存命中率的数据。此外,kFlushing带来了一点开销,在传入快速数据的高消化率方面保持了高系统可伸缩性。大量的实验评估表明了kFlushing的有效性和可扩展性,可以在处理高达100K微博/秒的快速微博流时将主存命中率提高26-330%。
{"title":"On main-memory flushing in microblogs data management systems","authors":"A. Magdy, Rami Alghamdi, M. Mokbel","doi":"10.1109/ICDE.2016.7498261","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498261","url":null,"abstract":"Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26–330% while coping up with fast microblog streams of up to 100K microblog/second.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"445-456"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75790731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
ClEveR: Clustering events with high density of true-to-false occurrence ratio 聪明:聚类具有高真假发生率密度的事件
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498301
G. Theodoridis, T. Benoist
Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.
利用信息通信技术的发展,现代系统收集大量的监测数据集,对这些数据进行分析,以提高系统的态势感知能力。除了常规活动之外,这大量监测信息还可能包括异常操作的实例,需要对其进行彻底的检测和检查,以便确定其根本原因。因此,对于警报机制来说,至关重要的是调查可疑监测痕迹之间的相互关系,不仅是彼此之间,而且是与整体监测数据之间的相互关系,以便发现任何高时空浓度的异常事件,这些异常事件可能被认为是潜在系统故障的证据。为此,本文提出了一种新的聚类算法,该算法不仅根据问题行为的浓度,而且根据正常活动的存在对问题行为的实例进行分组。在此基础上,提出的算法在两个邻近尺度上运行,以便将不受常规反馈干扰的更遥远的异常观测结合起来。无论初始动机如何,聚类算法适用于具有共同特征的对象的任何情况,并且与总体的其余部分相比,对高密度区域进行了检查。
{"title":"ClEveR: Clustering events with high density of true-to-false occurrence ratio","authors":"G. Theodoridis, T. Benoist","doi":"10.1109/ICDE.2016.7498301","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498301","url":null,"abstract":"Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"918-929"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91235676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DebEAQ - debugging empty-answer queries on large data graphs 调试大数据图上的空回答查询
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498355
E. Vasilyeva, Thomas S. Heinze, Maik Thiele, Wolfgang Lehner
The large volume of freely available graph data sets impedes the users in analyzing them. For this purpose, they usually pose plenty of pattern matching queries and study their answers. Without deep knowledge about the data graph, users can create `failing' queries, which deliver empty answers. Analyzing the causes of these empty answers is a time-consuming and complicated task especially for graph queries. To help users in debugging these `failing' queries, there are two common approaches: one is focusing on discovering missing subgraphs of a data graph, the other one tries to rewrite the queries such that they deliver some results. In this demonstration, we will combine both approaches and give the users an opportunity to discover why empty results were delivered by the requested queries. Therefore, we propose DebEAQ, a debugging tool for pattern matching queries, which allows to compare both approaches and also provides functionality to debug queries manually.
大量的免费图形数据集阻碍了用户对其进行分析。为此,它们通常提出大量的模式匹配查询并研究其答案。如果没有对数据图的深入了解,用户可能会创建“失败”的查询,从而提供空的答案。分析这些空答案的原因是一项耗时且复杂的任务,特别是对于图查询。为了帮助用户调试这些“失败”的查询,有两种常见的方法:一种是专注于发现数据图中缺失的子图,另一种是尝试重写查询,以便它们提供一些结果。在本演示中,我们将结合这两种方法,并让用户有机会发现为什么所请求的查询会传递空结果。因此,我们提出了DebEAQ,这是一种模式匹配查询的调试工具,它允许比较两种方法,还提供了手动调试查询的功能。
{"title":"DebEAQ - debugging empty-answer queries on large data graphs","authors":"E. Vasilyeva, Thomas S. Heinze, Maik Thiele, Wolfgang Lehner","doi":"10.1109/ICDE.2016.7498355","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498355","url":null,"abstract":"The large volume of freely available graph data sets impedes the users in analyzing them. For this purpose, they usually pose plenty of pattern matching queries and study their answers. Without deep knowledge about the data graph, users can create `failing' queries, which deliver empty answers. Analyzing the causes of these empty answers is a time-consuming and complicated task especially for graph queries. To help users in debugging these `failing' queries, there are two common approaches: one is focusing on discovering missing subgraphs of a data graph, the other one tries to rewrite the queries such that they deliver some results. In this demonstration, we will combine both approaches and give the users an opportunity to discover why empty results were delivered by the requested queries. Therefore, we propose DebEAQ, a debugging tool for pattern matching queries, which allows to compare both approaches and also provides functionality to debug queries manually.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1402-1405"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82182458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ICE: Managing cold state for big data applications ICE:管理大数据应用的冷状态
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498262
B. Chandramouli, Justin J. Levandoski, Eli Cortez C. Vilarinho
The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.
大数据在企业中的使用围绕着一个监控-挖掘-管理(M3)循环:数据被实时监控,而挖掘的见解被用于管理业务并获得价值。虽然采矿传统上是离线进行的,但近年来,人们越来越需要实时执行M3的所有阶段。流处理引擎(SPE)为定向广告、推荐系统、风险分析和呼叫中心分析等应用程序提供了这样一个无缝的M3循环。然而,这些M3应用程序需要SPE在内存中维护大量状态,从而导致资源使用倾斜:内存稀缺且过度使用,而CPU和I/O未得到充分利用。在本文中,我们提出了一种新的解决方案来扩展内存受限M3应用程序的spe,该解决方案利用数据并行子查询中的自然访问倾斜,其中一小部分状态是热的(经常访问),而大多数状态是冷的(不经常访问)。我们提出了ICE(增量冷状态引擎),这是一个允许SPE无缝地将冷状态迁移到二级存储(磁盘或闪存)的框架。ICE使用一种新颖的体系结构,该体系结构利用单个流操作符的语义,使用增量日志结构存储有效地管理SPE中的冷状态。我们在SPE中实现了ICE。使用真实数据的实验表明,ICE可以在不牺牲性能的情况下显著减少内存使用,有时甚至可以提高性能。
{"title":"ICE: Managing cold state for big data applications","authors":"B. Chandramouli, Justin J. Levandoski, Eli Cortez C. Vilarinho","doi":"10.1109/ICDE.2016.7498262","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498262","url":null,"abstract":"The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"47 1","pages":"457-468"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87574565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1