2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献_第4页

Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds 基于众包的实时城市交通速度估计:从趋势到速度

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498298

Huiqi Hu, Guoliang Li, Z. Bao, Yan Cui, Jianhua Feng

Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.

实时城市交通速度估计在许多实际应用中提供了显著的好处。然而，现有的交通信息采集系统只能获得一小部分道路的粗粒度交通信息，无法获得每条道路的细粒度交通信息。为了解决这个问题，本文研究了交通速度估计问题，该问题给定预算K，识别K条道路(称为种子)，其中这些种子上的实际交通速度可以通过众包获得，并根据这些种子的速度推断其他道路(称为非种子道路)的速度。该问题包括两个子问题:(1)速度推断——如何准确地推断非种子道路的速度;(2)种子选择——如何有效地选择优质种子。由于交通是动态变化的，影响交通的因素很多，难以预测，因此准确估计交通速度是一项具有挑战性的工作。为了解决这些挑战，我们提出了有效的算法来明智地选择高质量的种子，并设计了推理模型来推断非种子道路的速度。一方面，我们观察到道路具有相关性，相关道路具有相似的交通趋势:与历史平均速度相比，相关道路的速度同时上升或下降。我们利用这一特性，提出了一个两步模型来估计交通速度。第一步采用图形模型推断交通趋势，第二步设计层次线性模型根据交通趋势估计交通速度。另一方面，我们提出了种子选择问题，证明了它是np困难的，并提出了几种具有近似保证的贪心算法。在两个大型真实数据集上的实验结果表明，该方法的效率比基线提高了2个数量级，估计精度提高了40%。

{"title":"Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds","authors":"Huiqi Hu, Guoliang Li, Z. Bao, Yan Cui, Jianhua Feng","doi":"10.1109/ICDE.2016.7498298","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498298","url":null,"abstract":"Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"883-894"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84127069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Flexible hybrid stores: Constraint-based rewriting to the rescue 灵活的混合存储:基于约束的重写来拯救

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498353

Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, I. Manolescu, Stamatis Zampetakis

Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.

数据管理经历了一个有趣的时期1，因为当前可用的数据管理系统(简称dms)的数量可能比以往任何时候都要多。这为数据密集型应用程序提供了独特的机会，因为有些系统在某些数据处理操作上提供了出色的性能。然而，它也带来了巨大的挑战，因为在某些任务上高效的系统可能表现不佳或不支持其他任务，这使得不可能为给定的应用程序使用单个DMS。因此，为了利用它们的最佳性能，需要并排使用不同的dms，正如混合存储或多存储等术语所提倡的那样。我们提出了ESTOCADA，一个能够并排开发几乎不结合的多种dms的新系统，同时保证存储的健全性和完整性，并努力从各种dms中提取最佳性能。我们的系统利用了约束下查询重写领域的最新进展，我们使用它来捕获各种数据模型并描述每个DMS存储的片段。

{"title":"Flexible hybrid stores: Constraint-based rewriting to the rescue","authors":"Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, I. Manolescu, Stamatis Zampetakis","doi":"10.1109/ICDE.2016.7498353","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498353","url":null,"abstract":"Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"47 1","pages":"1394-1397"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82593269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

DebEAQ - debugging empty-answer queries on large data graphs 调试大数据图上的空回答查询

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498355

E. Vasilyeva, Thomas S. Heinze, Maik Thiele, Wolfgang Lehner

The large volume of freely available graph data sets impedes the users in analyzing them. For this purpose, they usually pose plenty of pattern matching queries and study their answers. Without deep knowledge about the data graph, users can create `failing' queries, which deliver empty answers. Analyzing the causes of these empty answers is a time-consuming and complicated task especially for graph queries. To help users in debugging these `failing' queries, there are two common approaches: one is focusing on discovering missing subgraphs of a data graph, the other one tries to rewrite the queries such that they deliver some results. In this demonstration, we will combine both approaches and give the users an opportunity to discover why empty results were delivered by the requested queries. Therefore, we propose DebEAQ, a debugging tool for pattern matching queries, which allows to compare both approaches and also provides functionality to debug queries manually.

大量的免费图形数据集阻碍了用户对其进行分析。为此，它们通常提出大量的模式匹配查询并研究其答案。如果没有对数据图的深入了解，用户可能会创建“失败”的查询，从而提供空的答案。分析这些空答案的原因是一项耗时且复杂的任务，特别是对于图查询。为了帮助用户调试这些“失败”的查询，有两种常见的方法:一种是专注于发现数据图中缺失的子图，另一种是尝试重写查询，以便它们提供一些结果。在本演示中，我们将结合这两种方法，并让用户有机会发现为什么所请求的查询会传递空结果。因此，我们提出了DebEAQ，这是一种模式匹配查询的调试工具，它允许比较两种方法，还提供了手动调试查询的功能。

引用次数: 8

pSCAN: Fast and exact structural graph clustering pSCAN:快速和精确的结构图聚类

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498245

Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang

In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.

本文研究了结构图聚类问题，这是图数据管理和分析的一个基本问题。给定一个大的图G = (V, E)，结构图聚类就是将V中的顶点分配给聚类，并识别出枢纽点和离群点的集合，使同一聚类中的顶点相互紧密连接，而不同聚类中的顶点相互松散连接。首先，我们证明了现有的SCAN方法是最坏最优的。然而，它仍然不能扩展到大型图，因为每一对相邻的顶点都要耗尽计算结构相似性。其次，我们对结构图聚类进行了三个观察，为进一步优化提供了机会。在此基础上，本文提出了一种新的两步聚类方法。在此基础上，提出了一种减少结构相似性计算次数的新方法。此外，我们提出了优化技术来加快检查两个顶点是否彼此结构相似。最后，我们对大型真实图和合成图进行了广泛的性能研究，这表明我们的新方法比最先进的方法要好一个数量级以上。值得注意的是，对于拥有10亿条边的twitter图，我们的方法需要25分钟，而最先进的方法即使在24小时后也无法完成。

{"title":"pSCAN: Fast and exact structural graph clustering","authors":"Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang","doi":"10.1109/ICDE.2016.7498245","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498245","url":null,"abstract":"In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"253-264"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77053800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

On main-memory flushing in microblogs data management systems 微博数据管理系统中的主存刷新

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498261

A. Magdy, Rami Alghamdi, M. Mokbel

Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26–330% while coping up with fast microblog streams of up to 100K microblog/second.

搜索微博，例如tweets和评论，实际上通过主存索引来支持可扩展的数据消化和高效的查询评估。由于微博的连续性和数量过多，将数据长时间保存在主存中是不可行的。因此，一旦分配的内存预算被填满，就会将一部分数据从内存刷新到磁盘，以持续容纳新传入的数据。现有技术要么由于刷新条目而不考虑它们与传入查询的相关性而导致内存命中率较低，要么由于跟踪单个数据项的开销很大，这限制了微博系统在这两种情况下的可伸缩性。在本文中，我们提出了kFlushing策略，该策略利用微博中top-k查询的流行度来智能地选择一个微博子集进行冲洗。kFlushing主要是为了提高内存命中率。为此，它识别并刷新内存中对传入查询没有贡献的数据。释放的内存空间被用来积累更多有用的数据，这些数据用于回答来自内存内容的更多查询。当所有内存都用于有用的数据时，kFlushing会刷新不太可能降低内存命中率的数据。此外，kFlushing带来了一点开销，在传入快速数据的高消化率方面保持了高系统可伸缩性。大量的实验评估表明了kFlushing的有效性和可扩展性，可以在处理高达100K微博/秒的快速微博流时将主存命中率提高26-330%。

{"title":"On main-memory flushing in microblogs data management systems","authors":"A. Magdy, Rami Alghamdi, M. Mokbel","doi":"10.1109/ICDE.2016.7498261","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498261","url":null,"abstract":"Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26–330% while coping up with fast microblog streams of up to 100K microblog/second.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"445-456"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75790731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

ClEveR: Clustering events with high density of true-to-false occurrence ratio 聪明:聚类具有高真假发生率密度的事件

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498301

G. Theodoridis, T. Benoist

Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.

利用信息通信技术的发展，现代系统收集大量的监测数据集，对这些数据进行分析，以提高系统的态势感知能力。除了常规活动之外，这大量监测信息还可能包括异常操作的实例，需要对其进行彻底的检测和检查，以便确定其根本原因。因此，对于警报机制来说，至关重要的是调查可疑监测痕迹之间的相互关系，不仅是彼此之间，而且是与整体监测数据之间的相互关系，以便发现任何高时空浓度的异常事件，这些异常事件可能被认为是潜在系统故障的证据。为此，本文提出了一种新的聚类算法，该算法不仅根据问题行为的浓度，而且根据正常活动的存在对问题行为的实例进行分组。在此基础上，提出的算法在两个邻近尺度上运行，以便将不受常规反馈干扰的更遥远的异常观测结合起来。无论初始动机如何，聚类算法适用于具有共同特征的对象的任何情况，并且与总体的其余部分相比，对高密度区域进行了检查。

{"title":"ClEveR: Clustering events with high density of true-to-false occurrence ratio","authors":"G. Theodoridis, T. Benoist","doi":"10.1109/ICDE.2016.7498301","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498301","url":null,"abstract":"Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"918-929"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91235676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Indoor data management 室内数据管理

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498358

Hua Lu, M. A. Cheema

A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.

现代生活的很大一部分是在室内进行的，比如在家里、办公室、购物中心、大学、图书馆和机场。然而，几乎所有现有的基于位置的服务(LBS)都只是为户外空间设计的。这主要是因为全球定位系统(GPS)和其他定位技术无法准确识别室内场地的位置。最近的一些举措已经开始跨越这一技术障碍，为研究机构、政府机构、科技巨头和创业型初创企业提供了巨大的未来机会——利用室内LBS的潜力。因此，室内数据管理在过去几年中获得了重要的研究关注，预计未来几年的研究兴趣将激增。这将导致广泛的室内应用，包括应急服务、公共服务、店内广告、购物、跟踪、导游等等。在本教程中，我们首先强调室内数据管理的重要性和需要解决的独特挑战。随后，我们概述了室内数据管理的现有研究，包括建模、清理、索引、查询和其他相关主题。最后，我们讨论了这一重要且不断发展的研究领域的未来研究方向，包括空间文本搜索、室内外空间集成、不确定室内数据和室内轨迹挖掘。

{"title":"Indoor data management","authors":"Hua Lu, M. A. Cheema","doi":"10.1109/ICDE.2016.7498358","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498358","url":null,"abstract":"A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"1414-1417"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84500609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An interval join optimized for modern hardware 为现代硬件优化的间隔连接

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498316

Danila Piatov, S. Helmer, Anton Dignös

We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.

我们开发了一种算法，用于有效地将基于间隔的属性上的关系与重叠谓词连接起来，例如，重叠谓词在时态数据库中很常见。使用新的数据结构和惰性评估技术，我们能够通过利用现代CPU架构的特性来优化内存访问，从而获得令人印象深刻的性能提升。在对真实世界数据集的实验评估中，我们的算法能够以一个数量级超越最先进的算法。

引用次数: 51

Joint repairs for web wrappers 织网机的接缝修补

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498320

Stefano Ortona, G. Orsi, Tim Furche, Marcello Buoncristiano

Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.

自动网络抓取是一种从网络获取数据的流行方法。抓取器(或包装器)来自手动或自动注释的示例，通常导致数据分段不足/过度，以及丢失或虚假的内容。因此，自动修复和维护提取的数据是对自动包装器生成的必要补充。此外，提取的数据通常是长期数据采集工作的结果，因此将包装器与生成的数据一起修复可以减少未来对数据清理的需求。研究了基于xpath的包装器及其提取数据的联合修复计算问题。我们证明了这个问题在一般情况下是np完全的，但在一些自然假设下变得容易处理。即使是易于处理的问题解决方案在非常大的数据集上仍然是不切实际的，但我们提出了一个最佳近似，证明在各种领域和来源上都是有效的。我们的方法依赖于编码的领域知识，但不需要对每个源进行监督。对来自100个不同应用领域的100多个不同网站的100,000多个网页的评估表明，与包装器生成系统无关，联合修复能够将包装器的质量提高15%至60%，消除50%以上情况下的所有错误。

{"title":"Joint repairs for web wrappers","authors":"Stefano Ortona, G. Orsi, Tim Furche, Marcello Buoncristiano","doi":"10.1109/ICDE.2016.7498320","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498320","url":null,"abstract":"Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"191 1","pages":"1146-1157"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74461790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

ICE: Managing cold state for big data applications ICE:管理大数据应用的冷状态

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498262

B. Chandramouli, Justin J. Levandoski, Eli Cortez C. Vilarinho

The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.

大数据在企业中的使用围绕着一个监控-挖掘-管理(M3)循环:数据被实时监控，而挖掘的见解被用于管理业务并获得价值。虽然采矿传统上是离线进行的，但近年来，人们越来越需要实时执行M3的所有阶段。流处理引擎(SPE)为定向广告、推荐系统、风险分析和呼叫中心分析等应用程序提供了这样一个无缝的M3循环。然而，这些M3应用程序需要SPE在内存中维护大量状态，从而导致资源使用倾斜:内存稀缺且过度使用，而CPU和I/O未得到充分利用。在本文中，我们提出了一种新的解决方案来扩展内存受限M3应用程序的spe，该解决方案利用数据并行子查询中的自然访问倾斜，其中一小部分状态是热的(经常访问)，而大多数状态是冷的(不经常访问)。我们提出了ICE(增量冷状态引擎)，这是一个允许SPE无缝地将冷状态迁移到二级存储(磁盘或闪存)的框架。ICE使用一种新颖的体系结构，该体系结构利用单个流操作符的语义，使用增量日志结构存储有效地管理SPE中的冷状态。我们在SPE中实现了ICE。使用真实数据的实验表明，ICE可以在不牺牲性能的情况下显著减少内存使用，有时甚至可以提高性能。

{"title":"ICE: Managing cold state for big data applications","authors":"B. Chandramouli, Justin J. Levandoski, Eli Cortez C. Vilarinho","doi":"10.1109/ICDE.2016.7498262","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498262","url":null,"abstract":"The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"47 1","pages":"457-468"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87574565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1