Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498298
Huiqi Hu, Guoliang Li, Z. Bao, Yan Cui, Jianhua Feng
Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.
{"title":"Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds","authors":"Huiqi Hu, Guoliang Li, Z. Bao, Yan Cui, Jianhua Feng","doi":"10.1109/ICDE.2016.7498298","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498298","url":null,"abstract":"Real-time urban traffic speed estimation provides significant benefits in many real-world applications. However, existing traffic information acquisition systems only obtain coarse-grained traffic information on a small number of roads but cannot acquire fine-grained traffic information on every road. To address this problem, in this paper we study the traffic speed estimation problem, which, given a budget K, identifies K roads (called seeds) where the real traffic speeds on these seeds can be obtained using crowdsourcing, and infers the speeds of other roads (called non-seed roads) based on the speeds of these seeds. This problem includes two sub-problems: (1) Speed Inference - How to accurately infer the speeds of the non-seed roads; (2) Seed Selection - How to effectively select high-quality seeds. It is rather challenging to estimate the traffic speed accurately, because the traffic changes dynamically and the changes are hard to be predicted as many possible factors can affect the traffic. To address these challenges, we propose effective algorithms to judiciously select high-quality seeds and devise inference models to infer the speeds of the non-seed roads. On the one hand, we observe that roads have correlations and correlated roads have similar traffic trend: the speeds of correlated roads rise or fall compared with their historical average speed simultaneously. We utilize this property and propose a two-step model to estimate the traffic speed. The first step adopts a graphical model to infer the traffic trend and the second step devises a hierarchical linear model to estimate the traffic speed based on the traffic trend. On the other hand, we formulate the seed selection problem, prove that it is NP-hard, and propose several greedy algorithms with approximation guarantees. Experimental results on two large real datasets show that our method outperforms baselines by 2 orders of magnitude in efficiency and 40% in estimation accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"883-894"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84127069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498353
Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, I. Manolescu, Stamatis Zampetakis
Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.
{"title":"Flexible hybrid stores: Constraint-based rewriting to the rescue","authors":"Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, I. Manolescu, Stamatis Zampetakis","doi":"10.1109/ICDE.2016.7498353","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498353","url":null,"abstract":"Data management goes through interesting times1, as the number of currently available data management systems (DMSs in short) is probably higher than ever before. This leads to unique opportunities for data-intensive applications, as some systems provide excellent performance on certain data processing operations. Yet, it also raises great challenges, as a system efficient on some tasks may perform poorly or not support other tasks, making it impossible to use a single DMS for a given application. It is thus desirable to use different DMSs side by side in order to take advantage of their best performance, as advocated under terms such as hybrid or poly-stores. We present ESTOCADA, a novel system capable of exploiting side-by-side a practically unbound variety of DMSs, all the while guaranteeing the soundness and completeness of the store, and striving to extract the best performance out of the various DMSs. Our system leverages recent advances in the area of query rewriting under constraints, which we use to capture the various data models and describe the fragments each DMS stores.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"47 1","pages":"1394-1397"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82593269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498355
E. Vasilyeva, Thomas S. Heinze, Maik Thiele, Wolfgang Lehner
The large volume of freely available graph data sets impedes the users in analyzing them. For this purpose, they usually pose plenty of pattern matching queries and study their answers. Without deep knowledge about the data graph, users can create `failing' queries, which deliver empty answers. Analyzing the causes of these empty answers is a time-consuming and complicated task especially for graph queries. To help users in debugging these `failing' queries, there are two common approaches: one is focusing on discovering missing subgraphs of a data graph, the other one tries to rewrite the queries such that they deliver some results. In this demonstration, we will combine both approaches and give the users an opportunity to discover why empty results were delivered by the requested queries. Therefore, we propose DebEAQ, a debugging tool for pattern matching queries, which allows to compare both approaches and also provides functionality to debug queries manually.
{"title":"DebEAQ - debugging empty-answer queries on large data graphs","authors":"E. Vasilyeva, Thomas S. Heinze, Maik Thiele, Wolfgang Lehner","doi":"10.1109/ICDE.2016.7498355","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498355","url":null,"abstract":"The large volume of freely available graph data sets impedes the users in analyzing them. For this purpose, they usually pose plenty of pattern matching queries and study their answers. Without deep knowledge about the data graph, users can create `failing' queries, which deliver empty answers. Analyzing the causes of these empty answers is a time-consuming and complicated task especially for graph queries. To help users in debugging these `failing' queries, there are two common approaches: one is focusing on discovering missing subgraphs of a data graph, the other one tries to rewrite the queries such that they deliver some results. In this demonstration, we will combine both approaches and give the users an opportunity to discover why empty results were delivered by the requested queries. Therefore, we propose DebEAQ, a debugging tool for pattern matching queries, which allows to compare both approaches and also provides functionality to debug queries manually.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1402-1405"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82182458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498245
Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang
In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.
{"title":"pSCAN: Fast and exact structural graph clustering","authors":"Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, W. Zhang","doi":"10.1109/ICDE.2016.7498245","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498245","url":null,"abstract":"In this paper, we study the problem of structural graph clustering, a fundamental problem in managing and analyzing graph data. Given a large graph G = (V, E), structural graph clustering is to assign vertices in V to clusters and to identify the sets of hub vertices and outlier vertices as well, such that vertices in the same cluster are densely connected to each other while vertices in different clusters are loosely connected to each other. Firstly, we prove that the existing SCAN approach is worst-case optimal. Nevertheless, it is still not scalable to large graphs due to exhaustively computing structural similarity for every pair of adjacent vertices. Secondly, we make three observations about structural graph clustering, which present opportunities for further optimization. Based on these observations, in this paper we develop a new two-step paradigm for scalable structural graph clustering. Thirdly, following this paradigm, we present a new approach aiming to reduce the number of structural similarity computations. Moreover, we propose optimization techniques to speed up checking whether two vertices are structure-similar to each other. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our new approach outperforms the state-of-the-art approaches by over one order of magnitude. Noticeably, for the twitter graph with 1 billion edges, our approach takes 25 minutes while the state-of-the-art approach cannot finish even after 24 hours.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"253-264"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77053800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498261
A. Magdy, Rami Alghamdi, M. Mokbel
Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26–330% while coping up with fast microblog streams of up to 100K microblog/second.
{"title":"On main-memory flushing in microblogs data management systems","authors":"A. Magdy, Rami Alghamdi, M. Mokbel","doi":"10.1109/ICDE.2016.7498261","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498261","url":null,"abstract":"Searching microblogs, e.g., tweets and comments, is practically supported through main-memory indexing for scalable data digestion and efficient query evaluation. With continuity and excessive numbers of microblogs, it is infeasible to keep data in main-memory for long periods. Thus, once allocated memory budget is filled, a portion of data is flushed from memory to disk to continuously accommodate newly incoming data. Existing techniques come with either low memory hit ratio due to flushing items regardless of their relevance to incoming queries or significant overhead of tracking individual data items, which limit scalability of microblogs systems in either cases. In this paper, we propose kFlushing policy that exploits popularity of top-k queries in microblogs to smartly select a subset of microblogs to flush. kFlushing is mainly designed to increase memory hit ratio. To this end, it identifies and flushes in-memory data that does not contribute to incoming queries. The freed memory space is utilized to accumulate more useful data that is used to answer more queries from memory contents. When all memory is utilized for useful data, kFlushing flushes data that is less likely to degrade memory hit ratio. In addition, kFlushing comes with a little overhead that keeps high system scalability in terms of high digestion rates of incoming fast data. Extensive experimental evaluation shows the effectiveness and scalability of kFlushing to improve main-memory hit by 26–330% while coping up with fast microblog streams of up to 100K microblog/second.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"445-456"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75790731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498301
G. Theodoridis, T. Benoist
Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.
{"title":"ClEveR: Clustering events with high density of true-to-false occurrence ratio","authors":"G. Theodoridis, T. Benoist","doi":"10.1109/ICDE.2016.7498301","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498301","url":null,"abstract":"Leveraging the ICT evolution, the modern systems collect voluminous sets of monitoring data, which are analysed in order to increase the system's situational awareness. Apart from the regular activity this bulk of monitoring information may also include instances of anomalous operation, which need to be detected and examined thoroughly so as their root causes to be identified. Hence, for an alert mechanism it is crucial to investigate the cross-correlations among the suspicious monitoring traces not only with each other but also against the overall monitoring data, in order to discover any high spatio-temporal concentration of abnormal occurrences that could be considered as evidence of an underlying system malfunction. To this end, this paper presents a novel clustering algorithm that groups instances of problematic behaviour not only according to their concentration but also with respect to the presence of normal activity. On this basis, the proposed algorithm operates at two proximity scales, so as to allow for combining more distant anomalous observations that are not however interrupted by regular feedback. Regardless of the initial motivation, the clustering algorithm is applicable to any case of objects that share a common feature and for which areas of high density in comparison with the rest of the population are examined.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"918-929"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91235676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498358
Hua Lu, M. A. Cheema
A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.
{"title":"Indoor data management","authors":"Hua Lu, M. A. Cheema","doi":"10.1109/ICDE.2016.7498358","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498358","url":null,"abstract":"A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organizations, government agencies, technology giants, and enterprizing start-ups - to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future research directions in this important and growing research area, discussing spatial-textual search, integrating outdoor and indoor spaces, uncertain indoor data, and indoor trajectory mining.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"1414-1417"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84500609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498316
Danila Piatov, S. Helmer, Anton Dignös
We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.
{"title":"An interval join optimized for modern hardware","authors":"Danila Piatov, S. Helmer, Anton Dignös","doi":"10.1109/ICDE.2016.7498316","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498316","url":null,"abstract":"We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"140 1","pages":"1098-1109"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85191273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498320
Stefano Ortona, G. Orsi, Tim Furche, Marcello Buoncristiano
Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.
{"title":"Joint repairs for web wrappers","authors":"Stefano Ortona, G. Orsi, Tim Furche, Marcello Buoncristiano","doi":"10.1109/ICDE.2016.7498320","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498320","url":null,"abstract":"Automated web scraping is a popular means for acquiring data from the web. Scrapers (or wrappers) are derived from either manually or automatically annotated examples, often resulting in under/over segmented data, together with missing or spurious content. Automatic repair and maintenance of the extracted data is thus a necessary complement to automatic wrapper generation. Moreover, the extracted data is often the result of a long-term data acquisition effort and thus jointly repairing wrappers together with the generated data reduces future needs for data cleaning. We study the problem of computing joint repairs for XPath-based wrappers and their extracted data. We show that the problem is NP-complete in general but becomes tractable under a few natural assumptions. Even tractable solutions to the problem are still impractical on very large datasets, but we propose an optimal approximation that proves effective across a wide variety of domains and sources. Our approach relies on encoded domain knowledge, but require no per-source supervision. An evaluation spanning more than 100k web pages from 100 different sites of a wide variety of application domains, shows that joint repairs are able to increase the quality of wrappers between 15% and 60% independently of the wrapper generation system, eliminating all errors in more than 50% of the cases.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"191 1","pages":"1146-1157"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74461790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498262
B. Chandramouli, Justin J. Levandoski, Eli Cortez C. Vilarinho
The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.
{"title":"ICE: Managing cold state for big data applications","authors":"B. Chandramouli, Justin J. Levandoski, Eli Cortez C. Vilarinho","doi":"10.1109/ICDE.2016.7498262","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498262","url":null,"abstract":"The use of big data in a business revolves around a monitor-mine-manage (M3) loop: data is monitored in real-time, while mined insights are used to manage the business and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need to perform all phases of M3 in real-time. A stream processing engine (SPE) enables such a seamless M3 loop for applications such as targeted advertising, recommender systems, risk analysis, and call-center analytics. However, these M3 applications require the SPE to maintain massive amounts of state in memory, leading to resource usage skew: memory is scarce and over-utilized, whereas CPU and I/O are under-utilized. In this paper, we propose a novel solution to scaling SPEs for memory-bound M3 applications that leverages natural access skew in data-parallel subqueries, where a small fraction of the state is hot (frequently accessed) and most state is cold (infrequently accessed). We present ICE (incremental coldstate engine), a framework that allows an SPE to seamlessly migrate cold state to secondary storage (disk or flash). ICE uses a novel architecture that exploits the semantics of individual stream operators to efficiently manage cold state in an SPE using an incremental log-structured store. We implemented ICE inside an SPE. Experiments using real data show that ICE can reduce memory usage significantly without sacrificing performance, and can sometimes even improve performance.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"47 1","pages":"457-468"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87574565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}