Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.
{"title":"DESKS: Direction-Aware Spatial Keyword Search","authors":"Guoliang Li, Jianhua Feng, Jing Xu","doi":"10.1109/ICDE.2012.93","DOIUrl":"https://doi.org/10.1109/ICDE.2012.93","url":null,"abstract":"Location-based services (LBS) have been widely accepted by mobile users. Many LBS users have direction-aware search requirement that answers must be in the search direction. However to the best of our knowledge there is not yet any research available that investigates direction-aware search. A straightforward method first finds candidates without considering the direction constraint, and then generates the answers by pruning those candidates which invalidate the direction constraint. However this method is rather expensive as it involves a lot of useless computation on many unnecessary directions. To address this problem, we propose a direction-aware spatial keyword search method which inherently supports direction-aware search. We devise novel direction-aware indexing structures to prune unnecessary directions. We develop effective pruning techniques and search algorithms to efficiently answer a direction-aware query. As users may dynamically change their search directions, we propose to incrementally answer a query. Experimental results on real datasets show that our method achieves high performance and outperforms existing methods significantly.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128395435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.
{"title":"Micro-Specialization in DBMSes","authors":"Rui Zhang, R. Snodgrass, S. Debray","doi":"10.1109/ICDE.2012.110","DOIUrl":"https://doi.org/10.1109/ICDE.2012.110","url":null,"abstract":"Relational database management systems are general in the sense that they can handle arbitrary schemas, queries, and modifications, this generality is implemented using runtime metadata lookups and tests that ensure that control is channelled to the appropriate code in all cases. Unfortunately, these lookups and tests are carried out even when information is available that renders some of these operations superfluous, leading to unnecessary runtime overheads. This paper introduces micro-specialization, an approach that uses relation- and query-specific information to specialize the DBMS code at runtime and thereby eliminate some of these overheads. We develop a taxonomy of approaches and specialization times and propose a general architecture that isolates most of the creation and execution of the specialized code sequences in a separate DBMS-independent module. Through three illustrative types of micro-specializations applied to PostgreSQL, we show that this approach requires minimal changes to a DBMS and can improve the performance simultaneously across a wide range of queries, modifications, and bulk-loading, in terms of storage, CPU usage, and I/O time of the TPC-H and TPC-C benchmarks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133505873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.
{"title":"Processing and Notifying Range Top-k Subscriptions","authors":"Albert Yu, P. Agarwal, Jun Yang","doi":"10.1109/ICDE.2012.67","DOIUrl":"https://doi.org/10.1109/ICDE.2012.67","url":null,"abstract":"We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include using a content-driven network to notify all users whose interest ranges contain the update (ignoring top-k), or using a server to compute only the affected queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. We present a geometric framework for the problem that allows us to describe the set of affected queries succinctly with messages that can be efficiently disseminated using content-driven networks. We give fast algorithms to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all user interests. We also present extensions to our solution, including an approximate algorithm that trades off between the cost of server-side reformulation and that of user-side post-processing, as well as efficient techniques for batch updates.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"453 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131474032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luke Dickens, Ian Molloy, Jorge Lobo, P. Cheng, A. Russo
An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.
{"title":"Learning Stochastic Models of Information Flow","authors":"Luke Dickens, Ian Molloy, Jorge Lobo, P. Cheng, A. Russo","doi":"10.1109/ICDE.2012.103","DOIUrl":"https://doi.org/10.1109/ICDE.2012.103","url":null,"abstract":"An understanding of information flow has many applications, including for maximizing marketing impact on social media, limiting malware propagation, and managing undesired disclosure of sensitive information. This paper presents scalable methods for both learning models of information flow in networks from data, based on the Independent Cascade Model, and predicting probabilities of unseen flow from these models. Our approach is based on a principled probabilistic construction and results compare favourably with existing methods in terms of accuracy of prediction and scalable evaluation, with the addition that we are able to evaluate a broader range of queries than previously shown, including probability of joint and/or conditional flow, as well as reflecting model uncertainty. Exact evaluation of flow probabilities is exponential in the number of edges and naive sampling can also be expensive, so we propose sampling in an efficient Markov-Chain Monte-Carlo fashion using the Metropolis-Hastings algorithm -- details described in the paper. We identify two types of data, those where the paths of past flows are known -- attributed data, and those where only the endpoints are known -- unattributed data. Both data types are addressed in this paper, including training methods, example real world data sets, and experimental evaluation. In particular, we investigate flow data from the Twitter microblogging service, exploring the flow of messages through retweets (tweet forwards) for the attributed case, and the propagation of hash tags (metadata tags) and urls for the unattributed case.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133855220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.
{"title":"Efficient Support of XQuery Update Facility in XML Enabled RDBMS","authors":"Z. Liu, Hui J. Chang, Balasubramanyam Sthanikam","doi":"10.1109/ICDE.2012.17","DOIUrl":"https://doi.org/10.1109/ICDE.2012.17","url":null,"abstract":"XQuery Update Facility (XQUF), which provides a declarative way of updating XML, has become recommendation by W3C. The SQL/XML standard, on the other hand, defines XMLType as a column data type in RDBMS environment and defines the standard SQL/XML operator, such as XML Query() to embed XQuery to query XMLType column in RDBMS. Based on this SQL/XML standard, XML enabled RDBMS becomes industrial strength platforms to host XML applications in a standard compliance way by providing XML store and query capability. However, updating XML capability support remains to be proprietary in RDBMS until XQUF becomes the recommendation. XQUF is agnostic of how XML is stored so that propagation of actual update to any persistent XML store is beyond the scope of XQUF. In this paper, we show how XQUF can be incorporated into XML Query() to effectively update XML stored in XMLType column in the environment of XML enabled RDBMS, such as Oracle XMLDB. We present various compile time and run time optimisation techniques to show how XQUF can be efficiently implemented to declaratively update XML stored in RDBMS. We present how our approaches of optimising XQUF for common physical XML storage models: native binary XML storage model and relational decomposition of XML storage model. Although our study is done using Oracle XMLDB, all of the presented optimisation techniques are generic to XML stores that need to support update of persistent XML store and not specific to Oracle XMLDB implementation.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132360183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.
{"title":"GeoFeed: A Location Aware News Feed System","authors":"Jie Bao, M. Mokbel, Chi-Yin Chow","doi":"10.1109/ICDE.2012.97","DOIUrl":"https://doi.org/10.1109/ICDE.2012.97","url":null,"abstract":"This paper presents the Geo Feed system, a location-aware news feed system that provides a new platform for its users to get spatially related message updates from either their friends or favorite news sources. Geo Feed distinguishes itself from all existing news feed systems in that it takes into account the spatial extents of messages and user locations when deciding upon the selected news feed. Geo Feed is equipped with three different approaches for delivering the news feed to its users, namely, spatial pull, spatial push, and shared push. Then, the main challenge of Geo Feed is to decide on when to use each of these three approaches to which users. Geo Feed is equipped with a smart decision model that decides about using these approaches in a way that: (a) minimizes the system overhead for delivering the location-aware news feed, and (b) guarantees a certain response time for each user to obtain the requested location-aware news feed. Experimental results, based on real and synthetic data, show that Geo Feed outperforms existing news feed systems in terms of response time and maintenance cost.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129263576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the "same" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.
{"title":"Physically Independent Stream Merging","authors":"B. Chandramouli, D. Maier, J. Goldstein","doi":"10.1109/ICDE.2012.25","DOIUrl":"https://doi.org/10.1109/ICDE.2012.25","url":null,"abstract":"A facility for merging equivalent data streams can support multiple capabilities in a data stream management system (DSMS), such as query-plan switching and high availability. One can logically view a data stream as a temporal table of events, each associated with a lifetime (time interval) over which the event contributes to output. In many applications, the \"same\" logical stream may present itself physically in multiple physical forms, for example, due to disorder arising in transmission or from combining multiple sources, and modifications of earlier events. Merging such streams correctly is challenging when the streams may differ physically in timing, order, and composition. This paper introduces a new stream operator called Logical Merge (LMerge) that takes multiple logically consistent streams as input and outputs a single stream that is compatible with all of them. LMerge can handle the dynamic attachment and detachment of input streams. We present a range of algorithms for LMerge that can exploit compile-time stream properties for efficiency. Experiments with Stream Insight, a commercial DSMS, show that LMerge is sometimes orders-of-magnitude more efficient than enforcing determinism on inputs, and that there is benefit to using specialized algorithms when stream variability is limited. We also show that LMerge and its extensions can provide performance benefits in several real-world applications.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117140483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Claudio Jossen, Lukas Blunschi, M. Mori, Donald Kossmann, Kurt Stockinger
This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.
{"title":"The Credit Suisse Meta-data Warehouse","authors":"Claudio Jossen, Lukas Blunschi, M. Mori, Donald Kossmann, Kurt Stockinger","doi":"10.1109/ICDE.2012.41","DOIUrl":"https://doi.org/10.1109/ICDE.2012.41","url":null,"abstract":"This paper describes the meta-data warehouse of Credit Suisse that is productive since 2009. Like most other large organizations, Credit Suisse has a complex application landscape and several data warehouses in order to meet the information needs of its users. The problem addressed by the meta-data warehouse is to increase the agility and flexibility of the organization with regards to changes such as the development of a new business process, a new business analytics report, or the implementation of a new regulatory requirement. The meta-data warehouse supports these changes by providing services to search for information items in the data warehouses and to extract the lineage of information items. One difficulty in the design of such a meta-data warehouse is that there is no standard or well-known meta-data model that can be used to support such search services. Instead, the meta-data structures need to be flexible themselves and evolve with the changing IT landscape. This paper describes the current data structures and implementation of the Credit Suisse meta-data warehouse and shows how its services help to increase the flexibility of the whole organization. A series of example meta-data structures, use cases, and screenshots are given in order to illustrate the concepts used and the lessons learned based on feedback of real business and IT users within Credit Suisse.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123848773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
"Big Data" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.
{"title":"Temporal Analytics on Big Data for Web Advertising","authors":"B. Chandramouli, J. Goldstein, S. Duan","doi":"10.1109/ICDE.2012.55","DOIUrl":"https://doi.org/10.1109/ICDE.2012.55","url":null,"abstract":"\"Big Data\" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. Previous work on BT has focused on techniques that scale well for offline data using M-R. However, this approach has limitations for BT-style applications that deal with temporal data: (1) many queries are temporal and not easily expressible in M-R, and moreover, the set-oriented nature of M-R front-ends such as SCOPE is not suitable for temporal processing, (2) as commercial systems mature, they may need to also directly analyze and react to real-time data feeds since a high turnaround time can result in missed opportunities, but it is difficult for current solutions to naturally also operate over real-time streams. Our contributions are twofold. First, we propose a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. Users write and submit analysis algorithms as temporal queries - these queries are succinct, scale-out-agnostic, and easy to write. They scale well on large-scale offline data using TiMR, and can work unmodified over real-time streams. We also propose new cost-based query fragmentation and temporal partitioning schemes for improving efficiency with TiMR. Second, we show the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. Experiments using real data from a commercial ad platform show that TiMR is very efficient and incurs orders-of-magnitude lower development effort. Our BT solution is easy and succinct, and performs up to several times better than current schemes in terms of memory, learning time, and click-through-rate/coverage.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123415442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.
{"title":"Aggregate Query Answering on Possibilistic Data with Cardinality Constraints","authors":"Graham Cormode, D. Srivastava, E. Shen, Ting Yu","doi":"10.1109/ICDE.2012.15","DOIUrl":"https://doi.org/10.1109/ICDE.2012.15","url":null,"abstract":"Uncertainties in data can arise for a number of reasons: when data is incomplete, contains conflicting information or has been deliberately perturbed or coarsened to remove sensitive details. An important case which arises in many real applications is when the data describes a set of possibilities, but with cardinality constraints. These constraints represent correlations between tuples encoding, e.g. that at most two possible records are correct, or that there is an (unknown) one-to-one mapping between a set of tuples and attribute values. Although there has been much effort to handle uncertain data, current systems are not equipped to handle such correlations, beyond simple mutual exclusion and co-existence constraints. Vitally, they have little support for efficiently handling aggregate queries on such data. In this paper, we aim to address some of these deficiencies, by introducing LICM (Linear Integer Constraint Model), which can succinctly represent many types of tuple correlations, particularly a class of cardinality constraints. We motivate and explain the model with examples from data cleaning and masking sensitive data, to show that it enables modeling and querying such data, which was not previously possible. We develop an efficient strategy to answer conjunctive and aggregate queries on possibilistic data by describing how to implement relational operators over data in the model. LICM compactly integrates the encoding of correlations, query answering and lineage recording. In combination with off-the-shelf linear integer programming solvers, our approach provides exact bounds for aggregate queries. Our prototype implementation demonstrates that query answering with LICM can be effective and scalable.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124144077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}