Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544863
Sarana Nutanong, H. Samet
Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra's algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC's drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmark-based heuristic functions, thereby enabling it to be applied to non-spatial networks/graphs such as social networks. Comparisons of experiments on SWH for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ.
{"title":"Memory-efficient algorithms for spatial network queries","authors":"Sarana Nutanong, H. Samet","doi":"10.1109/ICDE.2013.6544863","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544863","url":null,"abstract":"Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra's algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC's drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmark-based heuristic functions, thereby enabling it to be applied to non-spatial networks/graphs such as social networks. Comparisons of experiments on SWH for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125789479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544935
Dan Olteanu, Lampros Papageorgiou, Sebastiaan J. van Schaik
Πgora is an integration system for probabilistic data modelled using different formalisms such as pc-tables, Bayesian networks, and stochastic automata. User queries are expressed over a global relational layer and are evaluated by Πgora using a range of strategies, including data conversion into one probabilistic formalism followed by evaluation using a formalism-specific engine, and hybrid plans, where subqueries are evaluated using engines for different formalisms. This demonstration allows users to experience Πgora on real-world heterogeneous data sources from the medical domain.
{"title":"Πgora: An Integration System for Probabilistic Data","authors":"Dan Olteanu, Lampros Papageorgiou, Sebastiaan J. van Schaik","doi":"10.1109/ICDE.2013.6544935","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544935","url":null,"abstract":"Πgora is an integration system for probabilistic data modelled using different formalisms such as pc-tables, Bayesian networks, and stochastic automata. User queries are expressed over a global relational layer and are evaluated by Πgora using a range of strategies, including data conversion into one probabilistic formalism followed by evaluation using a formalism-specific engine, and hybrid plans, where subqueries are evaluated using engines for different formalisms. This demonstration allows users to experience Πgora on real-world heterogeneous data sources from the medical domain.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123598734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544841
Bhargav Kanagal, Amr Ahmed, Sandeep Pandey, V. Josifovski, Lluis Garcia Pueyo, Jeffrey Yuan
Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users' past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users' preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users' interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches.
{"title":"Focused matrix factorization for audience selection in display advertising","authors":"Bhargav Kanagal, Amr Ahmed, Sandeep Pandey, V. Josifovski, Lluis Garcia Pueyo, Jeffrey Yuan","doi":"10.1109/ICDE.2013.6544841","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544841","url":null,"abstract":"Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users' past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users' preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users' interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125876005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544927
Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag
Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.
{"title":"Peeking into the optimization of data flow programs with MapReduce-style UDFs","authors":"Fabian Hueske, Mathias Peters, Aljoscha Krettek, M. Ringwald, K. Tzoumas, V. Markl, J. Freytag","doi":"10.1109/ICDE.2013.6544927","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544927","url":null,"abstract":"Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544868
Ye Wang, S. Parthasarathy, P. Sadayappan
With the increasing popularity of XML data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and processing such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is an important challenge. A critical element at the heart of this challenge relates to the placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this ecosystem according to the needs of the application to maximize locality, balance load, or minimize data skew. Results on several real-world applications validate the efficacy and efficiency of our approach.
{"title":"Stratification driven placement of complex data: A framework for distributed data analytics","authors":"Ye Wang, S. Parthasarathy, P. Sadayappan","doi":"10.1109/ICDE.2013.6544868","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544868","url":null,"abstract":"With the increasing popularity of XML data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and processing such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is an important challenge. A critical element at the heart of this challenge relates to the placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this ecosystem according to the needs of the application to maximize locality, balance load, or minimize data skew. Results on several real-world applications validate the efficacy and efficiency of our approach.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121371091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544833
Haiquan Chen, Wei-Shinn Ku, Haixun Wang, L. Tang, Min-Te Sun
As one of the most important Semantic Web applications, social network analysis has attracted more and more interest from researchers due to the rapidly increasing availability of massive social network data. A desired solution for social network analysis should address the following issues. First, in many real world applications, inference rules are partially correct. An ideal solution should be able to handle partially correct rules. Second, applications in practice often involve large amounts of data. The inference mechanism should scale up towards large-scale data. Third, inference methods should take into account probabilistic evidence data because these are domains abounding with uncertainty. Various solutions for social network analysis have existed for quite a few years; however, none of them support all the aforementioned features. In this paper, we design and implement LinkProbe, a prototype to quantitatively predict the existence of links among nodes in large-scale social networks, which are empowered by Markov Logic Networks (MLNs). MLN has been proved to be an effective inference model which can handle complex dependencies and partially correct rules. More importantly, although MLN has shown acceptable performance in prior works, it is also reported as impractical in handling large-scale data due to its highly demanding nature in terms of inference time and memory consumption. In order to overcome these limitations, LinkProbe retrieves the k-backbone graphs and conducts the MLN inference on both the most globally influencing nodes and most locally related nodes. Our extensive experiments show that LinkProbe manages to provide a tunable balance between MLN inference accuracy and inference efficiency.
{"title":"LinkProbe: Probabilistic inference on large-scale social networks","authors":"Haiquan Chen, Wei-Shinn Ku, Haixun Wang, L. Tang, Min-Te Sun","doi":"10.1109/ICDE.2013.6544833","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544833","url":null,"abstract":"As one of the most important Semantic Web applications, social network analysis has attracted more and more interest from researchers due to the rapidly increasing availability of massive social network data. A desired solution for social network analysis should address the following issues. First, in many real world applications, inference rules are partially correct. An ideal solution should be able to handle partially correct rules. Second, applications in practice often involve large amounts of data. The inference mechanism should scale up towards large-scale data. Third, inference methods should take into account probabilistic evidence data because these are domains abounding with uncertainty. Various solutions for social network analysis have existed for quite a few years; however, none of them support all the aforementioned features. In this paper, we design and implement LinkProbe, a prototype to quantitatively predict the existence of links among nodes in large-scale social networks, which are empowered by Markov Logic Networks (MLNs). MLN has been proved to be an effective inference model which can handle complex dependencies and partially correct rules. More importantly, although MLN has shown acceptable performance in prior works, it is also reported as impractical in handling large-scale data due to its highly demanding nature in terms of inference time and memory consumption. In order to overcome these limitations, LinkProbe retrieves the k-backbone graphs and conducts the MLN inference on both the most globally influencing nodes and most locally related nodes. Our extensive experiments show that LinkProbe manages to provide a tunable balance between MLN inference accuracy and inference efficiency.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117340123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544816
Sumeet Bajaj, R. Sion
It has become apparent that in the digital world data once stored is never truly deleted even when such an expunction is desired either as a normal system function or for regulatory compliance purposes. Forensic Analysis techniques on systems are often successful at recovering information said to have been deleted in the past. Efforts aimed at thwarting such forensic analysis of systems have either focused on (i) identifying the system components where deleted data lingers and performing a secure delete operation over these remnants, or (ii) designing history independent data structures that hide information about past operations which result in the current system state. Yet, new data is constantly derived by processing existing (input) data which makes it increasingly difficult to remove all traces of this existing data, i.e., for regulatory compliance purposes. Even after deletion, significant information can linger in and be recoverable from the side effects the deleted data records left on the currently available state. In this paper we address this aspect in the context of a relational database, such that when combined with (i) & (ii), complete erasure of data and its effects can be achieved (“un-traceable deletion”). We introduce Ficklebase - a relational database wherein once a tuple has been “expired” - any and all its side-effects are removed, thereby eliminating all its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is undetectable. We present the design and evaluation of Ficklebase, and then discuss several of the fundamental functional implications of un-traceable deletion.
{"title":"Ficklebase: Looking into the future to erase the past","authors":"Sumeet Bajaj, R. Sion","doi":"10.1109/ICDE.2013.6544816","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544816","url":null,"abstract":"It has become apparent that in the digital world data once stored is never truly deleted even when such an expunction is desired either as a normal system function or for regulatory compliance purposes. Forensic Analysis techniques on systems are often successful at recovering information said to have been deleted in the past. Efforts aimed at thwarting such forensic analysis of systems have either focused on (i) identifying the system components where deleted data lingers and performing a secure delete operation over these remnants, or (ii) designing history independent data structures that hide information about past operations which result in the current system state. Yet, new data is constantly derived by processing existing (input) data which makes it increasingly difficult to remove all traces of this existing data, i.e., for regulatory compliance purposes. Even after deletion, significant information can linger in and be recoverable from the side effects the deleted data records left on the currently available state. In this paper we address this aspect in the context of a relational database, such that when combined with (i) & (ii), complete erasure of data and its effects can be achieved (“un-traceable deletion”). We introduce Ficklebase - a relational database wherein once a tuple has been “expired” - any and all its side-effects are removed, thereby eliminating all its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is undetectable. We present the design and evaluation of Ficklebase, and then discuss several of the fundamental functional implications of un-traceable deletion.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115400824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544821
Chunyao Song, Zheng Li, Tingjian Ge
Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in managing such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring function. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users fc tuples, which are merely one point in the complex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets.
{"title":"Top-K oracle: A new way to present top-k tuples for uncertain data","authors":"Chunyao Song, Zheng Li, Tingjian Ge","doi":"10.1109/ICDE.2013.6544821","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544821","url":null,"abstract":"Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in managing such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring function. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users fc tuples, which are merely one point in the complex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130612161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544931
C. Sayers, M. Hsu
To enable the interactive exploration of large social media datasets we exploit the temporal distributions of word n-grams within the message stream to discover “interesting” concepts, determine “relatedness” between concepts, and find representative examples for display. We present a new algorithm for context-dependent “interestingness” using the coefficient of variation of the temporal distribution, apply the well-known technique of Pearson's Correlation to tweets using equi-height histogramming to determine correlation, and employ an asymmetric variant for computing “relatedness” to encourage exploration. We further introduce techniques using interestingness, correlation, and relatedness to automatically discover concepts and select preferred word N-grams for display. These techniques are demonstrated on an 800,000 tweet dataset from the Academy Awards.
{"title":"Extracting interesting related context-dependent concepts from social media streams using temporal distributions","authors":"C. Sayers, M. Hsu","doi":"10.1109/ICDE.2013.6544931","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544931","url":null,"abstract":"To enable the interactive exploration of large social media datasets we exploit the temporal distributions of word n-grams within the message stream to discover “interesting” concepts, determine “relatedness” between concepts, and find representative examples for display. We present a new algorithm for context-dependent “interestingness” using the coefficient of variation of the temporal distribution, apply the well-known technique of Pearson's Correlation to tweets using equi-height histogramming to determine correlation, and employ an asymmetric variant for computing “relatedness” to encourage exploration. We further introduce techniques using interestingness, correlation, and relatedness to automatically discover concepts and select preferred word N-grams for display. These techniques are demonstrated on an 800,000 tweet dataset from the Academy Awards.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544890
Md. Saiful Islam, Rui Zhou, Chengfei Liu
This paper aims at answering the so called why-not questions in reverse skyline queries. A reverse skyline query retrieves all data points whose dynamic skylines contain the query point. We outline the benefit and the semantics of answering why-not questions in reverse skyline queries. In connection with this, we show how to modify the why-not point and the query point to include the why-not point in the reverse skyline of the query point. We then show, how a query point can be positioned safely anywhere within a region (i.e., called safe region) without losing any of the existing reverse skyline points. We also show how to answer why-not questions considering the safe region of the query point. Our approach efficiently combines both query point and data point modification techniques to produce meaningful answers. Experimental results also demonstrate that our approach can produce high quality explanations for why-not questions in reverse skyline queries.
{"title":"On answering why-not questions in reverse skyline queries","authors":"Md. Saiful Islam, Rui Zhou, Chengfei Liu","doi":"10.1109/ICDE.2013.6544890","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544890","url":null,"abstract":"This paper aims at answering the so called why-not questions in reverse skyline queries. A reverse skyline query retrieves all data points whose dynamic skylines contain the query point. We outline the benefit and the semantics of answering why-not questions in reverse skyline queries. In connection with this, we show how to modify the why-not point and the query point to include the why-not point in the reverse skyline of the query point. We then show, how a query point can be positioned safely anywhere within a region (i.e., called safe region) without losing any of the existing reverse skyline points. We also show how to answer why-not questions considering the safe region of the query point. Our approach efficiently combines both query point and data point modification techniques to produce meaningful answers. Experimental results also demonstrate that our approach can produce high quality explanations for why-not questions in reverse skyline queries.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121472137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}