Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544898
Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, D. Srivastava
The notion of heavy hitters-items that make up a large fraction of the population - has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of Conditional Heavy Hitters to identify such items, with applications in network monitoring, and Markov chain modeling. We introduce several streaming algorithms that allow us to find conditional heavy hitters efficiently, and provide analytical results. Different algorithms are successful for different input characteristics. We perform experimental evaluations to demonstrate the efficacy of our methods, and to study which algorithms are most suited for different types of data.
{"title":"Finding interesting correlations with conditional heavy hitters","authors":"Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, D. Srivastava","doi":"10.1109/ICDE.2013.6544898","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544898","url":null,"abstract":"The notion of heavy hitters-items that make up a large fraction of the population - has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of Conditional Heavy Hitters to identify such items, with applications in network monitoring, and Markov chain modeling. We introduce several streaming algorithms that allow us to find conditional heavy hitters efficiently, and provide analytical results. Different algorithms are successful for different input characteristics. We perform experimental evaluations to demonstrate the efficacy of our methods, and to study which algorithms are most suited for different types of data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544811
Justin J. Levandoski, P. Larson, R. Stoica
Main memories are becoming sufficiently large that most OLTP databases can be stored entirely in main memory, but this may not be the best solution. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). It is more economical to store the coldest records on secondary storage such as flash. As a first step towards managing cold data in databases optimized for main memory we investigate how to efficiently identify hot and cold data. We propose to log record accesses - possibly only a sample to reduce overhead - and perform offline analysis to estimate record access frequencies. We present four estimation algorithms based on exponential smoothing and experimentally evaluate their efficiency and accuracy. We find that exponential smoothing produces very accurate estimates, leading to higher hit rates than the best caching techniques. Our most efficient algorithm is able to analyze a log of 1B accesses in sub-second time on a workstation-class machine.
{"title":"Identifying hot and cold data in main-memory databases","authors":"Justin J. Levandoski, P. Larson, R. Stoica","doi":"10.1109/ICDE.2013.6544811","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544811","url":null,"abstract":"Main memories are becoming sufficiently large that most OLTP databases can be stored entirely in main memory, but this may not be the best solution. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). It is more economical to store the coldest records on secondary storage such as flash. As a first step towards managing cold data in databases optimized for main memory we investigate how to efficiently identify hot and cold data. We propose to log record accesses - possibly only a sample to reduce overhead - and perform offline analysis to estimate record access frequencies. We present four estimation algorithms based on exponential smoothing and experimentally evaluate their efficiency and accuracy. We find that exponential smoothing produces very accurate estimates, leading to higher hit rates than the best caching techniques. Our most efficient algorithm is able to analyze a log of 1B accesses in sub-second time on a workstation-class machine.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128391169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544947
Kisung Lee, Emre Yigitoglu, Ling Liu, Binh Han, Balaji Palanisamy, C. Pu
Spatial alarms are one of the fundamental functionalities for many LBSs. We argue that spatial alarms should be road network aware as mobile objects travel on spatially constrained road networks or walk paths. In this software system demonstration, we will present the first prototype system of ROADALARM - a spatial alarm processing system for moving objects on road networks. The demonstration system of ROAD-ALARM focuses on the three unique features of ROADALARM system design. First, we will show that the road network distance-based spatial alarm is best modeled using road network distance such as segment length-based and travel time-based distance. Thus, a road network spatial alarm is a star-like subgraph centered at the alarm target. Second, we will show the suite of ROADALARM optimization techniques to scale spatial alarm processing by taking into account spatial constraints on road networks and mobility patterns of mobile subscribers. Third, we will show that, by equipping the ROADALARM system with an activity monitoring-based control panel, we are able to enable the system administrator and the end users to visualize road network-based spatial alarms, mobility traces of moving objects and dynamically make selection or customization of the ROADALARM techniques for spatial alarm processing through graphical user interface. We show that the ROADALARM system provides both the general system architecture and the essential building blocks for location-based advertisements and location-based reminders.
{"title":"RoadAlarm: A spatial alarm system on road networks","authors":"Kisung Lee, Emre Yigitoglu, Ling Liu, Binh Han, Balaji Palanisamy, C. Pu","doi":"10.1109/ICDE.2013.6544947","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544947","url":null,"abstract":"Spatial alarms are one of the fundamental functionalities for many LBSs. We argue that spatial alarms should be road network aware as mobile objects travel on spatially constrained road networks or walk paths. In this software system demonstration, we will present the first prototype system of ROADALARM - a spatial alarm processing system for moving objects on road networks. The demonstration system of ROAD-ALARM focuses on the three unique features of ROADALARM system design. First, we will show that the road network distance-based spatial alarm is best modeled using road network distance such as segment length-based and travel time-based distance. Thus, a road network spatial alarm is a star-like subgraph centered at the alarm target. Second, we will show the suite of ROADALARM optimization techniques to scale spatial alarm processing by taking into account spatial constraints on road networks and mobility patterns of mobile subscribers. Third, we will show that, by equipping the ROADALARM system with an activity monitoring-based control panel, we are able to enable the system administrator and the end users to visualize road network-based spatial alarms, mobility traces of moving objects and dynamically make selection or customization of the ROADALARM techniques for spatial alarm processing through graphical user interface. We show that the ROADALARM system provides both the general system architecture and the essential building blocks for location-based advertisements and location-based reminders.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129360105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544807
G. Alonso
Until relatively recently, the development of data processing applications took place largely ignoring the underlying hardware. Only in niche applications (supercomputing, embedded systems) or in special software (operating systems, database internals, language runtimes) did (some) programmers had to pay attention to the actual hardware where the software would run. In most cases, working atop the abstractions provided by either the operating system or by system libraries was good enough. The constant improvements in processor speed did the rest. The new millennium has radically changed the picture. Driven by multiple needs - e.g., scale, physical constraints, energy limitations, virtualization, business models- hardware architectures are changing at a speed and in ways that current development practices for data processing cannot accommodate. From now on, software will have to be developed paying close attention to the underlying hardware and following strict performance engineering principles. In this paper, several aspects of the ongoing hardware revolution and its impact on data processing are analysed, pointing to the need for new strategies to tackle the challenges ahead.
{"title":"Hardware killed the software star","authors":"G. Alonso","doi":"10.1109/ICDE.2013.6544807","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544807","url":null,"abstract":"Until relatively recently, the development of data processing applications took place largely ignoring the underlying hardware. Only in niche applications (supercomputing, embedded systems) or in special software (operating systems, database internals, language runtimes) did (some) programmers had to pay attention to the actual hardware where the software would run. In most cases, working atop the abstractions provided by either the operating system or by system libraries was good enough. The constant improvements in processor speed did the rest. The new millennium has radically changed the picture. Driven by multiple needs - e.g., scale, physical constraints, energy limitations, virtualization, business models- hardware architectures are changing at a speed and in ways that current development practices for data processing cannot accommodate. From now on, software will have to be developed paying close attention to the underlying hardware and following strict performance engineering principles. In this paper, several aspects of the ongoing hardware revolution and its impact on data processing are analysed, pointing to the need for new strategies to tackle the challenges ahead.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129146578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544889
Xiaochun Yang, Bin Wang, Chen Li, Jiaying Wang, Xiaohui Xie
The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data.
{"title":"Efficient direct search on compressed genomic data","authors":"Xiaochun Yang, Bin Wang, Chen Li, Jiaying Wang, Xiaohui Xie","doi":"10.1109/ICDE.2013.6544889","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544889","url":null,"abstract":"The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132962139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544845
Xike Xie, Hua Lu, T. Pedersen
Indoor spaces accommodate large parts of people's life. The increasing availability of indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables a variety of indoor location-based services (LBSs). Efficient indoor distance-aware queries on indoor moving objects play an important role in supporting and boosting such LBSs. However, the distance-aware query evaluation on indoor moving objects is challenging because: (1) indoor spaces are characterized by many special entities and thus render distance calculation very complex; (2) the limitations of indoor positioning technologies create inherent uncertainties in indoor moving objects data. In this paper, we propose a complete set of techniques for efficient distance-aware queries on indoor moving objects. We define and categorize the indoor distances in relation to indoor uncertain objects, and derive different distance bounds that can facilitate query evaluation. Existing works often assume indoor floor plans are static, and require extensive pre-computation on indoor topologies. In contrast, we design a composite index scheme that integrates indoor geometries, indoor topologies, as well as indoor uncertain objects, and thus supports indoor distance-aware queries efficiently without time-consuming and volatile distance computation. We design algorithms for range query and k nearest neighbor query on indoor moving objects. The results of extensive experimental studies demonstrate that our proposals are efficient and scalable in evaluating distance-aware queries over indoor moving objects.
{"title":"Efficient distance-aware query evaluation on indoor moving objects","authors":"Xike Xie, Hua Lu, T. Pedersen","doi":"10.1109/ICDE.2013.6544845","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544845","url":null,"abstract":"Indoor spaces accommodate large parts of people's life. The increasing availability of indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables a variety of indoor location-based services (LBSs). Efficient indoor distance-aware queries on indoor moving objects play an important role in supporting and boosting such LBSs. However, the distance-aware query evaluation on indoor moving objects is challenging because: (1) indoor spaces are characterized by many special entities and thus render distance calculation very complex; (2) the limitations of indoor positioning technologies create inherent uncertainties in indoor moving objects data. In this paper, we propose a complete set of techniques for efficient distance-aware queries on indoor moving objects. We define and categorize the indoor distances in relation to indoor uncertain objects, and derive different distance bounds that can facilitate query evaluation. Existing works often assume indoor floor plans are static, and require extensive pre-computation on indoor topologies. In contrast, we design a composite index scheme that integrates indoor geometries, indoor topologies, as well as indoor uncertain objects, and thus supports indoor distance-aware queries efficiently without time-consuming and volatile distance computation. We design algorithms for range query and k nearest neighbor query on indoor moving objects. The results of extensive experimental studies demonstrate that our proposals are efficient and scalable in evaluating distance-aware queries over indoor moving objects.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126645166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544902
C. Ungureanu, Biplob K. Debnath, Stephen A. Rago, Akshat Aranya
The performance and capacity characteristics of flash storage make it attractive to use as a cache. Recency-based cache replacement policies rely on an in-memory full index, typically a B-tree or a hash table, that maps each object to its recency information. Even though the recency information itself may take very little space, the full index for a cache holding N keys requires at least log N bits per key. This metadata overhead is undesirably high when used for very large flash-based caches, such as key-value stores with billions of objects. To solve this problem, we propose a new RAM-frugal cache replacement policy that approximates the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters (TBF) for maintaining the recency information and leverages an on-flash key-value store to cache objects. TBF requires only one byte of RAM per cached object, making it suitable for implementing very large flash-based caches. We evaluate TBF through simulation on traces from several block stores and key-value stores, as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system implementation. Evaluation results show that TBF achieves cache hit rate and operations per second comparable to those of LRU in spite of its much smaller memory requirements.
{"title":"TBF: A memory-efficient replacement policy for flash-based caches","authors":"C. Ungureanu, Biplob K. Debnath, Stephen A. Rago, Akshat Aranya","doi":"10.1109/ICDE.2013.6544902","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544902","url":null,"abstract":"The performance and capacity characteristics of flash storage make it attractive to use as a cache. Recency-based cache replacement policies rely on an in-memory full index, typically a B-tree or a hash table, that maps each object to its recency information. Even though the recency information itself may take very little space, the full index for a cache holding N keys requires at least log N bits per key. This metadata overhead is undesirably high when used for very large flash-based caches, such as key-value stores with billions of objects. To solve this problem, we propose a new RAM-frugal cache replacement policy that approximates the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters (TBF) for maintaining the recency information and leverages an on-flash key-value store to cache objects. TBF requires only one byte of RAM per cached object, making it suitable for implementing very large flash-based caches. We evaluate TBF through simulation on traces from several block stores and key-value stores, as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system implementation. Evaluation results show that TBF achieves cache hit rate and operations per second comparable to those of LRU in spite of its much smaller memory requirements.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122970274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544852
Suman Nath, R. Venkatesan
Outsourcing data streams and desired computations to a third party such as the cloud is a desirable option to many companies. However, data outsourcing and remote computations intrinsically raise issues of trust, making it crucial to verify results returned by third parties. In this context, we propose a novel solution to verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by queries) that are common in many business applications. We consider a setting where a data owner employs an untrusted remote server to run continuous grouped aggregation queries on a data stream it forwards to the server. Untrusted clients then query the server for results and efficiently verify correctness of the results by using a small and easy-to-compute signature provided by the data owner. Our work complements previous works on authenticating remote computation of selection and aggregation queries. The most important aspect of our solution is that it is publicly verifiable - unlike most prior works, we support untrusted clients (who can collude with other clients or with the server). Experimental results on real and synthetic data show that our solution is practical and efficient.
{"title":"Publicly verifiable grouped aggregation queries on outsourced data streams","authors":"Suman Nath, R. Venkatesan","doi":"10.1109/ICDE.2013.6544852","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544852","url":null,"abstract":"Outsourcing data streams and desired computations to a third party such as the cloud is a desirable option to many companies. However, data outsourcing and remote computations intrinsically raise issues of trust, making it crucial to verify results returned by third parties. In this context, we propose a novel solution to verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by queries) that are common in many business applications. We consider a setting where a data owner employs an untrusted remote server to run continuous grouped aggregation queries on a data stream it forwards to the server. Untrusted clients then query the server for results and efficiently verify correctness of the results by using a small and easy-to-compute signature provided by the data owner. Our work complements previous works on authenticating remote computation of selection and aggregation queries. The most important aspect of our solution is that it is publicly verifiable - unlike most prior works, we support untrusted clients (who can collude with other clients or with the server). Experimental results on real and synthetic data show that our solution is practical and efficient.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121644619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544905
Tao Cheng, K. Chakrabarti, S. Chaudhuri, Vivek R. Narasayya, M. Syamala
Retail is increasingly moving online. There are only a few big e-tailers but there is a long tail of small-sized e-tailers. The big e-tailers are able to collect significant data on user activities at their websites. They use these assets to derive insights about their products and to provide superior experiences for their users. On the other hand, small e-tailers do not possess such user data and hence cannot match the rich user experiences offered by big e-tailers. Our key insight is that web search engines possess significant data on user behaviors that can be used to help smaller e-tailers mine the same signals that big e-tailers derive from their proprietary user data assets. These signals can be exposed as data services in the cloud; e-tailers can leverage them to enable similar user experiences as the big e-tailers. We present three such data services in the paper: entity synonym data service, query-to-entity data service and entity tagging data service. The entity synonym service is an in-production data service that is currently available while the other two are data services currently in development at Microsoft. Our experiments on product datasets show (i) these data services have high quality and (ii) they have significant impact on user experiences on e-tailer websites. To the best of our knowledge, this is the first paper to explore the potential of using search engine data assets for e-tailers.
{"title":"Data services for E-tailers leveraging web search engine assets","authors":"Tao Cheng, K. Chakrabarti, S. Chaudhuri, Vivek R. Narasayya, M. Syamala","doi":"10.1109/ICDE.2013.6544905","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544905","url":null,"abstract":"Retail is increasingly moving online. There are only a few big e-tailers but there is a long tail of small-sized e-tailers. The big e-tailers are able to collect significant data on user activities at their websites. They use these assets to derive insights about their products and to provide superior experiences for their users. On the other hand, small e-tailers do not possess such user data and hence cannot match the rich user experiences offered by big e-tailers. Our key insight is that web search engines possess significant data on user behaviors that can be used to help smaller e-tailers mine the same signals that big e-tailers derive from their proprietary user data assets. These signals can be exposed as data services in the cloud; e-tailers can leverage them to enable similar user experiences as the big e-tailers. We present three such data services in the paper: entity synonym data service, query-to-entity data service and entity tagging data service. The entity synonym service is an in-production data service that is currently available while the other two are data services currently in development at Microsoft. Our experiments on product datasets show (i) these data services have high quality and (ii) they have significant impact on user experiences on e-tailer websites. To the best of our knowledge, this is the first paper to explore the potential of using search engine data assets for e-tailers.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128069972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544819
Maximilian Dylla, Iris Miliaraki, M. Theobald
We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data.
{"title":"Top-k query processing in probabilistic databases with non-materialized views","authors":"Maximilian Dylla, Iris Miliaraki, M. Theobald","doi":"10.1109/ICDE.2013.6544819","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544819","url":null,"abstract":"We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133678133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}