Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816674
N. Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi
Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.
{"title":"Distributed and interactive cube exploration","authors":"N. Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi","doi":"10.1109/ICDE.2014.6816674","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816674","url":null,"abstract":"Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127851142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816706
W. Feng, Jianyong Wang
In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.
{"title":"We can learn your #hashtags: Connecting tweets to explicit topics","authors":"W. Feng, Jianyong Wang","doi":"10.1109/ICDE.2014.6816706","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816706","url":null,"abstract":"In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114648252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816656
Peng Wang, C. Ravishankar
Text-based search queries reveal user intent to the search engine, compromising privacy. Topical Intent Obfuscation (TIO) is a promising new approach to preserving user privacy. TIO masks topical intent by mixing real user queries with dummy queries matching various different topics. Dummy queries are generated using a Dummy Query Generation Algorithm (DGA). We demonstrate various shortcomings in current TIO schemes, and show how to correct them. Current schemes assume that DGA details are unknown to the adversary. We argue that this is a flawed assumption, and show how DGA details can be used to construct efficient attacks on TIO schemes, using an iterative DGA as an example. Our extensive experiments on real data sets show that our attacks can flag up to 80% of dummy queries. We also propose HDGA, a new DGA that we prove to be immune to the attacks based on DGA semantics that we describe.
{"title":"On masking topical intent in keyword search","authors":"Peng Wang, C. Ravishankar","doi":"10.1109/ICDE.2014.6816656","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816656","url":null,"abstract":"Text-based search queries reveal user intent to the search engine, compromising privacy. Topical Intent Obfuscation (TIO) is a promising new approach to preserving user privacy. TIO masks topical intent by mixing real user queries with dummy queries matching various different topics. Dummy queries are generated using a Dummy Query Generation Algorithm (DGA). We demonstrate various shortcomings in current TIO schemes, and show how to correct them. Current schemes assume that DGA details are unknown to the adversary. We argue that this is a flawed assumption, and show how DGA details can be used to construct efficient attacks on TIO schemes, using an iterative DGA as an example. Our extensive experiments on real data sets show that our attacks can flag up to 80% of dummy queries. We also propose HDGA, a new DGA that we prove to be immune to the attacks based on DGA semantics that we describe.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130125431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816727
Nga Tran, Andrew Lamb, Lakshmikant Shrinivas, Sreenath Bodagala, J. Dave
The Vertica SQL Query Optimizer was written from the ground up for the Vertica Analytic Database. Its design and the tradeoffs we encountered during its implementation argue that the full power of novel database systems can only be realized with a carefully crafted custom Query Optimizer written specifically for the system in which it operates.
{"title":"The Vertica Query Optimizer: The case for specialized query optimizers","authors":"Nga Tran, Andrew Lamb, Lakshmikant Shrinivas, Sreenath Bodagala, J. Dave","doi":"10.1109/ICDE.2014.6816727","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816727","url":null,"abstract":"The Vertica SQL Query Optimizer was written from the ground up for the Vertica Analytic Database. Its design and the tradeoffs we encountered during its implementation argue that the full power of novel database systems can only be realized with a carefully crafted custom Query Optimizer written specifically for the system in which it operates.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123901674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816680
Jun Gao, Jiashuai Zhou, Chang Zhou, J. Yu
With the rapid growth of graphs in different applications, it is inevitable to leverage existing distributed data processing frameworks in managing large graphs. Although these frameworks ease the developing cost, it is still cumbersome and error-prone for developers to implement complex graph analysis tasks in distributed environments. Additionally, developers have to learn the details of these frameworks quite well, which is a key to improve the performance of distributed jobs. This paper introduces a high level query language called GLog and proposes its evaluation method to overcome these limitations. Specifically, we first design a RG (Relational-Graph) data model to mix relational data and graph data, and extend Datalog to GLog on RG tables to support various graph analysis tasks. Second, we define operations on RG tables, and show translation templates to convert a GLog query into a sequence of MapReduce jobs. Third, we propose two strategies, namely rule merging and iteration rewriting, to optimize the translated jobs. The final experiments show that GLog can not only express various graph analysis tasks in a more succinct way, but also achieve a better performance for most of the graph analysis tasks than Pig, another high level dataflow system.
{"title":"GLog: A high level graph analysis system using MapReduce","authors":"Jun Gao, Jiashuai Zhou, Chang Zhou, J. Yu","doi":"10.1109/ICDE.2014.6816680","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816680","url":null,"abstract":"With the rapid growth of graphs in different applications, it is inevitable to leverage existing distributed data processing frameworks in managing large graphs. Although these frameworks ease the developing cost, it is still cumbersome and error-prone for developers to implement complex graph analysis tasks in distributed environments. Additionally, developers have to learn the details of these frameworks quite well, which is a key to improve the performance of distributed jobs. This paper introduces a high level query language called GLog and proposes its evaluation method to overcome these limitations. Specifically, we first design a RG (Relational-Graph) data model to mix relational data and graph data, and extend Datalog to GLog on RG tables to support various graph analysis tasks. Second, we define operations on RG tables, and show translation templates to convert a GLog query into a sequence of MapReduce jobs. Third, we propose two strategies, namely rule merging and iteration rewriting, to optimize the translated jobs. The final experiments show that GLog can not only express various graph analysis tasks in a more succinct way, but also achieve a better performance for most of the graph analysis tasks than Pig, another high level dataflow system.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123370546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816671
J. Ajmera, Sachindra Joshi, Ashish Verma, Amol Mittal
In a customer support scenario, a lot of valuable information is recorded in the form of `case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp `problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.
{"title":"Automatic generation of question answer pairs from noisy case logs","authors":"J. Ajmera, Sachindra Joshi, Ashish Verma, Amol Mittal","doi":"10.1109/ICDE.2014.6816671","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816671","url":null,"abstract":"In a customer support scenario, a lot of valuable information is recorded in the form of `case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp `problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816724
Yusheng Xie, Diana Palsetia, Goce Trajcevski, Ankit Agrawal, A. Choudhary
We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.
{"title":"SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases","authors":"Yusheng Xie, Diana Palsetia, Goce Trajcevski, Ankit Agrawal, A. Choudhary","doi":"10.1109/ICDE.2014.6816724","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816724","url":null,"abstract":"We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115893655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816764
B. Saha, D. Srivastava
In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
{"title":"Data quality: The other face of Big Data","authors":"B. Saha, D. Srivastava","doi":"10.1109/ICDE.2014.6816764","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816764","url":null,"abstract":"In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816684
Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann
The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.
{"title":"Locality-sensitive operators for parallel main-memory database clusters","authors":"Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2014.6816684","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816684","url":null,"abstract":"The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114389554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-19DOI: 10.1109/ICDE.2014.6816678
L. Perez, C. Jermaine
The use of materialized views derived from the intermediate results of frequently executed queries is a popular strategy for improving performance in query workloads. Optimizers capable of matching such views with inbound queries can generate alternative execution plans that read the materialized contents directly instead of re-computing the corresponding subqueries, which tends to result in reduced query execution times. In this paper, we introduce an architecture called Hawc that extends a cost-based logical optimizer with the capability to use history information to identify query plans that, if executed, produce intermediate result sets that can be used to create materialized views with the potential to reduce the execution time of future queries. We present techniques for using knowledge of past queries to assist the query optimizer and match, generate and select useful materialized views. Experimental results indicate that these techniques provide substantial improvements in workload execution time.
{"title":"History-aware query optimization with materialized intermediate views","authors":"L. Perez, C. Jermaine","doi":"10.1109/ICDE.2014.6816678","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816678","url":null,"abstract":"The use of materialized views derived from the intermediate results of frequently executed queries is a popular strategy for improving performance in query workloads. Optimizers capable of matching such views with inbound queries can generate alternative execution plans that read the materialized contents directly instead of re-computing the corresponding subqueries, which tends to result in reduced query execution times. In this paper, we introduce an architecture called Hawc that extends a cost-based logical optimizer with the capability to use history information to identify query plans that, if executed, produce intermediate result sets that can be used to create materialized views with the potential to reduce the execution time of future queries. We present techniques for using knowledge of past queries to assist the query optimizer and match, generate and select useful materialized views. Experimental results indicate that these techniques provide substantial improvements in workload execution time.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115997595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}