首页 > 最新文献

2014 IEEE 30th International Conference on Data Engineering最新文献

英文 中文
SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases SILVERBACK:用于柱状概率数据库中时态数据的可扩展关联挖掘
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816724
Yusheng Xie, Diana Palsetia, Goce Trajcevski, Ankit Agrawal, A. Choudhary
We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.
我们解决了大规模概率关联规则挖掘的问题,并考虑了挖掘结果的准确性和在适度硬件基础设施上追求可扩展性之间的权衡。我们演示了如何将研究成果的扩展和调整集成到工业应用程序中,并介绍了由Voxsup公司开发的商业部署SILVERBACK框架。SILVERBACK解决了存储效率问题,提出了一个概率柱状基础设施,并使用了Bloom过滤器和储层采样技术。此外,还引入了一种基于Apriori的概率剪枝技术,用于频繁项集的挖掘。所提出的目标驱动技术显著减少了频繁项集候选项的大小。我们提出了广泛的实验评估,证明了将基础设施限制纳入相应研究技术的上下文感知的好处。实验表明,与传统的基于hadoop的方法相比,通过增加更多的主机来提高可伸缩性,SILVERBACK(自2011年5月以来已经在Voxsup Inc.进行了商业部署和开发)具有更好的运行时性能,而精度的牺牲可以忽略不计。
{"title":"SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases","authors":"Yusheng Xie, Diana Palsetia, Goce Trajcevski, Ankit Agrawal, A. Choudhary","doi":"10.1109/ICDE.2014.6816724","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816724","url":null,"abstract":"We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115893655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Automatic generation of question answer pairs from noisy case logs 从嘈杂的案例日志中自动生成问题答案对
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816671
J. Ajmera, Sachindra Joshi, Ashish Verma, Amol Mittal
In a customer support scenario, a lot of valuable information is recorded in the form of `case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp `problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.
在客户支持场景中,以“案例日志”的形式记录了许多有价值的信息。案例记录主要是为了将来的参考或手工检查而写的,因此是以一种仓促的方式写的,而且非常嘈杂。在本文中,我们提出了利用这些案例日志来挖掘真正的客户关注点或问题的技术,然后将它们映射到为该企业编写的良好知识文章中。这种映射导致生成问答对。这些QA对可用于各种应用程序,例如动态更新常见问题(FAQs)、更新知识库等。在本文中,我们展示了这些发现的QA对作为问答系统的训练数据的效用。我们的案例日志挖掘方法是基于由两个生成模型组成的复合模型,即隐马尔可夫模型(HMM)和潜狄利克雷分配(LDA)模型。由于语义相似性,LDA模型解释了单词之间的长期依赖关系,HMM对这些案例日志中出现的顺序模式进行建模。这样的处理产生了清晰的“问题陈述”部分,这些部分表明了客户真正关心的问题。我们的实验表明,该方法在56%的情况下发现了清晰的问题陈述,并且优于其他分割方法,如HMM, LDA和条件随机场(CRF)。在找到这些清晰的问题陈述之后,从现有的知识库索引中查找合适的答案,形成候选QA对。我们表明,只考虑可以找到答案的问题陈述片段,进一步将分割性能提高到82%。最后,我们证明了当这些问答对被用作训练数据时,问答系统的性能可以得到显著提高。
{"title":"Automatic generation of question answer pairs from noisy case logs","authors":"J. Ajmera, Sachindra Joshi, Ashish Verma, Amol Mittal","doi":"10.1109/ICDE.2014.6816671","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816671","url":null,"abstract":"In a customer support scenario, a lot of valuable information is recorded in the form of `case logs'. Case logs are primarily written for future references or manual inspections and therefore are written in a hasty manner and are very noisy. In this paper, we propose techniques that exploit these case logs to mine real customer concerns or problems and then map them to well written knowledge articles for that enterprise. This mapping results into generation of question-answer (QA) pairs. These QA pairs can be used for a variety of applications such as dynamically updating the frequently-asked-questions (FAQs), updating the knowledge repository etc. In this paper we show the utility of these discovered QA pairs as training data for a question-answering system. Our approach for mining the case logs is based on a composite model consisting of two generative models, viz, hidden Markov model (HMM) and latent Dirichlet allocation (LDA) model. The LDA model explains the long-range dependencies across words due to their semantic similarity and HMM models the sequential patterns present in these case logs. Such processing results in crisp `problem statement' segments which are indicative of the real customer concerns. Our experiments show that this approach finds crisp problem-statements in 56% of the cases and outperforms other alternate methods for segmentation such as HMM, LDA and conditional random field (CRF). After finding these crisp problem-statements, appropriate answers are looked up from an existing knowledge repository index forming candidate QA pairs. We show that considering only the problemstatement segments for which the answers can be found further improves the segmentation performance to 82%. Finally, we show that when these QA pairs are used as training data, the performance of a question-answering system can be improved significantly.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
We can learn your #hashtags: Connecting tweets to explicit topics 我们可以学习你的#标签:将推文与明确的主题联系起来
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816706
W. Feng, Jianyong Wang
In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.
在Twitter上,用户可以用标签标注推文,以指示正在进行的主题。标签为用户提供了一种方便的方式来对tweet进行分类。从系统的角度来看,标签在tweet检索、事件检测、话题跟踪、广告投放等方面发挥着重要作用。用正确的标签标注推文可以带来更好的用户体验。然而,在标注过程中,有两个问题没有得到解决:(1)在用户决定创建一个新的标签之前,是否有办法帮助他/她发现一些相关的标签是否已经被创建并广泛使用?(2)不同用户对tweets分类的偏好可能不同。然而,关于标签推荐中的个性化问题的研究却很少。针对上述问题,本文提出了一种个性化标签推荐的统计模型。每天发布数百万对,我们可以用人群的智慧学习从推特到标签的复杂映射。模型回答了两个问题:(1)与传统的项目推荐数据不同,Twitter中的用户和推文具有丰富的辅助信息,如url、提及、位置、社会关系等。我们如何将这些功能整合到标签推荐中呢?(2)不同的标签具有不同的时间特征。与物理世界中的突发事件相关的标签具有很强的涨落时间模式,而其他一些标签在系统中保持稳定。我们如何整合与标签相关的功能来为标签推荐服务?考虑到上述所有因素,我们表明我们的模型在从Twitter抓取的真实数据集上成功地优于现有方法。
{"title":"We can learn your #hashtags: Connecting tweets to explicit topics","authors":"W. Feng, Jianyong Wang","doi":"10.1109/ICDE.2014.6816706","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816706","url":null,"abstract":"In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114648252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Distributed and interactive cube exploration 分布式和交互式多维数据集探索
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816674
N. Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi
Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.
大型数据集的交互式特别分析已经成为越来越流行的用例。我们详细介绍了在构建允许对数据多维数据集进行交互式探索的分布式系统时遇到的挑战。我们介绍DICE,这是一个分布式系统,它使用一种新颖的面向会话的模型进行数据立方体探索,旨在为用户提供指定精度级别的交互式亚秒级延迟。提供了一个新的框架,它结合了三个概念:数据多维数据集的分面探索、查询的推测执行和对数据子集的查询执行。我们讨论了系统的设计考虑、实现细节和优化。实验表明,DICE在十亿元规模下提供了亚秒级的交互式立方体探索体验,比目前的方法至少快33%。
{"title":"Distributed and interactive cube exploration","authors":"N. Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi","doi":"10.1109/ICDE.2014.6816674","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816674","url":null,"abstract":"Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127851142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 139
Incremental cluster evolution tracking from highly dynamic network data 基于高动态网络数据的增量集群演化跟踪
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816635
Pei Lee, L. Lakshmanan, E. Milios
Dynamic networks are commonly found in the current web age. In scenarios like social networks and social media, dynamic networks are noisy, are of large-scale and evolve quickly. In this paper, we focus on the cluster evolution tracking problem on highly dynamic networks, with clear application to event evolution tracking. There are several previous works on data stream clustering using a node-by-node approach for maintaining clusters. However, handling of bulk updates, i.e., a subgraph at a time, is critical for achieving acceptable performance over very large highly dynamic networks. We propose a subgraph-by-subgraph incremental tracking framework for cluster evolution in this paper. To effectively illustrate the techniques in our framework, we consider the event evolution tracking task in social streams as an application, where a social stream and an event are modeled as a dynamic post network and a dynamic cluster respectively. By monitoring through a fading time window, we introduce a skeletal graph to summarize the information in the dynamic network, and formalize cluster evolution patterns using a group of primitive evolution operations and their algebra. Two incremental computation algorithms are developed to maintain clusters and track evolution patterns as time rolls on and the network evolves. Our detailed experimental evaluation on large Twitter datasets demonstrates that our framework can effectively track the complete set of cluster evolution patterns from highly dynamic networks on the fly.
动态网络在当前的网络时代很常见。在社交网络和社交媒体等场景下,动态网络具有噪声大、规模大、演化快的特点。本文主要研究高动态网络上的聚类进化跟踪问题,并将其应用于事件进化跟踪。以前有一些关于数据流集群的工作,使用逐节点的方法来维护集群。然而,处理批量更新(即一次处理一个子图)对于在非常大的高动态网络上实现可接受的性能至关重要。本文提出了一种用于聚类演化的逐子图增量跟踪框架。为了有效地说明我们框架中的技术,我们将社交流中的事件演变跟踪任务视为一个应用程序,其中社交流和事件分别被建模为动态帖子网络和动态集群。通过衰落时间窗监测,引入骨架图来总结动态网络中的信息,并使用一组原始进化操作及其代数形式化聚类进化模式。随着时间的推移和网络的发展,开发了两种增量计算算法来维护集群和跟踪进化模式。我们在大型Twitter数据集上的详细实验评估表明,我们的框架可以有效地跟踪来自高度动态网络的完整集群演化模式。
{"title":"Incremental cluster evolution tracking from highly dynamic network data","authors":"Pei Lee, L. Lakshmanan, E. Milios","doi":"10.1109/ICDE.2014.6816635","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816635","url":null,"abstract":"Dynamic networks are commonly found in the current web age. In scenarios like social networks and social media, dynamic networks are noisy, are of large-scale and evolve quickly. In this paper, we focus on the cluster evolution tracking problem on highly dynamic networks, with clear application to event evolution tracking. There are several previous works on data stream clustering using a node-by-node approach for maintaining clusters. However, handling of bulk updates, i.e., a subgraph at a time, is critical for achieving acceptable performance over very large highly dynamic networks. We propose a subgraph-by-subgraph incremental tracking framework for cluster evolution in this paper. To effectively illustrate the techniques in our framework, we consider the event evolution tracking task in social streams as an application, where a social stream and an event are modeled as a dynamic post network and a dynamic cluster respectively. By monitoring through a fading time window, we introduce a skeletal graph to summarize the information in the dynamic network, and formalize cluster evolution patterns using a group of primitive evolution operations and their algebra. Two incremental computation algorithms are developed to maintain clusters and track evolution patterns as time rolls on and the network evolves. Our detailed experimental evaluation on large Twitter datasets demonstrates that our framework can effectively track the complete set of cluster evolution patterns from highly dynamic networks on the fly.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122006405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
An efficient sampling method for characterizing points of interests on maps 地图上兴趣点特征的有效采样方法
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816719
P. Wang, Wenbo He, Xue Liu
Recently map services (e.g., Google maps) and location-based online social networks (e.g., Foursquare) attract a lot of attention and businesses. With the increasing popularity of these location-based services, exploring and characterizing points of interests (PoIs) such as restaurants and hotels on maps provides valuable information for applications such as start-up marketing research. Due to the lack of a direct fully access to PoI databases, it is infeasible to exhaustively search and collect all PoIs within a large area using public APIs, which usually impose a limit on the maximum query rate. In this paper, we propose an effective and efficient method to sample PoIs on maps, and give unbiased estimators to calculate PoI statistics such as sum and average aggregates. Experimental results based on real datasets show that our method is efficient, and requires six times less queries than state-of-the-art methods to achieve the same accuracy.
最近,地图服务(如谷歌地图)和基于位置的在线社交网络(如Foursquare)吸引了大量的关注和业务。随着这些基于位置的服务的日益普及,在地图上探索和描述兴趣点(poi),如餐馆和酒店,为初创企业的市场研究等应用提供了有价值的信息。由于缺乏对PoI数据库的直接完全访问,使用公共api彻底搜索和收集大范围内的所有PoI是不可行的,这通常会对最大查询速率施加限制。在本文中,我们提出了一种在地图上对PoI进行采样的有效方法,并给出了无偏估计来计算PoI统计量,如总和和平均聚集。基于真实数据集的实验结果表明,我们的方法是有效的,并且需要比最先进的方法少6倍的查询来达到相同的精度。
{"title":"An efficient sampling method for characterizing points of interests on maps","authors":"P. Wang, Wenbo He, Xue Liu","doi":"10.1109/ICDE.2014.6816719","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816719","url":null,"abstract":"Recently map services (e.g., Google maps) and location-based online social networks (e.g., Foursquare) attract a lot of attention and businesses. With the increasing popularity of these location-based services, exploring and characterizing points of interests (PoIs) such as restaurants and hotels on maps provides valuable information for applications such as start-up marketing research. Due to the lack of a direct fully access to PoI databases, it is infeasible to exhaustively search and collect all PoIs within a large area using public APIs, which usually impose a limit on the maximum query rate. In this paper, we propose an effective and efficient method to sample PoIs on maps, and give unbiased estimators to calculate PoI statistics such as sum and average aggregates. Experimental results based on real datasets show that our method is efficient, and requires six times less queries than state-of-the-art methods to achieve the same accuracy.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124915749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Data quality: The other face of Big Data 数据质量:大数据的另一面
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816764
B. Saha, D. Srivastava
In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
在大数据时代,数据正在以前所未有的规模产生、收集和分析,数据驱动的决策正在席卷社会的方方面面。最近的研究表明,在大型数据库和网络上,低质量的数据普遍存在。由于低质量的数据可能会对数据分析结果造成严重后果,因此准确性的重要性,即大数据的第四个“V”正日益得到认可。在本教程中,我们将重点介绍前三个“V”(volume, velocity和variety)在处理大数据的准确性时所带来的重大挑战。由于数据的庞大数量和速度,需要以可扩展和及时的方式理解和(可能)修复错误数据。由于数据的多样性,通常来自不同的来源,数据质量规则不能先验地指定;为了发现数据的语义,需要让“数据自己说话”。本教程介绍了与大数据质量管理相关的最新成果,重点关注两个主要方面:(i)从数据本身发现质量问题,(ii)权衡准确性与效率,并为社区确定了一系列开放问题。
{"title":"Data quality: The other face of Big Data","authors":"B. Saha, D. Srivastava","doi":"10.1109/ICDE.2014.6816764","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816764","url":null,"abstract":"In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 217
On masking topical intent in keyword search 关键词搜索中主题意图的掩蔽
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816656
Peng Wang, C. Ravishankar
Text-based search queries reveal user intent to the search engine, compromising privacy. Topical Intent Obfuscation (TIO) is a promising new approach to preserving user privacy. TIO masks topical intent by mixing real user queries with dummy queries matching various different topics. Dummy queries are generated using a Dummy Query Generation Algorithm (DGA). We demonstrate various shortcomings in current TIO schemes, and show how to correct them. Current schemes assume that DGA details are unknown to the adversary. We argue that this is a flawed assumption, and show how DGA details can be used to construct efficient attacks on TIO schemes, using an iterative DGA as an example. Our extensive experiments on real data sets show that our attacks can flag up to 80% of dummy queries. We also propose HDGA, a new DGA that we prove to be immune to the attacks based on DGA semantics that we describe.
基于文本的搜索查询向搜索引擎揭示了用户的意图,损害了隐私。局部意图混淆(TIO)是一种很有前途的保护用户隐私的新方法。TIO通过混合真实用户查询和匹配各种不同主题的虚拟查询来掩盖主题意图。虚拟查询是使用虚拟查询生成算法(DGA)生成的。我们展示了当前TIO方案的各种缺点,并展示了如何纠正它们。目前的方案假设对手不知道DGA的细节。我们认为这是一个有缺陷的假设,并展示了如何使用DGA细节来构建对TIO方案的有效攻击,并以迭代DGA为例。我们在真实数据集上的大量实验表明,我们的攻击可以标记高达80%的虚拟查询。我们还提出了HDGA,一种新的DGA,我们证明了它对基于我们描述的DGA语义的攻击免疫。
{"title":"On masking topical intent in keyword search","authors":"Peng Wang, C. Ravishankar","doi":"10.1109/ICDE.2014.6816656","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816656","url":null,"abstract":"Text-based search queries reveal user intent to the search engine, compromising privacy. Topical Intent Obfuscation (TIO) is a promising new approach to preserving user privacy. TIO masks topical intent by mixing real user queries with dummy queries matching various different topics. Dummy queries are generated using a Dummy Query Generation Algorithm (DGA). We demonstrate various shortcomings in current TIO schemes, and show how to correct them. Current schemes assume that DGA details are unknown to the adversary. We argue that this is a flawed assumption, and show how DGA details can be used to construct efficient attacks on TIO schemes, using an iterative DGA as an example. Our extensive experiments on real data sets show that our attacks can flag up to 80% of dummy queries. We also propose HDGA, a new DGA that we prove to be immune to the attacks based on DGA semantics that we describe.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130125431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Locality-sensitive operators for parallel main-memory database clusters 用于并行主存数据库集群的位置敏感操作符
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816684
Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann
The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.
在过去的几十年里,计算速度的增长已经超过了网络带宽的增长。这导致本地处理和分布式处理之间的性能差距越来越大。因此,并行数据库集群必须最大化查询处理的局部性。实现这一目的的常用技术是共分区关系,以避免在网络上进行昂贵的数据变换。但是,这仅限于每个关系的一个属性,并且在面对更新时维护成本很高。由于与分布键的相关性,其他属性通常表现出模糊的共定位,但目前的方法没有利用这一点。在本文中,我们引入了位置敏感的数据变换,它可以大大减少分布式操作(如连接和聚合)的网络通信量。我们提出了四种新技术:(i)最优分区分配利用局部性来减少网络相位持续时间;(ii)通信调度避免了由于交叉流量导致的带宽利用率不足;(iii)自适应基数分区在数据重分区期间保持局域性,并优雅地处理值倾斜;(iv)选择性广播减少了存在极端值偏差或大量重复的网络通信。我们提供了全面的实验结果,表明我们的技术可以将模糊共定位的性能提高5倍,对于具有值偏差的输入可以提高3倍。
{"title":"Locality-sensitive operators for parallel main-memory database clusters","authors":"Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2014.6816684","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816684","url":null,"abstract":"The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114389554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
RuleMiner: Data quality rules discovery RuleMiner:数据质量规则发现
Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816746
Xu Chu, I. Ilyas, Paolo Papotti, Yin Ye
Integrity constraints (ICs) are valuables tools for enforcing correct application semantics. However, manually designing ICs require experts and time, hence the need for automatic discovery. Previous automatic ICs discovery suffer from (1) limited ICs language expressiveness; and (2) time-consuming manual verification of discovered ICs. We introduce RULEMINER, a system for discovering data quality rules that addresses the limitations of existing solutions.
完整性约束(ic)是执行正确的应用程序语义的重要工具。然而,手动设计集成电路需要专家和时间,因此需要自动发现。以往的集成电路自动发现存在以下缺陷:(1)集成电路语言表达能力有限;(2)耗时的人工验证发现的ic。我们介绍了RULEMINER,这是一个用于发现数据质量规则的系统,可以解决现有解决方案的局限性。
{"title":"RuleMiner: Data quality rules discovery","authors":"Xu Chu, I. Ilyas, Paolo Papotti, Yin Ye","doi":"10.1109/ICDE.2014.6816746","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816746","url":null,"abstract":"Integrity constraints (ICs) are valuables tools for enforcing correct application semantics. However, manually designing ICs require experts and time, hence the need for automatic discovery. Previous automatic ICs discovery suffer from (1) limited ICs language expressiveness; and (2) time-consuming manual verification of discovered ICs. We introduce RULEMINER, a system for discovering data quality rules that addresses the limitations of existing solutions.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116909865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
2014 IEEE 30th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1