首页 > 最新文献

2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献

英文 中文
Mutual benefit aware task assignment in a bipartite labor market 二元劳动力市场中具有互利意识的任务分配
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498230
Liu Zheng, Lei Chen
As one of the three major steps (question design, task assignment, answer aggregation) in crowdsourcing, task assignment directly affects the quality of the crowdsourcing result. A good assignment will not only improve the answers' quality, but also boost the workers' willingness to participate. Although a lot of works have been made to produce better assignment, most of them neglected one of its most important properties: the bipartition, which exists widely in real world scenarios. Such ignorance greatly limits their application under general settings.
任务分配作为众包的三大步骤(问题设计、任务分配、答案聚合)之一,直接影响众包结果的质量。一个好的作业不仅可以提高答案的质量,还可以提高员工的参与意愿。尽管已经做了大量的工作来产生更好的分配,但大多数工作都忽略了它的一个最重要的性质:在现实世界场景中广泛存在的二分性。这种无知极大地限制了它们在一般情况下的应用。
{"title":"Mutual benefit aware task assignment in a bipartite labor market","authors":"Liu Zheng, Lei Chen","doi":"10.1109/ICDE.2016.7498230","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498230","url":null,"abstract":"As one of the three major steps (question design, task assignment, answer aggregation) in crowdsourcing, task assignment directly affects the quality of the crowdsourcing result. A good assignment will not only improve the answers' quality, but also boost the workers' willingness to participate. Although a lot of works have been made to produce better assignment, most of them neglected one of its most important properties: the bipartition, which exists widely in real world scenarios. Such ignorance greatly limits their application under general settings.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"331 1","pages":"73-84"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77141120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
GARNET: A holistic system approach for trending queries in microblogs 微博趋势查询的整体系统方法
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498329
C. Jonathan, A. Magdy, M. Mokbel, A. Jonathan
The recent wide popularity of microblogs (e.g., tweets, online comments) has empowered various important applications, including, news delivery, event detection, market analysis, and target advertising. A core module in all these applications is a frequent/trending query processor that aims to find out those topics that are highly frequent or trending in the social media through posted microblogs. Unfortunately current attempts for such core module suffer from several drawbacks. Most importantly, their narrow scope, as they focus only on solving trending queries for a very special case of localized and very recent microblogs. This paper presents GARNET; a holistic system equipped with one-stop efficient and scalable solution for supporting a generic form of context-aware frequent and trending queries on microblogs. GARNET supports both frequent and trending queries, any arbitrary time interval either current, recent, or past, of fixed granularity, and having a set of arbitrary filters over contextual attributes. From a system point of view, GARNET is very appealing and industry-friendly, as one needs to realize it once in the system. Then, a myriad of various forms of trending and frequent queries are immediately supported. Experimental evidence based on a real system prototype of GARNET and billions of real Twitter data show the scalability and efficiency of GARNET for various query types.
最近微博的广泛流行(例如,tweet,在线评论)为各种重要应用提供了支持,包括新闻传递,事件检测,市场分析和目标广告。所有这些应用程序的核心模块是一个频繁/趋势查询处理器,旨在通过发布的微博找出社交媒体中频繁或流行的话题。不幸的是,目前这种核心模块的尝试有几个缺点。最重要的是,他们的范围很窄,因为他们只专注于解决一个非常特殊的本地化和最近的微博的趋势查询。本文介绍了GARNET;一个整体系统,配备一站式高效和可扩展的解决方案,用于支持微博上上下文感知的通用形式的频繁查询和趋势查询。GARNET支持频繁查询和趋势查询,支持固定粒度的任意时间间隔(当前、最近或过去),并对上下文属性具有一组任意过滤器。从系统的角度来看,GARNET非常吸引人,并且对行业友好,因为需要在系统中实现它。然后,立即支持无数不同形式的趋势和频繁查询。基于GARNET真实系统原型和数十亿Twitter真实数据的实验证据表明,GARNET对于各种查询类型具有可扩展性和高效性。
{"title":"GARNET: A holistic system approach for trending queries in microblogs","authors":"C. Jonathan, A. Magdy, M. Mokbel, A. Jonathan","doi":"10.1109/ICDE.2016.7498329","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498329","url":null,"abstract":"The recent wide popularity of microblogs (e.g., tweets, online comments) has empowered various important applications, including, news delivery, event detection, market analysis, and target advertising. A core module in all these applications is a frequent/trending query processor that aims to find out those topics that are highly frequent or trending in the social media through posted microblogs. Unfortunately current attempts for such core module suffer from several drawbacks. Most importantly, their narrow scope, as they focus only on solving trending queries for a very special case of localized and very recent microblogs. This paper presents GARNET; a holistic system equipped with one-stop efficient and scalable solution for supporting a generic form of context-aware frequent and trending queries on microblogs. GARNET supports both frequent and trending queries, any arbitrary time interval either current, recent, or past, of fixed granularity, and having a set of arbitrary filters over contextual attributes. From a system point of view, GARNET is very appealing and industry-friendly, as one needs to realize it once in the system. Then, a myriad of various forms of trending and frequent queries are immediately supported. Experimental evidence based on a real system prototype of GARNET and billions of real Twitter data show the scalability and efficiency of GARNET for various query types.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"50 1","pages":"1251-1262"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87367829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Adaptive noise immune cluster ensemble using affinity propagation 基于亲和传播的自适应噪声免疫聚类集成
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498371
Zhiwen Yu, Guoqiang Han, Le Li, Jiming Liu, Jun Zhang
Cluster ensemble, as one of the important research directions in the ensemble learning area, is gaining more and more attention, due to its powerful capability to integrate multiple clustering solutions and provide a more accurate, stable and robust result. Cluster ensemble has a lot of useful applications in a large number of areas. Although most of traditional cluster ensemble approaches obtain good results, few of them consider how to achieve good performance for noisy datasets. Some noisy datasets have a number of noisy attributes which may degrade the performance of conventional cluster ensemble approaches. Some noisy datasets which contain noisy samples will affect the final results. Other noisy datasets may be sensitive to distance functions.
聚类集成作为集成学习领域的重要研究方向之一,由于其强大的集成多个聚类解决方案的能力,提供更加准确、稳定和鲁棒的结果,越来越受到人们的关注。集群集成在许多领域都有很多有用的应用。虽然大多数传统的聚类集成方法都获得了良好的效果,但很少有人考虑如何在噪声数据集上获得良好的性能。一些有噪声的数据集具有许多噪声属性,这可能会降低传统聚类集成方法的性能。一些包含噪声样本的噪声数据集会影响最终结果。其他有噪声的数据集可能对距离函数很敏感。
{"title":"Adaptive noise immune cluster ensemble using affinity propagation","authors":"Zhiwen Yu, Guoqiang Han, Le Li, Jiming Liu, Jun Zhang","doi":"10.1109/ICDE.2016.7498371","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498371","url":null,"abstract":"Cluster ensemble, as one of the important research directions in the ensemble learning area, is gaining more and more attention, due to its powerful capability to integrate multiple clustering solutions and provide a more accurate, stable and robust result. Cluster ensemble has a lot of useful applications in a large number of areas. Although most of traditional cluster ensemble approaches obtain good results, few of them consider how to achieve good performance for noisy datasets. Some noisy datasets have a number of noisy attributes which may degrade the performance of conventional cluster ensemble approaches. Some noisy datasets which contain noisy samples will affect the final results. Other noisy datasets may be sensitive to distance functions.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"1454-1455"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80369256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Cross-layer betweenness centrality in multiplex networks with applications 带应用的多路网络中的跨层中间性中心
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498257
Tanmoy Chakraborty, Ramasuri Narayanam
Several real-life social systems witness the presence of multiple interaction types (or layers) among the entities, thus establishing a collection of co-evolving networks, known as multiplex networks. More recently, there has been a significant interest in developing certain centrality measures in multiplex networks to understand the influential power of the entities (to be referred as vertices or nodes hereafter). In this paper, we consider the problem of studying how frequently the nodes occur on the shortest paths between other nodes in the multiplex networks. As opposed to simplex networks, the shortest paths between nodes can possibly traverse through multiple layers in multiplex networks. Motivated by this phenomenon, we propose a new metric to address the above problem and we call this new metric cross-layer betweenness centrality (CBC). Our definition of CBC measure takes into account the interplay among multiple layers in determining the shortest paths in multiplex networks. We propose an efficient algorithm to compute CBC and show that it runs much faster than the naïve computation of this measure. We show the efficacy of the proposed algorithm using thorough experimentation on two real-world multiplex networks. We further demonstrate the practical utility of CBC by applying it in the following three application contexts: discovering non-overlapping community structure in multiplex networks, identifying interdisciplinary researchers from a multiplex co-authorship network, and the initiator selection for message spreading. In all these application scenarios, the respective solution methods based on the proposed CBC are found to be significantly better performing than that of the corresponding benchmark approaches.
几个现实生活中的社会系统见证了实体之间多种交互类型(或层)的存在,从而建立了一个共同进化网络的集合,称为多重网络。最近,人们对在多路网络中开发某些中心性度量非常感兴趣,以了解实体(下文称为顶点或节点)的影响力。本文研究多路网络中节点在其他节点之间最短路径上出现的频率问题。与单纯形网络相反,在多路网络中,节点之间的最短路径可能穿越多个层。在这种现象的激励下,我们提出了一个新的度量来解决上述问题,我们称之为跨层间中心性(CBC)。在确定多路网络中最短路径时,我们对CBC度量的定义考虑了多层之间的相互作用。我们提出了一种有效的计算CBC的算法,并表明它比该度量的naïve计算快得多。我们在两个真实的多路复用网络上进行了彻底的实验,证明了所提出算法的有效性。我们进一步展示了CBC的实际用途,将其应用于以下三种应用环境:发现多路网络中的非重叠社区结构,从多路合作网络中识别跨学科研究人员,以及选择消息传播的发起者。在所有这些应用场景中,发现基于所提出的CBC的各自解决方法的性能明显优于相应的基准方法。
{"title":"Cross-layer betweenness centrality in multiplex networks with applications","authors":"Tanmoy Chakraborty, Ramasuri Narayanam","doi":"10.1109/ICDE.2016.7498257","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498257","url":null,"abstract":"Several real-life social systems witness the presence of multiple interaction types (or layers) among the entities, thus establishing a collection of co-evolving networks, known as multiplex networks. More recently, there has been a significant interest in developing certain centrality measures in multiplex networks to understand the influential power of the entities (to be referred as vertices or nodes hereafter). In this paper, we consider the problem of studying how frequently the nodes occur on the shortest paths between other nodes in the multiplex networks. As opposed to simplex networks, the shortest paths between nodes can possibly traverse through multiple layers in multiplex networks. Motivated by this phenomenon, we propose a new metric to address the above problem and we call this new metric cross-layer betweenness centrality (CBC). Our definition of CBC measure takes into account the interplay among multiple layers in determining the shortest paths in multiplex networks. We propose an efficient algorithm to compute CBC and show that it runs much faster than the naïve computation of this measure. We show the efficacy of the proposed algorithm using thorough experimentation on two real-world multiplex networks. We further demonstrate the practical utility of CBC by applying it in the following three application contexts: discovering non-overlapping community structure in multiplex networks, identifying interdisciplinary researchers from a multiplex co-authorship network, and the initiator selection for message spreading. In all these application scenarios, the respective solution methods based on the proposed CBC are found to be significantly better performing than that of the corresponding benchmark approaches.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"13 1","pages":"397-408"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90810164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Online mobile Micro-Task Allocation in spatial crowdsourcing 空间众包中的在线移动微任务分配
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498228
Yongxin Tong, Jieying She, Bolin Ding, Libin Wang, Lei Chen
With the rapid development of smartphones, spatial crowdsourcing platforms are getting popular. A foundational research of spatial crowdsourcing is to allocate micro-tasks to suitable crowd workers. Most existing studies focus on offline scenarios, where all the spatiotemporal information of micro-tasks and crowd workers is given. However, they are impractical since micro-tasks and crowd workers in real applications appear dynamically and their spatiotemporal information cannot be known in advance. In this paper, to address the shortcomings of existing offline approaches, we first identify a more practical micro-task allocation problem, called the Global Online Micro-task Allocation in spatial crowdsourcing (GOMA) problem. We first extend the state-of-art algorithm for the online maximum weighted bipartite matching problem to the GOMA problem as the baseline algorithm. Although the baseline algorithm provides theoretical guarantee for the worst case, its average performance in practice is not good enough since the worst case happens with a very low probability in real world. Thus, we consider the average performance of online algorithms, a.k.a online random order model.We propose a two-phase-based framework, based on which we present the TGOA algorithm with 1 over 4 -competitive ratio under the online random order model. To improve its efficiency, we further design the TGOA-Greedy algorithm following the framework, which runs faster than the TGOA algorithm but has lower competitive ratio of 1 over 8. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.
随着智能手机的快速发展,空间众包平台越来越受欢迎。空间众包的基础研究是将微任务分配给合适的群体工作者。现有的研究大多集中在离线场景,其中微任务和人群工作者的所有时空信息都是给定的。然而,由于实际应用中的微任务和群体工作者是动态出现的,其时空信息无法提前获知,因此这种方法不切实际。在本文中,为了解决现有离线方法的不足,我们首先确定了一个更实际的微任务分配问题,称为空间众包中的全球在线微任务分配(GOMA)问题。我们首先将在线最大加权二部匹配问题的最新算法扩展到GOMA问题作为基线算法。虽然基线算法为最坏情况提供了理论上的保证,但由于最坏情况在现实世界中发生的概率很低,其在实践中的平均性能不够好。因此,我们考虑在线算法的平均性能,即在线随机顺序模型。提出了一种基于两阶段的框架,在此基础上给出了在线随机排序模型下竞争比为1 / 4的TGOA算法。为了提高TGOA算法的效率,我们根据该框架进一步设计了TGOA- greedy算法,该算法运行速度比TGOA算法快,但竞争比较低,为1 / 8。最后,我们通过在真实和合成数据集上的大量实验验证了所提出方法的有效性和效率。
{"title":"Online mobile Micro-Task Allocation in spatial crowdsourcing","authors":"Yongxin Tong, Jieying She, Bolin Ding, Libin Wang, Lei Chen","doi":"10.1109/ICDE.2016.7498228","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498228","url":null,"abstract":"With the rapid development of smartphones, spatial crowdsourcing platforms are getting popular. A foundational research of spatial crowdsourcing is to allocate micro-tasks to suitable crowd workers. Most existing studies focus on offline scenarios, where all the spatiotemporal information of micro-tasks and crowd workers is given. However, they are impractical since micro-tasks and crowd workers in real applications appear dynamically and their spatiotemporal information cannot be known in advance. In this paper, to address the shortcomings of existing offline approaches, we first identify a more practical micro-task allocation problem, called the Global Online Micro-task Allocation in spatial crowdsourcing (GOMA) problem. We first extend the state-of-art algorithm for the online maximum weighted bipartite matching problem to the GOMA problem as the baseline algorithm. Although the baseline algorithm provides theoretical guarantee for the worst case, its average performance in practice is not good enough since the worst case happens with a very low probability in real world. Thus, we consider the average performance of online algorithms, a.k.a online random order model.We propose a two-phase-based framework, based on which we present the TGOA algorithm with 1 over 4 -competitive ratio under the online random order model. To improve its efficiency, we further design the TGOA-Greedy algorithm following the framework, which runs faster than the TGOA algorithm but has lower competitive ratio of 1 over 8. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"49-60"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81759829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 254
Practical private shortest path computation based on Oblivious Storage 基于遗忘存储的实用私有最短路径计算
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498254
Dong Xie, Guanru Li, Bin Yao, Xuan Wei, Xiaokui Xiao, Yunjun Gao, M. Guo
As location-based services (LBSs) become popular, location-dependent queries have raised serious privacy concerns since they may disclose sensitive information in query processing. Among typical queries supported by LBSs, shortest path queries may reveal information about not only current locations of the clients, but also their potential destinations and travel plans. Unfortunately, existing methods for private shortest path computation suffer from issues of weak privacy property, low performance or poor scalability. In this paper, we aim at a strong privacy guarantee, where the adversary cannot infer almost any information about the queries, with better performance and scalability. To achieve this goal, we introduce a general system model based on the concept of Oblivious Storage (OS), which can deal with queries requiring strong privacy properties. Furthermore, we propose a new oblivious shuffle algorithm to optimize an existing OS scheme. By making trade-offs between query performance, scalability and privacy properties, we design different schemes for private shortest path computation. Eventually, we comprehensively evaluate our schemes upon real road networks in a practical environment and show their efficiency.
随着基于位置的服务(lbs)的流行,位置相关查询引起了严重的隐私问题,因为它们可能在查询处理中泄露敏感信息。在lbs支持的典型查询中,最短路径查询不仅可以显示有关客户当前位置的信息,还可以显示有关客户潜在目的地和旅行计划的信息。然而,现有的私有最短路径计算方法存在隐私性弱、性能低、可扩展性差等问题。在本文中,我们的目标是一个强大的隐私保证,对手几乎不能推断任何关于查询的信息,具有更好的性能和可扩展性。为了实现这一目标,我们引入了一个基于遗忘存储(OS)概念的通用系统模型,该模型可以处理需要强隐私属性的查询。此外,我们提出了一种新的无关洗牌算法来优化现有的操作系统方案。通过权衡查询性能、可扩展性和隐私属性,我们设计了不同的私有最短路径计算方案。最后,我们在实际环境中对我们的方案进行了综合评估,并证明了它们的有效性。
{"title":"Practical private shortest path computation based on Oblivious Storage","authors":"Dong Xie, Guanru Li, Bin Yao, Xuan Wei, Xiaokui Xiao, Yunjun Gao, M. Guo","doi":"10.1109/ICDE.2016.7498254","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498254","url":null,"abstract":"As location-based services (LBSs) become popular, location-dependent queries have raised serious privacy concerns since they may disclose sensitive information in query processing. Among typical queries supported by LBSs, shortest path queries may reveal information about not only current locations of the clients, but also their potential destinations and travel plans. Unfortunately, existing methods for private shortest path computation suffer from issues of weak privacy property, low performance or poor scalability. In this paper, we aim at a strong privacy guarantee, where the adversary cannot infer almost any information about the queries, with better performance and scalability. To achieve this goal, we introduce a general system model based on the concept of Oblivious Storage (OS), which can deal with queries requiring strong privacy properties. Furthermore, we propose a new oblivious shuffle algorithm to optimize an existing OS scheme. By making trade-offs between query performance, scalability and privacy properties, we design different schemes for private shortest path computation. Eventually, we comprehensively evaluate our schemes upon real road networks in a practical environment and show their efficiency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"361-372"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84384233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
OLAP over probabilistic data cubes I: Aggregating, materializing, and querying 基于概率数据集的OLAP I:聚合、具体化和查询
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498291
Xike Xie, Xingjun Hao, T. Pedersen, Peiquan Jin, Jinchuan Chen
On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges as even simple operations are #P-hard under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., SUM and COUNT, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. For aggregation, we focus on how to maximize the sharing of computation among cells and cuboids. We present two aggregation methods: convolution and sketch-based. The two methods scale down the time complexities of building a probabilistic cuboid to polynomial and linear, respectively. Each of the two supports both full and partial data cube materialization. Then, we devise a cost model which guides the aggregation methods to be deployed and combined during the cube materialization. We further provide algorithms for probabilistic slicing and dicing queries on the data cube. Extensive experiments over real and synthetic datasets are conducted to show that the techniques are effective and scalable.
在线分析处理(OLAP)通过快速计算多个层次维度上的数值测量的聚合值,为大规模数据集提供强大的分析功能。然而,许多类型的源数据,例如来自GPS、传感器和其他测量设备的源数据,本质上是不准确的(不精确和/或不确定),因此OLAP不能轻易应用。在本文中,我们通过提出概率数据立方体的概念来解决OLAP中产生的数据准确性问题。这样的多维数据集由一组概率长方体组成,这些概率长方体以概率质量函数(简称pmfs)的形式总结聚合值,从而提供对底层数据质量的洞察,并支持对置信度敏感的查询评估和分析。然而,数据的概率性质带来了计算上的挑战,因为在可能世界语义下,即使是简单的操作也是#P-hard。更糟糕的是,很难在不同的长方体之间共享计算,因为传统数据立方体的分布聚合函数,例如SUM和COUNT,在概率设置中变得整体。在本文中,我们提出了一套完整的概率数据立方体技术,从立方体聚合、立方体物化到查询评估。对于聚合,我们关注的是如何最大化单元和长方体之间的计算共享。我们提出了卷积和基于草图的两种聚合方法。这两种方法分别将构建概率长方体的时间复杂度降低到多项式和线性。两者都支持完整和部分数据立方体物化。然后,我们设计了一个成本模型来指导在多维数据集实体化过程中部署和组合的聚合方法。我们进一步提供了对数据立方体进行概率切片和切块查询的算法。在真实和合成数据集上进行的大量实验表明,该技术是有效的和可扩展的。
{"title":"OLAP over probabilistic data cubes I: Aggregating, materializing, and querying","authors":"Xike Xie, Xingjun Hao, T. Pedersen, Peiquan Jin, Jinchuan Chen","doi":"10.1109/ICDE.2016.7498291","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498291","url":null,"abstract":"On-Line Analytical Processing (OLAP) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges as even simple operations are #P-hard under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., SUM and COUNT, become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. For aggregation, we focus on how to maximize the sharing of computation among cells and cuboids. We present two aggregation methods: convolution and sketch-based. The two methods scale down the time complexities of building a probabilistic cuboid to polynomial and linear, respectively. Each of the two supports both full and partial data cube materialization. Then, we devise a cost model which guides the aggregation methods to be deployed and combined during the cube materialization. We further provide algorithms for probabilistic slicing and dicing queries on the data cube. Extensive experiments over real and synthetic datasets are conducted to show that the techniques are effective and scalable.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"51 1","pages":"799-810"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88976090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Indexing multi-metric data 索引多度量数据
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498318
Maximilian Franzke, Tobias Emrich, Andreas Züfle, M. Renz
The proliferation of the Web 2.0 and the ubiquitousness of social media yield a huge flood of heterogenous data that is voluntarily published and shared by billions of individual users all over the world. As a result, the representation of an entity (such as a real person) in this data may consist of various data types, including location and other numeric attributes, textual descriptions, images, videos, social network information and other types of information. Searching similar entities in this multi-enriched data exploiting the information of multiple representations simultaneously promises to yield more interesting and relevant information than searching among each data type individually. While efficient similarity search on single representations is a well studied problem, existing studies lacks appropriate solutions for multi-enriched data taking into account the combination of all representations as a whole. In this paper, we address the problem of index-supported similarity search on multi-enriched (a.k.a. multi-represented) objects based on a set of metrics, one metric for each representation. We define multimetric similarity search queries by employing user-defined weight function specifying the impact of each metric at query time. Our main contribution is an index structure which combines all metrics into a single multi-dimensional access method that works for arbitrary weights preferences. The experimental evaluation shows that our proposed index structure is more efficient than existing multi-metric access methods considering different cost criteria and tremendously outperforms traditional approaches when querying very large sets of multi-enriched objects.
Web 2.0的扩散和无处不在的社交媒体产生了大量异质数据,这些数据由世界各地数十亿个人用户自愿发布和共享。因此,此数据中实体(例如真人)的表示可能由各种数据类型组成,包括位置和其他数字属性、文本描述、图像、视频、社交网络信息和其他类型的信息。在这种多重丰富的数据中搜索相似的实体,同时利用多种表示的信息,比单独在每种数据类型中搜索更有意义和相关的信息。虽然对单一表示的高效相似性搜索是一个研究得很好的问题,但现有的研究缺乏考虑所有表示整体组合的多丰富数据的适当解决方案。在本文中,我们基于一组度量来解决索引支持的多浓缩(即多表示)对象的相似性搜索问题,每个度量对应一个度量。我们通过在查询时使用用户定义的权重函数指定每个度量的影响来定义多度量相似度搜索查询。我们的主要贡献是一个索引结构,它将所有指标结合到一个单一的多维访问方法中,该方法适用于任意权重偏好。实验结果表明,在考虑不同开销标准的情况下,我们提出的索引结构比现有的多度量访问方法效率更高,并且在查询非常大的多富集对象集时,显著优于传统方法。
{"title":"Indexing multi-metric data","authors":"Maximilian Franzke, Tobias Emrich, Andreas Züfle, M. Renz","doi":"10.1109/ICDE.2016.7498318","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498318","url":null,"abstract":"The proliferation of the Web 2.0 and the ubiquitousness of social media yield a huge flood of heterogenous data that is voluntarily published and shared by billions of individual users all over the world. As a result, the representation of an entity (such as a real person) in this data may consist of various data types, including location and other numeric attributes, textual descriptions, images, videos, social network information and other types of information. Searching similar entities in this multi-enriched data exploiting the information of multiple representations simultaneously promises to yield more interesting and relevant information than searching among each data type individually. While efficient similarity search on single representations is a well studied problem, existing studies lacks appropriate solutions for multi-enriched data taking into account the combination of all representations as a whole. In this paper, we address the problem of index-supported similarity search on multi-enriched (a.k.a. multi-represented) objects based on a set of metrics, one metric for each representation. We define multimetric similarity search queries by employing user-defined weight function specifying the impact of each metric at query time. Our main contribution is an index structure which combines all metrics into a single multi-dimensional access method that works for arbitrary weights preferences. The experimental evaluation shows that our proposed index structure is more efficient than existing multi-metric access methods considering different cost criteria and tremendously outperforms traditional approaches when querying very large sets of multi-enriched objects.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"1122-1133"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81064677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Platform-independent robust query processing 独立于平台的健壮查询处理
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498251
S. Karthik, J. Haritsa, Sreyash Kenkre, Vinayaka Pandit
To address the classical selectivity estimation problem in databases, a radically different approach called PlanBouquet was recently proposed in [3], wherein the estimation process is completely abandoned and replaced with a calibrated discovery mechanism. The beneficial outcome of this new construction is that, for the first time, provable guarantees are obtained on worst-case performance, thereby facilitating robust query processing.
为了解决数据库中经典的选择性估计问题,最近在[3]中提出了一种完全不同的方法PlanBouquet,其中完全放弃了估计过程,取而代之的是校准的发现机制。这种新结构的有益结果是,首次获得了对最坏情况性能的可证明保证,从而促进了健壮的查询处理。
{"title":"Platform-independent robust query processing","authors":"S. Karthik, J. Haritsa, Sreyash Kenkre, Vinayaka Pandit","doi":"10.1109/ICDE.2016.7498251","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498251","url":null,"abstract":"To address the classical selectivity estimation problem in databases, a radically different approach called PlanBouquet was recently proposed in [3], wherein the estimation process is completely abandoned and replaced with a calibrated discovery mechanism. The beneficial outcome of this new construction is that, for the first time, provable guarantees are obtained on worst-case performance, thereby facilitating robust query processing.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"4 1","pages":"325-336"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76259139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries 可伸缩耦合矩阵张量分解-算法和发现
Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498292
Byungsoo Jeon, Inah Jeon, Lee Sael, U. Kang
How can we analyze very large real-world tensors where additional information is coupled with certain modes of tensors? Coupled matrix-tensor factorization is a useful tool to simultaneously analyze matrices and a tensor, and has been used for important applications including collaborative filtering, multi-way clustering, and link prediction. However, existing single machine or distributed algorithms for coupled matrix-tensor factorization do not scale for tensors with billions of elements in each mode. In this paper, we propose SCOUT, a large-scale coupled matrix-tensor factorization algorithm running on the distributed MAPREDUCE platform. By carefully reorganizing operations, and reusing intermediate data, SCOUT decomposes up to 100× larger tensors than existing methods, and shows linear scalability for order and machines while other methods are limited in scalability. We also apply SCOUT on real world tensors and discover interesting hidden patterns like seasonal spike, and steady attentions for healthy food on Yelp dataset containing user-business-yearmonth tensor and two coupled matrices.
我们如何分析现实世界中非常大的张量,其中附加信息与张量的某些模态相耦合?耦合矩阵-张量分解是一种同时分析矩阵和张量的有用工具,已被用于协同过滤、多路聚类和链接预测等重要应用。然而,现有的单机或分布式耦合矩阵-张量分解算法不能适用于每个模式下具有数十亿个元素的张量。本文提出了一种基于分布式MAPREDUCE平台的大规模矩阵-张量耦合分解算法SCOUT。通过仔细重组操作和重用中间数据,SCOUT分解的张量比现有方法大100倍,并且显示出订单和机器的线性可扩展性,而其他方法在可扩展性方面受到限制。我们还将SCOUT应用于现实世界的张量,并在包含用户-业务-年-月张量和两个耦合矩阵的Yelp数据集上发现有趣的隐藏模式,如季节性峰值和健康食品的稳定关注。
{"title":"SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries","authors":"Byungsoo Jeon, Inah Jeon, Lee Sael, U. Kang","doi":"10.1109/ICDE.2016.7498292","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498292","url":null,"abstract":"How can we analyze very large real-world tensors where additional information is coupled with certain modes of tensors? Coupled matrix-tensor factorization is a useful tool to simultaneously analyze matrices and a tensor, and has been used for important applications including collaborative filtering, multi-way clustering, and link prediction. However, existing single machine or distributed algorithms for coupled matrix-tensor factorization do not scale for tensors with billions of elements in each mode. In this paper, we propose SCOUT, a large-scale coupled matrix-tensor factorization algorithm running on the distributed MAPREDUCE platform. By carefully reorganizing operations, and reusing intermediate data, SCOUT decomposes up to 100× larger tensors than existing methods, and shows linear scalability for order and machines while other methods are limited in scalability. We also apply SCOUT on real world tensors and discover interesting hidden patterns like seasonal spike, and steady attentions for healthy food on Yelp dataset containing user-business-yearmonth tensor and two coupled matrices.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"27 1","pages":"811-822"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73465160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
期刊
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1