2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献_第4页

Automatic extraction of top-k lists from the web 自动从网络中提取top-k列表

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544897

Zhixian Zhang, Kenny Q. Zhu, Haixun Wang, Hongsong Li

This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don't want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall.

本文关注的是从top-k网页中提取信息，top-k网页是描述普遍感兴趣的主题的top k实例的网页。例子包括“世界上最高的10座建筑”，“2010年你不想错过的50个热门事件”等等。与web上的其他结构化信息(包括web表)相比，top-k列表中的信息更大、更丰富、质量更高，而且通常更有趣。因此，top-k列表非常有价值。例如，它可以帮助丰富开放领域的知识库(以支持搜索或事实回答等应用程序)。本文提出了一种从网页中高效提取top-k列表的方法。具体来说，我们从16亿页的网络语料库中提取了超过170万个top-k列表，准确率为92.0%，召回率为72.3%。

引用次数: 25

Towards efficient SimRank computation on large networks 面向大型网络的高效simmrank计算

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544859

Weiren Yu, Xuemin Lin, W. Zhang

SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work [17] exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have duplicate redundancy. Besides, for a desired accuracy ϵ, the existing SimRank model requires K = [logC ϵ] iterations [17], where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to 0(Kd'n2) time, where d' is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores.

simmrank是一个强大的模型，用于评估图中顶点对的相似性。它基于这样的概念:如果两个顶点被相似的顶点引用，那么它们就是相似的。由于simmrank的自引用性，在大型图上的快速simmrank计算带来了巨大的挑战。最先进的工作[17]利用部分和记忆在O(Kmn)时间内对具有n个顶点和m条边的图计算simmrank，其中K为迭代次数。部分和记忆可以通过缓存部分相似求和以供以后重用来减少重复计算。然而，我们观察到不同部分和之间的计算可能有重复冗余。此外，为了达到期望的精度御柱，现有的simmrank模型需要K = [logC御柱]迭代[17]，其中C是一个阻尼因子。然而，如果需要较高的精度，这种几何收敛速度在实践中是缓慢的。在本文中，我们解决了这些差距。(1)我们提出了一种自适应聚类策略来消除部分和冗余(即部分和中出现的重复计算)，并设计了一种有效的算法来加速simmrank的计算到0(Kd'n2)时间，其中d'通常比图的平均in度小得多。(2)我们还提出了一个新的simmrank概念，它基于微分方程，可以表示为转换矩阵的指数和，而不是传统对立物的几何和。这使得simmrank迭代的收敛速度进一步加快。(3)使用真实和合成数据，我们通过经验验证了我们的部分和共享方法比最知名的算法性能高出一个数量级，并且我们修订的simmrank概念在大型图上进一步实现了5倍的加速，同时还相当保留了原始simmrank分数的相对顺序。

{"title":"Towards efficient SimRank computation on large networks","authors":"Weiren Yu, Xuemin Lin, W. Zhang","doi":"10.1109/ICDE.2013.6544859","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544859","url":null,"abstract":"SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work [17] exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have duplicate redundancy. Besides, for a desired accuracy ϵ, the existing SimRank model requires K = [logC ϵ] iterations [17], where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to 0(Kd'n2) time, where d' is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117336572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

ASVTDECTOR: A practical near duplicate video retrieval system 一个实用的近重复视频检索系统

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544941

Xiangmin Zhou, Lei Chen

In this paper, we present a system, named ASVT-DECTOR, to retrieve the near duplicate videos with large variations based on an 3D structure tensor model, named ASVT series, over the local descriptors of video segments. Different from the traditional global feature-based video detection systems that incur severe information loss, ASVT model is built over the local descriptor set of each video segment, keeping the robustness of local descriptors. Meanwhile, unlike the traditional local feature-based methods that suffer from the high cost of pair-wise descriptor comparison, ASVT model describes a video segment as an 3D structure tensor that is actually a 3×3 matrix, obtaining high retrieval efficiency. In this demonstration, we show that, given a clip, our ASVTDETECTOR system can effectively find the near-duplicates with large variations from a large collection in real time.

在本文中，我们提出了一个名为ASVT- dector的系统，该系统基于一个名为ASVT系列的三维结构张量模型，在视频片段的局部描述符上检索具有大变化的近重复视频。与传统的基于全局特征的视频检测系统存在严重的信息丢失问题不同，ASVT模型是在每个视频片段的局部描述符集上构建的，保持了局部描述符的鲁棒性。同时，不像传统的基于局部特征的方法存在两两描述符比较成本高的问题，ASVT模型将视频片段描述为一个三维结构张量，该张量实际上是一个3×3矩阵，获得了很高的检索效率。在这个演示中，我们展示了，给定一个片段，我们的ASVTDETECTOR系统可以有效地从一个大集合中实时发现具有大变化的近重复。

引用次数: 3

Secure nearest neighbor revisited 重新访问安全最近的邻居

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544870

Bin Yao, Feifei Li, Xiaokui Xiao

In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a cloud service provider and asks for an encrypted data point in E(D) (the encrypted database) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods [21], [15], even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field [6], [5], and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models [6], [5], our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.

在本文中，我们研究了安全最近邻(SNN)问题，其中客户端向云服务提供商发出加密查询点E(q)，并请求E(D)(加密数据库)中最接近查询点的加密数据点，而不允许服务器学习数据或查询(及其结果)的明文。我们证明了现有SNN方法[21]，[15]存在有效的攻击，即使它们被声称在标准安全模型中是安全的(例如在选择的明文或密文攻击下的不可区分性)。我们还从密码学领域[6]，[5]建立了SNN问题与保序加密(OPE)问题之间的关系，并证明SNN至少与OPE一样难。由于不可能在标准安全模型[6]，[5]中构建安全的OPE方案，我们的结果意味着不能期望仅基于E(q)和E(D)找到精确的(加密的)最近邻居。鉴于此硬度结果，我们设计了新的SNN方法，要求服务器在给定E(q)和E(D)的情况下，从E(D)(即G (G))返回一个相关的(加密的)分区E(G)，从而保证E(G)包含SNN查询的答案。我们的方法在效率和通信成本之间提供了可定制的权衡，并且它们与用于加密查询和数据库的加密方案E一样安全，其中E可以是任何成熟的加密方案。

{"title":"Secure nearest neighbor revisited","authors":"Bin Yao, Feifei Li, Xiaokui Xiao","doi":"10.1109/ICDE.2013.6544870","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544870","url":null,"abstract":"In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a cloud service provider and asks for an encrypted data point in E(D) (the encrypted database) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods [21], [15], even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field [6], [5], and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models [6], [5], our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125333251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 240

Triples in the clouds 云中的三倍

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544918

Zoi Kaoudi, I. Manolescu

The W3C's Resource Description Framework (or RDF, in short) is a promising candidate which may deliver many of the original semi-structured data promises: flexible structure, optional schema, and rich, flexible URIs as a basis for information sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific communities studying databases, knowledge representation, and Web technologies. Many RDF data collections are being published, going from scientific data to general-purpose ontologies to open government data, in particular in the Linked Data movement. Managing such large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance and elasticity features it provides. This tutorial discusses the problems involved in efficiently handling massive amounts of RDF data in a cloud environment. We provide the necessary background, analyze and classify existing solutions, and discuss open problems and perspectives.

W3C的资源描述框架(Resource Description Framework，简称RDF)是一个很有前途的候选者，它可以提供许多原始的半结构化数据承诺:灵活的结构、可选的模式，以及作为信息共享基础的丰富、灵活的uri。此外，RDF具有独特的优势，可以从研究数据库、知识表示和Web技术的科学团体的努力中获益。许多RDF数据集合正在被发布，从科学数据到通用本体再到开放的政府数据，特别是在关联数据运动中。由于庞大的规模、异构性和RDF推理带来的进一步复杂性，管理如此大量的RDF数据是一项挑战。为了应对大小挑战，需要分布式存储架构。云计算是一种新兴的范例，因其提供的可伸缩性、容错和弹性特性而被许多应用程序大量采用。本教程讨论在云环境中有效处理大量RDF数据所涉及的问题。我们提供必要的背景，分析和分类现有的解决方案，并讨论开放的问题和观点。

{"title":"Triples in the clouds","authors":"Zoi Kaoudi, I. Manolescu","doi":"10.1109/ICDE.2013.6544918","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544918","url":null,"abstract":"The W3C's Resource Description Framework (or RDF, in short) is a promising candidate which may deliver many of the original semi-structured data promises: flexible structure, optional schema, and rich, flexible URIs as a basis for information sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific communities studying databases, knowledge representation, and Web technologies. Many RDF data collections are being published, going from scientific data to general-purpose ontologies to open government data, in particular in the Linked Data movement. Managing such large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance and elasticity features it provides. This tutorial discusses the problems involved in efficiently handling massive amounts of RDF data in a cloud environment. We provide the necessary background, analyze and classify existing solutions, and discuss open problems and perspectives.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128437420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Efficient tracking and querying for coordinated uncertain mobile objects 协调不确定移动目标的高效跟踪与查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544824

Nicholas D. Larusso, Ambuj K. Singh

Accurately estimating the current positions of moving objects is a challenging task due to the various forms of data uncertainty (e.g. limited sensor precision, periodic updates from continuously moving objects). However, in many cases, groups of objects tend to exhibit similarities in their movement behavior. For example, vehicles in a convoy or animals in a herd both exhibit tightly coupled movement behavior within the group. While such statistical dependencies often increase the computational complexity necessary for capturing this additional structure, they also provide useful information which can be utilized to provide more accurate location estimates. In this paper, we propose a novel model for accurately tracking coordinated groups of mobile uncertain objects. We introduce an exact and more efficient approximate inference algorithm for updating the current location of each object upon the arrival of new (uncertain) location observations. Additionally, we derive probability bounds over the groups in order to process probabilistic threshold range queries more efficiently. Our experimental evaluation shows that our proposed model can provide 4X improvements in tracking accuracy over competing models which do not consider group behavior. We also show that our bounds enable us to prune up to 50% of the database, resulting in more efficient processing over a linear scan.

由于各种形式的数据不确定性(例如，有限的传感器精度，连续移动物体的定期更新)，准确估计移动物体的当前位置是一项具有挑战性的任务。然而，在许多情况下，一组物体倾向于在它们的运动行为中表现出相似性。例如，车队中的车辆或兽群中的动物在群体中都表现出紧密耦合的运动行为。虽然这种统计依赖关系通常增加了捕获这种额外结构所需的计算复杂性，但它们也提供了有用的信息，可用于提供更准确的位置估计。在本文中，我们提出了一种新的模型来精确跟踪移动不确定目标的协调群。我们引入了一种精确和更有效的近似推理算法，用于在新的(不确定的)位置观测到达时更新每个对象的当前位置。此外，我们推导了组的概率边界，以便更有效地处理概率阈值范围查询。我们的实验评估表明，与不考虑群体行为的竞争模型相比，我们提出的模型可以提供4倍的跟踪精度提高。我们还表明，我们的边界使我们能够修剪多达50%的数据库，从而在线性扫描中获得更有效的处理。

{"title":"Efficient tracking and querying for coordinated uncertain mobile objects","authors":"Nicholas D. Larusso, Ambuj K. Singh","doi":"10.1109/ICDE.2013.6544824","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544824","url":null,"abstract":"Accurately estimating the current positions of moving objects is a challenging task due to the various forms of data uncertainty (e.g. limited sensor precision, periodic updates from continuously moving objects). However, in many cases, groups of objects tend to exhibit similarities in their movement behavior. For example, vehicles in a convoy or animals in a herd both exhibit tightly coupled movement behavior within the group. While such statistical dependencies often increase the computational complexity necessary for capturing this additional structure, they also provide useful information which can be utilized to provide more accurate location estimates. In this paper, we propose a novel model for accurately tracking coordinated groups of mobile uncertain objects. We introduce an exact and more efficient approximate inference algorithm for updating the current location of each object upon the arrival of new (uncertain) location observations. Additionally, we derive probability bounds over the groups in order to process probabilistic threshold range queries more efficiently. Our experimental evaluation shows that our proposed model can provide 4X improvements in tracking accuracy over competing models which do not consider group behavior. We also show that our bounds enable us to prune up to 50% of the database, resulting in more efficient processing over a linear scan.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122790715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Towards efficient search for activity trajectories 更有效地寻找活动轨迹

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544828

K. Zheng, Shuo Shang, Nicholas Jing Yuan, Yi Yang

The advances in location positioning and wireless communication technologies have led to a myriad of spatial trajectories representing the mobility of a variety of moving objects. While processing trajectory data with the focus of spatio-temporal features has been widely studied in the last decade, recent proliferation in location-based web applications (e.g., Foursquare, Facebook) has given rise to large amounts of trajectories associated with activity information, called activity trajectory. In this paper, we study the problem of efficient similarity search on activity trajectory database. Given a sequence of query locations, each associated with a set of desired activities, an activity trajectory similarity query (ATSQ) returns k trajectories that cover the query activities and yield the shortest minimum match distance. An order-sensitive activity trajectory similarity query (OATSQ) is also proposed to take into account the order of the query locations. To process the queries efficiently, we firstly develop a novel hybrid grid index, GAT, to organize the trajectory segments and activities hierarchically, which enables us to prune the search space by location proximity and activity containment simultaneously. In addition, we propose algorithms for efficient computation of the minimum match distance and minimum order-sensitive match distance, respectively. The results of our extensive empirical studies based on real online check-in datasets demonstrate that our proposed index and methods are capable of achieving superior performance and good scalability.

位置定位和无线通信技术的进步导致了无数的空间轨迹，代表了各种移动物体的移动性。虽然在过去的十年中，以时空特征为重点处理轨迹数据已经得到了广泛的研究，但最近基于位置的web应用程序(例如，Foursquare, Facebook)的激增已经产生了大量与活动信息相关的轨迹，称为活动轨迹。本文研究了活动轨迹数据库的高效相似度搜索问题。给定查询位置序列，每个位置都与一组期望的活动相关联，活动轨迹相似性查询(ATSQ)返回k个覆盖查询活动并产生最短最小匹配距离的轨迹。为了考虑查询位置的顺序，提出了一种顺序敏感的活动轨迹相似性查询(OATSQ)。为了有效地处理查询，我们首先开发了一种新的混合网格索引GAT，将轨迹段和活动分层组织，使我们能够同时通过位置接近和活动遏制来修剪搜索空间。此外，我们还分别提出了最小匹配距离和最小顺序敏感匹配距离的高效计算算法。基于真实在线登记数据集的广泛实证研究结果表明，我们提出的索引和方法能够实现卓越的性能和良好的可扩展性。

{"title":"Towards efficient search for activity trajectories","authors":"K. Zheng, Shuo Shang, Nicholas Jing Yuan, Yi Yang","doi":"10.1109/ICDE.2013.6544828","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544828","url":null,"abstract":"The advances in location positioning and wireless communication technologies have led to a myriad of spatial trajectories representing the mobility of a variety of moving objects. While processing trajectory data with the focus of spatio-temporal features has been widely studied in the last decade, recent proliferation in location-based web applications (e.g., Foursquare, Facebook) has given rise to large amounts of trajectories associated with activity information, called activity trajectory. In this paper, we study the problem of efficient similarity search on activity trajectory database. Given a sequence of query locations, each associated with a set of desired activities, an activity trajectory similarity query (ATSQ) returns k trajectories that cover the query activities and yield the shortest minimum match distance. An order-sensitive activity trajectory similarity query (OATSQ) is also proposed to take into account the order of the query locations. To process the queries efficiently, we firstly develop a novel hybrid grid index, GAT, to organize the trajectory segments and activities hierarchically, which enables us to prune the search space by location proximity and activity containment simultaneously. In addition, we propose algorithms for efficient computation of the minimum match distance and minimum order-sensitive match distance, respectively. The results of our extensive empirical studies based on real online check-in datasets demonstrate that our proposed index and methods are capable of achieving superior performance and good scalability.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129279961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 164

Voronoi-based nearest neighbor search for multi-dimensional uncertain databases 基于voronoi的多维不确定数据库最近邻搜索

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544822

Peiwu Zhang, Reynold Cheng, N. Mamoulis, M. Renz, Andreas Züfle, Yu Tang, Tobias Emrich

In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point pϵR, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV-index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PV-index.

在基于Voronoi的最近邻搜索中，数据库中每个点p的Voronoi单元格可以用来检查p是否最接近某个查询点q。我们扩展了Voronoi单元格的概念，以支持属性值不精确的不确定对象。特别是，我们提出了可能的Voronoi电池(或pv电池)。多维不确定对象o的PV-cell是一个区域R，因此对于任意点pϵR, o可能是p的最近邻居。如果数据库S中所有对象的PV-cell是已知的，则可以使用它们来识别有可能成为q最近邻居的对象。然而，没有有效的算法来计算精确的PV-cell。因此，我们研究了如何推导出一个轴平行的超矩形(称为不确定边界矩形，UBR)，它紧密地包含了一个pv单元。我们进一步开发了PV-index(一种存储ubr的结构)来评估不确定数据上的概率最近邻查询。PV-index的一个优点是，在S上进行更新时，可以对其进行增量更新。在合成数据集和真实数据集上进行了大量实验，以验证PV-index的性能。

{"title":"Voronoi-based nearest neighbor search for multi-dimensional uncertain databases","authors":"Peiwu Zhang, Reynold Cheng, N. Mamoulis, M. Renz, Andreas Züfle, Yu Tang, Tobias Emrich","doi":"10.1109/ICDE.2013.6544822","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544822","url":null,"abstract":"In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point pϵR, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV-index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PV-index.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122556910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Recycling in pipelined query evaluation 流水线查询求值中的回收

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544837

F. Nagel, P. Boncz, Stratis Viglas

Database systems typically execute queries in isolation. Sharing recurring intermediate and final results between successive query invocations is ignored or only exploited by caching final query results. The DBA is kept in the loop to make explicit sharing decisions by identifying and/or defining materialized views. Thus decisions are made only after a long time and sharing opportunities may be missed. Recycling intermediate results has been proposed as a method to make database query engines profit from opportunities to reuse fine-grained partial query results, that is fully autonomous and is able to continuously adapt to changes in the workload. The technique was recently revisited in the context of MonetDB, a system that by default materializes all intermediate results. Materializing intermediate results can consume significant system resources, therefore most other database systems avoid this where possible, following a pipelined query architecture instead. The novelty of this paper is to show how recycling can successfully be applied in pipelined query executors, by tracking the benefit of materializing possible intermediate results and then choosing the ones making best use of a limited intermediate result cache. We present ways to maximize the potential of recycling by leveraging subsumption and proactive query rewriting. We have implemented our approach in the Vectorwise database engine and have experimentally evaluated its potential using both synthetic and real-world datasets. Our results show that intermediate result recycling significantly improves performance.

数据库系统通常孤立地执行查询。在连续的查询调用之间共享重复的中间和最终结果被忽略，或者只能通过缓存最终查询结果来利用。通过识别和/或定义物化视图，DBA保持在循环中，以做出显式的共享决策。因此，只有在很长一段时间后才能做出决定，并且可能错过分享机会。回收中间结果已被提出作为一种方法，使数据库查询引擎从重用细粒度部分查询结果的机会中获益，这种方法是完全自主的，能够不断适应工作负载的变化。该技术最近在MonetDB(一个默认情况下物化所有中间结果的系统)的上下文中被重新审视。实现中间结果会消耗大量的系统资源，因此大多数其他数据库系统尽可能避免这种情况，而是采用流水线查询体系结构。本文的新颖之处在于，通过跟踪实现可能的中间结果的好处，然后选择最充分利用有限的中间结果缓存的结果，展示了回收如何成功地应用于流水线查询执行器。我们提出了利用包容和主动查询重写来最大化回收潜力的方法。我们已经在Vectorwise数据库引擎中实现了我们的方法，并使用合成数据集和真实数据集实验评估了它的潜力。我们的研究结果表明，中间结果回收显著提高了性能。

{"title":"Recycling in pipelined query evaluation","authors":"F. Nagel, P. Boncz, Stratis Viglas","doi":"10.1109/ICDE.2013.6544837","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544837","url":null,"abstract":"Database systems typically execute queries in isolation. Sharing recurring intermediate and final results between successive query invocations is ignored or only exploited by caching final query results. The DBA is kept in the loop to make explicit sharing decisions by identifying and/or defining materialized views. Thus decisions are made only after a long time and sharing opportunities may be missed. Recycling intermediate results has been proposed as a method to make database query engines profit from opportunities to reuse fine-grained partial query results, that is fully autonomous and is able to continuously adapt to changes in the workload. The technique was recently revisited in the context of MonetDB, a system that by default materializes all intermediate results. Materializing intermediate results can consume significant system resources, therefore most other database systems avoid this where possible, following a pipelined query architecture instead. The novelty of this paper is to show how recycling can successfully be applied in pipelined query executors, by tracking the benefit of materializing possible intermediate results and then choosing the ones making best use of a limited intermediate result cache. We present ways to maximize the potential of recycling by leveraging subsumption and proactive query rewriting. We have implemented our approach in the Vectorwise database engine and have experimentally evaluated its potential using both synthetic and real-world datasets. Our results show that intermediate result recycling significantly improves performance.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116850975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

T-share: A large-scale dynamic taxi ridesharing service T-share:大型动态出租车拼车服务

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544843

Shuo Ma, Yu Zheng, O. Wolfson

Taxi ridesharing can be of significant social and environmental benefit, e.g. by saving energy consumption and satisfying people's commute needs. Despite the great potential, taxi ridesharing, especially with dynamic queries, is not well studied. In this paper, we formally define the dynamic ridesharing problem and propose a large-scale taxi ridesharing service. It efficiently serves real-time requests sent by taxi users and generates ridesharing schedules that reduce the total travel distance significantly. In our method, we first propose a taxi searching algorithm using a spatio-temporal index to quickly retrieve candidate taxis that are likely to satisfy a user query. A scheduling algorithm is then proposed. It checks each candidate taxi and inserts the query's trip into the schedule of the taxi which satisfies the query with minimum additional incurred travel distance. To tackle the heavy computational load, a lazy shortest path calculation strategy is devised to speed up the scheduling algorithm. We evaluated our service using a GPS trajectory dataset generated by over 33,000 taxis during a period of 3 months. By learning the spatio-temporal distributions of real user queries from this dataset, we built an experimental platform that simulates user real behaviours in taking a taxi. Tested on this platform with extensive experiments, our approach demonstrated its efficiency, effectiveness, and scalability. For example, our proposed service serves 25% additional taxi users while saving 13% travel distance compared with no-ridesharing (when the ratio of the number of queries to that of taxis is 6).

出租车共乘可以带来显著的社会和环境效益，例如节省能源消耗和满足人们的通勤需求。尽管潜力巨大，但出租车拼车，特别是动态查询，还没有得到很好的研究。本文正式定义了动态拼车问题，并提出了一种大规模的出租车拼车服务。它有效地处理出租车用户发送的实时请求，并生成乘车时间表，从而显着减少总旅行距离。在我们的方法中，我们首先提出了一种使用时空索引的出租车搜索算法，以快速检索可能满足用户查询的候选出租车。然后提出了一种调度算法。它检查每个候选出租车，并将查询的行程插入到满足查询的出租车的行程中，并且产生的额外旅行距离最小。为了解决繁重的计算负荷，设计了一种延迟最短路径计算策略来提高调度算法的速度。我们使用33,000多辆出租车在3个月内生成的GPS轨迹数据集来评估我们的服务。通过从该数据集中学习真实用户查询的时空分布，我们建立了一个模拟用户真实打车行为的实验平台。在这个平台上进行了大量的实验测试，我们的方法证明了它的效率、有效性和可扩展性。例如，我们提出的服务为25%的出租车用户提供了额外的服务，同时与不搭车相比节省了13%的出行距离(当查询数量与出租车数量之比为6时)。

{"title":"T-share: A large-scale dynamic taxi ridesharing service","authors":"Shuo Ma, Yu Zheng, O. Wolfson","doi":"10.1109/ICDE.2013.6544843","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544843","url":null,"abstract":"Taxi ridesharing can be of significant social and environmental benefit, e.g. by saving energy consumption and satisfying people's commute needs. Despite the great potential, taxi ridesharing, especially with dynamic queries, is not well studied. In this paper, we formally define the dynamic ridesharing problem and propose a large-scale taxi ridesharing service. It efficiently serves real-time requests sent by taxi users and generates ridesharing schedules that reduce the total travel distance significantly. In our method, we first propose a taxi searching algorithm using a spatio-temporal index to quickly retrieve candidate taxis that are likely to satisfy a user query. A scheduling algorithm is then proposed. It checks each candidate taxi and inserts the query's trip into the schedule of the taxi which satisfies the query with minimum additional incurred travel distance. To tackle the heavy computational load, a lazy shortest path calculation strategy is devised to speed up the scheduling algorithm. We evaluated our service using a GPS trajectory dataset generated by over 33,000 taxis during a period of 3 months. By learning the spatio-temporal distributions of real user queries from this dataset, we built an experimental platform that simulates user real behaviours in taking a taxi. Tested on this platform with extensive experiments, our approach demonstrated its efficiency, effectiveness, and scalability. For example, our proposed service serves 25% additional taxi users while saving 13% travel distance compared with no-ridesharing (when the ratio of the number of queries to that of taxis is 6).","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114161837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 517