2011 IEEE 27th International Conference on Data Engineering最新文献_第9页

On dimensionality reduction of massive graphs for indexing and retrieval 面向索引和检索的海量图的降维研究

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767834

C. Aggarwal, Haixun Wang

In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.

在本文中，我们将研究大量磁盘驻留数据集的降维问题。近年来，由于图挖掘在社区检测、社交网络和web挖掘方面的大量应用，它变得越来越重要。许多图数据集定义在海量节点域上，其中底层域的节点数量非常大。因此，通常很难存储和保存检索和索引数据所需的信息。大多数已知的降维方法仅对定义在适度域上的数据集有效。此外，虽然降维问题与大规模数据集的问题最为相关，但就数据在磁盘上访问的顺序而言，这些算法本质上不是为磁盘驻留数据的情况而设计的。这是制约当前降维方法适用性的一个严重缺陷。此外，由于降维方法通常是为诸如索引之类的数据库应用程序设计的，因此设计底层数据降维方法非常重要，这样它才能有效地用于此类应用程序。在本文中，我们将研究在底层节点数量非常大且数据集驻留在磁盘上的困难情况下图数据降维的难题。我们将提出一种有效的降维采样算法，并展示如何在有限的磁盘传输次数中执行降维。我们还将对索引应用程序设计高度可解释性和友好性的技术。我们将在一些真实数据集上说明该方法的有效性和效率。

{"title":"On dimensionality reduction of massive graphs for indexing and retrieval","authors":"C. Aggarwal, Haixun Wang","doi":"10.1109/ICDE.2011.5767834","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767834","url":null,"abstract":"In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129278334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Memory-constrained aggregate computation over data streams 数据流上内存受限的聚合计算

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767860

K. Naidu, R. Rastogi, Scott Satkin, A. Srinivasan

In this paper, we study the problem of efficiently computing multiple aggregation queries over a data stream. In order to share computation, prior proposals have suggested instantiating certain intermediate aggregates which are then used to generate the final answers for input queries. In this work, we make a number of important contributions aimed at improving the execution and generation of query plans containing intermediate aggregates. These include: (1) a different hashing model, which has low eviction rates, and also allows us to accurately estimate the number of evictions, (2) a comprehensive query execution cost model based on these estimates, (3) an efficient greedy heuristic for constructing good low-cost query plans, (4) provably near-optimal and optimal algorithms for allocating the available memory to aggregates in the query plan when the input data distribution is Zipf-like and Uniform, respectively, and (5) a detailed performance study with real-life IP flow data sets, which show that our multiple aggregates computation techniques consistently outperform the best-known approach.

在本文中，我们研究了在一个数据流上高效计算多个聚合查询的问题。为了共享计算，之前的建议建议实例化某些中间聚合，然后使用它们为输入查询生成最终答案。在这项工作中，我们做出了许多重要的贡献，旨在改进包含中间聚合的查询计划的执行和生成。这些包括:(1)不同的哈希模型，该模型具有较低的驱逐率，并允许我们准确地估计驱逐次数;(2)基于这些估计的综合查询执行成本模型;(3)用于构建良好的低成本查询计划的高效贪婪启发式算法;(4)当输入数据分布分别为Zipf-like和Uniform时，用于将可用内存分配给查询计划中的聚合的可证明的近最优和最优算法。(5)对真实IP流数据集进行了详细的性能研究，结果表明我们的多聚合计算技术始终优于最知名的方法。

引用次数: 14

A unified approach for computing top-k pairs in multidimensional space 多维空间中计算top-k对的统一方法

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767903

M. A. Cheema, Xuemin Lin, Haixun Wang, Jianmin Wang, W. Zhang

Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach.

Top-k对查询有许多实际应用。K个最近对查询，K个最远对查询和它们的双色变体是top-k对查询的一些例子，这些查询根据距离函数对对进行排序。虽然这些问题已经得到了大量的研究关注，但目前还没有一个统一的方法可以有效地回答所有这些问题。此外，没有现有的工作支持基于通用评分函数的top-k对查询。在本文中，我们提出了一种统一的方法，支持广泛的top-k对查询，包括上面提到的查询。我们提出的方法允许用户为查询中涉及的每个属性定义一个本地评分函数和一个全局评分函数，该函数通过组合其在不同属性上的分数来计算每对的最终分数。我们提出了高效的内部和外部存储算法，我们的理论分析表明，当涉及两个或更少的属性时，算法的预期性能是最优的。我们的方法不需要任何预先构建的索引，易于实现并且内存需求低。我们进行了大量的实验来证明我们提出的方法的有效性。

{"title":"A unified approach for computing top-k pairs in multidimensional space","authors":"M. A. Cheema, Xuemin Lin, Haixun Wang, Jianmin Wang, W. Zhang","doi":"10.1109/ICDE.2011.5767903","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767903","url":null,"abstract":"Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134031739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Top-k keyword search over probabilistic XML data 对概率XML数据进行Top-k关键字搜索

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767875

Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang

Despite the proliferation of work on XML keyword query, it remains open to support keyword query over probabilistic XML data. Compared with traditional keyword search, it is far more expensive to answer a keyword query over probabilistic XML data due to the consideration of possible world semantics. In this paper, we firstly define the new problem of studying top-k keyword search over probabilistic XML data, which is to retrieve k SLCA results with the k highest probabilities of existence. And then we propose two efficient algorithms. The first algorithm PrStack can find k SLCA results with the k highest probabilities by scanning the relevant keyword nodes only once. To further improve the efficiency, we propose a second algorithm EagerTopK based on a set of pruning properties which can quickly prune unsatisfied SLCA candidates. Finally, we implement the two algorithms and compare their performance with analysis of extensive experimental results.

尽管在XML关键字查询方面的工作越来越多，但它仍然支持对概率XML数据进行关键字查询。与传统的关键字搜索相比，由于要考虑可能的世界语义，在概率性XML数据上回答关键字查询的成本要高得多。本文首先定义了在概率性XML数据上研究top-k关键字搜索的新问题，即检索k个存在概率最高的SLCA结果。然后我们提出了两种有效的算法。第一个算法PrStack只需扫描一次相关关键字节点，就能找到k个具有k个最高概率的SLCA结果。为了进一步提高效率，我们提出了基于一组剪枝属性的第二种算法EagerTopK，该算法可以快速剪枝不满意的SLCA候选。最后，我们实现了这两种算法，并对它们的性能进行了比较，分析了大量的实验结果。

引用次数: 74

CT-index: Fingerprint-based graph indexing combining cycles and trees ct索引:结合循环和树的基于指纹的图形索引

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767909

K. Klein, Nils M. Kriege, Petra Mutzel

Efficient subgraph queries in large databases are a time-critical task in many application areas as e.g. biology or chemistry, where biological networks or chemical compounds are modeled as graphs. The NP-completeness of the underlying subgraph isomorphism problem renders an exact subgraph test for each database graph infeasible. Therefore efficient methods have to be found that avoid most of these tests but still allow to identify all graphs containing the query pattern. We propose a new approach based on the filter-verification paradigm, using a new hash-key fingerprint technique with a combination of tree and cycle features for filtering and a new subgraph isomorphism test for verification. Our approach is able to cope with edge and vertex labels and also allows to use wild card patterns for the search. We present an experimental comparison of our approach with state-of-the-art methods using a benchmark set of both real world and generated graph instances that shows its practicability. Our approach is implemented as part of the Scaffold Hunter software, a tool for the visual analysis of chemical compound databases.

在许多应用领域，在大型数据库中高效的子图查询是一项时间紧迫的任务，例如生物学或化学，其中生物网络或化合物被建模为图。子图同构问题的np完备性使得对每个数据库图进行精确的子图测试是不可行的。因此，必须找到有效的方法，避免大多数这些测试，但仍然允许识别包含查询模式的所有图。我们提出了一种基于过滤器-验证范式的新方法，使用一种结合树和循环特征的新的哈希键指纹技术进行过滤，并使用一种新的子图同构测试进行验证。我们的方法能够处理边缘和顶点标签，也允许使用通配符模式进行搜索。我们将我们的方法与最先进的方法进行了实验比较，使用真实世界和生成的图形实例的基准集来显示其实用性。我们的方法作为Scaffold Hunter软件的一部分实现，该软件是一种用于化学化合物数据库可视化分析的工具。

引用次数: 47

HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots HyPer:基于虚拟内存快照的混合OLTP&OLAP主内存数据库系统

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767867

A. Kemper, Thomas Neumann

The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.

联机事务处理(OLTP)和联机分析处理(OLAP)这两个领域对数据库体系结构提出了不同的挑战。目前，具有高任务关键事务率的客户将其数据分成两个独立的系统，一个用于OLTP的数据库和一个用于OLAP的所谓数据仓库。虽然允许适当的事务率，但这种分离有许多缺点，包括数据新鲜问题，这是由于只周期性地启动Extract Transform Load-data阶段造成的延迟造成的，以及由于维护两个独立的信息系统造成的过度资源消耗。我们提出了一种称为HyPer的高效混合系统，它可以同时处理OLTP和OLAP，方法是使用硬件辅助复制机制来维护事务数据的一致快照。HyPer是一个主存数据库系统，它保证OLTP事务的ACID属性，并在相同、任意当前和一致的快照上执行OLAP查询会话(多个查询)。利用处理器对虚拟内存管理(地址转换、缓存、更新时复制)的固有支持，可以同时实现这两种功能:在并行执行两个工作负载的单个系统上，史无前例的高事务率(高达每秒100000次)和非常快的OLAP查询响应时间。性能分析基于组合的TPC-C和TPC-H基准。

{"title":"HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots","authors":"A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2011.5767867","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767867","url":null,"abstract":"The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121793806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 651

Flexible use of cloud resources through profit maximization and price discrimination 通过利润最大化和价格歧视灵活使用云资源

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767932

Konstantinos Tsakalozos, H. Kllapi, Evangelia A. Sitaridi, M. Roussopoulos, Dimitris Paparas, A. Delis

Modern frameworks, such as Hadoop, combined with abundance of computing resources from the cloud, offer a significant opportunity to address long standing challenges in distributed processing. Infrastructure-as-a-Service clouds reduce the investment cost of renting a large data center while distributed processing frameworks are capable of efficiently harvesting the rented physical resources. Yet, the performance users get out of these resources varies greatly because the cloud hardware is shared by all users. The value for money cloud consumers achieve renders resource sharing policies a key player in both cloud performance and user satisfaction. In this paper, we employ microeconomics to direct the allotment of cloud resources for consumption in highly scalable master-worker virtual infrastructures. Our approach is developed on two premises: the cloud-consumer always has a budget and cloud physical resources are limited. Using our approach, the cloud administration is able to maximize per-user financial profit. We show that there is an equilibrium point at which our method achieves resource sharing proportional to each user's budget. Ultimately, this approach allows us to answer the question of how many resources a consumer should request from the seemingly endless pool provided by the cloud.

像Hadoop这样的现代框架，结合了来自云的丰富计算资源，为解决分布式处理中长期存在的挑战提供了重要的机会。基础设施即服务云降低了租用大型数据中心的投资成本，而分布式处理框架能够有效地获取租用的物理资源。然而，用户从这些资源中获得的性能差异很大，因为云硬件是由所有用户共享的。云消费者实现的物有所值使资源共享策略成为云性能和用户满意度的关键因素。在本文中，我们使用微观经济学来指导云资源在高度可扩展的主工虚拟基础设施中的消费分配。我们的方法是在两个前提下开发的:云消费者总是有预算，云物理资源是有限的。使用我们的方法，云管理能够最大化每个用户的财务利润。我们证明了存在一个平衡点，在这个平衡点上，我们的方法实现了与每个用户预算成比例的资源共享。最终，这种方法使我们能够回答这样一个问题:消费者应该从云提供的看似无穷无尽的资源池中请求多少资源。

{"title":"Flexible use of cloud resources through profit maximization and price discrimination","authors":"Konstantinos Tsakalozos, H. Kllapi, Evangelia A. Sitaridi, M. Roussopoulos, Dimitris Paparas, A. Delis","doi":"10.1109/ICDE.2011.5767932","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767932","url":null,"abstract":"Modern frameworks, such as Hadoop, combined with abundance of computing resources from the cloud, offer a significant opportunity to address long standing challenges in distributed processing. Infrastructure-as-a-Service clouds reduce the investment cost of renting a large data center while distributed processing frameworks are capable of efficiently harvesting the rented physical resources. Yet, the performance users get out of these resources varies greatly because the cloud hardware is shared by all users. The value for money cloud consumers achieve renders resource sharing policies a key player in both cloud performance and user satisfaction. In this paper, we employ microeconomics to direct the allotment of cloud resources for consumption in highly scalable master-worker virtual infrastructures. Our approach is developed on two premises: the cloud-consumer always has a budget and cloud physical resources are limited. Using our approach, the cloud administration is able to maximize per-user financial profit. We show that there is an equilibrium point at which our method achieves resource sharing proportional to each user's budget. Ultimately, this approach allows us to answer the question of how many resources a consumer should request from the seemingly endless pool provided by the cloud.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128116633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

T-verifier: Verifying truthfulness of fact statements t -验证者:验证事实陈述的真实性

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767859

Xian Li, W. Meng, Clement T. Yu

The Web has become the most popular place for people to acquire information. Unfortunately, it is widely recognized that the Web contains a significant amount of untruthful information. As a result, good tools are needed to help Web users determine the truthfulness of certain information. In this paper, we propose a two-step method that aims to determine whether a given statement is truthful, and if it is not, find out the truthful statement most related to the given statement. In the first step, we try to find a small number of alternative statements of the same topic as the given statement and make sure that one of these statements is truthful. In the second step, we identify the truthful statement from the given statement and the alternative statements. Both steps heavily rely on analysing various features extracted from the search results returned by a popular search engine for appropriate queries. Our experimental results show the best variation of the proposed method can achieve a precision of about 90%.

网络已经成为人们获取信息最流行的地方。不幸的是，人们普遍认为网络包含大量不真实的信息。因此，需要好的工具来帮助网络用户确定某些信息的真实性。在本文中，我们提出了一种两步法，旨在确定给定陈述是否真实，如果不是，找出与给定陈述最相关的真实陈述。在第一步中，我们尝试找到少量与给定语句相同主题的替代语句，并确保其中一个语句是真实的。在第二步，我们从给定的陈述和替代陈述中识别真实陈述。这两个步骤都严重依赖于分析从流行搜索引擎返回的搜索结果中提取的各种特征，以获得适当的查询。实验结果表明，该方法的最佳变化精度可达90%左右。

引用次数: 58

HashFile: An efficient index structure for multimedia data HashFile:多媒体数据的高效索引结构

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767837

Dongxiang Zhang, D. Agrawal, Gang Chen, A. Tung

Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.

高维空间的最近邻搜索是许多多媒体检索应用中必不可少的查询。由于维度的诅咒，在回答精确的神经网络查询时，现有的索引结构可能比简单的数据顺序扫描执行得更差。为了提高神经网络的搜索效率，提出了局部敏感哈希(LSH)及其变体来寻找近似的神经网络。它们采用了能够保持欧氏距离的哈希函数，使得相似的对象在同一桶中有很高的碰撞概率。给定一个查询对象，通过访问位于同一桶中的点来获得查询结果的候选对象。为了提高精度，每个哈希表与m个哈希函数相关联，以递归地将数据点哈希到更小的桶中，并删除误报。另一方面，需要多个哈希表来保证高检索召回率。因此，在精度和召回率之间进行优化成为LSH的主要挑战。最近，为了保证质量和效率，提出了位置敏感b树(LSB-tree)。但是，索引使用随机I/O访问。当多媒体数据库比较大时，需要相当大的磁盘I/O成本才能获得在实践中有效的近似比率。在本文中，我们提出了一种新的索引结构，称为HashFile，用于高效地检索多媒体对象。它结合了随机投影和线性扫描的优点。与LSH家族不同，在LSH家族中，每个桶与m个哈希值的串联相关联，我们只对密集桶进行递归分区，并将它们组织为树结构。给定查询点q，搜索算法以自顶向下的方式搜索查询对象附近的桶。每个节点中的候选桶按照哈希值的递增顺序依次存储，并且可以有效地加载到内存中进行线性扫描。HashFile可以支持精确和近似的NN查询。实验结果表明，HashFile在回答两种类型的神经网络查询时都比现有索引性能更好。

{"title":"HashFile: An efficient index structure for multimedia data","authors":"Dongxiang Zhang, D. Agrawal, Gang Chen, A. Tung","doi":"10.1109/ICDE.2011.5767837","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767837","url":null,"abstract":"Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121251195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Secure and efficient in-network processing of exact SUM queries 安全高效的网络内精确SUM查询处理

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767886

Stavros Papadopoulos, A. Kiayias, D. Papadias

In-network aggregation is a popular methodology adopted in wireless sensor networks, which reduces the energy expenditure in processing aggregate queries (such as SUM, MAX, etc.) over the sensor readings. Recently, research has focused on secure in-network aggregation, motivated (i) by the fact that the sensors are usually deployed in open and unsafe environments, and (ii) by new trends such as outsourcing, where the aggregation process is delegated to an untrustworthy service. This new paradigm necessitates the following key security properties: data confidentiality, integrity, authentication, and freshness. The majority of the existing work on the topic is either unsuitable for large-scale sensor networks, or provides only approximate answers for SUM queries (as well as their derivatives, e.g., COUNT, AVG, etc). Moreover, there is currently no approach offering both confidentiality and integrity at the same time. Towards this end, we propose a novel and efficient scheme called SIES. SIES is the first solution that supports Secure In-network processing of Exact SUM queries, satisfying all security properties. It achieves this goal through a combination of homomorphic encryption and secret sharing. Furthermore, SIES is lightweight (it relies on inexpensive hash operations and modular additions/multiplications), and features a very small bandwidth consumption (in the order of a few bytes). Consequently, SIES constitutes an ideal method for resource-constrained sensors.

网络内聚合是无线传感器网络中采用的一种流行方法，它减少了在传感器读数上处理聚合查询(如SUM, MAX等)的能量消耗。最近，研究主要集中在网络内的安全聚合上，其动机是:(1)传感器通常部署在开放和不安全的环境中，以及(2)新的趋势，如外包，将聚合过程委托给不可信的服务。这个新范例需要以下关键的安全属性:数据机密性、完整性、身份验证和新鲜度。关于该主题的大多数现有工作要么不适合大规模传感器网络，要么只提供SUM查询的近似答案(以及它们的衍生物，例如COUNT, AVG等)。此外，目前还没有一种方法可以同时提供保密性和完整性。为此，我们提出了一种新颖而高效的方案，称为SIES。SIES是第一个支持安全的网络内处理精确SUM查询的解决方案，满足所有安全属性。它通过同态加密和秘密共享的结合来实现这一目标。此外，SIES是轻量级的(它依赖于廉价的哈希操作和模块化的加法/乘法)，并且具有非常小的带宽消耗(按几个字节的顺序)。因此，对于资源受限的传感器，sis是一种理想的方法。

{"title":"Secure and efficient in-network processing of exact SUM queries","authors":"Stavros Papadopoulos, A. Kiayias, D. Papadias","doi":"10.1109/ICDE.2011.5767886","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767886","url":null,"abstract":"In-network aggregation is a popular methodology adopted in wireless sensor networks, which reduces the energy expenditure in processing aggregate queries (such as SUM, MAX, etc.) over the sensor readings. Recently, research has focused on secure in-network aggregation, motivated (i) by the fact that the sensors are usually deployed in open and unsafe environments, and (ii) by new trends such as outsourcing, where the aggregation process is delegated to an untrustworthy service. This new paradigm necessitates the following key security properties: data confidentiality, integrity, authentication, and freshness. The majority of the existing work on the topic is either unsuitable for large-scale sensor networks, or provides only approximate answers for SUM queries (as well as their derivatives, e.g., COUNT, AVG, etc). Moreover, there is currently no approach offering both confidentiality and integrity at the same time. Towards this end, we propose a novel and efficient scheme called SIES. SIES is the first solution that supports Secure In-network processing of Exact SUM queries, satisfying all security properties. It achieves this goal through a combination of homomorphic encryption and secret sharing. Furthermore, SIES is lightweight (it relies on inexpensive hash operations and modular additions/multiplications), and features a very small bandwidth consumption (in the order of a few bytes). Consequently, SIES constitutes an ideal method for resource-constrained sensors.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126637444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30