2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献_第2页

Secure and efficient range queries on outsourced databases using Rp-trees 使用rp树对外包数据库进行安全有效的范围查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544835

Peng Wang, C. Ravishankar

We show how to execute range queries securely and efficiently on encrypted databases in the cloud. Current methods provide either security or efficiency, but not both. Many schemes even reveal the ordering of encrypted tuples, which, as we show, allows adversaries to estimate plaintext values accurately. We present the R̂-trees, a hierarchical encrypted index that may be securely placed in the cloud, and searched efficiently. It is based on a mechanism we design for encrypted halfspace range queries in ℝd, using Asymmetric Scalar-product Preserving Encryption. Data owners can tune the R̂-trees parameters to achieve desired security-efficiency tradeoffs. We also present extensive experiments to evaluate R̂-trees performance. Our results show that R̂-trees queries are efficient on encrypted databases, and reveal far less information than competing methods.

我们将展示如何在云中的加密数据库上安全高效地执行范围查询。目前的方法要么提供安全性，要么提供效率，但不能两者兼得。许多方案甚至揭示了加密元组的顺序，正如我们所展示的，这允许攻击者准确地估计明文值。我们提出了R树，一种分层加密索引，可以安全地放置在云中，并且可以有效地搜索。它是基于我们设计的一种机制，用于加密的半空间范围查询在v - d中，使用非对称标量积保持加密。数据所有者可以调优R树参数，以实现所需的安全性和效率折衷。我们还提出了大量的实验来评估R′树的性能。我们的结果表明，R树查询在加密数据库上是有效的，并且比竞争方法显示的信息少得多。

引用次数: 89

RECODS: Replica consistency-on-demand store RECODS:副本一致性按需存储

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544944

Yuqing Zhu, Philip S. Yu, Jianmin Wang

Replication is critical to the scalability, availability and reliability of large-scale systems. The trade-off of replica consistency vs. response latency has been widely understood for large-scale stores with replication. The weak consistency guaranteed by existing large-scale stores complicates application development, while the strong consistency hurts application performance. It is desirable that the best consistency be guaranteed for a tolerable response latency, but none of existing large-scale stores supports maximized replica consistency within a given latency constraint. In this demonstration, we showcase RECODS (REplica Consistency-On-Demand Store), a NoSQL store implementation that can finely control the trade-off on an operation basis and thus facilitate application development with on-demand replica consistency. With RECODS, developers can specify the tolerable latency for each read/write operation. Within the specified latency constraint, a response will be returned and the replica consistency be maximized. RECODS implementation is based on Cassandra, an open source NoSQL store, but with a different operation execution process, replication process and in-memory storage hierarchy.

复制对于大规模系统的可伸缩性、可用性和可靠性至关重要。对于具有复制功能的大型存储，副本一致性与响应延迟之间的权衡已经得到了广泛的理解。现有大型存储所保证的弱一致性使应用程序开发变得复杂，而强一致性又会损害应用程序的性能。对于可容忍的响应延迟，最好的一致性得到保证是可取的，但是在给定的延迟约束下，现有的大型存储都不支持最大的副本一致性。在这个演示中，我们展示了RECODS (REplica consistency - on-demand Store)，这是一种NoSQL存储实现，可以在操作的基础上很好地控制这种权衡，从而促进了应用程序的按需复制一致性开发。使用RECODS，开发人员可以为每个读/写操作指定可容忍的延迟。在指定的延迟约束内，将返回响应并最大化副本一致性。RECODS的实现基于Cassandra，一个开源的NoSQL存储，但是有不同的操作执行过程、复制过程和内存存储层次结构。

{"title":"RECODS: Replica consistency-on-demand store","authors":"Yuqing Zhu, Philip S. Yu, Jianmin Wang","doi":"10.1109/ICDE.2013.6544944","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544944","url":null,"abstract":"Replication is critical to the scalability, availability and reliability of large-scale systems. The trade-off of replica consistency vs. response latency has been widely understood for large-scale stores with replication. The weak consistency guaranteed by existing large-scale stores complicates application development, while the strong consistency hurts application performance. It is desirable that the best consistency be guaranteed for a tolerable response latency, but none of existing large-scale stores supports maximized replica consistency within a given latency constraint. In this demonstration, we showcase RECODS (REplica Consistency-On-Demand Store), a NoSQL store implementation that can finely control the trade-off on an operation basis and thus facilitate application development with on-demand replica consistency. With RECODS, developers can specify the tolerable latency for each read/write operation. Within the specified latency constraint, a response will be returned and the replica consistency be maximized. RECODS implementation is based on Cassandra, an open source NoSQL store, but with a different operation execution process, replication process and in-memory storage hierarchy.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132375718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Top down plan generation: From theory to practice 自上而下的计划生成:从理论到实践

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544901

Pit Fender, G. Moerkotte

Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude while still preserving optimality. For both optimization strategies, efficient enumeration algorithms have been published. However, there are two severe limitations for the top-down approach: The published algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Since real queries may contain complex join predicates involving more than two relations, and outer joins as well as other non-inner joins, efficient top-down join enumeration cannot be used in practice yet. We develop a novel top-down join enumeration algorithm that overcomes these two limitations. Furthermore, we show that our new algorithm is competitive when compared with the state of the art in bottom-up processing even without playing out its advantage by making use of its branch-and-bound pruning capabilities.

查找连接操作的最佳执行顺序是当今基于成本的查询优化器的一项关键任务。有两种方法可以确定最佳计划:自底向上和自顶向下的连接枚举。但是只有自顶向下的方法才允许进行分支和绑定修剪，这可以在保持最优性的同时将编译时间提高几个数量级。对于这两种优化策略，已经发表了高效的枚举算法。然而，自顶向下方法有两个严重的限制:发布的算法只能处理(1)简单(二进制)连接谓词和(2)内部连接。由于实际查询可能包含涉及两个以上关系的复杂连接谓词，以及外部连接和其他非内部连接，因此在实践中还不能使用高效的自顶向下连接枚举。我们开发了一种新的自顶向下连接枚举算法，克服了这两个限制。此外，我们表明，与自下而上处理的最新技术相比，我们的新算法具有竞争力，即使没有利用其分支和定界修剪能力来发挥其优势。

引用次数: 10

Tajo: A distributed data warehouse system on large clusters Tajo:大型集群上的分布式数据仓库系统

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544934

Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Y. Chung

The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.

不断增加的关系数据量让我们找到了另一种方法来处理它们。最近，并行数据库和Hadoop之间的几种混合方法(例如HadoopDB和Hive)被引入了数据库社区。尽管这些混合方法得到了广泛的普及，但它们无法避免选择次优执行策略。我们认为这个问题是由他们的架构的固有限制造成的。在这个演示中，我们展示了Tajo，这是一个在无共享集群上的关系分布式数据仓库系统。它使用Hadoop分布式文件系统(HDFS)作为存储层，并有我们自己开发的查询执行引擎，而不是MapReduce框架。Tajo集群由一个主节点和多个跨集群节点的工作节点组成。master主要负责查询规划和worker协调。主程序将查询划分为小任务，并将它们分发给工人。每个worker都有一个本地查询引擎，该引擎执行物理操作符的有向无环图。操作符DAG可以接受两个或多个输入源，并在本地查询引擎中进行流水线化。此外，Tajo可以比MapReduce更灵活地控制分布式数据流，并支持索引技术。通过结合这些特性，Tajo可以采用更优化、更高效的查询处理，包括传统数据库研究领域中已经研究过的现有方法。为了深入了解Tajo架构和查询处理过程中的行为，演示将允许用户向32个Tajo集群节点提交TPC-H查询。基于web的用户界面将显示(1)提交的查询是如何计划的，(2)查询是如何跨节点分布的，(3)集群和节点状态，以及(4)关系的细节及其物理信息。并对Tajo与Hive的性能进行了比较。

{"title":"Tajo: A distributed data warehouse system on large clusters","authors":"Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Y. Chung","doi":"10.1109/ICDE.2013.6544934","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544934","url":null,"abstract":"The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Knowledge harvesting from text and Web sources 从文本和Web资源中获取知识

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544916

Fabian M. Suchanek, G. Weikum

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications.

像维基百科这样的知识共享社区的激增，以及从Web和文本源中可扩展的信息提取方面的进展，使得自动构建非常大的知识库成为可能。最近的此类努力包括学术研究项目，如DBpedia、KnowItAll、Probase、ReadTheWeb和YAGO，以及工业项目，如Freebase和Trueknowledge。这些项目提供了自动构建的关于命名实体、它们的语义类和它们的相互关系的事实知识库。这样的世界知识反过来又支持认知应用程序和以知识为中心的服务，如消除自然语言文本的歧义、深度问答以及对Web和企业数据中的实体和关系的语义搜索。如何利用知识库的突出例子包括谷歌知识图和IBM沃森问答系统。本教程介绍了最先进的方法、最新进展、研究机会以及沿着这条知识获取及其应用途径的开放挑战。

引用次数: 24

FERRARI: Flexible and efficient reachability range assignment for graph indexing 灵活高效的图索引可达范围分配

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544893

Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, G. Weikum

In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.

本文针对图的可达性问题，提出了一种可扩展的高效索引结构。我们建立在众所周知的节点间隔标记方案的基础上，其中从特定节点可到达的顶点集被紧凑地编码为节点标识符范围的集合。我们对索引的大小施加了一个显式的界限，并灵活地为图的节点分配了近似的可达范围，从而使响应查询的索引探测数量最小化。如果空间预算增加，所得到的可调索引结构将生成更好的范围标记，从而提供对索引大小和查询处理性能之间权衡的直接控制。通过将快速递归查询方法与我们的索引结构结合使用，我们表明，在实践中，可达性查询可以在一台现成的计算机上以微秒的顺序得到回答——即使对于大规模的真实世界图也是如此。我们的说法得到了大量实验结果的支持，这些实验结果使用了大量基准和现实世界的网络规模的图形数据集。

{"title":"FERRARI: Flexible and efficient reachability range assignment for graph indexing","authors":"Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, G. Weikum","doi":"10.1109/ICDE.2013.6544893","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544893","url":null,"abstract":"In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106

Fast peak-to-peak behavior with SSD buffer pool 快速峰对峰行为与SSD缓冲池

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544903

Jaeyoung Do, Donghui Zhang, J. Patel, D. DeWitt

A promising use of flash SSDs in a DBMS is to extend the main memory buffer pool by caching selected pages that have been evicted from the buffer pool. Such a use has been shown to produce significant gains in the steady state performance of the DBMS. One strategy for using the SSD buffer pool is to throw away the data in the SSD when the system is restarted (either when recovering from a crash or restarting after a shutdown), and consequently a long “ramp-up” period to regain peak performance is needed. One approach to eliminate this limitation is to use a memory-mapped file to store the SSD buffer table in order to be able to restore its contents on restart. However, this design can result in lower sustained performance, because every update to the SSD buffer table may incur an I/O operation to the memory-mapped file. In this paper we propose two new alternative designs. One design reconstructs the SSD buffer table using transactional logs. The other design asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the integrity of the data cached in the SSD buffer pool. We have implemented these three designs in SQL Server 2012. For each design, both the write-through and write-back SSD caching policies were implemented. Using two OLTP benchmarks (TPC-C and TPC-E), our experimental results show that our designs produce up to 3.8X speedup on the interval between peak-to-peak performance, with negligible performance loss; in contrast, the previous approach has a similar speedup but up to 54% performance loss.

闪存ssd在DBMS中的一个很有前途的用途是，通过缓存已从缓冲池中驱逐的选定页面来扩展主内存缓冲池。这样的使用已被证明可以显著提高DBMS的稳态性能。使用SSD缓冲池的一种策略是在系统重新启动时(从崩溃中恢复或关机后重新启动时)丢弃SSD中的数据，因此需要很长的“爬升”期来恢复峰值性能。消除此限制的一种方法是使用内存映射文件来存储SSD缓冲表，以便能够在重新启动时恢复其内容。然而，这种设计可能导致较低的持续性能，因为对SSD缓冲区表的每次更新都可能导致对内存映射文件的I/O操作。在本文中，我们提出了两种新的替代设计。一种设计是使用事务性日志重建SSD缓冲表。另一种设计异步刷新SSD缓冲表，并在重启时惰性验证缓存在SSD缓冲池中的数据的完整性。我们在SQL Server 2012中实现了这三种设计。对于每种设计，都实现了透写和回写SSD缓存策略。使用两个OLTP基准测试(TPC-C和TPC-E)，我们的实验结果表明，我们的设计在峰值到峰值之间的性能间隔上产生高达3.8倍的加速，性能损失可以忽略不计;相比之下，前一种方法具有类似的加速，但性能损失高达54%。

{"title":"Fast peak-to-peak behavior with SSD buffer pool","authors":"Jaeyoung Do, Donghui Zhang, J. Patel, D. DeWitt","doi":"10.1109/ICDE.2013.6544903","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544903","url":null,"abstract":"A promising use of flash SSDs in a DBMS is to extend the main memory buffer pool by caching selected pages that have been evicted from the buffer pool. Such a use has been shown to produce significant gains in the steady state performance of the DBMS. One strategy for using the SSD buffer pool is to throw away the data in the SSD when the system is restarted (either when recovering from a crash or restarting after a shutdown), and consequently a long “ramp-up” period to regain peak performance is needed. One approach to eliminate this limitation is to use a memory-mapped file to store the SSD buffer table in order to be able to restore its contents on restart. However, this design can result in lower sustained performance, because every update to the SSD buffer table may incur an I/O operation to the memory-mapped file. In this paper we propose two new alternative designs. One design reconstructs the SSD buffer table using transactional logs. The other design asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the integrity of the data cached in the SSD buffer pool. We have implemented these three designs in SQL Server 2012. For each design, both the write-through and write-back SSD caching policies were implemented. Using two OLTP benchmarks (TPC-C and TPC-E), our experimental results show that our designs produce up to 3.8X speedup on the interval between peak-to-peak performance, with negligible performance loss; in contrast, the previous approach has a similar speedup but up to 54% performance loss.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"402 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116029415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A demonstration of the G∗ graph database system 一个G *图数据库系统的演示

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544943

Sean R. Spillane, Jeremy Birnbaum, Daniel Bokser, Daniel Kemp, Alan G. Labouseur, Paul W. Olsen, Jayadevan Vijayan, Jeong-Hyon Hwang, Jun-Weon Yoon

The world is full of evolving networks, many of which can be represented by a series of large graphs. Neither the current graph processing systems nor database systems can efficiently store and query these graphs due to their lack of support for managing multiple graphs and lack of essential graph querying capabilities. We propose to demonstrate our system, G*, that meets the new challenges of managing multiple graphs and supporting fundamental graph querying capabilities. G* can store graphs on a large number of servers while compressing these graphs based on their commonalities. G* also allows users to easily express queries on graphs and efficiently executes these queries by sharing computations across graphs. During our demonstrations, conference attendees will run various analytic queries on large, practical data sets. These demonstrations will highlight the convenience and performance benefits of G* over existing database and graph processing systems, the effectiveness of sharing in graph data storage and processing, as well as G*'s scalability.

世界上充满了不断发展的网络，其中许多可以用一系列大图来表示。当前的图处理系统和数据库系统都无法有效地存储和查询这些图，因为它们缺乏对管理多个图的支持和必要的图查询能力。我们打算展示我们的系统G*，它可以满足管理多个图和支持基本图查询功能的新挑战。G*可以在大量服务器上存储图形，同时根据共性对这些图形进行压缩。G*还允许用户轻松地表达对图形的查询，并通过跨图形共享计算有效地执行这些查询。在我们的演示中，与会者将在大型实际数据集上运行各种分析查询。这些演示将突出G*相对于现有数据库和图形处理系统的便利性和性能优势，共享图形数据存储和处理的有效性，以及G*的可扩展性。

{"title":"A demonstration of the G∗ graph database system","authors":"Sean R. Spillane, Jeremy Birnbaum, Daniel Bokser, Daniel Kemp, Alan G. Labouseur, Paul W. Olsen, Jayadevan Vijayan, Jeong-Hyon Hwang, Jun-Weon Yoon","doi":"10.1109/ICDE.2013.6544943","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544943","url":null,"abstract":"The world is full of evolving networks, many of which can be represented by a series of large graphs. Neither the current graph processing systems nor database systems can efficiently store and query these graphs due to their lack of support for managing multiple graphs and lack of essential graph querying capabilities. We propose to demonstrate our system, G*, that meets the new challenges of managing multiple graphs and supporting fundamental graph querying capabilities. G* can store graphs on a large number of servers while compressing these graphs based on their commonalities. G* also allows users to easily express queries on graphs and efficiently executes these queries by sharing computations across graphs. During our demonstrations, conference attendees will run various analytic queries on large, practical data sets. These demonstrations will highlight the convenience and performance benefits of G* over existing database and graph processing systems, the effectiveness of sharing in graph data storage and processing, as well as G*'s scalability.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123473474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

C-Cube: Elastic continuous clustering in the cloud C-Cube:云中的弹性连续集群

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544857

Zhenjie Zhang, Hu Shu, Zhihong Chong, Hua Lu, Y. Yang

Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.

对数据流的连续聚类分析会随着更新的到来而增量地报告聚类结果。这种分析具有广泛的应用范围，包括流量监控和微博上的主题发现。流应用程序的一个共同特征是工作负载的数量经常以不可预测的方式波动。另一方面，大多数现有的连续集群解决方案都假设有一个中心服务器，或者是一个具有固定数量的专用服务器的分布式设置。换句话说，它们不是弹性的，这意味着它们不能动态地适应计算资源量和波动的工作负载。因此，它们会导致相当大的资源浪费，因为当工作负载较低时服务器利用率不足。本文提出了C-Cube，这是第一个用于连续流聚类的弹性方法。类似于流行的基于云的范例，如MapReduce, C-Cube将每个新记录路由到一个处理单元，例如，一个虚拟机，基于它的哈希值。每个处理单元执行所需的计算，并将其结果发送给轻量级聚合器。该设计支持动态添加/删除处理单元，以及替换故障处理单元并重新运行其任务。除了弹性之外，C-Cube也是有效的(因为它为聚类结果提供了质量保证)、高效的(它在任何时候都最小化了计算工作量)，并且通常适用于大量的聚类标准。我们在一个基于Twitter Storm的真实系统中实现了C-Cube，并使用真实和合成数据集对其进行了评估。大量的实验结果证实了我们的性能要求。

{"title":"C-Cube: Elastic continuous clustering in the cloud","authors":"Zhenjie Zhang, Hu Shu, Zhihong Chong, Hua Lu, Y. Yang","doi":"10.1109/ICDE.2013.6544857","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544857","url":null,"abstract":"Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117169698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Maximum visibility queries in spatial databases 空间数据库中的最大可见性查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544862

Sarah Masud, F. Choudhury, Mohammed Eunus Ali, Sarana Nutanong

Many real-world problems, such as placement of surveillance cameras and pricing of hotel rooms with a view, require the ability to determine the visibility of a given target object from different locations. Advances in large-scale 3D modeling (e.g., 3D virtual cities) provide us with data that can be used to solve these problems with high accuracy. In this paper, we investigate the problem of finding the location which provides the best view of a target object with visual obstacles in 2D or 3D space, for example, finding the location that provides the best view of fireworks in a city with tall buildings. To solve this problem, we first define the quality measure of a view (i.e., visibility measure) as the visible angular size of the target object. Then, we propose a new query type called the k-Maximum Visibility (kMV) query, which finds k locations from a set of locations that maximize the visibility of the target object. Our objective in this paper is to design a query solution which is capable of handling large-scale city models. This objective precludes the use of approaches that rely on constructing a visibility graph of the entire data space. As a result, we propose three approaches that incrementally consider relevant obstacles in order to determine the visibility of a target object from a given set of locations. These approaches differ in the order of obstacle retrieval, namely: query centric distance based, query centric visible region based, and target centric distance based approaches. We have conducted an extensive experimental study on real 2D and 3D datasets to demonstrate the efficiency and effectiveness of our solutions.

许多现实世界的问题，如监控摄像头的放置和有视野的酒店房间的定价，都需要能够从不同的位置确定给定目标对象的可见性。大规模3D建模(例如3D虚拟城市)的进步为我们提供了可以用于高精度解决这些问题的数据。在本文中，我们研究了在2D或3D空间中寻找具有视觉障碍的目标物体的最佳视角的问题，例如，在高楼林立的城市中寻找提供烟花最佳视角的位置。为了解决这个问题，我们首先将视图的质量度量(即可见性度量)定义为目标物体的可见角尺寸。然后，我们提出了一种新的查询类型，称为k-最大可见性(kMV)查询，它从一组位置中找到k个位置，使目标对象的可见性最大化。本文的目标是设计一个能够处理大规模城市模型的查询解决方案。这一目标排除了依赖于构建整个数据空间可见性图的方法的使用。因此，我们提出了三种逐步考虑相关障碍的方法，以便从给定的一组位置确定目标物体的可见性。这些方法的障碍检索顺序不同，即:基于查询中心距离的方法、基于查询中心可见区域的方法和基于目标中心距离的方法。我们在真实的2D和3D数据集上进行了广泛的实验研究，以证明我们的解决方案的效率和有效性。

{"title":"Maximum visibility queries in spatial databases","authors":"Sarah Masud, F. Choudhury, Mohammed Eunus Ali, Sarana Nutanong","doi":"10.1109/ICDE.2013.6544862","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544862","url":null,"abstract":"Many real-world problems, such as placement of surveillance cameras and pricing of hotel rooms with a view, require the ability to determine the visibility of a given target object from different locations. Advances in large-scale 3D modeling (e.g., 3D virtual cities) provide us with data that can be used to solve these problems with high accuracy. In this paper, we investigate the problem of finding the location which provides the best view of a target object with visual obstacles in 2D or 3D space, for example, finding the location that provides the best view of fireworks in a city with tall buildings. To solve this problem, we first define the quality measure of a view (i.e., visibility measure) as the visible angular size of the target object. Then, we propose a new query type called the k-Maximum Visibility (kMV) query, which finds k locations from a set of locations that maximize the visibility of the target object. Our objective in this paper is to design a query solution which is capable of handling large-scale city models. This objective precludes the use of approaches that rely on constructing a visibility graph of the entire data space. As a result, we propose three approaches that incrementally consider relevant obstacles in order to determine the visibility of a target object from a given set of locations. These approaches differ in the order of obstacle retrieval, namely: query centric distance based, query centric visible region based, and target centric distance based approaches. We have conducted an extensive experimental study on real 2D and 3D datasets to demonstrate the efficiency and effectiveness of our solutions.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124765553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16