Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544835
Peng Wang, C. Ravishankar
We show how to execute range queries securely and efficiently on encrypted databases in the cloud. Current methods provide either security or efficiency, but not both. Many schemes even reveal the ordering of encrypted tuples, which, as we show, allows adversaries to estimate plaintext values accurately. We present the R̂-trees, a hierarchical encrypted index that may be securely placed in the cloud, and searched efficiently. It is based on a mechanism we design for encrypted halfspace range queries in ℝd, using Asymmetric Scalar-product Preserving Encryption. Data owners can tune the R̂-trees parameters to achieve desired security-efficiency tradeoffs. We also present extensive experiments to evaluate R̂-trees performance. Our results show that R̂-trees queries are efficient on encrypted databases, and reveal far less information than competing methods.
{"title":"Secure and efficient range queries on outsourced databases using Rp-trees","authors":"Peng Wang, C. Ravishankar","doi":"10.1109/ICDE.2013.6544835","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544835","url":null,"abstract":"We show how to execute range queries securely and efficiently on encrypted databases in the cloud. Current methods provide either security or efficiency, but not both. Many schemes even reveal the ordering of encrypted tuples, which, as we show, allows adversaries to estimate plaintext values accurately. We present the R̂-trees, a hierarchical encrypted index that may be securely placed in the cloud, and searched efficiently. It is based on a mechanism we design for encrypted halfspace range queries in ℝd, using Asymmetric Scalar-product Preserving Encryption. Data owners can tune the R̂-trees parameters to achieve desired security-efficiency tradeoffs. We also present extensive experiments to evaluate R̂-trees performance. Our results show that R̂-trees queries are efficient on encrypted databases, and reveal far less information than competing methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128173007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544944
Yuqing Zhu, Philip S. Yu, Jianmin Wang
Replication is critical to the scalability, availability and reliability of large-scale systems. The trade-off of replica consistency vs. response latency has been widely understood for large-scale stores with replication. The weak consistency guaranteed by existing large-scale stores complicates application development, while the strong consistency hurts application performance. It is desirable that the best consistency be guaranteed for a tolerable response latency, but none of existing large-scale stores supports maximized replica consistency within a given latency constraint. In this demonstration, we showcase RECODS (REplica Consistency-On-Demand Store), a NoSQL store implementation that can finely control the trade-off on an operation basis and thus facilitate application development with on-demand replica consistency. With RECODS, developers can specify the tolerable latency for each read/write operation. Within the specified latency constraint, a response will be returned and the replica consistency be maximized. RECODS implementation is based on Cassandra, an open source NoSQL store, but with a different operation execution process, replication process and in-memory storage hierarchy.
{"title":"RECODS: Replica consistency-on-demand store","authors":"Yuqing Zhu, Philip S. Yu, Jianmin Wang","doi":"10.1109/ICDE.2013.6544944","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544944","url":null,"abstract":"Replication is critical to the scalability, availability and reliability of large-scale systems. The trade-off of replica consistency vs. response latency has been widely understood for large-scale stores with replication. The weak consistency guaranteed by existing large-scale stores complicates application development, while the strong consistency hurts application performance. It is desirable that the best consistency be guaranteed for a tolerable response latency, but none of existing large-scale stores supports maximized replica consistency within a given latency constraint. In this demonstration, we showcase RECODS (REplica Consistency-On-Demand Store), a NoSQL store implementation that can finely control the trade-off on an operation basis and thus facilitate application development with on-demand replica consistency. With RECODS, developers can specify the tolerable latency for each read/write operation. Within the specified latency constraint, a response will be returned and the replica consistency be maximized. RECODS implementation is based on Cassandra, an open source NoSQL store, but with a different operation execution process, replication process and in-memory storage hierarchy.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132375718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544901
Pit Fender, G. Moerkotte
Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude while still preserving optimality. For both optimization strategies, efficient enumeration algorithms have been published. However, there are two severe limitations for the top-down approach: The published algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Since real queries may contain complex join predicates involving more than two relations, and outer joins as well as other non-inner joins, efficient top-down join enumeration cannot be used in practice yet. We develop a novel top-down join enumeration algorithm that overcomes these two limitations. Furthermore, we show that our new algorithm is competitive when compared with the state of the art in bottom-up processing even without playing out its advantage by making use of its branch-and-bound pruning capabilities.
{"title":"Top down plan generation: From theory to practice","authors":"Pit Fender, G. Moerkotte","doi":"10.1109/ICDE.2013.6544901","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544901","url":null,"abstract":"Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude while still preserving optimality. For both optimization strategies, efficient enumeration algorithms have been published. However, there are two severe limitations for the top-down approach: The published algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Since real queries may contain complex join predicates involving more than two relations, and outer joins as well as other non-inner joins, efficient top-down join enumeration cannot be used in practice yet. We develop a novel top-down join enumeration algorithm that overcomes these two limitations. Furthermore, we show that our new algorithm is competitive when compared with the state of the art in bottom-up processing even without playing out its advantage by making use of its branch-and-bound pruning capabilities.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131742504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544934
Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Y. Chung
The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.
{"title":"Tajo: A distributed data warehouse system on large clusters","authors":"Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Y. Chung","doi":"10.1109/ICDE.2013.6544934","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544934","url":null,"abstract":"The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544916
Fabian M. Suchanek, G. Weikum
The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications.
{"title":"Knowledge harvesting from text and Web sources","authors":"Fabian M. Suchanek, G. Weikum","doi":"10.1109/ICDE.2013.6544916","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544916","url":null,"abstract":"The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115712814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544893
Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, G. Weikum
In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.
{"title":"FERRARI: Flexible and efficient reachability range assignment for graph indexing","authors":"Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, G. Weikum","doi":"10.1109/ICDE.2013.6544893","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544893","url":null,"abstract":"In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544903
Jaeyoung Do, Donghui Zhang, J. Patel, D. DeWitt
A promising use of flash SSDs in a DBMS is to extend the main memory buffer pool by caching selected pages that have been evicted from the buffer pool. Such a use has been shown to produce significant gains in the steady state performance of the DBMS. One strategy for using the SSD buffer pool is to throw away the data in the SSD when the system is restarted (either when recovering from a crash or restarting after a shutdown), and consequently a long “ramp-up” period to regain peak performance is needed. One approach to eliminate this limitation is to use a memory-mapped file to store the SSD buffer table in order to be able to restore its contents on restart. However, this design can result in lower sustained performance, because every update to the SSD buffer table may incur an I/O operation to the memory-mapped file. In this paper we propose two new alternative designs. One design reconstructs the SSD buffer table using transactional logs. The other design asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the integrity of the data cached in the SSD buffer pool. We have implemented these three designs in SQL Server 2012. For each design, both the write-through and write-back SSD caching policies were implemented. Using two OLTP benchmarks (TPC-C and TPC-E), our experimental results show that our designs produce up to 3.8X speedup on the interval between peak-to-peak performance, with negligible performance loss; in contrast, the previous approach has a similar speedup but up to 54% performance loss.
闪存ssd在DBMS中的一个很有前途的用途是,通过缓存已从缓冲池中驱逐的选定页面来扩展主内存缓冲池。这样的使用已被证明可以显著提高DBMS的稳态性能。使用SSD缓冲池的一种策略是在系统重新启动时(从崩溃中恢复或关机后重新启动时)丢弃SSD中的数据,因此需要很长的“爬升”期来恢复峰值性能。消除此限制的一种方法是使用内存映射文件来存储SSD缓冲表,以便能够在重新启动时恢复其内容。然而,这种设计可能导致较低的持续性能,因为对SSD缓冲区表的每次更新都可能导致对内存映射文件的I/O操作。在本文中,我们提出了两种新的替代设计。一种设计是使用事务性日志重建SSD缓冲表。另一种设计异步刷新SSD缓冲表,并在重启时惰性验证缓存在SSD缓冲池中的数据的完整性。我们在SQL Server 2012中实现了这三种设计。对于每种设计,都实现了透写和回写SSD缓存策略。使用两个OLTP基准测试(TPC-C和TPC-E),我们的实验结果表明,我们的设计在峰值到峰值之间的性能间隔上产生高达3.8倍的加速,性能损失可以忽略不计;相比之下,前一种方法具有类似的加速,但性能损失高达54%。
{"title":"Fast peak-to-peak behavior with SSD buffer pool","authors":"Jaeyoung Do, Donghui Zhang, J. Patel, D. DeWitt","doi":"10.1109/ICDE.2013.6544903","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544903","url":null,"abstract":"A promising use of flash SSDs in a DBMS is to extend the main memory buffer pool by caching selected pages that have been evicted from the buffer pool. Such a use has been shown to produce significant gains in the steady state performance of the DBMS. One strategy for using the SSD buffer pool is to throw away the data in the SSD when the system is restarted (either when recovering from a crash or restarting after a shutdown), and consequently a long “ramp-up” period to regain peak performance is needed. One approach to eliminate this limitation is to use a memory-mapped file to store the SSD buffer table in order to be able to restore its contents on restart. However, this design can result in lower sustained performance, because every update to the SSD buffer table may incur an I/O operation to the memory-mapped file. In this paper we propose two new alternative designs. One design reconstructs the SSD buffer table using transactional logs. The other design asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the integrity of the data cached in the SSD buffer pool. We have implemented these three designs in SQL Server 2012. For each design, both the write-through and write-back SSD caching policies were implemented. Using two OLTP benchmarks (TPC-C and TPC-E), our experimental results show that our designs produce up to 3.8X speedup on the interval between peak-to-peak performance, with negligible performance loss; in contrast, the previous approach has a similar speedup but up to 54% performance loss.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"402 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116029415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544943
Sean R. Spillane, Jeremy Birnbaum, Daniel Bokser, Daniel Kemp, Alan G. Labouseur, Paul W. Olsen, Jayadevan Vijayan, Jeong-Hyon Hwang, Jun-Weon Yoon
The world is full of evolving networks, many of which can be represented by a series of large graphs. Neither the current graph processing systems nor database systems can efficiently store and query these graphs due to their lack of support for managing multiple graphs and lack of essential graph querying capabilities. We propose to demonstrate our system, G*, that meets the new challenges of managing multiple graphs and supporting fundamental graph querying capabilities. G* can store graphs on a large number of servers while compressing these graphs based on their commonalities. G* also allows users to easily express queries on graphs and efficiently executes these queries by sharing computations across graphs. During our demonstrations, conference attendees will run various analytic queries on large, practical data sets. These demonstrations will highlight the convenience and performance benefits of G* over existing database and graph processing systems, the effectiveness of sharing in graph data storage and processing, as well as G*'s scalability.
{"title":"A demonstration of the G∗ graph database system","authors":"Sean R. Spillane, Jeremy Birnbaum, Daniel Bokser, Daniel Kemp, Alan G. Labouseur, Paul W. Olsen, Jayadevan Vijayan, Jeong-Hyon Hwang, Jun-Weon Yoon","doi":"10.1109/ICDE.2013.6544943","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544943","url":null,"abstract":"The world is full of evolving networks, many of which can be represented by a series of large graphs. Neither the current graph processing systems nor database systems can efficiently store and query these graphs due to their lack of support for managing multiple graphs and lack of essential graph querying capabilities. We propose to demonstrate our system, G*, that meets the new challenges of managing multiple graphs and supporting fundamental graph querying capabilities. G* can store graphs on a large number of servers while compressing these graphs based on their commonalities. G* also allows users to easily express queries on graphs and efficiently executes these queries by sharing computations across graphs. During our demonstrations, conference attendees will run various analytic queries on large, practical data sets. These demonstrations will highlight the convenience and performance benefits of G* over existing database and graph processing systems, the effectiveness of sharing in graph data storage and processing, as well as G*'s scalability.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123473474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544857
Zhenjie Zhang, Hu Shu, Zhihong Chong, Hua Lu, Y. Yang
Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.
{"title":"C-Cube: Elastic continuous clustering in the cloud","authors":"Zhenjie Zhang, Hu Shu, Zhihong Chong, Hua Lu, Y. Yang","doi":"10.1109/ICDE.2013.6544857","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544857","url":null,"abstract":"Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not ELASTIC, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117169698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-08DOI: 10.1109/ICDE.2013.6544862
Sarah Masud, F. Choudhury, Mohammed Eunus Ali, Sarana Nutanong
Many real-world problems, such as placement of surveillance cameras and pricing of hotel rooms with a view, require the ability to determine the visibility of a given target object from different locations. Advances in large-scale 3D modeling (e.g., 3D virtual cities) provide us with data that can be used to solve these problems with high accuracy. In this paper, we investigate the problem of finding the location which provides the best view of a target object with visual obstacles in 2D or 3D space, for example, finding the location that provides the best view of fireworks in a city with tall buildings. To solve this problem, we first define the quality measure of a view (i.e., visibility measure) as the visible angular size of the target object. Then, we propose a new query type called the k-Maximum Visibility (kMV) query, which finds k locations from a set of locations that maximize the visibility of the target object. Our objective in this paper is to design a query solution which is capable of handling large-scale city models. This objective precludes the use of approaches that rely on constructing a visibility graph of the entire data space. As a result, we propose three approaches that incrementally consider relevant obstacles in order to determine the visibility of a target object from a given set of locations. These approaches differ in the order of obstacle retrieval, namely: query centric distance based, query centric visible region based, and target centric distance based approaches. We have conducted an extensive experimental study on real 2D and 3D datasets to demonstrate the efficiency and effectiveness of our solutions.
{"title":"Maximum visibility queries in spatial databases","authors":"Sarah Masud, F. Choudhury, Mohammed Eunus Ali, Sarana Nutanong","doi":"10.1109/ICDE.2013.6544862","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544862","url":null,"abstract":"Many real-world problems, such as placement of surveillance cameras and pricing of hotel rooms with a view, require the ability to determine the visibility of a given target object from different locations. Advances in large-scale 3D modeling (e.g., 3D virtual cities) provide us with data that can be used to solve these problems with high accuracy. In this paper, we investigate the problem of finding the location which provides the best view of a target object with visual obstacles in 2D or 3D space, for example, finding the location that provides the best view of fireworks in a city with tall buildings. To solve this problem, we first define the quality measure of a view (i.e., visibility measure) as the visible angular size of the target object. Then, we propose a new query type called the k-Maximum Visibility (kMV) query, which finds k locations from a set of locations that maximize the visibility of the target object. Our objective in this paper is to design a query solution which is capable of handling large-scale city models. This objective precludes the use of approaches that rely on constructing a visibility graph of the entire data space. As a result, we propose three approaches that incrementally consider relevant obstacles in order to determine the visibility of a target object from a given set of locations. These approaches differ in the order of obstacle retrieval, namely: query centric distance based, query centric visible region based, and target centric distance based approaches. We have conducted an extensive experimental study on real 2D and 3D datasets to demonstrate the efficiency and effectiveness of our solutions.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124765553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}