In peer-to-peer networks, indices are used to map data id to nodes that host the data. The performance of data access can be improved by actively pushing indices to interested nodes. This paper proposes the Dynamic-tree based Update Propagation (DUP) scheme, which builds the update propagation tree to facilitate the propagation of indices. Because the update propagation tree only involves nodes that are essential for update propagation, the overhead of DUP is very small and the query latency is significantly reduced.
{"title":"DUP: Dynamic-Tree Based Update Propagation in Peer-to-Peer Networks","authors":"Liangzhong Yin, G. Cao","doi":"10.1109/ICDE.2005.52","DOIUrl":"https://doi.org/10.1109/ICDE.2005.52","url":null,"abstract":"In peer-to-peer networks, indices are used to map data id to nodes that host the data. The performance of data access can be improved by actively pushing indices to interested nodes. This paper proposes the Dynamic-tree based Update Propagation (DUP) scheme, which builds the update propagation tree to facilitate the propagation of indices. Because the update propagation tree only involves nodes that are essential for update propagation, the overhead of DUP is very small and the query latency is significantly reduced.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114968012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present "Pipe 'n Prune" (PnP), a new hybrid method for iceberg-cube query computation. The novelty of our method is that it achieves a tight integration of top-down piping for data aggregation with bottom-up a priori data pruning. A particular strength of PnP is that it is very efficient for all of the following scenarios: (1) Sequential iceberg-cube queries. (2) External memory iceberg-cube queries. (3) Parallel iceberg-cube queries on shared-nothing PC clusters with multiple disks.
提出了一种新的用于冰山-立方体查询计算的混合方法“Pipe 'n Prune”(PnP)。我们方法的新颖之处在于,它实现了自顶向下的数据聚合管道与自底向上的先验数据修剪的紧密集成。PnP的一个特殊优点是,它对以下所有场景都非常有效:(1)顺序冰山立方体查询。(2)外部内存冰山立方体查询。(3)在多磁盘的无共享PC集群上并行冰山立方体查询。
{"title":"PnP: parallel and external memory iceberg cube computation","authors":"Ying Chen, F. Dehne, Todd Eavis, A. Rau-Chaplin","doi":"10.1109/ICDE.2005.107","DOIUrl":"https://doi.org/10.1109/ICDE.2005.107","url":null,"abstract":"We present \"Pipe 'n Prune\" (PnP), a new hybrid method for iceberg-cube query computation. The novelty of our method is that it achieves a tight integration of top-down piping for data aggregation with bottom-up a priori data pruning. A particular strength of PnP is that it is very efficient for all of the following scenarios: (1) Sequential iceberg-cube queries. (2) External memory iceberg-cube queries. (3) Parallel iceberg-cube queries on shared-nothing PC clusters with multiple disks.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116668908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a practical index for approximate similarity queries of large multi-dimensional data sets: the spatial approximation sample hierarchy (SASH). A SASH is a multi-level structure of random samples, recursively constructed by building a SASH on a large randomly selected sample of data objects, and then connecting each remaining object to several of their approximate nearest neighbors from within the sample. Queries are processed by first locating approximate neighbors within the sample, and then using the pre-established connections to discover neighbors within the remainder of the data set. The SASH index relies on a pairwise distance measure, but otherwise makes no assumptions regarding the representation of the data. Experimental results are provided for query-by-example operations on protein sequence, image, and text data sets, including one consisting of more than 1 million vectors spanning more than 1.1 million terms - far in excess of what spatial search indices can handle efficiently. For sets of this size, the SASH can return a large proportion of the true neighbors roughly 2 orders of magnitude faster than sequential search.
{"title":"Fast approximate similarity search in extremely high-dimensional data sets","authors":"M. Houle, J. Sakuma","doi":"10.1109/ICDE.2005.66","DOIUrl":"https://doi.org/10.1109/ICDE.2005.66","url":null,"abstract":"This paper introduces a practical index for approximate similarity queries of large multi-dimensional data sets: the spatial approximation sample hierarchy (SASH). A SASH is a multi-level structure of random samples, recursively constructed by building a SASH on a large randomly selected sample of data objects, and then connecting each remaining object to several of their approximate nearest neighbors from within the sample. Queries are processed by first locating approximate neighbors within the sample, and then using the pre-established connections to discover neighbors within the remainder of the data set. The SASH index relies on a pairwise distance measure, but otherwise makes no assumptions regarding the representation of the data. Experimental results are provided for query-by-example operations on protein sequence, image, and text data sets, including one consisting of more than 1 million vectors spanning more than 1.1 million terms - far in excess of what spatial search indices can handle efficiently. For sets of this size, the SASH can return a large proportion of the true neighbors roughly 2 orders of magnitude faster than sequential search.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122627757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingchun Jiang, R. Adaikkalavan, Sharma Chakravarthy
Network fault management has been an active research area for a long period of time because of its complexity, and the returns it generates for service providers. However, most fault management systems are currently custom-developed for a particular domain. As communication service providers continuously add greater capabilities and sophistication to their systems in order to meet demands of a growing user population, these systems have to manage a multi-layered network along with its built-in legacy logical processing procedure. Stream processing has been receiving a lot of attention to deal with applications that generate large amounts of data in real-time at varying input rates and to compute functions over multiple streams, such as network fault management. In this paper, we propose an integrated inter-domain network fault management system for such a multi-layered network based on data stream and event processing techniques. We discuss various components in our system and how data stream processing techniques are used to build a flexible system for a sophisticated real-world application. We further identify a number of important issues related to data stream processing during the course of the discussion of our proposed system, which will further extend the boundaries of data stream processing.
{"title":"NFM/sup i/: an inner-domain network fault management system","authors":"Qingchun Jiang, R. Adaikkalavan, Sharma Chakravarthy","doi":"10.1109/ICDE.2005.94","DOIUrl":"https://doi.org/10.1109/ICDE.2005.94","url":null,"abstract":"Network fault management has been an active research area for a long period of time because of its complexity, and the returns it generates for service providers. However, most fault management systems are currently custom-developed for a particular domain. As communication service providers continuously add greater capabilities and sophistication to their systems in order to meet demands of a growing user population, these systems have to manage a multi-layered network along with its built-in legacy logical processing procedure. Stream processing has been receiving a lot of attention to deal with applications that generate large amounts of data in real-time at varying input rates and to compute functions over multiple streams, such as network fault management. In this paper, we propose an integrated inter-domain network fault management system for such a multi-layered network based on data stream and event processing techniques. We discuss various components in our system and how data stream processing techniques are used to build a flexible system for a sophisticated real-world application. We further identify a number of important issues related to data stream processing during the course of the discussion of our proposed system, which will further extend the boundaries of data stream processing.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122876188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Order-based element labeling for tree-structured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain accurate labeling for dynamic XML documents, when elements and subtrees are inserted and deleted. Most existing approaches do not work well for arbitrary update patterns; they either produce unacceptably long labels or incur enormous relabeling costs. We present two novel I/O-efficient data structures, W-BOX and B-BOX that efficiently maintain labeling for large, dynamic XML documents. We show analytically and experimentally that both, despite consuming minimal amounts of storage, gracefully handle arbitrary update patterns without sacrificing lookup efficiency. The two structures together provide a nice tradeoff between update and lookup costs: W-BOX has logarithmic amortized update cost and constant worst-case lookup cost, while B-BOX has constant amortized update cost and logarithmic worst-case lookup cost. We further propose techniques to eliminate the lookup cost for read-heavy workloads.
{"title":"BOXes: efficient maintenance of order-based labeling for dynamic XML data","authors":"Adam Silberstein, Hao He, K. Yi, Jun Yang","doi":"10.1109/ICDE.2005.29","DOIUrl":"https://doi.org/10.1109/ICDE.2005.29","url":null,"abstract":"Order-based element labeling for tree-structured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain accurate labeling for dynamic XML documents, when elements and subtrees are inserted and deleted. Most existing approaches do not work well for arbitrary update patterns; they either produce unacceptably long labels or incur enormous relabeling costs. We present two novel I/O-efficient data structures, W-BOX and B-BOX that efficiently maintain labeling for large, dynamic XML documents. We show analytically and experimentally that both, despite consuming minimal amounts of storage, gracefully handle arbitrary update patterns without sacrificing lookup efficiency. The two structures together provide a nice tradeoff between update and lookup costs: W-BOX has logarithmic amortized update cost and constant worst-case lookup cost, while B-BOX has constant amortized update cost and logarithmic worst-case lookup cost. We further propose techniques to eliminate the lookup cost for read-heavy workloads.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129448049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SNAP is a novel high-performance snapshot system for object storage systems. The goal is to provide a snapshot service that is efficient enough to permit "back-in-time" read-only activities to run against application-specified snapshots. Such activities are often impossible to run against rapidly evolving current state because of interference or because the required activity is determined in retrospect. A key innovation in SNAP is that it provides snapshots that are transactionally consistent, yet non-disruptive. Unlike earlier systems, we use novel in-memory data structures to ensure that frequent snapshots do not block applications from accessing the storage system, and do not cause unnecessary disk operations. SNAP takes a novel approach to dealing with snapshot meta-data using a new technique that supports both incremental meta-data creation and efficient meta-data reconstruction. We have implemented a SNAP prototype and analyzed its performance. Preliminary results show that providing snapshots for back-in-time activities has low impact on system performance even when snapshots are frequent.
{"title":"SNAP: efficient snapshots for back-in-time execution","authors":"L. Shrira, Hao Xu","doi":"10.1109/ICDE.2005.133","DOIUrl":"https://doi.org/10.1109/ICDE.2005.133","url":null,"abstract":"SNAP is a novel high-performance snapshot system for object storage systems. The goal is to provide a snapshot service that is efficient enough to permit \"back-in-time\" read-only activities to run against application-specified snapshots. Such activities are often impossible to run against rapidly evolving current state because of interference or because the required activity is determined in retrospect. A key innovation in SNAP is that it provides snapshots that are transactionally consistent, yet non-disruptive. Unlike earlier systems, we use novel in-memory data structures to ensure that frequent snapshots do not block applications from accessing the storage system, and do not cause unnecessary disk operations. SNAP takes a novel approach to dealing with snapshot meta-data using a new technique that supports both incremental meta-data creation and efficient meta-data reconstruction. We have implemented a SNAP prototype and analyzed its performance. Preliminary results show that providing snapshots for back-in-time activities has low impact on system performance even when snapshots are frequent.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123875876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Usually, these processes must be completed in a certain time window; thus, it is necessary to optimize their execution time. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow.
{"title":"Optimizing ETL processes in data warehouses","authors":"A. Simitsis, Panos Vassiliadis, T. Sellis","doi":"10.1109/ICDE.2005.103","DOIUrl":"https://doi.org/10.1109/ICDE.2005.103","url":null,"abstract":"Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Usually, these processes must be completed in a certain time window; thus, it is necessary to optimize their execution time. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"13 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120914639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Web services, and more in general service-oriented architectures (SOAs), are emerging as the technologies and architectures of choice for implementing distributed systems and performing application integration within and across companies boundaries. In this article we describe Web services from an evolutionary perspective, with an emphasis on the utilization for enterprise application integration and service-oriented architectures. The article also covers basic middleware problems and shows how the solutions to these problems have finally evolved into what we call today Web services.
{"title":"Web services and service-oriented architectures","authors":"G. Alonso, F. Casati","doi":"10.1109/ICDE.2005.154","DOIUrl":"https://doi.org/10.1109/ICDE.2005.154","url":null,"abstract":"Web services, and more in general service-oriented architectures (SOAs), are emerging as the technologies and architectures of choice for implementing distributed systems and performing application integration within and across companies boundaries. In this article we describe Web services from an evolutionary perspective, with an emphasis on the utilization for enterprise application integration and service-oriented architectures. The article also covers basic middleware problems and shows how the solutions to these problems have finally evolved into what we call today Web services.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126488725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present the first complete translation of XPath into an algebra, paving the way for a comprehensive, state-of-the-art XPath (and later on, XQuery) compiler based on algebraic optimization techniques. Our translation includes all XPath features such as nested expressions, position-based predicates and node-set functions. The translated algebraic expressions can be executed using the proven, scalable, iterator-based approach, as we demonstrate in form of a corresponding physical algebra in our native XML DBMS Natix. A first glance at performance results shows that even without further optimization of the expressions, we provide a competitive evaluation technique for XPath queries.
{"title":"Full-fledged algebraic XPath processing in Natix","authors":"M. Brantner, S. Helmer, C. Kanne, G. Moerkotte","doi":"10.1109/ICDE.2005.69","DOIUrl":"https://doi.org/10.1109/ICDE.2005.69","url":null,"abstract":"We present the first complete translation of XPath into an algebra, paving the way for a comprehensive, state-of-the-art XPath (and later on, XQuery) compiler based on algebraic optimization techniques. Our translation includes all XPath features such as nested expressions, position-based predicates and node-set functions. The translated algebraic expressions can be executed using the proven, scalable, iterator-based approach, as we demonstrate in form of a corresponding physical algebra in our native XML DBMS Natix. A first glance at performance results shows that even without further optimization of the expressions, we provide a competitive evaluation technique for XPath queries.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132966719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient one-pass computation of F/sub 0/, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider the problem of efficiently estimating F/sub 0/ of a data stream where each element of the stream is an interval of integers. We present a randomized algorithm which gives an (/spl epsiv/, /spl delta/) approximation of F/sub 0/, with the following time complexity (n is the size of the universe of the items): (1) the amortized processing time per interval is O(log1//spl delta/ log n//spl epsiv/). (2) The time to answer a query for F/sub 0/ is O(log1//spl delta/). The workspace used is O(1//spl epsiv//sup 2/log1//spl delta/logn) bits. Our algorithm improves upon a previous algorithm by Bar-Yossef Kumar and Sivakumar (2002), which requires O(1//spl epsiv//sup 5/log1//spl delta/log/sup 5/n) processing time per item. Our algorithm can be used to compute the max-dominance norm of a stream of multiple signals, and significantly improves upon the current best bounds due to Cormode and Muthukrishnan (2003). This also provides efficient and novel solutions for data aggregation problems in sensor networks studied by Nath and Gibbons (2004) and Considine et. al. (2004).
{"title":"Range-efficient computation of F/sub 0/ over massive data streams","authors":"A. Pavan, S. Tirthapura","doi":"10.1109/ICDE.2005.118","DOIUrl":"https://doi.org/10.1109/ICDE.2005.118","url":null,"abstract":"Efficient one-pass computation of F/sub 0/, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider the problem of efficiently estimating F/sub 0/ of a data stream where each element of the stream is an interval of integers. We present a randomized algorithm which gives an (/spl epsiv/, /spl delta/) approximation of F/sub 0/, with the following time complexity (n is the size of the universe of the items): (1) the amortized processing time per interval is O(log1//spl delta/ log n//spl epsiv/). (2) The time to answer a query for F/sub 0/ is O(log1//spl delta/). The workspace used is O(1//spl epsiv//sup 2/log1//spl delta/logn) bits. Our algorithm improves upon a previous algorithm by Bar-Yossef Kumar and Sivakumar (2002), which requires O(1//spl epsiv//sup 5/log1//spl delta/log/sup 5/n) processing time per item. Our algorithm can be used to compute the max-dominance norm of a stream of multiple signals, and significantly improves upon the current best bounds due to Cormode and Muthukrishnan (2003). This also provides efficient and novel solutions for data aggregation problems in sensor networks studied by Nath and Gibbons (2004) and Considine et. al. (2004).","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130778163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}