K. Whang, Min-Jae Lee, Jae-Gil Lee, Min-Soo Kim, Wook-Shin Han
We propose the notion of tight-coupling [K. Whang et al., (1999)] to add new data types into the DBMS engine. In this paper, we introduce the Odysseus ORDBMS and present its tightly-coupled IR features (US patented). We demonstrate a Web search engine capable of managing 20 million Web pages in a non-parallel configuration using Odysseus.
我们提出了紧密耦合的概念[K]。Whang et al.,(1999)]向DBMS引擎中添加新的数据类型。本文介绍了Odysseus ORDBMS,并介绍了其紧耦合IR特性(美国专利)。我们将演示一个Web搜索引擎,它能够使用奥德修斯在非并行配置中管理2000万个Web页面。
{"title":"Odysseus: a high-performance ORDBMS tightly-coupled with IR features","authors":"K. Whang, Min-Jae Lee, Jae-Gil Lee, Min-Soo Kim, Wook-Shin Han","doi":"10.1109/ICDE.2005.95","DOIUrl":"https://doi.org/10.1109/ICDE.2005.95","url":null,"abstract":"We propose the notion of tight-coupling [K. Whang et al., (1999)] to add new data types into the DBMS engine. In this paper, we introduce the Odysseus ORDBMS and present its tightly-coupled IR features (US patented). We demonstrate a Web search engine capable of managing 20 million Web pages in a non-parallel configuration using Odysseus.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121487947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A reverse nearest neighbor query returns the data objects that have a query point as their nearest neighbor. Although such queries have been studied quite extensively in Euclidean spaces, there is no previous work in the context of large graphs. In this paper, we propose algorithms and optimization techniques for RNN queries by utilizing some characteristics of networks.
{"title":"Reverse nearest neighbors in large graphs","authors":"Man Lung Yiu, D. Papadias, N. Mamoulis, Yufei Tao","doi":"10.1109/ICDE.2005.124","DOIUrl":"https://doi.org/10.1109/ICDE.2005.124","url":null,"abstract":"A reverse nearest neighbor query returns the data objects that have a query point as their nearest neighbor. Although such queries have been studied quite extensively in Euclidean spaces, there is no previous work in the context of large graphs. In this paper, we propose algorithms and optimization techniques for RNN queries by utilizing some characteristics of networks.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122388789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal k-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a non-trivial dataset under a general model of the problem.
{"title":"Data privacy through optimal k-anonymization","authors":"R. Bayardo, R. Agrawal","doi":"10.1109/ICDE.2005.42","DOIUrl":"https://doi.org/10.1109/ICDE.2005.42","url":null,"abstract":"Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal k-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a non-trivial dataset under a general model of the problem.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122838802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a novel filter based replication model for lightweight directory access protocol (LDAP) directories. Instead of replicating entire subtrees from the directory information tree (DIT), only entries matching a filter specification are replicated Advantages of the filter based replication framework over existing subtree based mechanisms have been demonstrated for a real enterprise directory using real workloads.
{"title":"Filter based directory replication and caching","authors":"Apurva Kumar","doi":"10.1109/ICDE.2005.67","DOIUrl":"https://doi.org/10.1109/ICDE.2005.67","url":null,"abstract":"This paper describes a novel filter based replication model for lightweight directory access protocol (LDAP) directories. Instead of replicating entire subtrees from the directory information tree (DIT), only entries matching a filter specification are replicated Advantages of the filter based replication framework over existing subtree based mechanisms have been demonstrated for a real enterprise directory using real workloads.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128075533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The query optimizer of a database system is confronted with two aspects when handling user-defined functions (UDFs) in query predicates: the vast differences in evaluation costs between UDFs (and other functions) and multiple calls of the same (expensive) UDF The former is dealt with by ordering the evaluation of the predicates optimally, the latter by identifying common subexpressions and thereby avoiding costly recomputation. Current approaches order n predicates optimally (neglecting factorization) in O(nlogn). Their result may deviate significantly from the optimal solution under factorization. We formalize the problem of finding optimal orderings under factorization and prove that it is NP-hard. Furthermore, we show how to improve on the run time of the brute-force algorithm (which computes all possible orderings) by presenting different enhanced algorithms. Although in the worst case these algorithms obviously still behave exponentially, our experiments demonstrate that for real-life examples their performance is much better.
{"title":"On the optimal ordering of maps and selections under factorization","authors":"Thomas Neumann, S. Helmer, G. Moerkotte","doi":"10.1109/ICDE.2005.97","DOIUrl":"https://doi.org/10.1109/ICDE.2005.97","url":null,"abstract":"The query optimizer of a database system is confronted with two aspects when handling user-defined functions (UDFs) in query predicates: the vast differences in evaluation costs between UDFs (and other functions) and multiple calls of the same (expensive) UDF The former is dealt with by ordering the evaluation of the predicates optimally, the latter by identifying common subexpressions and thereby avoiding costly recomputation. Current approaches order n predicates optimally (neglecting factorization) in O(nlogn). Their result may deviate significantly from the optimal solution under factorization. We formalize the problem of finding optimal orderings under factorization and prove that it is NP-hard. Furthermore, we show how to improve on the run time of the brute-force algorithm (which computes all possible orderings) by presenting different enhanced algorithms. Although in the worst case these algorithms obviously still behave exponentially, our experiments demonstrate that for real-life examples their performance is much better.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117187879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a new data-placement method named adaptive overlapped declustering, which can be applied to a parallel storage system using a value range partitioning-based distributed directory and primary-backup data replication, to improve the space utilization by balancing their access loads. The proposed method reduces data skews generated by data migration for balancing access load. While some data-placement methods capable of balancing access load or reducing data skew have been proposed, both requirements satisfied simultaneously. The proposed method also improves the reliability and availability of the system because it reduces recovery time for damaged backups after a disk failure. The method achieves this acceleration by reducing a large amount of network communications and disk I/O. Mathematical analysis shows the efficiency of space utilization under skewed access workloads. Queuing simulations demonstrated that the proposed method halves backup restoration time, compared with the traditional chained declustering method.
{"title":"Adaptive overlapped declustering: a highly available data-placement method balancing access load and space utilization","authors":"Akitsugu Watanabe, H. Yokota","doi":"10.1109/ICDE.2005.16","DOIUrl":"https://doi.org/10.1109/ICDE.2005.16","url":null,"abstract":"This paper proposes a new data-placement method named adaptive overlapped declustering, which can be applied to a parallel storage system using a value range partitioning-based distributed directory and primary-backup data replication, to improve the space utilization by balancing their access loads. The proposed method reduces data skews generated by data migration for balancing access load. While some data-placement methods capable of balancing access load or reducing data skew have been proposed, both requirements satisfied simultaneously. The proposed method also improves the reliability and availability of the system because it reduces recovery time for damaged backups after a disk failure. The method achieves this acceleration by reducing a large amount of network communications and disk I/O. Mathematical analysis shows the efficiency of space utilization under skewed access workloads. Queuing simulations demonstrated that the proposed method halves backup restoration time, compared with the traditional chained declustering method.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115012446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the ADEPT project we have been working on the design and implementation of next generation process management software. Based on a conceptual framework for dynamic process changes, on novel process support functions, and on advanced implementation concepts, the developed system enables the realization of adaptive, process-aware information systems (PAIS). Basically, process changes can take place at the type as well as the instance level: changes of single process instances may have to be carried out in an ad-hoc manner and must not affect system robustness and consistency. Process type changes, in turn, must be quickly accomplished in order to adapt the PAIS to business process changes. ADEPT2 offers powerful concepts for modeling, analyzing, and verifying process schemes. Particularly, it ensures schema correctness, like the absence of deadlock-causing cycles or erroneous data flows. This, in turn, constitutes an important prerequisite for dynamic process changes as well. ADEPT2 supports both ad-hoc changes of single process instances and the propagation of process type changes to running instances.
{"title":"Adaptive process management with ADEPT2","authors":"M. Reichert, S. Rinderle-Ma, U. Kreher, P. Dadam","doi":"10.1109/ICDE.2005.17","DOIUrl":"https://doi.org/10.1109/ICDE.2005.17","url":null,"abstract":"In the ADEPT project we have been working on the design and implementation of next generation process management software. Based on a conceptual framework for dynamic process changes, on novel process support functions, and on advanced implementation concepts, the developed system enables the realization of adaptive, process-aware information systems (PAIS). Basically, process changes can take place at the type as well as the instance level: changes of single process instances may have to be carried out in an ad-hoc manner and must not affect system robustness and consistency. Process type changes, in turn, must be quickly accomplished in order to adapt the PAIS to business process changes. ADEPT2 offers powerful concepts for modeling, analyzing, and verifying process schemes. Particularly, it ensures schema correctness, like the absence of deadlock-causing cycles or erroneous data flows. This, in turn, constitutes an important prerequisite for dynamic process changes as well. ADEPT2 supports both ad-hoc changes of single process instances and the propagation of process type changes to running instances.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115485948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains.
{"title":"Corpus-based schema matching","authors":"J. Madhavan, P. Bernstein, A. Doan, A. Halevy","doi":"10.1109/ICDE.2005.39","DOIUrl":"https://doi.org/10.1109/ICDE.2005.39","url":null,"abstract":"Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123330488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The focal goal of our research is to develop a general framework which can automatically analyze the sports video, detect the sports events, and finally offer an efficient and user-friendly system for sports video retrieval. In our earlier work, a novel multimedia data mining technique was proposed for automatic soccer event extraction by adopting multimodal feature analysis. Until now, this framework has been performed on the detection of goal and corner kick events and the results are quite impressive. Correspondingly, in this work, the detected video events are modeled and effectively stored in the database. A temporal query model is designed to satisfy the comprehensive temporal query requirements, and the corresponding graphical query language is developed. The advanced characteristics make our model particularly well suited for searching events in a large scale video database.
{"title":"An enhanced query model for soccer video retrieval using temporal relationships","authors":"Shu‐Ching Chen, M. Shyu, Na Zhao","doi":"10.1109/ICDE.2005.20","DOIUrl":"https://doi.org/10.1109/ICDE.2005.20","url":null,"abstract":"The focal goal of our research is to develop a general framework which can automatically analyze the sports video, detect the sports events, and finally offer an efficient and user-friendly system for sports video retrieval. In our earlier work, a novel multimedia data mining technique was proposed for automatic soccer event extraction by adopting multimodal feature analysis. Until now, this framework has been performed on the detection of goal and corner kick events and the results are quite impressive. Correspondingly, in this work, the detected video events are modeled and effectively stored in the database. A temporal query model is designed to satisfy the comprehensive temporal query requirements, and the corresponding graphical query language is developed. The advanced characteristics make our model particularly well suited for searching events in a large scale video database.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125959158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high per-attribute acquisition cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we observe that it can be beneficial to use correlations to automatically introduce low-cost attributes whose observation will allow the query processor to better estimate die selectivity of these expensive predicates. In particular, we show how to build conditional plans that branch into one or more sub-plans, each with a different ordering for the expensive query predicates, based on the runtime observation of low-cost attributes. We frame the problem of constructing the optimal conditional plan for a given user query and set of candidate low-cost attributes as an optimization problem. We describe an exponential time algorithm for finding such optimal plans, and describe a polynomial-time heuristic for identifying conditional plans that perform well in practice. We also show how to compactly model conditional probability distributions needed to identify correlations and build these plans. We evaluate our algorithms against several real-world sensor-network data sets, showing several-times performance increases for a variety of queries versus traditional optimization techniques.
{"title":"Exploiting correlated attributes in acquisitional query processing","authors":"A. Deshpande, Carlos Guestrin, W. Hong, S. Madden","doi":"10.1109/ICDE.2005.63","DOIUrl":"https://doi.org/10.1109/ICDE.2005.63","url":null,"abstract":"Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high per-attribute acquisition cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we observe that it can be beneficial to use correlations to automatically introduce low-cost attributes whose observation will allow the query processor to better estimate die selectivity of these expensive predicates. In particular, we show how to build conditional plans that branch into one or more sub-plans, each with a different ordering for the expensive query predicates, based on the runtime observation of low-cost attributes. We frame the problem of constructing the optimal conditional plan for a given user query and set of candidate low-cost attributes as an optimization problem. We describe an exponential time algorithm for finding such optimal plans, and describe a polynomial-time heuristic for identifying conditional plans that perform well in practice. We also show how to compactly model conditional probability distributions needed to identify correlations and build these plans. We evaluate our algorithms against several real-world sensor-network data sets, showing several-times performance increases for a variety of queries versus traditional optimization techniques.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129358476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}