Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994762
A. Ngu, Dimitrios Georgakopoulos, D. Baker, A. Cichocki, J. Desmarais, Peter Bates
Operation support systems (OSSs) integrate software components and network elements to automate the provisioning and monitoring of telecommunications services. This paper illustrates Telcordia's Cable OSS and shows how customers may use this OSS to provision IP and telephone services over the cable infrastructure. Telcordia's Cable OSS is a process-based application, i.e. a collection of flows, specialized components (e.g. a billing system, a call agent soft switch, network services and elements, cable modems, etc.) and corresponding adaptors that are integrated, coordinated and monitored using CMI (Collaboration Management Infrastructure), Telcordia's advanced process-based integration technology. Customers interact with the Cable OSS by using Web or IVR (interactive voice response) interfaces.
{"title":"Advanced process-based component integration in Telcordia's Cable OSS","authors":"A. Ngu, Dimitrios Georgakopoulos, D. Baker, A. Cichocki, J. Desmarais, Peter Bates","doi":"10.1109/ICDE.2002.994762","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994762","url":null,"abstract":"Operation support systems (OSSs) integrate software components and network elements to automate the provisioning and monitoring of telecommunications services. This paper illustrates Telcordia's Cable OSS and shows how customers may use this OSS to provision IP and telephone services over the cable infrastructure. Telcordia's Cable OSS is a process-based application, i.e. a collection of flows, specialized components (e.g. a billing system, a call agent soft switch, network services and elements, cable modems, etc.) and corresponding adaptors that are integrated, coordinated and monitored using CMI (Collaboration Management Infrastructure), Telcordia's advanced process-based integration technology. Customers interact with the Cable OSS by using Web or IVR (interactive voice response) interfaces.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121409064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994729
M. Oguchi, M. Kitsuregawa
Personal computer/workstation (PC/WS) clusters have come to be studied intensively in the field of parallel and distributed computing. From the viewpoint of applications, data intensive applications including data mining and ad-hoc query processing in databases are considered very important for massively parallel processors, in addition to the conventional scientific calculation. Thus, investigating the feasibility of such applications on a PC cluster is meaningful. A PC cluster connected with a storage area network (SAN) is built and evaluated with a data mining application. In the case of a SAN-connected cluster, each node can access all shared disks directly without using a LAN; thus, SAN-connected clusters achieve much better performance than LAN-connected clusters for disk-to-disk copy operations. However, if a lot of nodes access the same shared disk simultaneously, application performance degrades due to the I/O-bottleneck. A runtime data declustering method, in which data is declustered to several other disks dynamically during the execution of the application, is proposed to resolve this problem.
个人计算机/工作站(PC/WS)集群已经成为并行和分布式计算领域的研究热点。从应用程序的角度来看,除了传统的科学计算之外,数据密集型应用程序(包括数据挖掘和数据库中的临时查询处理)对大规模并行处理器非常重要。因此,研究这些应用在PC集群上的可行性是有意义的。构建了一个连接SAN (storage area network)的PC机集群,并利用数据挖掘应用程序对集群进行了评估。在san连接集群的情况下,每个节点可以直接访问所有共享磁盘,而无需使用局域网;因此,对于磁盘到磁盘的复制操作,san连接的集群比lan连接的集群获得更好的性能。但是,如果许多节点同时访问同一个共享磁盘,则由于I/ o瓶颈而导致应用程序性能下降。为了解决这一问题,提出了一种运行时数据解簇方法,该方法在应用程序执行过程中动态地将数据解簇到其他几个磁盘上。
{"title":"Runtime data declustering over SAN-connected PC cluster system","authors":"M. Oguchi, M. Kitsuregawa","doi":"10.1109/ICDE.2002.994729","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994729","url":null,"abstract":"Personal computer/workstation (PC/WS) clusters have come to be studied intensively in the field of parallel and distributed computing. From the viewpoint of applications, data intensive applications including data mining and ad-hoc query processing in databases are considered very important for massively parallel processors, in addition to the conventional scientific calculation. Thus, investigating the feasibility of such applications on a PC cluster is meaningful. A PC cluster connected with a storage area network (SAN) is built and evaluated with a data mining application. In the case of a SAN-connected cluster, each node can access all shared disks directly without using a LAN; thus, SAN-connected clusters achieve much better performance than LAN-connected clusters for disk-to-disk copy operations. However, if a lot of nodes access the same shared disk simultaneously, application performance degrades due to the I/O-bottleneck. A runtime data declustering method, in which data is declustered to several other disks dynamically during the execution of the application, is proposed to resolve this problem.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131201562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994723
Wai Lup Low, W. Tok, M. Lee, T. Ling
With the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the processes used to improve data quality, has been well researched in the context of traditional databases. In earlier work we developed a knowledge-based framework for data cleaning relational databases. In this work, we present a novel attempt to apply this framework to XML databases. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers.
{"title":"Data cleaning and XML: the DBLP experience","authors":"Wai Lup Low, W. Tok, M. Lee, T. Ling","doi":"10.1109/ICDE.2002.994723","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994723","url":null,"abstract":"With the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the processes used to improve data quality, has been well researched in the context of traditional databases. In earlier work we developed a knowledge-based framework for data cleaning relational databases. In this work, we present a novel attempt to apply this framework to XML databases. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133587561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994694
Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.
{"title":"TAILOR: a record linkage toolbox","authors":"Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios","doi":"10.1109/ICDE.2002.994694","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994694","url":null,"abstract":"Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for \"RecOrd LInkAge Toolbox\"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134426686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994759
Simonas Šaltenis, Christian S. Jensen
Visionaries predict that the Internet will soon extend to billions of wireless devices, or objects, a substantial fraction of which will offer their changing positions to location-based services. This paper assumes an Internet-service scenario where objects that have not reported their position within a specified duration of time are expected to no longer be interested in, or of interest to, the service. Due to the possibility of many "expiring" objects, a highly dynamic database results. The paper presents an R-tree based technique for the indexing of the current positions of such objects. Different types of bounding regions are studied, and new algorithms are provided for maintaining the tree structure. Performance experiments indicate that, when compared to the approach where the objects are not assumed to expire, the new indexing technique can improve search performance by a factor of two or more without sacrificing update performance.
{"title":"Indexing of moving objects for location-based services","authors":"Simonas Šaltenis, Christian S. Jensen","doi":"10.1109/ICDE.2002.994759","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994759","url":null,"abstract":"Visionaries predict that the Internet will soon extend to billions of wireless devices, or objects, a substantial fraction of which will offer their changing positions to location-based services. This paper assumes an Internet-service scenario where objects that have not reported their position within a specified duration of time are expected to no longer be interested in, or of interest to, the service. Due to the possibility of many \"expiring\" objects, a highly dynamic database results. The paper presents an R-tree based technique for the indexing of the current positions of such objects. Different types of bounding regions are studied, and new algorithms are provided for maintaining the tree structure. Performance experiments indicate that, when compared to the approach where the objects are not assumed to expire, the new indexing technique can improve search performance by a factor of two or more without sacrificing update performance.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115610171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994724
Mutsumi Nakamura, R. Elmasri
In this paper we show that the language of declarative logic programming (DLP) with answer sets and its extensions can be used to specify database evolution due to updates and active rules, and to verify correctness of active rules with respect to a specification described using temporal logic and aggregate operators. We classify the specification of active rules into four kind of constraints which can be expressed using a particular extension of DLP called Smodels. Smodels allows us to specify the evolution, to specify the constraints, and to enumerate all possible initial database states and initial updates. Together, these can be used to analyze all possible evolution paths of an active database system to verify if they satisfy a set of given constraints.
{"title":"Using Smodels (declarative logic programming) to verify correctness of certain active rules","authors":"Mutsumi Nakamura, R. Elmasri","doi":"10.1109/ICDE.2002.994724","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994724","url":null,"abstract":"In this paper we show that the language of declarative logic programming (DLP) with answer sets and its extensions can be used to specify database evolution due to updates and active rules, and to verify correctness of active rules with respect to a specification described using temporal logic and aggregate operators. We classify the specification of active rules into four kind of constraints which can be expressed using a particular extension of DLP called Smodels. Smodels allows us to specify the evolution, to specify the constraints, and to enumerate all possible initial database states and initial updates. Together, these can be used to analyze all possible evolution paths of an active database system to verify if they satisfy a set of given constraints.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121217971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994747
A. Gal, V. Atluri, Gang Xu
We present a system, called the Temporal Data Authorization Model (TDAM), for managing authorizations for temporal data. TDAM is capable of expressing access control policies based on the temporal characteristics of data. TDAM extends existing authorization models to allow the specifications of temporal constraints on data, based on data validity, data capture time, and replication time, using either absolute or relative time references. The ability to specify access control based on such temporal aspects were not supported before. The formulae are evaluated with respect to various temporal assignments to ensure the correctness of access control.
{"title":"An authorization system for temporal data","authors":"A. Gal, V. Atluri, Gang Xu","doi":"10.1109/ICDE.2002.994747","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994747","url":null,"abstract":"We present a system, called the Temporal Data Authorization Model (TDAM), for managing authorizations for temporal data. TDAM is capable of expressing access control policies based on the temporal characteristics of data. TDAM extends existing authorization models to allow the specifications of temporal constraints on data, based on data validity, data capture time, and replication time, using either absolute or relative time references. The ability to specify access control based on such temporal aspects were not supported before. The formulae are evaluated with respect to various temporal assignments to ensure the correctness of access control.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"337 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122757973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994697
C. Chung, Michael Gertz, Neel Sundaresan
Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.
{"title":"Reverse engineering for Web data: from visual to semantic structures","authors":"C. Chung, Michael Gertz, Neel Sundaresan","doi":"10.1109/ICDE.2002.994697","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994697","url":null,"abstract":"Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124571706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994739
M. Keidl, A. Kreutz, A. Kemper, D. Kossmann
The emergence of electronic marketplaces and other electronic services and applications on the Internet is creating a growing demand for the effective management of resources. Due to the nature of the Internet, such information changes rapidly. Furthermore, such information must be available for a large number of users and applications, and copies of pieces of information should be stored near those users that need this particular information. In this paper, we present the architecture of MDV ("Meta-Data Verwalter"), a distributed meta-data management system. MDV has a three-tier architecture and supports caching and replication in the middle tier so that queries can be evaluated locally. Users and applications specify the information they need and that is replicated using a specialized subscription language. In order to keep replicas up-to-date and to initiate the replication of new and relevant information, MDV implements a novel, scalable publish-and-subscribe algorithm. We describe this algorithm in detail, show how it can be implemented using a standard relational database system, and present the results of performance experiments conducted using our prototype implementation.
{"title":"A publish and subscribe architecture for distributed metadata management","authors":"M. Keidl, A. Kreutz, A. Kemper, D. Kossmann","doi":"10.1109/ICDE.2002.994739","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994739","url":null,"abstract":"The emergence of electronic marketplaces and other electronic services and applications on the Internet is creating a growing demand for the effective management of resources. Due to the nature of the Internet, such information changes rapidly. Furthermore, such information must be available for a large number of users and applications, and copies of pieces of information should be stored near those users that need this particular information. In this paper, we present the architecture of MDV (\"Meta-Data Verwalter\"), a distributed meta-data management system. MDV has a three-tier architecture and supports caching and replication in the middle tier so that queries can be evaluated locally. Users and applications specify the information they need and that is replicated using a specialized subscription language. In order to keep replicas up-to-date and to initiate the replication of new and relevant information, MDV implements a novel, scalable publish-and-subscribe algorithm. We describe this algorithm in detail, show how it can be implemented using a standard relational database system, and present the results of performance experiments conducted using our prototype implementation.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122125771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994771
Jiong Yang, Wei Wang, Haixun Wang, Philip S. Yu
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the /spl delta/-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The /spl delta/-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the /spl delta/-cluster model and the FLOC algorithm on a number of real and synthetic data sets.
{"title":"/spl delta/-clusters: capturing subspace correlation in a large data set","authors":"Jiong Yang, Wei Wang, Haixun Wang, Philip S. Yu","doi":"10.1109/ICDE.2002.994771","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994771","url":null,"abstract":"Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the /spl delta/-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The /spl delta/-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the /spl delta/-cluster model and the FLOC algorithm on a number of real and synthetic data sets.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132191813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}