首页 > 最新文献

Proceedings 18th International Conference on Data Engineering最新文献

英文 中文
Advanced process-based component integration in Telcordia's Cable OSS Telcordia电缆OSS中先进的基于过程的组件集成
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994762
A. Ngu, Dimitrios Georgakopoulos, D. Baker, A. Cichocki, J. Desmarais, Peter Bates
Operation support systems (OSSs) integrate software components and network elements to automate the provisioning and monitoring of telecommunications services. This paper illustrates Telcordia's Cable OSS and shows how customers may use this OSS to provision IP and telephone services over the cable infrastructure. Telcordia's Cable OSS is a process-based application, i.e. a collection of flows, specialized components (e.g. a billing system, a call agent soft switch, network services and elements, cable modems, etc.) and corresponding adaptors that are integrated, coordinated and monitored using CMI (Collaboration Management Infrastructure), Telcordia's advanced process-based integration technology. Customers interact with the Cable OSS by using Web or IVR (interactive voice response) interfaces.
营运支援系统(OSSs)集成了软件组件和网元,以自动提供和监控电讯服务。本文阐述了Telcordia的电缆OSS,并展示了客户如何使用该OSS在电缆基础设施上提供IP和电话服务。Telcordia的Cable OSS是一个基于过程的应用程序,即一系列流程、专用组件(例如计费系统、呼叫代理软交换、网络服务和元素、电缆调制解调器等)和相应的适配器,这些适配器使用Telcordia先进的基于过程的集成技术CMI(协作管理基础设施)进行集成、协调和监控。客户通过Web或IVR (interactive voice response)接口与Cable OSS进行交互。
{"title":"Advanced process-based component integration in Telcordia's Cable OSS","authors":"A. Ngu, Dimitrios Georgakopoulos, D. Baker, A. Cichocki, J. Desmarais, Peter Bates","doi":"10.1109/ICDE.2002.994762","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994762","url":null,"abstract":"Operation support systems (OSSs) integrate software components and network elements to automate the provisioning and monitoring of telecommunications services. This paper illustrates Telcordia's Cable OSS and shows how customers may use this OSS to provision IP and telephone services over the cable infrastructure. Telcordia's Cable OSS is a process-based application, i.e. a collection of flows, specialized components (e.g. a billing system, a call agent soft switch, network services and elements, cable modems, etc.) and corresponding adaptors that are integrated, coordinated and monitored using CMI (Collaboration Management Infrastructure), Telcordia's advanced process-based integration technology. Customers interact with the Cable OSS by using Web or IVR (interactive voice response) interfaces.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121409064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Runtime data declustering over SAN-connected PC cluster system 运行时数据在san连接的PC集群系统上进行集群化
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994729
M. Oguchi, M. Kitsuregawa
Personal computer/workstation (PC/WS) clusters have come to be studied intensively in the field of parallel and distributed computing. From the viewpoint of applications, data intensive applications including data mining and ad-hoc query processing in databases are considered very important for massively parallel processors, in addition to the conventional scientific calculation. Thus, investigating the feasibility of such applications on a PC cluster is meaningful. A PC cluster connected with a storage area network (SAN) is built and evaluated with a data mining application. In the case of a SAN-connected cluster, each node can access all shared disks directly without using a LAN; thus, SAN-connected clusters achieve much better performance than LAN-connected clusters for disk-to-disk copy operations. However, if a lot of nodes access the same shared disk simultaneously, application performance degrades due to the I/O-bottleneck. A runtime data declustering method, in which data is declustered to several other disks dynamically during the execution of the application, is proposed to resolve this problem.
个人计算机/工作站(PC/WS)集群已经成为并行和分布式计算领域的研究热点。从应用程序的角度来看,除了传统的科学计算之外,数据密集型应用程序(包括数据挖掘和数据库中的临时查询处理)对大规模并行处理器非常重要。因此,研究这些应用在PC集群上的可行性是有意义的。构建了一个连接SAN (storage area network)的PC机集群,并利用数据挖掘应用程序对集群进行了评估。在san连接集群的情况下,每个节点可以直接访问所有共享磁盘,而无需使用局域网;因此,对于磁盘到磁盘的复制操作,san连接的集群比lan连接的集群获得更好的性能。但是,如果许多节点同时访问同一个共享磁盘,则由于I/ o瓶颈而导致应用程序性能下降。为了解决这一问题,提出了一种运行时数据解簇方法,该方法在应用程序执行过程中动态地将数据解簇到其他几个磁盘上。
{"title":"Runtime data declustering over SAN-connected PC cluster system","authors":"M. Oguchi, M. Kitsuregawa","doi":"10.1109/ICDE.2002.994729","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994729","url":null,"abstract":"Personal computer/workstation (PC/WS) clusters have come to be studied intensively in the field of parallel and distributed computing. From the viewpoint of applications, data intensive applications including data mining and ad-hoc query processing in databases are considered very important for massively parallel processors, in addition to the conventional scientific calculation. Thus, investigating the feasibility of such applications on a PC cluster is meaningful. A PC cluster connected with a storage area network (SAN) is built and evaluated with a data mining application. In the case of a SAN-connected cluster, each node can access all shared disks directly without using a LAN; thus, SAN-connected clusters achieve much better performance than LAN-connected clusters for disk-to-disk copy operations. However, if a lot of nodes access the same shared disk simultaneously, application performance degrades due to the I/O-bottleneck. A runtime data declustering method, in which data is declustered to several other disks dynamically during the execution of the application, is proposed to resolve this problem.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131201562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data cleaning and XML: the DBLP experience 数据清理和XML: DBLP体验
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994723
Wai Lup Low, W. Tok, M. Lee, T. Ling
With the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the processes used to improve data quality, has been well researched in the context of traditional databases. In earlier work we developed a knowledge-based framework for data cleaning relational databases. In this work, we present a novel attempt to apply this framework to XML databases. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers.
随着以数据为中心的XML的日益流行,正在为迅速发展的XML数据存储库开发数据仓库和挖掘应用程序。毫无疑问,数据质量将是此类应用程序成功的关键因素。数据清洗是指用于提高数据质量的过程,在传统数据库中已经得到了很好的研究。在早期的工作中,我们为数据清理关系数据库开发了一个基于知识的框架。在这项工作中,我们提出了一种将该框架应用于XML数据库的新颖尝试。我们的实验数据集是DBLP数据库,这是许多研究人员使用的一个流行的在线XML书目数据库。
{"title":"Data cleaning and XML: the DBLP experience","authors":"Wai Lup Low, W. Tok, M. Lee, T. Ling","doi":"10.1109/ICDE.2002.994723","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994723","url":null,"abstract":"With the increasing popularity of data-centric XML, data warehousing and mining applications are being developed for rapidly burgeoning XML data repositories. Data quality will no doubt be a critical factor for the success of such applications. Data cleaning, which refers to the processes used to improve data quality, has been well researched in the context of traditional databases. In earlier work we developed a knowledge-based framework for data cleaning relational databases. In this work, we present a novel attempt to apply this framework to XML databases. Our experimental dataset is the DBLP database, a popular online XML bibliography database used by many researchers.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133587561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
TAILOR: a record linkage toolbox 一个记录链接工具箱
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994694
Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.
数据清理是确保实际数据库中存储的数据质量的重要过程。在数据库的知识发现、数据仓库、系统集成和电子服务等研究领域,经常遇到数据清理问题。识别表示同一实体(重复记录)的记录对的过程,通常称为记录链接,是数据清理的基本要素之一。在本文中,我们通过采用机器学习方法来解决记录链接问题。提出了三种模型,并进行了实证分析。由于没有现有的模型,包括本文中提出的模型,被证明是优越的,我们开发了一个交互式记录链接工具箱,名为TAILOR(“记录链接工具箱”的倒写缩写)。TAILOR的用户可以通过调整系统参数和插入内部开发的和公共领域的工具来构建他们自己的记录链接模型。建议的工具箱用作记录链接过程的框架,并以可扩展的方式设计,以与现有和未来的记录链接模型进行接口。我们进行了广泛的实验研究,不仅使用合成数据,而且使用真实数据来评估我们提出的模型。结果表明,所提出的机器学习记录链接模型在准确性和性能上都优于现有的记录链接模型。
{"title":"TAILOR: a record linkage toolbox","authors":"Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios","doi":"10.1109/ICDE.2002.994694","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994694","url":null,"abstract":"Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for \"RecOrd LInkAge Toolbox\"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134426686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 337
Indexing of moving objects for location-based services 为基于位置的服务索引移动对象
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994759
Simonas Šaltenis, Christian S. Jensen
Visionaries predict that the Internet will soon extend to billions of wireless devices, or objects, a substantial fraction of which will offer their changing positions to location-based services. This paper assumes an Internet-service scenario where objects that have not reported their position within a specified duration of time are expected to no longer be interested in, or of interest to, the service. Due to the possibility of many "expiring" objects, a highly dynamic database results. The paper presents an R-tree based technique for the indexing of the current positions of such objects. Different types of bounding regions are studied, and new algorithms are provided for maintaining the tree structure. Performance experiments indicate that, when compared to the approach where the objects are not assumed to expire, the new indexing technique can improve search performance by a factor of two or more without sacrificing update performance.
有远见的人预测,互联网将很快扩展到数十亿的无线设备或物体,其中很大一部分将提供基于位置的服务来改变它们的位置。本文假设了一个internet服务场景,在该场景中,在指定的时间内没有报告其位置的对象预计将不再对服务感兴趣或对服务感兴趣。由于可能存在许多“即将过期”的对象,因此会产生一个高度动态的数据库。本文提出了一种基于r树的技术来索引这些对象的当前位置。研究了不同类型的边界区域,提出了维护树形结构的新算法。性能实验表明,与不假设对象过期的方法相比,新的索引技术可以在不牺牲更新性能的情况下将搜索性能提高两倍或更多。
{"title":"Indexing of moving objects for location-based services","authors":"Simonas Šaltenis, Christian S. Jensen","doi":"10.1109/ICDE.2002.994759","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994759","url":null,"abstract":"Visionaries predict that the Internet will soon extend to billions of wireless devices, or objects, a substantial fraction of which will offer their changing positions to location-based services. This paper assumes an Internet-service scenario where objects that have not reported their position within a specified duration of time are expected to no longer be interested in, or of interest to, the service. Due to the possibility of many \"expiring\" objects, a highly dynamic database results. The paper presents an R-tree based technique for the indexing of the current positions of such objects. Different types of bounding regions are studied, and new algorithms are provided for maintaining the tree structure. Performance experiments indicate that, when compared to the approach where the objects are not assumed to expire, the new indexing technique can improve search performance by a factor of two or more without sacrificing update performance.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115610171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 214
Using Smodels (declarative logic programming) to verify correctness of certain active rules 使用模型(声明性逻辑编程)来验证某些活动规则的正确性
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994724
Mutsumi Nakamura, R. Elmasri
In this paper we show that the language of declarative logic programming (DLP) with answer sets and its extensions can be used to specify database evolution due to updates and active rules, and to verify correctness of active rules with respect to a specification described using temporal logic and aggregate operators. We classify the specification of active rules into four kind of constraints which can be expressed using a particular extension of DLP called Smodels. Smodels allows us to specify the evolution, to specify the constraints, and to enumerate all possible initial database states and initial updates. Together, these can be used to analyze all possible evolution paths of an active database system to verify if they satisfy a set of given constraints.
在本文中,我们展示了具有答案集的声明性逻辑编程语言(DLP)及其扩展可用于指定由于更新和活动规则而导致的数据库演变,并验证活动规则的正确性,这些规则是根据使用时间逻辑和聚合运算符描述的规范进行的。我们将活动规则的规范分为四类约束,这些约束可以使用DLP的特定扩展(称为模型)来表示。模型允许我们指定演化,指定约束,并枚举所有可能的初始数据库状态和初始更新。总之,这些可以用来分析活动数据库系统的所有可能的演化路径,以验证它们是否满足一组给定的约束。
{"title":"Using Smodels (declarative logic programming) to verify correctness of certain active rules","authors":"Mutsumi Nakamura, R. Elmasri","doi":"10.1109/ICDE.2002.994724","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994724","url":null,"abstract":"In this paper we show that the language of declarative logic programming (DLP) with answer sets and its extensions can be used to specify database evolution due to updates and active rules, and to verify correctness of active rules with respect to a specification described using temporal logic and aggregate operators. We classify the specification of active rules into four kind of constraints which can be expressed using a particular extension of DLP called Smodels. Smodels allows us to specify the evolution, to specify the constraints, and to enumerate all possible initial database states and initial updates. Together, these can be used to analyze all possible evolution paths of an active database system to verify if they satisfy a set of given constraints.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121217971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An authorization system for temporal data 时间数据的授权系统
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994747
A. Gal, V. Atluri, Gang Xu
We present a system, called the Temporal Data Authorization Model (TDAM), for managing authorizations for temporal data. TDAM is capable of expressing access control policies based on the temporal characteristics of data. TDAM extends existing authorization models to allow the specifications of temporal constraints on data, based on data validity, data capture time, and replication time, using either absolute or relative time references. The ability to specify access control based on such temporal aspects were not supported before. The formulae are evaluated with respect to various temporal assignments to ensure the correctness of access control.
我们提出了一个称为时态数据授权模型(TDAM)的系统,用于管理时态数据的授权。TDAM能够根据数据的时间特征表示访问控制策略。TDAM扩展了现有的授权模型,允许使用绝对或相对时间引用,基于数据有效性、数据捕获时间和复制时间来规范数据的时间约束。以前不支持基于这些临时方面指定访问控制的功能。根据不同的时间赋值对公式进行了计算,以确保访问控制的正确性。
{"title":"An authorization system for temporal data","authors":"A. Gal, V. Atluri, Gang Xu","doi":"10.1109/ICDE.2002.994747","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994747","url":null,"abstract":"We present a system, called the Temporal Data Authorization Model (TDAM), for managing authorizations for temporal data. TDAM is capable of expressing access control policies based on the temporal characteristics of data. TDAM extends existing authorization models to allow the specifications of temporal constraints on data, based on data validity, data capture time, and replication time, using either absolute or relative time references. The ability to specify access control based on such temporal aspects were not supported before. The formulae are evaluated with respect to various temporal assignments to ensure the correctness of access control.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"337 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122757973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Reverse engineering for Web data: from visual to semantic structures Web数据的逆向工程:从视觉结构到语义结构
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994697
C. Chung, Michael Gertz, Neel Sundaresan
Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.
尽管XML取得了进步,但是Web上的大多数文档仍然只是为了可视化呈现的目的而用HTML标记,从而构建了大量的遗留数据。为了以一种比基于关键字的检索更高效的方式查询基于Web的数据,有必要用结构和语义来丰富这样的Web文档。我们描述了一种将特定主题的HTML文档集成到XML文档存储库的新方法。特别是,我们描述了如何将特定于主题的HTML文档转换为XML文档。本文提出的文档转换和语义元素标注过程利用了文档重构规则和概念形式的主题最小信息。对于生成的XML文档,将派生一个多数模式,该模式以DTD的形式描述文档之间的公共结构。我们将探索和讨论用于文档转换和多数模式发现的不同技术和规则。最后,通过将该方法应用于Web爬虫收集的一组简历HTML文档,我们证明了该方法的可行性和有效性。
{"title":"Reverse engineering for Web data: from visual to semantic structures","authors":"C. Chung, Michael Gertz, Neel Sundaresan","doi":"10.1109/ICDE.2002.994697","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994697","url":null,"abstract":"Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124571706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
A publish and subscribe architecture for distributed metadata management 用于分布式元数据管理的发布和订阅体系结构
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994739
M. Keidl, A. Kreutz, A. Kemper, D. Kossmann
The emergence of electronic marketplaces and other electronic services and applications on the Internet is creating a growing demand for the effective management of resources. Due to the nature of the Internet, such information changes rapidly. Furthermore, such information must be available for a large number of users and applications, and copies of pieces of information should be stored near those users that need this particular information. In this paper, we present the architecture of MDV ("Meta-Data Verwalter"), a distributed meta-data management system. MDV has a three-tier architecture and supports caching and replication in the middle tier so that queries can be evaluated locally. Users and applications specify the information they need and that is replicated using a specialized subscription language. In order to keep replicas up-to-date and to initiate the replication of new and relevant information, MDV implements a novel, scalable publish-and-subscribe algorithm. We describe this algorithm in detail, show how it can be implemented using a standard relational database system, and present the results of performance experiments conducted using our prototype implementation.
电子市场和互联网上其他电子服务和应用的出现,对有效管理资源产生了日益增长的需求。由于互联网的性质,这些信息变化很快。此外,这些信息必须可供大量用户和应用程序使用,并且信息片段的副本应该存储在需要这些特定信息的用户附近。本文提出了一种分布式元数据管理系统MDV (Meta-Data Verwalter)的体系结构。MDV具有三层体系结构,并在中间层支持缓存和复制,以便可以在本地计算查询。用户和应用程序指定他们需要的信息,这些信息使用专门的订阅语言进行复制。为了使副本保持最新状态并启动新的相关信息的复制,MDV实现了一种新颖的、可扩展的发布-订阅算法。我们详细描述了该算法,展示了如何使用标准关系数据库系统实现该算法,并展示了使用我们的原型实现进行的性能实验结果。
{"title":"A publish and subscribe architecture for distributed metadata management","authors":"M. Keidl, A. Kreutz, A. Kemper, D. Kossmann","doi":"10.1109/ICDE.2002.994739","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994739","url":null,"abstract":"The emergence of electronic marketplaces and other electronic services and applications on the Internet is creating a growing demand for the effective management of resources. Due to the nature of the Internet, such information changes rapidly. Furthermore, such information must be available for a large number of users and applications, and copies of pieces of information should be stored near those users that need this particular information. In this paper, we present the architecture of MDV (\"Meta-Data Verwalter\"), a distributed meta-data management system. MDV has a three-tier architecture and supports caching and replication in the middle tier so that queries can be evaluated locally. Users and applications specify the information they need and that is replicated using a specialized subscription language. In order to keep replicas up-to-date and to initiate the replication of new and relevant information, MDV implements a novel, scalable publish-and-subscribe algorithm. We describe this algorithm in detail, show how it can be implemented using a standard relational database system, and present the results of performance experiments conducted using our prototype implementation.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122125771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
/spl delta/-clusters: capturing subspace correlation in a large data set /spl delta/-clusters:捕获大型数据集中的子空间相关性
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994771
Jiong Yang, Wei Wang, Haixun Wang, Philip S. Yu
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the /spl delta/-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The /spl delta/-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the /spl delta/-cluster model and the FLOC algorithm on a number of real and synthetic data sets.
聚类是近年来一个具有重要实际意义的活跃研究领域。大多数以前的聚类模型都专注于在(子)维度集(例如,子空间集群)上对具有相似值的对象进行分组,并假设每个对象在每个维度上都有一个相关值(例如,双集群)。这些现有的聚类模型可能并不总是足以捕获对象之间表现出的一致性。一组对象(在属性子集上)之间可能仍然存在强相干性,即使它们在每个属性上采用完全不同的值,并且属性值没有完全指定。这在许多应用中非常常见,包括生物信息学分析以及协同过滤分析,其中数据可能不完整且容易受到偏差的影响。在生物信息学中,最近提出了一种双聚类模型来捕获属性子集之间的一致性。我们引入了一个更通用的模型,称为/spl delta/-cluster模型,以捕获对象子集在属性子集上表现出的一致性,同时允许缺席属性值。为了有效地产生接近最优的聚类结果,设计了基于移动的聚类算法(FLOC)。/spl delta/-cluster模型将双聚类模型作为特例,FLOC算法的性能远优于双聚类算法。我们在许多真实和合成数据集上证明了/spl delta/-簇模型和FLOC算法的正确性和效率。
{"title":"/spl delta/-clusters: capturing subspace correlation in a large data set","authors":"Jiong Yang, Wei Wang, Haixun Wang, Philip S. Yu","doi":"10.1109/ICDE.2002.994771","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994771","url":null,"abstract":"Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bio-informatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bio-informatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. We introduce a more general model, referred to as the /spl delta/-cluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal clustering results. The /spl delta/-cluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the /spl delta/-cluster model and the FLOC algorithm on a number of real and synthetic data sets.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132191813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 364
期刊
Proceedings 18th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1