Reverse engineering for Web data: from visual to semantic structures

Proceedings 18th International Conference on Data Engineering Pub Date : 2002-08-07 DOI:10.1109/ICDE.2002.994697

C. Chung, Michael Gertz, Neel Sundaresan

{"title":"Reverse engineering for Web data: from visual to semantic structures","authors":"C. Chung, Michael Gertz, Neel Sundaresan","doi":"10.1109/ICDE.2002.994697","DOIUrl":null,"url":null,"abstract":"Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"64","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994697","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 64

Abstract

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Web数据的逆向工程:从视觉结构到语义结构

尽管XML取得了进步，但是Web上的大多数文档仍然只是为了可视化呈现的目的而用HTML标记，从而构建了大量的遗留数据。为了以一种比基于关键字的检索更高效的方式查询基于Web的数据，有必要用结构和语义来丰富这样的Web文档。我们描述了一种将特定主题的HTML文档集成到XML文档存储库的新方法。特别是，我们描述了如何将特定于主题的HTML文档转换为XML文档。本文提出的文档转换和语义元素标注过程利用了文档重构规则和概念形式的主题最小信息。对于生成的XML文档，将派生一个多数模式，该模式以DTD的形式描述文档之间的公共结构。我们将探索和讨论用于文档转换和多数模式发现的不同技术和规则。最后，通过将该方法应用于Web爬虫收集的一组简历HTML文档，我们证明了该方法的可行性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings 18th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Out from under the trees [linear file template] Declarative composition and peer-to-peer provisioning of dynamic Web services Multivariate time series prediction via temporal classification Integrating workflow management systems with business-to-business interaction standards YFilter: efficient and scalable filtering of XML documents