Semi-Automated Extraction of Targeted Data fromWeb Pages

22nd International Conference on Data Engineering Workshops (ICDEW'06) Pub Date : 2006-04-03 DOI:10.1109/ICDEW.2006.135

Fabrice Estiévenart, Jean-Roch Meurisse, Jean-Luc Hainaut, Philippe Thiran

引用次数: 3

Abstract

TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从网页中半自动提取目标数据

万维网可以被认为是个人和组织的无限信息来源。然而，如果在Web上发布的主要标准(HTML)非常适合人类阅读，其糟糕的语义使得计算机难以以智能和自动化的方式处理和使用嵌入的数据。在本文中，我们建议通过所谓的映射规则在HTML文档和外部应用程序之间建立一座桥梁。这些规则主要记录类似Web文档集群中重复出现的信息类型的语义解释及其在这些文档中的位置。依靠这些规则，可以将嵌入html的数据提取为更可计算的格式。映射规则的定义基于用户直接输入(主要用于解释部分)和自动计算数据在HTML树结构中的位置。这种方法由一个名为Retrozilla的用户友好工具支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

22nd International Conference on Data Engineering Workshops (ICDEW'06)

自引率

0.00%

发文量

期刊最新文献

Web Interface Navigation Design: Which Style of Navigation-Link Menus Do Users Prefer? Replication Based on Objects Load under a Content Distribution Network A Stochastic Approach for Trust Management A Multiple-Perspective, Interactive Approach for Web Information Extraction and Exploration Seaweed: Distributed Scalable Ad Hoc Querying