{"title":"Developer-friendly annotation-based HTML-to-XML transformation technology","authors":"Lendle Tseng","doi":"10.1145/2034691.2034707","DOIUrl":null,"url":null,"abstract":"Nowadays, the amount of information accessible on the web is huge. Although web users today expect a more integrated way to access information on the web, it is still rather difficult to \"integrate\" information from different web sites since most web pages are authored in HTML format, which is actually a presentation-oriented language and is usually considered unstructured. Today, there are many research works aiming at extracting information from web pages. Existing works typically transform the extracting results into structured or semi-structured data formats, thus other applications can further process the results to discover more useful information. Nevertheless, the unstructured nature of HTML makes the transformation process complex and can hardly be widely adopted. In this paper, an annotation-based HTML-to-XML ransformation technology is proposed. The mechanism is developed with both usability and simplicity in mind. With the proposed mechanism, ordinary web site developers simply add annotations to their web pages. Annotated web pages can then be processed by our software libraries and transformed into XML documents, which are machine-understandable. Software agents thus can be developed based on our technology.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"27 5 1","pages":"73-76"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2034691.2034707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Nowadays, the amount of information accessible on the web is huge. Although web users today expect a more integrated way to access information on the web, it is still rather difficult to "integrate" information from different web sites since most web pages are authored in HTML format, which is actually a presentation-oriented language and is usually considered unstructured. Today, there are many research works aiming at extracting information from web pages. Existing works typically transform the extracting results into structured or semi-structured data formats, thus other applications can further process the results to discover more useful information. Nevertheless, the unstructured nature of HTML makes the transformation process complex and can hardly be widely adopted. In this paper, an annotation-based HTML-to-XML ransformation technology is proposed. The mechanism is developed with both usability and simplicity in mind. With the proposed mechanism, ordinary web site developers simply add annotations to their web pages. Annotated web pages can then be processed by our software libraries and transformed into XML documents, which are machine-understandable. Software agents thus can be developed based on our technology.