Print-friendly page extraction for web printing service

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering Pub Date : 2011-09-19 DOI:10.1145/2034691.2034711

Sam Liu, Conglun Yao

{"title":"Print-friendly page extraction for web printing service","authors":"Sam Liu, Conglun Yao","doi":"10.1145/2034691.2034711","DOIUrl":null,"url":null,"abstract":"Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally contain hyperlinks (or links) that lead to print-friendly pages containing the salient content. For a more desirable Web printing experience, the main Web content should be extracted to produce well formatted pages. This paper describes a cloud service based on automatic content extraction and repurposing from print-friendly pages for Web printing. Content extraction from print-friendly pages is simpler and more reliable than from the original pages, but there are many variations of the print-link representations in HTML that make robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as \"print\", \"print article\", \"print-friendly version\", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate matter further, not all of the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so that no URL is present. Experimental results suggest that our solution is capable of achieving over 99% precision and 97% recall performance measures for print-friendly link extraction.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"1 1","pages":"89-92"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2034691.2034711","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally contain hyperlinks (or links) that lead to print-friendly pages containing the salient content. For a more desirable Web printing experience, the main Web content should be extracted to produce well formatted pages. This paper describes a cloud service based on automatic content extraction and repurposing from print-friendly pages for Web printing. Content extraction from print-friendly pages is simpler and more reliable than from the original pages, but there are many variations of the print-link representations in HTML that make robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate matter further, not all of the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so that no URL is present. Experimental results suggest that our solution is capable of achieving over 99% precision and 97% recall performance measures for print-friendly link extraction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

打印友好的网页提取网页打印服务

从浏览器打印网页通常会导致不满意的打印输出，因为页面通常格式不佳，并且包含非信息内容，如导航菜单和广告。因此，值得打印的网页(如文章)通常包含超链接(或链接)，这些链接会导致包含重要内容的打印友好页面。为了获得更理想的Web打印体验，应该提取主要Web内容以生成格式良好的页面。本文描述了一种基于自动内容提取和重新利用打印友好页面的云服务，用于Web打印。从打印友好的页面中提取内容比从原始页面中提取内容更简单、更可靠，但是HTML中打印链接表示的许多变体使得健壮的打印链接检测比最初看起来更加困难。首先，链接可以是基于文本的、基于图像的，或者两者兼而有之。例如，有一个用于表示打印友好页面的短语词典，如“打印”、“打印文章”、“打印友好版本”等。此外，一些链接使用类似打印机的图像图标，有或没有打印短语。更复杂的是，并非所有链接都包含有效的URL，而是由客户端Javascript或服务器动态生成页面，因此没有URL。实验结果表明，我们的解决方案能够实现超过99%的精度和97%的召回率的打印友好链接提取性能指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering

自引率

0.00%

发文量

期刊最新文献

The Notarial Archives, Valletta: Starting from Zero Truncation: all the news that fits we'll print Classifying and ranking search engine results as potential sources of plagiarism An ensemble approach for text document clustering using Wikipedia concepts Document changes: modeling, detection, storage and visualization (DChanges 2014)