Print-friendly page extraction for web printing service

Sam Liu, Conglun Yao
{"title":"Print-friendly page extraction for web printing service","authors":"Sam Liu, Conglun Yao","doi":"10.1145/2034691.2034711","DOIUrl":null,"url":null,"abstract":"Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally contain hyperlinks (or links) that lead to print-friendly pages containing the salient content. For a more desirable Web printing experience, the main Web content should be extracted to produce well formatted pages. This paper describes a cloud service based on automatic content extraction and repurposing from print-friendly pages for Web printing. Content extraction from print-friendly pages is simpler and more reliable than from the original pages, but there are many variations of the print-link representations in HTML that make robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as \"print\", \"print article\", \"print-friendly version\", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate matter further, not all of the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so that no URL is present. Experimental results suggest that our solution is capable of achieving over 99% precision and 97% recall performance measures for print-friendly link extraction.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"1 1","pages":"89-92"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2034691.2034711","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally contain hyperlinks (or links) that lead to print-friendly pages containing the salient content. For a more desirable Web printing experience, the main Web content should be extracted to produce well formatted pages. This paper describes a cloud service based on automatic content extraction and repurposing from print-friendly pages for Web printing. Content extraction from print-friendly pages is simpler and more reliable than from the original pages, but there are many variations of the print-link representations in HTML that make robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate matter further, not all of the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so that no URL is present. Experimental results suggest that our solution is capable of achieving over 99% precision and 97% recall performance measures for print-friendly link extraction.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
打印友好的网页提取网页打印服务
从浏览器打印网页通常会导致不满意的打印输出,因为页面通常格式不佳,并且包含非信息内容,如导航菜单和广告。因此,值得打印的网页(如文章)通常包含超链接(或链接),这些链接会导致包含重要内容的打印友好页面。为了获得更理想的Web打印体验,应该提取主要Web内容以生成格式良好的页面。本文描述了一种基于自动内容提取和重新利用打印友好页面的云服务,用于Web打印。从打印友好的页面中提取内容比从原始页面中提取内容更简单、更可靠,但是HTML中打印链接表示的许多变体使得健壮的打印链接检测比最初看起来更加困难。首先,链接可以是基于文本的、基于图像的,或者两者兼而有之。例如,有一个用于表示打印友好页面的短语词典,如“打印”、“打印文章”、“打印友好版本”等。此外,一些链接使用类似打印机的图像图标,有或没有打印短语。更复杂的是,并非所有链接都包含有效的URL,而是由客户端Javascript或服务器动态生成页面,因此没有URL。实验结果表明,我们的解决方案能够实现超过99%的精度和97%的召回率的打印友好链接提取性能指标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The Notarial Archives, Valletta: Starting from Zero Truncation: all the news that fits we'll print Classifying and ranking search engine results as potential sources of plagiarism An ensemble approach for text document clustering using Wikipedia concepts Document changes: modeling, detection, storage and visualization (DChanges 2014)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1