Development of Browser Extension for HTML Web Page Content Extraction

Murat Karabulut, İslam Mayda
{"title":"Development of Browser Extension for HTML Web Page Content Extraction","authors":"Murat Karabulut, İslam Mayda","doi":"10.1109/HORA49412.2020.9152891","DOIUrl":null,"url":null,"abstract":"As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully.","PeriodicalId":166917,"journal":{"name":"2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HORA49412.2020.9152891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

As the amount of content on the websites increases, automatic content extraction from Web pages becomes more important. Although many studies have been done in the literature on this subject, a method that fully solves the problem has not been revealed due to the flexible structure of HTML. The performances of the methods that show success at certain rates also decrease over time with the changing and developing Web structure. In this study, a browser extension was developed to automatically download text content on Web pages. This developed extension provides an output with 100% recall rate by cleaning the text content on the Web page from all tags and codes with a parser that utilizes the Document Object Model (DOM) structure. This browser extension that operates independently from the language has been tested on different types of popular Web sites in Turkey and has been shown to work successfully.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
HTML网页内容提取浏览器扩展的开发
随着网站上内容的增加,从网页中自动提取内容变得更加重要。虽然文献中对这个问题做了很多研究,但是由于HTML结构的灵活性,并没有揭示出一种完全解决这个问题的方法。随着Web结构的变化和发展,在一定程度上显示成功的方法的性能也会随着时间的推移而降低。在本研究中,开发了一个浏览器扩展来自动下载网页上的文本内容。这个开发的扩展通过使用一个利用文档对象模型(Document Object Model, DOM)结构的解析器从所有标记和代码中清除Web页面上的文本内容,从而提供100%召回率的输出。这个独立于语言运行的浏览器扩展已经在土耳其不同类型的流行网站上进行了测试,并已被证明工作成功。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hand Movement detection Using Empirical Mode Decomposition And Higher Order Spectra Design and Implementation of an Arduino Based Smart Home Design of Virtual Reality Browser Platform for Programming of Quantum Computers via VR Headsets Energy Harvesting System with A Single-step Power Conversion Process Achieving Peak Efficiency of 79.1% Applications of Digital and Computer Technologies for Control and Motion Simulation of Electromechanical Systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1