Text Extraction and Categorization from Watermark Scientific Document in Bulk

Wai Chong Chia, P. Teh, C. M. Gill
{"title":"Text Extraction and Categorization from Watermark Scientific Document in Bulk","authors":"Wai Chong Chia, P. Teh, C. M. Gill","doi":"10.1109/ICCIA.2018.00017","DOIUrl":null,"url":null,"abstract":"Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.","PeriodicalId":297098,"journal":{"name":"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 3rd International Conference on Computational Intelligence and Applications (ICCIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIA.2018.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大规模水印科学文献的文本提取与分类
如果所有这些都是在没有自动化系统的帮助下完成的,那么从大量以便携文件格式(PDF)准备的科学文件中提取信息是一个耗时的过程。然而,PDF中结构信息的缺失会在提取过程中产生很多问题。水印是一种可以对其产生负面影响的对象。当PDF提取工具应用于带水印的PDF时,水印会影响文本的顺序,通常作为文本的一部分被提取出来。如果文本将来用于分析,水印可能会影响结果的准确性,因为它们不应该被考虑在内。在本文中,提出了一种可以用来克服上述问题的方法。该方法利用PDF的直接文本识别和光学字符识别(OCR)来生成两个版本的数字文本,可以将它们组合在一起以提高准确性。结果表明,该方法能够从具有不同水印模式的PDF文件中提取文本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Text Extraction and Categorization from Watermark Scientific Document in Bulk Locating Heartbeats from Electrocardiograms and Other Correlated Signals Combining Deep Learning and JSEG Cuda Segmentation Algorithm for Electrical Components Recognition An Oppositional Learning Prediction Operator for Simulated Kalman Filter Clustering Method for Financial Time Series with Co-Movement Relationship
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1