{"title":"基于免疫的OCR误差后处理方法","authors":"Puberun Boruah","doi":"10.1109/ICECCT56650.2023.10179692","DOIUrl":null,"url":null,"abstract":"Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.","PeriodicalId":180790,"journal":{"name":"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Immuno-Inspired Approach Towards Post-Processing of OCR Errors\",\"authors\":\"Puberun Boruah\",\"doi\":\"10.1109/ICECCT56650.2023.10179692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.\",\"PeriodicalId\":180790,\"journal\":{\"name\":\"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCT56650.2023.10179692\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCT56650.2023.10179692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

误差是光学字符识别(OCR)等计算机视觉应用的重要组成部分。不幸的是,这些错误产生的噪音只会在自然语言处理的各个阶段进一步扩散。在已报道的OCR文本后处理工作中,大多数涉及词法方法、基于特征的机器学习模型、合并OCR输出或使用其他语言模型。本文提出了一种基于孤立词的OCR错误检测方法,该方法基于人工免疫系统(AIS)的原理。将OCR错误检测问题视为分类问题,将OCR错误视为病原体,将正确词视为宿主细胞。负选择算法用于将任何新标记分类为OCR错误(病原体)或良好项(宿主细胞)。一系列的实验表明,构建这样一个系统来帮助识别OCR错误独立于语言是可能的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
An Immuno-Inspired Approach Towards Post-Processing of OCR Errors
Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Model of Markovian Queue with Catastrophe, Restoration and Balking Nibble Based Two Bit Invert Coding Technique for Serial Network on Chip Links Hesitant Triangular Fuzzy Dombi Operators and Its Applications Fuel Cost Optimization of Coal-Fired Power Plants using Coal Blending Proportions An Efficient Classification for Light Motor Vehicles using CatBoost Algorithm
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1