基于子词嵌入的高融合印度语OCR校正

2019 International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2019-09-01 DOI:10.1109/ICDAR.2019.00034

Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri

{"title":"基于子词嵌入的高融合印度语OCR校正","authors":"Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri","doi":"10.1109/ICDAR.2019.00034","DOIUrl":null,"url":null,"abstract":"Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Sub-Word Embeddings for OCR Corrections in Highly Fusional Indic Languages\",\"authors\":\"Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri\",\"doi\":\"10.1109/ICDAR.2019.00034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.\",\"PeriodicalId\":325437,\"journal\":{\"name\":\"2019 International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2019.00034\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2019.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于频繁使用连接规则进行融合，印度语的文本中包含了很大比例的词汇外(OOV)单词(梵语中大约有4000个)。OCR误差进一步加剧了纠错系统的复杂性。子词单位的变化，如n-gram，可能封装上下文，可以分别从OCR文本和语言文本中提取。从这些语言的文本中衍生出来的一些子词单位与单词连接规则高度相关。与这些子词单位相关联的频率值等信号(在语料库上)以前已与对数线性分类器一起用于检测印度OCR文本中的错误。我们探索了两种不同的编码来捕获这些信号，并将输入增强到基于长短期记忆(LSTM)的OCR校正模型，这些模型在过去被证明对共同学习语言以及OCR特异性混淆很有用。第一种编码直接使用从训练数据中导出的子词单位频率值。在不同复杂程度的四种语言下，该模型的收敛速度更快，精度值更高。第二种类型的编码使用可训练的子词嵌入。我们引入了一种新的过程，在子词单元上训练快速文本嵌入，并进一步观察到F-Scores和词级精度值的大幅提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Sub-Word Embeddings for OCR Corrections in Highly Fusional Indic Languages

Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量