历史文献中高棉手写文本识别的编码器-解码器语言模型

2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA) Pub Date : 2022-12-02 DOI:10.1109/SKIMA57145.2022.10029532

Seanghort Born, Dona Valy, Phutphalla Kong

{"title":"历史文献中高棉手写文本识别的编码器-解码器语言模型","authors":"Seanghort Born, Dona Valy, Phutphalla Kong","doi":"10.1109/SKIMA57145.2022.10029532","DOIUrl":null,"url":null,"abstract":"Correcting spelling errors in texts extracted from Khmer palm leaf manuscripts by handwritten text recognition (HTR) systems can be very challenging. A Khmer Language Model developed in this study aims to facilitate the task mentioned above. The proposed model utilizes long short-term memory (LSTM) modules applicable for improving the performance of text recognition which is to predict a sequence of characters as output. The architecture of the language model is based on an encoder-decoder mechanism which is composed of two parts: an encoder to capture the context of the input erroneous word and a decoder to decode and predict the correctly spelt output word. Experimental evaluations are conducted on a text corpus consisting of Khmer words extracted from Sleuk-Rith set.","PeriodicalId":277436,"journal":{"name":"2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Encoder-Decoder Language Model for Khmer Handwritten Text Recognition in Historical Documents\",\"authors\":\"Seanghort Born, Dona Valy, Phutphalla Kong\",\"doi\":\"10.1109/SKIMA57145.2022.10029532\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Correcting spelling errors in texts extracted from Khmer palm leaf manuscripts by handwritten text recognition (HTR) systems can be very challenging. A Khmer Language Model developed in this study aims to facilitate the task mentioned above. The proposed model utilizes long short-term memory (LSTM) modules applicable for improving the performance of text recognition which is to predict a sequence of characters as output. The architecture of the language model is based on an encoder-decoder mechanism which is composed of two parts: an encoder to capture the context of the input erroneous word and a decoder to decode and predict the correctly spelt output word. Experimental evaluations are conducted on a text corpus consisting of Khmer words extracted from Sleuk-Rith set.\",\"PeriodicalId\":277436,\"journal\":{\"name\":\"2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SKIMA57145.2022.10029532\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SKIMA57145.2022.10029532","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过手写文本识别(HTR)系统纠正从高棉棕榈叶手稿中提取的文本中的拼写错误是非常具有挑战性的。本研究开发的高棉语模型旨在促进上述任务。该模型利用了长短期记忆(LSTM)模块，用于提高文本识别的性能，即预测一系列字符作为输出。语言模型的体系结构基于一个编码器-解码器机制，该机制由两部分组成:一个编码器捕获输入错误单词的上下文，一个解码器解码并预测正确拼写的输出单词。对从Sleuk-Rith集合中提取的高棉语文本语料库进行了实验评价。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Encoder-Decoder Language Model for Khmer Handwritten Text Recognition in Historical Documents

Correcting spelling errors in texts extracted from Khmer palm leaf manuscripts by handwritten text recognition (HTR) systems can be very challenging. A Khmer Language Model developed in this study aims to facilitate the task mentioned above. The proposed model utilizes long short-term memory (LSTM) modules applicable for improving the performance of text recognition which is to predict a sequence of characters as output. The architecture of the language model is based on an encoder-decoder mechanism which is composed of two parts: an encoder to capture the context of the input erroneous word and a decoder to decode and predict the correctly spelt output word. Experimental evaluations are conducted on a text corpus consisting of Khmer words extracted from Sleuk-Rith set.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)

自引率

0.00%

发文量