A statistical and rule-based spelling and grammar checker for Indonesian text

Asanilta Fahda, A. Purwarianti
{"title":"A statistical and rule-based spelling and grammar checker for Indonesian text","authors":"Asanilta Fahda, A. Purwarianti","doi":"10.1109/ICODSE.2017.8285846","DOIUrl":null,"url":null,"abstract":"Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.","PeriodicalId":366005,"journal":{"name":"2017 International Conference on Data and Software Engineering (ICoDSE)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Data and Software Engineering (ICoDSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICODSE.2017.8285846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
印度尼西亚文本的统计和基于规则的拼写和语法检查器
拼写和语法检查器是广泛使用的工具,旨在帮助发现和纠正各种写作错误。但是,目前没有能够检查印尼语文本的拼写和语法错误的校对系统。本文提出了一个采用规则和统计相结合的印尼语拼写语法检查器原型。规则匹配器模块目前使用38条规则来检测、纠正和解释标点、选词和拼写方面的常见错误。拼写检查模块使用字典尝试检查每个单词,以查找拼写错误和Damerau-Levenshtein距离邻居作为纠正候选。词形分析也增加了某些词形。使用重图/共现隐马尔可夫模型对候选对象进行排序和选择。语法检查器使用来自标记、POS标记或短语块的三元组语言模型来识别结构不正确的句子。通过实验,选择发射概率权重系数为0.95的共现HMM作为最适合拼写检查器的模型。在语法检查器方面,采用块长度归一化、阈值得分为−0.4的短语块模型效果最好。经文献评价,该系统的总体准确率为83.18%。这个原型被实现为一个web应用程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hybrid recommender system using random walk with restart for social tagging system Comparison of optimal path finding techniques for minimal diagnosis in mapping repair Cells identification of acute myeloid leukemia AML M0 and AML M1 using K-nearest neighbour based on morphological images Utility function based-mixed integer nonlinear programming (MINLP) problem model of information service pricing schemes Graph clustering using dirichlet process mixture model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1