{"title":"印度尼西亚文本的统计和基于规则的拼写和语法检查器","authors":"Asanilta Fahda, A. Purwarianti","doi":"10.1109/ICODSE.2017.8285846","DOIUrl":null,"url":null,"abstract":"Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.","PeriodicalId":366005,"journal":{"name":"2017 International Conference on Data and Software Engineering (ICoDSE)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"A statistical and rule-based spelling and grammar checker for Indonesian text\",\"authors\":\"Asanilta Fahda, A. Purwarianti\",\"doi\":\"10.1109/ICODSE.2017.8285846\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.\",\"PeriodicalId\":366005,\"journal\":{\"name\":\"2017 International Conference on Data and Software Engineering (ICoDSE)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Data and Software Engineering (ICoDSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICODSE.2017.8285846\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Data and Software Engineering (ICoDSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICODSE.2017.8285846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A statistical and rule-based spelling and grammar checker for Indonesian text
Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.