I. M. Krisna Dwitama, Muhammad Salman Al Farisi, Ika Alfina, A. Dinakaramani
{"title":"印尼语非正式语篇形态分析器的建立","authors":"I. M. Krisna Dwitama, Muhammad Salman Al Farisi, Ika Alfina, A. Dinakaramani","doi":"10.1109/ICACSIS56558.2022.9923494","DOIUrl":null,"url":null,"abstract":"Informal text is heavily used by Indonesian in social media. However, NLP tool that can process such text is still very limited. In this work, we built a morphological analyzer for informal text in Indonesian by adding new rules for informal words to an existing Indonesian morphological analyzer named Aksara. Moreover, we also enrich the Aksara lexicon with informal words. The tool can perform tokenization, lemmatization, and part-of-speech (POS) tagging. Aksara uses a rule-based method using a finite-state transducer with a compiler named Foma. To evaluate the tool, we created a gold standard of 102 sentences with 1434 tokens which around 30 % are informal. The test results show that our tool has a tokenization accuracy of 97.21 %, while lemmatization accuracy for case insensitive is 90.37 %, and POS tagging evaluation reached an F1-Score value of 80%.","PeriodicalId":165728,"journal":{"name":"2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Building Morphological Analyzer for Informal Text in Indonesian\",\"authors\":\"I. M. Krisna Dwitama, Muhammad Salman Al Farisi, Ika Alfina, A. Dinakaramani\",\"doi\":\"10.1109/ICACSIS56558.2022.9923494\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Informal text is heavily used by Indonesian in social media. However, NLP tool that can process such text is still very limited. In this work, we built a morphological analyzer for informal text in Indonesian by adding new rules for informal words to an existing Indonesian morphological analyzer named Aksara. Moreover, we also enrich the Aksara lexicon with informal words. The tool can perform tokenization, lemmatization, and part-of-speech (POS) tagging. Aksara uses a rule-based method using a finite-state transducer with a compiler named Foma. To evaluate the tool, we created a gold standard of 102 sentences with 1434 tokens which around 30 % are informal. The test results show that our tool has a tokenization accuracy of 97.21 %, while lemmatization accuracy for case insensitive is 90.37 %, and POS tagging evaluation reached an F1-Score value of 80%.\",\"PeriodicalId\":165728,\"journal\":{\"name\":\"2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICACSIS56558.2022.9923494\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS56558.2022.9923494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Building Morphological Analyzer for Informal Text in Indonesian
Informal text is heavily used by Indonesian in social media. However, NLP tool that can process such text is still very limited. In this work, we built a morphological analyzer for informal text in Indonesian by adding new rules for informal words to an existing Indonesian morphological analyzer named Aksara. Moreover, we also enrich the Aksara lexicon with informal words. The tool can perform tokenization, lemmatization, and part-of-speech (POS) tagging. Aksara uses a rule-based method using a finite-state transducer with a compiler named Foma. To evaluate the tool, we created a gold standard of 102 sentences with 1434 tokens which around 30 % are informal. The test results show that our tool has a tokenization accuracy of 97.21 %, while lemmatization accuracy for case insensitive is 90.37 %, and POS tagging evaluation reached an F1-Score value of 80%.