{"title":"Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification","authors":"Abdurrahman, A. Purwarianti","doi":"10.1109/ICACSIS47736.2019.8979733","DOIUrl":null,"url":null,"abstract":"Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.","PeriodicalId":165090,"journal":{"name":"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS47736.2019.8979733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.