Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification

2019 International Conference on Advanced Computer Science and information Systems (ICACSIS) Pub Date : 2019-10-01 DOI:10.1109/ICACSIS47736.2019.8979733

Abdurrahman, A. Purwarianti

{"title":"Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification","authors":"Abdurrahman, A. Purwarianti","doi":"10.1109/ICACSIS47736.2019.8979733","DOIUrl":null,"url":null,"abstract":"Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.","PeriodicalId":165090,"journal":{"name":"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS47736.2019.8979733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

印尼语文本分类中基于同义词的增强程度和语言模型的有效应用

基于机器学习的文本处理依赖于合格的文本数据集。文本增强研究的目的是丰富文本数据集，以获得比使用原始文本数据集更高的性能。我们对印尼语文本分类进行了文本增强处理，将某些词替换为其同义词。这个过程包括确定句子中要替换的单词的数量，并从同义词列表中选择替换单词。第一个过程，确定要替换的单词数量，是使用增强度来完成的。第二步是使用语言模型选择最佳替代词。同义词列表是从同义词典构建的。我们比较了几种构建语言模型的方法。采用n-gram和平滑相结合的方法建立统计模型，采用gram值3和5建立简单神经模型。神经模型使用预训练的词嵌入作为输入。5克神经模型的perplexity值显著高于其他语言模型。利用最佳语言模型生成增强数据集，并将其应用于面向方面的情感分析的两个分类任务:面向方面分类和面向情感分类。实验采用0.1 ~ 1的增强度。最佳增强度对分类模型性能的影响为3-4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)

自引率

0.00%

发文量