印尼语文本分类中基于同义词的增强程度和语言模型的有效应用

Abdurrahman, A. Purwarianti
{"title":"印尼语文本分类中基于同义词的增强程度和语言模型的有效应用","authors":"Abdurrahman, A. Purwarianti","doi":"10.1109/ICACSIS47736.2019.8979733","DOIUrl":null,"url":null,"abstract":"Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.","PeriodicalId":165090,"journal":{"name":"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification\",\"authors\":\"Abdurrahman, A. Purwarianti\",\"doi\":\"10.1109/ICACSIS47736.2019.8979733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.\",\"PeriodicalId\":165090,\"journal\":{\"name\":\"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICACSIS47736.2019.8979733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Advanced Computer Science and information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS47736.2019.8979733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

基于机器学习的文本处理依赖于合格的文本数据集。文本增强研究的目的是丰富文本数据集,以获得比使用原始文本数据集更高的性能。我们对印尼语文本分类进行了文本增强处理,将某些词替换为其同义词。这个过程包括确定句子中要替换的单词的数量,并从同义词列表中选择替换单词。第一个过程,确定要替换的单词数量,是使用增强度来完成的。第二步是使用语言模型选择最佳替代词。同义词列表是从同义词典构建的。我们比较了几种构建语言模型的方法。采用n-gram和平滑相结合的方法建立统计模型,采用gram值3和5建立简单神经模型。神经模型使用预训练的词嵌入作为输入。5克神经模型的perplexity值显著高于其他语言模型。利用最佳语言模型生成增强数据集,并将其应用于面向方面的情感分析的两个分类任务:面向方面分类和面向情感分类。实验采用0.1 ~ 1的增强度。最佳增强度对分类模型性能的影响为3-4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification
Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Evaluation of Instructional and User Interface Design for MOOC: Short and Free FutureLearn Courses Evaluation and Recommendations for the Instructional Design and User Interface Design of Coursera MOOC Platform Adult Content Classification on Indonesian Tweets using LSTM Neural Network Development of the Online Collaborative Summarizing Feature on Student-Centered E-Learning Environment Discriminating Unknown Software Using Distance Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1