M. I. D. Putra, Budi Irmawati, W. Wedashwara, Dita Pramesti, Siti Oryza Khairunnisa
{"title":"Age Group Based Document Classification in Bahasa Indonesia","authors":"M. I. D. Putra, Budi Irmawati, W. Wedashwara, Dita Pramesti, Siti Oryza Khairunnisa","doi":"10.1109/ICADEIS49811.2020.9277104","DOIUrl":null,"url":null,"abstract":"Internet provides articles that may be categorized to various target readers based on genders, ages, hobbies, etc. To make sure that readers consume a proper article based on their age group, methods and training data were proposed and collected to classify the articles. This paper reported a document classification based on age groups using a binary classification method for Indonesian documents. The document classification used the term frequency and inverse document frequency (TF-IDF) features run on the Multinomial Naïve Bayes Classifier. The dataset was crowdsourced from three different sites: bobo.grid.id, hai.grid.id, and www.detik.com for three age group readers such as elementary school children, teenagers, and adults. The experimental results obtained 0.9406, 0.9341, and 0.9374 of precision, recall, and F-score respectively. This experiment also reported that for the datasets that were not stemmed performed better than those that were stemmed. It shows that the stemming process, which usually be done in the document classification, throws some information in the Indonesian texts. However, because this behavior was not happen on nouns, our future work is to elaborate further on the role of affixations in the lower age group documents.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"1-6"},"PeriodicalIF":2.2000,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1109/ICADEIS49811.2020.9277104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 2
Abstract
Internet provides articles that may be categorized to various target readers based on genders, ages, hobbies, etc. To make sure that readers consume a proper article based on their age group, methods and training data were proposed and collected to classify the articles. This paper reported a document classification based on age groups using a binary classification method for Indonesian documents. The document classification used the term frequency and inverse document frequency (TF-IDF) features run on the Multinomial Naïve Bayes Classifier. The dataset was crowdsourced from three different sites: bobo.grid.id, hai.grid.id, and www.detik.com for three age group readers such as elementary school children, teenagers, and adults. The experimental results obtained 0.9406, 0.9341, and 0.9374 of precision, recall, and F-score respectively. This experiment also reported that for the datasets that were not stemmed performed better than those that were stemmed. It shows that the stemming process, which usually be done in the document classification, throws some information in the Indonesian texts. However, because this behavior was not happen on nouns, our future work is to elaborate further on the role of affixations in the lower age group documents.