Gilsiley Henrique Darú, Felipe Daltrozo da Motta Motta, Antonio Castelo, G. Loch
{"title":"Short text classification applied to item description: Some methods evaluation","authors":"Gilsiley Henrique Darú, Felipe Daltrozo da Motta Motta, Antonio Castelo, G. Loch","doi":"10.5433/1679-0375.2022v43n2p189","DOIUrl":null,"url":null,"abstract":"The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.","PeriodicalId":30173,"journal":{"name":"Semina Ciencias Exatas e Tecnologicas","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Semina Ciencias Exatas e Tecnologicas","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5433/1679-0375.2022v43n2p189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.