Gilsiley Henrique Darú, Felipe Daltrozo da Motta Motta, Antonio Castelo, G. Loch
{"title":"短文本分类在项目描述中的应用:一些评价方法","authors":"Gilsiley Henrique Darú, Felipe Daltrozo da Motta Motta, Antonio Castelo, G. Loch","doi":"10.5433/1679-0375.2022v43n2p189","DOIUrl":null,"url":null,"abstract":"The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.","PeriodicalId":30173,"journal":{"name":"Semina Ciencias Exatas e Tecnologicas","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Short text classification applied to item description: Some methods evaluation\",\"authors\":\"Gilsiley Henrique Darú, Felipe Daltrozo da Motta Motta, Antonio Castelo, G. Loch\",\"doi\":\"10.5433/1679-0375.2022v43n2p189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.\",\"PeriodicalId\":30173,\"journal\":{\"name\":\"Semina Ciencias Exatas e Tecnologicas\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Semina Ciencias Exatas e Tecnologicas\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5433/1679-0375.2022v43n2p189\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Semina Ciencias Exatas e Tecnologicas","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5433/1679-0375.2022v43n2p189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Short text classification applied to item description: Some methods evaluation
The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.