A. Patil, Shreekant Jere, Reshma Ram, Shruthi Srinarasi
{"title":"T5W:用于不平衡文本分类的过采样释义方法","authors":"A. Patil, Shreekant Jere, Reshma Ram, Shruthi Srinarasi","doi":"10.1109/CONECCT55679.2022.9865812","DOIUrl":null,"url":null,"abstract":"Imbalanced datasets are datasets with one or more underrepresented classes when compared to other classes. Such datasets pose problems during classification due to the lack of sufficient data to train minority classes. To handle this imbalance in text data, this paper proposes an oversampling technique that uses a combination of the T5 Transformer and the WordNet corpus to balance the dataset by paraphrasing text in minority classes. The WordNet corpus is used to extract synonyms of \"relevant\" words. Substituting these synonyms in the sentences augments the database by increasing data in minority classes. Combining these augmented sentences with the paraphrased sentences extracted using the T5 Transformer results in a larger and balanced dataset. Standard classifiers such as the Logistic Regression algorithm are used to compare the performance metrics before and after resampling the dataset. The results show that oversampling using the proposed approach significantly improves the performance of text classification algorithms. To automate the task of oversampling using a paraphrasing tool, the integration of the model with a Robotic Process Automation tool is detailed.","PeriodicalId":380005,"journal":{"name":"2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"T5W: A Paraphrasing Approach to Oversampling for Imbalanced Text Classification\",\"authors\":\"A. Patil, Shreekant Jere, Reshma Ram, Shruthi Srinarasi\",\"doi\":\"10.1109/CONECCT55679.2022.9865812\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Imbalanced datasets are datasets with one or more underrepresented classes when compared to other classes. Such datasets pose problems during classification due to the lack of sufficient data to train minority classes. To handle this imbalance in text data, this paper proposes an oversampling technique that uses a combination of the T5 Transformer and the WordNet corpus to balance the dataset by paraphrasing text in minority classes. The WordNet corpus is used to extract synonyms of \\\"relevant\\\" words. Substituting these synonyms in the sentences augments the database by increasing data in minority classes. Combining these augmented sentences with the paraphrased sentences extracted using the T5 Transformer results in a larger and balanced dataset. Standard classifiers such as the Logistic Regression algorithm are used to compare the performance metrics before and after resampling the dataset. The results show that oversampling using the proposed approach significantly improves the performance of text classification algorithms. To automate the task of oversampling using a paraphrasing tool, the integration of the model with a Robotic Process Automation tool is detailed.\",\"PeriodicalId\":380005,\"journal\":{\"name\":\"2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)\",\"volume\":\"99 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CONECCT55679.2022.9865812\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CONECCT55679.2022.9865812","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
T5W: A Paraphrasing Approach to Oversampling for Imbalanced Text Classification
Imbalanced datasets are datasets with one or more underrepresented classes when compared to other classes. Such datasets pose problems during classification due to the lack of sufficient data to train minority classes. To handle this imbalance in text data, this paper proposes an oversampling technique that uses a combination of the T5 Transformer and the WordNet corpus to balance the dataset by paraphrasing text in minority classes. The WordNet corpus is used to extract synonyms of "relevant" words. Substituting these synonyms in the sentences augments the database by increasing data in minority classes. Combining these augmented sentences with the paraphrased sentences extracted using the T5 Transformer results in a larger and balanced dataset. Standard classifiers such as the Logistic Regression algorithm are used to compare the performance metrics before and after resampling the dataset. The results show that oversampling using the proposed approach significantly improves the performance of text classification algorithms. To automate the task of oversampling using a paraphrasing tool, the integration of the model with a Robotic Process Automation tool is detailed.