{"title":"Text Document Classification Using Deep Learning Techniques","authors":"Safia Rehman, Aun Irtaza, Marriam Nawaz, Hareem Kibriya","doi":"10.1109/ETECTE55893.2022.10007316","DOIUrl":null,"url":null,"abstract":"The advancement in technology has resulted in expansion in the volume of online text documents. It is interesting to note that in the previous two years alone, more electronic data has been created than ever before by the entire human species. As a result, it is now essential to accurately classify this data according to their content, which also helps in further processing and the extraction of valuable features. Text based document classification is one of the very important problems in Natural Language Processing (NLP). Manual document classification techniques rely heavily on human power to examine and label documents based on their content. Whereas, traditional Machine Learning (ML) based algorithms require manual feature extraction prior to classification which requires choosing the best algorithm to extract handcrafted features. Both these strategies are not only time-consuming but also prone to error, and require choosing the best available algorithms. On the other hand, Deep Learning (DL) based algorithms do not require human intervention as they perform deep feature extraction and classification automatically with much better performance than the traditional ML based frameworks. In this paper, we present a completely automated and robust document classification method to classify online digital documents using DL based methods i.e. BERT and RoBERTa. The proposed technique achieved highest accuracy of 98.9% and can be deployed to classify digital text documents with a high performance.","PeriodicalId":131572,"journal":{"name":"2022 International Conference on Emerging Trends in Electrical, Control, and Telecommunication Engineering (ETECTE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Emerging Trends in Electrical, Control, and Telecommunication Engineering (ETECTE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ETECTE55893.2022.10007316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The advancement in technology has resulted in expansion in the volume of online text documents. It is interesting to note that in the previous two years alone, more electronic data has been created than ever before by the entire human species. As a result, it is now essential to accurately classify this data according to their content, which also helps in further processing and the extraction of valuable features. Text based document classification is one of the very important problems in Natural Language Processing (NLP). Manual document classification techniques rely heavily on human power to examine and label documents based on their content. Whereas, traditional Machine Learning (ML) based algorithms require manual feature extraction prior to classification which requires choosing the best algorithm to extract handcrafted features. Both these strategies are not only time-consuming but also prone to error, and require choosing the best available algorithms. On the other hand, Deep Learning (DL) based algorithms do not require human intervention as they perform deep feature extraction and classification automatically with much better performance than the traditional ML based frameworks. In this paper, we present a completely automated and robust document classification method to classify online digital documents using DL based methods i.e. BERT and RoBERTa. The proposed technique achieved highest accuracy of 98.9% and can be deployed to classify digital text documents with a high performance.