{"title":"Urdu Text Classification: A comparative study using machine learning techniques","authors":"Imran Rasheed, Vivek Gupta, H. Banka, C. Kumar","doi":"10.1109/ICDIM.2018.8847044","DOIUrl":null,"url":null,"abstract":"In the last decade, online content has entered a stage where news related organizations are reluctant to invest in offline operations due to excessive aberrations in content distributions. However, the proliferation of digital data in an unstructured or rather disordered form particularly for languages like Urdu has complicated the easy access to information. Consequently, the paper addresses the peculiarities of Urdu text classification of news origin. For this, the performance of the three classifiers such as Decision Tree (J48), Support Vector Machine (SVM) and k-nearest neighbor (KNN) was measured on the classification of Urdu text using WEKA (Waikato Environment Knowledge Analysis) tool. The assessment was carried out on a relatively large collection of Urdu text having over 16,678 documents containing mainly news articles from The Daily Roshni, an Urdu newspaper. Additionally, TF-IDF weighting scheme was used for feature selection and extraction of data. The Urdu text classification using SVM classifier performed quite better with promising accuracy and superior efficiency when compared to the other two classifiers. For this study, the dataset was formulated as per TRC (Text Retrieval Conference) community standard.","PeriodicalId":120884,"journal":{"name":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2018.8847044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
In the last decade, online content has entered a stage where news related organizations are reluctant to invest in offline operations due to excessive aberrations in content distributions. However, the proliferation of digital data in an unstructured or rather disordered form particularly for languages like Urdu has complicated the easy access to information. Consequently, the paper addresses the peculiarities of Urdu text classification of news origin. For this, the performance of the three classifiers such as Decision Tree (J48), Support Vector Machine (SVM) and k-nearest neighbor (KNN) was measured on the classification of Urdu text using WEKA (Waikato Environment Knowledge Analysis) tool. The assessment was carried out on a relatively large collection of Urdu text having over 16,678 documents containing mainly news articles from The Daily Roshni, an Urdu newspaper. Additionally, TF-IDF weighting scheme was used for feature selection and extraction of data. The Urdu text classification using SVM classifier performed quite better with promising accuracy and superior efficiency when compared to the other two classifiers. For this study, the dataset was formulated as per TRC (Text Retrieval Conference) community standard.