{"title":"Text Classification and Topic Modelling of Web Extracted Data","authors":"Niraj Kumar, R. Suman, Sanjay Kumar","doi":"10.1109/GCAT52182.2021.9587459","DOIUrl":null,"url":null,"abstract":"Text classification and Topic Modelling is the backbone for the text analysis of huge amount of corpus of data. With an increase in unstructured data around us, it is very difficult to analyse the data very easily. There is a need for some methods that can be applied to the data to get the sensitive and semantic information from the corpus. Text classification is categorization of text in organised way for the interpretation of sensitive information from the text, while Topic modelling is finding the abstract topic for the collection of text or document. Topic modelling is used frequently to find semantic information from the textual data. In this paper we applied Parsing techniques on various websites to extract the HTML and XML data which includes the textual data and also applied Preprocessing techniques to clean the data. For the text classification purpose some of the Machine learning based classifiers that we have used in our experiment are Naive Bayes and also Logistic Regression Classifier. The models of the document are built using three different topic modelling methods which are Latent Semantic Analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation. In the further experiment we have done analysis and also comparison based upon the performance of the models and classifiers on the processed textual data.","PeriodicalId":436231,"journal":{"name":"2021 2nd Global Conference for Advancement in Technology (GCAT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 2nd Global Conference for Advancement in Technology (GCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GCAT52182.2021.9587459","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Text classification and Topic Modelling is the backbone for the text analysis of huge amount of corpus of data. With an increase in unstructured data around us, it is very difficult to analyse the data very easily. There is a need for some methods that can be applied to the data to get the sensitive and semantic information from the corpus. Text classification is categorization of text in organised way for the interpretation of sensitive information from the text, while Topic modelling is finding the abstract topic for the collection of text or document. Topic modelling is used frequently to find semantic information from the textual data. In this paper we applied Parsing techniques on various websites to extract the HTML and XML data which includes the textual data and also applied Preprocessing techniques to clean the data. For the text classification purpose some of the Machine learning based classifiers that we have used in our experiment are Naive Bayes and also Logistic Regression Classifier. The models of the document are built using three different topic modelling methods which are Latent Semantic Analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation. In the further experiment we have done analysis and also comparison based upon the performance of the models and classifiers on the processed textual data.