Christos N. Karras, Aristeidis Karras, D. Tsolis, K. Giotopoulos, S. Sioutas
{"title":"Distributed Gibbs Sampling and LDA Modelling for Large Scale Big Data Management on PySpark","authors":"Christos N. Karras, Aristeidis Karras, D. Tsolis, K. Giotopoulos, S. Sioutas","doi":"10.1109/SEEDA-CECNSM57760.2022.9932990","DOIUrl":null,"url":null,"abstract":"Big data management methods are paramount in the modern era as applications tend to create massive amounts of data that comes from various sources. Therefore, there is an urge to create adaptive, speedy and robust frameworks that can effectively handle massive datasets. Distributed environments such as Apache Spark are of note, as they can handle such data by creating clusters where a portion of the data is stored locally and then the results are returned with the use of Resilient Distributed Datasets (RDDs). In this paper a method for distributed marginal Gibbs sampling for widely used latent Dirichlet allocation (LDA) model is implemented on PySpark along with a Metropolis Hastings Random Walker. The Distributed LDA (DLDA) algorithm distributes a given dataset into P partitions and performs local LDA on each partition, for each document independently. Every nth iteration, local LDA models, that were trained on distinct partitions, are combined to assure the model ability to converge. Experimental results are promising as the proposed system demonstrates comparable performance in the final model quality to the sequential LDA, and achieves significant speedup time-optimizations when utilized with massive datasets.","PeriodicalId":68279,"journal":{"name":"计算机工程与设计","volume":"119 1","pages":"1-8"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"计算机工程与设计","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.1109/SEEDA-CECNSM57760.2022.9932990","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Big data management methods are paramount in the modern era as applications tend to create massive amounts of data that comes from various sources. Therefore, there is an urge to create adaptive, speedy and robust frameworks that can effectively handle massive datasets. Distributed environments such as Apache Spark are of note, as they can handle such data by creating clusters where a portion of the data is stored locally and then the results are returned with the use of Resilient Distributed Datasets (RDDs). In this paper a method for distributed marginal Gibbs sampling for widely used latent Dirichlet allocation (LDA) model is implemented on PySpark along with a Metropolis Hastings Random Walker. The Distributed LDA (DLDA) algorithm distributes a given dataset into P partitions and performs local LDA on each partition, for each document independently. Every nth iteration, local LDA models, that were trained on distinct partitions, are combined to assure the model ability to converge. Experimental results are promising as the proposed system demonstrates comparable performance in the final model quality to the sequential LDA, and achieves significant speedup time-optimizations when utilized with massive datasets.
期刊介绍:
Computer Engineering and Design is supervised by China Aerospace Science and Industry Corporation and sponsored by the 706th Institute of the Second Academy of China Aerospace Science and Industry Corporation. It was founded in 1980. The purpose of the journal is to disseminate new technologies and promote academic exchanges. Since its inception, it has adhered to the principle of combining depth and breadth, theory and application, and focused on reporting cutting-edge and hot computer technologies. The journal accepts academic papers with innovative and independent academic insights, including papers on fund projects, award-winning research papers, outstanding papers at academic conferences, doctoral and master's theses, etc.