{"title":"A service-oriented framework for large-scale documents processing and application via 3D models and feature extraction","authors":"Qiang Chen , Yinong Chen , Cheng Zhan , Wu Chen , Zili Zhang , Sheng Wu","doi":"10.1016/j.simpat.2024.102903","DOIUrl":null,"url":null,"abstract":"<div><p>Educational big data analysis is facilitated by the significant amount of unstructured data found in education institutions. Python has various toolkits for both structured and unstructured data processing. However, its ability for processing large-scale data is limited. On the other hand, Spark is a big data processing framework, but it does not have the needed toolkits for processing unstructured rich text documents, 3D model and image processing. In this study, we develop a generic framework that integrates Python toolkits and Spark based on service-oriented architecture. The framework automatically extends the serial algorithm written in Python to distributed algorithm to accomplish parallel processing tasks seamlessly. First, our focus is on achieving non-intrusive deployment to Spark servers and how to run Python codes in Spark environment to process rich text documents. Second, we propose a compression-based schema to address the poor performance of small sized files in HDFS. Finally, we design a generic model that can process different types of poly-structured data such as 3D models and images. We published the services used in the system for sharing them at https level for constructing different systems. It is evaluated through simulation experiments using large-scale rich text documents, 3D models and images. According to the results, the speedup is 49 times faster than the standalone Python-docx in the simulations of extracting 232 GB docx files when eight physical nodes with 128 cores are used. It reaches about 89 times after further compression schema is applied. In addition, simulations for 3D model descriptors' extraction show that the simulation achieves a speedup of about 116 times. In the large-scale image's HOG features extraction simulation task of up to 256.7 GB (6,861,024 images), a speedup of up to 110 times is achieved.</p></div>","PeriodicalId":49518,"journal":{"name":"Simulation Modelling Practice and Theory","volume":"133 ","pages":"Article 102903"},"PeriodicalIF":3.5000,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Simulation Modelling Practice and Theory","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1569190X24000170","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Educational big data analysis is facilitated by the significant amount of unstructured data found in education institutions. Python has various toolkits for both structured and unstructured data processing. However, its ability for processing large-scale data is limited. On the other hand, Spark is a big data processing framework, but it does not have the needed toolkits for processing unstructured rich text documents, 3D model and image processing. In this study, we develop a generic framework that integrates Python toolkits and Spark based on service-oriented architecture. The framework automatically extends the serial algorithm written in Python to distributed algorithm to accomplish parallel processing tasks seamlessly. First, our focus is on achieving non-intrusive deployment to Spark servers and how to run Python codes in Spark environment to process rich text documents. Second, we propose a compression-based schema to address the poor performance of small sized files in HDFS. Finally, we design a generic model that can process different types of poly-structured data such as 3D models and images. We published the services used in the system for sharing them at https level for constructing different systems. It is evaluated through simulation experiments using large-scale rich text documents, 3D models and images. According to the results, the speedup is 49 times faster than the standalone Python-docx in the simulations of extracting 232 GB docx files when eight physical nodes with 128 cores are used. It reaches about 89 times after further compression schema is applied. In addition, simulations for 3D model descriptors' extraction show that the simulation achieves a speedup of about 116 times. In the large-scale image's HOG features extraction simulation task of up to 256.7 GB (6,861,024 images), a speedup of up to 110 times is achieved.
期刊介绍:
The journal Simulation Modelling Practice and Theory provides a forum for original, high-quality papers dealing with any aspect of systems simulation and modelling.
The journal aims at being a reference and a powerful tool to all those professionally active and/or interested in the methods and applications of simulation. Submitted papers will be peer reviewed and must significantly contribute to modelling and simulation in general or use modelling and simulation in application areas.
Paper submission is solicited on:
• theoretical aspects of modelling and simulation including formal modelling, model-checking, random number generators, sensitivity analysis, variance reduction techniques, experimental design, meta-modelling, methods and algorithms for validation and verification, selection and comparison procedures etc.;
• methodology and application of modelling and simulation in any area, including computer systems, networks, real-time and embedded systems, mobile and intelligent agents, manufacturing and transportation systems, management, engineering, biomedical engineering, economics, ecology and environment, education, transaction handling, etc.;
• simulation languages and environments including those, specific to distributed computing, grid computing, high performance computers or computer networks, etc.;
• distributed and real-time simulation, simulation interoperability;
• tools for high performance computing simulation, including dedicated architectures and parallel computing.