A service-oriented framework for large-scale documents processing and application via 3D models and feature extraction

IF 3.5 2区 计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Simulation Modelling Practice and Theory Pub Date : 2024-02-09 DOI:10.1016/j.simpat.2024.102903
Qiang Chen , Yinong Chen , Cheng Zhan , Wu Chen , Zili Zhang , Sheng Wu
{"title":"A service-oriented framework for large-scale documents processing and application via 3D models and feature extraction","authors":"Qiang Chen ,&nbsp;Yinong Chen ,&nbsp;Cheng Zhan ,&nbsp;Wu Chen ,&nbsp;Zili Zhang ,&nbsp;Sheng Wu","doi":"10.1016/j.simpat.2024.102903","DOIUrl":null,"url":null,"abstract":"<div><p>Educational big data analysis is facilitated by the significant amount of unstructured data found in education institutions. Python has various toolkits for both structured and unstructured data processing. However, its ability for processing large-scale data is limited. On the other hand, Spark is a big data processing framework, but it does not have the needed toolkits for processing unstructured rich text documents, 3D model and image processing. In this study, we develop a generic framework that integrates Python toolkits and Spark based on service-oriented architecture. The framework automatically extends the serial algorithm written in Python to distributed algorithm to accomplish parallel processing tasks seamlessly. First, our focus is on achieving non-intrusive deployment to Spark servers and how to run Python codes in Spark environment to process rich text documents. Second, we propose a compression-based schema to address the poor performance of small sized files in HDFS. Finally, we design a generic model that can process different types of poly-structured data such as 3D models and images. We published the services used in the system for sharing them at https level for constructing different systems. It is evaluated through simulation experiments using large-scale rich text documents, 3D models and images. According to the results, the speedup is 49 times faster than the standalone Python-docx in the simulations of extracting 232 GB docx files when eight physical nodes with 128 cores are used. It reaches about 89 times after further compression schema is applied. In addition, simulations for 3D model descriptors' extraction show that the simulation achieves a speedup of about 116 times. In the large-scale image's HOG features extraction simulation task of up to 256.7 GB (6,861,024 images), a speedup of up to 110 times is achieved.</p></div>","PeriodicalId":49518,"journal":{"name":"Simulation Modelling Practice and Theory","volume":"133 ","pages":"Article 102903"},"PeriodicalIF":3.5000,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Simulation Modelling Practice and Theory","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1569190X24000170","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Educational big data analysis is facilitated by the significant amount of unstructured data found in education institutions. Python has various toolkits for both structured and unstructured data processing. However, its ability for processing large-scale data is limited. On the other hand, Spark is a big data processing framework, but it does not have the needed toolkits for processing unstructured rich text documents, 3D model and image processing. In this study, we develop a generic framework that integrates Python toolkits and Spark based on service-oriented architecture. The framework automatically extends the serial algorithm written in Python to distributed algorithm to accomplish parallel processing tasks seamlessly. First, our focus is on achieving non-intrusive deployment to Spark servers and how to run Python codes in Spark environment to process rich text documents. Second, we propose a compression-based schema to address the poor performance of small sized files in HDFS. Finally, we design a generic model that can process different types of poly-structured data such as 3D models and images. We published the services used in the system for sharing them at https level for constructing different systems. It is evaluated through simulation experiments using large-scale rich text documents, 3D models and images. According to the results, the speedup is 49 times faster than the standalone Python-docx in the simulations of extracting 232 GB docx files when eight physical nodes with 128 cores are used. It reaches about 89 times after further compression schema is applied. In addition, simulations for 3D model descriptors' extraction show that the simulation achieves a speedup of about 116 times. In the large-scale image's HOG features extraction simulation task of up to 256.7 GB (6,861,024 images), a speedup of up to 110 times is achieved.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过三维模型和特征提取进行大规模文件处理和应用的面向服务的框架
教育机构中存在大量非结构化数据,这为教育大数据分析提供了便利。Python 有各种处理结构化和非结构化数据的工具包。但是,它处理大规模数据的能力有限。另一方面,Spark 是一个大数据处理框架,但它不具备处理非结构化富文本文档、三维模型和图像处理所需的工具包。在本研究中,我们基于面向服务的架构,开发了一个集成 Python 工具包和 Spark 的通用框架。该框架自动将 Python 编写的串行算法扩展为分布式算法,从而无缝完成并行处理任务。首先,我们的重点是实现向 Spark 服务器的非侵入式部署,以及如何在 Spark 环境中运行 Python 代码来处理富文本文档。其次,我们提出了一种基于压缩的模式,以解决 HDFS 中小规模文件性能不佳的问题。最后,我们设计了一个通用模型,可以处理三维模型和图像等不同类型的多结构数据。我们发布了系统中使用的服务,以便在 https 级别共享这些服务,从而构建不同的系统。通过使用大规模富文本文档、三维模型和图像进行模拟实验,对该系统进行了评估。结果显示,在提取 232 GB docx 文件的模拟中,当使用 8 个物理节点、128 个内核时,速度比独立 Python-docx 快 49 倍。应用进一步压缩模式后,速度提高了约 89 倍。此外,对 3D 模型描述符提取的模拟显示,模拟速度提高了约 116 倍。在高达 256.7 GB(6,861,024 幅图像)的大规模图像 HOG 特征提取模拟任务中,速度提高了 110 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Simulation Modelling Practice and Theory
Simulation Modelling Practice and Theory 工程技术-计算机:跨学科应用
CiteScore
9.80
自引率
4.80%
发文量
142
审稿时长
21 days
期刊介绍: The journal Simulation Modelling Practice and Theory provides a forum for original, high-quality papers dealing with any aspect of systems simulation and modelling. The journal aims at being a reference and a powerful tool to all those professionally active and/or interested in the methods and applications of simulation. Submitted papers will be peer reviewed and must significantly contribute to modelling and simulation in general or use modelling and simulation in application areas. Paper submission is solicited on: • theoretical aspects of modelling and simulation including formal modelling, model-checking, random number generators, sensitivity analysis, variance reduction techniques, experimental design, meta-modelling, methods and algorithms for validation and verification, selection and comparison procedures etc.; • methodology and application of modelling and simulation in any area, including computer systems, networks, real-time and embedded systems, mobile and intelligent agents, manufacturing and transportation systems, management, engineering, biomedical engineering, economics, ecology and environment, education, transaction handling, etc.; • simulation languages and environments including those, specific to distributed computing, grid computing, high performance computers or computer networks, etc.; • distributed and real-time simulation, simulation interoperability; • tools for high performance computing simulation, including dedicated architectures and parallel computing.
期刊最新文献
Incentive-driven computation offloading and resource pricing strategy in vehicular edge computing assisted with idle mobile vehicles Simulation modeling of super-large ships traffic: Insights from Ningbo-Zhoushan Port for coastal port management An algorithm for processing block diagram models of dynamical systems and an open-source visual-programming simulation tool Survey of CPU and memory simulators in computer architecture: A comprehensive analysis including compiler integration and emerging technology applications VM consolidation steps in cloud computing: A perspective review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1