Ding Zhang, Ze Yi Dai, Xue Ping Sun, Xue Ting Wu, Hui Li, Lin Tang, Jian Hua He
{"title":"A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.","authors":"Ding Zhang, Ze Yi Dai, Xue Ping Sun, Xue Ting Wu, Hui Li, Lin Tang, Jian Hua He","doi":"10.1107/S1600577524002637","DOIUrl":null,"url":null,"abstract":"With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data. Currently, there is no set of data processing solutions based on the big data technology framework for beamlines. Apache Hadoop is a widely used distributed system architecture for solving the problem of massive data storage and computation. This paper presents a set of distributed data processing schemes for beamlines with experimental data using Hadoop. The Hadoop Distributed File System is utilized as the distributed file storage system, and Hadoop YARN serves as the resource scheduler for the distributed computing cluster. A distributed data processing pipeline that can carry out massively parallel computation is designed and developed using Hadoop Spark. The entire data processing platform adopts a distributed microservice architecture, which makes the system easy to expand, reduces module coupling and improves reliability.","PeriodicalId":17114,"journal":{"name":"Journal of Synchrotron Radiation","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Synchrotron Radiation","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1107/S1600577524002637","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INSTRUMENTS & INSTRUMENTATION","Score":null,"Total":0}
引用次数: 0
Abstract
With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data. Currently, there is no set of data processing solutions based on the big data technology framework for beamlines. Apache Hadoop is a widely used distributed system architecture for solving the problem of massive data storage and computation. This paper presents a set of distributed data processing schemes for beamlines with experimental data using Hadoop. The Hadoop Distributed File System is utilized as the distributed file storage system, and Hadoop YARN serves as the resource scheduler for the distributed computing cluster. A distributed data processing pipeline that can carry out massively parallel computation is designed and developed using Hadoop Spark. The entire data processing platform adopts a distributed microservice architecture, which makes the system easy to expand, reduces module coupling and improves reliability.
期刊介绍:
Synchrotron radiation research is rapidly expanding with many new sources of radiation being created globally. Synchrotron radiation plays a leading role in pure science and in emerging technologies. The Journal of Synchrotron Radiation provides comprehensive coverage of the entire field of synchrotron radiation and free-electron laser research including instrumentation, theory, computing and scientific applications in areas such as biology, nanoscience and materials science. Rapid publication ensures an up-to-date information resource for scientists and engineers in the field.