Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi
{"title":"Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets","authors":"Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi","doi":"arxiv-2408.14611","DOIUrl":null,"url":null,"abstract":"Curating, processing, and combining large-scale medical imaging datasets from\nnational studies is a non-trivial task due to the intense computation and data\nthroughput required, variability of acquired data, and associated financial\noverhead. Existing platforms or tools for large-scale data curation,\nprocessing, and storage have difficulty achieving a viable cost-to-scale ratio\nof computation speed for research purposes, either being too slow or too\nexpensive. Additionally, management and consistency of processing large data in\na team-driven manner is a non-trivial task. We design a BIDS-compliant method\nfor an efficient and robust data processing pipeline of large-scale\ndiffusion-weighted and T1-weighted MRI data compatible with low-cost,\nhigh-efficiency computing systems. Our method accomplishes automated querying\nof data available for processing and process running in a consistent and\nreproducible manner that has long-term stability, while using heterogenous\nlow-cost computational resources and storage systems for efficient processing\nand data transfer. We demonstrate how our organizational structure permits\nefficiency in a semi-automated data processing pipeline and show how our method\nis comparable in processing time to cloud-based computation while being almost\n20 times more cost-effective. Our design allows for fast data throughput speeds\nand low latency to reduce the time for data transfer between storage servers\nand computation servers, achieving an average of 0.60 Gb/s compared to 0.33\nGb/s for using cloud-based processing methods. The design of our workflow\nengine permits quick process running while maintaining flexibility to adapt to\nnewly acquired data.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Curating, processing, and combining large-scale medical imaging datasets from
national studies is a non-trivial task due to the intense computation and data
throughput required, variability of acquired data, and associated financial
overhead. Existing platforms or tools for large-scale data curation,
processing, and storage have difficulty achieving a viable cost-to-scale ratio
of computation speed for research purposes, either being too slow or too
expensive. Additionally, management and consistency of processing large data in
a team-driven manner is a non-trivial task. We design a BIDS-compliant method
for an efficient and robust data processing pipeline of large-scale
diffusion-weighted and T1-weighted MRI data compatible with low-cost,
high-efficiency computing systems. Our method accomplishes automated querying
of data available for processing and process running in a consistent and
reproducible manner that has long-term stability, while using heterogenous
low-cost computational resources and storage systems for efficient processing
and data transfer. We demonstrate how our organizational structure permits
efficiency in a semi-automated data processing pipeline and show how our method
is comparable in processing time to cloud-based computation while being almost
20 times more cost-effective. Our design allows for fast data throughput speeds
and low latency to reduce the time for data transfer between storage servers
and computation servers, achieving an average of 0.60 Gb/s compared to 0.33
Gb/s for using cloud-based processing methods. The design of our workflow
engine permits quick process running while maintaining flexibility to adapt to
newly acquired data.