首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Richie: A Framework for Agile Design and Exploration of RISC-V-Based Accelerator-Rich Heterogeneous SoCs 敏捷设计框架和基于risc - v的富含加速器的异构soc的探索
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-23 DOI: 10.1109/TPDS.2025.3624958
Gianluca Bellocchi;Alessandro Capotondi;Luca Benini;Andrea Marongiu
Modern Heterogeneous Systems-on-Chip (HeSoCs) rely on the accelerator-rich paradigm to achieve performance and energy efficiency through the on-chip integration of many application-specific functional units. However, the lack of a standard System-Level Design (SLD) methodology and the heterogeneity of the HW/SW components complicate the costly and time-consuming process of deploying accelerator-rich systems and applications. In this work, we present Richie, an open-source research SLD framework featuring a modular and composable RISC-V-based accelerator-rich platform and a support toolchain to automate the assembly and specialization of accelerator-rich HeSoCs. Richie exploits Field Programmable Gate Arrays (FPGAs) to deploy full-stack applications and explore the HeSoC design space. We show how Richie facilitates the investigation of platform non-idealities as the system scales up in accelerator count, identifying key design solutions and exploring platform costs, such as area usage. This yields comparable trade-off improvements over manually-optimized designs. Finally, we assess the methodology ease-of-use and extensibility by deploying a real-world workload and adding support for a Network-on-Chip (NoC) architecture.
现代异构片上系统(HeSoCs)依赖于富含加速器的范例,通过片上集成许多特定应用的功能单元来实现性能和能源效率。然而,缺乏标准的系统级设计(SLD)方法以及硬件/软件组件的异构性使部署富含加速器的系统和应用程序的过程变得昂贵且耗时。在这项工作中,我们提出了Richie,这是一个开源的研究SLD框架,具有模块化和可组合的基于risc - v的富含加速器的平台和支持工具链,用于自动化组装和专业化富含加速器的hesoc。Richie利用现场可编程门阵列(fpga)来部署全栈应用,并探索HeSoC设计空间。我们展示了随着系统加速器数量的增加,Richie如何促进对平台非理想性的调查,确定关键的设计解决方案,并探索平台成本,如面积使用。与手动优化的设计相比,这产生了相当的折衷改进。最后,我们通过部署实际工作负载和添加对片上网络(NoC)架构的支持来评估该方法的易用性和可扩展性。
{"title":"Richie: A Framework for Agile Design and Exploration of RISC-V-Based Accelerator-Rich Heterogeneous SoCs","authors":"Gianluca Bellocchi;Alessandro Capotondi;Luca Benini;Andrea Marongiu","doi":"10.1109/TPDS.2025.3624958","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3624958","url":null,"abstract":"Modern Heterogeneous Systems-on-Chip (HeSoCs) rely on the accelerator-rich paradigm to achieve performance and energy efficiency through the on-chip integration of many application-specific functional units. However, the lack of a standard System-Level Design (SLD) methodology and the heterogeneity of the HW/SW components complicate the costly and time-consuming process of deploying accelerator-rich systems and applications. In this work, we present <sc>Richie</small>, an open-source research SLD framework featuring a modular and composable RISC-V-based accelerator-rich platform and a support toolchain to automate the assembly and specialization of accelerator-rich HeSoCs. <sc>Richie</small> exploits Field Programmable Gate Arrays (FPGAs) to deploy full-stack applications and explore the HeSoC design space. We show how <sc>Richie</small> facilitates the investigation of platform non-idealities as the system scales up in accelerator count, identifying key design solutions and exploring platform costs, such as area usage. This yields comparable trade-off improvements over manually-optimized designs. Finally, we assess the methodology ease-of-use and extensibility by deploying a real-world workload and adding support for a Network-on-Chip (NoC) architecture.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"533-547"},"PeriodicalIF":6.0,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CGA: Accelerating BFS Through an Sparsity-Aware Adaptive Framework on Heterogeneous Platforms CGA:在异构平台上通过稀疏感知自适应框架加速BFS
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-22 DOI: 10.1109/TPDS.2025.3624289
Lei Xu;Haipeng Jia;Yunquan Zhang
Direction optimization determines whether to use Sparse Matrix-Sparse Vector Multiplication (SpMSpV) or Sparse Matrix-Dense Vector Multiplication (SpMV) based on the input vector’s sparsity at each iteration of Breadth-First Search (BFS), aiming to achieve the fastest graph traversal. Although prior work on direction optimization has achieved state-of-the-art performance on either CPUs or GPUs, it has not fully leveraged the capabilities of modern heterogeneous platforms. This is because SpMSpV/SpMV execution times on GPUs do not consistently outperform those on CPUs, particularly for SpMSpV. In response, this paper introduces CGA, a machine learning-based adaptive framework for BFS that optimally selects between CPU and GPU kernels, effectively Adapting to diverse real-world graphs, vectors, and computing platforms. Our contributions include a novel set of bucket-based SpMSpV algorithms that significantly enhance kernel performance in high-sparsity scenarios, along with a low-overhead decision tree model and reduced CPU-GPU data transfers. Experimental results show that our framework outperforms previous state-of-the-art methods, achieving up to a 4.91x speedup over CPU-only baseline and 3.27x speedup over GPU-only baseline.
方向优化是根据每次迭代的宽度优先搜索(BFS)时输入向量的稀疏性决定是使用稀疏矩阵-稀疏向量乘法(SpMSpV)还是稀疏矩阵-密集向量乘法(SpMV),目的是实现最快的图遍历。尽管之前在方向优化方面的工作已经在cpu或gpu上实现了最先进的性能,但它并没有充分利用现代异构平台的功能。这是因为SpMSpV/SpMV在gpu上的执行时间并不总是优于cpu,特别是对于SpMSpV。作为回应,本文介绍了CGA,一种基于机器学习的BFS自适应框架,它可以在CPU和GPU内核之间进行最佳选择,有效地适应各种现实世界的图、向量和计算平台。我们的贡献包括一组新颖的基于bucket的SpMSpV算法,这些算法可以显著提高高稀疏场景下的内核性能,以及低开销的决策树模型和减少的CPU-GPU数据传输。实验结果表明,我们的框架优于以前最先进的方法,在仅cpu基准上实现了4.91倍的加速,在仅gpu基准上实现了3.27倍的加速。
{"title":"CGA: Accelerating BFS Through an Sparsity-Aware Adaptive Framework on Heterogeneous Platforms","authors":"Lei Xu;Haipeng Jia;Yunquan Zhang","doi":"10.1109/TPDS.2025.3624289","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3624289","url":null,"abstract":"Direction optimization determines whether to use Sparse Matrix-Sparse Vector Multiplication (SpMSpV) or Sparse Matrix-Dense Vector Multiplication (SpMV) based on the input vector’s sparsity at each iteration of Breadth-First Search (BFS), aiming to achieve the fastest graph traversal. Although prior work on direction optimization has achieved state-of-the-art performance on either CPUs or GPUs, it has not fully leveraged the capabilities of modern heterogeneous platforms. This is because SpMSpV/SpMV execution times on GPUs do not consistently outperform those on CPUs, particularly for SpMSpV. In response, this paper introduces <bold><u>CGA</u></b>, a machine learning-based adaptive framework for BFS that optimally selects between <bold><u>C</u></b>PU and <bold><u>G</u></b>PU kernels, effectively <bold><u>A</u></b>dapting to diverse real-world graphs, vectors, and computing platforms. Our contributions include a novel set of bucket-based SpMSpV algorithms that significantly enhance kernel performance in high-sparsity scenarios, along with a low-overhead decision tree model and reduced CPU-GPU data transfers. Experimental results show that our framework outperforms previous state-of-the-art methods, achieving up to a 4.91x speedup over CPU-only baseline and 3.27x speedup over GPU-only baseline.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"45-59"},"PeriodicalIF":6.0,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PHIDE: A Parallel Hybrid Direct–Iterative Eigensolver for Hermitian Eigenvalue Problems 厄密特征值问题的并行混合直接迭代特征解
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-20 DOI: 10.1109/TPDS.2025.3623188
Shengguo Li;Xinzhe Wu;Jose E. Roman;Ziyang Yuan;Ruibo Wang;Tiejun Li;Yi Xie;Bo Yang;Xuguang Chen
In this paper, we propose a Parallel Hybrid Direct–Iterative Eigensolver for Hermitian Eigenvalue Problems without tridiagonalization, denoted by PHIDE, which combines direct and iterative methods. PHIDE first reduces a Hermitian matrix to banded form, then applies a spectrum slicing algorithm to the banded matrix, and finally computes the eigenvectors of the original matrix via backtransformation. Compared with conventional direct eigensolvers, PHIDE avoids tridiagonalization, which involves many memory-bound operations. In PHIDE, the banded eigenvalue problem is solved using the contour integral method implemented in FEAST, which may yield slightly lower accuracy than tridiagonalization-based approaches. For sequences of correlated Hermitian eigenvalue problems arising in density functional theory (DFT), PHIDE achieves an average speedup of $1.22times$ over the state-of-the-art direct solver in ELPA when using 1024 processes. Numerical experiments are conducted on dense Hermitian matrices from real applications as well as large sparse matrices from the SuiteSparse and ELSES collections.
本文提出了一种不带三对角化的厄米特特征值问题的并行混合直接迭代特征解算法,用PHIDE表示,它结合了直接法和迭代法。该算法首先将厄米矩阵化简为条带形式,然后对条带矩阵进行频谱切片算法,最后通过反变换计算原始矩阵的特征向量。与传统的直接特征解法相比,PHIDE避免了涉及许多内存约束操作的三对角化。在PHIDE中,带状特征值问题使用FEAST中实现的轮廓积分方法来解决,该方法的精度可能略低于基于三对角化的方法。对于密度泛函理论(DFT)中出现的相关厄米特征值问题序列,当使用1024个进程时,PHIDE比ELPA中最先进的直接求解器实现了1.22倍的平均加速。在实际应用中的密集厄米矩阵以及来自SuiteSparse和ELSES集合的大型稀疏矩阵上进行了数值实验。
{"title":"PHIDE: A Parallel Hybrid Direct–Iterative Eigensolver for Hermitian Eigenvalue Problems","authors":"Shengguo Li;Xinzhe Wu;Jose E. Roman;Ziyang Yuan;Ruibo Wang;Tiejun Li;Yi Xie;Bo Yang;Xuguang Chen","doi":"10.1109/TPDS.2025.3623188","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3623188","url":null,"abstract":"In this paper, we propose a Parallel Hybrid Direct–Iterative Eigensolver for Hermitian Eigenvalue Problems without tridiagonalization, denoted by <monospace>PHIDE</monospace>, which combines direct and iterative methods. <monospace>PHIDE</monospace> first reduces a Hermitian matrix to banded form, then applies a spectrum slicing algorithm to the banded matrix, and finally computes the eigenvectors of the original matrix via backtransformation. Compared with conventional direct eigensolvers, <monospace>PHIDE</monospace> avoids tridiagonalization, which involves many memory-bound operations. In <monospace>PHIDE</monospace>, the banded eigenvalue problem is solved using the contour integral method implemented in FEAST, which may yield slightly lower accuracy than tridiagonalization-based approaches. For sequences of correlated Hermitian eigenvalue problems arising in density functional theory (DFT), <monospace>PHIDE</monospace> achieves an average speedup of <inline-formula><tex-math>$1.22times$</tex-math></inline-formula> over the state-of-the-art direct solver in ELPA when using 1024 processes. Numerical experiments are conducted on dense Hermitian matrices from real applications as well as large sparse matrices from the SuiteSparse and ELSES collections.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"260-271"},"PeriodicalIF":6.0,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Point Cloud Sampling by Parallel Structure Deconstruction 并行结构解构加速点云采样
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-17 DOI: 10.1109/TPDS.2025.3622691
Hengzhe Chi;Jihe Wang;Jinzhe Zhang;Danghui Wang
Point cloud processing is fundamental to applications such as autonomous driving, robotic navigation, and 3D reconstruction. Sampling is a crucial process in point cloud processing, but current sampling algorithms perform poorly when deployed on edge devices due to limited computational resources. This stems from three main reasons. First, the robustness of sampling is weak; even after denoising, residual noise remains, and existing edge sampling methods struggle to ignore noisy sampling nodes. Second, there’s a lack of structural sampling, popular algorithms like FPS tend to cover global information without considering special structures for focused sampling. Third, system efficiency is poor, traditional Farthest Point Sampling (FPS) and k-Nearest Neighbor Search (kNN) have terrible time and space complexities of $O(n^{2})$ with low parallelism, limiting system throughput, especially for large-scale point cloud processing. To bridge this gap, we propose a novel point cloud sampling algorithm based on parallel graph deconstruction named DPS (Deconstruction-based Point Cloud Sampling) and its multi-scale version DPS_MS with better denoising effect. First, during graph construction, we employ multi-scale graphing and add a preprocessing stage to filter out some noise nodes, reducing the noise rate. Second, in the sampling stage, we categorize important structures into edge nodes and dense nodes based on the reverse neighbor count of graph nodes, assigning them higher sampling weights to supplement structural information. Finally, we use a highly parallel locality-sensitive hashing algorithm to accelerate neighbor search and reduce memory consumption, we achieved data-level parallelization through the comprehensive parallelization of our algorithm. Through rigorous qualitative and quantitative validation on classification and segmentation tasks, we demonstrate that DPS, while maintaining accuracy, increases point cloud sampling speed by 22.08 times compared to the FPS algorithm, the implementation achieved a 75.1-fold improvement in system throughput and a maximum parallel speedup of 10.32x, and improves the Accuracy/FLOPs (M) ratio to 8.68.
点云处理是自动驾驶、机器人导航和3D重建等应用的基础。采样是点云处理中的一个关键过程,但由于计算资源有限,目前的采样算法在边缘设备上部署时表现不佳。这主要源于三个原因。首先,抽样的鲁棒性较弱;即使去噪后,残余噪声仍然存在,现有的边缘采样方法很难忽略有噪声的采样节点。其次,缺乏结构化采样,FPS等流行算法倾向于覆盖全局信息,而不考虑集中采样的特殊结构。第三,系统效率较差,传统的最远点采样(FPS)和k近邻搜索(kNN)的时间和空间复杂度为$ 0 (n^{2})$,并行性较低,限制了系统吞吐量,特别是对于大规模点云处理。为了弥补这一缺陷,我们提出了一种新的基于并行图解构的点云采样算法DPS (deconstruction -based point cloud sampling)及其去噪效果更好的多尺度版本DPS_MS。首先,在图的构建过程中,我们采用了多尺度图,并增加了预处理阶段来过滤掉一些噪声节点,降低了噪声率。其次,在采样阶段,根据图节点的反向邻居计数,将重要结构分为边缘节点和密集节点,赋予它们更高的采样权值,以补充结构信息。最后,我们使用高度并行的位置敏感散列算法来加速邻居搜索并减少内存消耗,通过对算法的全面并行化,实现了数据级并行化。通过对分类和分割任务进行严格的定性和定量验证,我们证明DPS算法在保持精度的同时,将点云采样速度提高了22.08倍,系统吞吐量提高了75.1倍,最大并行加速提高了10.32倍,并将精度/FLOPs (M)比提高到8.68。
{"title":"Accelerating Point Cloud Sampling by Parallel Structure Deconstruction","authors":"Hengzhe Chi;Jihe Wang;Jinzhe Zhang;Danghui Wang","doi":"10.1109/TPDS.2025.3622691","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3622691","url":null,"abstract":"Point cloud processing is fundamental to applications such as autonomous driving, robotic navigation, and 3D reconstruction. Sampling is a crucial process in point cloud processing, but current sampling algorithms perform poorly when deployed on edge devices due to limited computational resources. This stems from three main reasons. First, the robustness of sampling is weak; even after denoising, residual noise remains, and existing edge sampling methods struggle to ignore noisy sampling nodes. Second, there’s a lack of structural sampling, popular algorithms like FPS tend to cover global information without considering special structures for focused sampling. Third, system efficiency is poor, traditional Farthest Point Sampling (FPS) and k-Nearest Neighbor Search (kNN) have terrible time and space complexities of <inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula> with low parallelism, limiting system throughput, especially for large-scale point cloud processing. To bridge this gap, we propose a novel point cloud sampling algorithm based on parallel graph deconstruction named DPS (Deconstruction-based Point Cloud Sampling) and its multi-scale version DPS_MS with better denoising effect. First, during graph construction, we employ multi-scale graphing and add a preprocessing stage to filter out some noise nodes, reducing the noise rate. Second, in the sampling stage, we categorize important structures into edge nodes and dense nodes based on the reverse neighbor count of graph nodes, assigning them higher sampling weights to supplement structural information. Finally, we use a highly parallel locality-sensitive hashing algorithm to accelerate neighbor search and reduce memory consumption, we achieved data-level parallelization through the comprehensive parallelization of our algorithm. Through rigorous qualitative and quantitative validation on classification and segmentation tasks, we demonstrate that DPS, while maintaining accuracy, increases point cloud sampling speed by 22.08 times compared to the FPS algorithm, the implementation achieved a 75.1-fold improvement in system throughput and a maximum parallel speedup of 10.32x, and improves the Accuracy/FLOPs (M) ratio to 8.68.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"60-75"},"PeriodicalIF":6.0,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guest Editorial: New Tools and Techniques for the Distributed Computing Continuum 嘉宾评论:分布式计算连续体的新工具和技术
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-16 DOI: 10.1109/TPDS.2025.3612151
Jesus Carretero;Javier García-Blas;Sameer Shende
{"title":"Guest Editorial: New Tools and Techniques for the Distributed Computing Continuum","authors":"Jesus Carretero;Javier García-Blas;Sameer Shende","doi":"10.1109/TPDS.2025.3612151","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3612151","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2451-2454"},"PeriodicalIF":6.0,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11205817","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-Scale Neural Network Quantum States Calculation for Quantum Chemistry on a New Sunway Supercomputer 新型神威超级计算机上量子化学的大规模神经网络量子态计算
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-15 DOI: 10.1109/TPDS.2025.3620251
Yangjun Wu;Wenhao Zhou;Li Shen;Hong Qian;Honghui Shang
Quantum many-body system can be solved with neural-network method. Nonetheless, the practical deployment of neural network quantum states (NNQS) in large-scale electronic structure analyses faces challenges, chiefly the high sampling cost and the complexity of local energy computations. To overcome these computational barriers, we present an innovative data-parallel NNQS-Transformer implementation. This implementation introduces a hybrid multi-layer workload balancing strategy that effectively addresses previous load imbalance issues while leveraging Julia’s portability to achieve targeted performance optimizations. Through extensive testing, we validate our approach using comprehensive quantum chemistry calculations on systems containing up to 120 spin orbitals, where previous methods were limited to much smaller scales. The implementation demonstrates exceptional scalability on the Sunway platform, achieving 92% strong scaling and 98% weak scaling efficiencies when utilizing up to 37 million processor cores. These significant performance improvements mark a crucial step toward making NNQS calculations practical for real-world quantum chemistry applications.
量子多体系统可以用神经网络方法求解。然而,神经网络量子态(NNQS)在大规模电子结构分析中的实际应用面临着挑战,主要是采样成本高和局部能量计算复杂。为了克服这些计算障碍,我们提出了一种创新的数据并行NNQS-Transformer实现。该实现引入了混合多层工作负载平衡策略,该策略有效地解决了以前的负载不平衡问题,同时利用Julia的可移植性来实现目标性能优化。通过广泛的测试,我们在包含多达120个自旋轨道的系统上使用全面的量子化学计算来验证我们的方法,而以前的方法仅限于更小的尺度。该实现在神威平台上展示了卓越的可扩展性,在使用多达3700万个处理器内核时,实现了92%的强扩展效率和98%的弱扩展效率。这些显著的性能改进标志着NNQS计算在实际量子化学应用中的重要一步。
{"title":"Large-Scale Neural Network Quantum States Calculation for Quantum Chemistry on a New Sunway Supercomputer","authors":"Yangjun Wu;Wenhao Zhou;Li Shen;Hong Qian;Honghui Shang","doi":"10.1109/TPDS.2025.3620251","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3620251","url":null,"abstract":"Quantum many-body system can be solved with neural-network method. Nonetheless, the practical deployment of neural network quantum states (NNQS) in large-scale electronic structure analyses faces challenges, chiefly the high sampling cost and the complexity of local energy computations. To overcome these computational barriers, we present an innovative data-parallel NNQS-Transformer implementation. This implementation introduces a hybrid multi-layer workload balancing strategy that effectively addresses previous load imbalance issues while leveraging Julia’s portability to achieve targeted performance optimizations. Through extensive testing, we validate our approach using comprehensive quantum chemistry calculations on systems containing up to 120 spin orbitals, where previous methods were limited to much smaller scales. The implementation demonstrates exceptional scalability on the Sunway platform, achieving 92% strong scaling and 98% weak scaling efficiencies when utilizing up to 37 million processor cores. These significant performance improvements mark a crucial step toward making NNQS calculations practical for real-world quantum chemistry applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2724-2732"},"PeriodicalIF":6.0,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Autonomous Model Aggregation for Decentralized Learning on Edge Devices 边缘设备上分散学习的自主模型聚合
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-14 DOI: 10.1109/TPDS.2025.3621058
Jinru Chen;Jingke Tu;Lei Yang;Jiannong Cao
Edge AI applications enable edge devices to collaboratively learn a model via repeated model aggregations, aiming to utilize the distributed data on the devices for achieving high model accuracy. Existing methods either leverage a centralized server to directly aggregate the model updates from edge devices or need a central coordinator to group the edge devices for localized model aggregations. The centralized server (or coordinator) has a performance bottleneck and a high cost of collecting the global state needed for making the grouping decision in large-scale networks. In this paper, we propose an Autonomous Model Aggregation (AMA) method for large-scale decentralized learning on edge devices. Instead of needing a central coordinator to group the edge devices, AMA allows the edge devices to autonomously form groups using a highly efficient protocol, according to model functional similarity and historical grouping information. Moreover, AMA adopts a reinforcement learning approach to optimize the size of each group. Evaluation results on our self-developed edge computing testbed demonstrate that AMA outperforms the benchmark approaches by up to 20.71% in accuracy and reduced the convergence time by 75.58%.
边缘人工智能应用使边缘设备能够通过重复的模型聚合来协作学习模型,旨在利用设备上的分布式数据来实现高模型精度。现有的方法要么利用集中式服务器直接聚合来自边缘设备的模型更新,要么需要一个中央协调器对边缘设备进行分组以进行局部模型聚合。集中式服务器(或协调器)存在性能瓶颈,并且在大规模网络中收集进行分组决策所需的全局状态的成本很高。在本文中,我们提出了一种用于边缘设备上大规模分散学习的自治模型聚合(AMA)方法。AMA不需要中央协调器对边缘设备进行分组,而是允许边缘设备根据模型功能相似性和历史分组信息,使用高效协议自主组成分组。此外,AMA采用强化学习方法来优化每个组的大小。在自主开发的边缘计算试验台上的评估结果表明,AMA的准确率比基准方法高出20.71%,收敛时间缩短了75.58%。
{"title":"Autonomous Model Aggregation for Decentralized Learning on Edge Devices","authors":"Jinru Chen;Jingke Tu;Lei Yang;Jiannong Cao","doi":"10.1109/TPDS.2025.3621058","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3621058","url":null,"abstract":"Edge AI applications enable edge devices to collaboratively learn a model via repeated model aggregations, aiming to utilize the distributed data on the devices for achieving high model accuracy. Existing methods either leverage a centralized server to directly aggregate the model updates from edge devices or need a central coordinator to group the edge devices for localized model aggregations. The centralized server (or coordinator) has a performance bottleneck and a high cost of collecting the global state needed for making the grouping decision in large-scale networks. In this paper, we propose an Autonomous Model Aggregation (AMA) method for large-scale decentralized learning on edge devices. Instead of needing a central coordinator to group the edge devices, AMA allows the edge devices to autonomously form groups using a highly efficient protocol, according to model functional similarity and historical grouping information. Moreover, AMA adopts a reinforcement learning approach to optimize the size of each group. Evaluation results on our self-developed edge computing testbed demonstrate that AMA outperforms the benchmark approaches by up to 20.71% in accuracy and reduced the convergence time by 75.58%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"15-28"},"PeriodicalIF":6.0,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FEditor: Consecutive Task Placement With Adjustable Shapes Using FPGA State Frames FEditor:使用FPGA状态帧的可调形状的连续任务放置
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-13 DOI: 10.1109/TPDS.2025.3620384
Yanyan Li;Yu Chen;Zhiqian Xu;Yawen Wang;Hai Jiang;Keqin Li
Field Programmable Gate Arrays (FPGAs) are widely adopted in datacenters, where each FPGA is exclusively assigned to a task. This strategy results in significant resource waste and increased task rejections. To address this issue, placement algorithms adjust the locations and shapes of tasks based on Dynamic Partial Reconfiguration, which partitions an FPGA into multiple rectangular areas for sharing. However, existing schemes are designed for static task sets without adjustable shapes, incapable of optimizing the placement problem in datacenters. In this paper, FEditor is proposed as the first consecutive task placement scheme with adjustable shapes. It expands the planar FPGA models into three-dimensional ones with timestamps to accommodate consecutive tasks. To reduce the complexity of three-dimensional resource management, State Frames (SFs) are designed to compress the models losslessly. Three metrics and a nested heuristic algorithm are used for task placement. Experimental results demonstrate that FEditor has improved resource utilization by at least 19.8% and acceptance rate by at least 10% compared to the referenced algorithms. SFs and the nested algorithm accelerate the task placement by up to $10.26times$. The suitability of FEditor in datacenter environments is verified by its time efficiency trends.
现场可编程门阵列(FPGA)被广泛应用于数据中心,每个FPGA都被专门分配给一个任务。这种策略会导致严重的资源浪费和任务拒绝的增加。为了解决这个问题,放置算法基于动态部分重构来调整任务的位置和形状,该算法将FPGA划分为多个矩形区域以供共享。然而,现有的方案是针对静态任务集设计的,没有可调整的形状,无法优化数据中心中的放置问题。本文提出了FEditor作为第一个具有可调形状的连续任务布置方案。它将平面FPGA模型扩展为带时间戳的三维FPGA模型,以适应连续的任务。为了降低三维资源管理的复杂性,设计了状态框架(State Frames, sf)对模型进行无损压缩。三个指标和嵌套启发式算法用于任务布置。实验结果表明,与参考算法相比,FEditor的资源利用率至少提高了19.8%,接受率至少提高了10%。sf和嵌套算法将任务放置速度提高了10.26倍。FEditor的时间效率趋势验证了其在数据中心环境中的适用性。
{"title":"FEditor: Consecutive Task Placement With Adjustable Shapes Using FPGA State Frames","authors":"Yanyan Li;Yu Chen;Zhiqian Xu;Yawen Wang;Hai Jiang;Keqin Li","doi":"10.1109/TPDS.2025.3620384","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3620384","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are widely adopted in datacenters, where each FPGA is exclusively assigned to a task. This strategy results in significant resource waste and increased task rejections. To address this issue, placement algorithms adjust the locations and shapes of tasks based on Dynamic Partial Reconfiguration, which partitions an FPGA into multiple rectangular areas for sharing. However, existing schemes are designed for static task sets without adjustable shapes, incapable of optimizing the placement problem in datacenters. In this paper, FEditor is proposed as the first consecutive task placement scheme with adjustable shapes. It expands the planar FPGA models into three-dimensional ones with timestamps to accommodate consecutive tasks. To reduce the complexity of three-dimensional resource management, <i>State Frames</i> (<i>SFs</i>) are designed to compress the models losslessly. Three metrics and a nested heuristic algorithm are used for task placement. Experimental results demonstrate that FEditor has improved resource utilization by at least 19.8% and acceptance rate by at least 10% compared to the referenced algorithms. <i>SFs</i> and the nested algorithm accelerate the task placement by up to <inline-formula><tex-math>$10.26times$</tex-math></inline-formula>. The suitability of FEditor in datacenter environments is verified by its time efficiency trends.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"1-14"},"PeriodicalIF":6.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chorus: Robust Multitasking Local Client-Server Collaborative Inference With Wi-Fi 6 for AIoT Against Stochastic Congestion Delay 基于Wi-Fi 6的AIoT随机拥塞延迟鲁棒多任务本地客户端-服务器协同推理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-09 DOI: 10.1109/TPDS.2025.3619775
Yuzhe Luo;Ji Qi;Ling Li;Ruizhi Chen;Xiaoyu Wu;Limin Cheng;Chen Zhao
The rapid growth of AIoT devices brings huge demands for DNNs deployed on resource-constrained devices. However, the intensive computation and high memory footprint of DNN inference make it difficult for the AIoT devices to execute the inference tasks efficiently. In many widely deployed AIoT use cases, multiple local AIoT devices launch DNN inference tasks randomly. Although local collaborative inference has been proposed to accelerate DNN inference on local devices with limited resources, multitasking local collaborative inference, which is common in AIoT scenarios, has not been fully studied in previous works. We consider multitasking local client-server collaborative inference (MLCCI), which achieves efficient DNN inference by offloading the inference tasks from multiple AIoT devices to a more powerful local server with parallel pipelined execution streams through Wi-Fi 6. Our optimization goal is to minimize the mean end-to-end latency of MLCCI. Based on the experiment results, we identify three key challenges: high communication costs, high model initialization latency, and congestion delay brought by task interference. We analyze congestion delay in MLCCI and its stochastic fluctuations with queuing theory and propose Chorus, a high-performance adaptive MLCCI framework for AIoT devices, to minimize the mean end-to-end latency of MLCCI against stochastic congestion delay. Chorus generates communication-efficient model partitions with heuristic search, uses a prefetch-enabled two-level LRU cache to accelerate model initialization on the server, reduces congestion delay and its short-term fluctuations with execution stream allocation based on the cross-entropy method, and finally achieves efficient computation offloading with reinforcement learning. We established a system prototype, which statistically simulated many virtual clients with limited physical client devices to conduct performance evaluations, for Chorus with real devices. The evaluation results for various workload levels show that Chorus achieved an average of $1.4times$, $1.3times$, and $2times$ speedup over client-only inference, and server-only inference with LRU and MLSH, respectively.
AIoT设备的快速增长对部署在资源受限设备上的深度神经网络提出了巨大的需求。然而,深度神经网络推理的高计算量和高内存占用使得AIoT设备难以有效地执行推理任务。在许多广泛部署的AIoT用例中,多个本地AIoT设备随机启动DNN推理任务。虽然已经提出了在资源有限的本地设备上加速DNN推理的本地协同推理,但在AIoT场景中常见的多任务本地协同推理在以往的工作中并没有得到充分的研究。我们考虑多任务本地客户端-服务器协作推理(MLCCI),它通过Wi-Fi 6将推理任务从多个AIoT设备卸载到更强大的本地服务器上,并通过并行流水线执行流实现高效的DNN推理。我们的优化目标是最小化MLCCI的平均端到端延迟。基于实验结果,我们确定了三个关键挑战:高通信成本、高模型初始化延迟和任务干扰带来的拥塞延迟。本文利用排队理论分析了MLCCI中的拥塞延迟及其随机波动,并提出了一种用于AIoT设备的高性能自适应MLCCI框架Chorus,以最大限度地降低MLCCI对随机拥塞延迟的平均端到端延迟。Chorus通过启发式搜索生成通信高效的模型分区,使用支持预取的两级LRU缓存加速服务器上的模型初始化,通过基于交叉熵方法的执行流分配减少拥塞延迟及其短期波动,最终通过强化学习实现高效的计算卸载。我们建立了一个系统原型,用有限的物理客户端设备统计模拟了许多虚拟客户端进行性能评估。不同工作负载水平的评估结果表明,与LRU和MLSH相比,Chorus在仅客户端推理和仅服务器推理方面的平均加速分别为1.4倍、1.3倍和2倍。
{"title":"Chorus: Robust Multitasking Local Client-Server Collaborative Inference With Wi-Fi 6 for AIoT Against Stochastic Congestion Delay","authors":"Yuzhe Luo;Ji Qi;Ling Li;Ruizhi Chen;Xiaoyu Wu;Limin Cheng;Chen Zhao","doi":"10.1109/TPDS.2025.3619775","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3619775","url":null,"abstract":"The rapid growth of AIoT devices brings huge demands for DNNs deployed on resource-constrained devices. However, the intensive computation and high memory footprint of DNN inference make it difficult for the AIoT devices to execute the inference tasks efficiently. In many widely deployed AIoT use cases, multiple local AIoT devices launch DNN inference tasks randomly. Although local collaborative inference has been proposed to accelerate DNN inference on local devices with limited resources, multitasking local collaborative inference, which is common in AIoT scenarios, has not been fully studied in previous works. We consider multitasking local client-server collaborative inference (MLCCI), which achieves efficient DNN inference by offloading the inference tasks from multiple AIoT devices to a more powerful local server with parallel pipelined execution streams through Wi-Fi 6. Our optimization goal is to minimize the mean end-to-end latency of MLCCI. Based on the experiment results, we identify three key challenges: high communication costs, high model initialization latency, and congestion delay brought by task interference. We analyze congestion delay in MLCCI and its stochastic fluctuations with queuing theory and propose Chorus, a high-performance adaptive MLCCI framework for AIoT devices, to minimize the mean end-to-end latency of MLCCI against stochastic congestion delay. Chorus generates communication-efficient model partitions with heuristic search, uses a prefetch-enabled two-level LRU cache to accelerate model initialization on the server, reduces congestion delay and its short-term fluctuations with execution stream allocation based on the cross-entropy method, and finally achieves efficient computation offloading with reinforcement learning. We established a system prototype, which statistically simulated many virtual clients with limited physical client devices to conduct performance evaluations, for Chorus with real devices. The evaluation results for various workload levels show that Chorus achieved an average of <inline-formula><tex-math>$1.4times$</tex-math></inline-formula>, <inline-formula><tex-math>$1.3times$</tex-math></inline-formula>, and <inline-formula><tex-math>$2times$</tex-math></inline-formula> speedup over client-only inference, and server-only inference with LRU and MLSH, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2706-2723"},"PeriodicalIF":6.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache Partition Management for Improving Fairness and I/O Responsiveness in NVMe SSDs 提高NVMe ssd公平性和I/O响应性的缓存分区管理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-09 DOI: 10.1109/TPDS.2025.3619866
Jiaojiao Wu;Fan Yang;Zhibing Sha;Li Cai;Zhigang Cai;Balazs Gerofi;Yuanquan Shi;Jianwei Liao
NVMe SSDs have become mainstream storage devices thanks to their compact size and ultra-low latency. It has been observed that the impact of interference among all concurrently running streams (i.e., I/O workloads) on their overall responsiveness differs significantly, thus leading to unfairness. The intensity and access locality of streams are the primary factors contributing to interference. A small-sized data cache is commonly equipped in the front-end of SSDs to improve I/O performance and extend the device’s lifetime. The degree of parallelism at this level, however, is limited compared to that of the SSD back end, which consists of multiple channels, chips, and planes. Therefore, the impact of interference can be more significant at the data cache level. In this paper, we propose a cache division management scheme that not only contributes to fairness but also boosts I/O responsiveness across all workloads in NVMe SSDs. Specifically, our proposal supports long-term data cache partitioning and short-term cache adjustment with global sharing, ensuring better fairness and further enhancing cache utilization efficiency in multi-stream scenarios. Trace-driven simulation experiments show that our proposal improves fairness by an average of 66.0% and reduces overall I/O response time by between 3.8% and 18.0%, compared to existing cache management schemes for NVMe SSDs.
NVMe固态硬盘由于其紧凑的尺寸和超低的延迟,已经成为主流的存储设备。据观察,所有并发运行流(即I/O工作负载)之间的干扰对其总体响应性的影响差异很大,从而导致不公平。流的强度和进入位置是造成干扰的主要因素。为了提高I/O性能和延长设备的使用寿命,通常在ssd的前端配置一个小型的数据缓存。然而,与由多个通道、芯片和平面组成的SSD后端相比,这个级别的并行度是有限的。因此,在数据缓存级别,干扰的影响可能更为显著。在本文中,我们提出了一种缓存划分管理方案,该方案不仅有助于公平性,而且还提高了NVMe ssd中所有工作负载的I/O响应能力。具体来说,我们的方案支持全局共享的长期数据缓存分区和短期缓存调整,保证了更好的公平性,进一步提高了多流场景下的缓存利用效率。跟踪驱动的仿真实验表明,与现有的NVMe ssd缓存管理方案相比,我们的提议平均提高了66.0%的公平性,并将总体I/O响应时间减少了3.8%至18.0%。
{"title":"Cache Partition Management for Improving Fairness and I/O Responsiveness in NVMe SSDs","authors":"Jiaojiao Wu;Fan Yang;Zhibing Sha;Li Cai;Zhigang Cai;Balazs Gerofi;Yuanquan Shi;Jianwei Liao","doi":"10.1109/TPDS.2025.3619866","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3619866","url":null,"abstract":"NVMe SSDs have become mainstream storage devices thanks to their compact size and ultra-low latency. It has been observed that the impact of interference among all concurrently running streams (i.e., I/O workloads) on their overall responsiveness differs significantly, thus leading to unfairness. The intensity and access locality of streams are the primary factors contributing to interference. A small-sized data cache is commonly equipped in the front-end of SSDs to improve I/O performance and extend the device’s lifetime. The degree of parallelism at this level, however, is limited compared to that of the SSD back end, which consists of multiple channels, chips, and planes. Therefore, the impact of interference can be more significant at the data cache level. In this paper, we propose a cache division management scheme that not only contributes to fairness but also boosts I/O responsiveness across all workloads in NVMe SSDs. Specifically, our proposal supports long-term data cache partitioning and short-term cache adjustment with global sharing, ensuring better fairness and further enhancing cache utilization efficiency in multi-stream scenarios. Trace-driven simulation experiments show that our proposal improves fairness by an average of <monospace>66.0</monospace>% and reduces overall I/O response time by between <monospace>3.8</monospace>% and <monospace>18.0</monospace>%, compared to existing cache management schemes for NVMe SSDs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"122-136"},"PeriodicalIF":6.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Geobiology Appl. Clay Sci. Geochim. Cosmochim. Acta J. Hydrol. Org. Geochem. Carbon Balance Manage. Contrib. Mineral. Petrol. Int. J. Biometeorol. IZV-PHYS SOLID EART+ J. Atmos. Chem. Acta Oceanolog. Sin. Acta Geophys. ACTA GEOL POL ACTA PETROL SIN ACTA GEOL SIN-ENGL AAPG Bull. Acta Geochimica Adv. Atmos. Sci. Adv. Meteorol. Am. J. Phys. Anthropol. Am. J. Sci. Am. Mineral. Annu. Rev. Earth Planet. Sci. Appl. Geochem. Aquat. Geochem. Ann. Glaciol. Archaeol. Anthropol. Sci. ARCHAEOMETRY ARCT ANTARCT ALP RES Asia-Pac. J. Atmos. Sci. ATMOSPHERE-BASEL Atmos. Res. Aust. J. Earth Sci. Atmos. Chem. Phys. Atmos. Meas. Tech. Basin Res. Big Earth Data BIOGEOSCIENCES Geostand. Geoanal. Res. GEOLOGY Geosci. J. Geochem. J. Geochem. Trans. Geosci. Front. Geol. Ore Deposits Global Biogeochem. Cycles Gondwana Res. Geochem. Int. Geol. J. Geophys. Prospect. Geosci. Model Dev. GEOL BELG GROUNDWATER Hydrogeol. J. Hydrol. Earth Syst. Sci. Hydrol. Processes Int. J. Climatol. Int. J. Earth Sci. Int. Geol. Rev. Int. J. Disaster Risk Reduct. Int. J. Geomech. Int. J. Geog. Inf. Sci. Isl. Arc J. Afr. Earth. Sci. J. Adv. Model. Earth Syst. J APPL METEOROL CLIM J. Atmos. Oceanic Technol. J. Atmos. Sol. Terr. Phys. J. Clim. J. Earth Sci. J. Earth Syst. Sci. J. Environ. Eng. Geophys. J. Geog. Sci. Mineral. Mag. Miner. Deposita Mon. Weather Rev. Nat. Hazards Earth Syst. Sci. Nat. Clim. Change Nat. Geosci. Ocean Dyn. Ocean and Coastal Research npj Clim. Atmos. Sci. Ocean Modell. Ocean Sci. Ore Geol. Rev. OCEAN SCI J Paleontol. J. PALAEOGEOGR PALAEOCL PERIOD MINERAL PETROLOGY+ Phys. Chem. Miner. Polar Sci. Prog. Oceanogr. Quat. Sci. Rev. Q. J. Eng. Geol. Hydrogeol. RADIOCARBON Pure Appl. Geophys. Resour. Geol. Rev. Geophys. Sediment. Geol.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1