首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
DH_Aligner: A fast short-read aligner on multicore platforms with AVX vectorization DH_Aligner:具有AVX矢量化的多核平台上的快速短读对齐器
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-04 DOI: 10.1016/j.jpdc.2025.105142
Qiao Sun , Feng Chen , Leisheng Li , Huiyuan Li
The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignment makes an educated guess on where and how a read is mapped to a given reference sequence. In this paper, we propose DH_Aligner, a fast and accurate short read aligner designed and optimized for x86 multi-core platforms with avx2/avx512 SIMD instruction sets. It is based on a three-phased aligning work-flow: seeding-filtering-extension and provides an end-to-end solution for read alignment from Fastq to SAM files. Due to a fast seeding scheme and a seed filtering procedure, DH_Aligner can avoid both of a time-consuming seeding phase and redundant workload of aligning reads at seemingly wrong locations. With the introduction of batched-processing methodology, parallelism is easily exploited at data-, instruction- and thread-level. The performance-critical kernels in DH_Aligner are implemented by both avx2 and avx512 intrinsics for a better performance and portability. On two typical x86 based platforms: Intel Xeon-6154 and Hygon C86-7285, DH_Aligner can produce a near-best accuracy/sensitivity while outperform state-of-the-art parallel implementations with average speedup: 7.8x, 3.4x, 2.8x-6.7x and 1.5x over bwa-mem, bwa-mem2, bowtie2 and minimap2 respectively.
新一代测序(NGS)技术的快速发展导致大量基因组数据以比以前高得多的通量产生,这导致对下游快速准确的遗传分析的巨大需求。作为生物信息学工作流程的第一步,读取比对可以对读取在何处以及如何映射到给定的参考序列进行有根据的猜测。本文提出了一种基于avx2/avx512 SIMD指令集的快速、准确的短读对齐器DH_Aligner,它是针对x86多核平台设计和优化的。它基于三个阶段的对齐工作流程:播种-过滤-扩展,并为从Fastq到SAM文件的读取对齐提供端到端的解决方案。由于采用快速播种方案和种子过滤过程,DH_Aligner可以避免耗时的播种阶段和在看似错误的位置对齐读取的冗余工作量。随着批处理方法的引入,并行性很容易在数据级、指令级和线程级被利用。DH_Aligner中的性能关键内核由avx2和avx512两个内在函数实现,以获得更好的性能和可移植性。在两个典型的基于x86的平台上:Intel Xeon-6154和Hygon C86-7285, DH_Aligner可以产生近乎最佳的精度/灵敏度,同时优于最先进的并行实现,平均加速分别是bwa-mem, bwa-mem2, bowtie2和minimap2的7.8倍,3.4倍,2.8x-6.7倍和1.5倍。
{"title":"DH_Aligner: A fast short-read aligner on multicore platforms with AVX vectorization","authors":"Qiao Sun ,&nbsp;Feng Chen ,&nbsp;Leisheng Li ,&nbsp;Huiyuan Li","doi":"10.1016/j.jpdc.2025.105142","DOIUrl":"10.1016/j.jpdc.2025.105142","url":null,"abstract":"<div><div>The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignment makes an educated guess on where and how a read is mapped to a given reference sequence. In this paper, we propose DH_Aligner, a fast and accurate short read aligner designed and optimized for x86 multi-core platforms with <span>avx2/avx512</span> SIMD instruction sets. It is based on a three-phased aligning work-flow: seeding-filtering-extension and provides an end-to-end solution for read alignment from <span>Fastq</span> to <span>SAM</span> files. Due to a fast seeding scheme and a seed filtering procedure, DH_Aligner can avoid both of a time-consuming seeding phase and redundant workload of aligning reads at seemingly wrong locations. With the introduction of batched-processing methodology, parallelism is easily exploited at data-, instruction- and thread-level. The performance-critical kernels in DH_Aligner are implemented by both <span>avx2</span> and <span>avx512</span> intrinsics for a better performance and portability. On two typical x86 based platforms: Intel Xeon-6154 and Hygon C86-7285, DH_Aligner can produce a near-best accuracy/sensitivity while outperform state-of-the-art parallel implementations with average speedup: 7.8x, 3.4x, 2.8x-6.7x and 1.5x over bwa-mem, bwa-mem2, bowtie2 and minimap2 respectively.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105142"},"PeriodicalIF":3.4,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144571436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integration framework for online thread throttling with thread and page mapping on NUMA systems 基于NUMA系统的线程和页面映射的在线线程节流集成框架
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-04 DOI: 10.1016/j.jpdc.2025.105145
Janaina Schwarzrock , Hiago Mayk G. de A. Rocha , Arthur F. Lorenzon , Samuel Xavier de Souza , Antonio Carlos S. Beck
Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention.
非统一内存访问(NUMA)系统在HPC中很普遍,其中最佳的线程到核心分配和页面放置对于提高性能和最小化能耗至关重要。此外,考虑到NUMA系统具有对大量硬件线程的硬件支持,并且许多并行应用程序具有有限的可伸缩性,通过使用动态并发节流(Dynamic Concurrency Throttling, DCT)人为地减少线程数量可能会带来进一步的改进。然而,能源和性能的最佳配置(线程映射、页面映射、线程数)(由能源延迟积(energy - delay Product, EDP)量化)随着系统硬件、应用程序和输入集的不同而变化,甚至在执行过程中也是如此。由于这种动态特性,适应性是必不可少的,这使得离线策略的有效性大大降低。尽管它们很有效,但是在线策略引入了额外的执行开销,包括在运行时学习,以及在配置之间转换的成本,包括缓存预热、线程和数据重新分配。因此,平衡学习时间和解决方案质量变得越来越重要。在这种情况下,本工作提出了一个框架,将这种最佳配置找到一个单一的、在线的、有效的方法。我们的实验评估表明,与在线最先进的线程/页面映射技术(高达69.3%和43.4%)和DCT(高达93.2%和74.9%)相比,我们的框架提高了EDP和性能,同时完全自适应并且需要最少的用户干预。
{"title":"Integration framework for online thread throttling with thread and page mapping on NUMA systems","authors":"Janaina Schwarzrock ,&nbsp;Hiago Mayk G. de A. Rocha ,&nbsp;Arthur F. Lorenzon ,&nbsp;Samuel Xavier de Souza ,&nbsp;Antonio Carlos S. Beck","doi":"10.1016/j.jpdc.2025.105145","DOIUrl":"10.1016/j.jpdc.2025.105145","url":null,"abstract":"<div><div>Non-Uniform Memory Access (NUMA) systems are prevalent in HPC, where optimal thread-to-core allocation and page placement are crucial for enhancing performance and minimizing energy usage. Moreover, considering that NUMA systems have hardware support for a large number of hardware threads and many parallel applications have limited scalability, artificially decreasing the number of threads by using Dynamic Concurrency Throttling (DCT) may bring further improvements. However, the optimal configuration (thread mapping, page mapping, number of threads) for energy and performance, quantified by the Energy-Delay Product (EDP), varies with the system hardware, application and input set, even during execution. Because of this dynamic nature, adaptability is essential, making offline strategies much less effective. Despite their effectiveness, online strategies introduce additional execution overhead, which involves learning at run-time and the cost of transitions between configurations with cache warm-ups, thread and data reallocation. Thus, balancing the learning time and solution quality becomes increasingly significant. In this scenario, this work proposes a framework to find such optimal configurations into a single, online, and efficient approach. Our experimental evaluation shows that our framework improves EDP and performance compared to online state-of-the-art techniques of thread/page mapping (up to 69.3% and 43.4%) and DCT (up to 93.2% and 74.9%), while being totally adaptive and requiring minimum user intervention.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105145"},"PeriodicalIF":3.4,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144571438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complexity analysis and scalability of a matrix-free extrapolated geometric multigrid solver for curvilinear coordinates representations from fusion plasma applications 融合等离子体曲线坐标表示的无矩阵外推几何多网格求解器的复杂性分析和可扩展性
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-03 DOI: 10.1016/j.jpdc.2025.105143
Philippe Leleux , Christina Schwarz , Martin J. Kühn , Carola Kruse , Ulrich Rüde
Tokamak fusion reactors are promising alternatives for future energy production. Gyrokinetic simulations are important tools to understand physical processes inside tokamaks and to improve the design of future plants. In gyrokinetic codes such as Gysela, these simulations involve at each time step the solution of a gyrokinetic Poisson equation defined on disk-like cross sections. The authors of [14], [15] proposed to discretize a simplified differential equation using symmetric finite differences derived from the resulting energy functional and to use an implicitly extrapolated geometric multigrid scheme tailored to problems in curvilinear coordinates. In this article, we extend the discretization to a more realistic partial differential equation and demonstrate the optimal linear complexity of the proposed solver, in terms of computation and memory. We provide a general framework to analyze floating point operations and memory usage of matrix-free approaches for stencil-based operators. Finally, we give an efficient matrix-free implementation for the considered solver exploiting a task-based multithreaded parallelism which takes advantage of the disk-shaped geometry of the problem. We demonstrate the parallel efficiency for the solution of problems of size up to 50 million unknowns.
托卡马克聚变反应堆是未来能源生产的有希望的替代方案。陀螺动力学模拟是了解托卡马克内部物理过程和改进未来工厂设计的重要工具。在像Gysela这样的陀螺动力学代码中,这些模拟涉及在每个时间步解一个定义在圆盘状截面上的陀螺动力学泊松方程。[14],[15]的作者提出了利用由得到的能量泛函导出的对称有限差分来离散简化微分方程,并使用适合于曲线坐标问题的隐式外推几何多重网格方案。在本文中,我们将离散化扩展到一个更现实的偏微分方程,并在计算和内存方面证明了所提出的求解器的最佳线性复杂性。我们提供了一个通用框架来分析基于模板的操作符的浮点运算和无矩阵方法的内存使用。最后,我们给出了一个有效的无矩阵实现,利用基于任务的多线程并行性,利用了问题的磁盘形状几何。我们证明了求解规模高达5000万个未知数的问题的并行效率。
{"title":"Complexity analysis and scalability of a matrix-free extrapolated geometric multigrid solver for curvilinear coordinates representations from fusion plasma applications","authors":"Philippe Leleux ,&nbsp;Christina Schwarz ,&nbsp;Martin J. Kühn ,&nbsp;Carola Kruse ,&nbsp;Ulrich Rüde","doi":"10.1016/j.jpdc.2025.105143","DOIUrl":"10.1016/j.jpdc.2025.105143","url":null,"abstract":"<div><div>Tokamak fusion reactors are promising alternatives for future energy production. Gyrokinetic simulations are important tools to understand physical processes inside tokamaks and to improve the design of future plants. In gyrokinetic codes such as Gysela, these simulations involve at each time step the solution of a gyrokinetic Poisson equation defined on disk-like cross sections. The authors of <span><span>[14]</span></span>, <span><span>[15]</span></span> proposed to discretize a simplified differential equation using symmetric finite differences derived from the resulting energy functional and to use an implicitly extrapolated geometric multigrid scheme tailored to problems in curvilinear coordinates. In this article, we extend the discretization to a more realistic partial differential equation and demonstrate the optimal linear complexity of the proposed solver, in terms of computation and memory. We provide a general framework to analyze floating point operations and memory usage of matrix-free approaches for stencil-based operators. Finally, we give an efficient matrix-free implementation for the considered solver exploiting a task-based multithreaded parallelism which takes advantage of the disk-shaped geometry of the problem. We demonstrate the parallel efficiency for the solution of problems of size up to 50 million unknowns.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105143"},"PeriodicalIF":3.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144571437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards efficient program execution on edge-cloud computing platforms 在边缘云计算平台上实现高效程序执行
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-02 DOI: 10.1016/j.jpdc.2025.105135
Jean-François Dollinger, Vincent Vauchey
This paper investigates techniques dedicated to the performance of edge-cloud infrastructures and identifies the challenges to address to maximize their efficiency. Unlike traditional cloud-only processing, edge-cloud platforms meet the stringent requirements of real-time applications via additional computing resources close to the data source. Yet, due to numerous performance factors, it is a complex task to perform efficient computations on such platforms. Thus, we identify the main performance bottlenecks induced by traditional approaches and extensively discuss the performance characteristics of edge computing platforms. Based on these insights, we design an automated framework capable of achieving end-to-end efficacy of edge-cloud applications. We argue that achieving performance on edge-cloud infrastructures requires adaptive offloading of programs based on computational requirements. Thus, we comprehensively study three performance-critical aspects forming the performance workflow of applications: i) performance modelling, ii) program optimization iii) task scheduling. First, we explore performance modelling techniques, forming the foundation of most cost models, to accurately predict and achieve robust code optimization and scheduling. We then cover the whole program optimization chain, from hotspot detection to code optimization, focusing on memory locality, code parallelization, and acceleration. Finally, we discuss task scheduling techniques for selecting the best computing resource and ensuring a balanced workload distribution. Overall, our study provides insights by covering the above performance workflow referencing prominent state-of-the-art works, particularly focusing on those not yet applied in the context of edge-cloud computing. Additionally, we conducted experiments to further validate our findings. Finally, for each topic of interest, we identify the addressed scientific obstacles and outline the open research challenges yet to be overcome.
本文研究了专用于边缘云基础设施性能的技术,并确定了要解决的挑战,以最大限度地提高其效率。与传统的纯云处理不同,边缘云平台通过靠近数据源的额外计算资源来满足实时应用的严格要求。然而,由于众多的性能因素,在这样的平台上执行高效的计算是一项复杂的任务。因此,我们确定了传统方法引起的主要性能瓶颈,并广泛讨论了边缘计算平台的性能特征。基于这些见解,我们设计了一个能够实现端到端边缘云应用程序效率的自动化框架。我们认为,在边缘云基础设施上实现性能需要根据计算需求自适应卸载程序。因此,我们全面研究了形成应用程序性能工作流的三个性能关键方面:i)性能建模,ii)程序优化,iii)任务调度。首先,我们探索性能建模技术,形成了大多数成本模型的基础,以准确预测和实现鲁棒的代码优化和调度。然后我们涵盖了整个程序优化链,从热点检测到代码优化,重点是内存局部性,代码并行化和加速。最后,我们讨论了选择最佳计算资源和确保平衡工作负载分配的任务调度技术。总体而言,我们的研究通过涵盖上述性能工作流程提供了见解,这些工作流程参考了突出的最先进的作品,特别关注那些尚未在边缘云计算环境中应用的作品。此外,我们还进行了实验来进一步验证我们的发现。最后,对于每个感兴趣的主题,我们确定了已解决的科学障碍,并概述了尚未克服的开放研究挑战。
{"title":"Towards efficient program execution on edge-cloud computing platforms","authors":"Jean-François Dollinger,&nbsp;Vincent Vauchey","doi":"10.1016/j.jpdc.2025.105135","DOIUrl":"10.1016/j.jpdc.2025.105135","url":null,"abstract":"<div><div>This paper investigates techniques dedicated to the performance of edge-cloud infrastructures and identifies the challenges to address to maximize their efficiency. Unlike traditional cloud-only processing, edge-cloud platforms meet the stringent requirements of real-time applications via additional computing resources close to the data source. Yet, due to numerous performance factors, it is a complex task to perform efficient computations on such platforms. Thus, we identify the main performance bottlenecks induced by traditional approaches and extensively discuss the performance characteristics of edge computing platforms. Based on these insights, we design an automated framework capable of achieving end-to-end efficacy of edge-cloud applications. We argue that achieving performance on edge-cloud infrastructures requires adaptive offloading of programs based on computational requirements. Thus, we comprehensively study three performance-critical aspects forming the performance workflow of applications: i) performance modelling, ii) program optimization iii) task scheduling. First, we explore performance modelling techniques, forming the foundation of most cost models, to accurately predict and achieve robust code optimization and scheduling. We then cover the whole program optimization chain, from hotspot detection to code optimization, focusing on memory locality, code parallelization, and acceleration. Finally, we discuss task scheduling techniques for selecting the best computing resource and ensuring a balanced workload distribution. Overall, our study provides insights by covering the above performance workflow referencing prominent state-of-the-art works, particularly focusing on those not yet applied in the context of edge-cloud computing. Additionally, we conducted experiments to further validate our findings. Finally, for each topic of interest, we identify the addressed scientific obstacles and outline the open research challenges yet to be overcome.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105135"},"PeriodicalIF":3.4,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MM-AutoSolver: A multimodal machine learning method for the auto-selection of iterative solvers and preconditioners MM-AutoSolver:一种多模态机器学习方法,用于自动选择迭代求解器和预处理器
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-01 DOI: 10.1016/j.jpdc.2025.105144
Hantao Xiong , Wangdong Yang , Weiqing He , Shengle Lin , Keqin Li , Kenli Li
The solution of large-scale sparse linear systems of the form Ax=b is an important research problem in the field of High-performance Computing (HPC). With the increasing scale of these systems and the development of both HPC software and hardware, iterative solvers along with appropriate preconditioners have become mainstream methods for efficiently solving these sparse linear systems that arise from real-world HPC applications. Among abundant combinations of iterative solvers and preconditioners, the automatic selection of the optimal one has become a vital problem for accelerating the solution of these sparse linear systems. Previous work has utilized machine learning or deep learning algorithms to tackle this problem, but fails to abstract and exploit sufficient features from sparse linear systems, thus unable to obtain satisfactory results. In this work, we propose to address the automatic selection of the optimal combination of iterative solvers and preconditioners through the powerful multimodal machine learning framework, in which features of different modalities can be fully extracted and utilized to improve the results. Based on the multimodal machine learning framework, we put forward a multimodal machine learning model called MM-AutoSolver for the auto-selection of the optimal combination for a given sparse linear system. The experimental results based on a new large-scale matrix collection showcase that the proposed MM-AutoSolver outperforms state-of-the-art methods in predictive performance and has the capability to significantly accelerate the solution of large-scale sparse linear systems in HPC applications.
Ax=b形式的大规模稀疏线性系统的求解是高性能计算(HPC)领域的一个重要研究问题。随着这些系统规模的不断扩大以及高性能计算软件和硬件的发展,迭代求解法以及适当的预调节器已经成为有效求解这些稀疏线性系统的主流方法,这些系统来自于实际的高性能计算应用。在大量的迭代解和预条件组合中,自动选择最优解已成为加速求解这些稀疏线性系统的关键问题。以前的工作利用机器学习或深度学习算法来解决这一问题,但未能从稀疏线性系统中抽象和利用足够的特征,因此无法获得令人满意的结果。在这项工作中,我们提出通过强大的多模态机器学习框架来解决迭代求解器和预处理器的最优组合的自动选择问题,在该框架中,不同模态的特征可以被充分提取并利用来改进结果。在多模态机器学习框架的基础上,提出了一种多模态机器学习模型MM-AutoSolver,用于给定稀疏线性系统的最优组合的自动选择。基于新的大规模矩阵集合的实验结果表明,所提出的MM-AutoSolver在预测性能上优于目前最先进的方法,并且能够显著加快HPC应用中大规模稀疏线性系统的求解速度。
{"title":"MM-AutoSolver: A multimodal machine learning method for the auto-selection of iterative solvers and preconditioners","authors":"Hantao Xiong ,&nbsp;Wangdong Yang ,&nbsp;Weiqing He ,&nbsp;Shengle Lin ,&nbsp;Keqin Li ,&nbsp;Kenli Li","doi":"10.1016/j.jpdc.2025.105144","DOIUrl":"10.1016/j.jpdc.2025.105144","url":null,"abstract":"<div><div>The solution of large-scale sparse linear systems of the form <span><math><mi>A</mi><mi>x</mi><mo>=</mo><mi>b</mi></math></span> is an important research problem in the field of High-performance Computing (HPC). With the increasing scale of these systems and the development of both HPC software and hardware, iterative solvers along with appropriate preconditioners have become mainstream methods for efficiently solving these sparse linear systems that arise from real-world HPC applications. Among abundant combinations of iterative solvers and preconditioners, the automatic selection of the optimal one has become a vital problem for accelerating the solution of these sparse linear systems. Previous work has utilized machine learning or deep learning algorithms to tackle this problem, but fails to abstract and exploit sufficient features from sparse linear systems, thus unable to obtain satisfactory results. In this work, we propose to address the automatic selection of the optimal combination of iterative solvers and preconditioners through the powerful multimodal machine learning framework, in which features of different modalities can be fully extracted and utilized to improve the results. Based on the multimodal machine learning framework, we put forward a multimodal machine learning model called MM-AutoSolver for the auto-selection of the optimal combination for a given sparse linear system. The experimental results based on a new large-scale matrix collection showcase that the proposed MM-AutoSolver outperforms state-of-the-art methods in predictive performance and has the capability to significantly accelerate the solution of large-scale sparse linear systems in HPC applications.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105144"},"PeriodicalIF":3.4,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel watershed partitioning: GPU-based hierarchical image segmentation 并行分水岭分割:基于gpu的分层图像分割
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-27 DOI: 10.1016/j.jpdc.2025.105140
Varduhi Yeghiazaryan , Yeva Gabrielyan , Irina Voiculescu
Many image processing applications rely on partitioning an image into disjoint regions whose pixels are ‘similar.’ The watershed and waterfall transforms are established mathematical morphology pixel clustering techniques. They are both relevant to modern applications where groups of pixels are to be decided upon in one go, or where adjacency information is relevant. We introduce three new parallel partitioning algorithms for GPUs. By repeatedly applying watershed algorithms, we produce waterfall results which form a hierarchy of partition regions over an input image. Our watershed algorithms attain competitive execution times in both 2D and 3D, processing an 800 megavoxel image in less than 1.4 sec. We also show how to use this fully deterministic image partitioning as a pre-processing step to machine-learning-based semantic segmentation. This replaces the role of superpixel algorithms, and results in comparable accuracy and faster training times. The code is publicly available at https://github.com/hamemm/PRUF-watershed.git.
许多图像处理应用依赖于将图像划分为像素相似的不相交区域。分水岭和瀑布变换是建立数学形态学像素聚类技术。它们都与现代应用相关,其中像素组要一次决定,或者邻接信息相关。介绍了三种新的gpu并行划分算法。通过反复应用分水岭算法,我们产生瀑布结果,在输入图像上形成分区区域的层次结构。我们的分水岭算法在2D和3D中都获得了具有竞争力的执行时间,在不到1.4秒的时间内处理了800兆像素的图像。我们还展示了如何使用这种完全确定性的图像分割作为基于机器学习的语义分割的预处理步骤。这取代了超像素算法的作用,并产生了相当的准确性和更快的训练时间。该代码可在https://github.com/hamemm/PRUF-watershed.git上公开获得。
{"title":"Parallel watershed partitioning: GPU-based hierarchical image segmentation","authors":"Varduhi Yeghiazaryan ,&nbsp;Yeva Gabrielyan ,&nbsp;Irina Voiculescu","doi":"10.1016/j.jpdc.2025.105140","DOIUrl":"10.1016/j.jpdc.2025.105140","url":null,"abstract":"<div><div>Many image processing applications rely on partitioning an image into disjoint regions whose pixels are ‘similar.’ The watershed and waterfall transforms are established mathematical morphology pixel clustering techniques. They are both relevant to modern applications where groups of pixels are to be decided upon in one go, or where adjacency information is relevant. We introduce three new parallel partitioning algorithms for GPUs. By repeatedly applying watershed algorithms, we produce waterfall results which form a hierarchy of partition regions over an input image. Our watershed algorithms attain competitive execution times in both 2D and 3D, processing an 800 megavoxel image in less than 1.4 sec. We also show how to use this fully deterministic image partitioning as a pre-processing step to machine-learning-based semantic segmentation. This replaces the role of superpixel algorithms, and results in comparable accuracy and faster training times. The code is publicly available at <span><span>https://github.com/hamemm/PRUF-watershed.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105140"},"PeriodicalIF":3.4,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144656655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topology-aware GPU job scheduling with deep reinforcement learning and heuristics 基于深度强化学习和启发式的拓扑感知GPU作业调度
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-26 DOI: 10.1016/j.jpdc.2025.105138
Hajer Ayadi , Aijun An , Yiming Shao , Hossein Pourmedheji , Junjie Deng , Jimmy X. Huang , Michael Feiman , Hao Zhou
Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce TopDRL, an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. TopDRL uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that TopDRL significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.
深度神经网络(dnn)在计算机视觉和自然语言处理等许多领域都得到了普及。然而,随着数据量和模型复杂度的增加,训练深度神经网络的时间越来越长。虽然使用多个GPU并行进行分布式DNN训练是一种常见的解决方案,但它在GPU资源管理和调度方面带来了挑战。一个关键的挑战是最小化分配给DNN训练任务的gpu之间的通信成本。高昂的通信成本(由机架间或机器间数据传输等因素引起)可能导致硬件瓶颈和网络延迟,最终减慢训练速度。降低这些成本有助于更有效的数据传输和同步,直接加速训练过程。尽管深度强化学习(DRL)在GPU资源调度方面表现出了良好的前景,但现有的方法往往缺乏对硬件拓扑的考虑。此外,大多数提出的GPU调度器忽略了启发式策略和DRL策略相结合的可能性。为了应对这些挑战,我们引入了TopDRL,这是一种创新的混合调度器,它集成了深度强化学习(DRL)和启发式方法来增强GPU的作业调度。TopDRL使用多分支卷积神经网络(CNN)模型进行作业选择,并使用启发式方法进行GPU分配。在每个时间步,CNN模型选择一个作业,然后采用启发式方法从集群中选择彼此最接近的可用gpu。强化学习(RL)用于训练CNN模型选择最大吞吐量奖励的任务。对具有真实作业的数据集进行的广泛评估表明,TopDRL明显优于使用启发式或其他DRL模型进行作业选择和资源分配的六个基准调度器。
{"title":"Topology-aware GPU job scheduling with deep reinforcement learning and heuristics","authors":"Hajer Ayadi ,&nbsp;Aijun An ,&nbsp;Yiming Shao ,&nbsp;Hossein Pourmedheji ,&nbsp;Junjie Deng ,&nbsp;Jimmy X. Huang ,&nbsp;Michael Feiman ,&nbsp;Hao Zhou","doi":"10.1016/j.jpdc.2025.105138","DOIUrl":"10.1016/j.jpdc.2025.105138","url":null,"abstract":"<div><div>Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span>, an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span> uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that <span><math><mi>T</mi><mi>o</mi><mi>p</mi><mi>D</mi><mi>R</mi><mi>L</mi></math></span> significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105138"},"PeriodicalIF":3.4,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144518829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Edge metric basis and its fault tolerance over certain interconnection networks 边缘度量基础及其在一定互连网络中的容错性
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-23 DOI: 10.1016/j.jpdc.2025.105141
S. Prabhu , T. Jenifer Janany , M. Arulperumjothi , I.G. Yero
The surveillance of elements in an interconnection network is a classical problem in computer engineering. In addition, it is a problem closely related to uniquely identifying the elements of the network, which is indeed a classical distance-related problem in graph theory. This surveillance can be considered for different styles of elements in the network. The classical version centers the attention on the nodes, while some recent variations of it consider monitoring also the edges or both, vertices and edges at the same time. The first style gave rise to graph structures, called edge resolving set and edge metric basis, which is used to uniquely identify the edges of a given network by means of distance vectors. A vertex x in a graph G uniquely recognizes (resolves or identifies) two edges e and f in G if dG[e,x]dG[f,x], where dG[e,x] stands for the distance between a vertex x and an edge e of G. A set S with the smallest number of vertices, such that every couple of edges is uniquely recognized by a minimum of one vertex in S, is an edge metric basis, and the edge metric dimension refers to the cardinality of such S. Fault tolerance of a working system is the ability of such a system to keep functioning even if one of its parts stops working properly. The fault tolerance property of the edge metric basis is considered in this work. This results in a concept called fault-tolerant edge metric basis. That is, an edge metric basis S of a graph G is fault-tolerant if every pair of edges of G are resolved by a minimum of two vertices in S, and the minimum possible cardinality of such sets is coined as the fault-tolerant edge metric dimension of G. In this work, we present bounds for the edge metric dimension of graphs and its fault tolerance version. In addition, we investigate these parameters for butterfly, Beneš and fractal cubic networks, and found the exact value for their (fault-tolerant) edge metric dimensions.
互连网络中元素的监控是计算机工程中的一个经典问题。此外,它是一个与唯一识别网络元素密切相关的问题,这确实是图论中一个经典的与距离相关的问题。这种监视可以考虑网络中不同风格的元素。经典版本将注意力集中在节点上,而最近的一些变化则考虑同时监控边缘或同时监控顶点和边缘。第一种方法产生了称为边缘分辨集和边缘度量基的图结构,它们通过距离向量来唯一地识别给定网络的边缘。如果dG[e,x]≠dG[f,x],则图G中的顶点x唯一地识别(或识别)G中的两条边e和f,其中dG[e,x]表示顶点x与G中的边e之间的距离,使得S中每一对边都被S中至少一个顶点唯一识别的集合S是边度量基,一个工作系统的容错能力是指即使其中一个部分停止正常工作,该系统仍能保持正常工作的能力。本文考虑了边缘度量基的容错特性。这就产生了容错边缘度量基的概念。即,如果图G的每一对边都被S中的至少两个顶点解析,则图G的边度量基S是容错的,并将这些集合的最小可能基数称为G的容错边度量维数。本文给出了图的边度量维数及其容错版本的界限。此外,我们研究了蝴蝶网络、贝内斯网络和分形三次网络的这些参数,并找到了它们(容错)边缘度量维的精确值。
{"title":"Edge metric basis and its fault tolerance over certain interconnection networks","authors":"S. Prabhu ,&nbsp;T. Jenifer Janany ,&nbsp;M. Arulperumjothi ,&nbsp;I.G. Yero","doi":"10.1016/j.jpdc.2025.105141","DOIUrl":"10.1016/j.jpdc.2025.105141","url":null,"abstract":"<div><div>The surveillance of elements in an interconnection network is a classical problem in computer engineering. In addition, it is a problem closely related to uniquely identifying the elements of the network, which is indeed a classical distance-related problem in graph theory. This surveillance can be considered for different styles of elements in the network. The classical version centers the attention on the nodes, while some recent variations of it consider monitoring also the edges or both, vertices and edges at the same time. The first style gave rise to graph structures, called edge resolving set and edge metric basis, which is used to uniquely identify the edges of a given network by means of distance vectors. A vertex <em>x</em> in a graph <em>G</em> uniquely recognizes (resolves or identifies) two edges <em>e</em> and <em>f</em> in <em>G</em> if <span><math><msub><mrow><mi>d</mi></mrow><mrow><mi>G</mi></mrow></msub><mo>[</mo><mi>e</mi><mo>,</mo><mi>x</mi><mo>]</mo><mo>≠</mo><msub><mrow><mi>d</mi></mrow><mrow><mi>G</mi></mrow></msub><mo>[</mo><mi>f</mi><mo>,</mo><mi>x</mi><mo>]</mo></math></span>, where <span><math><msub><mrow><mi>d</mi></mrow><mrow><mi>G</mi></mrow></msub><mo>[</mo><mi>e</mi><mo>,</mo><mi>x</mi><mo>]</mo></math></span> stands for the distance between a vertex <em>x</em> and an edge <em>e</em> of <em>G</em>. A set <em>S</em> with the smallest number of vertices, such that every couple of edges is uniquely recognized by a minimum of one vertex in <em>S</em>, is an edge metric basis, and the edge metric dimension refers to the cardinality of such <em>S</em>. Fault tolerance of a working system is the ability of such a system to keep functioning even if one of its parts stops working properly. The fault tolerance property of the edge metric basis is considered in this work. This results in a concept called fault-tolerant edge metric basis. That is, an edge metric basis <em>S</em> of a graph <em>G</em> is fault-tolerant if every pair of edges of <em>G</em> are resolved by a minimum of two vertices in <em>S</em>, and the minimum possible cardinality of such sets is coined as the fault-tolerant edge metric dimension of <em>G</em>. In this work, we present bounds for the edge metric dimension of graphs and its fault tolerance version. In addition, we investigate these parameters for butterfly, Beneš and fractal cubic networks, and found the exact value for their (fault-tolerant) edge metric dimensions.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105141"},"PeriodicalIF":3.4,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144489472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dispersion of mobile robots on directed anonymous graphs 移动机器人在有向匿名图上的离散性
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-20 DOI: 10.1016/j.jpdc.2025.105139
Giuseppe F. Italiano , Debasish Pattanayak , Gokarna Sharma
Given any arbitrary initial configuration of kn robots positioned on the nodes of an n-node anonymous graph, the problem of dispersion is to autonomously reposition the robots such that each node will contain at most one robot. This problem gained significant interest due to its resemblance with several fundamental problems such as exploration, scattering, load balancing, relocation of electric cars to charging stations, etc. The objective is to solve dispersion simultaneously minimizing (or providing trade-off between) time and memory requirement at each robot. The literature mainly dealt with dispersion on undirected anonymous graphs. In this paper, we initiate the study of dispersion on directed anonymous graphs. We first show that it may not always be possible to solve dispersion when the directed graph is not strongly connected. We then establish some lower bounds on both time and memory requirement at each robot for solving dispersion on a strongly connected directed graph. Finally, we provide three deterministic algorithms solving dispersion on any strongly connected directed graph. Let D be the graph diameter, Δout be its maximum out-degree, and d be the deficiency (the minimum number of edges needed to add to the graph to make it Eulerian). The first algorithm solves dispersion in O(dk2) time with O(klog(k+Δout)) bits at each robot. The second algorithm solves dispersion in O(k2Δout) time with O(log(k+Δout)) bits at each robot. The third algorithm solves dispersion in O(kD) time with O(klog(k+Δout)) bits at each robot, provided that robots in the 1-hop neighborhood can communicate. All three algorithms extend to handle crash faults.
给定任意k≤n个机器人的初始配置,放置在n节点匿名图的节点上,分散问题是自动重新定位机器人,使每个节点最多包含一个机器人。由于该问题与几个基本问题(如勘探、分散、负载平衡、电动汽车向充电站的迁移等)相似,因此引起了人们的极大兴趣。目标是同时解决分散问题,使每个机器人的时间和内存需求最小化(或在两者之间提供权衡)。文献主要研究无向匿名图上的色散问题。本文研究了有向匿名图上的色散问题。我们首先证明,当有向图不是强连接时,可能并不总是可以求解色散。然后,我们建立了求解强连通有向图上色散的时间和内存需求的下界。最后,我们给出了求解任意强连通有向图上色散的三种确定性算法。设D为图的直径,Δout为图的最大出度,D为缺边数(使图成为欧拉图所需的最小边数)。第一种算法在O(d⋅k2)时间内用O(k⋅log (k+Δout))位在每个机器人上解决色散问题。第二种算法在O(k2⋅Δout)时间内用O(log (k+Δout))位在每个机器人上解决色散问题。第三种算法在O(k⋅D)时间内用O(k⋅log (k+Δout))位解决每个机器人的色散问题,前提是1跳邻域内的机器人可以通信。这三种算法都扩展到处理崩溃故障。
{"title":"Dispersion of mobile robots on directed anonymous graphs","authors":"Giuseppe F. Italiano ,&nbsp;Debasish Pattanayak ,&nbsp;Gokarna Sharma","doi":"10.1016/j.jpdc.2025.105139","DOIUrl":"10.1016/j.jpdc.2025.105139","url":null,"abstract":"<div><div>Given any arbitrary initial configuration of <span><math><mi>k</mi><mo>≤</mo><mi>n</mi></math></span> robots positioned on the nodes of an <em>n</em>-node anonymous graph, the problem of dispersion is to autonomously reposition the robots such that each node will contain at most one robot. This problem gained significant interest due to its resemblance with several fundamental problems such as exploration, scattering, load balancing, relocation of electric cars to charging stations, etc. The objective is to solve dispersion simultaneously minimizing (or providing trade-off between) time and memory requirement at each robot. The literature mainly dealt with dispersion on undirected anonymous graphs. In this paper, we initiate the study of dispersion on directed anonymous graphs. We first show that it may not always be possible to solve dispersion when the directed graph is not strongly connected. We then establish some lower bounds on both time and memory requirement at each robot for solving dispersion on a strongly connected directed graph. Finally, we provide three deterministic algorithms solving dispersion on any strongly connected directed graph. Let <em>D</em> be the graph diameter, <span><math><msub><mrow><mi>Δ</mi></mrow><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub></math></span> be its maximum out-degree, and <em>d</em> be the deficiency (the minimum number of edges needed to add to the graph to make it Eulerian). The first algorithm solves dispersion in <span><math><mi>O</mi><mo>(</mo><mi>d</mi><mo>⋅</mo><msup><mrow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></math></span> time with <span><math><mi>O</mi><mo>(</mo><mi>k</mi><mo>⋅</mo><mi>log</mi><mo>⁡</mo><mo>(</mo><mi>k</mi><mo>+</mo><msub><mrow><mi>Δ</mi></mrow><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub><mo>)</mo><mo>)</mo></math></span> bits at each robot. The second algorithm solves dispersion in <span><math><mi>O</mi><mo>(</mo><msup><mrow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>⋅</mo><msub><mrow><mi>Δ</mi></mrow><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub><mo>)</mo></math></span> time with <span><math><mi>O</mi><mo>(</mo><mi>log</mi><mo>⁡</mo><mo>(</mo><mi>k</mi><mo>+</mo><msub><mrow><mi>Δ</mi></mrow><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub><mo>)</mo><mo>)</mo></math></span> bits at each robot. The third algorithm solves dispersion in <span><math><mi>O</mi><mo>(</mo><mi>k</mi><mo>⋅</mo><mi>D</mi><mo>)</mo></math></span> time with <span><math><mi>O</mi><mo>(</mo><mi>k</mi><mo>⋅</mo><mi>log</mi><mo>⁡</mo><mo>(</mo><mi>k</mi><mo>+</mo><msub><mrow><mi>Δ</mi></mrow><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub><mo>)</mo><mo>)</mo></math></span> bits at each robot, provided that robots in the 1-hop neighborhood can communicate. All three algorithms extend to handle crash faults.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105139"},"PeriodicalIF":3.4,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144489471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The CAMINOS interconnection networks simulator CAMINOS互连网络模拟器
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-18 DOI: 10.1016/j.jpdc.2025.105136
Cristóbal Camarero, Daniel Postigo, Pablo Fuentes
This work presents CAMINOS, a new interconnection network simulator focusing on router microarchitecture. It was developed in Rust, a novel programming language with a syntax similar to C/C++ and strong memory protection.
The architecture of CAMINOS emphasizes the composition of components. This allows new designs to be defined in a configuration file without modifying source code, greatly reducing effort and time.
In addition to simulation functionality, CAMINOS assists in managing a collection of simulations as an experiment. This includes integration with SLURM to support executing batches of simulations and generating PDFs with results and diagnostics.
We show that CAMINOS makes good use of computing resources. Its memory usage is dominated by in-flight messages, showing low overhead in memory usage. We attest that CAMINOS can effectively use CPU time, as scenarios with little contention execute faster.
本文提出了一种新的互连网络模拟器CAMINOS,主要关注路由器微架构。它是用Rust开发的,Rust是一种新颖的编程语言,语法类似于C/ c++,并且具有强大的内存保护。CAMINOS的架构强调组件的组合。这允许在不修改源代码的情况下在配置文件中定义新的设计,从而大大减少了工作量和时间。除了模拟功能外,CAMINOS还协助管理作为实验的模拟集合。这包括与SLURM的集成,以支持执行批量模拟和生成带有结果和诊断的pdf。我们证明CAMINOS很好地利用了计算资源。它的内存使用主要由运行中的消息控制,显示出内存使用的低开销。我们证明CAMINOS可以有效地利用CPU时间,因为很少争用的场景执行得更快。
{"title":"The CAMINOS interconnection networks simulator","authors":"Cristóbal Camarero,&nbsp;Daniel Postigo,&nbsp;Pablo Fuentes","doi":"10.1016/j.jpdc.2025.105136","DOIUrl":"10.1016/j.jpdc.2025.105136","url":null,"abstract":"<div><div>This work presents CAMINOS, a new interconnection network simulator focusing on router microarchitecture. It was developed in Rust, a novel programming language with a syntax similar to C/C++ and strong memory protection.</div><div>The architecture of CAMINOS emphasizes the composition of components. This allows new designs to be defined in a configuration file without modifying source code, greatly reducing effort and time.</div><div>In addition to simulation functionality, CAMINOS assists in managing a collection of simulations as an experiment. This includes integration with SLURM to support executing batches of simulations and generating PDFs with results and diagnostics.</div><div>We show that CAMINOS makes good use of computing resources. Its memory usage is dominated by in-flight messages, showing low overhead in memory usage. We attest that CAMINOS can effectively use CPU time, as scenarios with little contention execute faster.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105136"},"PeriodicalIF":3.4,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1