首页 > 最新文献

Parallel Computing最新文献

英文 中文
Detecting chaotic regions of recurrent equations in parallel environments 并行环境中循环方程混沌区域的检测
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-01 DOI: 10.1016/j.parco.2025.103163
Athanasios Margaris , Stavros Souravlas
This paper investigates how parallel computing techniques, such as OpenMP and CUDA, can be optimized to enhance the computational efficiency of detecting chaotic regions in the parameter space of recurrent equations, a critical task in chaos theory. Leveraging the embarrassingly parallel nature of maximum Lyapunov exponent calculations, our method targets systems with known recurrence relations, where governing equations are analytically defined. Applied to a discretized recurrent neural model, the proposed approach achieves significant speedups, addressing the computational intensity of chaos detection. While building on established parallel techniques, this work fills a gap in their systematic application to chaos detection in high-dimensional systems, offering a scalable solution with potential for real-time analysis. We provide detailed performance metrics, parallel I/O guidelines, and visualization strategies, demonstrating adaptability to other analytically defined chaotic systems.
本文研究了如何优化并行计算技术,如OpenMP和CUDA,以提高在循环方程参数空间中检测混沌区域的计算效率,这是混沌理论中的一项关键任务。利用最大李雅普诺夫指数计算令人尴尬的并行性质,我们的方法针对具有已知递归关系的系统,其中控制方程是解析定义的。将该方法应用于离散递归神经模型,可以显著提高算法的速度,解决了混沌检测的计算强度问题。在建立已建立的并行技术的基础上,这项工作填补了它们在高维系统混沌检测系统应用中的空白,提供了具有实时分析潜力的可扩展解决方案。我们提供了详细的性能指标、并行I/O指南和可视化策略,展示了对其他分析定义的混沌系统的适应性。
{"title":"Detecting chaotic regions of recurrent equations in parallel environments","authors":"Athanasios Margaris ,&nbsp;Stavros Souravlas","doi":"10.1016/j.parco.2025.103163","DOIUrl":"10.1016/j.parco.2025.103163","url":null,"abstract":"<div><div>This paper investigates how parallel computing techniques, such as OpenMP and CUDA, can be optimized to enhance the computational efficiency of detecting chaotic regions in the parameter space of recurrent equations, a critical task in chaos theory. Leveraging the embarrassingly parallel nature of maximum Lyapunov exponent calculations, our method targets systems with known recurrence relations, where governing equations are analytically defined. Applied to a discretized recurrent neural model, the proposed approach achieves significant speedups, addressing the computational intensity of chaos detection. While building on established parallel techniques, this work fills a gap in their systematic application to chaos detection in high-dimensional systems, offering a scalable solution with potential for real-time analysis. We provide detailed performance metrics, parallel I/O guidelines, and visualization strategies, demonstrating adaptability to other analytically defined chaotic systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103163"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dependency-aware task offloading in IoT-based edge computing system using an optimized deep learning approach 基于物联网的边缘计算系统中使用优化深度学习方法的依赖感知任务卸载
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-01 DOI: 10.1016/j.parco.2025.103161
Shiva Shankar Reddy , Silpa Nrusimhadri , Gadiraju Mahesh , Veeranki Venkata Rama Maheswara Rao
Internet of Things (IoT) devices produce a lot of data, which can be difficult to process on limited computing systems. Edge computing aims to solve this issue by providing localized processing power at the edge of IoT networks to reduce communication delays and network bandwidth. Because of their limited resources and task dependencies, edge computing systems are facing computational issues as a result of the growing usage of IoT devices. An efficient task-offloading system that combines the Fire Hawk Optimizer (FHO) and Deep Reinforcement Learning (DRL) is proposed in this research to address these issues. This paper proposes leveraging deep learning techniques to prioritize and offload computational tasks from IoT applications to edge computing systems, addressing task interdependencies and resource constraints to enhance efficiency. The proposed method consists of two components. The first component uses Petri-Net modelling to analyze interdependencies among tasks, identify subtasks, and map their relationships. The second component uses a residual neural network-based actor-critic deep reinforcement learning (ResNet-ACDRL) decision-making model to offload tasks. Task dependencies and resource availability are assessed by the DRL component, namely a ResNet-ACDRL model, which is utilized to dynamically learn and enhance task-offloading strategies. In order to ensure optimal task allocation across local, edge, and cloud computing resources, the FHO is then used to refine these learned policies. Here, the term "policy" refers to the strategy used by the system to decide the most suitable resource for task execution. This dual approach strategy drastically reduces energy usage and execution delays. The suggested framework outperforms existing methods, according to experimental data, especially when managing task interdependencies and a variety of computational loads. The proposed method has been shown to significantly improve time delay and energy consumption compared to existing methods.
物联网(IoT)设备产生大量数据,这些数据很难在有限的计算系统上处理。边缘计算旨在通过在物联网网络边缘提供本地化处理能力来解决这一问题,以减少通信延迟和网络带宽。由于有限的资源和任务依赖性,由于物联网设备的使用越来越多,边缘计算系统面临着计算问题。为了解决这些问题,本研究提出了一种结合火鹰优化器(FHO)和深度强化学习(DRL)的高效任务卸载系统。本文建议利用深度学习技术将计算任务从物联网应用程序优先级和卸载到边缘计算系统,解决任务相互依赖性和资源约束问题,以提高效率。该方法由两个部分组成。第一个组件使用Petri-Net建模来分析任务之间的相互依赖性,识别子任务,并映射它们之间的关系。第二个组件使用基于残差神经网络的actor-critic深度强化学习(ResNet-ACDRL)决策模型来卸载任务。DRL组件(即ResNet-ACDRL模型)评估任务依赖性和资源可用性,并利用该模型动态学习和增强任务卸载策略。为了确保跨本地、边缘和云计算资源的最佳任务分配,然后使用FHO来改进这些学习策略。在这里,术语“策略”指的是系统用来决定最适合执行任务的资源的策略。这种双重方法策略大大减少了能源使用和执行延迟。根据实验数据,建议的框架优于现有的方法,特别是在管理任务相互依赖性和各种计算负载时。与现有方法相比,该方法显著改善了时间延迟和能量消耗。
{"title":"A dependency-aware task offloading in IoT-based edge computing system using an optimized deep learning approach","authors":"Shiva Shankar Reddy ,&nbsp;Silpa Nrusimhadri ,&nbsp;Gadiraju Mahesh ,&nbsp;Veeranki Venkata Rama Maheswara Rao","doi":"10.1016/j.parco.2025.103161","DOIUrl":"10.1016/j.parco.2025.103161","url":null,"abstract":"<div><div>Internet of Things (IoT) devices produce a lot of data, which can be difficult to process on limited computing systems. Edge computing aims to solve this issue by providing localized processing power at the edge of IoT networks to reduce communication delays and network bandwidth. Because of their limited resources and task dependencies, edge computing systems are facing computational issues as a result of the growing usage of IoT devices. An efficient task-offloading system that combines the Fire Hawk Optimizer (FHO) and Deep Reinforcement Learning (DRL) is proposed in this research to address these issues. This paper proposes leveraging deep learning techniques to prioritize and offload computational tasks from IoT applications to edge computing systems, addressing task interdependencies and resource constraints to enhance efficiency. The proposed method consists of two components. The first component uses Petri-Net modelling to analyze interdependencies among tasks, identify subtasks, and map their relationships. The second component uses a residual neural network-based actor-critic deep reinforcement learning (ResNet-ACDRL) decision-making model to offload tasks. Task dependencies and resource availability are assessed by the DRL component, namely a ResNet-ACDRL model, which is utilized to dynamically learn and enhance task-offloading strategies. In order to ensure optimal task allocation across local, edge, and cloud computing resources, the FHO is then used to refine these learned policies. Here, the term \"policy\" refers to the strategy used by the system to decide the most suitable resource for task execution. This dual approach strategy drastically reduces energy usage and execution delays. The suggested framework outperforms existing methods, according to experimental data, especially when managing task interdependencies and a variety of computational loads. The proposed method has been shown to significantly improve time delay and energy consumption compared to existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103161"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU/CUDA-Accelerated gradient growth optimizer for efficient complex numerical global optimization GPU/ cuda加速梯度增长优化器,用于高效复杂的数值全局优化
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-10 DOI: 10.1016/j.parco.2025.103160
Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin
Efficiently solving high-dimensional and complex numerical optimization problems remains a critical challenge in high-performance computing. This paper presents the GPU/CUDA-Accelerated Gradient Growth Optimizer (GGO)—a novel parallel metaheuristic algorithm that combines gradient-guided local search with GPU-enabled large-scale parallelism. Building upon the Growth Optimizer (GO), GGO incorporates a dimension-wise gradient-guiding strategy based on central difference approximations, which improves solution precision without requiring differentiable objective functions. To address the computational bottlenecks of high-dimensional problems, a hybrid CUDA-based framework is developed, integrating both fine-grained and coarse-grained parallel strategies to fully exploit GPU resources and minimize memory access latency. Extensive experiments on the CEC2017 and CEC2022 benchmark suites demonstrate the superior performance of GGO in terms of both convergence accuracy and computational speed. Compared to 49 state-of-the-art optimization algorithms, GGO achieves top-ranked results in 67% of test cases and delivers up to 7.8× speedup over its CPU-based counterpart. Statistical analyses using the Wilcoxon signed-rank test further confirm its robustness across 28 out of 29 functions in high-dimensional scenarios. Additionally, in-depth analysis reveals that GGO maintains high scalability and performance even as the problem dimension and population size increase, providing a generalizable solution for high-dimensional global optimization that is well-suited for parallel computing applications in scientific and engineering domains.
高效求解高维复杂数值优化问题一直是高性能计算领域面临的重要挑战。本文提出了GPU/ cuda加速梯度增长优化器(GGO)——一种新的并行元启发式算法,它将梯度引导的局部搜索与GPU支持的大规模并行性相结合。在Growth Optimizer (GO)的基础上,GGO结合了一种基于中心差分近似的维度梯度引导策略,在不需要可微目标函数的情况下提高了求解精度。为了解决高维问题的计算瓶颈,开发了一种基于cuda的混合框架,集成了细粒度和粗粒度并行策略,以充分利用GPU资源并最小化内存访问延迟。在CEC2017和CEC2022基准套件上的大量实验表明,GGO在收敛精度和计算速度方面都具有优越的性能。与49种最先进的优化算法相比,GGO在67%的测试用例中获得了排名第一的结果,并且比基于cpu的算法提供了高达7.8倍的加速。使用Wilcoxon符号秩检验的统计分析进一步证实了其在高维场景中29个函数中的28个函数的稳健性。此外,深入分析表明,即使在问题维度和人口规模增加的情况下,GGO也能保持较高的可扩展性和性能,为高维全局优化提供了一种通用的解决方案,非常适合科学和工程领域的并行计算应用。
{"title":"GPU/CUDA-Accelerated gradient growth optimizer for efficient complex numerical global optimization","authors":"Qingke Zhang ,&nbsp;Wenliang Chen ,&nbsp;Shuzhao Pang ,&nbsp;Sichen Tao ,&nbsp;Conglin Li ,&nbsp;Xin Yin","doi":"10.1016/j.parco.2025.103160","DOIUrl":"10.1016/j.parco.2025.103160","url":null,"abstract":"<div><div>Efficiently solving high-dimensional and complex numerical optimization problems remains a critical challenge in high-performance computing. This paper presents the GPU/CUDA-Accelerated Gradient Growth Optimizer (GGO)—a novel parallel metaheuristic algorithm that combines gradient-guided local search with GPU-enabled large-scale parallelism. Building upon the Growth Optimizer (GO), GGO incorporates a dimension-wise gradient-guiding strategy based on central difference approximations, which improves solution precision without requiring differentiable objective functions. To address the computational bottlenecks of high-dimensional problems, a hybrid CUDA-based framework is developed, integrating both fine-grained and coarse-grained parallel strategies to fully exploit GPU resources and minimize memory access latency. Extensive experiments on the CEC2017 and CEC2022 benchmark suites demonstrate the superior performance of GGO in terms of both convergence accuracy and computational speed. Compared to 49 state-of-the-art optimization algorithms, GGO achieves top-ranked results in 67% of test cases and delivers up to 7.8× speedup over its CPU-based counterpart. Statistical analyses using the Wilcoxon signed-rank test further confirm its robustness across 28 out of 29 functions in high-dimensional scenarios. Additionally, in-depth analysis reveals that GGO maintains high scalability and performance even as the problem dimension and population size increase, providing a generalizable solution for high-dimensional global optimization that is well-suited for parallel computing applications in scientific and engineering domains.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103160"},"PeriodicalIF":2.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Software acceleration of multi-user MIMO uplink detection on GPU 基于GPU的多用户MIMO上行检测软件加速
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-09-01 DOI: 10.1016/j.parco.2025.103150
Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani
This paper presents the exploration of GPU-accelerated block-wise decompositions for zero-forcing (ZF) based QR and Cholesky methods applied to massive multiple-input multiple-output (MIMO) uplink detection algorithms. Three algorithms are evaluated: ZF with block Cholesky decomposition, ZF with block QR decomposition (QRD), and minimum mean square error (MMSE) with block Cholesky decomposition. The latter was the only one previously explored, but it used standard Cholesky decomposition. Our approach achieves an 11% improvement over the previous GPU-accelerated MMSE study.
Through performance analysis, we observe a trade-off between precision and execution time. Reducing precision from FP64 to FP32 improves execution time but increases bit error rate (BER), with ZF-based QRD reducing execution time from 2.04μs to 1.24μs for a 128 × 8 MIMO size. The study also highlights that larger MIMO sizes, particularly 2048 × 32, require GPUs to fully utilize their computational and memory capabilities, especially under FP64 precision. In contrast, smaller matrices are compute-bound.
Our results recommend GPUs for larger MIMO sizes, as they offer the parallelism and memory resources necessary to efficiently handle the computational demands of next-generation networks. This work paves the way for scalable, GPU-based massive MIMO uplink detection systems.
本文介绍了基于gpu加速的分块分解的零强迫(ZF) QR和Cholesky方法应用于大规模多输入多输出(MIMO)上行检测算法的探索。评估了三种算法:块Cholesky分解ZF、块QR分解(QRD) ZF和块Cholesky分解最小均方误差(MMSE)。后者是之前唯一被探索过的,但它使用了标准的乔列斯基分解。我们的方法比以前的gpu加速MMSE研究提高了11%。通过性能分析,我们观察到精度和执行时间之间的权衡。在128 × 8 MIMO尺寸下,基于zf的QRD将执行时间从2.04μs降低到1.24μs,从而提高了误码率(BER),但精度从FP64降低到了FP32,提高了执行时间。该研究还强调,更大的MIMO尺寸,特别是2048 × 32,需要gpu充分利用其计算和存储能力,特别是在FP64精度下。相反,较小的矩阵是计算受限的。我们的研究结果推荐gpu用于更大的MIMO规模,因为它们提供了有效处理下一代网络计算需求所需的并行性和内存资源。这项工作为可扩展的、基于gpu的大规模MIMO上行检测系统铺平了道路。
{"title":"Software acceleration of multi-user MIMO uplink detection on GPU","authors":"Ali Nada ,&nbsp;Hazem Ismail Ali ,&nbsp;Liang Liu ,&nbsp;Yousra Alkabani","doi":"10.1016/j.parco.2025.103150","DOIUrl":"10.1016/j.parco.2025.103150","url":null,"abstract":"<div><div>This paper presents the exploration of GPU-accelerated block-wise decompositions for zero-forcing (ZF) based QR and Cholesky methods applied to massive multiple-input multiple-output (MIMO) uplink detection algorithms. Three algorithms are evaluated: ZF with block Cholesky decomposition, ZF with block QR decomposition (QRD), and minimum mean square error (MMSE) with block Cholesky decomposition. The latter was the only one previously explored, but it used standard Cholesky decomposition. Our approach achieves an 11% improvement over the previous GPU-accelerated MMSE study.</div><div>Through performance analysis, we observe a trade-off between precision and execution time. Reducing precision from FP64 to FP32 improves execution time but increases bit error rate (BER), with ZF-based QRD reducing execution time from <span><math><mrow><mn>2</mn><mo>.</mo><mn>04</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> to <span><math><mrow><mn>1</mn><mo>.</mo><mn>24</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> for a 128 × 8 MIMO size. The study also highlights that larger MIMO sizes, particularly 2048 × 32, require GPUs to fully utilize their computational and memory capabilities, especially under FP64 precision. In contrast, smaller matrices are compute-bound.</div><div>Our results recommend GPUs for larger MIMO sizes, as they offer the parallelism and memory resources necessary to efficiently handle the computational demands of next-generation networks. This work paves the way for scalable, GPU-based massive MIMO uplink detection systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103150"},"PeriodicalIF":2.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144922663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enable cross-iteration parallelism for PIM-based graph processing with vertex-level synchronization 为顶点级同步的基于pim的图形处理启用交叉迭代并行性
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-11 DOI: 10.1016/j.parco.2025.103149
Xiang Zhao, Haitao Du, Yi Kang
Processing-in-memory (PIM) architectures have emerged as a promising solution for accelerating graph processing by enabling computation in memory and minimizing data movement. However, most existing PIM-based graph processing systems rely on the Bulk Synchronous Parallel (BSP) model, which frequently enforces global barriers that limit cross-iteration computational parallelism and introduce significant synchronization and communication overheads.
To address these limitations, we propose the Cross Iteration Parallel (CIP) model, a novel vertex-level synchronization approach that eliminates global barriers by independently tracking the synchronization states of vertices. The CIP model enables concurrent execution across iterations, enhancing computational parallelism, overlapping communication and computation, improving core utilization, and increasing resilience to workload imbalance. We implement the CIP model in a PIM-based graph processing system, GraphDF, which features a few specially designed function units to support vertex-level synchronization. Evaluated on a PyMTL3-based cycle-accurate simulator using four real-world graphs and four graph algorithms, CIP running on GraphDF achieves an average speedup of 1.8× and a maximum of 2.3× compared to Dalorex, the state-of-the-art PIM-based graph processing system.
内存中处理(PIM)架构已经成为一种很有前途的解决方案,可以通过在内存中进行计算和最小化数据移动来加速图形处理。然而,大多数现有的基于pim的图形处理系统依赖于批量同步并行(BSP)模型,该模型经常强制执行全局障碍,限制了交叉迭代计算并行性,并引入了显著的同步和通信开销。为了解决这些限制,我们提出了交叉迭代并行(CIP)模型,这是一种新的顶点级同步方法,通过独立跟踪顶点的同步状态来消除全局障碍。CIP模型支持跨迭代的并发执行,增强计算并行性、重叠通信和计算、提高核心利用率,并增加对工作负载不平衡的弹性。我们在基于pim的图形处理系统GraphDF中实现了CIP模型,该系统具有几个专门设计的功能单元来支持顶点级同步。在基于pymtl3的周期精确模拟器上使用四个真实世界的图形和四种图形算法进行评估,与最先进的基于pim的图形处理系统Dalorex相比,在GraphDF上运行的CIP实现了1.8倍的平均加速,最大加速为2.3倍。
{"title":"Enable cross-iteration parallelism for PIM-based graph processing with vertex-level synchronization","authors":"Xiang Zhao,&nbsp;Haitao Du,&nbsp;Yi Kang","doi":"10.1016/j.parco.2025.103149","DOIUrl":"10.1016/j.parco.2025.103149","url":null,"abstract":"<div><div>Processing-in-memory (PIM) architectures have emerged as a promising solution for accelerating graph processing by enabling computation in memory and minimizing data movement. However, most existing PIM-based graph processing systems rely on the Bulk Synchronous Parallel (BSP) model, which frequently enforces global barriers that limit cross-iteration computational parallelism and introduce significant synchronization and communication overheads.</div><div>To address these limitations, we propose the Cross Iteration Parallel (CIP) model, a novel vertex-level synchronization approach that eliminates global barriers by independently tracking the synchronization states of vertices. The CIP model enables concurrent execution across iterations, enhancing computational parallelism, overlapping communication and computation, improving core utilization, and increasing resilience to workload imbalance. We implement the CIP model in a PIM-based graph processing system, GraphDF, which features a few specially designed function units to support vertex-level synchronization. Evaluated on a PyMTL3-based cycle-accurate simulator using four real-world graphs and four graph algorithms, CIP running on GraphDF achieves an average speedup of 1.8<span><math><mo>×</mo></math></span> and a maximum of 2.3<span><math><mo>×</mo></math></span> compared to Dalorex, the state-of-the-art PIM-based graph processing system.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103149"},"PeriodicalIF":2.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144860808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures 一种有效的长向量结构代数旁路BFS算法
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-11 DOI: 10.1016/j.parco.2025.103147
Yuyao Niu, Marc Cacas
Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.
In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-C-σ data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91× , 2.88× , and 1.46× over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.
广度优先搜索(BFS)是科学计算、数据库和网络分析应用中的基本算法。在代数BFS范式中,每次BFS迭代都表示为稀疏矩阵向量乘法,允许BFS通过建立良好的线性代数原语进行加速和分析。尽管在并行平台(如cpu、gpu和分布式内存系统)上优化代数BFS已经做了很多努力,但利用单指令多数据(SIMD)并行性的矢量架构,特别是在稀疏工作负载上的高性能,对于BFS的探索仍然相对不足。在本文中,我们提出了代数旁路BFS算法(ALgebraic Bypass BFS Algorithm, ALBBA),这是一种针对长向量结构优化的新颖高效的代数BFS实现。ALBBA利用SELL-C-σ数据结构的自定义变体来充分利用SIMD功能。通过将向量化友好搜索方法与两级绕过策略相结合,我们增强了稀疏矩阵-稀疏向量乘法(SpMSpV)和稀疏矩阵-密集向量乘法(SpMV)算法,这对代数BFS操作至关重要。我们进一步引入合并原语,并在每次BFS迭代中采用高效的选择方法。我们在NEC VE20B处理器上的实验表明,与NVIDIA H100 GPU上运行的Enterprise、GraphBLAST和Gunrock相比,ALBBA的平均速度分别达到了3.91倍、2.88倍和1.46倍。
{"title":"ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures","authors":"Yuyao Niu,&nbsp;Marc Cacas","doi":"10.1016/j.parco.2025.103147","DOIUrl":"10.1016/j.parco.2025.103147","url":null,"abstract":"<div><div>Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.</div><div>In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-<span><math><mi>C</mi></math></span>-<span><math><mi>σ</mi></math></span> data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91<span><math><mo>×</mo></math></span> , 2.88<span><math><mo>×</mo></math></span> , and 1.46<span><math><mo>×</mo></math></span> over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103147"},"PeriodicalIF":2.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Java to create and analyze models of parallel computing systems 使用Java创建和分析并行计算系统的模型
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-06-14 DOI: 10.1016/j.parco.2025.103146
Harish Padmanaban , Nurkasym Arkabaev , Maher Ali Rusho , Vladyslav Kozub , Yurii Kozub
The purpose of the study is to develop optimal solutions for models of parallel computing systems using the Java language. During the study, programs were written for the examined models of parallel computing systems. The result of the parallel sorting code is the output of a sorted array of random numbers. When processing data in parallel, the time spent on processing and the first elements of the list of squared numbers are displayed. When processing requests asynchronously, processing completion messages are displayed for each task with a slight delay. The main results include the development of optimization methods for algorithms and processes, such as the division of tasks into subtasks, the use of non-blocking algorithms, effective memory management, and load balancing, as well as the construction of diagrams and comparison of these methods by characteristics, including descriptions, implementation examples, and advantages. In addition, various specialized libraries were analyzed to improve the performance and scalability of the models. The results of the work performed showed a substantial improvement in response time, bandwidth, and resource efficiency in parallel computing systems. Scalability and load analysis assessments were conducted, demonstrating how the system responds to an increase in data volume or the number of threads. Profiling tools were used to analyze performance in detail and identify bottlenecks in models, which improved the architecture and implementation of parallel computing systems. The obtained results emphasize the importance of choosing the right methods and tools for optimizing parallel computing systems, which can substantially improve their performance and efficiency.
本研究的目的是开发使用Java语言的并行计算系统模型的最佳解决方案。在研究过程中,为所研究的并行计算系统模型编写了程序。并行排序代码的结果是经过排序的随机数数组的输出。当并行处理数据时,将显示用于处理的时间和平方数列表的第一个元素。当异步处理请求时,将为每个任务显示处理完成消息,并有轻微延迟。主要结果包括算法和过程的优化方法的发展,例如任务划分为子任务,非阻塞算法的使用,有效的内存管理和负载平衡,以及图表的构建和这些方法的特征的比较,包括描述,实现示例和优势。此外,还分析了各种专用库,以提高模型的性能和可扩展性。所执行的工作结果表明,并行计算系统在响应时间、带宽和资源效率方面有了实质性的改进。进行了可伸缩性和负载分析评估,演示了系统如何响应数据量或线程数量的增加。利用分析工具详细分析性能并识别模型中的瓶颈,从而改进了并行计算系统的体系结构和实现。所得结果强调了选择正确的方法和工具来优化并行计算系统的重要性,这可以大大提高并行计算系统的性能和效率。
{"title":"Using Java to create and analyze models of parallel computing systems","authors":"Harish Padmanaban ,&nbsp;Nurkasym Arkabaev ,&nbsp;Maher Ali Rusho ,&nbsp;Vladyslav Kozub ,&nbsp;Yurii Kozub","doi":"10.1016/j.parco.2025.103146","DOIUrl":"10.1016/j.parco.2025.103146","url":null,"abstract":"<div><div>The purpose of the study is to develop optimal solutions for models of parallel computing systems using the Java language. During the study, programs were written for the examined models of parallel computing systems. The result of the parallel sorting code is the output of a sorted array of random numbers. When processing data in parallel, the time spent on processing and the first elements of the list of squared numbers are displayed. When processing requests asynchronously, processing completion messages are displayed for each task with a slight delay. The main results include the development of optimization methods for algorithms and processes, such as the division of tasks into subtasks, the use of non-blocking algorithms, effective memory management, and load balancing, as well as the construction of diagrams and comparison of these methods by characteristics, including descriptions, implementation examples, and advantages. In addition, various specialized libraries were analyzed to improve the performance and scalability of the models. The results of the work performed showed a substantial improvement in response time, bandwidth, and resource efficiency in parallel computing systems. Scalability and load analysis assessments were conducted, demonstrating how the system responds to an increase in data volume or the number of threads. Profiling tools were used to analyze performance in detail and identify bottlenecks in models, which improved the architecture and implementation of parallel computing systems. The obtained results emphasize the importance of choosing the right methods and tools for optimizing parallel computing systems, which can substantially improve their performance and efficiency.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103146"},"PeriodicalIF":2.0,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA-based accelerator for YOLOv5 object detection with optimized computation and data access for edge deployment 基于fpga的YOLOv5目标检测加速器,具有优化的边缘部署计算和数据访问
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-05-05 DOI: 10.1016/j.parco.2025.103138
Wei Qian , Zhengwei Zhu , Chenyang Zhu , Yanping Zhu
In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.
在目标检测领域,卷积神经网络的进步是实质性的。然而,它们的高计算和数据访问需求使这些算法在边缘设备上的部署复杂化。为了缓解这些挑战,现场可编程门阵列由于其低功耗和快速响应能力,已经成为执行卷积神经网络固有的并行计算的理想硬件平台。我们为You Only Look Once version 5 (YOLOv5)目标检测网络开发了一种基于现场可编程门阵列的加速器,该加速器在Xilinx XCZU15EG芯片上使用Verilog硬件描述语言实现。该加速器有效地处理了Yolov5网络的卷积层、批归一化融合层和张量加法操作。我们的架构将卷积计算分离为两个计算单元:乘法和加法。压缩加法器和三元加法器树的引入大大加快了加法运算。此外,通过使用双输入单输出缓冲器和专用数据访问单元,可以减轻片外带宽压力。实验结果表明,在200兆赫的中心频率下,加速器的功耗为13.021瓦。实验结果表明,该加速器的性能优于Amazon Web Services gravon2中央处理器和Jetson Nano图形处理器。烧蚀实验验证了我们的创新设计所提供的增强功能。最终,我们的方法显著提高了Yolov5网络的推理速度,分别提高了61.88%、69.1%、59.36%、64.07%和65.92%,从而大大提高了加速器的性能,超越了现有的方法。
{"title":"FPGA-based accelerator for YOLOv5 object detection with optimized computation and data access for edge deployment","authors":"Wei Qian ,&nbsp;Zhengwei Zhu ,&nbsp;Chenyang Zhu ,&nbsp;Yanping Zhu","doi":"10.1016/j.parco.2025.103138","DOIUrl":"10.1016/j.parco.2025.103138","url":null,"abstract":"<div><div>In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103138"},"PeriodicalIF":2.0,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143912054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EESF: Energy-efficient scheduling framework for deadline-constrained workflows with computation speed estimation method in cloud EESF:基于云计算速度估计方法的限期工作流节能调度框架
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-05-05 DOI: 10.1016/j.parco.2025.103139
Rupinder Kaur, Gurjinder Kaur, Major Singh Goraya
Substantial amount of energy consumed by rapidly growing cloud data centers is a major hindrance to sustainable cloud computing. Therefore, this paper proposes a scheduling framework named EESF aiming at minimizing the energy consumption and makespan of workflow execution under deadline and dependency constraints. The novel aspects of the proposed EESF are outlined as follows: 1) it first estimates the computation speed requirements of the entire workflow application before beginning the execution. Then, it estimates the computation speed requirements of individual tasks dynamically during execution. 2) Different from existing approaches that mainly assign tasks to virtual machines (VMs) with lower energy consumption or use DVFS to lower the frequency or voltage of hosts/VMs leading to longer makespan, EESF considers the degree of dependency of the tasks along with estimated speed for task-VM assignment. 3) Based on the fact that scheduling dependent tasks on same VM is not always energy-efficient, a new concept of virtual task clustering is introduced to schedule the tasks with dependencies in an energy-efficient manner. 4) EESF deploys VMs dynamically as per the necessary computation speed requirements of the tasks to prevent over-provisioning/under-provisioning of computational power. 5) In general, task reassignment causes huge data transfer which also consumes energy, but EESF reassigns tasks to more-energy efficient VMs running on the same host, thereby zeroing the data transfer time. Experiments performed using four real-world scientific workflows and 10 random workflows illustrate that EESF reduces energy consumption by 6%-44% than related algorithms while significantly reducing the makespan.
快速增长的云数据中心所消耗的大量能源是可持续云计算的主要障碍。因此,本文提出了一种调度框架EESF,其目标是在期限约束和依赖约束下最小化工作流执行的能耗和完工时间。提出的EESF的新颖之处如下:1)在开始执行之前,它首先估计整个工作流应用程序的计算速度需求。然后,在执行过程中动态估计单个任务的计算速度需求。2)与现有的主要将任务分配给能耗较低的虚拟机(vm)或使用DVFS降低主机/ vm的频率或电压从而延长makespan的方法不同,EESF考虑任务的依赖程度以及任务- vm分配的估计速度。3)针对在同一虚拟机上调度依赖任务并不总是节能的问题,提出了虚拟任务集群的概念,以节能的方式调度具有依赖关系的任务。4) EESF根据任务所需的计算速度要求动态部署vm,以防止计算能力的过度配置/不足配置。5)一般来说,任务重分配会导致大量的数据传输,同时也会消耗能源,但EESF会将任务重新分配给运行在同一主机上的更节能的虚拟机,从而使数据传输时间归零。使用4个真实科学工作流和10个随机工作流进行的实验表明,EESF比相关算法减少了6%-44%的能耗,同时显著缩短了完工时间。
{"title":"EESF: Energy-efficient scheduling framework for deadline-constrained workflows with computation speed estimation method in cloud","authors":"Rupinder Kaur,&nbsp;Gurjinder Kaur,&nbsp;Major Singh Goraya","doi":"10.1016/j.parco.2025.103139","DOIUrl":"10.1016/j.parco.2025.103139","url":null,"abstract":"<div><div>Substantial amount of energy consumed by rapidly growing cloud data centers is a major hindrance to sustainable cloud computing. Therefore, this paper proposes a scheduling framework named EESF aiming at minimizing the energy consumption and makespan of workflow execution under deadline and dependency constraints. The novel aspects of the proposed EESF are outlined as follows: 1) it first estimates the computation speed requirements of the entire workflow application before beginning the execution. Then, it estimates the computation speed requirements of individual tasks dynamically during execution. 2) Different from existing approaches that mainly assign tasks to virtual machines (VMs) with lower energy consumption or use DVFS to lower the frequency or voltage of hosts/VMs leading to longer makespan, EESF considers the degree of dependency of the tasks along with estimated speed for task-VM assignment. 3) Based on the fact that scheduling dependent tasks on same VM is not always energy-efficient, a new concept of virtual task clustering is introduced to schedule the tasks with dependencies in an energy-efficient manner. 4) EESF deploys VMs dynamically as per the necessary computation speed requirements of the tasks to prevent over-provisioning/under-provisioning of computational power. 5) In general, task reassignment causes huge data transfer which also consumes energy, but EESF reassigns tasks to more-energy efficient VMs running on the same host, thereby zeroing the data transfer time. Experiments performed using four real-world scientific workflows and 10 random workflows illustrate that EESF reduces energy consumption by 6%-44% than related algorithms while significantly reducing the makespan.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103139"},"PeriodicalIF":2.0,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143935399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-level parallelism optimization for two-dimensional convolution vectorization method on multi-core vector accelerator 二维卷积矢量化方法在多核矢量加速器上的多级并行优化
IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-04-29 DOI: 10.1016/j.parco.2025.103137
Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang
The widespread application of convolutional neural network across diverse domains has highlighted the growing significance of accelerating convolutional computations. In this work, we design a multi-level parallelism optimization method for direct convolution vectorization algorithm based on a channel-first data layout on a multi-core vector accelerator. This method calculates based on the input row and weight column in a single core, and achieves the simultaneous computation of more elements, thereby effectively hiding the latency of instructions and improving the degree of parallelism at instruction-level. This method can also substantially eliminates data overlap caused by convolutional windows sliding. Among multiple cores, the data flow is optimized with various data reuse methods for different situations. Experimental results show that the computational efficiency on multi-core can be improved greatly, up to 80.2%. For the typical network ResNet18, compared with existing method on the accelerator, a performance acceleration of 4.42-5.63 times can be achieved.
卷积神经网络在各个领域的广泛应用凸显了加速卷积计算的重要性。本文在多核矢量加速器上设计了一种基于通道优先数据布局的直接卷积矢量化算法的多级并行优化方法。该方法基于单个核的输入行和权值列进行计算,实现了多元素的同时计算,从而有效地隐藏了指令的延迟,提高了指令级的并行度。该方法还可以有效地消除卷积窗口滑动引起的数据重叠。在多核之间,针对不同情况,采用不同的数据重用方法对数据流进行优化。实验结果表明,该方法可以大大提高多核计算效率,最高可达80.2%。对于典型网络ResNet18,在加速器上与现有方法相比,可以实现4.42-5.63倍的性能加速。
{"title":"Multi-level parallelism optimization for two-dimensional convolution vectorization method on multi-core vector accelerator","authors":"Siyang Xing ,&nbsp;Youmeng Li ,&nbsp;Zikun Deng ,&nbsp;Qijun Zheng ,&nbsp;Zeyu Lu ,&nbsp;Qinglin Wang","doi":"10.1016/j.parco.2025.103137","DOIUrl":"10.1016/j.parco.2025.103137","url":null,"abstract":"<div><div>The widespread application of convolutional neural network across diverse domains has highlighted the growing significance of accelerating convolutional computations. In this work, we design a multi-level parallelism optimization method for direct convolution vectorization algorithm based on a channel-first data layout on a multi-core vector accelerator. This method calculates based on the input row and weight column in a single core, and achieves the simultaneous computation of more elements, thereby effectively hiding the latency of instructions and improving the degree of parallelism at instruction-level. This method can also substantially eliminates data overlap caused by convolutional windows sliding. Among multiple cores, the data flow is optimized with various data reuse methods for different situations. Experimental results show that the computational efficiency on multi-core can be improved greatly, up to 80.2%. For the typical network ResNet18, compared with existing method on the accelerator, a performance acceleration of 4.42-5.63 times can be achieved.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103137"},"PeriodicalIF":2.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143894768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1