Pub Date : 2025-11-01DOI: 10.1016/j.parco.2025.103163
Athanasios Margaris , Stavros Souravlas
This paper investigates how parallel computing techniques, such as OpenMP and CUDA, can be optimized to enhance the computational efficiency of detecting chaotic regions in the parameter space of recurrent equations, a critical task in chaos theory. Leveraging the embarrassingly parallel nature of maximum Lyapunov exponent calculations, our method targets systems with known recurrence relations, where governing equations are analytically defined. Applied to a discretized recurrent neural model, the proposed approach achieves significant speedups, addressing the computational intensity of chaos detection. While building on established parallel techniques, this work fills a gap in their systematic application to chaos detection in high-dimensional systems, offering a scalable solution with potential for real-time analysis. We provide detailed performance metrics, parallel I/O guidelines, and visualization strategies, demonstrating adaptability to other analytically defined chaotic systems.
{"title":"Detecting chaotic regions of recurrent equations in parallel environments","authors":"Athanasios Margaris , Stavros Souravlas","doi":"10.1016/j.parco.2025.103163","DOIUrl":"10.1016/j.parco.2025.103163","url":null,"abstract":"<div><div>This paper investigates how parallel computing techniques, such as OpenMP and CUDA, can be optimized to enhance the computational efficiency of detecting chaotic regions in the parameter space of recurrent equations, a critical task in chaos theory. Leveraging the embarrassingly parallel nature of maximum Lyapunov exponent calculations, our method targets systems with known recurrence relations, where governing equations are analytically defined. Applied to a discretized recurrent neural model, the proposed approach achieves significant speedups, addressing the computational intensity of chaos detection. While building on established parallel techniques, this work fills a gap in their systematic application to chaos detection in high-dimensional systems, offering a scalable solution with potential for real-time analysis. We provide detailed performance metrics, parallel I/O guidelines, and visualization strategies, demonstrating adaptability to other analytically defined chaotic systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103163"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Internet of Things (IoT) devices produce a lot of data, which can be difficult to process on limited computing systems. Edge computing aims to solve this issue by providing localized processing power at the edge of IoT networks to reduce communication delays and network bandwidth. Because of their limited resources and task dependencies, edge computing systems are facing computational issues as a result of the growing usage of IoT devices. An efficient task-offloading system that combines the Fire Hawk Optimizer (FHO) and Deep Reinforcement Learning (DRL) is proposed in this research to address these issues. This paper proposes leveraging deep learning techniques to prioritize and offload computational tasks from IoT applications to edge computing systems, addressing task interdependencies and resource constraints to enhance efficiency. The proposed method consists of two components. The first component uses Petri-Net modelling to analyze interdependencies among tasks, identify subtasks, and map their relationships. The second component uses a residual neural network-based actor-critic deep reinforcement learning (ResNet-ACDRL) decision-making model to offload tasks. Task dependencies and resource availability are assessed by the DRL component, namely a ResNet-ACDRL model, which is utilized to dynamically learn and enhance task-offloading strategies. In order to ensure optimal task allocation across local, edge, and cloud computing resources, the FHO is then used to refine these learned policies. Here, the term "policy" refers to the strategy used by the system to decide the most suitable resource for task execution. This dual approach strategy drastically reduces energy usage and execution delays. The suggested framework outperforms existing methods, according to experimental data, especially when managing task interdependencies and a variety of computational loads. The proposed method has been shown to significantly improve time delay and energy consumption compared to existing methods.
{"title":"A dependency-aware task offloading in IoT-based edge computing system using an optimized deep learning approach","authors":"Shiva Shankar Reddy , Silpa Nrusimhadri , Gadiraju Mahesh , Veeranki Venkata Rama Maheswara Rao","doi":"10.1016/j.parco.2025.103161","DOIUrl":"10.1016/j.parco.2025.103161","url":null,"abstract":"<div><div>Internet of Things (IoT) devices produce a lot of data, which can be difficult to process on limited computing systems. Edge computing aims to solve this issue by providing localized processing power at the edge of IoT networks to reduce communication delays and network bandwidth. Because of their limited resources and task dependencies, edge computing systems are facing computational issues as a result of the growing usage of IoT devices. An efficient task-offloading system that combines the Fire Hawk Optimizer (FHO) and Deep Reinforcement Learning (DRL) is proposed in this research to address these issues. This paper proposes leveraging deep learning techniques to prioritize and offload computational tasks from IoT applications to edge computing systems, addressing task interdependencies and resource constraints to enhance efficiency. The proposed method consists of two components. The first component uses Petri-Net modelling to analyze interdependencies among tasks, identify subtasks, and map their relationships. The second component uses a residual neural network-based actor-critic deep reinforcement learning (ResNet-ACDRL) decision-making model to offload tasks. Task dependencies and resource availability are assessed by the DRL component, namely a ResNet-ACDRL model, which is utilized to dynamically learn and enhance task-offloading strategies. In order to ensure optimal task allocation across local, edge, and cloud computing resources, the FHO is then used to refine these learned policies. Here, the term \"policy\" refers to the strategy used by the system to decide the most suitable resource for task execution. This dual approach strategy drastically reduces energy usage and execution delays. The suggested framework outperforms existing methods, according to experimental data, especially when managing task interdependencies and a variety of computational loads. The proposed method has been shown to significantly improve time delay and energy consumption compared to existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103161"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1016/j.parco.2025.103160
Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin
Efficiently solving high-dimensional and complex numerical optimization problems remains a critical challenge in high-performance computing. This paper presents the GPU/CUDA-Accelerated Gradient Growth Optimizer (GGO)—a novel parallel metaheuristic algorithm that combines gradient-guided local search with GPU-enabled large-scale parallelism. Building upon the Growth Optimizer (GO), GGO incorporates a dimension-wise gradient-guiding strategy based on central difference approximations, which improves solution precision without requiring differentiable objective functions. To address the computational bottlenecks of high-dimensional problems, a hybrid CUDA-based framework is developed, integrating both fine-grained and coarse-grained parallel strategies to fully exploit GPU resources and minimize memory access latency. Extensive experiments on the CEC2017 and CEC2022 benchmark suites demonstrate the superior performance of GGO in terms of both convergence accuracy and computational speed. Compared to 49 state-of-the-art optimization algorithms, GGO achieves top-ranked results in 67% of test cases and delivers up to 7.8× speedup over its CPU-based counterpart. Statistical analyses using the Wilcoxon signed-rank test further confirm its robustness across 28 out of 29 functions in high-dimensional scenarios. Additionally, in-depth analysis reveals that GGO maintains high scalability and performance even as the problem dimension and population size increase, providing a generalizable solution for high-dimensional global optimization that is well-suited for parallel computing applications in scientific and engineering domains.
{"title":"GPU/CUDA-Accelerated gradient growth optimizer for efficient complex numerical global optimization","authors":"Qingke Zhang , Wenliang Chen , Shuzhao Pang , Sichen Tao , Conglin Li , Xin Yin","doi":"10.1016/j.parco.2025.103160","DOIUrl":"10.1016/j.parco.2025.103160","url":null,"abstract":"<div><div>Efficiently solving high-dimensional and complex numerical optimization problems remains a critical challenge in high-performance computing. This paper presents the GPU/CUDA-Accelerated Gradient Growth Optimizer (GGO)—a novel parallel metaheuristic algorithm that combines gradient-guided local search with GPU-enabled large-scale parallelism. Building upon the Growth Optimizer (GO), GGO incorporates a dimension-wise gradient-guiding strategy based on central difference approximations, which improves solution precision without requiring differentiable objective functions. To address the computational bottlenecks of high-dimensional problems, a hybrid CUDA-based framework is developed, integrating both fine-grained and coarse-grained parallel strategies to fully exploit GPU resources and minimize memory access latency. Extensive experiments on the CEC2017 and CEC2022 benchmark suites demonstrate the superior performance of GGO in terms of both convergence accuracy and computational speed. Compared to 49 state-of-the-art optimization algorithms, GGO achieves top-ranked results in 67% of test cases and delivers up to 7.8× speedup over its CPU-based counterpart. Statistical analyses using the Wilcoxon signed-rank test further confirm its robustness across 28 out of 29 functions in high-dimensional scenarios. Additionally, in-depth analysis reveals that GGO maintains high scalability and performance even as the problem dimension and population size increase, providing a generalizable solution for high-dimensional global optimization that is well-suited for parallel computing applications in scientific and engineering domains.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103160"},"PeriodicalIF":2.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01DOI: 10.1016/j.parco.2025.103150
Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani
This paper presents the exploration of GPU-accelerated block-wise decompositions for zero-forcing (ZF) based QR and Cholesky methods applied to massive multiple-input multiple-output (MIMO) uplink detection algorithms. Three algorithms are evaluated: ZF with block Cholesky decomposition, ZF with block QR decomposition (QRD), and minimum mean square error (MMSE) with block Cholesky decomposition. The latter was the only one previously explored, but it used standard Cholesky decomposition. Our approach achieves an 11% improvement over the previous GPU-accelerated MMSE study.
Through performance analysis, we observe a trade-off between precision and execution time. Reducing precision from FP64 to FP32 improves execution time but increases bit error rate (BER), with ZF-based QRD reducing execution time from to for a 128 × 8 MIMO size. The study also highlights that larger MIMO sizes, particularly 2048 × 32, require GPUs to fully utilize their computational and memory capabilities, especially under FP64 precision. In contrast, smaller matrices are compute-bound.
Our results recommend GPUs for larger MIMO sizes, as they offer the parallelism and memory resources necessary to efficiently handle the computational demands of next-generation networks. This work paves the way for scalable, GPU-based massive MIMO uplink detection systems.
{"title":"Software acceleration of multi-user MIMO uplink detection on GPU","authors":"Ali Nada , Hazem Ismail Ali , Liang Liu , Yousra Alkabani","doi":"10.1016/j.parco.2025.103150","DOIUrl":"10.1016/j.parco.2025.103150","url":null,"abstract":"<div><div>This paper presents the exploration of GPU-accelerated block-wise decompositions for zero-forcing (ZF) based QR and Cholesky methods applied to massive multiple-input multiple-output (MIMO) uplink detection algorithms. Three algorithms are evaluated: ZF with block Cholesky decomposition, ZF with block QR decomposition (QRD), and minimum mean square error (MMSE) with block Cholesky decomposition. The latter was the only one previously explored, but it used standard Cholesky decomposition. Our approach achieves an 11% improvement over the previous GPU-accelerated MMSE study.</div><div>Through performance analysis, we observe a trade-off between precision and execution time. Reducing precision from FP64 to FP32 improves execution time but increases bit error rate (BER), with ZF-based QRD reducing execution time from <span><math><mrow><mn>2</mn><mo>.</mo><mn>04</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> to <span><math><mrow><mn>1</mn><mo>.</mo><mn>24</mn><mspace></mspace><mi>μ</mi><mi>s</mi></mrow></math></span> for a 128 × 8 MIMO size. The study also highlights that larger MIMO sizes, particularly 2048 × 32, require GPUs to fully utilize their computational and memory capabilities, especially under FP64 precision. In contrast, smaller matrices are compute-bound.</div><div>Our results recommend GPUs for larger MIMO sizes, as they offer the parallelism and memory resources necessary to efficiently handle the computational demands of next-generation networks. This work paves the way for scalable, GPU-based massive MIMO uplink detection systems.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103150"},"PeriodicalIF":2.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144922663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-11DOI: 10.1016/j.parco.2025.103149
Xiang Zhao, Haitao Du, Yi Kang
Processing-in-memory (PIM) architectures have emerged as a promising solution for accelerating graph processing by enabling computation in memory and minimizing data movement. However, most existing PIM-based graph processing systems rely on the Bulk Synchronous Parallel (BSP) model, which frequently enforces global barriers that limit cross-iteration computational parallelism and introduce significant synchronization and communication overheads.
To address these limitations, we propose the Cross Iteration Parallel (CIP) model, a novel vertex-level synchronization approach that eliminates global barriers by independently tracking the synchronization states of vertices. The CIP model enables concurrent execution across iterations, enhancing computational parallelism, overlapping communication and computation, improving core utilization, and increasing resilience to workload imbalance. We implement the CIP model in a PIM-based graph processing system, GraphDF, which features a few specially designed function units to support vertex-level synchronization. Evaluated on a PyMTL3-based cycle-accurate simulator using four real-world graphs and four graph algorithms, CIP running on GraphDF achieves an average speedup of 1.8 and a maximum of 2.3 compared to Dalorex, the state-of-the-art PIM-based graph processing system.
{"title":"Enable cross-iteration parallelism for PIM-based graph processing with vertex-level synchronization","authors":"Xiang Zhao, Haitao Du, Yi Kang","doi":"10.1016/j.parco.2025.103149","DOIUrl":"10.1016/j.parco.2025.103149","url":null,"abstract":"<div><div>Processing-in-memory (PIM) architectures have emerged as a promising solution for accelerating graph processing by enabling computation in memory and minimizing data movement. However, most existing PIM-based graph processing systems rely on the Bulk Synchronous Parallel (BSP) model, which frequently enforces global barriers that limit cross-iteration computational parallelism and introduce significant synchronization and communication overheads.</div><div>To address these limitations, we propose the Cross Iteration Parallel (CIP) model, a novel vertex-level synchronization approach that eliminates global barriers by independently tracking the synchronization states of vertices. The CIP model enables concurrent execution across iterations, enhancing computational parallelism, overlapping communication and computation, improving core utilization, and increasing resilience to workload imbalance. We implement the CIP model in a PIM-based graph processing system, GraphDF, which features a few specially designed function units to support vertex-level synchronization. Evaluated on a PyMTL3-based cycle-accurate simulator using four real-world graphs and four graph algorithms, CIP running on GraphDF achieves an average speedup of 1.8<span><math><mo>×</mo></math></span> and a maximum of 2.3<span><math><mo>×</mo></math></span> compared to Dalorex, the state-of-the-art PIM-based graph processing system.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103149"},"PeriodicalIF":2.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144860808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-11DOI: 10.1016/j.parco.2025.103147
Yuyao Niu, Marc Cacas
Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.
In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-- data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91 , 2.88 , and 1.46 over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.
{"title":"ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures","authors":"Yuyao Niu, Marc Cacas","doi":"10.1016/j.parco.2025.103147","DOIUrl":"10.1016/j.parco.2025.103147","url":null,"abstract":"<div><div>Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.</div><div>In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-<span><math><mi>C</mi></math></span>-<span><math><mi>σ</mi></math></span> data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91<span><math><mo>×</mo></math></span> , 2.88<span><math><mo>×</mo></math></span> , and 1.46<span><math><mo>×</mo></math></span> over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103147"},"PeriodicalIF":2.0,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of the study is to develop optimal solutions for models of parallel computing systems using the Java language. During the study, programs were written for the examined models of parallel computing systems. The result of the parallel sorting code is the output of a sorted array of random numbers. When processing data in parallel, the time spent on processing and the first elements of the list of squared numbers are displayed. When processing requests asynchronously, processing completion messages are displayed for each task with a slight delay. The main results include the development of optimization methods for algorithms and processes, such as the division of tasks into subtasks, the use of non-blocking algorithms, effective memory management, and load balancing, as well as the construction of diagrams and comparison of these methods by characteristics, including descriptions, implementation examples, and advantages. In addition, various specialized libraries were analyzed to improve the performance and scalability of the models. The results of the work performed showed a substantial improvement in response time, bandwidth, and resource efficiency in parallel computing systems. Scalability and load analysis assessments were conducted, demonstrating how the system responds to an increase in data volume or the number of threads. Profiling tools were used to analyze performance in detail and identify bottlenecks in models, which improved the architecture and implementation of parallel computing systems. The obtained results emphasize the importance of choosing the right methods and tools for optimizing parallel computing systems, which can substantially improve their performance and efficiency.
{"title":"Using Java to create and analyze models of parallel computing systems","authors":"Harish Padmanaban , Nurkasym Arkabaev , Maher Ali Rusho , Vladyslav Kozub , Yurii Kozub","doi":"10.1016/j.parco.2025.103146","DOIUrl":"10.1016/j.parco.2025.103146","url":null,"abstract":"<div><div>The purpose of the study is to develop optimal solutions for models of parallel computing systems using the Java language. During the study, programs were written for the examined models of parallel computing systems. The result of the parallel sorting code is the output of a sorted array of random numbers. When processing data in parallel, the time spent on processing and the first elements of the list of squared numbers are displayed. When processing requests asynchronously, processing completion messages are displayed for each task with a slight delay. The main results include the development of optimization methods for algorithms and processes, such as the division of tasks into subtasks, the use of non-blocking algorithms, effective memory management, and load balancing, as well as the construction of diagrams and comparison of these methods by characteristics, including descriptions, implementation examples, and advantages. In addition, various specialized libraries were analyzed to improve the performance and scalability of the models. The results of the work performed showed a substantial improvement in response time, bandwidth, and resource efficiency in parallel computing systems. Scalability and load analysis assessments were conducted, demonstrating how the system responds to an increase in data volume or the number of threads. Profiling tools were used to analyze performance in detail and identify bottlenecks in models, which improved the architecture and implementation of parallel computing systems. The obtained results emphasize the importance of choosing the right methods and tools for optimizing parallel computing systems, which can substantially improve their performance and efficiency.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103146"},"PeriodicalIF":2.0,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144322927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.
在目标检测领域,卷积神经网络的进步是实质性的。然而,它们的高计算和数据访问需求使这些算法在边缘设备上的部署复杂化。为了缓解这些挑战,现场可编程门阵列由于其低功耗和快速响应能力,已经成为执行卷积神经网络固有的并行计算的理想硬件平台。我们为You Only Look Once version 5 (YOLOv5)目标检测网络开发了一种基于现场可编程门阵列的加速器,该加速器在Xilinx XCZU15EG芯片上使用Verilog硬件描述语言实现。该加速器有效地处理了Yolov5网络的卷积层、批归一化融合层和张量加法操作。我们的架构将卷积计算分离为两个计算单元:乘法和加法。压缩加法器和三元加法器树的引入大大加快了加法运算。此外,通过使用双输入单输出缓冲器和专用数据访问单元,可以减轻片外带宽压力。实验结果表明,在200兆赫的中心频率下,加速器的功耗为13.021瓦。实验结果表明,该加速器的性能优于Amazon Web Services gravon2中央处理器和Jetson Nano图形处理器。烧蚀实验验证了我们的创新设计所提供的增强功能。最终,我们的方法显著提高了Yolov5网络的推理速度,分别提高了61.88%、69.1%、59.36%、64.07%和65.92%,从而大大提高了加速器的性能,超越了现有的方法。
{"title":"FPGA-based accelerator for YOLOv5 object detection with optimized computation and data access for edge deployment","authors":"Wei Qian , Zhengwei Zhu , Chenyang Zhu , Yanping Zhu","doi":"10.1016/j.parco.2025.103138","DOIUrl":"10.1016/j.parco.2025.103138","url":null,"abstract":"<div><div>In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103138"},"PeriodicalIF":2.0,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143912054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-05DOI: 10.1016/j.parco.2025.103139
Rupinder Kaur, Gurjinder Kaur, Major Singh Goraya
Substantial amount of energy consumed by rapidly growing cloud data centers is a major hindrance to sustainable cloud computing. Therefore, this paper proposes a scheduling framework named EESF aiming at minimizing the energy consumption and makespan of workflow execution under deadline and dependency constraints. The novel aspects of the proposed EESF are outlined as follows: 1) it first estimates the computation speed requirements of the entire workflow application before beginning the execution. Then, it estimates the computation speed requirements of individual tasks dynamically during execution. 2) Different from existing approaches that mainly assign tasks to virtual machines (VMs) with lower energy consumption or use DVFS to lower the frequency or voltage of hosts/VMs leading to longer makespan, EESF considers the degree of dependency of the tasks along with estimated speed for task-VM assignment. 3) Based on the fact that scheduling dependent tasks on same VM is not always energy-efficient, a new concept of virtual task clustering is introduced to schedule the tasks with dependencies in an energy-efficient manner. 4) EESF deploys VMs dynamically as per the necessary computation speed requirements of the tasks to prevent over-provisioning/under-provisioning of computational power. 5) In general, task reassignment causes huge data transfer which also consumes energy, but EESF reassigns tasks to more-energy efficient VMs running on the same host, thereby zeroing the data transfer time. Experiments performed using four real-world scientific workflows and 10 random workflows illustrate that EESF reduces energy consumption by 6%-44% than related algorithms while significantly reducing the makespan.
{"title":"EESF: Energy-efficient scheduling framework for deadline-constrained workflows with computation speed estimation method in cloud","authors":"Rupinder Kaur, Gurjinder Kaur, Major Singh Goraya","doi":"10.1016/j.parco.2025.103139","DOIUrl":"10.1016/j.parco.2025.103139","url":null,"abstract":"<div><div>Substantial amount of energy consumed by rapidly growing cloud data centers is a major hindrance to sustainable cloud computing. Therefore, this paper proposes a scheduling framework named EESF aiming at minimizing the energy consumption and makespan of workflow execution under deadline and dependency constraints. The novel aspects of the proposed EESF are outlined as follows: 1) it first estimates the computation speed requirements of the entire workflow application before beginning the execution. Then, it estimates the computation speed requirements of individual tasks dynamically during execution. 2) Different from existing approaches that mainly assign tasks to virtual machines (VMs) with lower energy consumption or use DVFS to lower the frequency or voltage of hosts/VMs leading to longer makespan, EESF considers the degree of dependency of the tasks along with estimated speed for task-VM assignment. 3) Based on the fact that scheduling dependent tasks on same VM is not always energy-efficient, a new concept of virtual task clustering is introduced to schedule the tasks with dependencies in an energy-efficient manner. 4) EESF deploys VMs dynamically as per the necessary computation speed requirements of the tasks to prevent over-provisioning/under-provisioning of computational power. 5) In general, task reassignment causes huge data transfer which also consumes energy, but EESF reassigns tasks to more-energy efficient VMs running on the same host, thereby zeroing the data transfer time. Experiments performed using four real-world scientific workflows and 10 random workflows illustrate that EESF reduces energy consumption by 6%-44% than related algorithms while significantly reducing the makespan.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103139"},"PeriodicalIF":2.0,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143935399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-29DOI: 10.1016/j.parco.2025.103137
Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang
The widespread application of convolutional neural network across diverse domains has highlighted the growing significance of accelerating convolutional computations. In this work, we design a multi-level parallelism optimization method for direct convolution vectorization algorithm based on a channel-first data layout on a multi-core vector accelerator. This method calculates based on the input row and weight column in a single core, and achieves the simultaneous computation of more elements, thereby effectively hiding the latency of instructions and improving the degree of parallelism at instruction-level. This method can also substantially eliminates data overlap caused by convolutional windows sliding. Among multiple cores, the data flow is optimized with various data reuse methods for different situations. Experimental results show that the computational efficiency on multi-core can be improved greatly, up to 80.2%. For the typical network ResNet18, compared with existing method on the accelerator, a performance acceleration of 4.42-5.63 times can be achieved.
{"title":"Multi-level parallelism optimization for two-dimensional convolution vectorization method on multi-core vector accelerator","authors":"Siyang Xing , Youmeng Li , Zikun Deng , Qijun Zheng , Zeyu Lu , Qinglin Wang","doi":"10.1016/j.parco.2025.103137","DOIUrl":"10.1016/j.parco.2025.103137","url":null,"abstract":"<div><div>The widespread application of convolutional neural network across diverse domains has highlighted the growing significance of accelerating convolutional computations. In this work, we design a multi-level parallelism optimization method for direct convolution vectorization algorithm based on a channel-first data layout on a multi-core vector accelerator. This method calculates based on the input row and weight column in a single core, and achieves the simultaneous computation of more elements, thereby effectively hiding the latency of instructions and improving the degree of parallelism at instruction-level. This method can also substantially eliminates data overlap caused by convolutional windows sliding. Among multiple cores, the data flow is optimized with various data reuse methods for different situations. Experimental results show that the computational efficiency on multi-core can be improved greatly, up to 80.2%. For the typical network ResNet18, compared with existing method on the accelerator, a performance acceleration of 4.42-5.63 times can be achieved.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103137"},"PeriodicalIF":2.0,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143894768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}