Pub Date : 2024-03-09DOI: 10.1016/S0743-7315(24)00038-8
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00038-8","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00038-8","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104874"},"PeriodicalIF":3.8,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000388/pdfft?md5=ef4b0c5d74636a75840725db69cf440c&pid=1-s2.0-S0743731524000388-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140066743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-06DOI: 10.1016/j.jpdc.2024.104869
Tian Chen , Yu-an Tan , Zheng Zhang , Nan Luo , Bin Li , Yuanzhang Li
As convolution layers have been proved to be the most time-consuming operation in convolutional neural network (CNN) algorithms, many efficient CNN accelerators have been designed to boost the performance of convolution operations. Previous works on CNN acceleration usually use fixed design variables for diverse convolutional layers, which would lead to inefficient data movements and low utilization of computing resource. We tackle this issue by proposing a flexible dataflow optimization method with design variables estimation for different layers. The optimization method first narrows the design space by the priori constraints, and then enumerates all legal solutions to select the optimal design variables. We demonstrate the effectiveness of the proposed optimization method by implementing representative CNN models (VGG-16, ResNet-18 and MobileNet V1) on Enflame Technology's programmable CNN accelerator, General Computing Unit (GCU). The results indicate that our optimization can significantly enhance the throughput of the convolution layers in ResNet, VGG and MobileNet on GCU, with improvement of up to 1.84×. Furthermore, it achieves up to 2.08× of GCU utilization specifically for the convolution layers of ResNet on GCU.
{"title":"Dataflow optimization with layer-wise design variables estimation method for enflame CNN accelerators","authors":"Tian Chen , Yu-an Tan , Zheng Zhang , Nan Luo , Bin Li , Yuanzhang Li","doi":"10.1016/j.jpdc.2024.104869","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104869","url":null,"abstract":"<div><p>As convolution layers have been proved to be the most time-consuming operation in convolutional neural network (CNN) algorithms, many efficient CNN accelerators have been designed to boost the performance of convolution operations. Previous works on CNN acceleration usually use fixed design variables for diverse convolutional layers, which would lead to inefficient data movements and low utilization of computing resource. We tackle this issue by proposing a flexible dataflow optimization method with design variables estimation for different layers. The optimization method first narrows the design space by the priori constraints, and then enumerates all legal solutions to select the optimal design variables. We demonstrate the effectiveness of the proposed optimization method by implementing representative CNN models (VGG-16, ResNet-18 and MobileNet V1) on Enflame Technology's programmable CNN accelerator, General Computing Unit (GCU). The results indicate that our optimization can significantly enhance the throughput of the convolution layers in ResNet, VGG and MobileNet on GCU, with improvement of up to 1.84×. Furthermore, it achieves up to 2.08× of GCU utilization specifically for the convolution layers of ResNet on GCU.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104869"},"PeriodicalIF":3.8,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140067279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-04DOI: 10.1016/j.jpdc.2024.104868
Emerson A. Macedo, Alba C.M.A. Melo
The definition of protein structures is an important research topic in molecular biology currently, since there is a direct relationship between the function of the protein in the organism and the 3D geometric configuration it adopts. The transformations that occur in the protein structure from the 1D configuration to the 3D form are called protein folding. Ab initio protein folding methods use physical forces to model the interactions among the atoms that compose the protein. In order to accelerate those methods, parallel tools such as NAMD were proposed. In this paper, we propose two contributions for parallel protein folding simulations: (a) adaptive patch grid (APG) and (b) the addition of atomic burials (AB) to the traditional forces used in the simulation. With APG, we are able to adapt the simulation box (patch grid) to the current shape of the protein during the folding process. AB forces relate the 3D protein structure to its geometric center and are adequate for modeling globular proteins. Thus, adding AB to the forces used in parallel protein folding potentially increases the quality of the result for this class of proteins. APG and AB were implemented in NAMD and tested in supercomputer environments. Our results show that, with APG, we are able to reduce the execution time of the folding simulation of protein 4LNZ (5,714 atoms, 15 million time steps) from 12 hours and 36 minutes to 11 hours and 8 minutes, using 16 nodes (256 CPU cores). We also show that our APG+AB strategy was successfully used in a realistic protein folding simulation (1.7 billion time steps).
{"title":"Adaptive patch grid strategy for parallel protein folding using atomic burials with NAMD","authors":"Emerson A. Macedo, Alba C.M.A. Melo","doi":"10.1016/j.jpdc.2024.104868","DOIUrl":"10.1016/j.jpdc.2024.104868","url":null,"abstract":"<div><p>The definition of protein structures is an important research topic in molecular biology currently, since there is a direct relationship between the function of the protein in the organism and the 3D geometric configuration it adopts. The transformations that occur in the protein structure from the 1D configuration to the 3D form are called protein folding. <em>Ab initio</em> protein folding methods use physical forces to model the interactions among the atoms that compose the protein. In order to accelerate those methods, parallel tools such as NAMD were proposed. In this paper, we propose two contributions for parallel protein folding simulations: (a) adaptive patch grid (APG) and (b) the addition of atomic burials (AB) to the traditional forces used in the simulation. With APG, we are able to adapt the simulation box (patch grid) to the current shape of the protein during the folding process. AB forces relate the 3D protein structure to its geometric center and are adequate for modeling globular proteins. Thus, adding AB to the forces used in parallel protein folding potentially increases the quality of the result for this class of proteins. APG and AB were implemented in NAMD and tested in supercomputer environments. Our results show that, with APG, we are able to reduce the execution time of the folding simulation of protein 4LNZ (5,714 atoms, 15 million time steps) from 12 hours and 36 minutes to 11 hours and 8 minutes, using 16 nodes (256 CPU cores). We also show that our APG+AB strategy was successfully used in a realistic protein folding simulation (1.7 billion time steps).</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104868"},"PeriodicalIF":3.8,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140054484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1016/j.jpdc.2024.104866
Mohammed M. Alani
With the promise of higher throughput, and better response times, 6G networks provide a significant enabler for smart cities to evolve. The rapidly-growing reliance on connected devices within the smart city context encourages malicious actors to target these devices to achieve various malicious goals. In this paper, we present a novel defense technique that creates a cloud-based virtualized honeypot/twin that is designed to receive malicious traffic through edge-based machine learning-enabled detection system. The proposed system performs early identification of malicious traffic in a software defined network-enabled edge routing point to divert that traffic away from the 6G-enabled smart city endpoints. Testing of the proposed system showed an accuracy exceeding 99.8%, with an score of 0.9984.
{"title":"HoneyTwin: Securing smart cities with machine learning-enabled SDN edge and cloud-based honeypots","authors":"Mohammed M. Alani","doi":"10.1016/j.jpdc.2024.104866","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104866","url":null,"abstract":"<div><p>With the promise of higher throughput, and better response times, 6G networks provide a significant enabler for smart cities to evolve. The rapidly-growing reliance on connected devices within the smart city context encourages malicious actors to target these devices to achieve various malicious goals. In this paper, we present a novel defense technique that creates a cloud-based virtualized honeypot/twin that is designed to receive malicious traffic through edge-based machine learning-enabled detection system. The proposed system performs early identification of malicious traffic in a software defined network-enabled edge routing point to divert that traffic away from the 6G-enabled smart city endpoints. Testing of the proposed system showed an accuracy exceeding 99.8%, with an <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score of 0.9984.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104866"},"PeriodicalIF":3.8,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139942060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-18DOI: 10.1016/j.jpdc.2024.104867
Wenjie Tang, Yiping Yao, Lizhen Ou, Kai Chen
Publish–subscribe communication is a fundamental service used for message-passing between decoupled applications in distributed simulation. When abundant unnecessary data transfer is introduced, interest-matching services are needed to filter irrelevant message traffic. Frequent demands during simulation execution makes interest matching a bottleneck with increased simulation scale. Contemporary algorithms built for serial processing inadequately leverage multicore processor-based parallel resources. Parallel algorithmic improvements are insufficient for large-scale simulations. Therefore, we propose a hierarchical sort-based parallel algorithm for dynamic interest matching that embeds all update and subscription regions into two full binary trees, thereby transferring the region-matching task to one of node-matching. It utilizes the association between adjacent nodes and the hierarchical relation between parent‒child nodes to eliminate redundant operations, and achieves incremental parallel matching that only compares changed regions. We analyze the time and space complexity of this process. The new algorithm performs better and is more scalable than state-of-the-art algorithms.
{"title":"Hierarchical sort-based parallel algorithm for dynamic interest matching","authors":"Wenjie Tang, Yiping Yao, Lizhen Ou, Kai Chen","doi":"10.1016/j.jpdc.2024.104867","DOIUrl":"10.1016/j.jpdc.2024.104867","url":null,"abstract":"<div><p>Publish–subscribe communication is a fundamental service used for message-passing between decoupled applications in distributed simulation. When abundant unnecessary data transfer is introduced, interest-matching services are needed to filter irrelevant message traffic. Frequent demands during simulation execution makes interest matching a bottleneck with increased simulation scale. Contemporary algorithms built for serial processing inadequately leverage multicore processor-based parallel resources. Parallel algorithmic improvements are insufficient for large-scale simulations. Therefore, we propose a hierarchical sort-based parallel algorithm for dynamic interest matching that embeds all update and subscription regions into two full binary trees, thereby transferring the region-matching task to one of node-matching. It utilizes the association between adjacent nodes and the hierarchical relation between parent‒child nodes to eliminate redundant operations, and achieves incremental parallel matching that only compares changed regions. We analyze the time and space complexity of this process. The new algorithm performs better and is more scalable than state-of-the-art algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104867"},"PeriodicalIF":3.8,"publicationDate":"2024-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1016/j.jpdc.2024.104863
Anne Benoit , Thomas Herault , Lucas Perotin , Yves Robert , Frédéric Vivien
This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations () or fair-sharing the bandwidth across them (FairShare). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely steady-state windows, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that two of our simple and low-complexity greedy strategies significantly outperform , FairShare and I/O-Sets, and we recommend that the I/O community would implement them for further assessment.
{"title":"Revisiting I/O bandwidth-sharing strategies for HPC applications","authors":"Anne Benoit , Thomas Herault , Lucas Perotin , Yves Robert , Frédéric Vivien","doi":"10.1016/j.jpdc.2024.104863","DOIUrl":"10.1016/j.jpdc.2024.104863","url":null,"abstract":"<div><p>This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (<figure><img></figure>) or fair-sharing the bandwidth across them (<span>FairShare</span>). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely <em>steady-state windows</em>, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that two of our simple and low-complexity greedy strategies significantly outperform <figure><img></figure>, <span>FairShare</span> and I/O-Sets, and we recommend that the I/O community would implement them for further assessment.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104863"},"PeriodicalIF":3.8,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139878546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1016/S0743-7315(24)00023-6
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00023-6","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00023-6","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"187 ","pages":"Article 104859"},"PeriodicalIF":3.8,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000236/pdfft?md5=8661326c859cab793505056ef1edee51&pid=1-s2.0-S0743731524000236-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-08DOI: 10.1016/j.jpdc.2024.104855
Ricardo Quislant, Eladio Gutierrez, Oscar Plata
Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, etc. Matrix Profile, a state-of-the-art algorithm to perform time series analysis, finds out the most similar and dissimilar subsequences in a time series in deterministic time and it is exact. Matrix Profile has low arithmetic intensity and it operates on large amounts of time series data, which can be an issue in terms of memory requirements. On the other hand, Hardware Transactional Memory (HTM) is an alternative optimistic synchronization method that executes transactions speculatively in parallel while keeping track of memory accesses to detect and resolve conflicts.
This work evaluates one of the best implementations of Matrix Profile exploring multiple multiprocessor variants and proposing new implementations that consider a variety of synchronization methods (HTM, locks, barriers), as well as algorithm organizations. We analyze these variants using real datasets, both short and large, in terms of speedup and memory requirements, the latter being a major issue when dealing with very large time series. The experimental evaluation shows that our proposals can achieve up to 100× speedup over the sequential algorithm for 128 threads, and up to 3× over the baseline, while keeping memory requirements low and even independent of the number of threads.
{"title":"Exploring multiprocessor approaches to time series analysis","authors":"Ricardo Quislant, Eladio Gutierrez, Oscar Plata","doi":"10.1016/j.jpdc.2024.104855","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104855","url":null,"abstract":"<div><p>Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, etc. <em>Matrix Profile</em>, a state-of-the-art algorithm to perform time series analysis, finds out the most similar and dissimilar subsequences in a time series in deterministic time and it is exact. Matrix Profile has low arithmetic intensity and it operates on large amounts of time series data, which can be an issue in terms of memory requirements. On the other hand, Hardware Transactional Memory (HTM) is an alternative optimistic synchronization method that executes transactions speculatively in parallel while keeping track of memory accesses to detect and resolve conflicts.</p><p>This work evaluates one of the best implementations of Matrix Profile exploring multiple multiprocessor variants and proposing new implementations that consider a variety of synchronization methods (HTM, locks, barriers), as well as algorithm organizations. We analyze these variants using real datasets, both short and large, in terms of speedup and memory requirements, the latter being a major issue when dealing with very large time series. The experimental evaluation shows that our proposals can achieve up to 100× speedup over the sequential algorithm for 128 threads, and up to 3× over the baseline, while keeping memory requirements low and even independent of the number of threads.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104855"},"PeriodicalIF":3.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000194/pdfft?md5=a25b14cc13a327c9c4b6c5f9abde8126&pid=1-s2.0-S0743731524000194-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139732906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-07DOI: 10.1016/j.jpdc.2024.104854
Qiliang Li , Min Lyu , Liangliang Xu , Yinlong Xu
The RAID2.0 architecture, which uses dozens or even hundreds of disks, is widely adopted for large-capacity data storage. However, limited resources like memory and CPU cause RAID2.0 to execute batch recovery for disk failures. The traditional random data placement and recovery schemes result in highly skewed I/O access within a batch, which slows down the recovery speed. To address this issue, we propose DR-RAID, an efficient reconstruction scheme that balances local rebuilding workloads across all surviving disks within a batch. We dynamically select a batch of tasks with almost balanced read loads and make intra-batch adjustments for tasks with multiple solutions of reading source chunks. Furthermore, we use a bipartite graph model to achieve a uniform distribution of write loads. DR-RAID can be applied with homogeneous or heterogeneous disk rebuilding bandwidth. Experimental results demonstrate that in offline rebuilding, DR-RAID enhances the rebuilding throughput by up to 61.90% compared to the random data placement scheme. With varied rebuilding bandwidth, the improvement can reach up to 65.00%.
{"title":"Fast recovery for large disk enclosures based on RAID2.0: Algorithms and evaluation","authors":"Qiliang Li , Min Lyu , Liangliang Xu , Yinlong Xu","doi":"10.1016/j.jpdc.2024.104854","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104854","url":null,"abstract":"<div><p>The RAID2.0 architecture, which uses dozens or even hundreds of disks, is widely adopted for large-capacity data storage. However, limited resources like memory and CPU cause RAID2.0 to execute batch recovery for disk failures. The traditional random data placement and recovery schemes result in highly skewed I/O access within a batch, which slows down the recovery speed. To address this issue, we propose DR-RAID, an efficient reconstruction scheme that balances local rebuilding workloads across all surviving disks within a batch. We dynamically select a batch of tasks with almost balanced read loads and make intra-batch adjustments for tasks with multiple solutions of reading source chunks. Furthermore, we use a bipartite graph model to achieve a uniform distribution of write loads. DR-RAID can be applied with homogeneous or heterogeneous disk rebuilding bandwidth. Experimental results demonstrate that in offline rebuilding, DR-RAID enhances the rebuilding throughput by up to 61.90% compared to the random data placement scheme. With varied rebuilding bandwidth, the improvement can reach up to 65.00%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104854"},"PeriodicalIF":3.8,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139732543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-05DOI: 10.1016/j.jpdc.2024.104853
B. Naresh Kumar Reddy , Aruru Sai Kumar
Adaptive routing is effective in maintaining higher processor performance and avoids packets over minimal or non-minimal alternate routes without congestion for a multiprocessor system on chip. However, many systems cannot deal with the fact that sending packets over an alternative path rather than the shorter, fixed-priority route can result in packets arriving at the destination node out of order. This can occur if packets belonging to the same communication flow are adaptively routed through a different path. In real-world network systems, there are strategies and algorithms to efficiently handle out-of-order packets without requiring infinite memory. Techniques like buffering, sliding windows, and sequence number management are used to reorder packets while considering the practical constraints of available memory and processing power. The specific method used depends on the network protocol and the requirements of the application. In the proposed technique, a novel technique aimed at improving the performance of multiprocessor systems on chip by implementing adaptive routing based on the Bat algorithm. The framework employs 5 stage pipeline router, that completely gained and forward a packet at the perfect direction in an adaptive mode. Bat algorithm is used to enhance the performance, which can optimize route to transmit packets at the destination. A test was carried out on various NoC sizes (6 X 6 and 8 X 8) under multimedia benchmarks, compared with other related algorithms and implemented on Kintex-7 FPGA board. The outcomes of the simulation illustrate that the proposed algorithm reduces delay and improves the throughput over the other traditional adaptive algorithms.
对于芯片上的多处理器系统而言,自适应路由选择可有效保持较高的处理器性能,并避免数据包通过最小或非最小的备用路径而造成拥塞。然而,许多系统无法处理这样一个事实,即通过替代路径而不是更短、固定优先级的路径发送数据包,会导致数据包不按顺序到达目的地节点。如果属于同一通信流的数据包通过不同路径自适应路由,就会出现这种情况。在现实世界的网络系统中,有一些策略和算法可以在不需要无限内存的情况下有效处理失序数据包。缓冲、滑动窗口和序列号管理等技术可用于重新排序数据包,同时考虑可用内存和处理能力的实际限制。具体采用哪种方法取决于网络协议和应用程序的要求。在所提出的技术中,一种新型技术旨在通过实施基于 Bat 算法的自适应路由来提高芯片上多处理器系统的性能。该框架采用 5 级流水线路由器,以自适应模式在最佳方向完全获取和转发数据包。Bat 算法用于提高性能,可以优化路由,将数据包传送到目的地。在多媒体基准下对不同尺寸(6 X 6 和 8 X 8)的 NoC 进行了测试,与其他相关算法进行了比较,并在 Kintex-7 FPGA 板上进行了实现。仿真结果表明,与其他传统自适应算法相比,建议的算法减少了延迟,提高了吞吐量。
{"title":"Evaluating the effectiveness of Bat optimization in an adaptive and energy-efficient network-on-chip routing framework","authors":"B. Naresh Kumar Reddy , Aruru Sai Kumar","doi":"10.1016/j.jpdc.2024.104853","DOIUrl":"10.1016/j.jpdc.2024.104853","url":null,"abstract":"<div><p>Adaptive routing is effective in maintaining higher processor performance and avoids packets over minimal or non-minimal alternate routes without congestion for a multiprocessor system on chip. However, many systems cannot deal with the fact that sending packets over an alternative path rather than the shorter, fixed-priority route can result in packets arriving at the destination node out of order. This can occur if packets belonging to the same communication flow are adaptively routed through a different path. In real-world network systems, there are strategies and algorithms to efficiently handle out-of-order packets without requiring infinite memory. Techniques like buffering, sliding windows, and sequence number management are used to reorder packets while considering the practical constraints of available memory and processing power. The specific method used depends on the network protocol and the requirements of the application. In the proposed technique, a novel technique aimed at improving the performance of multiprocessor systems on chip by implementing adaptive routing based on the Bat algorithm. The framework employs 5 stage pipeline router, that completely gained and forward a packet at the perfect direction in an adaptive mode. Bat algorithm is used to enhance the performance, which can optimize route to transmit packets at the destination. A test was carried out on various NoC sizes (6 X 6 and 8 X 8) under multimedia benchmarks, compared with other related algorithms and implemented on Kintex-7 FPGA board. The outcomes of the simulation illustrate that the proposed algorithm reduces delay and improves the throughput over the other traditional adaptive algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104853"},"PeriodicalIF":3.8,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139688940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}