首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
Neuron grouping and mapping methods for 2D-mesh NoC-based DNN accelerators 基于 2D 网格 NoC 的 DNN 加速器的神经元分组和映射方法
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-02 DOI: 10.1016/j.jpdc.2024.104949
Furkan Nacar , Alperen Cakin , Selma Dilek , Suleyman Tosun , Krishnendu Chakrabarty

Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.

深度神经网络(DNN)已在多个领域得到广泛应用;然而,由于需要大量的层和神经元相互通信,其计算成本往往高得令人望而却步。此外,由于需要大量的数据移动和计算,DNN 还会消耗大量能源。为了应对这些挑战,我们需要新的架构来加速 DNN。在本文中,我们针对基于二维网格芯片上网络(NoC)的 DNN 加速器提出了新颖的神经元分组和映射方法,同时考虑了全连接和部分连接 DNN 模型。我们提出了基于整数线性规划(ILP)和模拟退火(SA)的神经元分组解决方案,目标是最大限度地减少神经元组之间的数据通信总量。在确定合适的 DNN 图表示之后,我们还应用 ILP 和 SA 方法将神经元映射到二维网格 NoC 结构上,目的是最大限度地降低系统的总通信成本。我们在各种基准和 DNN 模型上采用不同的剪枝比率进行了多次实验,结果发现通信成本平均降低了 40-50%。
{"title":"Neuron grouping and mapping methods for 2D-mesh NoC-based DNN accelerators","authors":"Furkan Nacar ,&nbsp;Alperen Cakin ,&nbsp;Selma Dilek ,&nbsp;Suleyman Tosun ,&nbsp;Krishnendu Chakrabarty","doi":"10.1016/j.jpdc.2024.104949","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104949","url":null,"abstract":"<div><p>Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliable communication in dynamic networks with locally bounded byzantine faults 具有局部有界拜占庭故障的动态网络中的可靠通信
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-02 DOI: 10.1016/j.jpdc.2024.104952
Silvia Bonomi , Giovanni Farina , Sébastien Tixeuil

The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.

We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the f-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.

拜占庭容错可靠通信基元是分布式系统的基本构件,它能保证进程间信息交换的真实性、完整性和传递性。我们研究了这种基元在具有动态通信网络(即可用通信通道集随时间变化)的分布式系统中的可实现性。我们假设了 f 局部有界拜占庭故障模型,并确定了允许所有进程对之间进行可靠通信的动态通信网络条件。此外,我们还研究了它在几类动态网络上的可实施性,并对它在异步分布式系统中的应用提出了见解。
{"title":"Reliable communication in dynamic networks with locally bounded byzantine faults","authors":"Silvia Bonomi ,&nbsp;Giovanni Farina ,&nbsp;Sébastien Tixeuil","doi":"10.1016/j.jpdc.2024.104952","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104952","url":null,"abstract":"<div><p>The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.</p><p>We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the <em>f</em>-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PiPar: Pipeline parallelism for collaborative machine learning PiPar:协作式机器学习的管道并行性
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-02 DOI: 10.1016/j.jpdc.2024.104947
Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese

Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.

有人提出了协作式机器学习(CML)技术,如联合学习,用于在多个移动设备和服务器之间训练深度学习模型。CML 技术可以保护隐私,因为在每台设备上训练的本地模型而不是来自设备的原始数据都会与服务器共享。然而,由于资源利用率低,CML 训练效率不高。我们发现,服务器和设备上由于顺序计算和通信造成的资源闲置是资源利用率低的主要原因。我们开发了一种新型框架 PiPar,利用 CML 技术的流水线并行性来大幅提高资源利用率。设计了一个新的训练流水线,以并行化不同硬件资源上的计算和不同带宽资源上的通信,从而加速 CML 的训练过程。还提出了一种低开销的自动参数选择方法来优化流水线,最大限度地提高可用资源的利用率。实验结果证实了 PiPar 基本方法的有效性,并强调与联合学习相比:(i) 服务器的空闲时间最多可减少 64.1 倍;(ii) 在不同网络条件下,针对六个小型和大型流行深度神经网络集合和四个数据集,在不牺牲准确性的情况下,整体训练时间最多可加快 34.6 倍。实验还证明,PiPar 在结合差分隐私方法以及在异构设备和带宽变化的环境中运行时可实现性能优势。
{"title":"PiPar: Pipeline parallelism for collaborative machine learning","authors":"Zihan Zhang ,&nbsp;Philip Rodgers ,&nbsp;Peter Kilpatrick ,&nbsp;Ivor Spence ,&nbsp;Blesson Varghese","doi":"10.1016/j.jpdc.2024.104947","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104947","url":null,"abstract":"<div><p>Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework <span>PiPar</span> that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of <span>PiPar</span> and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that <span>PiPar</span> achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001114/pdfft?md5=589f02b2eaa1e2c9523c4d2a0434e4e1&pid=1-s2.0-S0743731524001114-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Staleness aware semi-asynchronous federated learning 滞后感知半同步联合学习
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-01 DOI: 10.1016/j.jpdc.2024.104950
Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh

As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.

随着利用个人数据进行分布式深度学习的尝试越来越多,联合学习(FL)的重要性也随之增加。人们尝试使用同步或异步协议来克服联合学习的核心挑战(即统计和系统异构性)。然而,在每种协议中,杂波都会分别在延迟和准确性方面降低训练效率。为了解决杂散问题,可将两种协议结合的半同步协议应用于 FL;然而,有效处理本地模型的滞后性是一个难题。我们提出了 SASAFL,以解决半同步 FL 中因僵化而导致的训练效率低下问题。SASAFL 通过考虑全局模型的质量来同步服务器和客户端,从而实现稳定的训练。此外,它还能根据全局损失的变化调整参与的客户端数量,并立即处理上一轮未参与的客户端,从而实现高精度和低延迟。为了验证 SASAFL 的有效性,我们在各种条件下进行了评估。与基线相比,SASAFL 的准确率提高了 19.69%p,回合准确率提高了 2.32 倍,延迟准确率提高了 2.24 倍。此外,SASAFL 总是能达到基线无法达到的目标精度。
{"title":"Staleness aware semi-asynchronous federated learning","authors":"Miri Yu,&nbsp;Jiheon Choi,&nbsp;Jaehyun Lee,&nbsp;Sangyoon Oh","doi":"10.1016/j.jpdc.2024.104950","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104950","url":null,"abstract":"<div><p>As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks 通过改进的坎农算法,利用块张量矩阵乘法实现 3D DFT:胖树网络分布式内存集群的实现与扩展
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-28 DOI: 10.1016/j.jpdc.2024.104945
Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli

A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.

众所周知,并行 3D FFT 的可扩展性瓶颈在于其使用的全对全通信。在这里,我们介绍 S3DFT,这是一个通过使用点对点通信来规避这一问题的库,尽管算术复杂度较高。这种方法利用了坎农算法的三种变体,并对块张量矩阵乘法进行了调整。我们展示了 S3DFT 对硬件资源的高效利用,以及它在 JUWELS 集群 16,464 个内核上的扩展能力。然而,在与成熟的 3D FFT 库进行比较时,我们发现 S3DFT 的并行效率和性能并不尽如人意。详细分析发现,原因在于其两个组件算法,由于其通信模式是如何映射到胖树拓扑的子集中的,因此扩展性较差。这一结果揭示了在胖树网络系统上运行分块并行算法的潜在缺点,即沿处理元件网状结构特定方向的通信延迟增加。
{"title":"3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks","authors":"Nitin Malapally ,&nbsp;Viacheslav Bolnykh ,&nbsp;Estela Suarez ,&nbsp;Paolo Carloni ,&nbsp;Thomas Lippert ,&nbsp;Davide Mandelli","doi":"10.1016/j.jpdc.2024.104945","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104945","url":null,"abstract":"<div><p>A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001096/pdfft?md5=a6e4f3cba9286a71b7d82fe7347d295b&pid=1-s2.0-S0743731524001096-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep reinforcement learning based controller placement and optimal edge selection in SDN-based multi-access edge computing environments 在基于 SDN 的多接入边缘计算环境中,基于深度强化学习的控制器安置和最优边缘选择
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-27 DOI: 10.1016/j.jpdc.2024.104948
Chunlin Li , Jun Liu , Ning Ma , Qingzhe Zhang , Zhengwei Zhong , Lincheng Jiang , Guolei Jia

Multi-Access Edge Computing (MEC) can provide computility close to the clients to decrease response time and enhance Quality of Service (QoS). However, the complex wireless network consists of various network hardware facilities with different communication protocols and Application Programming Interface (API), which result in the MEC system's high running costs and low running efficiency. To this end, Software-defined networking (SDN) is applied to MEC, which can support access to massive network devices and provide flexible and efficient management. The reasonable SDN controller scheme is crucial to enhance the performance of SDN-assisted MEC. At First, we used the Convolutional Neural Networks (CNN)-Long Short-Term Memory (LSTM) model to predict the network traffic to calculate the load. Then, the optimization objective is formulated by ensuring the load balance and minimizing the system cost. Finally, the Deep Reinforcement Learning (DRL) algorithm is used to obtain the optimal value. Based on the controller placement algorithm ensuring the load balancing, the dynamical edge selection method based on the Channel State Information (CSI) is proposed to optimize the task offloading, and according to CSI, the strategy of task queue execution is designed. Then, the task offloading problem is modeled by using queuing theory. Finally, dynamical edge selection based on Lyapunov's optimization is introduced to get the model solution. In the experiment studies, the assessment method evaluated the performance of two sets of baseline algorithms, including SAPKM, the PSO, the K-means, the LADMA, the LATA, and the OAOP. Compared to the baseline algorithms, the proposed algorithms can effectively reduce the average communication delay and total system energy consumption and improve the utilization of the SDN controller.

多接入边缘计算(MEC)可在客户端附近提供计算能力,从而缩短响应时间并提高服务质量(QoS)。然而,复杂的无线网络由各种网络硬件设施组成,通信协议和应用编程接口(API)各不相同,导致 MEC 系统运行成本高、运行效率低。为此,软件定义网络(SDN)被应用于 MEC,它可以支持海量网络设备的接入,并提供灵活高效的管理。合理的 SDN 控制器方案是提高 SDN 辅助 MEC 性能的关键。首先,我们使用卷积神经网络(CNN)-长短期记忆(LSTM)模型预测网络流量,计算负载。然后,通过确保负载平衡和系统成本最小化来制定优化目标。最后,使用深度强化学习(DRL)算法获得最优值。在确保负载平衡的控制器放置算法基础上,提出了基于信道状态信息(CSI)的动态边缘选择方法来优化任务卸载,并根据 CSI 设计了任务队列执行策略。然后,利用队列理论对任务卸载问题进行建模。最后,引入基于 Lyapunov 优化的动态边缘选择,得到模型解。在实验研究中,评估方法评估了两套基准算法的性能,包括 SAPKM、PSO、K-means、LADMA、LATA 和 OAOP。与基线算法相比,所提出的算法能有效降低平均通信延迟和系统总能耗,提高 SDN 控制器的利用率。
{"title":"Deep reinforcement learning based controller placement and optimal edge selection in SDN-based multi-access edge computing environments","authors":"Chunlin Li ,&nbsp;Jun Liu ,&nbsp;Ning Ma ,&nbsp;Qingzhe Zhang ,&nbsp;Zhengwei Zhong ,&nbsp;Lincheng Jiang ,&nbsp;Guolei Jia","doi":"10.1016/j.jpdc.2024.104948","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104948","url":null,"abstract":"<div><p>Multi-Access Edge Computing (MEC) can provide computility close to the clients to decrease response time and enhance Quality of Service (QoS). However, the complex wireless network consists of various network hardware facilities with different communication protocols and Application Programming Interface (API), which result in the MEC system's high running costs and low running efficiency. To this end, Software-defined networking (SDN) is applied to MEC, which can support access to massive network devices and provide flexible and efficient management. The reasonable SDN controller scheme is crucial to enhance the performance of SDN-assisted MEC. At First, we used the Convolutional Neural Networks (CNN)-Long Short-Term Memory (LSTM) model to predict the network traffic to calculate the load. Then, the optimization objective is formulated by ensuring the load balance and minimizing the system cost. Finally, the Deep Reinforcement Learning (DRL) algorithm is used to obtain the optimal value. Based on the controller placement algorithm ensuring the load balancing, the dynamical edge selection method based on the Channel State Information (CSI) is proposed to optimize the task offloading, and according to CSI, the strategy of task queue execution is designed. Then, the task offloading problem is modeled by using queuing theory. Finally, dynamical edge selection based on Lyapunov's optimization is introduced to get the model solution. In the experiment studies, the assessment method evaluated the performance of two sets of baseline algorithms, including SAPKM, the PSO, the K-means, the LADMA, the LATA, and the OAOP. Compared to the baseline algorithms, the proposed algorithms can effectively reduce the average communication delay and total system energy consumption and improve the utilization of the SDN controller.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141606834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Experimental evaluation of a multi-installment scheduling strategy based on divisible load paradigm for SAR image reconstruction on a distributed computing infrastructure 对分布式计算基础设施上基于可分割负载范式的合成孔径雷达图像重建多分期调度策略的实验评估
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-26 DOI: 10.1016/j.jpdc.2024.104942
Gokul Madathupalyam Chinnappan , Bharadwaj Veeravalli , Koen Mouthaan , John Wen-Hao Lee

Radar loads, especially Synthetic Aperture Radar (SAR) image reconstruction loads use a large volume of data collected from satellites to create a high-resolution image of the earth. To design near-real-time applications that utilise SAR data, speeding up the image reconstruction algorithm is imperative. This can be achieved by deploying a set of distributed computing infrastructures connected through a network. Scheduling such complex and large divisible loads on a distributed platform can be designed using the Divisible Load Theory (DLT) framework. We performed distributed SAR image reconstruction experiments using the SLURM library on a cloud virtual machine network using two scheduling strategies, namely the Multi-Installment Scheduling with Result Retrieval (MIS-RR) strategy and the traditional EQual-partitioning Strategy (EQS). The DLT model proposed in the MIS-RR strategy is incorporated to make the load divisible. Based on the experimental results and performance analysis carried out using different pixel lengths, pulse set sizes, and the number of virtual machines, we observe that the time performance of MIS-RR is much superior to that of EQS. Hence the MIS-RR strategy is of practical significance in reducing the overall processing time, and cost, and in improving the utilisation of the compute infrastructure. Furthermore, we note that the DLT-based theoretical analysis of MIS-RR coincides well with the experimental data, demonstrating the relevance of DLT in the real world.

雷达载荷,尤其是合成孔径雷达(SAR)图像重建载荷使用从卫星收集的大量数据来创建高分辨率的地球图像。要设计利用合成孔径雷达数据的近实时应用,必须加快图像重建算法的速度。这可以通过部署一组通过网络连接的分布式计算基础设施来实现。在分布式平台上调度这种复杂而庞大的可分割负载,可以使用可分割负载理论(DLT)框架来设计。我们在云虚拟机网络上使用 SLURM 库进行了分布式合成孔径雷达图像重建实验,使用了两种调度策略,即带结果检索的多分期调度(MIS-RR)策略和传统的均衡分区策略(EQS)。MIS-RR 策略中提出的 DLT 模型可使负载可分。根据使用不同像素长度、脉冲集大小和虚拟机数量进行的实验结果和性能分析,我们发现 MIS-RR 的时间性能远远优于 EQS。因此,MIS-RR 策略在减少整体处理时间和成本以及提高计算基础设施的利用率方面具有实际意义。此外,我们注意到,基于 DLT 的 MIS-RR 理论分析与实验数据非常吻合,证明了 DLT 在现实世界中的相关性。
{"title":"Experimental evaluation of a multi-installment scheduling strategy based on divisible load paradigm for SAR image reconstruction on a distributed computing infrastructure","authors":"Gokul Madathupalyam Chinnappan ,&nbsp;Bharadwaj Veeravalli ,&nbsp;Koen Mouthaan ,&nbsp;John Wen-Hao Lee","doi":"10.1016/j.jpdc.2024.104942","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104942","url":null,"abstract":"<div><p>Radar loads, especially Synthetic Aperture Radar (SAR) image reconstruction loads use a large volume of data collected from satellites to create a high-resolution image of the earth. To design near-real-time applications that utilise SAR data, speeding up the image reconstruction algorithm is imperative. This can be achieved by deploying a set of distributed computing infrastructures connected through a network. Scheduling such complex and large divisible loads on a distributed platform can be designed using the Divisible Load Theory (DLT) framework. We performed distributed SAR image reconstruction experiments using the SLURM library on a cloud virtual machine network using two scheduling strategies, namely the Multi-Installment Scheduling with Result Retrieval (MIS-RR) strategy and the traditional EQual-partitioning Strategy (EQS). The DLT model proposed in the MIS-RR strategy is incorporated to make the load divisible. Based on the experimental results and performance analysis carried out using different pixel lengths, pulse set sizes, and the number of virtual machines, we observe that the time performance of MIS-RR is much superior to that of EQS. Hence the MIS-RR strategy is of practical significance in reducing the overall processing time, and cost, and in improving the utilisation of the compute infrastructure. Furthermore, we note that the DLT-based theoretical analysis of MIS-RR coincides well with the experimental data, demonstrating the relevance of DLT in the real world.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PPB-MCTS: A novel distributed-memory parallel partial-backpropagation Monte Carlo tree search algorithm PPB-MCTS:新型分布式内存并行部分后向传播蒙特卡洛树搜索算法
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-26 DOI: 10.1016/j.jpdc.2024.104944
Yashar Naderzadeh , Daniel Grosu , Ratna Babu Chinnam

Monte-Carlo Tree Search (MCTS) is an adaptive and heuristic tree-search algorithm designed to uncover sub-optimal actions at each decision-making point. This method progressively constructs a search tree by gathering samples throughout its execution. Predominantly applied within the realm of gaming, MCTS has exhibited exceptional achievements. Additionally, it has displayed promising outcomes when employed to solve NP-hard combinatorial optimization problems. MCTS has been adapted for distributed-memory parallel platforms. The primary challenges associated with distributed-memory parallel MCTS are the substantial communication overhead and the necessity to balance the computational load among various processes. In this work, we introduce a novel distributed-memory parallel MCTS algorithm with partial backpropagations, referred to as Parallel Partial-Backpropagation MCTS (PPB-MCTS). Our design approach aims to significantly reduce the communication overhead while maintaining, or even slightly improving, the performance in the context of combinatorial optimization problems. To address the communication overhead challenge, we propose a strategy involving transmitting an additional backpropagation message. This strategy avoids attaching an information table to the communication messages exchanged by the processes, thus reducing the communication overhead. Furthermore, this approach contributes to enhancing the decision-making accuracy during the selection phase. The load balancing issue is also effectively addressed by implementing a shared transposition table among the parallel processes. Furthermore, we introduce two primary methods for managing duplicate states within distributed-memory parallel MCTS, drawing upon techniques utilized in addressing duplicate states within sequential MCTS. Duplicate states can transform the conventional search tree into a Directed Acyclic Graph (DAG). To evaluate the performance of our proposed parallel algorithm, we conduct an extensive series of experiments on solving instances of the Job-Shop Scheduling Problem (JSSP) and the Weighted Set-Cover Problem (WSCP). These problems are recognized for their complexity and classified as NP-hard combinatorial optimization problems with considerable relevance within industrial applications. The experiments are performed on a cluster of computers with many cores. The empirical results highlight the enhanced scalability of our algorithm compared to that of the existing distributed-memory parallel MCTS algorithms. As the number of processes increases, our algorithm demonstrates increased rollout efficiency while maintaining an improved load balance across processes.

蒙特卡洛树搜索(Monte-Carlo Tree Search,MCTS)是一种自适应的启发式树搜索算法,旨在发现每个决策点的次优行动。这种方法在执行过程中通过收集样本逐步构建搜索树。MCTS 主要应用于游戏领域,取得了卓越的成就。此外,在解决 NP 难度的组合优化问题时,它也取得了可喜的成果。MCTS 适用于分布式内存并行平台。与分布式内存并行 MCTS 相关的主要挑战是巨大的通信开销和在不同进程间平衡计算负载的必要性。在这项工作中,我们介绍了一种新型分布式内存并行 MCTS 算法,该算法采用部分反向传播,被称为并行部分反向传播 MCTS(PPB-MCTS)。我们的设计方法旨在大幅降低通信开销,同时保持甚至略微提高组合优化问题的性能。为解决通信开销难题,我们提出了一种涉及传输额外反向传播信息的策略。这种策略避免了在进程交换的通信信息中附加信息表,从而减少了通信开销。此外,这种方法还有助于提高选择阶段的决策准确性。通过在并行进程间实施共享转置表,负载平衡问题也得到了有效解决。此外,我们还借鉴了处理顺序 MCTS 中重复状态的技术,介绍了在分布式内存并行 MCTS 中管理重复状态的两种主要方法。重复状态会将传统的搜索树转化为有向无环图(DAG)。为了评估我们提出的并行算法的性能,我们在求解工作车间调度问题(JSSP)和加权集合覆盖问题(WSCP)的实例时进行了一系列广泛的实验。这些问题的复杂性是公认的,被归类为 NP-硬组合优化问题,在工业应用中具有相当大的相关性。实验在多核计算机集群上进行。实证结果表明,与现有的分布式内存并行 MCTS 算法相比,我们的算法具有更强的可扩展性。随着进程数量的增加,我们的算法显示出更高的扩展效率,同时保持了各进程之间更好的负载平衡。
{"title":"PPB-MCTS: A novel distributed-memory parallel partial-backpropagation Monte Carlo tree search algorithm","authors":"Yashar Naderzadeh ,&nbsp;Daniel Grosu ,&nbsp;Ratna Babu Chinnam","doi":"10.1016/j.jpdc.2024.104944","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104944","url":null,"abstract":"<div><p>Monte-Carlo Tree Search (MCTS) is an adaptive and heuristic tree-search algorithm designed to uncover sub-optimal actions at each decision-making point. This method progressively constructs a search tree by gathering samples throughout its execution. Predominantly applied within the realm of gaming, MCTS has exhibited exceptional achievements. Additionally, it has displayed promising outcomes when employed to solve NP-hard combinatorial optimization problems. MCTS has been adapted for distributed-memory parallel platforms. The primary challenges associated with distributed-memory parallel MCTS are the substantial communication overhead and the necessity to balance the computational load among various processes. In this work, we introduce a novel distributed-memory parallel MCTS algorithm with partial backpropagations, referred to as <em>Parallel Partial-Backpropagation MCTS</em> (<span>PPB-MCTS</span>). Our design approach aims to significantly reduce the communication overhead while maintaining, or even slightly improving, the performance in the context of combinatorial optimization problems. To address the communication overhead challenge, we propose a strategy involving transmitting an additional backpropagation message. This strategy avoids attaching an information table to the communication messages exchanged by the processes, thus reducing the communication overhead. Furthermore, this approach contributes to enhancing the decision-making accuracy during the selection phase. The load balancing issue is also effectively addressed by implementing a shared transposition table among the parallel processes. Furthermore, we introduce two primary methods for managing duplicate states within distributed-memory parallel MCTS, drawing upon techniques utilized in addressing duplicate states within sequential MCTS. Duplicate states can transform the conventional search tree into a Directed Acyclic Graph (DAG). To evaluate the performance of our proposed parallel algorithm, we conduct an extensive series of experiments on solving instances of the Job-Shop Scheduling Problem (JSSP) and the Weighted Set-Cover Problem (WSCP). These problems are recognized for their complexity and classified as NP-hard combinatorial optimization problems with considerable relevance within industrial applications. The experiments are performed on a cluster of computers with many cores. The empirical results highlight the enhanced scalability of our algorithm compared to that of the existing distributed-memory parallel MCTS algorithms. As the number of processes increases, our algorithm demonstrates increased rollout efficiency while maintaining an improved load balance across processes.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Blockchain-assisted full-session key agreement for secure data sharing in cloud computing 区块链辅助全会话密钥协议促进云计算中的安全数据共享
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-25 DOI: 10.1016/j.jpdc.2024.104943
Yangyang Long , Changgen Peng , Weijie Tan , Yuling Chen

Data sharing in cloud computing allows multiple data owners to freely share their data resources while security and privacy issues remain inevitable challenges. As a foundation of secure communication, authenticated key agreement (AKA) scheme has been recognized as a promising approach to solve such problems. However, most existing AKA schemes are based on the cloud-based architecture, privacy and security issues will inevitably occur once the centralized authority is attacked. Besides, most previous schemes require an online registration authority for authentication, which may consume significant resources. To address these drawbacks, for secure data sharing in cloud computing, a blockchain-assisted full-session key agreement scheme is proposed. After the registration phase, the registration authority does not engage in authentication and key agreement process. By utilizing blockchain technology, a common session key between the remote user and cloud server can be negotiated, and a shared group key among multiple remote users can be negotiated without private information leakage. Formal and informal security proof demonstrated the proposed scheme is able to meet the security and privacy requirements. The detail performance evaluation shows that the proposed scheme has lower computation costs and acceptable communication overheads while superior security is ensured.

云计算中的数据共享允许多个数据所有者自由共享数据资源,但安全和隐私问题仍是不可避免的挑战。作为安全通信的基础,认证密钥协议(AKA)方案被认为是解决此类问题的一种有前途的方法。然而,现有的 AKA 方案大多基于云架构,一旦集中式机构受到攻击,隐私和安全问题将不可避免地出现。此外,以前的方案大多需要在线注册机构进行身份验证,这可能会消耗大量资源。针对这些缺点,为实现云计算中的安全数据共享,提出了一种区块链辅助全会话密钥协议方案。在注册阶段之后,注册机构不参与身份验证和密钥协议过程。利用区块链技术,远程用户和云服务器之间可以协商一个公共会话密钥,多个远程用户之间可以协商一个共享组密钥,而不会泄露私人信息。正式和非正式的安全证明表明,所提出的方案能够满足安全和隐私要求。详细的性能评估表明,所提出的方案具有较低的计算成本和可接受的通信开销,同时确保了卓越的安全性。
{"title":"Blockchain-assisted full-session key agreement for secure data sharing in cloud computing","authors":"Yangyang Long ,&nbsp;Changgen Peng ,&nbsp;Weijie Tan ,&nbsp;Yuling Chen","doi":"10.1016/j.jpdc.2024.104943","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104943","url":null,"abstract":"<div><p>Data sharing in cloud computing allows multiple data owners to freely share their data resources while security and privacy issues remain inevitable challenges. As a foundation of secure communication, authenticated key agreement (AKA) scheme has been recognized as a promising approach to solve such problems. However, most existing AKA schemes are based on the cloud-based architecture, privacy and security issues will inevitably occur once the centralized authority is attacked. Besides, most previous schemes require an online registration authority for authentication, which may consume significant resources. To address these drawbacks, for secure data sharing in cloud computing, a blockchain-assisted full-session key agreement scheme is proposed. After the registration phase, the registration authority does not engage in authentication and key agreement process. By utilizing blockchain technology, a common session key between the remote user and cloud server can be negotiated, and a shared group key among multiple remote users can be negotiated without private information leakage. Formal and informal security proof demonstrated the proposed scheme is able to meet the security and privacy requirements. The detail performance evaluation shows that the proposed scheme has lower computation costs and acceptable communication overheads while superior security is ensured.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpChar: Characterizing the sparse puzzle via decision trees SpChar:通过决策树表征稀疏谜题
IF 3.4 3区 计算机科学 Q1 Mathematics Pub Date : 2024-06-17 DOI: 10.1016/j.jpdc.2024.104941
Francesco Sgherzi , Marco Siracusa , Ivan Fernandez , Adrià Armejach , Miquel Moretó

Sparse matrix computation is crucial in various modern applications, including large-scale graph analytics, deep learning, and recommender systems. The performance of sparse kernels varies greatly depending on the structure of the input matrix, making it difficult to gain a comprehensive understanding of sparse computation and its relationship to inputs, algorithms, and target machine architecture. Despite extensive research on certain sparse kernels, such as Sparse Matrix-Vector Multiplication (SpMV), the overall family of sparse algorithms has yet to be investigated as a whole. This paper introduces SpChar, a workload characterization methodology for general sparse computation. SpChar employs tree-based models to identify the most relevant hardware and input characteristics, starting from hardware and input-related metrics gathered from Performance Monitoring Counters (PMCs) and matrices. Our analysis enables the creation of a characterization loop that facilitates the optimization of sparse computation by mapping the impact of architectural features to inputs and algorithmic choices. We apply SpChar to more than 600 matrices from the SuiteSparse Matrix collection and three state-of-the-art Arm Central Processing Units (CPUs) to determine the critical hardware and software characteristics that affect sparse computation. In our analysis, we determine that the biggest limiting factors for high-performance sparse computation are (1) the latency of the memory system, (2) the pipeline flush overhead resulting from branch misprediction, and (3) the poor reuse of cached elements. Additionally, we propose software and hardware optimizations that designers can implement to create a platform suitable for sparse computation. We then investigate these optimizations using the gem5 simulator to achieve a significant speedup of up to 2.63× compared to a CPU where the optimizations are not applied.

稀疏矩阵计算在大规模图分析、深度学习和推荐系统等各种现代应用中至关重要。稀疏内核的性能因输入矩阵结构的不同而有很大差异,因此很难全面了解稀疏计算及其与输入、算法和目标机器架构的关系。尽管对某些稀疏内核(如稀疏矩阵-矢量乘法(SpMV))进行了广泛研究,但整个稀疏算法系列仍有待整体研究。本文介绍了 SpChar,这是一种用于一般稀疏计算的工作量表征方法。SpChar 采用基于树的模型,从性能监控计数器 (PMC) 和矩阵中收集的硬件和输入相关指标出发,确定最相关的硬件和输入特征。通过我们的分析,可以创建一个特性循环,将架构特性的影响映射到输入和算法选择上,从而促进稀疏计算的优化。我们将 SpChar 应用于 SuiteSparse Matrix 集合中的 600 多个矩阵和三个最先进的 Arm 中央处理器 (CPU),以确定影响稀疏计算的关键硬件和软件特性。通过分析,我们确定高性能稀疏计算的最大限制因素是:(1) 内存系统的延迟;(2) 分支错误预测导致的流水线刷新开销;(3) 缓存元素的重复利用率低。此外,我们还提出了软件和硬件优化方案,设计人员可以通过实施这些方案来创建适合稀疏计算的平台。然后,我们使用 gem5 模拟器对这些优化措施进行了研究,结果表明,与未应用优化措施的 CPU 相比,速度显著提高了 2.63 倍。
{"title":"SpChar: Characterizing the sparse puzzle via decision trees","authors":"Francesco Sgherzi ,&nbsp;Marco Siracusa ,&nbsp;Ivan Fernandez ,&nbsp;Adrià Armejach ,&nbsp;Miquel Moretó","doi":"10.1016/j.jpdc.2024.104941","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104941","url":null,"abstract":"<div><p>Sparse matrix computation is crucial in various modern applications, including large-scale graph analytics, deep learning, and recommender systems. The performance of sparse kernels varies greatly depending on the structure of the input matrix, making it difficult to gain a comprehensive understanding of sparse computation and its relationship to inputs, algorithms, and target machine architecture. Despite extensive research on certain sparse kernels, such as Sparse Matrix-Vector Multiplication (SpMV), the overall family of sparse algorithms has yet to be investigated as a whole. This paper introduces SpChar, a workload characterization methodology for general sparse computation. SpChar employs tree-based models to identify the most relevant hardware and input characteristics, starting from hardware and input-related metrics gathered from Performance Monitoring Counters (PMCs) and matrices. Our analysis enables the creation of a <em>characterization loop</em> that facilitates the optimization of sparse computation by mapping the impact of architectural features to inputs and algorithmic choices. We apply SpChar to more than 600 matrices from the SuiteSparse Matrix collection and three state-of-the-art Arm Central Processing Units (CPUs) to determine the critical hardware and software characteristics that affect sparse computation. In our analysis, we determine that the biggest limiting factors for high-performance sparse computation are (1) the latency of the memory system, (2) the pipeline flush overhead resulting from branch misprediction, and (3) the poor reuse of cached elements. Additionally, we propose software and hardware optimizations that designers can implement to create a platform suitable for sparse computation. We then investigate these optimizations using the gem5 simulator to achieve a significant speedup of up to 2.63× compared to a CPU where the optimizations are not applied.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141433984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1