Pub Date : 2024-07-09DOI: 10.1016/j.jpdc.2024.104953
Sahar Maleki, Hassan Zarabadipour, Mehdi Rahmani
To address the complexity challenge of a large-scale system, the decomposition into smaller subsystems is very crucial and demanding for distributed estimation and control purposes. This paper proposes novel optimization-based approaches to decompose a large-scale system into subsystems that are either weakly coupled or weakly coupled with overlapping components. To achieve this goal, first, the epsilon decomposition of large-scale systems is examined. Then, optimization frameworks are presented for disjoint and overlapping decompositions utilizing bipartite graphs. Next, the proposed decomposition algorithms are represented for particular cases of large-scale systems using directed graphs. In contrast to the existing user-based techniques, the proposed optimization-based methods can reach the solution rapidly and systematically. At last, the capability and efficiency of the proposed algorithms are investigated by conducting simulations on three case studies, which include a practical distillation column, a modified benchmark model, and the IEEE 118-bus power system.
{"title":"Optimization-based disjoint and overlapping epsilon decompositions of large-scale dynamical systems via graph theory","authors":"Sahar Maleki, Hassan Zarabadipour, Mehdi Rahmani","doi":"10.1016/j.jpdc.2024.104953","DOIUrl":"10.1016/j.jpdc.2024.104953","url":null,"abstract":"<div><p>To address the complexity challenge of a large-scale system, the decomposition into smaller subsystems is very crucial and demanding for distributed estimation and control purposes. This paper proposes novel optimization-based approaches to decompose a large-scale system into subsystems that are either weakly coupled or weakly coupled with overlapping components. To achieve this goal, first, the epsilon decomposition of large-scale systems is examined. Then, optimization frameworks are presented for disjoint and overlapping decompositions utilizing bipartite graphs. Next, the proposed decomposition algorithms are represented for particular cases of large-scale systems using directed graphs. In contrast to the existing user-based techniques, the proposed optimization-based methods can reach the solution rapidly and systematically. At last, the capability and efficiency of the proposed algorithms are investigated by conducting simulations on three case studies, which include a practical distillation column, a modified benchmark model, and the IEEE 118-bus power system.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104953"},"PeriodicalIF":3.4,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1016/j.jpdc.2024.104951
Rakesh Shrestha , Mohammadreza Mohammadi , Sima Sinaei , Alberto Salcines , David Pampliega , Raul Clemente , Ana Lourdes Sanz , Ehsan Nowroozi , Anders Lindgren
In smart electric grid systems, various sensors and Internet of Things (IoT) devices are used to collect electrical data at substations. In a traditional system, a multitude of energy-related data from substations needs to be migrated to central storage, such as Cloud or edge devices, for knowledge extraction that might impose severe data misuse, data manipulation, or privacy leakage. This motivates to propose anomaly detection system to detect threats and Federated Learning to resolve the issues of data silos and privacy of data. In this article, we present a framework to identify anomalies in industrial data that are gathered from the remote terminal devices deployed at the substations in the smart electric grid system. The anomaly detection system is based on Long Short-Term Memory (LSTM) and autoencoders that employs Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD) approaches for detecting anomalies. We deploy Federated Learning (FL) to preserve the privacy of the data generated by the substations. FL enables energy providers to train shared AI models cooperatively without disclosing the data to the server. In order to further enhance the security and privacy properties of the proposed framework, we implemented homomorphic encryption based on the Paillier algorithm for preserving data privacy. The proposed security model performs better with MSD approach using HE-128 bit key providing 97% F1-score and 98% accuracy for K=5 with low computation overhead as compared with HE-256 bit key.
{"title":"Anomaly detection based on LSTM and autoencoders using federated learning in smart electric grid","authors":"Rakesh Shrestha , Mohammadreza Mohammadi , Sima Sinaei , Alberto Salcines , David Pampliega , Raul Clemente , Ana Lourdes Sanz , Ehsan Nowroozi , Anders Lindgren","doi":"10.1016/j.jpdc.2024.104951","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104951","url":null,"abstract":"<div><p>In smart electric grid systems, various sensors and Internet of Things (IoT) devices are used to collect electrical data at substations. In a traditional system, a multitude of energy-related data from substations needs to be migrated to central storage, such as Cloud or edge devices, for knowledge extraction that might impose severe data misuse, data manipulation, or privacy leakage. This motivates to propose anomaly detection system to detect threats and Federated Learning to resolve the issues of data silos and privacy of data. In this article, we present a framework to identify anomalies in industrial data that are gathered from the remote terminal devices deployed at the substations in the smart electric grid system. The anomaly detection system is based on Long Short-Term Memory (LSTM) and autoencoders that employs Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD) approaches for detecting anomalies. We deploy Federated Learning (FL) to preserve the privacy of the data generated by the substations. FL enables energy providers to train shared AI models cooperatively without disclosing the data to the server. In order to further enhance the security and privacy properties of the proposed framework, we implemented homomorphic encryption based on the Paillier algorithm for preserving data privacy. The proposed security model performs better with MSD approach using HE-128 bit key providing 97% F1-score and 98% accuracy for K=5 with low computation overhead as compared with HE-256 bit key.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104951"},"PeriodicalIF":3.4,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001151/pdfft?md5=8b26b7d7db2b8eb9c771f42fd6536e0c&pid=1-s2.0-S0743731524001151-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141606835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1016/j.jpdc.2024.104946
Khalid Javeed , Yasir Ali Shah , David Gregg
Elliptic Curve Cryptography (ECC) is the front-runner among available public key cryptography (PKC) schemes due to its potential to offer higher security per key bit. All ECC-based cryptosystems heavily rely on point multiplication operation where its efficient realization has attained notable focus in the research community. Low latency implementation of the point multiplication operation is frequently required in high-speed applications such as online authentication and web server certification. This paper presents a low latency ECC point multiplication architecture for Montgomery curves over generic prime filed . The proposed architecture is able to operate for a general prime modulus without any constraints on its structure. It is based on a new novel pipelined modular multiplier developed using the Montgomery multiplication and the Karatsuba-Offman technique with a four-part splitting methodology. The Montgomery ladder approach is adopted on a system level, where a high-speed scheduling strategy to efficiently execute operations is also presented. Due to these circuit and system-level optimizations, the proposed design delivers low-latency results without a significant increase in resource consumption. The proposed architecture is described in Verilog-HDL for 256-bit key lengths and implemented on Virtex-7 and Virtex-6 FPGA platforms using Xilinx ISE Design Suite. On the Virtex-7 FPGA platform, it performs a 256-bit point multiplication operation in just 110.9 us with a throughput of almost 9017 operations per second. The implementation results demonstrate that despite its generic nature, it produces low latency as compared to the state-of-the-art. Therefore, it has prominent prospects to be used in high-speed authentication and certification applications.
{"title":"GMC-crypto: Low latency implementation of ECC point multiplication for generic Montgomery curves over GF(p)","authors":"Khalid Javeed , Yasir Ali Shah , David Gregg","doi":"10.1016/j.jpdc.2024.104946","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104946","url":null,"abstract":"<div><p>Elliptic Curve Cryptography (ECC) is the front-runner among available public key cryptography (PKC) schemes due to its potential to offer higher security per key bit. All ECC-based cryptosystems heavily rely on point multiplication operation where its efficient realization has attained notable focus in the research community. Low latency implementation of the point multiplication operation is frequently required in high-speed applications such as online authentication and web server certification. This paper presents a low latency ECC point multiplication architecture for Montgomery curves over generic prime filed <span><math><mi>G</mi><mi>F</mi><mo>(</mo><mi>p</mi><mo>)</mo></math></span>. The proposed architecture is able to operate for a general prime modulus without any constraints on its structure. It is based on a new novel pipelined modular multiplier developed using the Montgomery multiplication and the Karatsuba-Offman technique with a four-part splitting methodology. The Montgomery ladder approach is adopted on a system level, where a high-speed scheduling strategy to efficiently execute <span><math><mi>G</mi><mi>F</mi><mo>(</mo><mi>p</mi><mo>)</mo></math></span> operations is also presented. Due to these circuit and system-level optimizations, the proposed design delivers low-latency results without a significant increase in resource consumption. The proposed architecture is described in Verilog-HDL for 256-bit key lengths and implemented on Virtex-7 and Virtex-6 FPGA platforms using Xilinx ISE Design Suite. On the Virtex-7 FPGA platform, it performs a 256-bit point multiplication operation in just 110.9 <em>u</em>s with a throughput of almost 9017 operations per second. The implementation results demonstrate that despite its generic nature, it produces low latency as compared to the state-of-the-art. Therefore, it has prominent prospects to be used in high-speed authentication and certification applications.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104946"},"PeriodicalIF":3.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.
{"title":"Neuron grouping and mapping methods for 2D-mesh NoC-based DNN accelerators","authors":"Furkan Nacar , Alperen Cakin , Selma Dilek , Suleyman Tosun , Krishnendu Chakrabarty","doi":"10.1016/j.jpdc.2024.104949","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104949","url":null,"abstract":"<div><p>Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104949"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1016/j.jpdc.2024.104952
Silvia Bonomi , Giovanni Farina , Sébastien Tixeuil
The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.
We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the f-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.
拜占庭容错可靠通信基元是分布式系统的基本构件,它能保证进程间信息交换的真实性、完整性和传递性。我们研究了这种基元在具有动态通信网络(即可用通信通道集随时间变化)的分布式系统中的可实现性。我们假设了 f 局部有界拜占庭故障模型,并确定了允许所有进程对之间进行可靠通信的动态通信网络条件。此外,我们还研究了它在几类动态网络上的可实施性,并对它在异步分布式系统中的应用提出了见解。
{"title":"Reliable communication in dynamic networks with locally bounded byzantine faults","authors":"Silvia Bonomi , Giovanni Farina , Sébastien Tixeuil","doi":"10.1016/j.jpdc.2024.104952","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104952","url":null,"abstract":"<div><p>The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.</p><p>We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the <em>f</em>-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104952"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1016/j.jpdc.2024.104947
Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese
Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.
{"title":"PiPar: Pipeline parallelism for collaborative machine learning","authors":"Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese","doi":"10.1016/j.jpdc.2024.104947","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104947","url":null,"abstract":"<div><p>Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework <span>PiPar</span> that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of <span>PiPar</span> and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that <span>PiPar</span> achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104947"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001114/pdfft?md5=589f02b2eaa1e2c9523c4d2a0434e4e1&pid=1-s2.0-S0743731524001114-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1016/j.jpdc.2024.104950
Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh
As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.
{"title":"Staleness aware semi-asynchronous federated learning","authors":"Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh","doi":"10.1016/j.jpdc.2024.104950","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104950","url":null,"abstract":"<div><p>As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104950"},"PeriodicalIF":3.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-28DOI: 10.1016/j.jpdc.2024.104945
Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli
A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.
众所周知,并行 3D FFT 的可扩展性瓶颈在于其使用的全对全通信。在这里,我们介绍 S3DFT,这是一个通过使用点对点通信来规避这一问题的库,尽管算术复杂度较高。这种方法利用了坎农算法的三种变体,并对块张量矩阵乘法进行了调整。我们展示了 S3DFT 对硬件资源的高效利用,以及它在 JUWELS 集群 16,464 个内核上的扩展能力。然而,在与成熟的 3D FFT 库进行比较时,我们发现 S3DFT 的并行效率和性能并不尽如人意。详细分析发现,原因在于其两个组件算法,由于其通信模式是如何映射到胖树拓扑的子集中的,因此扩展性较差。这一结果揭示了在胖树网络系统上运行分块并行算法的潜在缺点,即沿处理元件网状结构特定方向的通信延迟增加。
{"title":"3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks","authors":"Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli","doi":"10.1016/j.jpdc.2024.104945","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104945","url":null,"abstract":"<div><p>A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104945"},"PeriodicalIF":3.4,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001096/pdfft?md5=a6e4f3cba9286a71b7d82fe7347d295b&pid=1-s2.0-S0743731524001096-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-27DOI: 10.1016/j.jpdc.2024.104948
Chunlin Li , Jun Liu , Ning Ma , Qingzhe Zhang , Zhengwei Zhong , Lincheng Jiang , Guolei Jia
Multi-Access Edge Computing (MEC) can provide computility close to the clients to decrease response time and enhance Quality of Service (QoS). However, the complex wireless network consists of various network hardware facilities with different communication protocols and Application Programming Interface (API), which result in the MEC system's high running costs and low running efficiency. To this end, Software-defined networking (SDN) is applied to MEC, which can support access to massive network devices and provide flexible and efficient management. The reasonable SDN controller scheme is crucial to enhance the performance of SDN-assisted MEC. At First, we used the Convolutional Neural Networks (CNN)-Long Short-Term Memory (LSTM) model to predict the network traffic to calculate the load. Then, the optimization objective is formulated by ensuring the load balance and minimizing the system cost. Finally, the Deep Reinforcement Learning (DRL) algorithm is used to obtain the optimal value. Based on the controller placement algorithm ensuring the load balancing, the dynamical edge selection method based on the Channel State Information (CSI) is proposed to optimize the task offloading, and according to CSI, the strategy of task queue execution is designed. Then, the task offloading problem is modeled by using queuing theory. Finally, dynamical edge selection based on Lyapunov's optimization is introduced to get the model solution. In the experiment studies, the assessment method evaluated the performance of two sets of baseline algorithms, including SAPKM, the PSO, the K-means, the LADMA, the LATA, and the OAOP. Compared to the baseline algorithms, the proposed algorithms can effectively reduce the average communication delay and total system energy consumption and improve the utilization of the SDN controller.
多接入边缘计算(MEC)可在客户端附近提供计算能力,从而缩短响应时间并提高服务质量(QoS)。然而,复杂的无线网络由各种网络硬件设施组成,通信协议和应用编程接口(API)各不相同,导致 MEC 系统运行成本高、运行效率低。为此,软件定义网络(SDN)被应用于 MEC,它可以支持海量网络设备的接入,并提供灵活高效的管理。合理的 SDN 控制器方案是提高 SDN 辅助 MEC 性能的关键。首先,我们使用卷积神经网络(CNN)-长短期记忆(LSTM)模型预测网络流量,计算负载。然后,通过确保负载平衡和系统成本最小化来制定优化目标。最后,使用深度强化学习(DRL)算法获得最优值。在确保负载平衡的控制器放置算法基础上,提出了基于信道状态信息(CSI)的动态边缘选择方法来优化任务卸载,并根据 CSI 设计了任务队列执行策略。然后,利用队列理论对任务卸载问题进行建模。最后,引入基于 Lyapunov 优化的动态边缘选择,得到模型解。在实验研究中,评估方法评估了两套基准算法的性能,包括 SAPKM、PSO、K-means、LADMA、LATA 和 OAOP。与基线算法相比,所提出的算法能有效降低平均通信延迟和系统总能耗,提高 SDN 控制器的利用率。
{"title":"Deep reinforcement learning based controller placement and optimal edge selection in SDN-based multi-access edge computing environments","authors":"Chunlin Li , Jun Liu , Ning Ma , Qingzhe Zhang , Zhengwei Zhong , Lincheng Jiang , Guolei Jia","doi":"10.1016/j.jpdc.2024.104948","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104948","url":null,"abstract":"<div><p>Multi-Access Edge Computing (MEC) can provide computility close to the clients to decrease response time and enhance Quality of Service (QoS). However, the complex wireless network consists of various network hardware facilities with different communication protocols and Application Programming Interface (API), which result in the MEC system's high running costs and low running efficiency. To this end, Software-defined networking (SDN) is applied to MEC, which can support access to massive network devices and provide flexible and efficient management. The reasonable SDN controller scheme is crucial to enhance the performance of SDN-assisted MEC. At First, we used the Convolutional Neural Networks (CNN)-Long Short-Term Memory (LSTM) model to predict the network traffic to calculate the load. Then, the optimization objective is formulated by ensuring the load balance and minimizing the system cost. Finally, the Deep Reinforcement Learning (DRL) algorithm is used to obtain the optimal value. Based on the controller placement algorithm ensuring the load balancing, the dynamical edge selection method based on the Channel State Information (CSI) is proposed to optimize the task offloading, and according to CSI, the strategy of task queue execution is designed. Then, the task offloading problem is modeled by using queuing theory. Finally, dynamical edge selection based on Lyapunov's optimization is introduced to get the model solution. In the experiment studies, the assessment method evaluated the performance of two sets of baseline algorithms, including SAPKM, the PSO, the K-means, the LADMA, the LATA, and the OAOP. Compared to the baseline algorithms, the proposed algorithms can effectively reduce the average communication delay and total system energy consumption and improve the utilization of the SDN controller.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104948"},"PeriodicalIF":3.4,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141606834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-26DOI: 10.1016/j.jpdc.2024.104942
Gokul Madathupalyam Chinnappan , Bharadwaj Veeravalli , Koen Mouthaan , John Wen-Hao Lee
Radar loads, especially Synthetic Aperture Radar (SAR) image reconstruction loads use a large volume of data collected from satellites to create a high-resolution image of the earth. To design near-real-time applications that utilise SAR data, speeding up the image reconstruction algorithm is imperative. This can be achieved by deploying a set of distributed computing infrastructures connected through a network. Scheduling such complex and large divisible loads on a distributed platform can be designed using the Divisible Load Theory (DLT) framework. We performed distributed SAR image reconstruction experiments using the SLURM library on a cloud virtual machine network using two scheduling strategies, namely the Multi-Installment Scheduling with Result Retrieval (MIS-RR) strategy and the traditional EQual-partitioning Strategy (EQS). The DLT model proposed in the MIS-RR strategy is incorporated to make the load divisible. Based on the experimental results and performance analysis carried out using different pixel lengths, pulse set sizes, and the number of virtual machines, we observe that the time performance of MIS-RR is much superior to that of EQS. Hence the MIS-RR strategy is of practical significance in reducing the overall processing time, and cost, and in improving the utilisation of the compute infrastructure. Furthermore, we note that the DLT-based theoretical analysis of MIS-RR coincides well with the experimental data, demonstrating the relevance of DLT in the real world.
{"title":"Experimental evaluation of a multi-installment scheduling strategy based on divisible load paradigm for SAR image reconstruction on a distributed computing infrastructure","authors":"Gokul Madathupalyam Chinnappan , Bharadwaj Veeravalli , Koen Mouthaan , John Wen-Hao Lee","doi":"10.1016/j.jpdc.2024.104942","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104942","url":null,"abstract":"<div><p>Radar loads, especially Synthetic Aperture Radar (SAR) image reconstruction loads use a large volume of data collected from satellites to create a high-resolution image of the earth. To design near-real-time applications that utilise SAR data, speeding up the image reconstruction algorithm is imperative. This can be achieved by deploying a set of distributed computing infrastructures connected through a network. Scheduling such complex and large divisible loads on a distributed platform can be designed using the Divisible Load Theory (DLT) framework. We performed distributed SAR image reconstruction experiments using the SLURM library on a cloud virtual machine network using two scheduling strategies, namely the Multi-Installment Scheduling with Result Retrieval (MIS-RR) strategy and the traditional EQual-partitioning Strategy (EQS). The DLT model proposed in the MIS-RR strategy is incorporated to make the load divisible. Based on the experimental results and performance analysis carried out using different pixel lengths, pulse set sizes, and the number of virtual machines, we observe that the time performance of MIS-RR is much superior to that of EQS. Hence the MIS-RR strategy is of practical significance in reducing the overall processing time, and cost, and in improving the utilisation of the compute infrastructure. Furthermore, we note that the DLT-based theoretical analysis of MIS-RR coincides well with the experimental data, demonstrating the relevance of DLT in the real world.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104942"},"PeriodicalIF":3.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}