Pub Date : 2024-07-19DOI: 10.1016/j.jpdc.2024.104956
Lixue Sun , Chunxiang Xu , Fugeng Zeng
Cloud computing is a promising service architecture that enables a data owner to share data in an economic and efficient manner. To ensure data privacy, a data owner will generate the ciphertext of the data before outsourcing. Attribute-based encryption (ABE) provides an elegant solution for a data owner to enforce fine-grained access control on the data to be outsourced. However, ABE cannot support ciphertext transformation when needing to share the underlying data with a public-key infrastructure (PKI) user further. In addition, an untrusted cloud server may return random ciphertexts to the PKI user to save expensive computational costs of ciphertext transformation. To address above issues, we introduce a novel cryptographic primitive namely verifiable and hybrid attribute-based proxy re-encryption (VHABPRE). VHABPRE provides a transformation mechanism that re-encrypts an ABE ciphertext to a PKI-based public key encryption (PKE) ciphertext such that the PKI user can access the underlying data, meanwhile this PKI user can ensure the validity of the transformed ciphertext. By leveraging a key blinding technique and computing the commitment of the data, we construct two VHABPRE schemes to achieve flexible data sharing. We give formal security proofs and comprehensive performance evaluation to show the security and efficiency of the VHABPRE schemes.
{"title":"Verifiable and hybrid attribute-based proxy re-encryption for flexible data sharing in cloud storage","authors":"Lixue Sun , Chunxiang Xu , Fugeng Zeng","doi":"10.1016/j.jpdc.2024.104956","DOIUrl":"10.1016/j.jpdc.2024.104956","url":null,"abstract":"<div><p>Cloud computing is a promising service architecture that enables a data owner to share data in an economic and efficient manner. To ensure data privacy, a data owner will generate the ciphertext of the data before outsourcing. Attribute-based encryption (ABE) provides an elegant solution for a data owner to enforce fine-grained access control on the data to be outsourced. However, ABE cannot support ciphertext transformation when needing to share the underlying data with a public-key infrastructure (PKI) user further. In addition, an untrusted cloud server may return random ciphertexts to the PKI user to save expensive computational costs of ciphertext transformation. To address above issues, we introduce a novel cryptographic primitive namely verifiable and hybrid attribute-based proxy re-encryption (VHABPRE). VHABPRE provides a transformation mechanism that re-encrypts an ABE ciphertext to a PKI-based public key encryption (PKE) ciphertext such that the PKI user can access the underlying data, meanwhile this PKI user can ensure the validity of the transformed ciphertext. By leveraging a key blinding technique and computing the commitment of the data, we construct two VHABPRE schemes to achieve flexible data sharing. We give formal security proofs and comprehensive performance evaluation to show the security and efficiency of the VHABPRE schemes.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104956"},"PeriodicalIF":3.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-14DOI: 10.1016/j.jpdc.2024.104954
Nicolas Bousquet , Laurent Feuilloley , Théo Pierron
Local certification consists in assigning labels to the vertices of a network to certify that some given property is satisfied, in such a way that the labels can be checked locally. In the last few years, certification of graph classes received considerable attention. The goal is to certify that a graph G belongs to a given graph class . Such certifications with labels of size (where n is the size of the network) exist for trees, planar graphs and graphs embedded on surfaces. Feuilloley et al. ask if this can be extended to any class of graphs defined by a finite set of forbidden minors.
In this work, we develop new decomposition tools for graph certification, and apply them to show that for every small enough minor H, H-minor-free graphs can indeed be certified with labels of size . We also show matching lower bounds using a new proof technique.
本地认证包括为网络顶点分配标签,以证明满足某些给定属性,这种方式可以在本地检查标签。最近几年,图类认证受到了广泛关注。这种认证的标签大小为 O(logn)(其中 n 是网络的大小),适用于树、平面图和嵌入曲面的图。在这项研究中,我们为图形认证开发了新的分解工具,并应用这些工具证明了对于每一个足够小的次要因子 H,无 H 次要因子的图形确实可以用大小为 O(logn) 的标签进行认证。我们还利用一种新的证明技术展示了匹配的下限。
{"title":"Local certification of graph decompositions and applications to minor-free classes","authors":"Nicolas Bousquet , Laurent Feuilloley , Théo Pierron","doi":"10.1016/j.jpdc.2024.104954","DOIUrl":"10.1016/j.jpdc.2024.104954","url":null,"abstract":"<div><p>Local certification consists in assigning labels to the vertices of a network to certify that some given property is satisfied, in such a way that the labels can be checked locally. In the last few years, certification of graph classes received considerable attention. The goal is to certify that a graph <em>G</em> belongs to a given graph class <span><math><mi>G</mi></math></span>. Such certifications with labels of size <span><math><mi>O</mi><mo>(</mo><mi>log</mi><mo></mo><mi>n</mi><mo>)</mo></math></span> (where <em>n</em> is the size of the network) exist for trees, planar graphs and graphs embedded on surfaces. Feuilloley et al. ask if this can be extended to any class of graphs defined by a finite set of forbidden minors.</p><p>In this work, we develop new decomposition tools for graph certification, and apply them to show that for every small enough minor <em>H</em>, <em>H</em>-minor-free graphs can indeed be certified with labels of size <span><math><mi>O</mi><mo>(</mo><mi>log</mi><mo></mo><mi>n</mi><mo>)</mo></math></span>. We also show matching lower bounds using a new proof technique.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104954"},"PeriodicalIF":3.4,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-09DOI: 10.1016/j.jpdc.2024.104953
Sahar Maleki, Hassan Zarabadipour, Mehdi Rahmani
To address the complexity challenge of a large-scale system, the decomposition into smaller subsystems is very crucial and demanding for distributed estimation and control purposes. This paper proposes novel optimization-based approaches to decompose a large-scale system into subsystems that are either weakly coupled or weakly coupled with overlapping components. To achieve this goal, first, the epsilon decomposition of large-scale systems is examined. Then, optimization frameworks are presented for disjoint and overlapping decompositions utilizing bipartite graphs. Next, the proposed decomposition algorithms are represented for particular cases of large-scale systems using directed graphs. In contrast to the existing user-based techniques, the proposed optimization-based methods can reach the solution rapidly and systematically. At last, the capability and efficiency of the proposed algorithms are investigated by conducting simulations on three case studies, which include a practical distillation column, a modified benchmark model, and the IEEE 118-bus power system.
{"title":"Optimization-based disjoint and overlapping epsilon decompositions of large-scale dynamical systems via graph theory","authors":"Sahar Maleki, Hassan Zarabadipour, Mehdi Rahmani","doi":"10.1016/j.jpdc.2024.104953","DOIUrl":"10.1016/j.jpdc.2024.104953","url":null,"abstract":"<div><p>To address the complexity challenge of a large-scale system, the decomposition into smaller subsystems is very crucial and demanding for distributed estimation and control purposes. This paper proposes novel optimization-based approaches to decompose a large-scale system into subsystems that are either weakly coupled or weakly coupled with overlapping components. To achieve this goal, first, the epsilon decomposition of large-scale systems is examined. Then, optimization frameworks are presented for disjoint and overlapping decompositions utilizing bipartite graphs. Next, the proposed decomposition algorithms are represented for particular cases of large-scale systems using directed graphs. In contrast to the existing user-based techniques, the proposed optimization-based methods can reach the solution rapidly and systematically. At last, the capability and efficiency of the proposed algorithms are investigated by conducting simulations on three case studies, which include a practical distillation column, a modified benchmark model, and the IEEE 118-bus power system.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104953"},"PeriodicalIF":3.4,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1016/j.jpdc.2024.104951
Rakesh Shrestha , Mohammadreza Mohammadi , Sima Sinaei , Alberto Salcines , David Pampliega , Raul Clemente , Ana Lourdes Sanz , Ehsan Nowroozi , Anders Lindgren
In smart electric grid systems, various sensors and Internet of Things (IoT) devices are used to collect electrical data at substations. In a traditional system, a multitude of energy-related data from substations needs to be migrated to central storage, such as Cloud or edge devices, for knowledge extraction that might impose severe data misuse, data manipulation, or privacy leakage. This motivates to propose anomaly detection system to detect threats and Federated Learning to resolve the issues of data silos and privacy of data. In this article, we present a framework to identify anomalies in industrial data that are gathered from the remote terminal devices deployed at the substations in the smart electric grid system. The anomaly detection system is based on Long Short-Term Memory (LSTM) and autoencoders that employs Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD) approaches for detecting anomalies. We deploy Federated Learning (FL) to preserve the privacy of the data generated by the substations. FL enables energy providers to train shared AI models cooperatively without disclosing the data to the server. In order to further enhance the security and privacy properties of the proposed framework, we implemented homomorphic encryption based on the Paillier algorithm for preserving data privacy. The proposed security model performs better with MSD approach using HE-128 bit key providing 97% F1-score and 98% accuracy for K=5 with low computation overhead as compared with HE-256 bit key.
{"title":"Anomaly detection based on LSTM and autoencoders using federated learning in smart electric grid","authors":"Rakesh Shrestha , Mohammadreza Mohammadi , Sima Sinaei , Alberto Salcines , David Pampliega , Raul Clemente , Ana Lourdes Sanz , Ehsan Nowroozi , Anders Lindgren","doi":"10.1016/j.jpdc.2024.104951","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104951","url":null,"abstract":"<div><p>In smart electric grid systems, various sensors and Internet of Things (IoT) devices are used to collect electrical data at substations. In a traditional system, a multitude of energy-related data from substations needs to be migrated to central storage, such as Cloud or edge devices, for knowledge extraction that might impose severe data misuse, data manipulation, or privacy leakage. This motivates to propose anomaly detection system to detect threats and Federated Learning to resolve the issues of data silos and privacy of data. In this article, we present a framework to identify anomalies in industrial data that are gathered from the remote terminal devices deployed at the substations in the smart electric grid system. The anomaly detection system is based on Long Short-Term Memory (LSTM) and autoencoders that employs Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD) approaches for detecting anomalies. We deploy Federated Learning (FL) to preserve the privacy of the data generated by the substations. FL enables energy providers to train shared AI models cooperatively without disclosing the data to the server. In order to further enhance the security and privacy properties of the proposed framework, we implemented homomorphic encryption based on the Paillier algorithm for preserving data privacy. The proposed security model performs better with MSD approach using HE-128 bit key providing 97% F1-score and 98% accuracy for K=5 with low computation overhead as compared with HE-256 bit key.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104951"},"PeriodicalIF":3.4,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001151/pdfft?md5=8b26b7d7db2b8eb9c771f42fd6536e0c&pid=1-s2.0-S0743731524001151-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141606835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1016/j.jpdc.2024.104946
Khalid Javeed , Yasir Ali Shah , David Gregg
Elliptic Curve Cryptography (ECC) is the front-runner among available public key cryptography (PKC) schemes due to its potential to offer higher security per key bit. All ECC-based cryptosystems heavily rely on point multiplication operation where its efficient realization has attained notable focus in the research community. Low latency implementation of the point multiplication operation is frequently required in high-speed applications such as online authentication and web server certification. This paper presents a low latency ECC point multiplication architecture for Montgomery curves over generic prime filed . The proposed architecture is able to operate for a general prime modulus without any constraints on its structure. It is based on a new novel pipelined modular multiplier developed using the Montgomery multiplication and the Karatsuba-Offman technique with a four-part splitting methodology. The Montgomery ladder approach is adopted on a system level, where a high-speed scheduling strategy to efficiently execute operations is also presented. Due to these circuit and system-level optimizations, the proposed design delivers low-latency results without a significant increase in resource consumption. The proposed architecture is described in Verilog-HDL for 256-bit key lengths and implemented on Virtex-7 and Virtex-6 FPGA platforms using Xilinx ISE Design Suite. On the Virtex-7 FPGA platform, it performs a 256-bit point multiplication operation in just 110.9 us with a throughput of almost 9017 operations per second. The implementation results demonstrate that despite its generic nature, it produces low latency as compared to the state-of-the-art. Therefore, it has prominent prospects to be used in high-speed authentication and certification applications.
{"title":"GMC-crypto: Low latency implementation of ECC point multiplication for generic Montgomery curves over GF(p)","authors":"Khalid Javeed , Yasir Ali Shah , David Gregg","doi":"10.1016/j.jpdc.2024.104946","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104946","url":null,"abstract":"<div><p>Elliptic Curve Cryptography (ECC) is the front-runner among available public key cryptography (PKC) schemes due to its potential to offer higher security per key bit. All ECC-based cryptosystems heavily rely on point multiplication operation where its efficient realization has attained notable focus in the research community. Low latency implementation of the point multiplication operation is frequently required in high-speed applications such as online authentication and web server certification. This paper presents a low latency ECC point multiplication architecture for Montgomery curves over generic prime filed <span><math><mi>G</mi><mi>F</mi><mo>(</mo><mi>p</mi><mo>)</mo></math></span>. The proposed architecture is able to operate for a general prime modulus without any constraints on its structure. It is based on a new novel pipelined modular multiplier developed using the Montgomery multiplication and the Karatsuba-Offman technique with a four-part splitting methodology. The Montgomery ladder approach is adopted on a system level, where a high-speed scheduling strategy to efficiently execute <span><math><mi>G</mi><mi>F</mi><mo>(</mo><mi>p</mi><mo>)</mo></math></span> operations is also presented. Due to these circuit and system-level optimizations, the proposed design delivers low-latency results without a significant increase in resource consumption. The proposed architecture is described in Verilog-HDL for 256-bit key lengths and implemented on Virtex-7 and Virtex-6 FPGA platforms using Xilinx ISE Design Suite. On the Virtex-7 FPGA platform, it performs a 256-bit point multiplication operation in just 110.9 <em>u</em>s with a throughput of almost 9017 operations per second. The implementation results demonstrate that despite its generic nature, it produces low latency as compared to the state-of-the-art. Therefore, it has prominent prospects to be used in high-speed authentication and certification applications.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104946"},"PeriodicalIF":3.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.
{"title":"Neuron grouping and mapping methods for 2D-mesh NoC-based DNN accelerators","authors":"Furkan Nacar , Alperen Cakin , Selma Dilek , Suleyman Tosun , Krishnendu Chakrabarty","doi":"10.1016/j.jpdc.2024.104949","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104949","url":null,"abstract":"<div><p>Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104949"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1016/j.jpdc.2024.104952
Silvia Bonomi , Giovanni Farina , Sébastien Tixeuil
The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.
We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the f-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.
拜占庭容错可靠通信基元是分布式系统的基本构件,它能保证进程间信息交换的真实性、完整性和传递性。我们研究了这种基元在具有动态通信网络(即可用通信通道集随时间变化)的分布式系统中的可实现性。我们假设了 f 局部有界拜占庭故障模型,并确定了允许所有进程对之间进行可靠通信的动态通信网络条件。此外,我们还研究了它在几类动态网络上的可实施性,并对它在异步分布式系统中的应用提出了见解。
{"title":"Reliable communication in dynamic networks with locally bounded byzantine faults","authors":"Silvia Bonomi , Giovanni Farina , Sébastien Tixeuil","doi":"10.1016/j.jpdc.2024.104952","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104952","url":null,"abstract":"<div><p>The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.</p><p>We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the <em>f</em>-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104952"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1016/j.jpdc.2024.104947
Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese
Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.
{"title":"PiPar: Pipeline parallelism for collaborative machine learning","authors":"Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese","doi":"10.1016/j.jpdc.2024.104947","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104947","url":null,"abstract":"<div><p>Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework <span>PiPar</span> that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of <span>PiPar</span> and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that <span>PiPar</span> achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104947"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001114/pdfft?md5=589f02b2eaa1e2c9523c4d2a0434e4e1&pid=1-s2.0-S0743731524001114-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1016/j.jpdc.2024.104950
Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh
As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.
{"title":"Staleness aware semi-asynchronous federated learning","authors":"Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh","doi":"10.1016/j.jpdc.2024.104950","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104950","url":null,"abstract":"<div><p>As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104950"},"PeriodicalIF":3.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-28DOI: 10.1016/j.jpdc.2024.104945
Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli
A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.
众所周知,并行 3D FFT 的可扩展性瓶颈在于其使用的全对全通信。在这里,我们介绍 S3DFT,这是一个通过使用点对点通信来规避这一问题的库,尽管算术复杂度较高。这种方法利用了坎农算法的三种变体,并对块张量矩阵乘法进行了调整。我们展示了 S3DFT 对硬件资源的高效利用,以及它在 JUWELS 集群 16,464 个内核上的扩展能力。然而,在与成熟的 3D FFT 库进行比较时,我们发现 S3DFT 的并行效率和性能并不尽如人意。详细分析发现,原因在于其两个组件算法,由于其通信模式是如何映射到胖树拓扑的子集中的,因此扩展性较差。这一结果揭示了在胖树网络系统上运行分块并行算法的潜在缺点,即沿处理元件网状结构特定方向的通信延迟增加。
{"title":"3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks","authors":"Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli","doi":"10.1016/j.jpdc.2024.104945","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104945","url":null,"abstract":"<div><p>A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104945"},"PeriodicalIF":3.4,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001096/pdfft?md5=a6e4f3cba9286a71b7d82fe7347d295b&pid=1-s2.0-S0743731524001096-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}