Journal of Parallel and Distributed Computing最新文献_第5页

Verifiable and hybrid attribute-based proxy re-encryption for flexible data sharing in cloud storage 基于属性的可验证混合代理重加密，实现云存储中的灵活数据共享

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-19 DOI: 10.1016/j.jpdc.2024.104956

Lixue Sun , Chunxiang Xu , Fugeng Zeng

Cloud computing is a promising service architecture that enables a data owner to share data in an economic and efficient manner. To ensure data privacy, a data owner will generate the ciphertext of the data before outsourcing. Attribute-based encryption (ABE) provides an elegant solution for a data owner to enforce fine-grained access control on the data to be outsourced. However, ABE cannot support ciphertext transformation when needing to share the underlying data with a public-key infrastructure (PKI) user further. In addition, an untrusted cloud server may return random ciphertexts to the PKI user to save expensive computational costs of ciphertext transformation. To address above issues, we introduce a novel cryptographic primitive namely verifiable and hybrid attribute-based proxy re-encryption (VHABPRE). VHABPRE provides a transformation mechanism that re-encrypts an ABE ciphertext to a PKI-based public key encryption (PKE) ciphertext such that the PKI user can access the underlying data, meanwhile this PKI user can ensure the validity of the transformed ciphertext. By leveraging a key blinding technique and computing the commitment of the data, we construct two VHABPRE schemes to achieve flexible data sharing. We give formal security proofs and comprehensive performance evaluation to show the security and efficiency of the VHABPRE schemes.

云计算是一种前景广阔的服务架构，能让数据所有者以经济、高效的方式共享数据。为确保数据隐私，数据所有者会在外包之前生成数据的密文。基于属性的加密（ABE）为数据所有者提供了一个优雅的解决方案，可对外包数据实施细粒度访问控制。但是，当需要与公钥基础设施（PKI）用户进一步共享底层数据时，ABE 无法支持密文转换。此外，不受信任的云服务器可能会向 PKI 用户返回随机密文，以节省昂贵的密文转换计算成本。为解决上述问题，我们引入了一种新型加密基元，即基于属性的可验证混合代理重加密（VHABPRE）。VHABPRE 提供了一种转换机制，可将 ABE 密文重新加密为基于 PKI 的公钥加密（PKE）密文，这样 PKI 用户就可以访问底层数据，同时该 PKI 用户还能确保转换后密文的有效性。通过利用密钥致盲技术和计算数据承诺，我们构建了两种 VHABPRE 方案，以实现灵活的数据共享。我们给出了正式的安全证明和全面的性能评估，以说明 VHABPRE 方案的安全性和效率。

{"title":"Verifiable and hybrid attribute-based proxy re-encryption for flexible data sharing in cloud storage","authors":"Lixue Sun , Chunxiang Xu , Fugeng Zeng","doi":"10.1016/j.jpdc.2024.104956","DOIUrl":"10.1016/j.jpdc.2024.104956","url":null,"abstract":"<div><p>Cloud computing is a promising service architecture that enables a data owner to share data in an economic and efficient manner. To ensure data privacy, a data owner will generate the ciphertext of the data before outsourcing. Attribute-based encryption (ABE) provides an elegant solution for a data owner to enforce fine-grained access control on the data to be outsourced. However, ABE cannot support ciphertext transformation when needing to share the underlying data with a public-key infrastructure (PKI) user further. In addition, an untrusted cloud server may return random ciphertexts to the PKI user to save expensive computational costs of ciphertext transformation. To address above issues, we introduce a novel cryptographic primitive namely verifiable and hybrid attribute-based proxy re-encryption (VHABPRE). VHABPRE provides a transformation mechanism that re-encrypts an ABE ciphertext to a PKI-based public key encryption (PKE) ciphertext such that the PKI user can access the underlying data, meanwhile this PKI user can ensure the validity of the transformed ciphertext. By leveraging a key blinding technique and computing the commitment of the data, we construct two VHABPRE schemes to achieve flexible data sharing. We give formal security proofs and comprehensive performance evaluation to show the security and efficiency of the VHABPRE schemes.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104956"},"PeriodicalIF":3.4,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Local certification of graph decompositions and applications to minor-free classes 图分解的局部认证及其在无次要类中的应用

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-14 DOI: 10.1016/j.jpdc.2024.104954

Nicolas Bousquet , Laurent Feuilloley , Théo Pierron

Local certification consists in assigning labels to the vertices of a network to certify that some given property is satisfied, in such a way that the labels can be checked locally. In the last few years, certification of graph classes received considerable attention. The goal is to certify that a graph G belongs to a given graph class $G$ . Such certifications with labels of size $O (\log n)$ (where n is the size of the network) exist for trees, planar graphs and graphs embedded on surfaces. Feuilloley et al. ask if this can be extended to any class of graphs defined by a finite set of forbidden minors.

In this work, we develop new decomposition tools for graph certification, and apply them to show that for every small enough minor H, H-minor-free graphs can indeed be certified with labels of size $O (\log n)$ . We also show matching lower bounds using a new proof technique.

本地认证包括为网络顶点分配标签，以证明满足某些给定属性，这种方式可以在本地检查标签。最近几年，图类认证受到了广泛关注。这种认证的标签大小为 O(logn)（其中 n 是网络的大小），适用于树、平面图和嵌入曲面的图。在这项研究中，我们为图形认证开发了新的分解工具，并应用这些工具证明了对于每一个足够小的次要因子 H，无 H 次要因子的图形确实可以用大小为 O(logn) 的标签进行认证。我们还利用一种新的证明技术展示了匹配的下限。

{"title":"Local certification of graph decompositions and applications to minor-free classes","authors":"Nicolas Bousquet , Laurent Feuilloley , Théo Pierron","doi":"10.1016/j.jpdc.2024.104954","DOIUrl":"10.1016/j.jpdc.2024.104954","url":null,"abstract":"<div><p>Local certification consists in assigning labels to the vertices of a network to certify that some given property is satisfied, in such a way that the labels can be checked locally. In the last few years, certification of graph classes received considerable attention. The goal is to certify that a graph <em>G</em> belongs to a given graph class <span><math><mi>G</mi></math></span>. Such certifications with labels of size <span><math><mi>O</mi><mo>(</mo><mi>log</mi><mo>⁡</mo><mi>n</mi><mo>)</mo></math></span> (where <em>n</em> is the size of the network) exist for trees, planar graphs and graphs embedded on surfaces. Feuilloley et al. ask if this can be extended to any class of graphs defined by a finite set of forbidden minors.</p><p>In this work, we develop new decomposition tools for graph certification, and apply them to show that for every small enough minor <em>H</em>, <em>H</em>-minor-free graphs can indeed be certified with labels of size <span><math><mi>O</mi><mo>(</mo><mi>log</mi><mo>⁡</mo><mi>n</mi><mo>)</mo></math></span>. We also show matching lower bounds using a new proof technique.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104954"},"PeriodicalIF":3.4,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimization-based disjoint and overlapping epsilon decompositions of large-scale dynamical systems via graph theory 通过图论对大规模动力系统进行基于优化的不相交和重叠ε分解

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-09 DOI: 10.1016/j.jpdc.2024.104953

Sahar Maleki, Hassan Zarabadipour, Mehdi Rahmani

To address the complexity challenge of a large-scale system, the decomposition into smaller subsystems is very crucial and demanding for distributed estimation and control purposes. This paper proposes novel optimization-based approaches to decompose a large-scale system into subsystems that are either weakly coupled or weakly coupled with overlapping components. To achieve this goal, first, the epsilon decomposition of large-scale systems is examined. Then, optimization frameworks are presented for disjoint and overlapping decompositions utilizing bipartite graphs. Next, the proposed decomposition algorithms are represented for particular cases of large-scale systems using directed graphs. In contrast to the existing user-based techniques, the proposed optimization-based methods can reach the solution rapidly and systematically. At last, the capability and efficiency of the proposed algorithms are investigated by conducting simulations on three case studies, which include a practical distillation column, a modified benchmark model, and the IEEE 118-bus power system.

为了应对大规模系统的复杂性挑战，将其分解成更小的子系统对于分布式估算和控制来说是非常关键和苛刻的。本文提出了基于优化的新方法，将大规模系统分解为弱耦合或弱耦合且有重叠组件的子系统。为实现这一目标，本文首先研究了大规模系统的ε分解。然后，提出了利用二叉图进行不相交和重叠分解的优化框架。接下来，针对使用有向图的大规模系统的特殊情况，介绍了所提出的分解算法。与现有的基于用户的技术相比，所提出的基于优化的方法可以快速、系统地解决问题。最后，通过对三个案例（包括一个实用蒸馏塔、一个修改后的基准模型和 IEEE 118 总线电力系统）进行仿真，研究了所提算法的能力和效率。

{"title":"Optimization-based disjoint and overlapping epsilon decompositions of large-scale dynamical systems via graph theory","authors":"Sahar Maleki, Hassan Zarabadipour, Mehdi Rahmani","doi":"10.1016/j.jpdc.2024.104953","DOIUrl":"10.1016/j.jpdc.2024.104953","url":null,"abstract":"<div><p>To address the complexity challenge of a large-scale system, the decomposition into smaller subsystems is very crucial and demanding for distributed estimation and control purposes. This paper proposes novel optimization-based approaches to decompose a large-scale system into subsystems that are either weakly coupled or weakly coupled with overlapping components. To achieve this goal, first, the epsilon decomposition of large-scale systems is examined. Then, optimization frameworks are presented for disjoint and overlapping decompositions utilizing bipartite graphs. Next, the proposed decomposition algorithms are represented for particular cases of large-scale systems using directed graphs. In contrast to the existing user-based techniques, the proposed optimization-based methods can reach the solution rapidly and systematically. At last, the capability and efficiency of the proposed algorithms are investigated by conducting simulations on three case studies, which include a practical distillation column, a modified benchmark model, and the IEEE 118-bus power system.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104953"},"PeriodicalIF":3.4,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Anomaly detection based on LSTM and autoencoders using federated learning in smart electric grid 基于 LSTM 和自动编码器的异常检测，在智能电网中使用联合学习技术

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-04 DOI: 10.1016/j.jpdc.2024.104951

Rakesh Shrestha , Mohammadreza Mohammadi , Sima Sinaei , Alberto Salcines , David Pampliega , Raul Clemente , Ana Lourdes Sanz , Ehsan Nowroozi , Anders Lindgren

In smart electric grid systems, various sensors and Internet of Things (IoT) devices are used to collect electrical data at substations. In a traditional system, a multitude of energy-related data from substations needs to be migrated to central storage, such as Cloud or edge devices, for knowledge extraction that might impose severe data misuse, data manipulation, or privacy leakage. This motivates to propose anomaly detection system to detect threats and Federated Learning to resolve the issues of data silos and privacy of data. In this article, we present a framework to identify anomalies in industrial data that are gathered from the remote terminal devices deployed at the substations in the smart electric grid system. The anomaly detection system is based on Long Short-Term Memory (LSTM) and autoencoders that employs Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD) approaches for detecting anomalies. We deploy Federated Learning (FL) to preserve the privacy of the data generated by the substations. FL enables energy providers to train shared AI models cooperatively without disclosing the data to the server. In order to further enhance the security and privacy properties of the proposed framework, we implemented homomorphic encryption based on the Paillier algorithm for preserving data privacy. The proposed security model performs better with MSD approach using HE-128 bit key providing 97% F1-score and 98% accuracy for K=5 with low computation overhead as compared with HE-256 bit key.

在智能电网系统中，各种传感器和物联网（IoT）设备用于收集变电站的电力数据。在传统系统中，变电站的大量能源相关数据需要迁移到云或边缘设备等中央存储设备中进行知识提取，这可能会造成严重的数据滥用、数据篡改或隐私泄露。这就促使我们提出异常检测系统来检测威胁，并提出联盟学习来解决数据孤岛和数据隐私问题。在本文中，我们提出了一个识别工业数据异常的框架，这些数据来自智能电网系统中部署在变电站的远程终端设备。异常检测系统基于长短期记忆（LSTM）和自动编码器，采用平均标准偏差（MSD）和绝对偏差中值（MAD）方法来检测异常。我们部署了联邦学习（FL），以保护变电站生成的数据的隐私。FL 使能源提供商能够在不向服务器披露数据的情况下合作训练共享人工智能模型。为了进一步增强拟议框架的安全和隐私属性，我们采用了基于 Paillier 算法的同态加密来保护数据隐私。与 HE-256 位密钥相比，使用 HE-128 位密钥的 MSD 方法在 K=5 的情况下提供了 97% 的 F1 分数和 98% 的准确率，且计算开销较低。

{"title":"Anomaly detection based on LSTM and autoencoders using federated learning in smart electric grid","authors":"Rakesh Shrestha , Mohammadreza Mohammadi , Sima Sinaei , Alberto Salcines , David Pampliega , Raul Clemente , Ana Lourdes Sanz , Ehsan Nowroozi , Anders Lindgren","doi":"10.1016/j.jpdc.2024.104951","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104951","url":null,"abstract":"<div><p>In smart electric grid systems, various sensors and Internet of Things (IoT) devices are used to collect electrical data at substations. In a traditional system, a multitude of energy-related data from substations needs to be migrated to central storage, such as Cloud or edge devices, for knowledge extraction that might impose severe data misuse, data manipulation, or privacy leakage. This motivates to propose anomaly detection system to detect threats and Federated Learning to resolve the issues of data silos and privacy of data. In this article, we present a framework to identify anomalies in industrial data that are gathered from the remote terminal devices deployed at the substations in the smart electric grid system. The anomaly detection system is based on Long Short-Term Memory (LSTM) and autoencoders that employs Mean Standard Deviation (MSD) and Median Absolute Deviation (MAD) approaches for detecting anomalies. We deploy Federated Learning (FL) to preserve the privacy of the data generated by the substations. FL enables energy providers to train shared AI models cooperatively without disclosing the data to the server. In order to further enhance the security and privacy properties of the proposed framework, we implemented homomorphic encryption based on the Paillier algorithm for preserving data privacy. The proposed security model performs better with MSD approach using HE-128 bit key providing 97% F1-score and 98% accuracy for K=5 with low computation overhead as compared with HE-256 bit key.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104951"},"PeriodicalIF":3.4,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001151/pdfft?md5=8b26b7d7db2b8eb9c771f42fd6536e0c&pid=1-s2.0-S0743731524001151-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141606835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GMC-crypto: Low latency implementation of ECC point multiplication for generic Montgomery curves over GF(p) GMC-crypto：针对 GF(p) 上通用蒙哥马利曲线的 ECC 点乘法的低延迟实现

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-03 DOI: 10.1016/j.jpdc.2024.104946

Khalid Javeed , Yasir Ali Shah , David Gregg

Elliptic Curve Cryptography (ECC) is the front-runner among available public key cryptography (PKC) schemes due to its potential to offer higher security per key bit. All ECC-based cryptosystems heavily rely on point multiplication operation where its efficient realization has attained notable focus in the research community. Low latency implementation of the point multiplication operation is frequently required in high-speed applications such as online authentication and web server certification. This paper presents a low latency ECC point multiplication architecture for Montgomery curves over generic prime filed $G F (p)$ . The proposed architecture is able to operate for a general prime modulus without any constraints on its structure. It is based on a new novel pipelined modular multiplier developed using the Montgomery multiplication and the Karatsuba-Offman technique with a four-part splitting methodology. The Montgomery ladder approach is adopted on a system level, where a high-speed scheduling strategy to efficiently execute $G F (p)$ operations is also presented. Due to these circuit and system-level optimizations, the proposed design delivers low-latency results without a significant increase in resource consumption. The proposed architecture is described in Verilog-HDL for 256-bit key lengths and implemented on Virtex-7 and Virtex-6 FPGA platforms using Xilinx ISE Design Suite. On the Virtex-7 FPGA platform, it performs a 256-bit point multiplication operation in just 110.9 us with a throughput of almost 9017 operations per second. The implementation results demonstrate that despite its generic nature, it produces low latency as compared to the state-of-the-art. Therefore, it has prominent prospects to be used in high-speed authentication and certification applications.

椭圆曲线加密算法（ECC）是现有公开密钥加密算法（PKC）方案中的佼佼者，因为它有可能为每个密钥比特提供更高的安全性。所有基于 ECC 的密码系统都严重依赖于点乘法运算，而高效实现点乘法运算已成为研究界关注的焦点。在在线身份验证和网络服务器认证等高速应用中，经常需要低延迟地实现点乘操作。本文针对通用素数 GF(p) 上的蒙哥马利曲线提出了一种低延迟 ECC 点乘法架构。所提出的架构能够在不受其结构限制的情况下对一般素数进行运算。它基于一种新颖的流水线模块化乘法器，采用蒙哥马利乘法和卡拉祖巴-奥夫曼技术以及四部分分割方法。系统级采用了蒙哥马利梯形图方法，并提出了高效执行 GF(p) 运算的高速调度策略。由于进行了这些电路和系统级优化，所提出的设计可在不显著增加资源消耗的情况下实现低延迟结果。该架构采用 Verilog-HDL 对 256 位密钥长度进行了描述，并使用 Xilinx ISE Design Suite 在 Virtex-7 和 Virtex-6 FPGA 平台上实现。在 Virtex-7 FPGA 平台上，执行 256 位点乘法运算仅需 110.9 秒，吞吐量接近每秒 9017 次运算。实现结果表明，尽管它具有通用性，但与最先进的技术相比，它产生的延迟很低。因此，它在高速身份验证和认证应用中有着广阔的应用前景。

{"title":"GMC-crypto: Low latency implementation of ECC point multiplication for generic Montgomery curves over GF(p)","authors":"Khalid Javeed , Yasir Ali Shah , David Gregg","doi":"10.1016/j.jpdc.2024.104946","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104946","url":null,"abstract":"<div><p>Elliptic Curve Cryptography (ECC) is the front-runner among available public key cryptography (PKC) schemes due to its potential to offer higher security per key bit. All ECC-based cryptosystems heavily rely on point multiplication operation where its efficient realization has attained notable focus in the research community. Low latency implementation of the point multiplication operation is frequently required in high-speed applications such as online authentication and web server certification. This paper presents a low latency ECC point multiplication architecture for Montgomery curves over generic prime filed <span><math><mi>G</mi><mi>F</mi><mo>(</mo><mi>p</mi><mo>)</mo></math></span>. The proposed architecture is able to operate for a general prime modulus without any constraints on its structure. It is based on a new novel pipelined modular multiplier developed using the Montgomery multiplication and the Karatsuba-Offman technique with a four-part splitting methodology. The Montgomery ladder approach is adopted on a system level, where a high-speed scheduling strategy to efficiently execute <span><math><mi>G</mi><mi>F</mi><mo>(</mo><mi>p</mi><mo>)</mo></math></span> operations is also presented. Due to these circuit and system-level optimizations, the proposed design delivers low-latency results without a significant increase in resource consumption. The proposed architecture is described in Verilog-HDL for 256-bit key lengths and implemented on Virtex-7 and Virtex-6 FPGA platforms using Xilinx ISE Design Suite. On the Virtex-7 FPGA platform, it performs a 256-bit point multiplication operation in just 110.9 <em>u</em>s with a throughput of almost 9017 operations per second. The implementation results demonstrate that despite its generic nature, it produces low latency as compared to the state-of-the-art. Therefore, it has prominent prospects to be used in high-speed authentication and certification applications.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104946"},"PeriodicalIF":3.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neuron grouping and mapping methods for 2D-mesh NoC-based DNN accelerators 基于 2D 网格 NoC 的 DNN 加速器的神经元分组和映射方法

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-02 DOI: 10.1016/j.jpdc.2024.104949

Furkan Nacar , Alperen Cakin , Selma Dilek , Suleyman Tosun , Krishnendu Chakrabarty

Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.

深度神经网络（DNN）已在多个领域得到广泛应用；然而，由于需要大量的层和神经元相互通信，其计算成本往往高得令人望而却步。此外，由于需要大量的数据移动和计算，DNN 还会消耗大量能源。为了应对这些挑战，我们需要新的架构来加速 DNN。在本文中，我们针对基于二维网格芯片上网络（NoC）的 DNN 加速器提出了新颖的神经元分组和映射方法，同时考虑了全连接和部分连接 DNN 模型。我们提出了基于整数线性规划（ILP）和模拟退火（SA）的神经元分组解决方案，目标是最大限度地减少神经元组之间的数据通信总量。在确定合适的 DNN 图表示之后，我们还应用 ILP 和 SA 方法将神经元映射到二维网格 NoC 结构上，目的是最大限度地降低系统的总通信成本。我们在各种基准和 DNN 模型上采用不同的剪枝比率进行了多次实验，结果发现通信成本平均降低了 40-50%。

{"title":"Neuron grouping and mapping methods for 2D-mesh NoC-based DNN accelerators","authors":"Furkan Nacar , Alperen Cakin , Selma Dilek , Suleyman Tosun , Krishnendu Chakrabarty","doi":"10.1016/j.jpdc.2024.104949","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104949","url":null,"abstract":"<div><p>Deep Neural Networks (DNNs) have gained widespread adoption in various fields; however, their computational cost is often prohibitively high due to the large number of layers and neurons communicating with each other. Furthermore, DNNs can consume a significant amount of energy due to the large volume of data movement and computation they require. To address these challenges, there is a need for new architectures to accelerate DNNs. In this paper, we propose novel neuron grouping and mapping methods for 2D-mesh Network-on-Chip (NoC)-based DNN accelerators considering both fully connected and partially connected DNN models. We present Integer Linear Programming (ILP) and simulated annealing (SA)-based neuron grouping solutions with the objective of minimizing the total volume of data communication among the neuron groups. After determining a suitable graph representation of the DNN, we also apply ILP and SA methods to map the neurons onto a 2D-mesh NoC fabric with the objective of minimizing the total communication cost of the system. We conducted several experiments on various benchmarks and DNN models with different pruning ratios and achieved an average of 40-50% improvement in communication cost.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104949"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reliable communication in dynamic networks with locally bounded byzantine faults 具有局部有界拜占庭故障的动态网络中的可靠通信

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-02 DOI: 10.1016/j.jpdc.2024.104952

Silvia Bonomi , Giovanni Farina , Sébastien Tixeuil

The Byzantine tolerant reliable communication primitive is a fundamental building block in distributed systems that guarantees the authenticity, integrity, and delivery of information exchanged between processes.

We study the implementability of such a primitive in a distributed system with a dynamic communication network (i.e., where the set of available communication channels changes over time). We assume the f-locally bounded Byzantine fault model and identify the conditions on the dynamic communication networks that allow reliable communication between all pairs of processes. In addition, we investigate its implementability on several classes of dynamic networks and provide insights into its use in asynchronous distributed systems.

拜占庭容错可靠通信基元是分布式系统的基本构件，它能保证进程间信息交换的真实性、完整性和传递性。我们研究了这种基元在具有动态通信网络（即可用通信通道集随时间变化）的分布式系统中的可实现性。我们假设了 f 局部有界拜占庭故障模型，并确定了允许所有进程对之间进行可靠通信的动态通信网络条件。此外，我们还研究了它在几类动态网络上的可实施性，并对它在异步分布式系统中的应用提出了见解。

引用次数: 0

PiPar: Pipeline parallelism for collaborative machine learning PiPar：协作式机器学习的管道并行性

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-02 DOI: 10.1016/j.jpdc.2024.104947

Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese

Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.

有人提出了协作式机器学习（CML）技术，如联合学习，用于在多个移动设备和服务器之间训练深度学习模型。CML 技术可以保护隐私，因为在每台设备上训练的本地模型而不是来自设备的原始数据都会与服务器共享。然而，由于资源利用率低，CML 训练效率不高。我们发现，服务器和设备上由于顺序计算和通信造成的资源闲置是资源利用率低的主要原因。我们开发了一种新型框架 PiPar，利用 CML 技术的流水线并行性来大幅提高资源利用率。设计了一个新的训练流水线，以并行化不同硬件资源上的计算和不同带宽资源上的通信，从而加速 CML 的训练过程。还提出了一种低开销的自动参数选择方法来优化流水线，最大限度地提高可用资源的利用率。实验结果证实了 PiPar 基本方法的有效性，并强调与联合学习相比：(i) 服务器的空闲时间最多可减少 64.1 倍；(ii) 在不同网络条件下，针对六个小型和大型流行深度神经网络集合和四个数据集，在不牺牲准确性的情况下，整体训练时间最多可加快 34.6 倍。实验还证明，PiPar 在结合差分隐私方法以及在异构设备和带宽变化的环境中运行时可实现性能优势。

{"title":"PiPar: Pipeline parallelism for collaborative machine learning","authors":"Zihan Zhang , Philip Rodgers , Peter Kilpatrick , Ivor Spence , Blesson Varghese","doi":"10.1016/j.jpdc.2024.104947","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104947","url":null,"abstract":"<div><p>Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework <span>PiPar</span> that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of <span>PiPar</span> and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1×, and (ii) the overall training time can be accelerated by up to 34.6× under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that <span>PiPar</span> achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104947"},"PeriodicalIF":3.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001114/pdfft?md5=589f02b2eaa1e2c9523c4d2a0434e4e1&pid=1-s2.0-S0743731524001114-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Staleness aware semi-asynchronous federated learning 滞后感知半同步联合学习

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-07-01 DOI: 10.1016/j.jpdc.2024.104950

Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh

As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.

随着利用个人数据进行分布式深度学习的尝试越来越多，联合学习（FL）的重要性也随之增加。人们尝试使用同步或异步协议来克服联合学习的核心挑战（即统计和系统异构性）。然而，在每种协议中，杂波都会分别在延迟和准确性方面降低训练效率。为了解决杂散问题，可将两种协议结合的半同步协议应用于 FL；然而，有效处理本地模型的滞后性是一个难题。我们提出了 SASAFL，以解决半同步 FL 中因僵化而导致的训练效率低下问题。SASAFL 通过考虑全局模型的质量来同步服务器和客户端，从而实现稳定的训练。此外，它还能根据全局损失的变化调整参与的客户端数量，并立即处理上一轮未参与的客户端，从而实现高精度和低延迟。为了验证 SASAFL 的有效性，我们在各种条件下进行了评估。与基线相比，SASAFL 的准确率提高了 19.69%p，回合准确率提高了 2.32 倍，延迟准确率提高了 2.24 倍。此外，SASAFL 总是能达到基线无法达到的目标精度。

{"title":"Staleness aware semi-asynchronous federated learning","authors":"Miri Yu, Jiheon Choi, Jaehyun Lee, Sangyoon Oh","doi":"10.1016/j.jpdc.2024.104950","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104950","url":null,"abstract":"<div><p>As the attempts to distribute deep learning using personal data have increased, the importance of federated learning (FL) has also increased. Attempts have been made to overcome the core challenges of federated learning (i.e., statistical and system heterogeneity) using synchronous or asynchronous protocols. However, stragglers reduce training efficiency in terms of latency and accuracy in each protocol, respectively. To solve straggler issues, a semi-asynchronous protocol that combines the two protocols can be applied to FL; however, effectively handling the staleness of the local model is a difficult problem. We proposed SASAFL to solve the training inefficiency caused by staleness in semi-asynchronous FL. SASAFL enables stable training by considering the quality of the global model to synchronise the servers and clients. In addition, it achieves high accuracy and low latency by adjusting the number of participating clients in response to changes in global loss and immediately processing clients that did not to participate in the previous round. An evaluation was conducted under various conditions to verify the effectiveness of the SASAFL. SASAFL achieved 19.69%p higher accuracy than the baseline, 2.32 times higher round-to-accuracy and 2.24 times higher latency-to-accuracy. Additionally, SASAFL always achieved target accuracy that the baseline can't reach.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104950"},"PeriodicalIF":3.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks 通过改进的坎农算法，利用块张量矩阵乘法实现 3D DFT：胖树网络分布式内存集群的实现与扩展

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-06-28 DOI: 10.1016/j.jpdc.2024.104945

Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli

A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.

众所周知，并行 3D FFT 的可扩展性瓶颈在于其使用的全对全通信。在这里，我们介绍 S3DFT，这是一个通过使用点对点通信来规避这一问题的库，尽管算术复杂度较高。这种方法利用了坎农算法的三种变体，并对块张量矩阵乘法进行了调整。我们展示了 S3DFT 对硬件资源的高效利用，以及它在 JUWELS 集群 16,464 个内核上的扩展能力。然而，在与成熟的 3D FFT 库进行比较时，我们发现 S3DFT 的并行效率和性能并不尽如人意。详细分析发现，原因在于其两个组件算法，由于其通信模式是如何映射到胖树拓扑的子集中的，因此扩展性较差。这一结果揭示了在胖树网络系统上运行分块并行算法的潜在缺点，即沿处理元件网状结构特定方向的通信延迟增加。

{"title":"3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks","authors":"Nitin Malapally , Viacheslav Bolnykh , Estela Suarez , Paolo Carloni , Thomas Lippert , Davide Mandelli","doi":"10.1016/j.jpdc.2024.104945","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104945","url":null,"abstract":"<div><p>A known scalability bottleneck of the parallel 3D FFT is its use of all-to-all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication – albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor-matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well-established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block-wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"193 ","pages":"Article 104945"},"PeriodicalIF":3.4,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001096/pdfft?md5=a6e4f3cba9286a71b7d82fe7347d295b&pid=1-s2.0-S0743731524001096-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0