David RedonCRIStAL, BONUS, Pierre FortinCRIStAL, BONUS, Bilel DerbelCRIStAL, BONUS, Miwako TsujiRIKEN CCS, Mitsuhisa SatoRIKEN CCS
The Increasing Population Covariance Matrix Adaptation Evolution Strategy (IPOP-CMA-ES) algorithm is a reference stochastic optimizer dedicated to blackbox optimization, where no prior knowledge about the underlying problem structure is available. This paper aims at accelerating IPOP-CMA-ES thanks to high performance computing and parallelism when solving large optimization problems. We first show how BLAS and LAPACK routines can be introduced in linear algebra operations, and we then propose two strategies for deploying IPOP-CMA-ES efficiently on large-scale parallel architectures with thousands of CPU cores. The first parallel strategy processes the multiple searches in the same ordering as the sequential IPOP-CMA-ES, while the second one processes concurrently these multiple searches. These strategies are implemented in MPI+OpenMP and compared on 6144 cores of the supercomputer Fugaku. We manage to obtain substantial speedups (up to several thousand) and even super-linear ones, and we provide an in-depth analysis of our results to understand precisely the superior performance of our second strategy.
{"title":"Massively parallel CMA-ES with increasing population","authors":"David RedonCRIStAL, BONUS, Pierre FortinCRIStAL, BONUS, Bilel DerbelCRIStAL, BONUS, Miwako TsujiRIKEN CCS, Mitsuhisa SatoRIKEN CCS","doi":"arxiv-2409.11765","DOIUrl":"https://doi.org/arxiv-2409.11765","url":null,"abstract":"The Increasing Population Covariance Matrix Adaptation Evolution Strategy\u0000(IPOP-CMA-ES) algorithm is a reference stochastic optimizer dedicated to\u0000blackbox optimization, where no prior knowledge about the underlying problem\u0000structure is available. This paper aims at accelerating IPOP-CMA-ES thanks to\u0000high performance computing and parallelism when solving large optimization\u0000problems. We first show how BLAS and LAPACK routines can be introduced in\u0000linear algebra operations, and we then propose two strategies for deploying\u0000IPOP-CMA-ES efficiently on large-scale parallel architectures with thousands of\u0000CPU cores. The first parallel strategy processes the multiple searches in the\u0000same ordering as the sequential IPOP-CMA-ES, while the second one processes\u0000concurrently these multiple searches. These strategies are implemented in\u0000MPI+OpenMP and compared on 6144 cores of the supercomputer Fugaku. We manage to\u0000obtain substantial speedups (up to several thousand) and even super-linear\u0000ones, and we provide an in-depth analysis of our results to understand\u0000precisely the superior performance of our second strategy.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"189 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hussam Al DaasSTFC, Scientific Computing Department, Rutherford Appleton Laboratory, Didcot, UK, Grey BallardWake Forest University, Computer Science Department, Winston-Salem, NC, USA, Laura GrigoriEPFL, Institute of Mathematics, Lausanne, Switzerland and PSI, Center for Scientific Computing, Theory and Data, Villigen, Switzerland, Suraj KumarInstitut national de recherche en sciences et technologies du numérique, Lyon, France, Kathryn RouseInmar Intelligence, Winston-Salem, NC, USA, Mathieu VeriteEPFL, Institute of Mathematics, Lausanne, Switzerland
In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.
{"title":"Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations","authors":"Hussam Al DaasSTFC, Scientific Computing Department, Rutherford Appleton Laboratory, Didcot, UK, Grey BallardWake Forest University, Computer Science Department, Winston-Salem, NC, USA, Laura GrigoriEPFL, Institute of Mathematics, Lausanne, Switzerland and PSI, Center for Scientific Computing, Theory and Data, Villigen, Switzerland, Suraj KumarInstitut national de recherche en sciences et technologies du numérique, Lyon, France, Kathryn RouseInmar Intelligence, Winston-Salem, NC, USA, Mathieu VeriteEPFL, Institute of Mathematics, Lausanne, Switzerland","doi":"arxiv-2409.11304","DOIUrl":"https://doi.org/arxiv-2409.11304","url":null,"abstract":"In this article, we focus on the communication costs of three symmetric\u0000matrix computations: i) multiplying a matrix with its transpose, known as a\u0000symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a\u0000matrix with the transpose of another matrix and the transpose of that result,\u0000known as a symmetric rank-2k update (SYR2K) iii) performing matrix\u0000multiplication with a symmetric input matrix (SYMM). All three computations\u0000appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use\u0000in applications involving symmetric matrices. We establish communication lower\u0000bounds for these kernels using sequential and distributed-memory parallel\u0000computational models, and we show that our bounds are tight by presenting\u0000communication-optimal algorithms for each setting. Our lower bound proofs rely\u0000on applying a geometric inequality for symmetric computations and analytically\u0000solving constrained nonlinear optimization problems. The symmetric matrix and\u0000its corresponding computations are accessed and performed according to a\u0000triangular block partitioning scheme in the optimal algorithms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-BFT consensus runs multiple leader-based consensus instances in parallel, circumventing the leader bottleneck of a single instance. However, it contains an Achilles' heel: the need to globally order output blocks across instances. Deriving this global ordering is challenging because it must cope with different rates at which blocks are produced by instances. Prior Multi-BFT designs assign each block a global index before creation, leading to poor performance. We propose Ladon, a high-performance Multi-BFT protocol that allows varying instance block rates. Our key idea is to order blocks across instances dynamically, which eliminates blocking on slow instances. We achieve dynamic global ordering by assigning monotonic ranks to blocks. We pipeline rank coordination with the consensus process to reduce protocol overhead and combine aggregate signatures with rank information to reduce message complexity. Ladon's dynamic ordering enables blocks to be globally ordered according to their generation, which respects inter-block causality. We implemented and evaluated Ladon by integrating it with both PBFT and HotStuff protocols. Our evaluation shows that Ladon-PBFT (resp., Ladon-HotStuff) improves the peak throughput of the prior art by $approx$8x (resp., 2x) and reduces latency by $approx$62% (resp., 23%), when deployed with one straggling replica (out of 128 replicas) in a WAN setting.
{"title":"Ladon: High-Performance Multi-BFT Consensus via Dynamic Global Ordering (Extended Version)","authors":"Hanzheng Lyu, Shaokang Xie, Jianyu Niu, Chen Feng, Yinqian Zhang, Ivan Beschastnikh","doi":"arxiv-2409.10954","DOIUrl":"https://doi.org/arxiv-2409.10954","url":null,"abstract":"Multi-BFT consensus runs multiple leader-based consensus instances in\u0000parallel, circumventing the leader bottleneck of a single instance. However, it\u0000contains an Achilles' heel: the need to globally order output blocks across\u0000instances. Deriving this global ordering is challenging because it must cope\u0000with different rates at which blocks are produced by instances. Prior Multi-BFT\u0000designs assign each block a global index before creation, leading to poor\u0000performance. We propose Ladon, a high-performance Multi-BFT protocol that allows varying\u0000instance block rates. Our key idea is to order blocks across instances\u0000dynamically, which eliminates blocking on slow instances. We achieve dynamic\u0000global ordering by assigning monotonic ranks to blocks. We pipeline rank\u0000coordination with the consensus process to reduce protocol overhead and combine\u0000aggregate signatures with rank information to reduce message complexity.\u0000Ladon's dynamic ordering enables blocks to be globally ordered according to\u0000their generation, which respects inter-block causality. We implemented and\u0000evaluated Ladon by integrating it with both PBFT and HotStuff protocols. Our\u0000evaluation shows that Ladon-PBFT (resp., Ladon-HotStuff) improves the peak\u0000throughput of the prior art by $approx$8x (resp., 2x) and reduces latency by\u0000$approx$62% (resp., 23%), when deployed with one straggling replica (out of\u0000128 replicas) in a WAN setting.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Behkish Nassirzadeh, Stefanos Leonardos, Albert Heinle, Anwar Hasan, Vijay Ganesh
Blockchain integration in industries like online advertising is hindered by its connectivity limitations to off-chain data. These industries heavily rely on precise counting systems for collecting and analyzing off-chain data. This requires mechanisms, often called oracles, to feed off-chain data into smart contracts. However, current oracle solutions are ill-suited for counting systems since the oracles do not know when to expect the data, posing a significant challenge. To address this, we present CountChain, a decentralized oracle network for counting systems. In CountChain, data is received by all oracle nodes, and any node can submit a proposition request. Each proposition contains enough data to evaluate the occurrence of an event. Only randomly selected nodes participate in a game to evaluate the truthfulness of each proposition by providing proof and some stake. Finally, the propositions with the outcome of True increment the counter in a smart contract. Thus, instead of a contract calling oracles for data, in CountChain, the oracles call a smart contract when the data is available. Furthermore, we present a formal analysis and experimental evaluation of the system's parameters on over half a million data points to obtain optimal system parameters. In such conditions, our game-theoretical analysis demonstrates that a Nash equilibrium exists wherein all rational parties participate with honesty.
{"title":"CountChain: A Decentralized Oracle Network for Counting Systems","authors":"Behkish Nassirzadeh, Stefanos Leonardos, Albert Heinle, Anwar Hasan, Vijay Ganesh","doi":"arxiv-2409.11592","DOIUrl":"https://doi.org/arxiv-2409.11592","url":null,"abstract":"Blockchain integration in industries like online advertising is hindered by\u0000its connectivity limitations to off-chain data. These industries heavily rely\u0000on precise counting systems for collecting and analyzing off-chain data. This\u0000requires mechanisms, often called oracles, to feed off-chain data into smart\u0000contracts. However, current oracle solutions are ill-suited for counting\u0000systems since the oracles do not know when to expect the data, posing a\u0000significant challenge. To address this, we present CountChain, a decentralized oracle network for\u0000counting systems. In CountChain, data is received by all oracle nodes, and any\u0000node can submit a proposition request. Each proposition contains enough data to\u0000evaluate the occurrence of an event. Only randomly selected nodes participate\u0000in a game to evaluate the truthfulness of each proposition by providing proof\u0000and some stake. Finally, the propositions with the outcome of True increment\u0000the counter in a smart contract. Thus, instead of a contract calling oracles\u0000for data, in CountChain, the oracles call a smart contract when the data is\u0000available. Furthermore, we present a formal analysis and experimental\u0000evaluation of the system's parameters on over half a million data points to\u0000obtain optimal system parameters. In such conditions, our game-theoretical\u0000analysis demonstrates that a Nash equilibrium exists wherein all rational\u0000parties participate with honesty.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic network management strategies have become paramount for meeting the needs of innovative real-time and data-intensive applications, such as in the Internet of Things. However, meeting the ever-growing and fluctuating demands for data and services in such applications requires more than ever an efficient and scalable network resource management approach. Such approach should enable the automated provisioning of services while incentivising energy-efficient resource usage that expands throughout the edge-to-cloud continuum. This paper is the first to realise the concept of modular Software-Defined Networks based on serverless functions in an energy-aware environment. By adopting Function as a Service, the approach enables on-demand deployment of network functions, resulting in cost reduction through fine resource provisioning granularity. An analytical model is presented to approximate the service delivery time and power consumption, as well as an open-source prototype implementation supported by an extensive experimental evaluation. The experiments demonstrate not only the practical applicability of the proposed approach but significant improvement in terms of energy efficiency.
{"title":"Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach","authors":"Fatemeh Banaie, Karim Djemame, Abdulaziz Alhindi, Vasilios Kelefouras","doi":"arxiv-2409.11208","DOIUrl":"https://doi.org/arxiv-2409.11208","url":null,"abstract":"Automatic network management strategies have become paramount for meeting the\u0000needs of innovative real-time and data-intensive applications, such as in the\u0000Internet of Things. However, meeting the ever-growing and fluctuating demands\u0000for data and services in such applications requires more than ever an efficient\u0000and scalable network resource management approach. Such approach should enable\u0000the automated provisioning of services while incentivising energy-efficient\u0000resource usage that expands throughout the edge-to-cloud continuum. This paper\u0000is the first to realise the concept of modular Software-Defined Networks based\u0000on serverless functions in an energy-aware environment. By adopting Function as\u0000a Service, the approach enables on-demand deployment of network functions,\u0000resulting in cost reduction through fine resource provisioning granularity. An\u0000analytical model is presented to approximate the service delivery time and\u0000power consumption, as well as an open-source prototype implementation supported\u0000by an extensive experimental evaluation. The experiments demonstrate not only\u0000the practical applicability of the proposed approach but significant\u0000improvement in terms of energy efficiency.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"47 25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proto-Danksharding, proposed in Ethereum Improvement Proposal 4844 (EIP-4844), aims to incrementally improve the scalability of the Ethereum blockchain by introducing a new type of transaction known as blob-carrying transactions. These transactions incorporate binary large objects (blobs) of data that are stored off-chain but referenced and verified on-chain to ensure data availability. By decoupling data availability from transaction execution, Proto-Danksharding alleviates network congestion and reduces gas fees, laying the groundwork for future, more advanced sharding solutions. This letter provides an analytical model to derive the delay for these new transactions. We model the system as an $mathrm{M/D}^B/1$ queue which we then find its steady state distribution through embedding a Markov chain and use of supplementary variable method. We show that transactions with more blobs but less frequent impose higher delays on the system compared to lower blobs but more frequent.
{"title":"Delay Analysis of EIP-4844","authors":"Pourya Soltani, Farid Ashtiani","doi":"arxiv-2409.11043","DOIUrl":"https://doi.org/arxiv-2409.11043","url":null,"abstract":"Proto-Danksharding, proposed in Ethereum Improvement Proposal 4844\u0000(EIP-4844), aims to incrementally improve the scalability of the Ethereum\u0000blockchain by introducing a new type of transaction known as blob-carrying\u0000transactions. These transactions incorporate binary large objects (blobs) of\u0000data that are stored off-chain but referenced and verified on-chain to ensure\u0000data availability. By decoupling data availability from transaction execution,\u0000Proto-Danksharding alleviates network congestion and reduces gas fees, laying\u0000the groundwork for future, more advanced sharding solutions. This letter\u0000provides an analytical model to derive the delay for these new transactions. We\u0000model the system as an $mathrm{M/D}^B/1$ queue which we then find its steady\u0000state distribution through embedding a Markov chain and use of supplementary\u0000variable method. We show that transactions with more blobs but less frequent\u0000impose higher delays on the system compared to lower blobs but more frequent.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) has gained attention across various industries for its capability to train machine learning models without centralizing sensitive data. While this approach offers significant benefits such as privacy preservation and decreased communication overhead, it presents several challenges, including deployment complexity and interoperability issues, particularly in heterogeneous scenarios or resource-constrained environments. Over-the-air (OTA) FL was introduced to tackle these challenges by disseminating model updates without necessitating direct device-to-device connections or centralized servers. However, OTA-FL brought forth limitations associated with heightened energy consumption and network latency. In this paper, we propose a multi-attribute client selection framework employing the grey wolf optimizer (GWO) to strategically control the number of participants in each round and optimize the OTA-FL process while considering accuracy, energy, delay, reliability, and fairness constraints of participating devices. We evaluate the performance of our multi-attribute client selection approach in terms of model loss minimization, convergence time reduction, and energy efficiency. In our experimental evaluation, we assessed and compared the performance of our approach against the existing state-of-the-art methods. Our results demonstrate that the proposed GWO-based client selection outperforms these baselines across various metrics. Specifically, our approach achieves a notable reduction in model loss, accelerates convergence time, and enhances energy efficiency while maintaining high fairness and reliability indicators.
{"title":"A Green Multi-Attribute Client Selection for Over-The-Air Federated Learning: A Grey-Wolf-Optimizer Approach","authors":"Maryam Ben Driss, Essaid Sabir, Halima Elbiaze, Abdoulaye Baniré Diallo, Mohamed Sadik","doi":"arxiv-2409.11442","DOIUrl":"https://doi.org/arxiv-2409.11442","url":null,"abstract":"Federated Learning (FL) has gained attention across various industries for\u0000its capability to train machine learning models without centralizing sensitive\u0000data. While this approach offers significant benefits such as privacy\u0000preservation and decreased communication overhead, it presents several\u0000challenges, including deployment complexity and interoperability issues,\u0000particularly in heterogeneous scenarios or resource-constrained environments.\u0000Over-the-air (OTA) FL was introduced to tackle these challenges by\u0000disseminating model updates without necessitating direct device-to-device\u0000connections or centralized servers. However, OTA-FL brought forth limitations\u0000associated with heightened energy consumption and network latency. In this\u0000paper, we propose a multi-attribute client selection framework employing the\u0000grey wolf optimizer (GWO) to strategically control the number of participants\u0000in each round and optimize the OTA-FL process while considering accuracy,\u0000energy, delay, reliability, and fairness constraints of participating devices.\u0000We evaluate the performance of our multi-attribute client selection approach in\u0000terms of model loss minimization, convergence time reduction, and energy\u0000efficiency. In our experimental evaluation, we assessed and compared the\u0000performance of our approach against the existing state-of-the-art methods. Our\u0000results demonstrate that the proposed GWO-based client selection outperforms\u0000these baselines across various metrics. Specifically, our approach achieves a\u0000notable reduction in model loss, accelerates convergence time, and enhances\u0000energy efficiency while maintaining high fairness and reliability indicators.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"591 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs, and other accelerators; and 3) utilizing special parallel architectures like Single Instruction/Multiple Data (SIMD). Many researchers have made efforts using different parallel technologies, including developing applications, conducting performance analyses, identifying performance bottlenecks, and proposing feasible solutions. However, balancing and optimizing parallel programs remain challenging due to the complexity of parallel algorithms and hardware architectures. Issues such as data transfer between hosts and devices in heterogeneous systems continue to be bottlenecks that limit performance. This work summarizes a vast amount of information on various parallel programming techniques, aiming to present the current state and future development trends of parallel programming, performance issues, and solutions. It seeks to give readers an overall picture and provide background knowledge to support subsequent research.
{"title":"A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture","authors":"Xinyao Yi","doi":"arxiv-2409.10661","DOIUrl":"https://doi.org/arxiv-2409.10661","url":null,"abstract":"Parallel computing is a standard approach to achieving high-performance\u0000computing (HPC). Three commonly used methods to implement parallel computing\u0000include: 1) applying multithreading technology on single-core or multi-core\u0000CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs,\u0000and other accelerators; and 3) utilizing special parallel architectures like\u0000Single Instruction/Multiple Data (SIMD). Many researchers have made efforts using different parallel technologies,\u0000including developing applications, conducting performance analyses, identifying\u0000performance bottlenecks, and proposing feasible solutions. However, balancing\u0000and optimizing parallel programs remain challenging due to the complexity of\u0000parallel algorithms and hardware architectures. Issues such as data transfer\u0000between hosts and devices in heterogeneous systems continue to be bottlenecks\u0000that limit performance. This work summarizes a vast amount of information on various parallel\u0000programming techniques, aiming to present the current state and future\u0000development trends of parallel programming, performance issues, and solutions.\u0000It seeks to give readers an overall picture and provide background knowledge to\u0000support subsequent research.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grigorii Melnikov, Sebastian Müller, Nikita Polyanskii, Yury Yanovich
Consensus plays a crucial role in distributed ledger systems, impacting both scalability and decentralization. Many blockchain systems use a weighted lottery based on a scarce resource such as a stake, storage, memory, or computing power to select a committee whose members drive the consensus and are responsible for adding new information to the ledger. Therefore, ensuring a robust and fair committee selection process is essential for maintaining security, efficiency, and decentralization. There are two main approaches to randomized committee selection. In one approach, each validator candidate locally checks whether they are elected to the committee and reveals their proof during the consensus phase. In contrast, in the second approach, a sortition algorithm decides a fixed-sized committee that is globally verified. This paper focuses on the latter approach, with cryptographic sortition as a method for fair committee selection that guarantees a constant committee size. Our goal is to develop deterministic guarantees that strengthen decentralization. We introduce novel methods that provide deterministic bounds on the influence of adversaries within the committee, as evidenced by numerical experiments. This approach overcomes the limitations of existing protocols that only offer probabilistic guarantees, often providing large committees that are impractical for many quorum-based applications like atomic broadcast and randomness beacon protocols.
{"title":"Deterministic Bounds in Committee Selection: Enhancing Decentralization and Scalability in Distributed Ledgers","authors":"Grigorii Melnikov, Sebastian Müller, Nikita Polyanskii, Yury Yanovich","doi":"arxiv-2409.10727","DOIUrl":"https://doi.org/arxiv-2409.10727","url":null,"abstract":"Consensus plays a crucial role in distributed ledger systems, impacting both\u0000scalability and decentralization. Many blockchain systems use a weighted\u0000lottery based on a scarce resource such as a stake, storage, memory, or\u0000computing power to select a committee whose members drive the consensus and are\u0000responsible for adding new information to the ledger. Therefore, ensuring a\u0000robust and fair committee selection process is essential for maintaining\u0000security, efficiency, and decentralization. There are two main approaches to randomized committee selection. In one\u0000approach, each validator candidate locally checks whether they are elected to\u0000the committee and reveals their proof during the consensus phase. In contrast,\u0000in the second approach, a sortition algorithm decides a fixed-sized committee\u0000that is globally verified. This paper focuses on the latter approach, with\u0000cryptographic sortition as a method for fair committee selection that\u0000guarantees a constant committee size. Our goal is to develop deterministic\u0000guarantees that strengthen decentralization. We introduce novel methods that\u0000provide deterministic bounds on the influence of adversaries within the\u0000committee, as evidenced by numerical experiments. This approach overcomes the\u0000limitations of existing protocols that only offer probabilistic guarantees,\u0000often providing large committees that are impractical for many quorum-based\u0000applications like atomic broadcast and randomness beacon protocols.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The world of Machine Learning (ML) has witnessed rapid changes in terms of new models and ways to process users data. The majority of work that has been done is focused on Deep Learning (DL) based approaches. However, with the emergence of new algorithms such as the Tsetlin Machine (TM) algorithm, there is growing interest in exploring alternative approaches that may offer unique advantages in certain domains or applications. One of these domains is Federated Learning (FL), in which users privacy is of utmost importance. Due to its novelty, FL has seen a surge in the incorporation of personalization techniques to enhance model accuracy while maintaining user privacy under personalized conditions. In this work, we propose a novel approach dubbed TPFL: Tsetlin-Personalized Federated Learning, in which models are grouped into clusters based on their confidence towards a specific class. In this way, clustering can benefit from two key advantages. Firstly, clients share only what they are confident about, resulting in the elimination of wrongful weight aggregation among clients whose data for a specific class may have not been enough during the training. This phenomenon is prevalent when the data are non-Independent and Identically Distributed (non-IID). Secondly, by sharing only weights towards a specific class, communication cost is substantially reduced, making TPLF efficient in terms of both accuracy and communication cost. The results of TPFL demonstrated the highest accuracy on three different datasets; namely MNIST, FashionMNIST and FEMNIST.
{"title":"TPFL: Tsetlin-Personalized Federated Learning with Confidence-Based Clustering","authors":"Rasoul Jafari Gohari, Laya Aliahmadipour, Ezat Valipour","doi":"arxiv-2409.10392","DOIUrl":"https://doi.org/arxiv-2409.10392","url":null,"abstract":"The world of Machine Learning (ML) has witnessed rapid changes in terms of\u0000new models and ways to process users data. The majority of work that has been\u0000done is focused on Deep Learning (DL) based approaches. However, with the\u0000emergence of new algorithms such as the Tsetlin Machine (TM) algorithm, there\u0000is growing interest in exploring alternative approaches that may offer unique\u0000advantages in certain domains or applications. One of these domains is\u0000Federated Learning (FL), in which users privacy is of utmost importance. Due to\u0000its novelty, FL has seen a surge in the incorporation of personalization\u0000techniques to enhance model accuracy while maintaining user privacy under\u0000personalized conditions. In this work, we propose a novel approach dubbed TPFL:\u0000Tsetlin-Personalized Federated Learning, in which models are grouped into\u0000clusters based on their confidence towards a specific class. In this way,\u0000clustering can benefit from two key advantages. Firstly, clients share only\u0000what they are confident about, resulting in the elimination of wrongful weight\u0000aggregation among clients whose data for a specific class may have not been\u0000enough during the training. This phenomenon is prevalent when the data are\u0000non-Independent and Identically Distributed (non-IID). Secondly, by sharing\u0000only weights towards a specific class, communication cost is substantially\u0000reduced, making TPLF efficient in terms of both accuracy and communication\u0000cost. The results of TPFL demonstrated the highest accuracy on three different\u0000datasets; namely MNIST, FashionMNIST and FEMNIST.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}