Pub Date : 2022-09-01DOI: 10.1109/SRDS55811.2022.00031
Chunfu Jia, Shaoqiang Wu, Ding Wang
As the most dominant authentication mechanism, password-based authentication suffers catastrophic offline password guessing attacks once the authentication server is compromised and the password database is leaked. Password hardening (PH) service, an external/third-party crypto service, has been recently proposed to strengthen password storage and reduce the damage of authentication server compromise. However, all existing schemes are unreliable in that they overlook the important restorable property: PH service opt-out. In existing PH schemes, once the authentication server has subscribed to a PH service, it must adopt this service forever, even if it wants to stop the external/third-party PH service and restore its original password storage (or subscribe to another PH service). To fill the gap, we propose a new PH service called PW-Hero that equips its PH service with an option to terminate its use (i.e., opt-out). In PW-Hero, password authentication is strengthened against offline attacks by adding external secret spices to password records. With the opt-out property, authentication servers can proactively request to end the PH service after successful authentications. Then password records can be securely migrated to their traditional salted hash state, ready for subscription to other PH services. Besides, PW-Hero achieves all existing desirable properties, such as comprehensive verifiability, rate limits against online attacks, and user privacy. We define PW-Hero as a suite of protocols that meet desirable properties and build a simple, secure, and efficient instance. Moreover, we develop a prototype implementation and evaluate its performance, establishing the practicality of our PW-Hero service.
{"title":"Reliable Password Hardening Service with Opt-Out","authors":"Chunfu Jia, Shaoqiang Wu, Ding Wang","doi":"10.1109/SRDS55811.2022.00031","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00031","url":null,"abstract":"As the most dominant authentication mechanism, password-based authentication suffers catastrophic offline password guessing attacks once the authentication server is compromised and the password database is leaked. Password hardening (PH) service, an external/third-party crypto service, has been recently proposed to strengthen password storage and reduce the damage of authentication server compromise. However, all existing schemes are unreliable in that they overlook the important restorable property: PH service opt-out. In existing PH schemes, once the authentication server has subscribed to a PH service, it must adopt this service forever, even if it wants to stop the external/third-party PH service and restore its original password storage (or subscribe to another PH service). To fill the gap, we propose a new PH service called PW-Hero that equips its PH service with an option to terminate its use (i.e., opt-out). In PW-Hero, password authentication is strengthened against offline attacks by adding external secret spices to password records. With the opt-out property, authentication servers can proactively request to end the PH service after successful authentications. Then password records can be securely migrated to their traditional salted hash state, ready for subscription to other PH services. Besides, PW-Hero achieves all existing desirable properties, such as comprehensive verifiability, rate limits against online attacks, and user privacy. We define PW-Hero as a suite of protocols that meet desirable properties and build a simple, secure, and efficient instance. Moreover, we develop a prototype implementation and evaluate its performance, establishing the practicality of our PW-Hero service.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125106442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sharding technique is viewed as the most promising solution to improving blockchain scalability. However, to implement a sharded blockchain, developers have to address two major challenges. The first challenge is that the ratio of cross-shard transactions (TXs) across blockchain shards is very high. This issue significantly degrades the throughput of a blockchain. The second challenge is that the workloads across blockchain shards are largely imbalanced. If workloads are imbalanced, some shards have to handle an overwhelming number of TXs and become congested very possibly. Facing these two challenges, a dilemma is that it is difficult to guarantee a low cross-shard TX ratio and maintain the workload balance across all shards, simultaneously. We believe that a fine-grained account-allocation strategy can address this dilemma. To this end, we first formulate the tradeoff between such two metrics as a network-partition problem. We then solve this problem using a community-aware account partition algorithm. Furthermore, we also propose a sharding protocol, named Transformers, to apply the proposed algorithm into the sharded blockchain system. Finally, trace-driven evaluation results demonstrate that the proposed protocol outperforms other baselines in terms of throughput, latency, cross-shard TX ratio, and the queue size of transaction pool.
{"title":"Achieving Scalability and Load Balance across Blockchain Shards for State Sharding","authors":"Canlin Li, Huawei Huang, Yetong Zhao, Xiaowen Peng, Ruijie Yang, Zibin Zheng, Song Guo","doi":"10.1109/SRDS55811.2022.00034","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00034","url":null,"abstract":"Sharding technique is viewed as the most promising solution to improving blockchain scalability. However, to implement a sharded blockchain, developers have to address two major challenges. The first challenge is that the ratio of cross-shard transactions (TXs) across blockchain shards is very high. This issue significantly degrades the throughput of a blockchain. The second challenge is that the workloads across blockchain shards are largely imbalanced. If workloads are imbalanced, some shards have to handle an overwhelming number of TXs and become congested very possibly. Facing these two challenges, a dilemma is that it is difficult to guarantee a low cross-shard TX ratio and maintain the workload balance across all shards, simultaneously. We believe that a fine-grained account-allocation strategy can address this dilemma. To this end, we first formulate the tradeoff between such two metrics as a network-partition problem. We then solve this problem using a community-aware account partition algorithm. Furthermore, we also propose a sharding protocol, named Transformers, to apply the proposed algorithm into the sharded blockchain system. Finally, trace-driven evaluation results demonstrate that the proposed protocol outperforms other baselines in terms of throughput, latency, cross-shard TX ratio, and the queue size of transaction pool.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125157638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/SRDS55811.2022.00026
Jack Waudby, P. Ezhilchelvan, I. Mitrani, J. Webber
Distributed OLTP systems execute the high-overhead, two-phase commit (2PC) protocol at the end of every distributed transaction. Epoch-based commit proposes that 2PC be executed only once for all transactions processed within a time interval called an epoch. Increasing epoch duration allows more transactions to be processed before the common 2PC. It thus reduces 2PC overhead per transaction, increases throughput but also increases average transaction latency. Therefore, required is the ability to choose the right epoch size that offers the desired trade-off between throughput and latency. To this end, we develop two analytical models to estimate throughput and average latency in terms of epoch size taking into account load and failure conditions. Simulations affirm their accuracy and effectiveness. We then present epoch-based multi-commit which, unlike epoch-based commit, seeks to avoid all transactions being aborted when failures occur, and also performs identically when failures do not occur. Our performance study identifies workload factors that make it more effective in preventing transaction aborts and concludes that the analytical models can be equally useful in predicting its performance as well.
{"title":"A Performance Study of Epoch-based Commit Protocols in Distributed OLTP Databases","authors":"Jack Waudby, P. Ezhilchelvan, I. Mitrani, J. Webber","doi":"10.1109/SRDS55811.2022.00026","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00026","url":null,"abstract":"Distributed OLTP systems execute the high-overhead, two-phase commit (2PC) protocol at the end of every distributed transaction. Epoch-based commit proposes that 2PC be executed only once for all transactions processed within a time interval called an epoch. Increasing epoch duration allows more transactions to be processed before the common 2PC. It thus reduces 2PC overhead per transaction, increases throughput but also increases average transaction latency. Therefore, required is the ability to choose the right epoch size that offers the desired trade-off between throughput and latency. To this end, we develop two analytical models to estimate throughput and average latency in terms of epoch size taking into account load and failure conditions. Simulations affirm their accuracy and effectiveness. We then present epoch-based multi-commit which, unlike epoch-based commit, seeks to avoid all transactions being aborted when failures occur, and also performs identically when failures do not occur. Our performance study identifies workload factors that make it more effective in preventing transaction aborts and concludes that the analytical models can be equally useful in predicting its performance as well.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130951075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/SRDS55811.2022.00017
Hui Zeng, Tongqing Zhou, Xinyi Wu, Zhiping Cai
The privacy-preserving nature of Federated Learning (FL) exposes such a distributed learning paradigm to the planting of backdoors with locally corrupted data. We discover that FL backdoors, under a new on-off multi-shot attack form, are essentially stealthy against existing defenses that are built on model statistics and spectral analysis. First-hand observations of such attacks show that the backdoored models are indistinguishable from normal ones w.r.t. both low-level and high-level representations. We thus emphasize that a critical redemption, if not the only, for the tricky stealthiness is reactive tracing and posterior mitigation. A three-step remedy framework is then proposed by exploring the temporal and inferential correlations of models on a trapped sample from an attack. In particular, we use shift ensemble detection and co-occurrence analysis for adversary identification, and repair the model via malicious ingredients removal under theoretical error guarantee. Extensive experiments on various backdoor settings demonstrate that our framework can achieve accuracy on attack round identification of ∼80% and on attackers of ∼50%, which are ∼28.76% better than existing proactive defenses. Meanwhile, it can successfully eliminate the influence of backdoors with only a 5%∼6% performance drop.
{"title":"Never Too Late: Tracing and Mitigating Backdoor Attacks in Federated Learning","authors":"Hui Zeng, Tongqing Zhou, Xinyi Wu, Zhiping Cai","doi":"10.1109/SRDS55811.2022.00017","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00017","url":null,"abstract":"The privacy-preserving nature of Federated Learning (FL) exposes such a distributed learning paradigm to the planting of backdoors with locally corrupted data. We discover that FL backdoors, under a new on-off multi-shot attack form, are essentially stealthy against existing defenses that are built on model statistics and spectral analysis. First-hand observations of such attacks show that the backdoored models are indistinguishable from normal ones w.r.t. both low-level and high-level representations. We thus emphasize that a critical redemption, if not the only, for the tricky stealthiness is reactive tracing and posterior mitigation. A three-step remedy framework is then proposed by exploring the temporal and inferential correlations of models on a trapped sample from an attack. In particular, we use shift ensemble detection and co-occurrence analysis for adversary identification, and repair the model via malicious ingredients removal under theoretical error guarantee. Extensive experiments on various backdoor settings demonstrate that our framework can achieve accuracy on attack round identification of ∼80% and on attackers of ∼50%, which are ∼28.76% better than existing proactive defenses. Meanwhile, it can successfully eliminate the influence of backdoors with only a 5%∼6% performance drop.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133364151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/SRDS55811.2022.00032
Zhinan Cheng, Shujie Han, P. Lee, X. Li, Jiongzhou Liu, Zhan Li
Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM errors and server failures in state-of-the-art field studies on DRAM error measurement. To fill this void, we present an in-depth data-driven correlative analysis between DRAM errors and server failures, with the primary goal of predicting server failures based on DRAM error characterization and hence enabling proactive reliability maintenance for production data centers. Our analysis is based on an eight-month dataset collected from over three million memory modules in the production data centers at Alibaba. We find that the correctable DRAM errors of most server failures only manifest within a short time before the failures happen, implying that server failure prediction should be conducted regularly at short time intervals for accurate prediction. We also study various impacting factors (including component failures in the memory subsystem, DRAM configurations, types of correctable DRAM errors) on server failures. Furthermore, we design a machine-learning-based server failure prediction workflow and demonstrate the feasibility of server failure prediction based on DRAM error characterization. To this end, we report 14 findings from our measurement and prediction studies.
{"title":"An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers","authors":"Zhinan Cheng, Shujie Han, P. Lee, X. Li, Jiongzhou Liu, Zhan Li","doi":"10.1109/SRDS55811.2022.00032","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00032","url":null,"abstract":"Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM errors and server failures in state-of-the-art field studies on DRAM error measurement. To fill this void, we present an in-depth data-driven correlative analysis between DRAM errors and server failures, with the primary goal of predicting server failures based on DRAM error characterization and hence enabling proactive reliability maintenance for production data centers. Our analysis is based on an eight-month dataset collected from over three million memory modules in the production data centers at Alibaba. We find that the correctable DRAM errors of most server failures only manifest within a short time before the failures happen, implying that server failure prediction should be conducted regularly at short time intervals for accurate prediction. We also study various impacting factors (including component failures in the memory subsystem, DRAM configurations, types of correctable DRAM errors) on server failures. Furthermore, we design a machine-learning-based server failure prediction workflow and demonstrate the feasibility of server failure prediction based on DRAM error characterization. To this end, we report 14 findings from our measurement and prediction studies.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115228099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays wide stripe erasure codes (ECs) become popular as they can achieve low monetary cost and provide high reliability for cold data. Generally, wide stripe erasure codes can be generated by extending traditional erasure codes with a large stripe size, or designing new codes. However, although wide stripe erasure codes can decrease the storage cost significantly, the construction of lost data is extraordinary slow, which stems primarily from high cross-rack overhead. It is because a large number of racks participate in the construction of the lost data, which results in high cross-rack traffic. To address the above problems, we propose a novel erasure code called XOR-Hitchhiker-RS (XHR) code, to decrease the cross-rack overhead and still maintain low storage cost. The key idea of XHR is that it utilizes a triple dimensional framework to place more chunks within racks and reduce global repair triggers. To demonstrate the effectiveness of XHR-Code, we provide mathematical analysis and conduct comprehensive experiments. The results show that, compared to the state-of-the-art solutions such as ECWide under various failure conditions, XHR can effectively reduce cross-rack repair traffic and the repair time by up to 36.50%.
{"title":"XHR-Code: An Efficient Wide Stripe Erasure Code to Reduce Cross-Rack Overhead in Cloud Storage Systems","authors":"Guofeng Yang, Huangzhen Xue, Yunfei Gu, Chentao Wu, J. Li, Minyi Guo, Shiyi Li, Xin Xie, Yuanyuan Dong, Yafei Zhao","doi":"10.1109/SRDS55811.2022.00033","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00033","url":null,"abstract":"Nowadays wide stripe erasure codes (ECs) become popular as they can achieve low monetary cost and provide high reliability for cold data. Generally, wide stripe erasure codes can be generated by extending traditional erasure codes with a large stripe size, or designing new codes. However, although wide stripe erasure codes can decrease the storage cost significantly, the construction of lost data is extraordinary slow, which stems primarily from high cross-rack overhead. It is because a large number of racks participate in the construction of the lost data, which results in high cross-rack traffic. To address the above problems, we propose a novel erasure code called XOR-Hitchhiker-RS (XHR) code, to decrease the cross-rack overhead and still maintain low storage cost. The key idea of XHR is that it utilizes a triple dimensional framework to place more chunks within racks and reduce global repair triggers. To demonstrate the effectiveness of XHR-Code, we provide mathematical analysis and conduct comprehensive experiments. The results show that, compared to the state-of-the-art solutions such as ECWide under various failure conditions, XHR can effectively reduce cross-rack repair traffic and the repair time by up to 36.50%.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129873067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/SRDS55811.2022.00028
Sahiti Bommareddy, Daniel Qian, C. Bonebrake, P. Skare, Y. Amir
In the world of increasing cyber threats, a compromised protective relay can put power grid resilience at risk by irreparably damaging costly power assets or by causing significant disruptions. We present the first architecture and protocols for the substation that ensure correct protective relay operation in the face of successful relay intrusions and network attacks while meeting the required latency constraint of a quarter power cycle (4.167ms). Our architecture supports other rigid requirements, including continuous availability over a long system lifetime and seamless substation integration. We evaluate our implementation in a range of fault-free and faulty operation conditions, and provide deployment tradeoffs.
{"title":"Real-Time Byzantine Resilience for Power Grid Substations","authors":"Sahiti Bommareddy, Daniel Qian, C. Bonebrake, P. Skare, Y. Amir","doi":"10.1109/SRDS55811.2022.00028","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00028","url":null,"abstract":"In the world of increasing cyber threats, a compromised protective relay can put power grid resilience at risk by irreparably damaging costly power assets or by causing significant disruptions. We present the first architecture and protocols for the substation that ensure correct protective relay operation in the face of successful relay intrusions and network attacks while meeting the required latency constraint of a quarter power cycle (4.167ms). Our architecture supports other rigid requirements, including continuous availability over a long system lifetime and seamless substation integration. We evaluate our implementation in a range of fault-free and faulty operation conditions, and provide deployment tradeoffs.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127796057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/SRDS55811.2022.00013
Xiang Li, Mustafa Abdallah, Shikhar Suryavansh, M. Chiang, Kwang Taik Kim, S. Bagchi
Edge computing promises to exploit underlying computation resources closer to users to help run latency-sensitive applications such as augmented reality and video analytics. However, one key missing piece has been how to incorporate personally owned, unmanaged devices into a usable edge computing system. The primary challenges arise due to the heterogeneity, lack of interference management, and unpredictable availability of such devices. In this paper we propose an orchestration framework IBDASH, which orchestrates application tasks on an edge system that comprises a mix of commercial and personal edge devices. IBDASH targets reducing both end-to-end latency of execution and probability of failure for applications that have dependency among tasks, captured by directed acyclic graphs (DAGs). IBDASH takes memory constraints of each edge device and network bandwidth into consideration. To assess the effectiveness of IBDASH, we run real application tasks on real edge devices with widely varying capabilities. We feed these measurements into a simulator that runs IBDASH at scale. Compared to three state-of-the-art edge orchestration schemes and two intuitive baselines, IBDASH reduces the end-to-end latency and probability of failure, by 14% and 41% on average respectively. The main takeaway from our work is that it is feasible to combine personal and commercial devices into a usable edge computing platform, one that delivers low and predictable latency and high availability.
{"title":"DAG-based Task Orchestration for Edge Computing","authors":"Xiang Li, Mustafa Abdallah, Shikhar Suryavansh, M. Chiang, Kwang Taik Kim, S. Bagchi","doi":"10.1109/SRDS55811.2022.00013","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00013","url":null,"abstract":"Edge computing promises to exploit underlying computation resources closer to users to help run latency-sensitive applications such as augmented reality and video analytics. However, one key missing piece has been how to incorporate personally owned, unmanaged devices into a usable edge computing system. The primary challenges arise due to the heterogeneity, lack of interference management, and unpredictable availability of such devices. In this paper we propose an orchestration framework IBDASH, which orchestrates application tasks on an edge system that comprises a mix of commercial and personal edge devices. IBDASH targets reducing both end-to-end latency of execution and probability of failure for applications that have dependency among tasks, captured by directed acyclic graphs (DAGs). IBDASH takes memory constraints of each edge device and network bandwidth into consideration. To assess the effectiveness of IBDASH, we run real application tasks on real edge devices with widely varying capabilities. We feed these measurements into a simulator that runs IBDASH at scale. Compared to three state-of-the-art edge orchestration schemes and two intuitive baselines, IBDASH reduces the end-to-end latency and probability of failure, by 14% and 41% on average respectively. The main takeaway from our work is that it is feasible to combine personal and commercial devices into a usable edge computing platform, one that delivers low and predictable latency and high availability.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133420767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1109/srds55811.2022.00001
{"title":"Half Title Page","authors":"","doi":"10.1109/srds55811.2022.00001","DOIUrl":"https://doi.org/10.1109/srds55811.2022.00001","url":null,"abstract":"","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129732320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning enables local nodes to train a global model together by uploading only training updates to the parameter server without exchanging private data. However, as the complexity of the federated learning task increases, the communication volume of the training process becomes extremely large, hence the huge communication traffic becomes a serious bottleneck in current federated learning application. Existing methods reduce communication overhead from two aspects, the number of communications and the traffic per communication. But these methods usually lead to more consumption of computing resources or a decrease in model accuracy. To handle these problems, this paper proposes a data fitting based weight compression algorithm, FWC, which includes four sequential stages: sparsification, polynomial fitting, encoding, reconstruction and two mechanism: warm-up and accumulation. In particular, the warm-up mechanism can well address the problem of slow convergence in early training period. Experimental results on models with different scales show that FWC is able to provide more than 600x traffic compression at the cost of only millisecond-level computational time cost and less than 1% accuracy loss.
{"title":"FWC: Fitting Weight Compression Method for Reducing Communication Traffic for Federated Learning","authors":"Hao Jiang, Kedong Yan, Chanying Huang, Qianmu Li, Shan Xiao","doi":"10.1109/SRDS55811.2022.00025","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00025","url":null,"abstract":"Federated learning enables local nodes to train a global model together by uploading only training updates to the parameter server without exchanging private data. However, as the complexity of the federated learning task increases, the communication volume of the training process becomes extremely large, hence the huge communication traffic becomes a serious bottleneck in current federated learning application. Existing methods reduce communication overhead from two aspects, the number of communications and the traffic per communication. But these methods usually lead to more consumption of computing resources or a decrease in model accuracy. To handle these problems, this paper proposes a data fitting based weight compression algorithm, FWC, which includes four sequential stages: sparsification, polynomial fitting, encoding, reconstruction and two mechanism: warm-up and accumulation. In particular, the warm-up mechanism can well address the problem of slow convergence in early training period. Experimental results on models with different scales show that FWC is able to provide more than 600x traffic compression at the cost of only millisecond-level computational time cost and less than 1% accuracy loss.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"15 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114022312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}