2022 41st International Symposium on Reliable Distributed Systems (SRDS)最新文献

Reliable Password Hardening Service with Opt-Out 可选择退出的可靠密码强化服务

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00031

Chunfu Jia, Shaoqiang Wu, Ding Wang

As the most dominant authentication mechanism, password-based authentication suffers catastrophic offline password guessing attacks once the authentication server is compromised and the password database is leaked. Password hardening (PH) service, an external/third-party crypto service, has been recently proposed to strengthen password storage and reduce the damage of authentication server compromise. However, all existing schemes are unreliable in that they overlook the important restorable property: PH service opt-out. In existing PH schemes, once the authentication server has subscribed to a PH service, it must adopt this service forever, even if it wants to stop the external/third-party PH service and restore its original password storage (or subscribe to another PH service). To fill the gap, we propose a new PH service called PW-Hero that equips its PH service with an option to terminate its use (i.e., opt-out). In PW-Hero, password authentication is strengthened against offline attacks by adding external secret spices to password records. With the opt-out property, authentication servers can proactively request to end the PH service after successful authentications. Then password records can be securely migrated to their traditional salted hash state, ready for subscription to other PH services. Besides, PW-Hero achieves all existing desirable properties, such as comprehensive verifiability, rate limits against online attacks, and user privacy. We define PW-Hero as a suite of protocols that meet desirable properties and build a simple, secure, and efficient instance. Moreover, we develop a prototype implementation and evaluate its performance, establishing the practicality of our PW-Hero service.

基于密码的身份验证作为目前最主流的身份验证机制，一旦身份验证服务器被攻破，密码数据库泄露，就会遭受灾难性的离线猜密码攻击。PH (Password hardening)服务是一种外部/第三方加密服务，最近被提出用于加强密码存储，减少认证服务器泄露的损害。然而，所有现有的方案都是不可靠的，因为它们忽略了重要的可恢复属性:PH服务选择退出。在现有的PH方案中，一旦认证服务器订阅了PH服务，它就必须永远采用该服务，即使它想要停止外部/第三方PH服务并恢复其原始密码存储(或订阅另一个PH服务)。为了填补这一空白，我们提出了一种名为PW-Hero的新PH服务，该服务为其PH服务配备了终止使用的选项(即选择退出)。在PW-Hero中，通过向密码记录添加外部秘密香料，加强了密码身份验证以抵御离线攻击。使用opt-out属性，身份验证服务器可以在身份验证成功后主动请求结束PH服务。然后，可以将密码记录安全地迁移到传统的盐渍散列状态，准备订阅其他PH服务。此外，PW-Hero实现了所有现有的理想属性，如全面的可验证性、对在线攻击的速率限制和用户隐私。我们将PW-Hero定义为满足所需属性的一套协议，并构建一个简单、安全和高效的实例。此外，我们开发了一个原型实现并对其性能进行了评估，以确定我们的PW-Hero服务的实用性。

{"title":"Reliable Password Hardening Service with Opt-Out","authors":"Chunfu Jia, Shaoqiang Wu, Ding Wang","doi":"10.1109/SRDS55811.2022.00031","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00031","url":null,"abstract":"As the most dominant authentication mechanism, password-based authentication suffers catastrophic offline password guessing attacks once the authentication server is compromised and the password database is leaked. Password hardening (PH) service, an external/third-party crypto service, has been recently proposed to strengthen password storage and reduce the damage of authentication server compromise. However, all existing schemes are unreliable in that they overlook the important restorable property: PH service opt-out. In existing PH schemes, once the authentication server has subscribed to a PH service, it must adopt this service forever, even if it wants to stop the external/third-party PH service and restore its original password storage (or subscribe to another PH service). To fill the gap, we propose a new PH service called PW-Hero that equips its PH service with an option to terminate its use (i.e., opt-out). In PW-Hero, password authentication is strengthened against offline attacks by adding external secret spices to password records. With the opt-out property, authentication servers can proactively request to end the PH service after successful authentications. Then password records can be securely migrated to their traditional salted hash state, ready for subscription to other PH services. Besides, PW-Hero achieves all existing desirable properties, such as comprehensive verifiability, rate limits against online attacks, and user privacy. We define PW-Hero as a suite of protocols that meet desirable properties and build a simple, secure, and efficient instance. Moreover, we develop a prototype implementation and evaluate its performance, establishing the practicality of our PW-Hero service.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125106442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Achieving Scalability and Load Balance across Blockchain Shards for State Sharding 为状态分片实现跨区块链分片的可扩展性和负载平衡

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00034

Canlin Li, Huawei Huang, Yetong Zhao, Xiaowen Peng, Ruijie Yang, Zibin Zheng, Song Guo

Sharding technique is viewed as the most promising solution to improving blockchain scalability. However, to implement a sharded blockchain, developers have to address two major challenges. The first challenge is that the ratio of cross-shard transactions (TXs) across blockchain shards is very high. This issue significantly degrades the throughput of a blockchain. The second challenge is that the workloads across blockchain shards are largely imbalanced. If workloads are imbalanced, some shards have to handle an overwhelming number of TXs and become congested very possibly. Facing these two challenges, a dilemma is that it is difficult to guarantee a low cross-shard TX ratio and maintain the workload balance across all shards, simultaneously. We believe that a fine-grained account-allocation strategy can address this dilemma. To this end, we first formulate the tradeoff between such two metrics as a network-partition problem. We then solve this problem using a community-aware account partition algorithm. Furthermore, we also propose a sharding protocol, named Transformers, to apply the proposed algorithm into the sharded blockchain system. Finally, trace-driven evaluation results demonstrate that the proposed protocol outperforms other baselines in terms of throughput, latency, cross-shard TX ratio, and the queue size of transaction pool.

分片技术被认为是提高区块链可扩展性最有前途的解决方案。然而，要实现分片区块链，开发人员必须解决两个主要挑战。第一个挑战是跨区块链分片的跨分片交易(TXs)比例非常高。这个问题显著降低了区块链的吞吐量。第二个挑战是跨区块链分片的工作负载在很大程度上是不平衡的。如果工作负载不平衡，一些分片必须处理大量的TXs，很可能会出现拥塞。面对这两个挑战，一个难题是很难同时保证低跨分片TX比率和保持所有分片的工作负载平衡。我们相信细粒度的帐户分配策略可以解决这个难题。为此，我们首先将这两个指标之间的权衡表述为网络分区问题。然后，我们使用社区感知帐户分区算法来解决这个问题。此外，我们还提出了一个名为Transformers的分片协议，将所提出的算法应用于分片区块链系统。最后，跟踪驱动的评估结果表明，所提出的协议在吞吐量、延迟、跨分片TX比率和事务池队列大小方面优于其他基准。

{"title":"Achieving Scalability and Load Balance across Blockchain Shards for State Sharding","authors":"Canlin Li, Huawei Huang, Yetong Zhao, Xiaowen Peng, Ruijie Yang, Zibin Zheng, Song Guo","doi":"10.1109/SRDS55811.2022.00034","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00034","url":null,"abstract":"Sharding technique is viewed as the most promising solution to improving blockchain scalability. However, to implement a sharded blockchain, developers have to address two major challenges. The first challenge is that the ratio of cross-shard transactions (TXs) across blockchain shards is very high. This issue significantly degrades the throughput of a blockchain. The second challenge is that the workloads across blockchain shards are largely imbalanced. If workloads are imbalanced, some shards have to handle an overwhelming number of TXs and become congested very possibly. Facing these two challenges, a dilemma is that it is difficult to guarantee a low cross-shard TX ratio and maintain the workload balance across all shards, simultaneously. We believe that a fine-grained account-allocation strategy can address this dilemma. To this end, we first formulate the tradeoff between such two metrics as a network-partition problem. We then solve this problem using a community-aware account partition algorithm. Furthermore, we also propose a sharding protocol, named Transformers, to apply the proposed algorithm into the sharded blockchain system. Finally, trace-driven evaluation results demonstrate that the proposed protocol outperforms other baselines in terms of throughput, latency, cross-shard TX ratio, and the queue size of transaction pool.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125157638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Performance Study of Epoch-based Commit Protocols in Distributed OLTP Databases 分布式OLTP数据库中基于时代的提交协议性能研究

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00026

Jack Waudby, P. Ezhilchelvan, I. Mitrani, J. Webber

Distributed OLTP systems execute the high-overhead, two-phase commit (2PC) protocol at the end of every distributed transaction. Epoch-based commit proposes that 2PC be executed only once for all transactions processed within a time interval called an epoch. Increasing epoch duration allows more transactions to be processed before the common 2PC. It thus reduces 2PC overhead per transaction, increases throughput but also increases average transaction latency. Therefore, required is the ability to choose the right epoch size that offers the desired trade-off between throughput and latency. To this end, we develop two analytical models to estimate throughput and average latency in terms of epoch size taking into account load and failure conditions. Simulations affirm their accuracy and effectiveness. We then present epoch-based multi-commit which, unlike epoch-based commit, seeks to avoid all transactions being aborted when failures occur, and also performs identically when failures do not occur. Our performance study identifies workload factors that make it more effective in preventing transaction aborts and concludes that the analytical models can be equally useful in predicting its performance as well.

分布式OLTP系统在每个分布式事务结束时执行高开销的两阶段提交(2PC)协议。基于epoch的提交建议，对于在称为epoch的时间间隔内处理的所有事务，只执行2PC一次。增加epoch持续时间允许在公共2PC之前处理更多事务。因此，它减少了每个事务的2PC开销，提高了吞吐量，但也增加了平均事务延迟。因此，需要能够选择正确的epoch大小，从而在吞吐量和延迟之间提供所需的权衡。为此，我们开发了两个分析模型，以考虑负载和故障条件的epoch大小来估计吞吐量和平均延迟。仿真验证了其准确性和有效性。然后，我们提出了基于时间点的多提交，它与基于时间点的提交不同，它寻求避免在发生失败时中止所有事务，并且在没有发生失败时也执行相同的事务。我们的性能研究确定了使其在防止事务中止方面更有效的工作负载因素，并得出结论，分析模型在预测其性能方面也同样有用。

{"title":"A Performance Study of Epoch-based Commit Protocols in Distributed OLTP Databases","authors":"Jack Waudby, P. Ezhilchelvan, I. Mitrani, J. Webber","doi":"10.1109/SRDS55811.2022.00026","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00026","url":null,"abstract":"Distributed OLTP systems execute the high-overhead, two-phase commit (2PC) protocol at the end of every distributed transaction. Epoch-based commit proposes that 2PC be executed only once for all transactions processed within a time interval called an epoch. Increasing epoch duration allows more transactions to be processed before the common 2PC. It thus reduces 2PC overhead per transaction, increases throughput but also increases average transaction latency. Therefore, required is the ability to choose the right epoch size that offers the desired trade-off between throughput and latency. To this end, we develop two analytical models to estimate throughput and average latency in terms of epoch size taking into account load and failure conditions. Simulations affirm their accuracy and effectiveness. We then present epoch-based multi-commit which, unlike epoch-based commit, seeks to avoid all transactions being aborted when failures occur, and also performs identically when failures do not occur. Our performance study identifies workload factors that make it more effective in preventing transaction aborts and concludes that the analytical models can be equally useful in predicting its performance as well.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130951075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Never Too Late: Tracing and Mitigating Backdoor Attacks in Federated Learning 为时不晚:联邦学习中的跟踪和减少后门攻击

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00017

Hui Zeng, Tongqing Zhou, Xinyi Wu, Zhiping Cai

The privacy-preserving nature of Federated Learning (FL) exposes such a distributed learning paradigm to the planting of backdoors with locally corrupted data. We discover that FL backdoors, under a new on-off multi-shot attack form, are essentially stealthy against existing defenses that are built on model statistics and spectral analysis. First-hand observations of such attacks show that the backdoored models are indistinguishable from normal ones w.r.t. both low-level and high-level representations. We thus emphasize that a critical redemption, if not the only, for the tricky stealthiness is reactive tracing and posterior mitigation. A three-step remedy framework is then proposed by exploring the temporal and inferential correlations of models on a trapped sample from an attack. In particular, we use shift ensemble detection and co-occurrence analysis for adversary identification, and repair the model via malicious ingredients removal under theoretical error guarantee. Extensive experiments on various backdoor settings demonstrate that our framework can achieve accuracy on attack round identification of ∼80% and on attackers of ∼50%, which are ∼28.76% better than existing proactive defenses. Meanwhile, it can successfully eliminate the influence of backdoors with only a 5%∼6% performance drop.

联邦学习(FL)的隐私保护特性使这种分布式学习范式暴露于植入带有本地损坏数据的后门。我们发现FL后门，在一种新的开-关多枪攻击形式下，本质上是隐形的，反对建立在模型统计和光谱分析上的现有防御。对此类攻击的第一手观察表明，无论低级还是高级表示，后门模型都与正常模型无法区分。因此，我们强调，对于棘手的隐身性，一个关键的补救措施，如果不是唯一的，是反应跟踪和后验缓解。然后，通过探索从攻击中捕获样本的模型的时间和推理相关性，提出了一个三步补救框架。特别是，我们使用移位集合检测和共现分析来识别对手，并在理论误差保证下通过去除恶意成分来修复模型。在各种后门设置上的大量实验表明，我们的框架在攻击轮识别上的准确率为~ 80%，攻击者的准确率为~ 50%，比现有的主动防御好~ 28.76%。同时，它可以成功地消除后门的影响，性能仅下降5% ~ 6%。

{"title":"Never Too Late: Tracing and Mitigating Backdoor Attacks in Federated Learning","authors":"Hui Zeng, Tongqing Zhou, Xinyi Wu, Zhiping Cai","doi":"10.1109/SRDS55811.2022.00017","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00017","url":null,"abstract":"The privacy-preserving nature of Federated Learning (FL) exposes such a distributed learning paradigm to the planting of backdoors with locally corrupted data. We discover that FL backdoors, under a new on-off multi-shot attack form, are essentially stealthy against existing defenses that are built on model statistics and spectral analysis. First-hand observations of such attacks show that the backdoored models are indistinguishable from normal ones w.r.t. both low-level and high-level representations. We thus emphasize that a critical redemption, if not the only, for the tricky stealthiness is reactive tracing and posterior mitigation. A three-step remedy framework is then proposed by exploring the temporal and inferential correlations of models on a trapped sample from an attack. In particular, we use shift ensemble detection and co-occurrence analysis for adversary identification, and repair the model via malicious ingredients removal under theoretical error guarantee. Extensive experiments on various backdoor settings demonstrate that our framework can achieve accuracy on attack round identification of ∼80% and on attackers of ∼50%, which are ∼28.76% better than existing proactive defenses. Meanwhile, it can successfully eliminate the influence of backdoors with only a 5%∼6% performance drop.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133364151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers 生产数据中心DRAM错误与服务器故障相关性的深入研究

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00032

Zhinan Cheng, Shujie Han, P. Lee, X. Li, Jiongzhou Liu, Zhan Li

Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM errors and server failures in state-of-the-art field studies on DRAM error measurement. To fill this void, we present an in-depth data-driven correlative analysis between DRAM errors and server failures, with the primary goal of predicting server failures based on DRAM error characterization and hence enabling proactive reliability maintenance for production data centers. Our analysis is based on an eight-month dataset collected from over three million memory modules in the production data centers at Alibaba. We find that the correctable DRAM errors of most server failures only manifest within a short time before the failures happen, implying that server failure prediction should be conducted regularly at short time intervals for accurate prediction. We also study various impacting factors (including component failures in the memory subsystem, DRAM configurations, types of correctable DRAM errors) on server failures. Furthermore, we design a machine-learning-based server failure prediction workflow and demonstrate the feasibility of server failure prediction based on DRAM error characterization. To this end, we report 14 findings from our measurement and prediction studies.

动态随机存取内存(DRAM)错误非常普遍，会导致生产数据中心的服务器故障。然而，在DRAM误差测量的最新领域研究中，对DRAM误差与服务器故障之间的相关性知之甚少。为了填补这一空白，我们提出了DRAM错误和服务器故障之间深入的数据驱动相关性分析，其主要目标是基于DRAM错误特征预测服务器故障，从而实现对生产数据中心的主动可靠性维护。我们的分析基于从阿里巴巴生产数据中心的300多万个内存模块中收集的8个月的数据集。我们发现，大多数服务器故障的可纠正的DRAM错误仅在故障发生前的短时间内出现，这意味着服务器故障预测应在短时间间隔内定期进行，以准确预测。我们还研究了服务器故障的各种影响因素(包括内存子系统中的组件故障、DRAM配置、可纠正的DRAM错误类型)。此外，我们设计了一个基于机器学习的服务器故障预测工作流程，并证明了基于DRAM错误表征的服务器故障预测的可行性。为此，我们报告了我们的测量和预测研究中的14项发现。

{"title":"An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers","authors":"Zhinan Cheng, Shujie Han, P. Lee, X. Li, Jiongzhou Liu, Zhan Li","doi":"10.1109/SRDS55811.2022.00032","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00032","url":null,"abstract":"Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM errors and server failures in state-of-the-art field studies on DRAM error measurement. To fill this void, we present an in-depth data-driven correlative analysis between DRAM errors and server failures, with the primary goal of predicting server failures based on DRAM error characterization and hence enabling proactive reliability maintenance for production data centers. Our analysis is based on an eight-month dataset collected from over three million memory modules in the production data centers at Alibaba. We find that the correctable DRAM errors of most server failures only manifest within a short time before the failures happen, implying that server failure prediction should be conducted regularly at short time intervals for accurate prediction. We also study various impacting factors (including component failures in the memory subsystem, DRAM configurations, types of correctable DRAM errors) on server failures. Furthermore, we design a machine-learning-based server failure prediction workflow and demonstrate the feasibility of server failure prediction based on DRAM error characterization. To this end, we report 14 findings from our measurement and prediction studies.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115228099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

XHR-Code: An Efficient Wide Stripe Erasure Code to Reduce Cross-Rack Overhead in Cloud Storage Systems XHR-Code:一种高效的宽条纹擦除码，减少云存储系统的跨机架开销

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00033

Guofeng Yang, Huangzhen Xue, Yunfei Gu, Chentao Wu, J. Li, Minyi Guo, Shiyi Li, Xin Xie, Yuanyuan Dong, Yafei Zhao

Nowadays wide stripe erasure codes (ECs) become popular as they can achieve low monetary cost and provide high reliability for cold data. Generally, wide stripe erasure codes can be generated by extending traditional erasure codes with a large stripe size, or designing new codes. However, although wide stripe erasure codes can decrease the storage cost significantly, the construction of lost data is extraordinary slow, which stems primarily from high cross-rack overhead. It is because a large number of racks participate in the construction of the lost data, which results in high cross-rack traffic. To address the above problems, we propose a novel erasure code called XOR-Hitchhiker-RS (XHR) code, to decrease the cross-rack overhead and still maintain low storage cost. The key idea of XHR is that it utilizes a triple dimensional framework to place more chunks within racks and reduce global repair triggers. To demonstrate the effectiveness of XHR-Code, we provide mathematical analysis and conduct comprehensive experiments. The results show that, compared to the state-of-the-art solutions such as ECWide under various failure conditions, XHR can effectively reduce cross-rack repair traffic and the repair time by up to 36.50%.

目前，宽条擦除码(ECs)因其低成本和高可靠性而受到欢迎。宽条擦除码通常可以通过将传统的擦除码扩展为较大的条长或设计新的擦除码来生成。然而，尽管宽条擦除码可以显著降低存储成本，但丢失数据的构建速度非常慢，这主要源于高跨机架开销。这是因为大量的机架参与了丢失数据的构建，导致了高的跨机架流量。为了解决上述问题，我们提出了一种新的擦除码，称为XOR-Hitchhiker-RS (XHR)码，以减少跨机架开销并保持较低的存储成本。XHR的关键思想是，它利用一个三维框架在机架中放置更多的块，并减少全局修复触发器。为了证明XHR-Code的有效性，我们进行了数学分析并进行了全面的实验。结果表明，在各种故障条件下，与ECWide等最先进的解决方案相比，XHR可有效减少跨机架维修流量和维修时间，最多可减少36.50%。

{"title":"XHR-Code: An Efficient Wide Stripe Erasure Code to Reduce Cross-Rack Overhead in Cloud Storage Systems","authors":"Guofeng Yang, Huangzhen Xue, Yunfei Gu, Chentao Wu, J. Li, Minyi Guo, Shiyi Li, Xin Xie, Yuanyuan Dong, Yafei Zhao","doi":"10.1109/SRDS55811.2022.00033","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00033","url":null,"abstract":"Nowadays wide stripe erasure codes (ECs) become popular as they can achieve low monetary cost and provide high reliability for cold data. Generally, wide stripe erasure codes can be generated by extending traditional erasure codes with a large stripe size, or designing new codes. However, although wide stripe erasure codes can decrease the storage cost significantly, the construction of lost data is extraordinary slow, which stems primarily from high cross-rack overhead. It is because a large number of racks participate in the construction of the lost data, which results in high cross-rack traffic. To address the above problems, we propose a novel erasure code called XOR-Hitchhiker-RS (XHR) code, to decrease the cross-rack overhead and still maintain low storage cost. The key idea of XHR is that it utilizes a triple dimensional framework to place more chunks within racks and reduce global repair triggers. To demonstrate the effectiveness of XHR-Code, we provide mathematical analysis and conduct comprehensive experiments. The results show that, compared to the state-of-the-art solutions such as ECWide under various failure conditions, XHR can effectively reduce cross-rack repair traffic and the repair time by up to 36.50%.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129873067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Real-Time Byzantine Resilience for Power Grid Substations 电网变电站的实时拜占庭弹性

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00028

Sahiti Bommareddy, Daniel Qian, C. Bonebrake, P. Skare, Y. Amir

In the world of increasing cyber threats, a compromised protective relay can put power grid resilience at risk by irreparably damaging costly power assets or by causing significant disruptions. We present the first architecture and protocols for the substation that ensure correct protective relay operation in the face of successful relay intrusions and network attacks while meeting the required latency constraint of a quarter power cycle (4.167ms). Our architecture supports other rigid requirements, including continuous availability over a long system lifetime and seamless substation integration. We evaluate our implementation in a range of fault-free and faulty operation conditions, and provide deployment tradeoffs.

在网络威胁日益增加的世界中，受损的保护继电器可能会对昂贵的电力资产造成不可挽回的破坏或造成重大中断，从而使电网的恢复能力面临风险。我们提出了变电站的第一个架构和协议，确保在面对成功的继电器入侵和网络攻击时正确的保护继电器操作，同时满足所需的四分之一功率周期(4.167ms)的延迟约束。我们的架构支持其他严格的要求，包括长系统生命周期内的持续可用性和无缝变电站集成。我们在一系列无故障和故障操作条件下评估我们的实现，并提供部署权衡。

引用次数: 1

DAG-based Task Orchestration for Edge Computing 基于dag的边缘计算任务编排

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00013

Xiang Li, Mustafa Abdallah, Shikhar Suryavansh, M. Chiang, Kwang Taik Kim, S. Bagchi

Edge computing promises to exploit underlying computation resources closer to users to help run latency-sensitive applications such as augmented reality and video analytics. However, one key missing piece has been how to incorporate personally owned, unmanaged devices into a usable edge computing system. The primary challenges arise due to the heterogeneity, lack of interference management, and unpredictable availability of such devices. In this paper we propose an orchestration framework IBDASH, which orchestrates application tasks on an edge system that comprises a mix of commercial and personal edge devices. IBDASH targets reducing both end-to-end latency of execution and probability of failure for applications that have dependency among tasks, captured by directed acyclic graphs (DAGs). IBDASH takes memory constraints of each edge device and network bandwidth into consideration. To assess the effectiveness of IBDASH, we run real application tasks on real edge devices with widely varying capabilities. We feed these measurements into a simulator that runs IBDASH at scale. Compared to three state-of-the-art edge orchestration schemes and two intuitive baselines, IBDASH reduces the end-to-end latency and probability of failure, by 14% and 41% on average respectively. The main takeaway from our work is that it is feasible to combine personal and commercial devices into a usable edge computing platform, one that delivers low and predictable latency and high availability.

边缘计算承诺利用更接近用户的底层计算资源，以帮助运行对延迟敏感的应用程序，如增强现实和视频分析。然而，一个关键的缺失部分是如何将个人拥有的，非管理的设备整合到可用的边缘计算系统中。主要的挑战是由于这些设备的异构性、缺乏干扰管理和不可预测的可用性。在本文中，我们提出了一个编排框架IBDASH，它在包含商业和个人边缘设备的混合边缘系统上编排应用程序任务。IBDASH的目标是通过有向无环图(dag)捕获，减少任务之间具有依赖性的应用程序的端到端执行延迟和失败概率。IBDASH考虑了每个边缘设备的内存约束和网络带宽。为了评估IBDASH的有效性，我们在具有广泛不同功能的实际边缘设备上运行实际应用任务。我们将这些测量结果输入到大规模运行IBDASH的模拟器中。与三个最先进的边缘编排方案和两个直观的基线相比，IBDASH将端到端延迟和故障概率平均分别降低了14%和41%。从我们的工作中得出的主要结论是，将个人和商业设备结合到一个可用的边缘计算平台中是可行的，这个平台可以提供低且可预测的延迟和高可用性。

{"title":"DAG-based Task Orchestration for Edge Computing","authors":"Xiang Li, Mustafa Abdallah, Shikhar Suryavansh, M. Chiang, Kwang Taik Kim, S. Bagchi","doi":"10.1109/SRDS55811.2022.00013","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00013","url":null,"abstract":"Edge computing promises to exploit underlying computation resources closer to users to help run latency-sensitive applications such as augmented reality and video analytics. However, one key missing piece has been how to incorporate personally owned, unmanaged devices into a usable edge computing system. The primary challenges arise due to the heterogeneity, lack of interference management, and unpredictable availability of such devices. In this paper we propose an orchestration framework IBDASH, which orchestrates application tasks on an edge system that comprises a mix of commercial and personal edge devices. IBDASH targets reducing both end-to-end latency of execution and probability of failure for applications that have dependency among tasks, captured by directed acyclic graphs (DAGs). IBDASH takes memory constraints of each edge device and network bandwidth into consideration. To assess the effectiveness of IBDASH, we run real application tasks on real edge devices with widely varying capabilities. We feed these measurements into a simulator that runs IBDASH at scale. Compared to three state-of-the-art edge orchestration schemes and two intuitive baselines, IBDASH reduces the end-to-end latency and probability of failure, by 14% and 41% on average respectively. The main takeaway from our work is that it is feasible to combine personal and commercial devices into a usable edge computing platform, one that delivers low and predictable latency and high availability.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133420767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Half Title Page 半页标题

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/srds55811.2022.00001

引用次数: 0

FWC: Fitting Weight Compression Method for Reducing Communication Traffic for Federated Learning 减少联邦学习通信流量的拟合权压缩方法

2022 41st International Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2022-09-01 DOI: 10.1109/SRDS55811.2022.00025

Hao Jiang, Kedong Yan, Chanying Huang, Qianmu Li, Shan Xiao

Federated learning enables local nodes to train a global model together by uploading only training updates to the parameter server without exchanging private data. However, as the complexity of the federated learning task increases, the communication volume of the training process becomes extremely large, hence the huge communication traffic becomes a serious bottleneck in current federated learning application. Existing methods reduce communication overhead from two aspects, the number of communications and the traffic per communication. But these methods usually lead to more consumption of computing resources or a decrease in model accuracy. To handle these problems, this paper proposes a data fitting based weight compression algorithm, FWC, which includes four sequential stages: sparsification, polynomial fitting, encoding, reconstruction and two mechanism: warm-up and accumulation. In particular, the warm-up mechanism can well address the problem of slow convergence in early training period. Experimental results on models with different scales show that FWC is able to provide more than 600x traffic compression at the cost of only millisecond-level computational time cost and less than 1% accuracy loss.

联邦学习使本地节点能够通过只将训练更新上传到参数服务器而不交换私有数据来一起训练全局模型。然而，随着联邦学习任务复杂性的增加，训练过程的通信量变得非常大，庞大的通信量成为当前联邦学习应用的严重瓶颈。现有的方法从通信次数和每次通信的流量两个方面降低通信开销。但这些方法通常会消耗更多的计算资源或降低模型精度。为了解决这些问题，本文提出了一种基于数据拟合的权重压缩算法FWC，该算法包括稀疏化、多项式拟合、编码、重构四个顺序阶段和预热和积累两种机制。其中，预热机制可以很好地解决训练前期收敛缓慢的问题。在不同尺度模型上的实验结果表明，FWC能够以毫秒级的计算时间成本和小于1%的精度损失为代价提供600倍以上的流量压缩。

{"title":"FWC: Fitting Weight Compression Method for Reducing Communication Traffic for Federated Learning","authors":"Hao Jiang, Kedong Yan, Chanying Huang, Qianmu Li, Shan Xiao","doi":"10.1109/SRDS55811.2022.00025","DOIUrl":"https://doi.org/10.1109/SRDS55811.2022.00025","url":null,"abstract":"Federated learning enables local nodes to train a global model together by uploading only training updates to the parameter server without exchanging private data. However, as the complexity of the federated learning task increases, the communication volume of the training process becomes extremely large, hence the huge communication traffic becomes a serious bottleneck in current federated learning application. Existing methods reduce communication overhead from two aspects, the number of communications and the traffic per communication. But these methods usually lead to more consumption of computing resources or a decrease in model accuracy. To handle these problems, this paper proposes a data fitting based weight compression algorithm, FWC, which includes four sequential stages: sparsification, polynomial fitting, encoding, reconstruction and two mechanism: warm-up and accumulation. In particular, the warm-up mechanism can well address the problem of slow convergence in early training period. Experimental results on models with different scales show that FWC is able to provide more than 600x traffic compression at the cost of only millisecond-level computational time cost and less than 1% accuracy loss.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"15 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114022312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0