2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)最新文献_第4页

An Inter-blockchain Escrow Approach for Fast Bitcoin Payment 快速比特币支付的区块链间托管方法

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00148

Xinyu Lei, Tian Xie, Guan-Hua Tu, A. Liu

In recent years, the Bitcoin (BTC) payment is increasingly popular in retailers and service providers. A BTC transaction (tx) needs six confirmations (one hour) to be validated, making it not suitable for fast-pay scenarios. Theoretically, a shorter waiting time period increases the success possibility of a double-spending attack. To address this problem, we propose BTCFast scheme to support fast BTC tx. BTCFast is a novel, decentralized, escrow-based scheme on top of the programmable smart contract (PSC)-enabled blockchains (e.g. Ethereum, EOS). We develop a smart contract (PayJudger) to work as a trusted payment judger, which guarantees the tx fairness. In addition, we devise a proof-of-work (PoW)-based payment judgment mechanism for PayJudger to resolve a BTC payment dispute. Our theoretical and experimental results show that BTCFast can reduce the waiting time to be less than 1 second with comparable security as the current approach (i.e., waiting for six confirmations) with no extra operation fee.

近年来，比特币(BTC)支付在零售商和服务提供商中越来越流行。比特币交易(tx)需要6次确认(1小时)才能验证，因此不适合快速支付场景。理论上，较短的等待时间增加了双重花费攻击成功的可能性。为了解决这个问题，我们提出了BTCFast方案来支持快速的BTC tx。BTCFast是一种新颖的，分散的，基于托管的方案，位于可编程智能合约(PSC)支持的区块链(例如以太坊，EOS)之上。我们开发了一个智能合约(PayJudger)作为一个可信的支付判断器，保证了交易的公平性。此外，我们为PayJudger设计了一个基于工作量证明(PoW)的支付判断机制，以解决BTC支付纠纷。我们的理论和实验结果表明，BTCFast可以将等待时间减少到1秒以内，安全性与目前的方法(即等待6次确认)相当，并且不需要额外的操作费用。

引用次数: 0

Understanding WiFi Cross-Technology Interference Detection in the Real World 了解现实世界中的WiFi交叉技术干扰检测

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00061

T. Pulkkinen, J. Nurminen, P. Nurmi

WiFi networks are increasingly subjected to cross-technology interference with emerging IoT and even mobile communication solutions all crowding the 2.4 GHz ISM band where WiFi networks conventionally operate. Due to the diversity of interference sources, maintaining high level of network performance is becoming increasing difficult. Recently, deep learning based interference detection has been proposed as a potentially powerful way to identify sources of interference and to provide feedback on how to mitigate their effects. The performance of such approaches has been shown to be impressive in controlled evaluations. However, little information exists on how they generalize to the complexity of everyday environments. In this paper, we contribute by conducting a comprehensive performance evaluation of deep learning based interference detection. In our evaluation, we consider five orthogonal but complementary metrics: correctness, overfitting, robustness, efficiency, and interpretability. Our results show that, while deep learning indeed has excellent correctness (i.e., detection accuracy), it can be prone to noise in measurements (e.g., struggle when transmission power is dynamically adjusted) and suffers from poor interpretability. Deep learning is also highly sensitive to the quality and quantity of training data, with performance decreasing rapidly when the training and testing measurements come from environments with different characteristics. To compensate for weaknesses of deep learning, as our second contribution we propose a novel signal modeling approach for interference detection and compare it against deep learning. Our results demonstrate that, in terms of errors, there are some differences across the two approaches, with signal modeling being better at identifying technologies that rely on frequency hopping or that have dynamic spectrum signatures but suffering in other cases. Based on our results, we draw guidelines for improving interference detection performance.

WiFi网络越来越多地受到新兴物联网甚至移动通信解决方案的交叉技术干扰，这些解决方案都挤占了WiFi网络传统运行的2.4 GHz ISM频段。由于干扰源的多样性，保持高水平的网络性能变得越来越困难。最近，基于深度学习的干扰检测被认为是一种潜在的强大方法，可以识别干扰源，并就如何减轻其影响提供反馈。这种方法在控制评价中的表现令人印象深刻。然而，关于它们如何推广到日常环境的复杂性的信息很少。在本文中，我们通过对基于深度学习的干扰检测进行全面的性能评估。在我们的评估中，我们考虑了五个正交但互补的指标:正确性、过拟合、鲁棒性、效率和可解释性。我们的研究结果表明，虽然深度学习确实具有出色的正确性(即检测准确性)，但它在测量中容易出现噪声(例如，动态调整传输功率时的挣扎)，并且可解释性较差。深度学习对训练数据的质量和数量也非常敏感，当训练和测试测量来自不同特征的环境时，性能会迅速下降。为了弥补深度学习的缺点，作为我们的第二个贡献，我们提出了一种新的干扰检测信号建模方法，并将其与深度学习进行比较。我们的结果表明，就误差而言，两种方法之间存在一些差异，信号建模在识别依赖跳频的技术或具有动态频谱特征但在其他情况下受到影响的技术方面更好。基于我们的研究结果，我们提出了提高干扰检测性能的指导方针。

{"title":"Understanding WiFi Cross-Technology Interference Detection in the Real World","authors":"T. Pulkkinen, J. Nurminen, P. Nurmi","doi":"10.1109/ICDCS47774.2020.00061","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00061","url":null,"abstract":"WiFi networks are increasingly subjected to cross-technology interference with emerging IoT and even mobile communication solutions all crowding the 2.4 GHz ISM band where WiFi networks conventionally operate. Due to the diversity of interference sources, maintaining high level of network performance is becoming increasing difficult. Recently, deep learning based interference detection has been proposed as a potentially powerful way to identify sources of interference and to provide feedback on how to mitigate their effects. The performance of such approaches has been shown to be impressive in controlled evaluations. However, little information exists on how they generalize to the complexity of everyday environments. In this paper, we contribute by conducting a comprehensive performance evaluation of deep learning based interference detection. In our evaluation, we consider five orthogonal but complementary metrics: correctness, overfitting, robustness, efficiency, and interpretability. Our results show that, while deep learning indeed has excellent correctness (i.e., detection accuracy), it can be prone to noise in measurements (e.g., struggle when transmission power is dynamically adjusted) and suffers from poor interpretability. Deep learning is also highly sensitive to the quality and quantity of training data, with performance decreasing rapidly when the training and testing measurements come from environments with different characteristics. To compensate for weaknesses of deep learning, as our second contribution we propose a novel signal modeling approach for interference detection and compare it against deep learning. Our results demonstrate that, in terms of errors, there are some differences across the two approaches, with signal modeling being better at identifying technologies that rely on frequency hopping or that have dynamic spectrum signatures but suffering in other cases. Based on our results, we draw guidelines for improving interference detection performance.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121802388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Framework for Automatic Failure Recovery in ICT Systems by Deep Reinforcement Learning 基于深度强化学习的ICT系统故障自动恢复框架

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00170

Hiroki Ikeuchi, Jiawen Ge, Yoichi Matsuo, Keishiro Watanabe

Because automatic recovery from failures is of great importance for future operations of ICT systems, we propose a framework for learning a recovery policy using deep reinforcement learning. In our framework, while iteratively trying various recovery actions and observing system metrics in a target system, an agent autonomously learns the optimal recovery policy, which indicates what recovery action should be executed on the basis of observations. By using failure injection tools designed for Chaos Engineering, we can reproduce many types of failures in the target system, thereby making the agent learn a recovery policy applicable to various failures. Once the recovery policy is obtained, we can automate failure recovery by executing recovery actions that the recovery policy returns. Unlike most previous methods, our framework does not require any historical documents of failure recovery or modeling of system behavior. To verify the feasibility of the framework, we conducted an experiment using a container-based environment built on a Kubernetes cluster, demonstrating that training converges in a few days and the obtained recovery policy can successfully recover from failures with a minimum number of recovery actions.

由于故障自动恢复对ICT系统的未来运营非常重要，我们提出了一个使用深度强化学习学习恢复策略的框架。在我们的框架中，当迭代地尝试各种恢复操作并观察目标系统中的系统指标时，代理自主学习最佳恢复策略，该策略根据观察结果指示应该执行哪些恢复操作。通过使用为混沌工程设计的故障注入工具，我们可以在目标系统中重现多种类型的故障，从而使智能体学习适用于各种故障的恢复策略。一旦获得恢复策略，我们就可以通过执行恢复策略返回的恢复操作来自动进行故障恢复。与大多数以前的方法不同，我们的框架不需要任何故障恢复或系统行为建模的历史文档。为了验证框架的可行性，我们在Kubernetes集群上使用基于容器的环境进行了实验，证明训练在几天内收敛，并且获得的恢复策略可以通过最少的恢复操作成功地从故障中恢复。

{"title":"A Framework for Automatic Failure Recovery in ICT Systems by Deep Reinforcement Learning","authors":"Hiroki Ikeuchi, Jiawen Ge, Yoichi Matsuo, Keishiro Watanabe","doi":"10.1109/ICDCS47774.2020.00170","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00170","url":null,"abstract":"Because automatic recovery from failures is of great importance for future operations of ICT systems, we propose a framework for learning a recovery policy using deep reinforcement learning. In our framework, while iteratively trying various recovery actions and observing system metrics in a target system, an agent autonomously learns the optimal recovery policy, which indicates what recovery action should be executed on the basis of observations. By using failure injection tools designed for Chaos Engineering, we can reproduce many types of failures in the target system, thereby making the agent learn a recovery policy applicable to various failures. Once the recovery policy is obtained, we can automate failure recovery by executing recovery actions that the recovery policy returns. Unlike most previous methods, our framework does not require any historical documents of failure recovery or modeling of system behavior. To verify the feasibility of the framework, we conducted an experiment using a container-based environment built on a Kubernetes cluster, demonstrating that training converges in a few days and the obtained recovery policy can successfully recover from failures with a minimum number of recovery actions.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133999997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

An Energy-Efficient Edge Offloading Scheme for UAV-Assisted Internet of Things 一种无人机辅助物联网的高能效边缘卸载方案

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00167

Minghui Dai, Zhou Su, Jiliang Li, Jian Zhou

As the ever-increasing capacities of internet of things (IoT), unmanned aerial vehicle (UAV)-assisted IoT becomes a promising paradigm for improving network connectivity, extending the coverage of network and computing offloading. However, due to the limitation of battery lifetime and computing capacities of UAVs, the offloading scheme for UAVs presents a new challenge in IoT. Therefore, in this paper, an energy-efficient edge offloading scheme is proposed to improve the offloading efficiency of UAVs. Firstly, based on the data transmission delay of UAVs and computing delay of edge nodes, the matching scheme is designed to obtain the optimal matching between UAVs and edge nodes. Secondly, the energy-efficient offloading scheme for UAVs and edge nodes is modeled as a bargaining game. Then, the offloading strategy based on incentive algorithm is developed to improve the offloading efficiency. Finally, the simulation results demonstrate that the proposed offloading scheme can significantly promote the effectiveness of offloading compared with the conventional schemes.

随着物联网(IoT)容量的不断增加，无人机(UAV)辅助物联网成为改善网络连通性、扩大网络覆盖和计算卸载的一个有前途的范例。然而，由于无人机电池寿命和计算能力的限制，无人机的卸载方案在物联网中提出了新的挑战。为此，本文提出了一种节能的边缘卸载方案，以提高无人机的卸载效率。首先，基于无人机的数据传输时延和边缘节点的计算时延，设计匹配方案，实现无人机与边缘节点的最优匹配;其次，将无人机和边缘节点的节能卸载方案建模为讨价还价博弈。在此基础上，提出了基于激励算法的卸载策略，提高了卸载效率。最后，仿真结果表明，与传统卸载方案相比，所提出的卸载方案能显著提高卸载效果。

引用次数: 7

Characterizing Bottlenecks in Scheduling Microservices on Serverless Platforms 无服务器平台上调度微服务的瓶颈特征

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00195

J. Gunasekaran, P. Thinakaran, N. Nachiappan, R. Kannan, M. Kandemir, C. Das

Datacenters are witnessing an increasing trend in adopting microservice-based architecture for application design, which consists of a combination of different microservices. Typically these applications are short-lived and are administered with strict Service Level Objective (SLO) requirements. Traditional virtual machine (VM) based provisioning for such applications not only suffers from long latency when provisioning resources (as VMs tend to take a few minutes to start up), but also places an additional overhead of server management and provisioning on the users. This led to the adoption of serverless functions, where applications are composed as functions and hosted in containers. However, state-of-the-art schedulers employed in serverless platforms tend to look at microservice-based applications similar to conventional monolithic black-box applications. To detect all the inefficiencies, we characterize the end-to-end life cycle of these microservice-based applications in this work. Our findings show that the applications suffer from poor scheduling of microservices due to reactive container provisioning during workload fluctuations, thereby resulting in either in SLO violations or colossal container over-provisioning, in turn leading to poor resource utilization. We also find that there is an ample amount of slack available at each stage of application execution, which can potentially be leveraged to improve the overall application performance.

数据中心在应用程序设计中越来越多地采用基于微服务的体系结构，它由不同微服务的组合组成。通常，这些应用程序都是短期的，并且按照严格的服务水平目标(Service Level Objective, SLO)要求进行管理。对于这类应用程序，传统的基于虚拟机(VM)的配置不仅在配置资源时存在很长的延迟(因为VM往往需要几分钟才能启动)，而且还会给用户带来额外的服务器管理和配置开销。这导致了无服务器函数的采用，其中应用程序作为函数组成并托管在容器中。然而，在无服务器平台中使用的最先进的调度器倾向于查看基于微服务的应用程序，类似于传统的单片黑盒应用程序。为了检测所有的低效率，我们在本工作中描述了这些基于微服务的应用程序的端到端生命周期。我们的研究结果表明，由于在工作负载波动期间响应性容器配置，应用程序受到微服务调度不良的影响，从而导致违反SLO或容器过度配置，进而导致资源利用率低下。我们还发现，在应用程序执行的每个阶段都有大量的空闲时间，可以利用这些空闲时间来提高应用程序的整体性能。

{"title":"Characterizing Bottlenecks in Scheduling Microservices on Serverless Platforms","authors":"J. Gunasekaran, P. Thinakaran, N. Nachiappan, R. Kannan, M. Kandemir, C. Das","doi":"10.1109/ICDCS47774.2020.00195","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00195","url":null,"abstract":"Datacenters are witnessing an increasing trend in adopting microservice-based architecture for application design, which consists of a combination of different microservices. Typically these applications are short-lived and are administered with strict Service Level Objective (SLO) requirements. Traditional virtual machine (VM) based provisioning for such applications not only suffers from long latency when provisioning resources (as VMs tend to take a few minutes to start up), but also places an additional overhead of server management and provisioning on the users. This led to the adoption of serverless functions, where applications are composed as functions and hosted in containers. However, state-of-the-art schedulers employed in serverless platforms tend to look at microservice-based applications similar to conventional monolithic black-box applications. To detect all the inefficiencies, we characterize the end-to-end life cycle of these microservice-based applications in this work. Our findings show that the applications suffer from poor scheduling of microservices due to reactive container provisioning during workload fluctuations, thereby resulting in either in SLO violations or colossal container over-provisioning, in turn leading to poor resource utilization. We also find that there is an ample amount of slack available at each stage of application execution, which can potentially be leveraged to improve the overall application performance.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127673246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Multi-Objective Online Task Allocation in Spatial Crowdsourcing Systems 空间众包系统中的多目标在线任务分配

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00104

Ellen Mitsopoulou, Juliana Litou, V. Kalogeraki

In this work we aim to provide an efficient solution to the problem of online task allocation in spatial crowdsourcing systems. We focus on the objectives of platform utility maximization and worker utility maximization, yet the proposed schema is generic enough to accommodate more objectives. The goal is to find an allocation of tasks to workers that maximizes the platform’s profit and reliability of the results, while simultaneously assigns tasks based on the users’ interests to increase user engagement and hence the probability that the users will complete the tasks on time. Our scheme works well in highly fluctuating environments where the tasks to be executed require that the workers meet certain criteria of expertise, availability, reliability, etc. Our detailed experimental evaluation illustrates the benefits and practicality of our approach and demonstrates that our approach outperforms its competitors.

在这项工作中，我们的目标是为空间众包系统中的在线任务分配问题提供一个有效的解决方案。我们关注的是平台效用最大化和工作人员效用最大化的目标，但是所建议的模式足够通用，可以容纳更多的目标。我们的目标是找到一个任务分配给工人，使平台的利润和结果的可靠性最大化，同时根据用户的兴趣分配任务，以增加用户的参与度，从而提高用户按时完成任务的概率。我们的方案在高度波动的环境中工作得很好，在这种环境中，要执行的任务要求工人满足某些专业知识、可用性、可靠性等标准。我们详细的实验评估说明了我们的方法的好处和实用性，并表明我们的方法优于其竞争对手。

引用次数: 2

Context-Aware Deep Model Compression for Edge Cloud Computing 面向边缘云计算的上下文感知深度模型压缩

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00101

Lingdong Wang, Liyao Xiang, Jiayu Xu, Jiaju Chen, Xing Zhao, Dixi Yao, Xinbing Wang, Baochun Li

While deep neural networks (DNNs) have led to a paradigm shift, its exorbitant computational requirement has always been a roadblock in its deployment to the edge, such as wearable devices and smartphones. Hence a hybrid edge-cloud computational framework is proposed to transfer part of the computation to the cloud, by naively partitioning the DNN operations under the constant network condition assumption. However, real-world network state varies greatly depending on the context, and DNN partitioning only has limited strategy space. In this paper, we explore the structural flexibility of DNN to fit the edge model to varying network contexts and different deployment platforms. Specifically, we designed a reinforcement learning-based decision engine to search for model transformation strategies in response to a combined objective of model accuracy and computation latency. The engine generates a context-aware model tree so that the DNN can decide the model branch to switch to at runtime. By the emulation and field experimental results, our approach enjoys a 30% − 50% latency reduction while retaining the model accuracy.

虽然深度神经网络(dnn)已经导致了范式的转变，但其过高的计算需求一直是其向边缘部署的障碍，例如可穿戴设备和智能手机。因此，提出了一种混合边缘云计算框架，通过在恒定网络条件假设下天真地划分DNN操作，将部分计算转移到云中。然而，现实世界的网络状态随着上下文的不同而变化很大，DNN划分只有有限的策略空间。在本文中，我们探讨了深度神经网络的结构灵活性，以适应不同的网络环境和不同的部署平台。具体来说，我们设计了一个基于强化学习的决策引擎来搜索模型转换策略，以响应模型精度和计算延迟的综合目标。引擎生成一个上下文感知的模型树，以便DNN可以决定在运行时切换到哪个模型分支。通过仿真和现场实验结果，我们的方法在保持模型精度的同时，延迟降低了30% ~ 50%。

{"title":"Context-Aware Deep Model Compression for Edge Cloud Computing","authors":"Lingdong Wang, Liyao Xiang, Jiayu Xu, Jiaju Chen, Xing Zhao, Dixi Yao, Xinbing Wang, Baochun Li","doi":"10.1109/ICDCS47774.2020.00101","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00101","url":null,"abstract":"While deep neural networks (DNNs) have led to a paradigm shift, its exorbitant computational requirement has always been a roadblock in its deployment to the edge, such as wearable devices and smartphones. Hence a hybrid edge-cloud computational framework is proposed to transfer part of the computation to the cloud, by naively partitioning the DNN operations under the constant network condition assumption. However, real-world network state varies greatly depending on the context, and DNN partitioning only has limited strategy space. In this paper, we explore the structural flexibility of DNN to fit the edge model to varying network contexts and different deployment platforms. Specifically, we designed a reinforcement learning-based decision engine to search for model transformation strategies in response to a combined objective of model accuracy and computation latency. The engine generates a context-aware model tree so that the DNN can decide the model branch to switch to at runtime. By the emulation and field experimental results, our approach enjoys a 30% − 50% latency reduction while retaining the model accuracy.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129084819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Serverless Straggler Mitigation using Error-Correcting Codes 使用纠错码的无服务器离散器缓解

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00019

Vipul Gupta, Dominic Carrano, Yaoqing Yang, Vaishaal Shankar, T. Courtade, K. Ramchandran

Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase the end-to-end latency for distributed computation. We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning and high-performance computing. The proposed schemes are inspired by error-correcting codes and employ parallel encoding and decoding over the data stored in the cloud using serverless workers. This creates a fully distributed computing framework without using a master node to conduct encoding or decoding, which removes the computation, communication and storage bottleneck at the master. On the theory side, we establish that our proposed scheme is asymptotically optimal in terms of decoding time and provide a lower bound on the number of stragglers it can tolerate with high probability. Through extensive experiments, we show that our scheme outperforms existing schemes such as speculative execution and other coding theoretic methods by at least 25%.

廉价的云服务(如无服务器计算)通常容易受到分散节点的影响，这会增加分布式计算的端到端延迟。我们提出并实现了用于矩阵乘法的无服务器系统中的离散缓解的简单而有原则的方法，并在机器学习和高性能计算的几个常见应用中对它们进行了评估。所提出的方案受到纠错码的启发，并使用无服务器工作器对存储在云中的数据进行并行编码和解码。这创建了一个完全分布式的计算框架，而无需使用主节点进行编码或解码，从而消除了主节点的计算、通信和存储瓶颈。在理论方面，我们证明了我们所提出的方案在解码时间方面是渐近最优的，并提供了它可以高概率容忍的离散数的下界。通过大量的实验，我们表明我们的方案比现有的方案(如推测执行和其他编码理论方法)至少高出25%。

{"title":"Serverless Straggler Mitigation using Error-Correcting Codes","authors":"Vipul Gupta, Dominic Carrano, Yaoqing Yang, Vaishaal Shankar, T. Courtade, K. Ramchandran","doi":"10.1109/ICDCS47774.2020.00019","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00019","url":null,"abstract":"Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase the end-to-end latency for distributed computation. We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning and high-performance computing. The proposed schemes are inspired by error-correcting codes and employ parallel encoding and decoding over the data stored in the cloud using serverless workers. This creates a fully distributed computing framework without using a master node to conduct encoding or decoding, which removes the computation, communication and storage bottleneck at the master. On the theory side, we establish that our proposed scheme is asymptotically optimal in terms of decoding time and provide a lower bound on the number of stragglers it can tolerate with high probability. Through extensive experiments, we show that our scheme outperforms existing schemes such as speculative execution and other coding theoretic methods by at least 25%.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"1961 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Communication-efficient k-Means for Edge-based Machine Learning 基于边缘的机器学习的高效通信k均值

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00062

Hanlin Lu, T. He, Shiqiang Wang, Changchang Liu, M. Mahdavi, V. Narayanan, Kevin S. Chan, Stephen Pasteris

We consider the problem of computing the k-means centers for a large high-dimensional dataset in the context of edge-based machine learning, where data sources offload machine learning computation to nearby edge servers. k-Means computation is fundamental to many data analytics, and the capability of computing provably accurate k-means centers by leveraging the computation power of the edge servers, at a low communication and computation cost to the data sources, will greatly improve the performance of these analytics. We propose to let the data sources send small summaries, generated by joint dimensionality reduction (DR) and cardinality reduction (CR), to support approximate k-means computation at reduced complexity and communication cost. By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on state-of-the-art DR/CR methods, we show that: (i) in the single-source case, it is possible to achieve a near-optimal approximation at a near-linear complexity and a constant communication cost, (ii) in the multiple-source case, it is possible to achieve similar performance at a logarithmic communication cost, and (iii) the order of applying DR and CR significantly affects the complexity and the communication cost. Our findings are validated through experiments based on real datasets.

我们考虑在基于边缘的机器学习背景下计算大型高维数据集的k-均值中心的问题，其中数据源将机器学习计算卸载到附近的边缘服务器。k-Means计算是许多数据分析的基础，通过利用边缘服务器的计算能力，以较低的数据源通信和计算成本计算可证明准确的k-Means中心的能力，将大大提高这些分析的性能。我们建议让数据源发送由联合维数约简(DR)和基数约简(CR)生成的小摘要，以降低复杂性和通信成本来支持近似k-means计算。通过分析基于最先进DR/CR方法的k-means算法的复杂度、通信成本和近似误差，我们发现:(i)在单源情况下，有可能以近似线性的复杂度和恒定的通信成本实现近似最优逼近;(ii)在多源情况下，有可能以对数的通信成本实现类似的性能;(iii)应用DR和CR的顺序显著影响复杂性和通信成本。我们的发现通过基于真实数据集的实验得到了验证。

{"title":"Communication-efficient k-Means for Edge-based Machine Learning","authors":"Hanlin Lu, T. He, Shiqiang Wang, Changchang Liu, M. Mahdavi, V. Narayanan, Kevin S. Chan, Stephen Pasteris","doi":"10.1109/ICDCS47774.2020.00062","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00062","url":null,"abstract":"We consider the problem of computing the k-means centers for a large high-dimensional dataset in the context of edge-based machine learning, where data sources offload machine learning computation to nearby edge servers. k-Means computation is fundamental to many data analytics, and the capability of computing provably accurate k-means centers by leveraging the computation power of the edge servers, at a low communication and computation cost to the data sources, will greatly improve the performance of these analytics. We propose to let the data sources send small summaries, generated by joint dimensionality reduction (DR) and cardinality reduction (CR), to support approximate k-means computation at reduced complexity and communication cost. By analyzing the complexity, the communication cost, and the approximation error of k-means algorithms based on state-of-the-art DR/CR methods, we show that: (i) in the single-source case, it is possible to achieve a near-optimal approximation at a near-linear complexity and a constant communication cost, (ii) in the multiple-source case, it is possible to achieve similar performance at a logarithmic communication cost, and (iii) the order of applying DR and CR significantly affects the complexity and the communication cost. Our findings are validated through experiments based on real datasets.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117350266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Refining Micro Services Placement over Multiple Kubernetes-orchestrated Clusters employing Resource Monitoring 使用资源监控在多个kubernetes编排的集群上优化微服务布局

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2020-11-01 DOI: 10.1109/ICDCS47774.2020.00173

Seunghyun Lee, Seokho Son, Jungsu Han, JongWon Kim

In the cloud field, there is an increasing demand for globalized services and corresponding execution environments that overcome local limitations and selectively utilize optimal resources. Utilizing multi-cloud deployments and operations rather than using a single cloud is an effective way to satisfy the increasing demand. In particular, we need to provide cloud-native environment to organically support services based on a microservices architecture. In this paper, we propose a cloud-native workload profiling system with Kubernetes-orchestrated multi-cluster configuration. The contributions of this paper are as follows. (i) We design the operating software over multiple cloud-native cluster to select optimal resources by monitoring. (ii) For operating the multiple clusters through the design, we define and design specific general service workloads. Also, we implement the workloads in application software (iii) To seek optimal resources, we deployed the general workloads and monitored resource usage repeatedly in detail. We calculate resource variation in comparison with initial resource usage and average resource usage after deploying the service workloads. Also, we analyze the resource monitoring result. We expect this methodology can find proper resources for service workload types.

在云领域，对全球化服务和相应的执行环境的需求日益增长，这些服务和执行环境可以克服本地限制，并有选择地利用最优资源。利用多云部署和操作，而不是使用单个云，是满足日益增长的需求的有效方法。特别是，我们需要提供云原生环境来有机地支持基于微服务架构的服务。在本文中，我们提出了一个基于kubernetes的多集群配置的云原生工作负载分析系统。本文的贡献如下:(i)我们在多个云原生集群上设计操作软件，通过监控选择最优资源。(ii)为了通过设计操作多个集群，我们定义和设计了特定的通用服务工作负载。(iii)为了寻找最优的资源，我们部署了一般的工作负载，并反复详细地监控资源的使用情况。我们通过与部署服务工作负载后的初始资源使用情况和平均资源使用情况进行比较来计算资源变化。并对资源监控结果进行了分析。我们期望此方法可以为服务工作负载类型找到适当的资源。

{"title":"Refining Micro Services Placement over Multiple Kubernetes-orchestrated Clusters employing Resource Monitoring","authors":"Seunghyun Lee, Seokho Son, Jungsu Han, JongWon Kim","doi":"10.1109/ICDCS47774.2020.00173","DOIUrl":"https://doi.org/10.1109/ICDCS47774.2020.00173","url":null,"abstract":"In the cloud field, there is an increasing demand for globalized services and corresponding execution environments that overcome local limitations and selectively utilize optimal resources. Utilizing multi-cloud deployments and operations rather than using a single cloud is an effective way to satisfy the increasing demand. In particular, we need to provide cloud-native environment to organically support services based on a microservices architecture. In this paper, we propose a cloud-native workload profiling system with Kubernetes-orchestrated multi-cluster configuration. The contributions of this paper are as follows. (i) We design the operating software over multiple cloud-native cluster to select optimal resources by monitoring. (ii) For operating the multiple clusters through the design, we define and design specific general service workloads. Also, we implement the workloads in application software (iii) To seek optimal resources, we deployed the general workloads and monitored resource usage repeatedly in detail. We calculate resource variation in comparison with initial resource usage and average resource usage after deploying the service workloads. Also, we analyze the resource monitoring result. We expect this methodology can find proper resources for service workload types.","PeriodicalId":158630,"journal":{"name":"2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115773452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2