2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)最新文献_第6页

Dyconits: Scaling Minecraft-like Services through Dynamically Managed Inconsistency Dyconits:通过动态管理的不一致性来扩展类似minecraft的服务

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00021

Jesse Donkervliet, J. Cuijpers, A. Iosup

Gaming is one of the most popular and lucrative entertainment industries. Minecraft alone exceeds 130 million active monthly players and sells millions of licenses annually; it is also provided as a (paid) service. Minecraft, and thousands of others, provide each a Modifiable Virtual Environment (MVE). However, Minecraft-like games only scale using isolated instances that support at most a few hundred players in the same virtual world, thus preventing their large player-base from actually gaming together. When operating as a service, even fewer players can game together. Existing techniques for managing data in distributed systems do not scale for such games: they either do not work for high-density areas (e.g., village centers or other places where the MVE is often modified), or can introduce an unbounded amount of inconsistency that can lower the quality of experience. In this work, we propose Dyconits, a middleware that allows games to scale, by bounding inconsistency in MVEs, optimistically and dynamically. Dyconits allow game developers to partition offline the game-world and its objects into units, each with its own bounds. The Dyconits system controls, dynamically and policy-based, the creation of dyconits and the management of their bounds. Importantly, the Dyconits system is thin, and reuses the existing game codebase and in particular the network stack. To demonstrate and evaluate Dyconits in practice, we modify an existing, open-source, Minecraft-like game, and evaluate its effectiveness through real-world experiments. Our approach supports up to 40% more concurrent players and reduces network bandwidth by up to 85%, with only minor modifications to the game and without increasing game latency.

游戏是最受欢迎和最赚钱的娱乐产业之一。仅《我的世界》就超过1.3亿月活跃玩家，每年销售数百万份授权;它也作为(付费)服务提供。《我的世界》和其他成千上万的游戏都提供了一个可修改的虚拟环境(MVE)。然而，像《我的世界》这样的游戏只能使用独立的实例，在同一个虚拟世界中最多支持几百名玩家，从而阻碍了它们庞大的玩家基础一起游戏。当作为服务运营时，甚至更少的玩家可以一起游戏。在分布式系统中管理数据的现有技术并不适用于这类游戏:它们要么不适用于高密度区域(游戏邦注:例如，村庄中心或其他MVE经常被修改的地方)，要么可能引入无限的不一致性，从而降低体验质量。在这项工作中，我们提出了Dyconits，这是一种中间件，允许游戏通过限制mve中的不一致性，乐观地和动态地进行扩展。Dyconits允许游戏开发者将游戏世界及其对象划分为单元，每个单元都有自己的边界。Dyconits系统动态地、基于策略地控制Dyconits的创建及其边界的管理。重要的是，Dyconits系统很瘦，并且重用了现有的游戏代码库，特别是网络堆栈。为了在实践中演示和评估Dyconits，我们修改了一个现有的、开源的、类似于《我的世界》的游戏，并通过现实世界的实验来评估其有效性。我们的方法支持多达40%的并发玩家，并减少高达85%的网络带宽，只需要对游戏进行微小的修改，并且不会增加游戏延迟。

{"title":"Dyconits: Scaling Minecraft-like Services through Dynamically Managed Inconsistency","authors":"Jesse Donkervliet, J. Cuijpers, A. Iosup","doi":"10.1109/ICDCS51616.2021.00021","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00021","url":null,"abstract":"Gaming is one of the most popular and lucrative entertainment industries. Minecraft alone exceeds 130 million active monthly players and sells millions of licenses annually; it is also provided as a (paid) service. Minecraft, and thousands of others, provide each a Modifiable Virtual Environment (MVE). However, Minecraft-like games only scale using isolated instances that support at most a few hundred players in the same virtual world, thus preventing their large player-base from actually gaming together. When operating as a service, even fewer players can game together. Existing techniques for managing data in distributed systems do not scale for such games: they either do not work for high-density areas (e.g., village centers or other places where the MVE is often modified), or can introduce an unbounded amount of inconsistency that can lower the quality of experience. In this work, we propose Dyconits, a middleware that allows games to scale, by bounding inconsistency in MVEs, optimistically and dynamically. Dyconits allow game developers to partition offline the game-world and its objects into units, each with its own bounds. The Dyconits system controls, dynamically and policy-based, the creation of dyconits and the management of their bounds. Importantly, the Dyconits system is thin, and reuses the existing game codebase and in particular the network stack. To demonstrate and evaluate Dyconits in practice, we modify an existing, open-source, Minecraft-like game, and evaluate its effectiveness through real-world experiments. Our approach supports up to 40% more concurrent players and reduces network bandwidth by up to 85%, with only minor modifications to the game and without increasing game latency.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114551902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Two-Stage Heavy Hitter Detection System Based on CPU Spikes at Cloud-Scale Gateways 基于云规模网关CPU峰值的两阶段重型攻击检测系统

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00041

Jianyuan Lu, Tian Pan, Shan He, Mao Miao, Guangzhe Zhou, Yining Qi, Biao Lyu, Shunmin Zhu

The cloud network provides sharing resources for tens of thousands of tenants to achieve economics of scale. However, heavy hitters caused by a single tenant will probably interfere with the processing of the cloud gateways, undermining the predictable performance expected by other cloud tenants. To prevent it, heavy hitter detection becomes a key concern at the performance-critical cloud gateways but faces the dilemma between fine granularity and low overhead. In this work, we present CloudSentry, a scalable two-stage heavy hitter detection system dedicated to multi-tenant cloud gateways against such a dilemma. CloudSentry contains a lightweight coarse-grained detection running 24/7 to localize infrequent CPU spikes. Then it invokes a fine-grained detection to precisely dump and analyze the potential heavy-hitter packets at the CPU spikes. After that, a more comprehensive analysis is conducted to associate heavy hitters with the cloud service scenarios and invoke a corresponding backpressure procedure. CloudSentry significantly reduces memory, computation and storage overhead compared with existing approaches. Additionally, it has been deployed world-wide in Alibaba Cloud for over one year, with rich deployment experiences. In a gateway cluster under an average traffic throughput of of 251Gbps, CloudSentry consumes only a fraction of 2%-5% CPU utilization with 8KB run-time memory, producing only 10MB heavy hitter logs during one month.

云网络为数万个租户提供共享资源，实现规模经济。但是，单个租户造成的严重影响可能会干扰云网关的处理，从而破坏其他云租户预期的可预测性能。为了防止这种情况，重磅攻击检测成为性能关键型云网关的一个关键问题，但它面临着细粒度和低开销之间的两难境地。在这项工作中，我们提出了CloudSentry，这是一个可扩展的两阶段重型攻击检测系统，专门用于多租户云网关，以应对这种困境。CloudSentry包含一个轻量级的粗粒度检测，全天候运行，用于定位不常见的CPU峰值。然后，它调用细粒度检测来精确地转储和分析CPU峰值处潜在的重磅数据包。之后，进行更全面的分析，将重磅数据与云服务场景关联起来，并调用相应的背压过程。与现有方法相比，CloudSentry显著降低了内存、计算和存储开销。此外，它已在阿里云全球部署一年多，具有丰富的部署经验。在平均流量吞吐量为251Gbps的网关集群中，CloudSentry只消耗2%-5%的CPU利用率，运行时内存为8KB，在一个月内仅产生10MB的重命中日志。

{"title":"A Two-Stage Heavy Hitter Detection System Based on CPU Spikes at Cloud-Scale Gateways","authors":"Jianyuan Lu, Tian Pan, Shan He, Mao Miao, Guangzhe Zhou, Yining Qi, Biao Lyu, Shunmin Zhu","doi":"10.1109/ICDCS51616.2021.00041","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00041","url":null,"abstract":"The cloud network provides sharing resources for tens of thousands of tenants to achieve economics of scale. However, heavy hitters caused by a single tenant will probably interfere with the processing of the cloud gateways, undermining the predictable performance expected by other cloud tenants. To prevent it, heavy hitter detection becomes a key concern at the performance-critical cloud gateways but faces the dilemma between fine granularity and low overhead. In this work, we present CloudSentry, a scalable two-stage heavy hitter detection system dedicated to multi-tenant cloud gateways against such a dilemma. CloudSentry contains a lightweight coarse-grained detection running 24/7 to localize infrequent CPU spikes. Then it invokes a fine-grained detection to precisely dump and analyze the potential heavy-hitter packets at the CPU spikes. After that, a more comprehensive analysis is conducted to associate heavy hitters with the cloud service scenarios and invoke a corresponding backpressure procedure. CloudSentry significantly reduces memory, computation and storage overhead compared with existing approaches. Additionally, it has been deployed world-wide in Alibaba Cloud for over one year, with rich deployment experiences. In a gateway cluster under an average traffic throughput of of 251Gbps, CloudSentry consumes only a fraction of 2%-5% CPU utilization with 8KB run-time memory, producing only 10MB heavy hitter logs during one month.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115832526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

GTCP: Hybrid Congestion Control for Cross-Datacenter Networks 跨数据中心网络的混合拥塞控制

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00093

Shaojun Zou, Jiawei Huang, Jingling Liu, Tao Zhang, Ning Jiang, Jianxin Wang

To improve the quality of experience for worldwide users, an increasing number of service providers deploy their services on geographically dispersed data centers, which are connected by wide area network (WAN). In the cross-datacenter networks, however, the intra- and inter-datacenter parts have different characteristics, including switch buffer depth, round-trip time and bandwidth. Besides, most of intra-DC flows belong to interactive services that require low delay while inter-DC flows typically need to achieve high throughput. Unfortunately, existing sender-based and receiver-driven transport protocols do not consider the network heterogeneity between inter- and intra- DC networks so that they fail to simultaneously achieve low latency for intra-DC flows and high throughput for inter-DC flows. This paper proposes a general hybrid congestion control mechanism called GTCP to address this problem. When the inter-DC flow detects congestion inside data center, it switches to the receiver-driven mode to avoid the impact on intra-DC flows. Otherwise, it switches back to the sender-based mode to proactively explore the available bandwidth. Besides, the intra-DC flow leverages the pausing mechanism to eliminate the queue build-up. Through a series of testbed experiments and large-scale NS2 simulations, we demonstrate that GTCP reduces flow completion time by up to 79.3% compared with existing protocols.

为了提高全球用户的体验质量，越来越多的服务提供商将其服务部署在地理上分散的数据中心上，这些数据中心通过广域网(WAN)连接。然而，在跨数据中心网络中，数据中心内部和数据中心之间的部分具有不同的特征，包括交换机缓冲深度、往返时间和带宽。此外，dc内流大多属于对时延要求较低的交互业务，而dc间流通常需要实现高吞吐量。不幸的是，现有的基于发送方和接收方驱动的传输协议没有考虑到DC间和DC内网络之间的网络异质性，因此它们无法同时实现DC内流的低延迟和DC间流的高吞吐量。本文提出了一种通用的混合拥塞控制机制GTCP来解决这个问题。当数据中心间的流量检测到数据中心内部拥塞时，会切换到接收端驱动模式，以避免对数据中心内的流量造成影响。否则，它将切换回基于发送方的模式，主动探索可用带宽。此外，dc内流利用暂停机制来消除队列建立。通过一系列的试验台实验和大规模的NS2模拟，我们证明了GTCP与现有协议相比，可以减少高达79.3%的流量完成时间。

{"title":"GTCP: Hybrid Congestion Control for Cross-Datacenter Networks","authors":"Shaojun Zou, Jiawei Huang, Jingling Liu, Tao Zhang, Ning Jiang, Jianxin Wang","doi":"10.1109/ICDCS51616.2021.00093","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00093","url":null,"abstract":"To improve the quality of experience for worldwide users, an increasing number of service providers deploy their services on geographically dispersed data centers, which are connected by wide area network (WAN). In the cross-datacenter networks, however, the intra- and inter-datacenter parts have different characteristics, including switch buffer depth, round-trip time and bandwidth. Besides, most of intra-DC flows belong to interactive services that require low delay while inter-DC flows typically need to achieve high throughput. Unfortunately, existing sender-based and receiver-driven transport protocols do not consider the network heterogeneity between inter- and intra- DC networks so that they fail to simultaneously achieve low latency for intra-DC flows and high throughput for inter-DC flows. This paper proposes a general hybrid congestion control mechanism called GTCP to address this problem. When the inter-DC flow detects congestion inside data center, it switches to the receiver-driven mode to avoid the impact on intra-DC flows. Otherwise, it switches back to the sender-based mode to proactively explore the available bandwidth. Besides, the intra-DC flow leverages the pausing mechanism to eliminate the queue build-up. Through a series of testbed experiments and large-scale NS2 simulations, we demonstrate that GTCP reduces flow completion time by up to 79.3% compared with existing protocols.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116764313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

ProgrammabilityMedic: Predictable Path Programmability Recovery under Multiple Controller Failures in SD-WANs 可编程性:sd - wan中多个控制器故障下的可预测路径可编程性恢复

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00051

Songshi Dou, Zehua Guo, Yuanqing Xia

Software-Defined Networking (SDN) promises good network performance in Wide Area Networks (WANs) with the logically centralized control using physically distributed controllers. In Software-Defined WANs (SD-WANs), maintaining path programmability, which enables flexible path change on flows, is crucial for maintaining network performance under traffic variation. However, when controllers fail, existing solutions are essentially coarse-grained switch-controller mapping solutions and only recover the path programmability of a limited number of offline flows, which traverse offline switches controlled by failed controllers. In this paper, we propose ProgrammabilityMedic (PM) to provide predictable path programmability recovery under controller failures in SD-WANs. The key idea of PM is to approximately realize flow-controller mappings using hybrid SDN/legacy routing supported by high-end commercial SDN switches. Using the hybrid routing, we can recover programmability by fine-grainedly selecting a routing mode for each offline flow at each offline switch to fit the given control resource from active controllers. Thus, PM can effectively map offline switches to active controllers to improve recovery efficiency. Simulation results show that PM outperforms existing switch-level solutions by maintaining balanced programmability and increasing the total programmability of recovered offline flows up to 315% under two controller failures and 340% under three controller failures.

软件定义网络(SDN)通过物理上分布式的控制器实现逻辑上的集中控制，保证了广域网(wan)中良好的网络性能。在软件定义广域网(sd - wan)中，保持路径可编程性(允许在流量上灵活地更改路径)对于在流量变化情况下保持网络性能至关重要。然而，当控制器发生故障时，现有的解决方案本质上是粗粒度的开关-控制器映射解决方案，并且只能恢复有限数量的离线流的路径可编程性，这些流遍历由故障控制器控制的离线交换机。在本文中，我们提出了可编程medic (PM)来提供sd - wan中控制器故障下的可预测路径可编程恢复。PM的关键思想是使用高端商用SDN交换机支持的混合SDN/遗留路由近似实现流量控制器映射。利用混合路由，我们可以细粒度地为每个离线流在每个离线开关处选择路由模式，以适应来自活动控制器的给定控制资源，从而恢复可编程性。因此，PM可以有效地将离线交换机映射到主动控制器，以提高恢复效率。仿真结果表明，PM优于现有的开关级解决方案，保持了平衡的可编程性，并将恢复的离线流的总可编程性在两个控制器故障下提高了315%，在三个控制器故障下提高了340%。

{"title":"ProgrammabilityMedic: Predictable Path Programmability Recovery under Multiple Controller Failures in SD-WANs","authors":"Songshi Dou, Zehua Guo, Yuanqing Xia","doi":"10.1109/ICDCS51616.2021.00051","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00051","url":null,"abstract":"Software-Defined Networking (SDN) promises good network performance in Wide Area Networks (WANs) with the logically centralized control using physically distributed controllers. In Software-Defined WANs (SD-WANs), maintaining path programmability, which enables flexible path change on flows, is crucial for maintaining network performance under traffic variation. However, when controllers fail, existing solutions are essentially coarse-grained switch-controller mapping solutions and only recover the path programmability of a limited number of offline flows, which traverse offline switches controlled by failed controllers. In this paper, we propose ProgrammabilityMedic (PM) to provide predictable path programmability recovery under controller failures in SD-WANs. The key idea of PM is to approximately realize flow-controller mappings using hybrid SDN/legacy routing supported by high-end commercial SDN switches. Using the hybrid routing, we can recover programmability by fine-grainedly selecting a routing mode for each offline flow at each offline switch to fit the given control resource from active controllers. Thus, PM can effectively map offline switches to active controllers to improve recovery efficiency. Simulation results show that PM outperforms existing switch-level solutions by maintaining balanced programmability and increasing the total programmability of recovered offline flows up to 315% under two controller failures and 340% under three controller failures.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116935753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks 利用计算和通信任务的智能并行性加速分布式K-FAC

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00059

S. Shi, Lin Zhang, Bo Li

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.

基于GPU集群的同步随机梯度下降(SGD)分布式训练被广泛用于加速深度模型的训练过程。然而，SGD在模型参数更新中只利用一阶梯度，这可能需要几天或几周的时间。最近的研究已经成功地利用近似二阶信息来加速训练过程，其中kronecker因子近似曲率(KFAC)成为训练深度模型最有效的近似算法之一。然而，当利用GPU集群使用分布式KFAC (D-KFAC)训练模型时，它会产生大量的计算，并在每次迭代期间引入额外的通信。在这项工作中，我们提出了具有计算和通信任务智能并行性的D-KFAC (SPD-KFAC)，以减少迭代时间。具体来说，1)我们首先描述了D-KFAC的性能瓶颈，2)我们设计并实现了Kronecker因子计算和动态张量融合通信的流水线机制，以及3)我们开发了一个负载平衡放置，用于在GPU集群上逆多个矩阵。我们在具有100Gb/s InfiniBand互连的64-GPU集群上进行了实际实验。实验结果表明，我们提出的SPD-KFAC训练方案比目前最先进的算法提高了10%-35%。

{"title":"Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks","authors":"S. Shi, Lin Zhang, Bo Li","doi":"10.1109/ICDCS51616.2021.00059","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00059","url":null,"abstract":"Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115210201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Poster: Function Delivery Network: Extending Serverless to Heterogeneous Computing 海报:功能交付网络:将无服务器扩展到异构计算

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00120

Anshul Jindal, Mohak Chadha, M. Gerndt, Julian Frielinghaus, Vladimir Podolskiy, Pengfei Chen

Several of today's cloud applications are spread over heterogeneous connected computing resources and are highly dynamic in their structure and resource requirements. However, serverless computing and Function-as-a-Service (FaaS) platforms are limited to homogeneous clusters and homogeneous functions. We introduce an extension of FaaS to heterogeneous computing and to support heterogeneous functions through a network of distributed heterogeneous target platforms called Function Delivery Network (FDN). A target platform is a combination of a cluster of a homogeneous computing system and a FaaS platform on top of it. FDN provides Function-Delivery-as-a-Service (FDaaS), delivering the function invocations to the right target platform. We showcase the opportunities such as collaborative execution between multiple target platforms and varied target platform's characteristics that the FDN offers in fulfilling two objectives: Service Level Objective (SLO) requirements and energy efficiency when scheduling functions invocations by evaluating over five distributed target platforms.

今天的一些云应用程序分布在异构连接的计算资源上，并且在结构和资源需求上是高度动态的。然而，无服务器计算和功能即服务(FaaS)平台仅限于同构集群和同构功能。我们将FaaS扩展到异构计算，并通过称为功能交付网络(FDN)的分布式异构目标平台网络来支持异构功能。目标平台是同构计算系统集群和其之上的FaaS平台的组合。FDN提供功能交付即服务(FDaaS)，将功能调用交付到正确的目标平台。通过评估五个分布式目标平台，我们展示了FDN在实现两个目标时提供的机会，例如多个目标平台之间的协作执行和不同目标平台的特征:服务水平目标(SLO)要求和调度功能调用时的能源效率。

{"title":"Poster: Function Delivery Network: Extending Serverless to Heterogeneous Computing","authors":"Anshul Jindal, Mohak Chadha, M. Gerndt, Julian Frielinghaus, Vladimir Podolskiy, Pengfei Chen","doi":"10.1109/ICDCS51616.2021.00120","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00120","url":null,"abstract":"Several of today's cloud applications are spread over heterogeneous connected computing resources and are highly dynamic in their structure and resource requirements. However, serverless computing and Function-as-a-Service (FaaS) platforms are limited to homogeneous clusters and homogeneous functions. We introduce an extension of FaaS to heterogeneous computing and to support heterogeneous functions through a network of distributed heterogeneous target platforms called Function Delivery Network (FDN). A target platform is a combination of a cluster of a homogeneous computing system and a FaaS platform on top of it. FDN provides Function-Delivery-as-a-Service (FDaaS), delivering the function invocations to the right target platform. We showcase the opportunities such as collaborative execution between multiple target platforms and varied target platform's characteristics that the FDN offers in fulfilling two objectives: Service Level Objective (SLO) requirements and energy efficiency when scheduling functions invocations by evaluating over five distributed target platforms.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127078884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Demo: Application Monitoring as a Network Service 演示:作为网络服务的应用程序监控

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00108

Mona Elsaadawy, Laetitia Fesselier, Bettina Kemme

The recent rise of cloud applications, representing large complex modern distributed services, has made performance monitoring a major issue and a critical process for both cloud providers and cloud customers. Many different monitoring techniques are used such as tracking resource consumption, performing application-specific measures or analyzing message exchanges. Typically the collected data is logged at the host on which the application is deployed, then either analyzed locally or forwarded to a remote analysis host. In contrast, this demonstration paper presents a Monitoring as a Service prototype that uses the advances in Software Defined Networking (SDN) to move some of the logging functionality into the network. The core of our MaaS is implemented as a virtual network function where agents are co-located with software switches in order to extract performance metrics from the message flows between components in a non-intrusive manner and send the calculated measures to the clients for visualization in near real-time. The MaaS has a lot of flexibility in how it is deployed and does not require to instrument software or platforms. In our demo we show the tool in action demonstrating how users can choose to monitor different service types and performance metrics in a user-friendly manner.

最近云应用程序的兴起，代表了大型复杂的现代分布式服务，使得性能监控成为云提供商和云客户的一个主要问题和关键过程。使用了许多不同的监视技术，例如跟踪资源消耗、执行特定于应用程序的度量或分析消息交换。通常，收集的数据在部署应用程序的主机上进行记录，然后在本地进行分析或转发到远程分析主机。与此相反，本文演示了一个监视即服务原型，它使用软件定义网络(SDN)的先进技术将一些日志功能移到网络中。我们的MaaS的核心是作为虚拟网络功能实现的，其中代理与软件交换机共存，以便以非侵入式的方式从组件之间的消息流中提取性能指标，并将计算出的指标发送给客户端，以便近乎实时地可视化。MaaS在部署方式上具有很大的灵活性，并且不需要检测软件或平台。在我们的演示中，我们展示了该工具的实际应用，演示了用户如何以用户友好的方式选择监视不同的服务类型和性能指标。

{"title":"Demo: Application Monitoring as a Network Service","authors":"Mona Elsaadawy, Laetitia Fesselier, Bettina Kemme","doi":"10.1109/ICDCS51616.2021.00108","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00108","url":null,"abstract":"The recent rise of cloud applications, representing large complex modern distributed services, has made performance monitoring a major issue and a critical process for both cloud providers and cloud customers. Many different monitoring techniques are used such as tracking resource consumption, performing application-specific measures or analyzing message exchanges. Typically the collected data is logged at the host on which the application is deployed, then either analyzed locally or forwarded to a remote analysis host. In contrast, this demonstration paper presents a Monitoring as a Service prototype that uses the advances in Software Defined Networking (SDN) to move some of the logging functionality into the network. The core of our MaaS is implemented as a virtual network function where agents are co-located with software switches in order to extract performance metrics from the message flows between components in a non-intrusive manner and send the calculated measures to the clients for visualization in near real-time. The MaaS has a lot of flexibility in how it is deployed and does not require to instrument software or platforms. In our demo we show the tool in action demonstrating how users can choose to monitor different service types and performance metrics in a user-friendly manner.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123690065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blockumulus: A Scalable Framework for Smart Contracts on the Cloud Blockumulus:云上智能合约的可扩展框架

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00064

Nikolay Ivanov, Qiben Yan, Qingyang Wang

Public blockchains have spurred the growing popularity of decentralized transactions and smart contracts, especially on the financial market. However, public blockchains exhibit their limitations on the transaction throughput, storage availability, and compute capacity. To avoid transaction gridlock, public blockchains impose large fees and per-block resource limits, making it difficult to accommodate the ever-growing high transaction demand. Previous research endeavors to improve the scalability and performance of blockchain through various technologies, such as side-chaining, sharding, secured off-chain computation, communication network optimizations, and efficient consensus protocols. However, these approaches have not attained a widespread adoption due to their inability in delivering a cloud-like performance, in terms of the scalability in transaction throughput, storage, and compute capacity. In this work, we determine that the major obstacle to public blockchain scalability is their underlying unstructured P2P networks. We further show that a centralized network can support the deployment of decentralized smart contracts. We propose a novel approach for achieving scalable decentralization: instead of trying to make blockchain scalable, we deliver decentralization to already scalable cloud by using an Ethereum smart contract. We introduce Blockumulus, a framework that can deploy decentralized cloud smart contract environments using a novel technique called overlay consensus. Through experiments, we demonstrate that Blockumulus is scalable in all three dimensions: computation, data storage, and transaction throughput. Besides eliminating the current code execution and storage restrictions, Blockumulus delivers a transaction latency between 2 and 5 seconds under normal load. Moreover, the stress test of our prototype reveals the ability to execute 20,000 simultaneous transactions under 26 seconds, which is on par with the average throughput of worldwide credit card transactions.

公共区块链刺激了去中心化交易和智能合约的日益普及，尤其是在金融市场上。然而，公共区块链在交易吞吐量、存储可用性和计算能力方面表现出其局限性。为了避免交易僵局，公共区块链征收高额费用和每个区块的资源限制，使其难以适应不断增长的高交易需求。先前的研究努力通过各种技术来提高区块链的可扩展性和性能，例如侧链，分片，安全的链下计算，通信网络优化和高效的共识协议。然而，由于在事务吞吐量、存储和计算容量的可伸缩性方面无法提供类似云的性能，这些方法并没有得到广泛采用。在这项工作中，我们确定公共区块链可扩展性的主要障碍是其底层非结构化P2P网络。我们进一步表明，集中式网络可以支持分散智能合约的部署。我们提出了一种实现可扩展去中心化的新方法:我们不是试图使区块链可扩展，而是通过使用以太坊智能合约将去中心化交付给已经可扩展的云。我们介绍Blockumulus，这是一个框架，可以使用一种称为覆盖共识的新技术部署分散的云智能合约环境。通过实验，我们证明了Blockumulus在计算、数据存储和事务吞吐量这三个方面都是可扩展的。除了消除当前的代码执行和存储限制外，Blockumulus在正常负载下提供2到5秒的事务延迟。此外，我们的原型的压力测试揭示了在26秒内执行20,000个同时交易的能力，这与全球信用卡交易的平均吞吐量相当。

{"title":"Blockumulus: A Scalable Framework for Smart Contracts on the Cloud","authors":"Nikolay Ivanov, Qiben Yan, Qingyang Wang","doi":"10.1109/ICDCS51616.2021.00064","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00064","url":null,"abstract":"Public blockchains have spurred the growing popularity of decentralized transactions and smart contracts, especially on the financial market. However, public blockchains exhibit their limitations on the transaction throughput, storage availability, and compute capacity. To avoid transaction gridlock, public blockchains impose large fees and per-block resource limits, making it difficult to accommodate the ever-growing high transaction demand. Previous research endeavors to improve the scalability and performance of blockchain through various technologies, such as side-chaining, sharding, secured off-chain computation, communication network optimizations, and efficient consensus protocols. However, these approaches have not attained a widespread adoption due to their inability in delivering a cloud-like performance, in terms of the scalability in transaction throughput, storage, and compute capacity. In this work, we determine that the major obstacle to public blockchain scalability is their underlying unstructured P2P networks. We further show that a centralized network can support the deployment of decentralized smart contracts. We propose a novel approach for achieving scalable decentralization: instead of trying to make blockchain scalable, we deliver decentralization to already scalable cloud by using an Ethereum smart contract. We introduce Blockumulus, a framework that can deploy decentralized cloud smart contract environments using a novel technique called overlay consensus. Through experiments, we demonstrate that Blockumulus is scalable in all three dimensions: computation, data storage, and transaction throughput. Besides eliminating the current code execution and storage restrictions, Blockumulus delivers a transaction latency between 2 and 5 seconds under normal load. Moreover, the stress test of our prototype reveals the ability to execute 20,000 simultaneous transactions under 26 seconds, which is on par with the average throughput of worldwide credit card transactions.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122965448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Poster: Quadratic-Time Algorithms for Optimal Min-Max Barrier Coverage with Mobile Sensors on the Plane 海报:平面上移动传感器最优最小-最大屏障覆盖的二次时间算法

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00122

P. Yao, Longkun Guo, Jiguo Yu

Emerging applications impose the min-max line barrier coverage (LBC) problem that aims to minimize the maximum movement of the sensors for the sake of balancing energy consumption. In the paper, we devise an algorithm for LBC that finds an optimal solution within a runtime $O(n^{2})$, improving the previous state-of-art runtime $o(n^{2}log n)$ due to [7]. The key idea to accelerating the computation of the optimum solutions is to use approximation solutions that are obtained by our devised approximation algorithm. Numerical experiments demonstrate our algorithms outperform all the other baselines including the previous state-of-art algorithm.

新兴应用施加最小-最大线屏障覆盖(LBC)问题，旨在最小化传感器的最大运动，以平衡能量消耗。在本文中，我们设计了一种LBC算法，该算法在运行时间$O(n^{2})$内找到最优解，改进了先前的运行时间$O(n^{2} log n)$，因为[7]。加速最优解计算的关键思想是使用我们所设计的近似算法得到的近似解。数值实验表明，我们的算法优于所有其他基线，包括以前最先进的算法。

引用次数: 2

Privacy-Preserving Neural Network Inference Framework via Homomorphic Encryption and SGX 基于同态加密和SGX的隐私保护神经网络推理框架

2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)

Pub Date : 2021-07-01 DOI: 10.1109/ICDCS51616.2021.00077

Huizi Xiao, Qingyang Zhang, Qingqi Pei, Weisong Shi

Edge computing is a promising paradigm that pushes computing, storage, and energy to the networks' edge. It utilizes the data nearby the users to provide real-time, energy-efficient, and reliable services. Neural network inference in edge computing is a powerful tool for various applications. However, edge server will collect more personal sensitive information of users inevitably. It is the most basic requirement for users to ensure their security and privacy while obtaining accurate inference results. Homomorphic encryption (HE) technology is confidential computing that directly performs mathematical computing on encrypted data. But it only can carry out limited addition and multiplication operation with very low efficiency. Intel software guard extension (SGX) can provide a trusted isolation space in the CPU to ensure the confidentiality and integrity of code and data executed. But several defects are hard to overcome due to hardware design limitations when applying SGX in inference services. This paper proposes a hybrid framework utilizing SGX to accelerate the HE-based convolutional neural network (CNN) inference, eliminating the approximation operations in HE to improve inference accuracy in theory. Besides, SGX is also taken as a built-in trusted third party to distribute keys, thereby improving our framework's scalability and flexibility. We have quantified the various CNN operations in the respective cases of HE and SGX to provide the foresight practice. Taking the connected and autonomous vehicles as a case study in edge computing, we implemented this hybrid framework in CNN to verify its feasibility and advantage.

边缘计算是一种很有前途的范例，它将计算、存储和能源推向网络的边缘。它利用用户附近的数据，提供实时、节能、可靠的服务。在边缘计算中，神经网络推理是一种强大的应用工具。然而，边缘服务器不可避免地会收集到更多用户的个人敏感信息。在获得准确的推理结果的同时，确保用户的安全和隐私是用户最基本的要求。同态加密(HE)技术是直接对加密数据进行数学计算的保密计算。但它只能进行有限的加法和乘法运算，效率很低。英特尔软件保护扩展(SGX)可以在CPU中提供可信的隔离空间，以确保执行的代码和数据的机密性和完整性。但在推理服务中应用SGX时，由于硬件设计的限制，存在一些难以克服的缺陷。本文提出了一种利用SGX加速基于HE的卷积神经网络(CNN)推理的混合框架，从理论上消除了HE中的近似运算，提高了推理精度。此外，SGX还被作为内置的可信第三方来分发密钥，从而提高了我们框架的可扩展性和灵活性。我们量化了HE和SGX各自案例中的各种CNN操作，以提供前瞻性实践。以车联网和自动驾驶汽车为例，我们在CNN中实现了这种混合框架，以验证其可行性和优势。

{"title":"Privacy-Preserving Neural Network Inference Framework via Homomorphic Encryption and SGX","authors":"Huizi Xiao, Qingyang Zhang, Qingqi Pei, Weisong Shi","doi":"10.1109/ICDCS51616.2021.00077","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00077","url":null,"abstract":"Edge computing is a promising paradigm that pushes computing, storage, and energy to the networks' edge. It utilizes the data nearby the users to provide real-time, energy-efficient, and reliable services. Neural network inference in edge computing is a powerful tool for various applications. However, edge server will collect more personal sensitive information of users inevitably. It is the most basic requirement for users to ensure their security and privacy while obtaining accurate inference results. Homomorphic encryption (HE) technology is confidential computing that directly performs mathematical computing on encrypted data. But it only can carry out limited addition and multiplication operation with very low efficiency. Intel software guard extension (SGX) can provide a trusted isolation space in the CPU to ensure the confidentiality and integrity of code and data executed. But several defects are hard to overcome due to hardware design limitations when applying SGX in inference services. This paper proposes a hybrid framework utilizing SGX to accelerate the HE-based convolutional neural network (CNN) inference, eliminating the approximation operations in HE to improve inference accuracy in theory. Besides, SGX is also taken as a built-in trusted third party to distribute keys, thereby improving our framework's scalability and flexibility. We have quantified the various CNN operations in the respective cases of HE and SGX to provide the foresight practice. Taking the connected and autonomous vehicles as a case study in edge computing, we implemented this hybrid framework in CNN to verify its feasibility and advantage.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124982852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3