Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00125
James Mariani, Yongqi Han, Li Xiao
Many emerging mobile applications rely heavily upon image recognition of both static images and live video streams. Image recognition is commonly achieved using deep neural networks (DNNs) which can achieve high accuracy but also incur significant computation latency and energy consumption on resource-constrained smartphones. We introduce an in-memory caching paradigm that supports infrastructure-less collaborative computation reuse in smartphone image recognition. We propose using the inertial movement of smartphones, the locality inherent in video streams, as well as information from nearby, peer-to-peer devices to maximize the computation reuse opportunities in mobile image recognition. Experimental results show that our system lowers the average latency of standard mobile neural network image recognition applications by up to 94% with minimal loss of recognition accuracy.
{"title":"Poster: Approximate Caching for Mobile Image Recognition","authors":"James Mariani, Yongqi Han, Li Xiao","doi":"10.1109/ICDCS51616.2021.00125","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00125","url":null,"abstract":"Many emerging mobile applications rely heavily upon image recognition of both static images and live video streams. Image recognition is commonly achieved using deep neural networks (DNNs) which can achieve high accuracy but also incur significant computation latency and energy consumption on resource-constrained smartphones. We introduce an in-memory caching paradigm that supports infrastructure-less collaborative computation reuse in smartphone image recognition. We propose using the inertial movement of smartphones, the locality inherent in video streams, as well as information from nearby, peer-to-peer devices to maximize the computation reuse opportunities in mobile image recognition. Experimental results show that our system lowers the average latency of standard mobile neural network image recognition applications by up to 94% with minimal loss of recognition accuracy.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"5 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132365633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00102
Hongbo Jiang, Xiangyu Shen, Daibo Liu
Modeling user preferences is a challenging problem in the wide application of recommendation services. Existing methods mainly exploit multiple activities irrelevant to user's inner feeling to build user preference model, which may raise model uncertainty and bring about prediction error. In this paper, we present PupilMeter - the first system that moves one step forward towards exploring the correlation between user preference and the instant pupillary response. Specifically, we conduct extensive experiments to dig into the generic physiological process of pupillary response while viewing specific content on smart devices, and further figure out six key time-series features relevant to users' preference degree by using Random Forest. However, the diversity of pupillary responses caused by inherent individual difference poses significant challenges to the generality of learned model. To solve this problem, we use Multilayer Perceptron to automatically train and adjust the importance of key features for each individual and then generate a personalized user preference model associated with user's pupillary response. We have prototyped PupilMeter and conducted both test experiments and in-the-wild studies to comprehensively evaluate the effectiveness of PupilMeter by recruiting 30 volunteers. Experimental results demonstrate that PupilMeter can accurately identify users' preference.
{"title":"PupilMeter: Modeling User Preference with Time-Series Features of Pupillary Response","authors":"Hongbo Jiang, Xiangyu Shen, Daibo Liu","doi":"10.1109/ICDCS51616.2021.00102","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00102","url":null,"abstract":"Modeling user preferences is a challenging problem in the wide application of recommendation services. Existing methods mainly exploit multiple activities irrelevant to user's inner feeling to build user preference model, which may raise model uncertainty and bring about prediction error. In this paper, we present PupilMeter - the first system that moves one step forward towards exploring the correlation between user preference and the instant pupillary response. Specifically, we conduct extensive experiments to dig into the generic physiological process of pupillary response while viewing specific content on smart devices, and further figure out six key time-series features relevant to users' preference degree by using Random Forest. However, the diversity of pupillary responses caused by inherent individual difference poses significant challenges to the generality of learned model. To solve this problem, we use Multilayer Perceptron to automatically train and adjust the importance of key features for each individual and then generate a personalized user preference model associated with user's pupillary response. We have prototyped PupilMeter and conducted both test experiments and in-the-wild studies to comprehensively evaluate the effectiveness of PupilMeter by recruiting 30 volunteers. Experimental results demonstrate that PupilMeter can accurately identify users' preference.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127861257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00107
R. Figueiredo, Kensworth C. Subratie
This demonstration will showcase EdgeVPN.io, an open-source software-defined virtual private network (VPN) that enables the creation of scalable layer-2 virtual networks across multiple providers - including scenarios where devices are behind Network Address Translation (NAT) and firewall middleboxes. Its architecture combines a distributed software-defined networking (SDN) control plane and a scalable structured peer-to-peer overlay of Internet tunnels that form its datapath. EdgeVPN.io provides a foundation for the deployment of virtual networks that enable research and development in distributed computing. The demonstration will include a brief overview of the architecture, and will show step-by-step how a researcher can deploy EdgeVPN.io networks on devices including Raspberry Pis, Jetson Nanos, and VMs/Docker containers in the cloud. Attendees will be provided with trial resources to allow them to follow the demonstration hands-on if they so desire.
{"title":"Demo: Software-defined Virtual Networking Across Multiple Edge and Cloud Providers with EdgeVPN.io","authors":"R. Figueiredo, Kensworth C. Subratie","doi":"10.1109/ICDCS51616.2021.00107","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00107","url":null,"abstract":"This demonstration will showcase EdgeVPN.io, an open-source software-defined virtual private network (VPN) that enables the creation of scalable layer-2 virtual networks across multiple providers - including scenarios where devices are behind Network Address Translation (NAT) and firewall middleboxes. Its architecture combines a distributed software-defined networking (SDN) control plane and a scalable structured peer-to-peer overlay of Internet tunnels that form its datapath. EdgeVPN.io provides a foundation for the deployment of virtual networks that enable research and development in distributed computing. The demonstration will include a brief overview of the architecture, and will show step-by-step how a researcher can deploy EdgeVPN.io networks on devices including Raspberry Pis, Jetson Nanos, and VMs/Docker containers in the cloud. Attendees will be provided with trial resources to allow them to follow the demonstration hands-on if they so desire.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"10 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129195917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00078
Xi Zhang, Qixuan Zhu
The large-scale interactive services distribute clients' requests across a large number of physical machine in data center architectures to enhance the quality-of-service (QoS) performance. In parallel and distributed data center architecture, even a temporary spike in latency of any service component can significantly impact the end-to-end delay. Besides the average latency, tail-latency (i.e., worst case latency) of a service has also attracted a lot of research attentions. The tail-latency is a critical performance metric in data centers, where long tail latencies refer to the higher percentiles (such as 98th, 99th) of latency in comparison to the average latency time. While the statistical delay-bounded QoS provisioning theory has been shown to be a powerful technique and useful performance metric for supporting time-sensitive multimedia transmissions over mobile computing networks, how to efficiently extend and implement this technique/performance-metric for statistically bounding the tail-latency for data center networks has neither been well understood nor thoroughly studied. In this paper, we model and characterize the tail-latency distribution in a three-layer parallel and distributed data center architecture, where clients request different types of services and ten download their requested data packets from data center through a first-come-first-serve M/M/1 queueing system. We first define the statistical tail-latency bounded QoS, and investigate the tail-latency problem through generalized extreme value (GEV) theory and generalized Pareto distribution (GPD) theory. Then, we propose a scheme to identify the dominant sources of latency variance in a semantic context, so that we are able to optimize the instructions of those sources to reduce the latency tail. Finally, using numerical analyses we validate and evaluate our developed modeling techniques and schemes for characterizing the tail-latency QoS provisioning theories in supporting data center networks.
{"title":"Statistical Tail-Latency Bounded QoS Provisioning for Parallel and Distributed Data Centers","authors":"Xi Zhang, Qixuan Zhu","doi":"10.1109/ICDCS51616.2021.00078","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00078","url":null,"abstract":"The large-scale interactive services distribute clients' requests across a large number of physical machine in data center architectures to enhance the quality-of-service (QoS) performance. In parallel and distributed data center architecture, even a temporary spike in latency of any service component can significantly impact the end-to-end delay. Besides the average latency, tail-latency (i.e., worst case latency) of a service has also attracted a lot of research attentions. The tail-latency is a critical performance metric in data centers, where long tail latencies refer to the higher percentiles (such as 98th, 99th) of latency in comparison to the average latency time. While the statistical delay-bounded QoS provisioning theory has been shown to be a powerful technique and useful performance metric for supporting time-sensitive multimedia transmissions over mobile computing networks, how to efficiently extend and implement this technique/performance-metric for statistically bounding the tail-latency for data center networks has neither been well understood nor thoroughly studied. In this paper, we model and characterize the tail-latency distribution in a three-layer parallel and distributed data center architecture, where clients request different types of services and ten download their requested data packets from data center through a first-come-first-serve M/M/1 queueing system. We first define the statistical tail-latency bounded QoS, and investigate the tail-latency problem through generalized extreme value (GEV) theory and generalized Pareto distribution (GPD) theory. Then, we propose a scheme to identify the dominant sources of latency variance in a semantic context, so that we are able to optimize the instructions of those sources to reduce the latency tail. Finally, using numerical analyses we validate and evaluate our developed modeling techniques and schemes for characterizing the tail-latency QoS provisioning theories in supporting data center networks.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116918338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00126
Zhiming Huang, Bingshan Hu, Jianping Pan
In this paper, we study a distributed stochastic multi-armed bandit problem that can address many real-world problems such as task assignment for multiple crowdsourcing platforms, traffic scheduling in wireless networks with multiple access points and caching at cellular network edge. We propose an efficient algorithm called multi-agent combinatorial upper confidence bound (MACUCB) with provable performance guarantees and low communication overhead. Furthermore, we perform extensive experiments to show the effectiveness of the proposed algorithm.
{"title":"Poster: Multi-agent Combinatorial Bandits with Moving Arms","authors":"Zhiming Huang, Bingshan Hu, Jianping Pan","doi":"10.1109/ICDCS51616.2021.00126","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00126","url":null,"abstract":"In this paper, we study a distributed stochastic multi-armed bandit problem that can address many real-world problems such as task assignment for multiple crowdsourcing platforms, traffic scheduling in wireless networks with multiple access points and caching at cellular network edge. We propose an efficient algorithm called multi-agent combinatorial upper confidence bound (MACUCB) with provable performance guarantees and low communication overhead. Furthermore, we perform extensive experiments to show the effectiveness of the proposed algorithm.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127403598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The efficiency of block lifecycle determines the performance of blockchain, which is critically affected by the execution, mining and validation steps in blockchain lifecycle. To accelerate blockchains, many works focus on optimizing the mining step while ignoring other steps. In this paper, we propose a novel blockchain framework-FastBlock to speed up the execution and validation steps by introducing efficient concurrency. To efficiently prevent the potential concurrency violations, FastBlock utilizes symbolic execution to identify minimal atomic sections in each transaction and guarantees the atomicity of these sections in execution step via an efficient concurrency control mechanism-hardware transactional memory (HTM). To enable a deterministic validation step, FastBlock concurrently re-executes transactions based on a happen-before graph without increasing block size. Finally, we implement FastBlock and evaluate it in terms of conflicting transactions rate, number of transactions per block, and varying thread number. Our results indicate that FastBlock is efficient: the execution step and validation step speed up to 3.0x and 2.3x on average over the original serial model respectively with eight concurrent threads.
{"title":"FASTBLOCK: Accelerating Blockchains via Hardware Transactional Memory","authors":"Yue Li, Han Liu, Yuanliang Chen, Jianbo Gao, Zhenhao Wu, Zhi Guan, Zhong Chen","doi":"10.1109/ICDCS51616.2021.00032","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00032","url":null,"abstract":"The efficiency of block lifecycle determines the performance of blockchain, which is critically affected by the execution, mining and validation steps in blockchain lifecycle. To accelerate blockchains, many works focus on optimizing the mining step while ignoring other steps. In this paper, we propose a novel blockchain framework-FastBlock to speed up the execution and validation steps by introducing efficient concurrency. To efficiently prevent the potential concurrency violations, FastBlock utilizes symbolic execution to identify minimal atomic sections in each transaction and guarantees the atomicity of these sections in execution step via an efficient concurrency control mechanism-hardware transactional memory (HTM). To enable a deterministic validation step, FastBlock concurrently re-executes transactions based on a happen-before graph without increasing block size. Finally, we implement FastBlock and evaluate it in terms of conflicting transactions rate, number of transactions per block, and varying thread number. Our results indicate that FastBlock is efficient: the execution step and validation step speed up to 3.0x and 2.3x on average over the original serial model respectively with eight concurrent threads.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126139293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-technology interference is a major threat to the dependability of low-power wireless communications. Due to power and bandwidth asymmetries, technologies such as Wi-Fi tend to dominate the RF channel and unintentionally destroy low-power wireless communications from resource-constrained technologies such as ZigBee, leading to severe coexistence issues. To address these issues, existing schemes make ZigBee nodes individually assess the RF channel's availability or let Wi-Fi appliances blindly reserve the medium for the transmissions of low-power devices. Without a two-way interaction between devices making use of different wireless technologies, these approaches have limited scenarios or achieve inefficient network performance. This paper presents BiCord, a bidirectional coordination scheme in which resource-constrained wireless devices such as ZigBee nodes and powerful Wi-Fi appliances coordinate their activities to increase coexistence and enhance network performance. Specifically, in BiCord, ZigBee nodes directly request channel resources from Wi-Fi devices, who then reserve the channel for ZigBee transmissions on-demand. This interaction continues until the transmission requirement of ZigBee nodes is both fulfilled and understood by Wi-Fi devices. This way, BiCord avoids unnecessary channel allocations, maximizes the availability of the spectrum, and minimizes transmission delays. We evaluate BiCord on off-the-shelf Wi-Fi and ZigBee devices, demonstrating its effectiveness experimentally. Among others, our results show that BiCord increases channel utilization by up to 50.6% and reduces the average transmission delay of ZigBee nodes by 84.2% compared to state-of-the-art approaches.
{"title":"BiCord: Bidirectional Coordination among Coexisting Wireless Devices","authors":"Zihao Yu, Pengyu Li, C. Boano, Yuan He, Meng Jin, Xiuzhen Guo, Xiaolong Zheng","doi":"10.1109/ICDCS51616.2021.00037","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00037","url":null,"abstract":"Cross-technology interference is a major threat to the dependability of low-power wireless communications. Due to power and bandwidth asymmetries, technologies such as Wi-Fi tend to dominate the RF channel and unintentionally destroy low-power wireless communications from resource-constrained technologies such as ZigBee, leading to severe coexistence issues. To address these issues, existing schemes make ZigBee nodes individually assess the RF channel's availability or let Wi-Fi appliances blindly reserve the medium for the transmissions of low-power devices. Without a two-way interaction between devices making use of different wireless technologies, these approaches have limited scenarios or achieve inefficient network performance. This paper presents BiCord, a bidirectional coordination scheme in which resource-constrained wireless devices such as ZigBee nodes and powerful Wi-Fi appliances coordinate their activities to increase coexistence and enhance network performance. Specifically, in BiCord, ZigBee nodes directly request channel resources from Wi-Fi devices, who then reserve the channel for ZigBee transmissions on-demand. This interaction continues until the transmission requirement of ZigBee nodes is both fulfilled and understood by Wi-Fi devices. This way, BiCord avoids unnecessary channel allocations, maximizes the availability of the spectrum, and minimizes transmission delays. We evaluate BiCord on off-the-shelf Wi-Fi and ZigBee devices, demonstrating its effectiveness experimentally. Among others, our results show that BiCord increases channel utilization by up to 50.6% and reduces the average transmission delay of ZigBee nodes by 84.2% compared to state-of-the-art approaches.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128045945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00090
Tian Pan, Xingchen Lin, Haoyu Song, Enge Song, Zizheng Bian, Hao Li, Jiao Zhang, Fuliang Li, Tao Huang, Chenhao Jia, Bin Liu
Visibility is essential for operating and troubleshooting intricate networks. In-band Network Telemetry (INT) has been embedded in the latest merchant silicons to offer high-precision device and traffic state visibility. INT is actually an underlying technique and each INT instance covers only one monitoring path. The network-wide measurement coverage therefore requires a high-level orchestration to provision multiple INT paths. An optimal path planning is expected to produce a minimum number of paths with a minimum number of overlapping links. Eulerian trail has been used to solve the general problem. However, in production networks, the vantage points where one can deploy probes to start and terminate INT paths are constrained. In this work, we propose an optimal path planning algorithm, INT-probe, which achieves the network-wide telemetry coverage under the constraint of stationary probes. INT-probe formulates the constrained path planning into an extended multi-depot k-Chinese postman problem (MDCPP-set) and then reduces it to a solvable minimum weight perfect matching problem. We analyze algorithm's theoretical bound and the complexity. Extensive evaluation on both wide area networks and data center networks with different scales and topologies are conducted. We show INT-probe is efficient, high-performance, and practical for real-world deployment. For a large-scale data center networks with 1125 switches, INT-probe can generate 112 monitoring paths (reduced by 50.4 %) by allowing only 1.79% increase of the total path length, promptly resolving link failures within 744.71ms.
{"title":"INT-probe: Lightweight In-band Network-Wide Telemetry with Stationary Probes","authors":"Tian Pan, Xingchen Lin, Haoyu Song, Enge Song, Zizheng Bian, Hao Li, Jiao Zhang, Fuliang Li, Tao Huang, Chenhao Jia, Bin Liu","doi":"10.1109/ICDCS51616.2021.00090","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00090","url":null,"abstract":"Visibility is essential for operating and troubleshooting intricate networks. In-band Network Telemetry (INT) has been embedded in the latest merchant silicons to offer high-precision device and traffic state visibility. INT is actually an underlying technique and each INT instance covers only one monitoring path. The network-wide measurement coverage therefore requires a high-level orchestration to provision multiple INT paths. An optimal path planning is expected to produce a minimum number of paths with a minimum number of overlapping links. Eulerian trail has been used to solve the general problem. However, in production networks, the vantage points where one can deploy probes to start and terminate INT paths are constrained. In this work, we propose an optimal path planning algorithm, INT-probe, which achieves the network-wide telemetry coverage under the constraint of stationary probes. INT-probe formulates the constrained path planning into an extended multi-depot k-Chinese postman problem (MDCPP-set) and then reduces it to a solvable minimum weight perfect matching problem. We analyze algorithm's theoretical bound and the complexity. Extensive evaluation on both wide area networks and data center networks with different scales and topologies are conducted. We show INT-probe is efficient, high-performance, and practical for real-world deployment. For a large-scale data center networks with 1125 switches, INT-probe can generate 112 monitoring paths (reduced by 50.4 %) by allowing only 1.79% increase of the total path length, promptly resolving link failures within 744.71ms.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126518909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00060
Hang Xu, Chen-Yu Ho, A. Abdelmoniem, Aritra Dutta, E. Bergou, Konstantinos Karatsenidis, M. Canini, Panos Kalnis
Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training increasingly becomes communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank). Next, we propose GRACE, a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.
{"title":"GRACE: A Compressed Communication Framework for Distributed Machine Learning","authors":"Hang Xu, Chen-Yu Ho, A. Abdelmoniem, Aritra Dutta, E. Bergou, Konstantinos Karatsenidis, M. Canini, Panos Kalnis","doi":"10.1109/ICDCS51616.2021.00060","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00060","url":null,"abstract":"Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training increasingly becomes communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank). Next, we propose GRACE, a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123196796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) framework enables training over distributed datasets while keeping the data local. However, it is difficult to customize a model fitting for all unknown local data. A pre-determined model is most likely to lead to slow convergence or low accuracy, especially when the distributed data is non-i.i.d.. To resolve the issue, we propose a model searching method in the federated learning scenario, and the method automatically searches a model structure fitting for the unseen local data. We novelly design a reinforcement learning-based framework that samples and distributes sub-models to the participants and updates its model selection policy by maximizing the reward. In practice, the model search algorithm takes a long time to converge, and hence we adaptively assign sub-models to participants according to the transmission condition. We further propose delay-compensated synchronization to mitigate loss over late updates to facilitate convergence. Extensive experiments show that our federated model search algorithm produces highly accurate models efficiently, particularly on non-i.i.d. data.
{"title":"Federated Model Search via Reinforcement Learning","authors":"Dixi Yao, Lingdong Wang, Jiayu Xu, Liyao Xiang, Shuo Shao, Yingqi Chen, Yanjun Tong","doi":"10.1109/ICDCS51616.2021.00084","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00084","url":null,"abstract":"Federated Learning (FL) framework enables training over distributed datasets while keeping the data local. However, it is difficult to customize a model fitting for all unknown local data. A pre-determined model is most likely to lead to slow convergence or low accuracy, especially when the distributed data is non-i.i.d.. To resolve the issue, we propose a model searching method in the federated learning scenario, and the method automatically searches a model structure fitting for the unseen local data. We novelly design a reinforcement learning-based framework that samples and distributes sub-models to the participants and updates its model selection policy by maximizing the reward. In practice, the model search algorithm takes a long time to converge, and hence we adaptively assign sub-models to participants according to the transmission condition. We further propose delay-compensated synchronization to mitigate loss over late updates to facilitate convergence. Extensive experiments show that our federated model search algorithm produces highly accurate models efficiently, particularly on non-i.i.d. data.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132421100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}