Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00018
Zhuohui Duan, Haikun Liu, Haodi Lu, Xiaofei Liao, Hai Jin, Yu Zhang, Bingsheng He
Byte-addressable Non-volatile Memory (NVM) technologies promise higher density and lower cost than DRAM. They have been increasingly employed for data center applications. Despite many previous studies on using NVM in a single machine, there remain challenges to best utilize it in a distributed data center environment. This paper presents Gengar, an RDMA-enabled Distributed Shared Hybrid Memory (DSHM) pool with simple programming APIs on viewing remote NVM and DRAM in a global memory space. We propose to exploit semantics of RDMA primitives to identify frequently-accessed data in the hybrid memory pool, and cache it in distributed DRAM buffers. We redesign RDMA communication protocols to reduce the bottleneck of RDMA write latency by leveraging a proxy mechanism. Gengar also supports memory sharing among multiple users with data consistency guarantee. We evaluate Gengar in a real testbed equipped with Intel Optane DC Persistent DIMMs. Experimental results show that Gengar significantly improves the performance of public benchmarks such as MapReduce and YCSB by up to 70 % compared with state-of-the-art DSHM systems.
字节可寻址非易失性存储器(NVM)技术承诺比DRAM具有更高的密度和更低的成本。它们越来越多地用于数据中心应用程序。尽管之前有许多关于在单台机器中使用NVM的研究,但在分布式数据中心环境中最好地利用它仍然存在挑战。Gengar是一个支持rdma的分布式共享混合内存(DSHM)池,具有简单的编程api,用于在全局内存空间中查看远程NVM和DRAM。我们建议利用RDMA原语的语义来识别混合内存池中频繁访问的数据,并将其缓存在分布式DRAM缓冲区中。我们重新设计了RDMA通信协议,利用代理机制减少了RDMA写入延迟的瓶颈。Gengar还支持多用户之间的内存共享,并保证数据一致性。我们在配备英特尔Optane DC Persistent内存条的真实测试台上对Gengar进行了评估。实验结果表明,与最先进的DSHM系统相比,Gengar显著提高了MapReduce和YCSB等公共基准测试的性能,提高幅度高达70%。
{"title":"Gengar: An RDMA-based Distributed Hybrid Memory Pool","authors":"Zhuohui Duan, Haikun Liu, Haodi Lu, Xiaofei Liao, Hai Jin, Yu Zhang, Bingsheng He","doi":"10.1109/ICDCS51616.2021.00018","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00018","url":null,"abstract":"Byte-addressable Non-volatile Memory (NVM) technologies promise higher density and lower cost than DRAM. They have been increasingly employed for data center applications. Despite many previous studies on using NVM in a single machine, there remain challenges to best utilize it in a distributed data center environment. This paper presents Gengar, an RDMA-enabled Distributed Shared Hybrid Memory (DSHM) pool with simple programming APIs on viewing remote NVM and DRAM in a global memory space. We propose to exploit semantics of RDMA primitives to identify frequently-accessed data in the hybrid memory pool, and cache it in distributed DRAM buffers. We redesign RDMA communication protocols to reduce the bottleneck of RDMA write latency by leveraging a proxy mechanism. Gengar also supports memory sharing among multiple users with data consistency guarantee. We evaluate Gengar in a real testbed equipped with Intel Optane DC Persistent DIMMs. Experimental results show that Gengar significantly improves the performance of public benchmarks such as MapReduce and YCSB by up to 70 % compared with state-of-the-art DSHM systems.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116994118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00040
Suzhen Wu, Zhanhong Tu, Zuocheng Wang, Zhirong Shen, Bo Mao
As increasingly prevalent, more and more data are stored in the cloud storage, which brings us two major challenges. First, the modified files in the cloud should be quickly synchronized (sync) to ensure data consistency, e.g., delta sync achieves efficient cloud sync by synchronizing only the updated part of the file. Second, the huge data in the cloud needs to be deduplicated and encrypted, e.g., message-locked encryption (MLE) implements data deduplication by encrypting the content between different users. However, when both are combined, few updates in the content can cause large sync traffic amplification for both keys and ciphertext in the MLE-based cloud storage, which significantly degrading the cloud sync efficiency. In this paper, we propose an feature-based encryption sync scheme FeatureSync to improve the performance of synchronizing multiple encrypted files by merging several files before synchronizing. The performance evaluations on a lightweight prototype implementation of FeatureSync show that FeatureSync reduces the cloud sync time by 72.6% and the cloud sync traffic by 78.5% on average, compared with the state-of-the-art sync schemes.
{"title":"When Delta Sync Meets Message-Locked Encryption: a Feature-based Delta Sync Scheme for Encrypted Cloud Storage","authors":"Suzhen Wu, Zhanhong Tu, Zuocheng Wang, Zhirong Shen, Bo Mao","doi":"10.1109/ICDCS51616.2021.00040","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00040","url":null,"abstract":"As increasingly prevalent, more and more data are stored in the cloud storage, which brings us two major challenges. First, the modified files in the cloud should be quickly synchronized (sync) to ensure data consistency, e.g., delta sync achieves efficient cloud sync by synchronizing only the updated part of the file. Second, the huge data in the cloud needs to be deduplicated and encrypted, e.g., message-locked encryption (MLE) implements data deduplication by encrypting the content between different users. However, when both are combined, few updates in the content can cause large sync traffic amplification for both keys and ciphertext in the MLE-based cloud storage, which significantly degrading the cloud sync efficiency. In this paper, we propose an feature-based encryption sync scheme FeatureSync to improve the performance of synchronizing multiple encrypted files by merging several files before synchronizing. The performance evaluations on a lightweight prototype implementation of FeatureSync show that FeatureSync reduces the cloud sync time by 72.6% and the cloud sync traffic by 78.5% on average, compared with the state-of-the-art sync schemes.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126715762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00085
Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun
We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.
{"title":"Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs","authors":"Woo-Yeon Lee, Yunseong Lee, Won Wook Song, Youngseok Yang, Jooyeon Kim, Byung-Gon Chun","doi":"10.1109/ICDCS51616.2021.00085","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00085","url":null,"abstract":"We introduce Harmony, a new scheduling framework that executes multiple Parameter-Server ML training jobs together to improve cluster resource utilization. Harmony coordinates a fine-grained execution of co-located jobs with complementary resource usages to avoid contention and to efficiently share resources between the jobs. To resolve the memory pressure due to the increased number of simultaneous jobs, Harmony uses a data spill/reload mechanism optimized for multiple jobs with the iterative execution pattern. Our evaluation shows that Harmony improves cluster resource utilization by up to 1.65×, resulting in a reduction of the mean ML training job time by about 53%, and makespan, the total time to process all given jobs, by about 38%, compared to the traditional approaches that allocate dedicated resources to each job.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123229203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00113
Badal Agrawal, Mohit Mishra
Developers across most of the organizations face the issue of manually dealing with the classification of the software bug reports. Software bug reports often contain text and other useful information that are common for a particular type of bug. This information can be extracted using the techniques of Natural Language Processing and combined with the manual classification done by the developers until now to create a properly labelled data set for training a supervised learning model for automatically classifying the bug reports into their respective categories. Previous studies have only focused on binary classification of software incident reports as bug and non-bug. Our novel approach achieves an accuracy of 76.94% for a 10-factor classification problem on the bug repository created by Microsoft Dynamics 365 team. In addition, we propose a novel method for automatically retraining the model and updating it with developer feedback in case of misclassification that will significantly reduce the maintenance cost and effort.
{"title":"Demo: Automatically Retrainable Self Improving Model for the Automated Classification of Software Incidents into Multiple Classes","authors":"Badal Agrawal, Mohit Mishra","doi":"10.1109/ICDCS51616.2021.00113","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00113","url":null,"abstract":"Developers across most of the organizations face the issue of manually dealing with the classification of the software bug reports. Software bug reports often contain text and other useful information that are common for a particular type of bug. This information can be extracted using the techniques of Natural Language Processing and combined with the manual classification done by the developers until now to create a properly labelled data set for training a supervised learning model for automatically classifying the bug reports into their respective categories. Previous studies have only focused on binary classification of software incident reports as bug and non-bug. Our novel approach achieves an accuracy of 76.94% for a 10-factor classification problem on the bug repository created by Microsoft Dynamics 365 team. In addition, we propose a novel method for automatically retraining the model and updating it with developer feedback in case of misclassification that will significantly reduce the maintenance cost and effort.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126487127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00115
Roberto Morabito, M. Chiang
In recent years, the research community started to extensively study how edge computing can enhance the provisioning of a seamless and performing Machine Learning (ML) experience. Boosting the performance of ML inference at the edge became a driving factor especially for enabling those use-cases in which proximity to the data sources, near real-time requirements, and need of a reduced network latency represent a determining factor. The growing demand of edge-based ML services has been also boosted by an increasing market release of small-form factor inference accelerators devices that feature, however, heterogeneous and not fully interoperable software and hardware characteristics. A key aspect that has not yet been fully investigated is how to discover and efficiently optimize the provision of ML inference services in distributed edge systems featuring heterogeneous edge inference accelerators - not neglecting also that the limited devices computation capabilities may imply the need of orchestrating the inference execution provisioning among the different system's devices. The main goal of this demo is to showcase how ML inference services can be agnostically discovered, provisioned, and orchestrated in a cluster of heterogeneous and distributed edge nodes.
{"title":"Demo: Discover, Provision, and Orchestration of Machine Learning Inference Services in Heterogeneous Edge","authors":"Roberto Morabito, M. Chiang","doi":"10.1109/ICDCS51616.2021.00115","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00115","url":null,"abstract":"In recent years, the research community started to extensively study how edge computing can enhance the provisioning of a seamless and performing Machine Learning (ML) experience. Boosting the performance of ML inference at the edge became a driving factor especially for enabling those use-cases in which proximity to the data sources, near real-time requirements, and need of a reduced network latency represent a determining factor. The growing demand of edge-based ML services has been also boosted by an increasing market release of small-form factor inference accelerators devices that feature, however, heterogeneous and not fully interoperable software and hardware characteristics. A key aspect that has not yet been fully investigated is how to discover and efficiently optimize the provision of ML inference services in distributed edge systems featuring heterogeneous edge inference accelerators - not neglecting also that the limited devices computation capabilities may imply the need of orchestrating the inference execution provisioning among the different system's devices. The main goal of this demo is to showcase how ML inference services can be agnostically discovered, provisioned, and orchestrated in a cluster of heterogeneous and distributed edge nodes.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130622756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00101
Hangcheng Cao, Hongbo Jiang, Daibo Liu, Jie Xiong
Continuous user authentication is of great importance to maintain security for a mobile system and protect user's privacy throughout a login session. In this paper, we propose HandPass, a continuous user authentication system that employs the vibration responses of concealed hand biometrics, which are passively activated by the natural user-device interactions on the touchscreen. Hand vibration responses are instantly triggered and embodied in the mechanical vibration of the force-bearing body (i.e., the mobile device and the holding hand). Therefore, a built-in accelerometer can effectively capture the intrinsic features of hand vibration responses. The hand vibration response is determined by the trigger force and the complex hand structure, which is unique to each user and is difficult (if not impossible) to counterfeit. HandPass is a passive hand vibration response-based continuous user authentication system hosted on smartphones, with advantages of non-intrusiveness, high efficiency, and user-friendliness. We prototyped HandPass on Android smartphones and comprehensively evaluated its performance by recruiting 43 volunteers. Experiment results show that HandPass can achieve 97.3 % overall authentication accuracy and only 1.8 % false acceptance rate in diverse scenarios.
{"title":"Evidence in Hand: Passive Vibration Response-based Continuous User Authentication","authors":"Hangcheng Cao, Hongbo Jiang, Daibo Liu, Jie Xiong","doi":"10.1109/ICDCS51616.2021.00101","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00101","url":null,"abstract":"Continuous user authentication is of great importance to maintain security for a mobile system and protect user's privacy throughout a login session. In this paper, we propose HandPass, a continuous user authentication system that employs the vibration responses of concealed hand biometrics, which are passively activated by the natural user-device interactions on the touchscreen. Hand vibration responses are instantly triggered and embodied in the mechanical vibration of the force-bearing body (i.e., the mobile device and the holding hand). Therefore, a built-in accelerometer can effectively capture the intrinsic features of hand vibration responses. The hand vibration response is determined by the trigger force and the complex hand structure, which is unique to each user and is difficult (if not impossible) to counterfeit. HandPass is a passive hand vibration response-based continuous user authentication system hosted on smartphones, with advantages of non-intrusiveness, high efficiency, and user-friendliness. We prototyped HandPass on Android smartphones and comprehensively evaluated its performance by recruiting 43 volunteers. Experiment results show that HandPass can achieve 97.3 % overall authentication accuracy and only 1.8 % false acceptance rate in diverse scenarios.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134633274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00029
Hwanjo Heo, Seungwon Shin
Blockchain data has become a popular subject in studying various aspects of blockchains including the security of underlying mechanisms. However, the main chain block data, usually available from block explorer services, does not serve as a sufficient source of transaction and block dynamics that are only visible from a large-scale event measurement. In this paper, the transaction and block arrival events of the two popular public blockchains, i.e., Bitcoin and Ethereum, are measured to investigate the hidden dynamics of blockchain networks. We share our key findings and security implications including a false universal assumption of previous mining related studies and an invalid transaction propagation problem that can be exploited to launch a Denial-of-Service attack on a network.
{"title":"Behind Block Explorers: Public Blockchain Measurement and Security Implication","authors":"Hwanjo Heo, Seungwon Shin","doi":"10.1109/ICDCS51616.2021.00029","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00029","url":null,"abstract":"Blockchain data has become a popular subject in studying various aspects of blockchains including the security of underlying mechanisms. However, the main chain block data, usually available from block explorer services, does not serve as a sufficient source of transaction and block dynamics that are only visible from a large-scale event measurement. In this paper, the transaction and block arrival events of the two popular public blockchains, i.e., Bitcoin and Ethereum, are measured to investigate the hidden dynamics of blockchain networks. We share our key findings and security implications including a false universal assumption of previous mining related studies and an invalid transaction propagation problem that can be exploited to launch a Denial-of-Service attack on a network.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130696936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00096
P. Berenbrink, Tom Friedetzky, Christopher Hahn, L. Hintze, Dominik Kaaser, Peter Kling, Lars Nagel
We analyze the following infinite load balancing process, modeled as a classical balls-into-bins game: There are $n$ bins (servers) with a limited capacity (buffer) of size $c=c(n)in mathbb{N}$. Given a fixed arrival rate $lambda=lambda(n)in(0,1)$, in every round $lambda n$ new balls (requests) are generated. Together with possible leftovers from previous rounds, these balls compete to be allocated to the bins. To this end, every ball samples a bin independently and uniformly at random and tries to allocate itself to that bin. Each bin accepts as many balls as possible until its buffer is full, preferring balls of higher age. At the end of the round, every bin deletes the ball it allocated first. We study how the buffer size $c$ affects the performance of this process. For this, we analyze both the number of balls competing each round (including the leftovers from previous rounds) as well as the worst-case waiting time of individual balls. We show that (i) the number of competing balls is at any (even exponentially large) time bounded with high probability by $4 cdot c^{-1} cdot ln (1/(1-lambda))cdot n + mathrm{O}(c cdot n)$ and that (ii) the waiting time of a given ball is with high probability at most $(4 cdot ln (1/(1-lambda)))/ (c cdot (1-1/e)) + log log n + mathrm{O}(c)$. These results indicate a sweet spot for the choice of $c$ around $c = Theta(sqrt{log (1/(1-lambda))})$. Compared to a related process with infinite capacity [Berenbrink et al., PODC'16], for constant $lambda$ the waiting time is reduced from $mathrm{O}(log n)$ to $mathrm{O}(log log n)$. Even for large $lambda approx 1 - 1/n$ we reduce the waiting time from $mathrm{O}(log n)$ to $mathrm{O}(sqrt{log n})$.
{"title":"Infinite Balanced Allocation via Finite Capacities","authors":"P. Berenbrink, Tom Friedetzky, Christopher Hahn, L. Hintze, Dominik Kaaser, Peter Kling, Lars Nagel","doi":"10.1109/ICDCS51616.2021.00096","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00096","url":null,"abstract":"We analyze the following infinite load balancing process, modeled as a classical balls-into-bins game: There are $n$ bins (servers) with a limited capacity (buffer) of size $c=c(n)in mathbb{N}$. Given a fixed arrival rate $lambda=lambda(n)in(0,1)$, in every round $lambda n$ new balls (requests) are generated. Together with possible leftovers from previous rounds, these balls compete to be allocated to the bins. To this end, every ball samples a bin independently and uniformly at random and tries to allocate itself to that bin. Each bin accepts as many balls as possible until its buffer is full, preferring balls of higher age. At the end of the round, every bin deletes the ball it allocated first. We study how the buffer size $c$ affects the performance of this process. For this, we analyze both the number of balls competing each round (including the leftovers from previous rounds) as well as the worst-case waiting time of individual balls. We show that (i) the number of competing balls is at any (even exponentially large) time bounded with high probability by $4 cdot c^{-1} cdot ln (1/(1-lambda))cdot n + mathrm{O}(c cdot n)$ and that (ii) the waiting time of a given ball is with high probability at most $(4 cdot ln (1/(1-lambda)))/ (c cdot (1-1/e)) + log log n + mathrm{O}(c)$. These results indicate a sweet spot for the choice of $c$ around $c = Theta(sqrt{log (1/(1-lambda))})$. Compared to a related process with infinite capacity [Berenbrink et al., PODC'16], for constant $lambda$ the waiting time is reduced from $mathrm{O}(log n)$ to $mathrm{O}(log log n)$. Even for large $lambda approx 1 - 1/n$ we reduce the waiting time from $mathrm{O}(log n)$ to $mathrm{O}(sqrt{log n})$.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114839198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00052
Yangyang Zhang, Jianxin Li, Yiming Zhang, Lijie Wang, Ling Liu
Modern distributed machine learning (ML) systems leverage large-scale computing infrastructures to achieve fast model training. For many servers jointly training a model, failure recovery becomes an important challenge when a training task could be accomplished in minutes rather than days. The state-of-the-art checkpointing mechanism cannot meet the need of efficient recovery for large-scale ML, because its high cost prevents timely checkpointing and a server failure will likely cause a substantial loss of intermediate results when the checkpointing intervals are comparable to the entire training times. This paper proposes FreeLauncher (FLR), a lossless recovery mechanism for large-scale ML which performs ultralight replication (instead of checkpointing) to guarantee all intermediate training results (parameters) to be timely replicated. Our key insight is that in the parameter-server (PS) architecture there already exist multiple copies for each intermediate result not only in the server but also in the workers, most of which are qualified for failure recovery. FLR addresses the challenges of parameter sparsity (e.g., when training LDA) and staleness (e.g., when adopting relaxed consistency) by selectively replicating the latest copies of the sparse/stale parameters to ensure at least k up-to-date copies to be existent, which can handle any k-1 failures by re-launching the failed servers with recovered parameters from workers. We implement FLR on Tensorflow. Evaluation results show that FLR achieves lossless failure recovery (almost requiring no recomputation) at little cost.
{"title":"FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication","authors":"Yangyang Zhang, Jianxin Li, Yiming Zhang, Lijie Wang, Ling Liu","doi":"10.1109/ICDCS51616.2021.00052","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00052","url":null,"abstract":"Modern distributed machine learning (ML) systems leverage large-scale computing infrastructures to achieve fast model training. For many servers jointly training a model, failure recovery becomes an important challenge when a training task could be accomplished in minutes rather than days. The state-of-the-art checkpointing mechanism cannot meet the need of efficient recovery for large-scale ML, because its high cost prevents timely checkpointing and a server failure will likely cause a substantial loss of intermediate results when the checkpointing intervals are comparable to the entire training times. This paper proposes FreeLauncher (FLR), a lossless recovery mechanism for large-scale ML which performs ultralight replication (instead of checkpointing) to guarantee all intermediate training results (parameters) to be timely replicated. Our key insight is that in the parameter-server (PS) architecture there already exist multiple copies for each intermediate result not only in the server but also in the workers, most of which are qualified for failure recovery. FLR addresses the challenges of parameter sparsity (e.g., when training LDA) and staleness (e.g., when adopting relaxed consistency) by selectively replicating the latest copies of the sparse/stale parameters to ensure at least k up-to-date copies to be existent, which can handle any k-1 failures by re-launching the failed servers with recovered parameters from workers. We implement FLR on Tensorflow. Evaluation results show that FLR achieves lossless failure recovery (almost requiring no recomputation) at little cost.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1109/ICDCS51616.2021.00100
G. D. Marco, D. Kowalski, Grzegorz Stachowiak
This paper studies the Contention resolution problem on a shared channel (also known as a multiple access channel). A set of $n$ stations are connected to a common device and are able to communicate by transmitting and listening. Each station may have a message to broadcast. At any round, a transmission is successful if and only if exactly one station is transmitting at that round. Simultaneous transmissions interfere one another and, as a result, the respective messages are lost. The Contention resolution is the fundamental problem of scheduling the transmissions into rounds in such a way that any station delivers successfully its message on the channel. We consider a general dynamic distributed setting. We assume that the stations can join (or be activated on) the channel at arbitrary times (dynamic scenario). This has to be contrasted with the simplified static scenario, in which all stations are assumed to be activated simultaneously. We also assume that the stations are not able to detect whether a collision among simultaneous transmissions occurred (model without collision detection). Finally, there is no global clock in the system: each station measures the time using its own local clock which starts when the station is activated and is possibly out of sync with respect to the other stations. We study non-adaptive deterministic distributed algorithms for the contention resolution problem and assess their efficiency both in terms of channel utilization (also called throughput) and energy consumption. While this topic has been quite extensively examined for randomized algorithms, this is, to the best of our knowledge, the first paper to discuss to which extent deterministic contention resolution algorithms can be efficient in terms of both channel utilization and energy consumption. Our results imply an exponential separation gap between static and dynamic setting with respect to channel utilization. We also show that the knowledge of the number of participating stations k (or an upper bound on it) has a substantial impact on the energy consumption.
{"title":"Deterministic Contention Resolution without Collision Detection: Throughput vs Energy","authors":"G. D. Marco, D. Kowalski, Grzegorz Stachowiak","doi":"10.1109/ICDCS51616.2021.00100","DOIUrl":"https://doi.org/10.1109/ICDCS51616.2021.00100","url":null,"abstract":"This paper studies the Contention resolution problem on a shared channel (also known as a multiple access channel). A set of $n$ stations are connected to a common device and are able to communicate by transmitting and listening. Each station may have a message to broadcast. At any round, a transmission is successful if and only if exactly one station is transmitting at that round. Simultaneous transmissions interfere one another and, as a result, the respective messages are lost. The Contention resolution is the fundamental problem of scheduling the transmissions into rounds in such a way that any station delivers successfully its message on the channel. We consider a general dynamic distributed setting. We assume that the stations can join (or be activated on) the channel at arbitrary times (dynamic scenario). This has to be contrasted with the simplified static scenario, in which all stations are assumed to be activated simultaneously. We also assume that the stations are not able to detect whether a collision among simultaneous transmissions occurred (model without collision detection). Finally, there is no global clock in the system: each station measures the time using its own local clock which starts when the station is activated and is possibly out of sync with respect to the other stations. We study non-adaptive deterministic distributed algorithms for the contention resolution problem and assess their efficiency both in terms of channel utilization (also called throughput) and energy consumption. While this topic has been quite extensively examined for randomized algorithms, this is, to the best of our knowledge, the first paper to discuss to which extent deterministic contention resolution algorithms can be efficient in terms of both channel utilization and energy consumption. Our results imply an exponential separation gap between static and dynamic setting with respect to channel utilization. We also show that the knowledge of the number of participating stations k (or an upper bound on it) has a substantial impact on the energy consumption.","PeriodicalId":222376,"journal":{"name":"2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132462564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}