Proceedings of the 27th ACM Symposium on Operating Systems Principles最新文献

Notary: a device for secure transaction approval 公证:一种安全的交易批准装置

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359661

Anish Athalye, A. Belay, M. Kaashoek, R. Morris, N. Zeldovich

Notary is a new hardware and software architecture for running isolated approval agents in the form factor of a USB stick with a small display and buttons. Approval agents allow factoring out critical security decisions, such as getting the user's approval to sign a Bitcoin transaction or to delete a backup, to a secure environment. The key challenge addressed by Notary is to securely switch between agents on the same device. Prior systems either avoid the problem by building single-function devices like a USB U2F key, or they provide weak isolation that is susceptible to kernel bugs, side channels, or Rowhammer-like attacks. Notary achieves strong isolation using reset-based switching, along with the use of physically separate systems-on-a-chip for agent code and for the kernel, and a machine-checked proof of both the hardware's register-transfer-level design and software, showing that reset-based switching leaks no state. Notary also provides a trustworthy I/O path between the agent code and the user, which prevents an adversary from tampering with the user's screen or buttons. We built a hardware/software prototype of Notary, using a combination of ARM and RISC-V processors. The prototype demonstrates that it is feasible to verify Notary's reset-based switching, and that Notary can support diverse agents, including cryptocurrencies and a transaction approval agent for traditional client-server applications such as websites. Measurements of reset-based switching show that it is fast enough for interactive use. We analyze security bugs in existing cryptocurrency hardware wallets, which aim to provide a similar form factor and feature set as Notary, and show that Notary's design avoids many bugs that affect them.

公证是一种新的硬件和软件架构，用于运行独立的审批代理，其形式是带有小显示器和按钮的USB棒。审批代理允许将关键的安全决策分解到安全环境中，例如获得用户批准以签署比特币交易或删除备份。公证人解决的关键挑战是在同一设备上的代理之间安全地切换。以前的系统要么通过构建像USB U2F密钥这样的单一功能设备来避免这个问题，要么提供较弱的隔离，容易受到内核错误、侧通道或类似rowhhammer的攻击。Notary使用基于重置的交换实现了强大的隔离，同时使用物理上独立的片上系统(用于代理代码和内核)，以及硬件的寄存器传输级设计和软件的机器检查证明，表明基于重置的交换不会泄露任何状态。公证人还在代理代码和用户之间提供了一个可信的I/O路径，这可以防止攻击者篡改用户的屏幕或按钮。我们构建了一个公证的硬件/软件原型，使用ARM和RISC-V处理器的组合。该原型表明，验证公证人基于重置的切换是可行的，并且公证人可以支持多种代理，包括加密货币和传统客户端-服务器应用程序(如网站)的交易批准代理。基于复位的开关的测量表明，它是足够快的交互式使用。我们分析了现有加密货币硬件钱包中的安全漏洞，其目的是提供与公证人类似的外形因素和功能集，并表明公证人的设计避免了许多影响它们的漏洞。

{"title":"Notary: a device for secure transaction approval","authors":"Anish Athalye, A. Belay, M. Kaashoek, R. Morris, N. Zeldovich","doi":"10.1145/3341301.3359661","DOIUrl":"https://doi.org/10.1145/3341301.3359661","url":null,"abstract":"Notary is a new hardware and software architecture for running isolated approval agents in the form factor of a USB stick with a small display and buttons. Approval agents allow factoring out critical security decisions, such as getting the user's approval to sign a Bitcoin transaction or to delete a backup, to a secure environment. The key challenge addressed by Notary is to securely switch between agents on the same device. Prior systems either avoid the problem by building single-function devices like a USB U2F key, or they provide weak isolation that is susceptible to kernel bugs, side channels, or Rowhammer-like attacks. Notary achieves strong isolation using reset-based switching, along with the use of physically separate systems-on-a-chip for agent code and for the kernel, and a machine-checked proof of both the hardware's register-transfer-level design and software, showing that reset-based switching leaks no state. Notary also provides a trustworthy I/O path between the agent code and the user, which prevents an adversary from tampering with the user's screen or buttons. We built a hardware/software prototype of Notary, using a combination of ARM and RISC-V processors. The prototype demonstrates that it is feasible to verify Notary's reset-based switching, and that Notary can support diverse agents, including cryptocurrencies and a transaction approval agent for traditional client-server applications such as websites. Measurements of reset-based switching show that it is fast enough for interactive use. We analyze security bugs in existing cryptocurrency hardware wallets, which aim to provide a similar form factor and feature set as Notary, and show that Notary's design avoids many bugs that affect them.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116062400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

A generic communication scheduler for distributed DNN training acceleration 分布式DNN训练加速的通用通信调度程序

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359642

Yanghua Peng, Yibo Zhu, Yangrui Chen, Y. Bao, Bairen Yi, Chang Lan, Chuan Wu, Chuanxiong Guo

We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).

我们提出了bytesscheduler，一个用于分布式DNN训练加速的通用通信调度程序。ByteScheduler基于我们的原则分析，即分区和重新安排张量传输可以在理论上获得最佳结果，并且即使有调度开销也可以在现实世界中获得良好的性能。为了使bytesscheduler在各种DNN训练框架中普遍工作，我们引入了一个统一的抽象和依赖代理机制来实现通信调度，而不会破坏框架引擎中的原始依赖关系。我们进一步介绍了一种贝叶斯优化方法，用于在各种网络条件下自动调整不同训练模型的张量分区大小和其他参数。ByteScheduler现在支持TensorFlow, PyTorch和MXNet，无需修改其源代码，并且可以很好地与参数服务器(PS)和all-reduce架构一起使用TCP或RDMA进行梯度同步。我们的实验表明，ByteScheduler在所有实验系统配置和DNN模型下加速训练，最高可达196%(或原始速度的2.96倍)。

{"title":"A generic communication scheduler for distributed DNN training acceleration","authors":"Yanghua Peng, Yibo Zhu, Yangrui Chen, Y. Bao, Bairen Yi, Chang Lan, Chuan Wu, Chuanxiong Guo","doi":"10.1145/3341301.3359642","DOIUrl":"https://doi.org/10.1145/3341301.3359642","url":null,"abstract":"We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130049922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 246

AutoMine

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359633

Daniel Mawhirter, Bo Wu

Graph mining algorithms that aim at identifying structural patterns of graphs are typically more complex than graph computation algorithms such as breadth first search. Researchers have implemented several systems with high-level and flexible interfaces customized for tackling graph mining problems. However, we find that for triangle counting, one of the simplest graph mining problems, such systems can be several times slower than a single-threaded implementation of a straightforward algorithm. In this paper, we reveal the root causes of the severe inefficiencies of state-of-the-art graph mining systems and the challenges to address the performance problems. We build AutoMine, a single-machine system to provide both high-level interfaces and high performance for large-scale graph mining applications. The novelty of AutoMine comes from 1) a new representation of subgraph patterns and 2) compilation techniques that automatically generate efficient mining code with minimized memory consumption from a high-level abstraction. We have extensively evaluated AutoMine against 3 graph mining systems on 8 real-world graphs of different scales. Our experimental results show that AutoMine often produces several orders of magnitude better performance and can process very large graphs existing systems cannot handle.

{"title":"AutoMine","authors":"Daniel Mawhirter, Bo Wu","doi":"10.1145/3341301.3359633","DOIUrl":"https://doi.org/10.1145/3341301.3359633","url":null,"abstract":"Graph mining algorithms that aim at identifying structural patterns of graphs are typically more complex than graph computation algorithms such as breadth first search. Researchers have implemented several systems with high-level and flexible interfaces customized for tackling graph mining problems. However, we find that for triangle counting, one of the simplest graph mining problems, such systems can be several times slower than a single-threaded implementation of a straightforward algorithm. In this paper, we reveal the root causes of the severe inefficiencies of state-of-the-art graph mining systems and the challenges to address the performance problems. We build AutoMine, a single-machine system to provide both high-level interfaces and high performance for large-scale graph mining applications. The novelty of AutoMine comes from 1) a new representation of subgraph patterns and 2) compilation techniques that automatically generate efficient mining code with minimized memory consumption from a high-level abstraction. We have extensively evaluated AutoMine against 3 graph mining systems on 8 real-world graphs of different scales. Our experimental results show that AutoMine often produces several orders of magnitude better performance and can process very large graphs existing systems cannot handle.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125385242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Honeycrisp: large-scale differentially private aggregation without a trusted core Honeycrisp:没有可信核心的大规模差异私有聚合

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359660

Edo Roth, D. Noble, B. Falk, Andreas Haeberlen

Recently, a number of systems have been deployed that gather sensitive statistics from user devices while giving differential privacy guarantees. One prominent example is the component in Apple's macOS and iOS devices that collects information about emoji usage and new words. However, these systems have been criticized for making unrealistic assumptions, e.g., by creating a very high "privacy budget" for answering queries, and by replenishing this budget every day, which results in a high worst-case privacy loss. However, it is not obvious whether such assumptions can be avoided if one requires a strong threat model and wishes to collect data periodically, instead of just once. In this paper, we show that, essentially, it is possible to have one's cake and eat it too. We describe a system called Honeycrisp whose privacy cost depends on how often the data changes, and not on how often a query is asked. Thus, if the data is relatively stable (as is likely the case, e.g., with emoji and word usage), Honeycrisp can answer periodic queries for many years, as long as the underlying data does not change too often. Honeycrisp accomplishes this by using a) the sparse-vector technique, and b) a combination of cryptographic techniques to enable global differential privacy without a trusted party. Using a prototype implementation, we show that Honeycrisp is efficient and can scale to large deployments.

最近，已经部署了许多系统，可以从用户设备收集敏感统计数据，同时提供不同的隐私保证。一个突出的例子是苹果macOS和iOS设备中收集表情符号使用和新词信息的组件。然而，这些系统因做出不切实际的假设而受到批评，例如，通过为回答查询创建非常高的“隐私预算”，并每天补充该预算，这导致了最坏情况下的高隐私损失。然而，如果需要一个强大的威胁模型并希望定期收集数据，而不是只收集一次，那么是否可以避免这样的假设并不明显。在本文中，我们证明，鱼与熊掌兼得是可能的。我们描述了一个名为Honeycrisp的系统，其隐私成本取决于数据更改的频率，而不是查询的频率。因此，如果数据相对稳定(就像可能的情况一样，例如，表情符号和单词的使用)，只要底层数据不经常变化，Honeycrisp就可以回答多年的周期性查询。Honeycrisp通过使用a)稀疏向量技术和b)加密技术的组合来实现这一点，从而在没有可信方的情况下实现全局差分隐私。通过一个原型实现，我们证明了Honeycrisp是高效的，并且可以扩展到大型部署。

{"title":"Honeycrisp: large-scale differentially private aggregation without a trusted core","authors":"Edo Roth, D. Noble, B. Falk, Andreas Haeberlen","doi":"10.1145/3341301.3359660","DOIUrl":"https://doi.org/10.1145/3341301.3359660","url":null,"abstract":"Recently, a number of systems have been deployed that gather sensitive statistics from user devices while giving differential privacy guarantees. One prominent example is the component in Apple's macOS and iOS devices that collects information about emoji usage and new words. However, these systems have been criticized for making unrealistic assumptions, e.g., by creating a very high \"privacy budget\" for answering queries, and by replenishing this budget every day, which results in a high worst-case privacy loss. However, it is not obvious whether such assumptions can be avoided if one requires a strong threat model and wishes to collect data periodically, instead of just once. In this paper, we show that, essentially, it is possible to have one's cake and eat it too. We describe a system called Honeycrisp whose privacy cost depends on how often the data changes, and not on how often a query is asked. Thus, if the data is relatively stable (as is likely the case, e.g., with emoji and word usage), Honeycrisp can answer periodic queries for many years, as long as the underlying data does not change too often. Honeycrisp accomplishes this by using a) the sparse-vector technique, and b) a combination of cryptographic techniques to enable global differential privacy without a trusted party. Using a prototype implementation, we show that Honeycrisp is efficient and can scale to large deployments.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123881650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Taiji 太地町

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359655

David Chou, T. Xu, K. Veeraraghavan, Andrew J. Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, D. Perelman

We present Taiji, a new system for managing user traffic for large-scale Internet services that accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing network latency of user requests. Taiji models edge-to-datacenter traffic routing as an assignment problem---assigning traffic objects at the edge to the data centers to satisfy service-level objectives. Taiji uses a constraint optimization solver to generate an optimal routing table that specifies the fractions of traffic each edge node will distribute to different data centers. Taiji continuously adjusts the routing table to accommodate the dynamics of user traffic and failure events that reduce capacity. Taiji leverages connections among users to selectively route traffic of highly-connected users to the same data centers based on fractions in the routing table. This routing strategy, which we term connection-aware routing, allows us to reduce query load on our backend storage by 17%. Taiji has been used in production at Facebook for more than four years and routes global traffic in a user-aware manner for several large-scale product services across dozens of edge nodes and data centers.

{"title":"Taiji","authors":"David Chou, T. Xu, K. Veeraraghavan, Andrew J. Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, D. Perelman","doi":"10.1145/3341301.3359655","DOIUrl":"https://doi.org/10.1145/3341301.3359655","url":null,"abstract":"We present Taiji, a new system for managing user traffic for large-scale Internet services that accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing network latency of user requests. Taiji models edge-to-datacenter traffic routing as an assignment problem---assigning traffic objects at the edge to the data centers to satisfy service-level objectives. Taiji uses a constraint optimization solver to generate an optimal routing table that specifies the fractions of traffic each edge node will distribute to different data centers. Taiji continuously adjusts the routing table to accommodate the dynamics of user traffic and failure events that reduce capacity. Taiji leverages connections among users to selectively route traffic of highly-connected users to the same data centers based on fractions in the routing table. This routing strategy, which we term connection-aware routing, allows us to reduce query load on our backend storage by 17%. Taiji has been used in production at Facebook for more than four years and routes global traffic in a user-aware manner for several large-scale product services across dozens of edge nodes and data centers.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121239149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Efficient scalable thread-safety-violation detection: finding thousands of concurrency bugs during testing 高效的可伸缩线程安全违规检测:在测试期间发现数千个并发错误

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359638

Guangpu Li, Shan Lu, M. Musuvathi, Suman Nath, Rohan Padhye

Concurrency bugs are hard to find, reproduce, and debug. They often escape rigorous in-house testing, but result in large-scale outages in production. Existing concurrency-bug detection techniques unfortunately cannot be part of industry's integrated build and test environment due to some open challenges: how to handle code developed by thousands of engineering teams that uses a wide variety of synchronization mechanisms, how to report little/no false positives, and how to avoid excessive testing resource consumption. This paper presents TSVD, a thread-safety violation detector that addresses these challenges through a new design point in the domain of active testing. Unlike previous techniques that inject delays randomly or employ expensive synchronization analysis, TSVD uses lightweight monitoring of the calling behaviors of thread-unsafe methods, not any synchronization operations, to dynamically identify bug suspects. It then injects corresponding delays to drive the program towards thread-unsafe behaviors, actively learns from its ability or inability to do so, and persists its learning from one test run to the next. TSVD is deployed and regularly used in Microsoft and it has already found over 1000 thread-safety violations from thousands of projects. It detects more bugs than state-of-the-art techniques, mostly with just one test run.

并发性bug很难发现、重现和调试。它们经常逃脱严格的内部测试，但会导致生产中的大规模中断。不幸的是，现有的并发bug检测技术不能成为行业集成构建和测试环境的一部分，因为存在一些开放的挑战:如何处理由数千个工程团队开发的使用各种同步机制的代码，如何报告很少/没有误报，以及如何避免过度的测试资源消耗。本文提出了一种线程安全冲突检测器TSVD，它通过主动测试领域的一个新的设计点来解决这些挑战。与之前随机注入延迟或使用昂贵的同步分析的技术不同，TSVD使用轻量级监视线程不安全方法的调用行为，而不是任何同步操作，来动态识别可疑的bug。然后，它注入相应的延迟来驱动程序走向线程不安全的行为，主动地从它的能力或不能力中学习，并从一个测试运行持续到下一个测试运行。TSVD被部署并经常在微软中使用，它已经从数千个项目中发现了超过1000个线程安全违规。与最先进的技术相比，它可以检测到更多的bug，而且大多数只需运行一次测试。

{"title":"Efficient scalable thread-safety-violation detection: finding thousands of concurrency bugs during testing","authors":"Guangpu Li, Shan Lu, M. Musuvathi, Suman Nath, Rohan Padhye","doi":"10.1145/3341301.3359638","DOIUrl":"https://doi.org/10.1145/3341301.3359638","url":null,"abstract":"Concurrency bugs are hard to find, reproduce, and debug. They often escape rigorous in-house testing, but result in large-scale outages in production. Existing concurrency-bug detection techniques unfortunately cannot be part of industry's integrated build and test environment due to some open challenges: how to handle code developed by thousands of engineering teams that uses a wide variety of synchronization mechanisms, how to report little/no false positives, and how to avoid excessive testing resource consumption. This paper presents TSVD, a thread-safety violation detector that addresses these challenges through a new design point in the domain of active testing. Unlike previous techniques that inject delays randomly or employ expensive synchronization analysis, TSVD uses lightweight monitoring of the calling behaviors of thread-unsafe methods, not any synchronization operations, to dynamically identify bug suspects. It then injects corresponding delays to drive the program towards thread-unsafe behaviors, actively learns from its ability or inability to do so, and persists its learning from one test run to the next. TSVD is deployed and regularly used in Microsoft and it has already found over 1000 thread-safety violations from thousands of projects. It detects more bugs than state-of-the-art techniques, mostly with just one test run.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121074946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

PipeDream: generalized pipeline parallelism for DNN training PipeDream: DNN训练的广义管道并行

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359646

D. Narayanan, A. Harlap, Amar Phanishayee, V. Seshadri, Nikhil R. Devanur, G. Ganger, Phillip B. Gibbons, M. Zaharia

DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.

深度神经网络训练非常耗时，需要高效的多加速器并行化。当前并行化训练的方法主要使用批内并行化，其中训练的单个迭代被分配到可用的工人上，但是在较高的工人数量上遭受收益递减的困扰。我们提出了PipeDream系统，该系统在批内并行的基础上增加了批间流水线，以进一步提高并行训练吞吐量，帮助更好地重叠计算和通信，并在可能的情况下减少通信量。与传统的流水线不同，DNN训练是双向的，其中向前传递计算图，然后是向后传递，使用在向前传递期间计算的状态和中间数据。因此，Naïve管道可能导致在向前和向后传递中使用的状态版本不匹配，或者过多的管道刷新和较低的硬件效率。为了应对这些挑战，PipeDream版本为数值正确的梯度计算建模参数，并以最小的管道延迟同时在不同的工人上安排不同小批量的向前和向后传递。PipeDream还自动在工作人员之间划分DNN层，以平衡工作并减少通信。对一系列DNN任务、模型和硬件配置的广泛实验表明，PipeDream训练模型的精度比常用的批内并行技术快5.3倍。

{"title":"PipeDream: generalized pipeline parallelism for DNN training","authors":"D. Narayanan, A. Harlap, Amar Phanishayee, V. Seshadri, Nikhil R. Devanur, G. Ganger, Phillip B. Gibbons, M. Zaharia","doi":"10.1145/3341301.3359646","DOIUrl":"https://doi.org/10.1145/3341301.3359646","url":null,"abstract":"DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126780076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 520

KnightKing

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359634

Ke Yang, Mingxing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, Yong Jiang

Random walk on graphs has recently gained immense popularity as a tool for graph data analytics and machine learning. Currently, random walk algorithms are developed as individual implementations and suffer significant performance and scalability problems, especially with the dynamic nature of sophisticated walk strategies. We present KnightKing, the first general-purpose, distributed graph random walk engine. To address the unique interaction between a static graph and many dynamic walkers, it adopts an intuitive walker-centric computation model. The corresponding programming model allows users to easily specify existing or new random walk algorithms, facilitated by a new unified edge transition probability definition that applies across popular known algorithms. With KnightKing, these diverse algorithms benefit from its common distributed random walk execution engine, centered around an innovative rejection-based sampling mechanism that dramatically reduces the cost of higher-order random walk algorithms. Our evaluation confirms that KnightKing brings up to 4 orders of magnitude improvement in executing algorithms that currently can only be afforded with approximation solutions on large graphs.

{"title":"KnightKing","authors":"Ke Yang, Mingxing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, Yong Jiang","doi":"10.1145/3341301.3359634","DOIUrl":"https://doi.org/10.1145/3341301.3359634","url":null,"abstract":"Random walk on graphs has recently gained immense popularity as a tool for graph data analytics and machine learning. Currently, random walk algorithms are developed as individual implementations and suffer significant performance and scalability problems, especially with the dynamic nature of sophisticated walk strategies. We present KnightKing, the first general-purpose, distributed graph random walk engine. To address the unique interaction between a static graph and many dynamic walkers, it adopts an intuitive walker-centric computation model. The corresponding programming model allows users to easily specify existing or new random walk algorithms, facilitated by a new unified edge transition probability definition that applies across popular known algorithms. With KnightKing, these diverse algorithms benefit from its common distributed random walk execution engine, centered around an innovative rejection-based sampling mechanism that dramatically reduces the cost of higher-order random walk algorithms. Our evaluation confirms that KnightKing brings up to 4 orders of magnitude improvement in executing algorithms that currently can only be afforded with approximation solutions on large graphs.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132812354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Aegean: replication beyond the client-server model 爱琴海:超越客户端-服务器模型的复制

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359663

Remzi Can Aksoy, Manos Kapritsos

This paper presents Aegean, a new approach that allows fault-tolerant replication to be implemented beyond the confines of the client-server model. In today's computing, where services are rarely standalone, traditional replication protocols such as Primary-Backup, Paxos, and PBFT are not directly applicable, as they were designed for the client-server model. When services interact, these protocols run into a number of problems, affecting both correctness and performance. In this paper, we rethink the design of replication protocols in the presence of interactions between services and introduce new techniques that accommodate such interactions safely and efficiently. Our evaluation shows that a prototype implementation of Aegean not only ensures correctness in the presence of service interactions, but can further improve throughput by an order of magnitude.

本文介绍了Aegean，这是一种允许在客户端-服务器模型的限制之外实现容错复制的新方法。在今天的计算中，服务很少是独立的，传统的复制协议(如Primary-Backup、Paxos和PBFT)并不直接适用，因为它们是为客户机-服务器模型设计的。当服务交互时，这些协议会遇到许多问题，从而影响正确性和性能。在本文中，我们重新考虑了服务之间存在交互的复制协议的设计，并引入了安全有效地适应这种交互的新技术。我们的评估表明，爱琴海的原型实现不仅确保了服务交互存在时的正确性，而且可以进一步提高吞吐量的数量级。

引用次数: 12

TASO

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359630

Zhihao Jia, O. Padon, James P. Thomas, Todd Warszawski, M. Zaharia, A. Aiken

Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators are introduced on a regular basis. We propose TASO, the first DNN computation graph optimizer that automatically generates graph substitutions. TASO takes as input a list of operator specifications and generates candidate substitutions using the given operators as basic building blocks. All generated substitutions are formally verified against the operator specifications using an automated theorem prover. To optimize a given DNN computation graph, TASO performs a cost-based backtracking search, applying the substitutions to find an optimized graph, which can be directly used by existing DNN frameworks. Our evaluation on five real-world DNN architectures shows that TASO outperforms existing DNN frameworks by up to 2.8X, while requiring significantly less human effort. For example, TensorFlow currently contains approximately 53,000 lines of manual optimization rules, while the operator specifications needed by TASO are only 1,400 lines of code.

{"title":"TASO","authors":"Zhihao Jia, O. Padon, James P. Thomas, Todd Warszawski, M. Zaharia, A. Aiken","doi":"10.1145/3341301.3359630","DOIUrl":"https://doi.org/10.1145/3341301.3359630","url":null,"abstract":"Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators are introduced on a regular basis. We propose TASO, the first DNN computation graph optimizer that automatically generates graph substitutions. TASO takes as input a list of operator specifications and generates candidate substitutions using the given operators as basic building blocks. All generated substitutions are formally verified against the operator specifications using an automated theorem prover. To optimize a given DNN computation graph, TASO performs a cost-based backtracking search, applying the substitutions to find an optimized graph, which can be directly used by existing DNN frameworks. Our evaluation on five real-world DNN architectures shows that TASO outperforms existing DNN frameworks by up to 2.8X, while requiring significantly less human effort. For example, TensorFlow currently contains approximately 53,000 lines of manual optimization rules, while the operator specifications needed by TASO are only 1,400 lines of code.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114354119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 195