Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation最新文献

Karma: Resource Allocation for Dynamic Demands 因果报应:动态需求的资源分配

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2023-05-26 DOI: 10.48550/arXiv.2305.17222

Midhul Vuppalapati, Giannis Fikioris, R. Agarwal, Asaf Cidon, Anurag Khandelwal, É. Tardos

We consider the problem of fair resource allocation in a system where user demands are dynamic, that is, where user demands vary over time. Our key observation is that the classical max-min fairness algorithm for resource allocation provides many desirable properties (e.g., Pareto efficiency, strategy-proofness, and fairness), but only under the strong assumption of user demands being static over time. For the realistic case of dynamic user demands, the max-min fairness algorithm loses one or more of these properties. We present Karma, a new resource allocation mechanism for dynamic user demands. The key technical contribution in Karma is a credit-based resource allocation algorithm: in each quantum, users donate their unused resources and are assigned credits when other users borrow these resources; Karma carefully orchestrates the exchange of credits across users (based on their instantaneous demands, donated resources and borrowed resources), and performs prioritized resource allocation based on users' credits. We theoretically establish Karma guarantees related to Pareto efficiency, strategy-proofness, and fairness for dynamic user demands. Empirical evaluations over production workloads show that these properties translate well into practice: Karma is able to reduce disparity in performance across users to a bare minimum while maintaining Pareto-optimal system-wide performance.

我们考虑在用户需求是动态的系统中公平分配资源的问题，也就是说，用户需求随时间变化。我们的主要观察是，用于资源分配的经典最大最小公平算法提供了许多理想的属性(例如，帕累托效率、策略验证性和公平性)，但只有在用户需求随时间保持静态的强烈假设下。对于动态用户需求的现实情况，最大最小公平性算法失去了这些属性中的一个或多个。我们提出了一种新的动态用户需求资源分配机制Karma。Karma的关键技术贡献是基于信用的资源分配算法:在每个量子中，用户捐赠其未使用的资源，并在其他用户借用这些资源时分配信用;Karma精心编排用户之间的积分交换(基于他们的即时需求、捐赠的资源和借来的资源)，并根据用户的积分执行优先级资源分配。我们从理论上建立了与动态用户需求的帕累托效率、策略验证和公平性相关的因果保证。对生产工作负载的经验评估表明，这些特性可以很好地转化为实践:Karma能够将用户之间的性能差异减少到最小，同时保持帕累托最优的系统范围性能。

{"title":"Karma: Resource Allocation for Dynamic Demands","authors":"Midhul Vuppalapati, Giannis Fikioris, R. Agarwal, Asaf Cidon, Anurag Khandelwal, É. Tardos","doi":"10.48550/arXiv.2305.17222","DOIUrl":"https://doi.org/10.48550/arXiv.2305.17222","url":null,"abstract":"We consider the problem of fair resource allocation in a system where user demands are dynamic, that is, where user demands vary over time. Our key observation is that the classical max-min fairness algorithm for resource allocation provides many desirable properties (e.g., Pareto efficiency, strategy-proofness, and fairness), but only under the strong assumption of user demands being static over time. For the realistic case of dynamic user demands, the max-min fairness algorithm loses one or more of these properties. We present Karma, a new resource allocation mechanism for dynamic user demands. The key technical contribution in Karma is a credit-based resource allocation algorithm: in each quantum, users donate their unused resources and are assigned credits when other users borrow these resources; Karma carefully orchestrates the exchange of credits across users (based on their instantaneous demands, donated resources and borrowed resources), and performs prioritized resource allocation based on users' credits. We theoretically establish Karma guarantees related to Pareto efficiency, strategy-proofness, and fairness for dynamic user demands. Empirical evaluations over production workloads show that these properties translate well into practice: Karma is able to reduce disparity in performance across users to a bare minimum while maintaining Pareto-optimal system-wide performance.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"1 1","pages":"645-662"},"PeriodicalIF":0.0,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89867685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall 避免时间戳反转陷阱的严格序列化数据存储的自然并发控制

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2023-05-23 DOI: 10.48550/arXiv.2305.14270

Haonan Lu, Shuai Mu, S. Sen, Wyatt Lloyd

Strictly serializable datastores greatly simplify the development of correct applications by providing strong consistency guarantees. However, existing techniques pay unnecessary costs for naturally consistent transactions, which arrive at servers in an order that is already strictly serializable. We find these transactions are prevalent in datacenter workloads. We exploit this natural arrival order by executing transaction requests with minimal costs while optimistically assuming they are naturally consistent, and then leverage a timestamp-based technique to efficiently verify if the execution is indeed consistent. In the process of designing such a timestamp-based technique, we identify a fundamental pitfall in relying on timestamps to provide strict serializability, and name it the timestamp-inversion pitfall. We find timestamp-inversion has affected several existing works. We present Natural Concurrency Control (NCC), a new concurrency control technique that guarantees strict serializability and ensures minimal costs -- i.e., one-round latency, lock-free, and non-blocking execution -- in the best (and common) case by leveraging natural consistency. NCC is enabled by three key components: non-blocking execution, decoupled response control, and timestamp-based consistency check. NCC avoids timestamp-inversion with a new technique: response timing control, and proposes two optimization techniques, asynchrony-aware timestamps and smart retry, to reduce false aborts. Moreover, NCC designs a specialized protocol for read-only transactions, which is the first to achieve the optimal best-case performance while ensuring strict serializability, without relying on synchronized clocks. Our evaluation shows that NCC outperforms state-of-the-art solutions by an order of magnitude on many workloads.

严格序列化的数据存储通过提供强一致性保证，极大地简化了正确应用程序的开发。然而，现有技术为自然一致的事务支付了不必要的成本，这些事务以严格可序列化的顺序到达服务器。我们发现这些事务在数据中心工作负载中很普遍。我们利用这种自然的到达顺序，以最小的成本执行事务请求，同时乐观地假设它们是自然一致的，然后利用基于时间戳的技术来有效地验证执行是否确实一致。在设计这种基于时间戳的技术的过程中，我们发现了依赖时间戳来提供严格的序列化性的一个基本缺陷，并将其命名为时间戳反转缺陷。我们发现时间戳反演已经影响了一些现有的工作。我们提出了自然并发控制(NCC)，这是一种新的并发控制技术，通过利用自然一致性在最佳(和常见)情况下保证严格的序列化性并确保最小的成本——即一轮延迟、无锁和非阻塞执行。NCC由三个关键组件启用:非阻塞执行、解耦响应控制和基于时间戳的一致性检查。NCC通过响应时间控制技术避免了时间戳反转，并提出了异步感知时间戳和智能重试两种优化技术，以减少假中止。此外，NCC为只读事务设计了专门的协议，这是第一个实现最佳性能的协议，同时确保严格的可序列化性，而不依赖于同步时钟。我们的评估表明，在许多工作负载上，NCC的性能比最先进的解决方案高出一个数量级。

{"title":"NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall","authors":"Haonan Lu, Shuai Mu, S. Sen, Wyatt Lloyd","doi":"10.48550/arXiv.2305.14270","DOIUrl":"https://doi.org/10.48550/arXiv.2305.14270","url":null,"abstract":"Strictly serializable datastores greatly simplify the development of correct applications by providing strong consistency guarantees. However, existing techniques pay unnecessary costs for naturally consistent transactions, which arrive at servers in an order that is already strictly serializable. We find these transactions are prevalent in datacenter workloads. We exploit this natural arrival order by executing transaction requests with minimal costs while optimistically assuming they are naturally consistent, and then leverage a timestamp-based technique to efficiently verify if the execution is indeed consistent. In the process of designing such a timestamp-based technique, we identify a fundamental pitfall in relying on timestamps to provide strict serializability, and name it the timestamp-inversion pitfall. We find timestamp-inversion has affected several existing works. We present Natural Concurrency Control (NCC), a new concurrency control technique that guarantees strict serializability and ensures minimal costs -- i.e., one-round latency, lock-free, and non-blocking execution -- in the best (and common) case by leveraging natural consistency. NCC is enabled by three key components: non-blocking execution, decoupled response control, and timestamp-based consistency check. NCC avoids timestamp-inversion with a new technique: response timing control, and proposes two optimization techniques, asynchrony-aware timestamps and smart retry, to reduce false aborts. Moreover, NCC designs a specialized protocol for read-only transactions, which is the first to achieve the optimal best-case performance while ensuring strict serializability, without relying on synchronized clocks. Our evaluation shows that NCC outperforms state-of-the-art solutions by an order of magnitude on many workloads.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"293 1","pages":"305-323"},"PeriodicalIF":0.0,"publicationDate":"2023-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75412065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning Walle:一个端到端、通用、大规模的设备云协同机器学习生产系统

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2022-05-30 DOI: 10.48550/arXiv.2205.14833

Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, B. Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bingzhang Zou, Peng Lan, Guohuan Xu, Fei Wu, Shaojie Tang, Fan Wu, Guihai Chen

To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment, while facilitating daily task iteration. Specifically, the compute container is based on Mobile Neural Network (MNN), a tensor compute engine along with the data processing and model execution libraries, which are exposed through a refined Python thread-level virtual machine (VM) to support diverse ML tasks and concurrent task execution. The core of MNN is the novel mechanisms of operator decomposition and semi-auto search, sharply reducing the workload in manually optimizing hundreds of operators for tens of hardware backends and further quickly identifying the best backend with runtime optimization for a computation graph. The data pipeline introduces an on-device stream processing framework to enable processing user behavior data at source. The deployment platform releases ML tasks with an efficient push-then-pull method and supports multi-granularity deployment policies. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability. Extensive micro-benchmarks also highlight the superior performance of MNN and the Python thread-level VM. Walle has been in large-scale production use in Alibaba, while MNN has been open source with a broad impact in the community.

为了打破主流基于云的机器学习(ML)范式的瓶颈，我们采用了设备-云协同ML，并构建了第一个端到端通用系统Walle作为基础。Walle包括一个部署平台，将机器学习任务及时分发到数十亿规模的设备上;数据管道，高效准备任务输入;以及计算容器，提供跨平台和高性能的执行环境，同时促进日常任务迭代。具体来说，计算容器基于移动神经网络(MNN)，一个张量计算引擎，以及数据处理和模型执行库，通过一个改进的Python线程级虚拟机(VM)公开，以支持各种ML任务和并发任务执行。MNN的核心是新的算子分解和半自动搜索机制，大大减少了为数十个硬件后端手动优化数百个算子的工作量，并进一步通过运行时优化来快速识别计算图的最佳后端。数据管道引入设备上流处理框架，实现对用户行为数据的源处理。部署平台通过高效的推拉方式发布ML任务，并支持多粒度部署策略。我们在实际的电子商务应用场景中评估Walle，以证明其有效性、效率和可扩展性。大量的微基准测试还突出了MNN和Python线程级VM的优越性能。Walle已经在阿里巴巴中大规模生产使用，而MNN已经开源，在社区中产生了广泛的影响。

{"title":"Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning","authors":"Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, B. Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bingzhang Zou, Peng Lan, Guohuan Xu, Fei Wu, Shaojie Tang, Fan Wu, Guihai Chen","doi":"10.48550/arXiv.2205.14833","DOIUrl":"https://doi.org/10.48550/arXiv.2205.14833","url":null,"abstract":"To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment, while facilitating daily task iteration. Specifically, the compute container is based on Mobile Neural Network (MNN), a tensor compute engine along with the data processing and model execution libraries, which are exposed through a refined Python thread-level virtual machine (VM) to support diverse ML tasks and concurrent task execution. The core of MNN is the novel mechanisms of operator decomposition and semi-auto search, sharply reducing the workload in manually optimizing hundreds of operators for tens of hardware backends and further quickly identifying the best backend with runtime optimization for a computation graph. The data pipeline introduces an on-device stream processing framework to enable processing user behavior data at source. The deployment platform releases ML tasks with an efficient push-then-pull method and supports multi-granularity deployment policies. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability. Extensive micro-benchmarks also highlight the superior performance of MNN and the Python thread-level VM. Walle has been in large-scale production use in Alibaba, while MNN has been open source with a broad impact in the community.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"38 1","pages":"249-265"},"PeriodicalIF":0.0,"publicationDate":"2022-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86966095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Blockaid: Data Access Policy Enforcement for Web Applications Blockaid: Web应用程序的数据访问策略执行

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2022-05-13 DOI: 10.48550/arXiv.2205.06911

Wen Zhang, Eric Sheng, M. Chang, Aurojit Panda, Shmuel Sagiv, S. Shenker

Modern web applications serve large amounts of sensitive user data, access to which is typically governed by data-access policies. Enforcing such policies is crucial to preventing improper data access, and prior work has proposed many enforcement mechanisms. However, these prior methods either alter application semantics or require adopting a new programming model; the former can result in unexpected application behavior, while the latter cannot be used with existing web frameworks. Blockaid is an access-policy enforcement system that preserves application semantics and is compatible with existing web frameworks. It intercepts database queries from the application, attempts to verify that each query is policy-compliant, and blocks queries that are not. It verifies policy compliance using SMT solvers and generalizes and caches previous compliance decisions for better performance. We show that Blockaid supports existing web applications while requiring minimal code changes and adding only modest overheads.

现代web应用程序提供大量敏感用户数据，对这些数据的访问通常由数据访问策略控制。执行这些政策对于防止不当的数据访问至关重要，之前的工作已经提出了许多执行机制。然而，这些先前的方法要么改变应用程序语义，要么需要采用新的编程模型;前者可能导致意想不到的应用程序行为，而后者不能与现有的web框架一起使用。Blockaid是一个访问策略执行系统，它保留了应用程序语义，并与现有的web框架兼容。它拦截来自应用程序的数据库查询，尝试验证每个查询是否符合策略，并阻止不符合策略的查询。它使用SMT求解器验证策略遵从性，并泛化和缓存以前的遵从性决策以获得更好的性能。我们展示了Blockaid支持现有的web应用程序，同时只需要很少的代码更改，并且只增加了适度的开销。

{"title":"Blockaid: Data Access Policy Enforcement for Web Applications","authors":"Wen Zhang, Eric Sheng, M. Chang, Aurojit Panda, Shmuel Sagiv, S. Shenker","doi":"10.48550/arXiv.2205.06911","DOIUrl":"https://doi.org/10.48550/arXiv.2205.06911","url":null,"abstract":"Modern web applications serve large amounts of sensitive user data, access to which is typically governed by data-access policies. Enforcing such policies is crucial to preventing improper data access, and prior work has proposed many enforcement mechanisms. However, these prior methods either alter application semantics or require adopting a new programming model; the former can result in unexpected application behavior, while the latter cannot be used with existing web frameworks. Blockaid is an access-policy enforcement system that preserves application semantics and is compatible with existing web frameworks. It intercepts database queries from the application, attempts to verify that each query is policy-compliant, and blocks queries that are not. It verifies policy compliance using SMT solvers and generalizes and caches previous compliance decisions for better performance. We show that Blockaid supports existing web applications while requiring minimal code changes and adding only modest overheads.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"386 9 1","pages":"701-718"},"PeriodicalIF":0.0,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80224233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Zeph: Cryptographic Enforcement of End-to-End Data Privacy Zeph:端到端数据隐私的加密实施

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2021-07-08 DOI: 10.3929/ETHZ-B-000494008

Lukas Burkhalter, Nicolas Küchler, Alexander Viand, Hossein Shafagh, Anwar Hithnawi

As increasingly more sensitive data is being collected to gain valuable insights, the need to natively integrate privacy controls in data analytics frameworks is growing in importance. Today, privacy controls are enforced by data curators with full access to data in the clear. However, a plethora of recent data breaches show that even widely trusted service providers can be compromised. Additionally, there is no assurance that data processing and handling comply with the claimed privacy policies. This motivates the need for a new approach to data privacy that can provide strong assurance and control to users. This paper presents Zeph, a system that enables users to set privacy preferences on how their data can be shared and processed. Zeph enforces privacy policies cryptographically and ensures that data available to third-party applications complies with users' privacy policies. Zeph executes privacy-adhering data transformations in real-time and scales to thousands of data sources, allowing it to support large-scale low-latency data stream analytics. We introduce a hybrid cryptographic protocol for privacy-adhering transformations of encrypted data. We develop a prototype of Zeph on Apache Kafka to demonstrate that Zeph can perform large-scale privacy transformations with low overhead.

随着越来越多的敏感数据被收集以获得有价值的见解，在数据分析框架中原生集成隐私控制的需求变得越来越重要。今天，隐私控制是由数据管理员执行的，他们可以完全访问数据。然而，最近大量的数据泄露表明，即使是广受信任的服务提供商也可能受到损害。此外，不能保证数据处理和处理符合所声称的隐私政策。这激发了对一种新的数据隐私方法的需求，这种方法可以为用户提供强大的保证和控制。本文介绍了Zeph，这是一个允许用户设置其数据共享和处理方式的隐私偏好的系统。Zeph以加密方式执行隐私策略，并确保第三方应用程序可用的数据符合用户的隐私策略。Zeph实时执行隐私数据转换，并可扩展到数千个数据源，从而支持大规模低延迟数据流分析。我们引入了一种混合加密协议，用于加密数据的隐私保护转换。我们在Apache Kafka上开发了一个Zeph的原型，以证明Zeph可以以低开销执行大规模的隐私转换。

{"title":"Zeph: Cryptographic Enforcement of End-to-End Data Privacy","authors":"Lukas Burkhalter, Nicolas Küchler, Alexander Viand, Hossein Shafagh, Anwar Hithnawi","doi":"10.3929/ETHZ-B-000494008","DOIUrl":"https://doi.org/10.3929/ETHZ-B-000494008","url":null,"abstract":"As increasingly more sensitive data is being collected to gain valuable insights, the need to natively integrate privacy controls in data analytics frameworks is growing in importance. Today, privacy controls are enforced by data curators with full access to data in the clear. However, a plethora of recent data breaches show that even widely trusted service providers can be compromised. Additionally, there is no assurance that data processing and handling comply with the claimed privacy policies. This motivates the need for a new approach to data privacy that can provide strong assurance and control to users. This paper presents Zeph, a system that enables users to set privacy preferences on how their data can be shared and processed. Zeph enforces privacy policies cryptographically and ensures that data available to third-party applications complies with users' privacy policies. Zeph executes privacy-adhering data transformations in real-time and scales to thousands of data sources, allowing it to support large-scale low-latency data stream analytics. We introduce a hybrid cryptographic protocol for privacy-adhering transformations of encrypted data. We develop a prototype of Zeph on Apache Kafka to demonstrate that Zeph can perform large-scale privacy transformations with low overhead.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"1 1","pages":"387-404"},"PeriodicalIF":0.0,"publicationDate":"2021-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91529073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs FlashShare:通过服务器存储堆栈从内核到固件的超低延迟ssd

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2018-10-08 DOI: 10.5555/3291168.3291203

Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Chan-Seop Lee, Mohammad Alian, Myoungjun Chun, M. Kandemir, N. Kim, Jihong Kim, Myoungsoo Jung

A modern datacenter server aims to achieve high energy efficiency by co-running multiple applications. Some of such applications (e.g., web search) are latency sensitive. Therefore, they require low-latency I/O services to fast respond to requests from clients. However, we observe that simply replacing the storage devices of servers with Ultra-Low-Latency (ULL) SSDs does not notably reduce the latency of I/O services, especially when co-running multiple applications. In this paper, we propose FLASHSHARE to assist ULL SSDs to satisfy different levels of I/O service latency requirements for different co-running applications. Specifically, FLASHSHARE is a holistic cross-stack approach, which can significantly reduce I/O interferences among co-running applications at a server without any change in applications. At the kernel-level, we extend the data structures of the storage stack to pass attributes of (co-running) applications through all the layers of the underlying storage stack spanning from the OS kernel to the SSD firmware. For given attributes, the block layer and NVMe driver of FLASHSHARE differently manage the I/O scheduler and interrupt handler of NVMe. We also enhance the NVMe controller and cache layer at the SSD firmware-level, by dynamically partitioning DRAM in the ULL SSD and adjusting its caching strategies to meet diverse user requirements. The evaluation results demonstrate that FLASHSHARE can shorten the average and 99th-percentile turnaround response times of co-running applications by 22% and 31%, respectively.

现代数据中心服务器旨在通过协同运行多个应用程序来实现高能效。一些这样的应用程序(例如，web搜索)是延迟敏感的。因此，它们需要低延迟的I/O服务来快速响应来自客户机的请求。然而，我们观察到，简单地将服务器的存储设备替换为超低延迟(ULL) ssd并不能显著降低I/O服务的延迟，特别是在协同运行多个应用程序时。在本文中，我们提出FLASHSHARE来协助ULL ssd满足不同协同运行应用程序的不同级别的I/O服务延迟需求。具体来说，FLASHSHARE是一种整体的跨堆栈方法，它可以显著减少服务器上共同运行的应用程序之间的I/O干扰，而无需对应用程序进行任何更改。在内核级，我们扩展了存储堆栈的数据结构，以便通过从操作系统内核到SSD固件的底层存储堆栈的所有层传递(共同运行的)应用程序的属性。对于给定的属性，FLASHSHARE的块层和NVMe驱动程序以不同的方式管理NVMe的I/O调度器和中断处理程序。我们还在SSD固件级别增强了NVMe控制器和缓存层，通过在ULL SSD中动态分区DRAM并调整其缓存策略来满足不同的用户需求。评估结果表明，FLASHSHARE可以将共同运行的应用程序的平均周转响应时间和第99百分位周转响应时间分别缩短22%和31%。

{"title":"FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs","authors":"Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Chan-Seop Lee, Mohammad Alian, Myoungjun Chun, M. Kandemir, N. Kim, Jihong Kim, Myoungsoo Jung","doi":"10.5555/3291168.3291203","DOIUrl":"https://doi.org/10.5555/3291168.3291203","url":null,"abstract":"A modern datacenter server aims to achieve high energy efficiency by co-running multiple applications. Some of such applications (e.g., web search) are latency sensitive. Therefore, they require low-latency I/O services to fast respond to requests from clients. However, we observe that simply replacing the storage devices of servers with Ultra-Low-Latency (ULL) SSDs does not notably reduce the latency of I/O services, especially when co-running multiple applications. In this paper, we propose FLASHSHARE to assist ULL SSDs to satisfy different levels of I/O service latency requirements for different co-running applications. Specifically, FLASHSHARE is a holistic cross-stack approach, which can significantly reduce I/O interferences among co-running applications at a server without any change in applications. At the kernel-level, we extend the data structures of the storage stack to pass attributes of (co-running) applications through all the layers of the underlying storage stack spanning from the OS kernel to the SSD firmware. For given attributes, the block layer and NVMe driver of FLASHSHARE differently manage the I/O scheduler and interrupt handler of NVMe. We also enhance the NVMe controller and cache layer at the SSD firmware-level, by dynamically partitioning DRAM in the ULL SSD and adjusting its caching strategies to meet diverse user requirements. The evaluation results demonstrate that FLASHSHARE can shorten the average and 99th-percentile turnaround response times of co-running applications by 22% and 31%, respectively.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"1 1","pages":"477-492"},"PeriodicalIF":0.0,"publicationDate":"2018-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76475284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

An Analysis of Network-Partitioning Failures in Cloud Systems 云系统中网络分区故障分析

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2018-10-08 DOI: 10.5555/3291168.3291173

Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, S. Al-Kiswany

We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms.We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures.

我们提出了一个全面的研究136系统故障归因于网络分区故障从25个广泛使用的分布式系统。我们发现，大多数故障都会导致灾难性的影响，例如数据丢失、删除的数据重新出现、锁被打破和系统崩溃。一旦发生网络分区，大多数故障很容易出现:它们几乎不需要客户端输入，可以通过隔离单个节点触发，并且是确定性的。然而，必须考虑的测试用例数量是非常大的。幸运的是，我们确定了顺序、时序和网络故障特征，这些特征显著地简化了测试。此外，我们发现大量的失败是由于核心系统机制的设计缺陷。我们发现，通过设计审查可以避免大多数故障，并且可以通过使用网络分区故障注入进行测试来发现这些故障。我们构建了一个测试框架NEAT，它简化了多个客户端的协调，并可以注入不同类型的网络分区错误。我们使用NEAT测试了7个流行的系统，发现并报告了32个故障。

{"title":"An Analysis of Network-Partitioning Failures in Cloud Systems","authors":"Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, S. Al-Kiswany","doi":"10.5555/3291168.3291173","DOIUrl":"https://doi.org/10.5555/3291168.3291173","url":null,"abstract":"We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms.We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"56 1","pages":"51-68"},"PeriodicalIF":0.0,"publicationDate":"2018-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88516898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems 容错，快与慢:利用分布式系统中的故障异步

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2018-10-08 DOI: 10.5555/3291168.3291197

R. Alagappan, Aishwarya Ganesan, Jing Liu, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

We introduce situation-aware updates and crash recovery (SAUCR), a new approach to performing replicated data updates in a distributed system. SAUCR adapts the update protocol to the current situation: with many nodes up, SAUCR buffers updates in memory; when failures arise, SAUCR flushes updates to disk. This situation-awareness enables SAUCR to achieve high performance while offering strong durability and availability guarantees. We implement a prototype of SAUCR in ZooKeeper. Through rigorous crash testing, we demonstrate that SAUCR significantly improves durability and availability compared to systems that always write only to memory. We also show that SAUCR's reliability improvements come at little or no cost: SAUCR's overheads are within 0%-9% of a purely memory-based system.

我们介绍了态势感知更新和崩溃恢复(SAUCR)，这是一种在分布式系统中执行复制数据更新的新方法。SAUCR使更新协议适应当前情况:当有许多节点时，SAUCR在内存中缓冲更新;当出现故障时，SAUCR将更新刷新到磁盘。这种态势感知使SAUCR能够实现高性能，同时提供强大的耐用性和可用性保证。我们在ZooKeeper中实现了一个SAUCR的原型。通过严格的崩溃测试，我们证明了与总是只写内存的系统相比，SAUCR显著提高了持久性和可用性。我们还表明，SAUCR的可靠性改进几乎没有成本:SAUCR的开销在纯基于内存的系统的0%-9%之内。

引用次数: 13

Unobservable Communication over Fully Untrusted Infrastructure 在完全不可信的基础设施上进行不可观察的通信

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2016-11-02 DOI: 10.15781/T20R9MP4D

Sebastian Angel, Srinath T. V. Setty

Keeping communication private has become increasingly important in an era of mass surveillance and state-sponsored attacks. While hiding the contents of a conversation has well-known solutions, hiding the associated metadata (participants, duration, etc.) remains a challenge, especially if one cannot trust ISPs or proxy servers. This paper describes a communication system called Pung that provably hides all content and metadata while withstanding global adversaries. Pung is a key-value store where clients deposit and retrieve messages without anyone-- including Pung's servers--learning of the existence of a conversation. Pung is based on private information retrieval, which we make more practical for our setting with new techniques. These include a private multiretrieval scheme, an application of the power of two choices, and batch codes. These extensions allow Pung to handle 103× more users than prior systems with a similar threat model.

在一个大规模监控和国家支持的攻击盛行的时代，保持通信隐私变得越来越重要。虽然隐藏对话内容有众所周知的解决方案，但隐藏相关的元数据(参与者、持续时间等)仍然是一个挑战，特别是在不能信任isp或代理服务器的情况下。本文描述了一个名为Pung的通信系统，该系统可以在抵御全球对手的情况下隐藏所有内容和元数据。Pung是一个键值存储，客户端在其中存放和检索消息，而任何人(包括Pung的服务器)都不会知道对话的存在。Pung基于私人信息检索，我们使用新技术使其更实用。其中包括一个私有的多检索方案，两个选择能力的应用，以及批处理代码。这些扩展允许Pung处理比具有类似威胁模型的先前系统多103倍的用户。

引用次数: 141

JetStream: Cluster-Scale Parallelization of Information Flow Queries JetStream:信息流查询的集群级并行化

Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation

Pub Date : 2016-11-02 DOI: 10.5555/3026877.3026912

Andrew Quinn, David Devecsery, Peter M. Chen, J. Flinn

Dynamic information flow tracking (DIFT) is an important tool in many domains, such as security, debugging, forensics, provenance, configuration troubleshooting, and privacy tracking. However, the usability of DIFT is currently limited by its high overhead; complex information flow queries can take up to two orders of magnitude longer to execute than the original execution of the program. This precludes interactive uses in which users iteratively refine queries to narrow down bugs, leaks of private data, or performance anomalies.JetStream applies cluster computing to parallelize and accelerate information flow queries over past executions. It uses deterministic record and replay to time slice executions into distinct contiguous chunks of execution called epochs, and it tracks information flow for each epoch on a separate core in the cluster. It structures the aggregation of information flow data from each epoch as a streaming computation. Epochs are arranged in a sequential chain from the beginning to the end of program execution; relationships to program inputs (sources) are streamed forward along the chain, and relationships to program outputs (sinks) are streamed backward. Jet-Stream is the first system to parallelize DIFT across a cluster. Our results show that JetStream queries scale to at least 128 cores over a wide range of applications. JetStream accelerates DIFT queries to run 12-48 times faster than sequential queries; in most cases, queries run faster than the original execution of the program.

动态信息流跟踪(DIFT)是许多领域中的重要工具，例如安全、调试、取证、来源、配置故障排除和隐私跟踪。然而，DIFT的可用性目前受到其高开销的限制;复杂信息流查询的执行时间可能比程序的原始执行时间长两个数量级。这就排除了交互使用，在交互使用中，用户迭代地改进查询以缩小错误、私有数据泄漏或性能异常。JetStream应用集群计算来并行化和加速过去执行的信息流查询。它使用确定性记录和重放将执行时间切片为不同的连续执行块(称为epoch)，并在集群中单独的核心上跟踪每个epoch的信息流。它将来自每个epoch的信息流数据聚合为流计算。epoch从程序执行的开始到结束排列在一个顺序链中;与程序输入(源)的关系沿着链向前流，而与程序输出(接收)的关系向后流。Jet-Stream是第一个跨集群并行DIFT的系统。我们的结果表明，JetStream查询在广泛的应用中至少可以扩展到128核。JetStream加速DIFT查询，运行速度比顺序查询快12-48倍;在大多数情况下，查询的运行速度比程序的原始执行要快。

{"title":"JetStream: Cluster-Scale Parallelization of Information Flow Queries","authors":"Andrew Quinn, David Devecsery, Peter M. Chen, J. Flinn","doi":"10.5555/3026877.3026912","DOIUrl":"https://doi.org/10.5555/3026877.3026912","url":null,"abstract":"Dynamic information flow tracking (DIFT) is an important tool in many domains, such as security, debugging, forensics, provenance, configuration troubleshooting, and privacy tracking. However, the usability of DIFT is currently limited by its high overhead; complex information flow queries can take up to two orders of magnitude longer to execute than the original execution of the program. This precludes interactive uses in which users iteratively refine queries to narrow down bugs, leaks of private data, or performance anomalies.JetStream applies cluster computing to parallelize and accelerate information flow queries over past executions. It uses deterministic record and replay to time slice executions into distinct contiguous chunks of execution called epochs, and it tracks information flow for each epoch on a separate core in the cluster. It structures the aggregation of information flow data from each epoch as a streaming computation. Epochs are arranged in a sequential chain from the beginning to the end of program execution; relationships to program inputs (sources) are streamed forward along the chain, and relationships to program outputs (sinks) are streamed backward. Jet-Stream is the first system to parallelize DIFT across a cluster. Our results show that JetStream queries scale to at least 128 cores over a wide range of applications. JetStream accelerates DIFT queries to run 12-48 times faster than sequential queries; in most cases, queries run faster than the original execution of the program.","PeriodicalId":90294,"journal":{"name":"Proceedings of the -- USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Symposium on Operating Systems Design and Implementation","volume":"1 1","pages":"451-466"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83159182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25