Operating Systems Review (ACM)最新文献

英文中文

Using Local Cache Coherence for Disaggregated Memory Systems 在分解存储系统中使用本地缓存一致性

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2023-06-26 DOI: 10.1145/3606557.3606561

I. Calciu, M. Imran, Ivan Puddu, Sanidhya Kashyap, H. Maruf, O. Mutlu, Aasheesh Kolli

Disaggregated memory provides many cost savings and resource provisioning benefits for current datacenters, but software systems enabling disaggregated memory access result in high performance penalties. These systems require intrusive code changes to port applications for disaggregated memory or employ slow virtual memory mechanisms to avoid code changes. Such mechanisms result in high overhead page faults to access remote data and high dirty data amplification when tracking changes to cached data at page-granularity. In this paper, we propose a fundamentally new approach for disaggregated memory systems, based on the observation that we can use local cache coherence to track applications' memory accesses transparently, without code changes, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time and reduces dirty data amplification1.

分解内存为当前的数据中心提供了许多成本节约和资源调配优势，但支持分解内存访问的软件系统会导致高性能损失。这些系统需要对端口应用程序进行侵入性代码更改以获得分解内存，或者采用慢速虚拟内存机制来避免代码更改。当以页面粒度跟踪对缓存数据的更改时，这种机制导致访问远程数据的高开销页面错误和高脏数据放大。在本文中，我们提出了一种用于分解内存系统的全新方法，基于我们可以使用本地缓存一致性在缓存行粒度上透明地跟踪应用程序的内存访问，而不需要更改代码。这个简单的想法（1）在访问远程数据时消除了应用程序关键路径中的页面错误，（2）将应用程序内存访问跟踪与虚拟内存页面大小解耦，从而实现缓存线粒度脏数据跟踪和驱逐。利用这一观察结果，我们为分解内存实现了一个新的软件运行时，它提高了平均内存访问时间，并减少了脏数据放大1。

{"title":"Using Local Cache Coherence for Disaggregated Memory Systems","authors":"I. Calciu, M. Imran, Ivan Puddu, Sanidhya Kashyap, H. Maruf, O. Mutlu, Aasheesh Kolli","doi":"10.1145/3606557.3606561","DOIUrl":"https://doi.org/10.1145/3606557.3606561","url":null,"abstract":"Disaggregated memory provides many cost savings and resource provisioning benefits for current datacenters, but software systems enabling disaggregated memory access result in high performance penalties. These systems require intrusive code changes to port applications for disaggregated memory or employ slow virtual memory mechanisms to avoid code changes. Such mechanisms result in high overhead page faults to access remote data and high dirty data amplification when tracking changes to cached data at page-granularity. In this paper, we propose a fundamentally new approach for disaggregated memory systems, based on the observation that we can use local cache coherence to track applications' memory accesses transparently, without code changes, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time and reduces dirty data amplification1.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"57 1","pages":"21 - 28"},"PeriodicalIF":0.0,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45832972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Make It Real: An End-to-End Implementation of A Physically Disaggregated Data Center 实现：物理分解数据中心的端到端实现

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2023-06-26 DOI: 10.1145/3606557.3606559

Yiying Zhang

Resource disaggregation is an approach to separate different hardware resources into independent pools in a data center, so that these pools can be easily managed and their resources can be allocated in a tight but unbounded way. The past decade has seen research and practices in realizing the resource-disaggregation idea on regular servers. We advocate for a physically disaggregated data center, where disaggregated resource pools consist of hardware devices, not servers. Physical disaggregation could unlock another level of benefits in resource disaggregation, including further improved cost saving, easier maintenance and scaling, and more customization. This paper presents our efforts in building an end-to-end physically disaggregated data center, including the design and implementation of disaggregated hardware devices, networking systems for connecting these devices, operating systems for orchestrating them, and porting of traditional and cloud-computing applications to this physically disaggregated platform.

资源分解是一种将不同的硬件资源分离到数据中心中独立的资源池中的方法，这样可以方便地管理这些资源池，并且可以以严格但不受限制的方式分配资源。在过去的十年中，在常规服务器上实现资源分解思想已经有了很多研究和实践。我们提倡物理上分解的数据中心，其中分解的资源池由硬件设备组成，而不是服务器。物理分解可以释放资源分解的另一个层面的好处，包括进一步提高成本节约，更容易维护和扩展，以及更多的定制。本文介绍了我们在构建端到端物理分解数据中心方面所做的努力，包括设计和实现分解硬件设备、用于连接这些设备的网络系统、用于编排它们的操作系统，以及将传统和云计算应用程序移植到这个物理分解平台。

引用次数: 0

Memory disaggregation: why now and what are the challenges 记忆分解:为什么是现在?挑战是什么

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2023-06-26 DOI: 10.1145/3606557.3606563

M. Aguilera, Emmanuel Amaro, Nadav Amit, Erika Hunhoff, Anil Yelam, Gerd Zellweger

Hardware disaggregation has emerged as one of the most fundamental shifts in how we build computer systems over the past decades. While disaggregation has been successful for several types of resources (storage, power, and others), memory disaggregation has yet to happen. We make the case that the time for memory disaggregation has arrived. We look at past successful disaggregation stories and learn that their success depended on two requirements: addressing a burning issue and being technically feasible. We examine memory disaggregation through this lens and find that both requirements are finally met. Once available, memory disaggregation will require software support to be used effectively. We discuss some of the challenges of designing an operating system that can utilize disaggregated memory for itself and its applications.

在过去的几十年里，硬件分解已经成为我们构建计算机系统的最根本的转变之一。虽然对几种类型的资源（存储、电源和其他）的分解已经取得了成功，但内存分解尚未发生。我们提出的理由是，记忆分解的时机已经到来。我们回顾了过去成功的分解故事，了解到它们的成功取决于两个要求：解决紧迫的问题和在技术上可行。我们通过这个视角来研究记忆分解，发现这两个要求最终都得到了满足。一旦可用，内存分解将需要软件支持才能有效使用。我们讨论了设计一个可以为自己及其应用程序使用分类内存的操作系统的一些挑战。

引用次数: 1

Navigating Performance-Efficiency Tradeoffs in Serverless Computing: Deduplication to the Rescue! 在无服务器计算中进行性能效率权衡：重复数据消除助一臂之力！

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2023-06-26 DOI: 10.1145/3606557.3606564

Divyanshu Saxena, T. Ji, Arjun Singhvi, Junaid Khalid, Aditya Akella

Navigating the performance and efficiency trade-offs is critical for serverless platforms, where the providers ideally want to give the illusion of warm function startups while maintaining low resource costs. Limited controls, provided via toggling sandboxes between warm and cold states and keepalives, force operators to sacrifice significant resources to achieve good performance.

导航性能和效率的权衡对于无服务器平台至关重要，在无服务器平台上，提供商理想情况下希望在保持低资源成本的同时，给人一种温暖功能初创的错觉。通过在温暖和寒冷状态以及保持状态之间切换沙盒来提供有限的控制，迫使操作员牺牲大量资源来实现良好的性能。

引用次数: 0

Disaggregated GPU Acceleration for Serverless Applications 用于无服务器应用程序的分解GPU加速

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2023-06-26 DOI: 10.1145/3606557.3606560

Henrique Fingler, Zhiting Zhu, Esther Yoon, Zhipeng Jia, E. Witchel, C. Rossbach

Serverless platforms have been attracting applications from traditional platforms because infrastructure management responsibilities are shifted from users to providers. Many applications well-suited to serverless environments could leverage GPU acceleration to enhance their performance. Unfortunately, current serverless platforms do not expose GPUs to serverless applications.

无服务器平台一直在吸引来自传统平台的应用程序，因为基础设施管理的责任从用户转移到提供商。许多非常适合无服务器环境的应用程序可以利用GPU加速来增强其性能。不幸的是，目前的无服务器平台并没有将gpu暴露给无服务器应用程序。

引用次数: 0

Memory Disaggregation: Advances and Open Challenges 记忆分解：进展与开放的挑战

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2023-05-06 DOI: 10.1145/3606557.3606562

H. Maruf, Mosharaf Chowdhury

Compute and memory are tightly coupled within each server in traditional datacenters. Large-scale datacenter operators have identified this coupling as a root cause behind fleetwide resource underutilization and increasing Total Cost of Ownership (TCO). With the advent of ultra-fast networks and cache-coherent interfaces, memory disaggregation has emerged as a potential solution, whereby applications can leverage available memory even outside server boundaries. This paper summarizes the growing research landscape of memory disaggregation from a software perspective and introduces the challenges toward making it practical under current and future hardware trends. We also reflect on our seven-year journey in the SymbioticLab to build a comprehensive disaggregated memory system over ultra-fast networks. We conclude with some open challenges toward building next-generation memory disaggregation systems leveraging emerging cache-coherent interconnects.

在传统数据中心中，计算和内存在每个服务器内紧密耦合。大规模数据中心运营商已经将这种耦合确定为快速资源利用不足和总拥有成本（TCO）增加背后的根本原因。随着超快网络和缓存一致接口的出现，内存分解已成为一种潜在的解决方案，应用程序甚至可以利用服务器边界之外的可用内存。本文从软件的角度总结了内存分解日益增长的研究前景，并介绍了在当前和未来硬件趋势下使其实用化的挑战。我们还回顾了我们在SymbioticLab的七年历程，即在超快网络上构建一个全面的分类记忆系统。最后，我们总结了利用新兴的高速缓存一致互连构建下一代内存分解系统的一些开放挑战。

引用次数: 0

Positional Paper 定位纸

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544500

Y. Shkuro, B. Renard, Ashutosh Kumar Singh

Application telemetry refers to measurements taken from software systems to assess their performance, availability, correctness, efficiency, and other aspects useful to operators, as well as to troubleshoot them when they behave abnormally. Many modern observability platforms support dimensional models of telemetry signals where the measurements are accompanied by additional dimensions used to identify either the resources described by the telemetry or the business-specific attributes of the activities (e.g., a customer identifier). However, most of these platforms lack any semantic understanding of the data, by not capturing any metadata about telemetry, from simple aspects such as units of measure or data types (treating all dimensions as strings) to more complex concepts such as purpose policies. This limits the ability of the platforms to provide a rich user experience, especially when dealing with different telemetry assets, for example, linking an anomaly in a time series with the corresponding subset of logs or traces, which requires semantic understanding of the dimensions in the respective data sets.

应用遥测是指从软件系统中进行的测量，以评估其性能、可用性、正确性、效率和对操作员有用的其他方面，并在其行为异常时对其进行故障排除。许多现代可观察性平台支持遥测信号的维度模型，其中测量伴随着用于识别遥测所描述的资源或活动的业务特定属性（例如，客户标识符）的附加维度。然而，这些平台中的大多数都缺乏对数据的任何语义理解，因为它们没有捕获任何关于遥测的元数据，从测量单位或数据类型等简单方面（将所有维度视为字符串）到目的策略等更复杂的概念。这限制了平台提供丰富用户体验的能力，尤其是在处理不同的遥测资产时，例如，将时间序列中的异常与日志或轨迹的相应子集联系起来，这需要对相应数据集中的维度进行语义理解。

引用次数: 0

Data-Aware Compression for HPC using Machine Learning 使用机器学习的HPC数据感知压缩

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544508

Julius Plehn, A. Fuchs, Michael Kuhn, Jakob Lüttgau, T. Ludwig

While compression can provide significant storage and cost savings, its use within HPC applications is often only of secondary concern. This is in part due to the inflexibility of existing approaches where a single compression algorithm has to be used throughout the whole application but also because insights into the behaviour of the algorithms within the context of individual applications are missing.

虽然压缩可以提供大量的存储和成本节约，但它在HPC应用程序中的使用通常只是次要的问题。这部分是由于现有方法的不灵活性，即必须在整个应用程序中使用单个压缩算法，但也因为缺少对单个应用程序上下文中算法行为的洞察。

引用次数: 0

Analysis and Workload Characterization of the CERN EOS Storage System CERN EOS存储系统的分析与工作负载表征

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544507

Devashish R. Purandare, Daniel Bittman, E. L. Miller

Modern, large-scale scientific computing runs on complex exascale storage systems that support even more complex data workloads. Understanding the data access and movement patterns is vital for informing the design of future iterations of existing systems and next-generation systems. Yet we are lacking in publicly available traces and tools to help us understand even one system in depth, let alone correlate long-term cross-system trends.

现代的大规模科学计算运行在复杂的百亿亿级存储系统上，这些系统支持更复杂的数据工作负载。理解数据访问和移动模式对于告知现有系统和下一代系统的未来迭代设计至关重要。然而，我们缺乏公开可用的跟踪和工具来帮助我们深入了解一个系统，更不用说关联长期的跨系统趋势了。

引用次数: 0

An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection 用于及时、准确和全面的云事件检测的智能框架

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544499

Yichen Li, Xu Zhang, Shilin He, Zhuangbin Chen, Yu Kang, Jinyang Liu, Liqun Li, Yingnong Dang, Feng Gao, Zhangwei Xu, S. Rajmohan, Qingwei Lin, Dongmei Zhang, Michael R. Lyu

Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years of efforts, cloud providers are able to solve most incidents automatically and rapidly. The secret of this ability is intelligent incident detection. Only when incidents are detected timely, accurately, and comprehensively, can they be diagnosed and mitigated at a satisfiable speed. To overcome the limitations of traditional rule-based detection, we carried out years of incident detection research. We developed a comprehensive AIOps (Artificial Intelligence for IT Operations) framework for incident detection containing a set of data-driven methods. This paper shares our recent experience of developing and deploying such an intelligent incident detection system at Microsoft. We first discuss the real-world challenges of incident detection that constitute the pain points of engineers. Then, we summarize our intelligent solutions proposed in recent years to tackle these challenges. Finally, we show the deployment of the incident detection AIOps framework and demonstrate its practical benefits conveyed to Microsoft cloud services with real cases.

云事件（服务中断或性能下降）极大地降低了大规模云系统的可靠性，导致客户不满和收入损失。经过多年的努力，云提供商能够自动快速地解决大多数事件。这种能力的秘密是智能事件检测。只有及时、准确、全面地发现事件，才能以令人满意的速度对其进行诊断和缓解。为了克服传统基于规则检测的局限性，我们进行了多年的事件检测研究。我们开发了一个用于事件检测的全面AIOps（IT运营人工智能）框架，其中包含一组数据驱动的方法。本文分享了我们最近在微软开发和部署这种智能事件检测系统的经验。我们首先讨论了事件检测的现实挑战，这些挑战构成了工程师的痛点。然后，我们总结了近年来为应对这些挑战而提出的智能解决方案。最后，我们展示了事件检测AIOps框架的部署，并通过实际案例展示了其向微软云服务带来的实际好处。

{"title":"An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection","authors":"Yichen Li, Xu Zhang, Shilin He, Zhuangbin Chen, Yu Kang, Jinyang Liu, Liqun Li, Yingnong Dang, Feng Gao, Zhangwei Xu, S. Rajmohan, Qingwei Lin, Dongmei Zhang, Michael R. Lyu","doi":"10.1145/3544497.3544499","DOIUrl":"https://doi.org/10.1145/3544497.3544499","url":null,"abstract":"Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years of efforts, cloud providers are able to solve most incidents automatically and rapidly. The secret of this ability is intelligent incident detection. Only when incidents are detected timely, accurately, and comprehensively, can they be diagnosed and mitigated at a satisfiable speed. To overcome the limitations of traditional rule-based detection, we carried out years of incident detection research. We developed a comprehensive AIOps (Artificial Intelligence for IT Operations) framework for incident detection containing a set of data-driven methods. This paper shares our recent experience of developing and deploying such an intelligent incident detection system at Microsoft. We first discuss the real-world challenges of incident detection that constitute the pain points of engineers. Then, we summarize our intelligent solutions proposed in recent years to tackle these challenges. Finally, we show the deployment of the incident detection AIOps framework and demonstrate its practical benefits conveyed to Microsoft cloud services with real cases.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"56 1","pages":"1 - 7"},"PeriodicalIF":0.0,"publicationDate":"2022-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45597721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Operating Systems Review (ACM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀