Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this paper, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9x speedup in multi-socket evaluation and is up to 36x faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1x.
{"title":"PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation","authors":"Zheng Dang, Shuibing He, Xuechen Zhang, Peiyi Hong, Zhenxin Li, Xinyu Chen, Haozhe Song, Xian-He Sun, Gang Chen","doi":"10.1145/3643886","DOIUrl":"https://doi.org/10.1145/3643886","url":null,"abstract":"<p>Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this paper, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9x speedup in multi-socket evaluation and is up to 36x faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1x.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139678084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile emulation, which creates full-fledged software mobile devices on a physical PC/server, is pivotal to the mobile ecosystem. Unfortunately, existing mobile emulators perform poorly on graphics-intensive apps in terms of efficiency and compatibility. To address this, we introduce graphics projection, a novel graphics virtualization mechanism that adds a small-size projection space inside the guest memory, which processes graphics operations involving control contexts and resource handles without host interactions. While enhancing performance, the decoupled and asynchronous guest/host control flows introduced by graphics projection can significantly complicate emulators’ reliability issue diagnosis when faced with a variety of uncommon or non-standard app behaviors in the wild, hindering practical deployment in production. To overcome this drawback, we develop an automatic reliability issue analysis pipeline that distills the critical code paths across the guest and host control flows by runtime quarantine and state introspection. The resulting new Android emulator, dubbed Trinity, exhibits an average of 97% native hardware performance and 99.3% reliable app support, in some cases outperforming other emulators by more than an order of magnitude. It has been deployed in Huawei DevEco Studio, a major Android IDE with millions of developers.
移动模拟可在物理 PC/服务器上创建完整的软件移动设备,对移动生态系统至关重要。遗憾的是,现有的移动模拟器在图形密集型应用程序方面的效率和兼容性都很差。为了解决这个问题,我们引入了图形投影,这是一种新颖的图形虚拟化机制,它在客户内存中添加了一个小尺寸的投影空间,无需主机交互即可处理涉及控制上下文和资源句柄的图形操作。在提高性能的同时,图形投影引入的客体/主机解耦和异步控制流在面对各种不常见或非标准的应用程序行为时,会使仿真器的可靠性问题诊断变得非常复杂,从而阻碍了生产中的实际部署。为了克服这一弊端,我们开发了一个自动可靠性问题分析管道,通过运行时隔离和状态自省,提炼出客机和主机控制流中的关键代码路径。由此产生的新安卓模拟器被称为 Trinity,平均具有 97% 的原生硬件性能和 99.3% 的可靠应用程序支持,在某些情况下比其他模拟器的性能高出一个数量级以上。它已部署在华为 DevEco Studio 中,这是一个拥有数百万开发人员的主要安卓集成开发环境。
{"title":"Trinity: High-Performance and Reliable Mobile Emulation through Graphics Projection","authors":"Hao Lin, Zhenhua Li, Di Gao, Yunhao Liu, Feng Qian, Tianyin Xu","doi":"10.1145/3643029","DOIUrl":"https://doi.org/10.1145/3643029","url":null,"abstract":"<p>Mobile emulation, which creates full-fledged software mobile devices on a physical PC/server, is pivotal to the mobile ecosystem. Unfortunately, existing mobile emulators perform poorly on graphics-intensive apps in terms of efficiency and compatibility. To address this, we introduce <i>graphics projection</i>, a novel graphics virtualization mechanism that adds a small-size <i>projection space</i> inside the guest memory, which processes graphics operations involving control contexts and resource handles without host interactions. While enhancing performance, the decoupled and asynchronous guest/host control flows introduced by graphics projection can significantly complicate emulators’ reliability issue diagnosis when faced with a variety of uncommon or non-standard app behaviors in the wild, hindering practical deployment in production. To overcome this drawback, we develop an automatic reliability issue analysis pipeline that distills the critical code paths across the guest and host control flows by runtime quarantine and state introspection. The resulting new Android emulator, dubbed Trinity, exhibits an average of 97% native hardware performance and 99.3% reliable app support, in some cases outperforming other emulators by more than an order of magnitude. It has been deployed in Huawei DevEco Studio, a major Android IDE with millions of developers.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"7 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139553863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Sha, Chuandong Li, Xiaolin Wang, Zhenlin Wang, Yingwei Luo
The tiered-memory system can effectively expand the memory capacity for virtual machines (VMs). However, virtualization introduces new challenges specifically in enforcing performance isolation, minimizing context switching, and providing resource overcommit. None of the state-of-the-art designs consider virtualization and address these challenges; we observe that a VM with tiered memory incurs up to a 2 × slowdown compared to a DRAM-only VM.
We propose vTMM, a hardware-software collaborative tiered-memory management framework for virtualization. A key insight in vTMM is to leverage the unique system features in virtualization to meet the above challenges. vTMM automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. Specially, vTMM optimizes page tracking and migration based on page-modification logging (PML), a hardware-assisted virtualization mechanism, and adaptively distinguishes hot/cold pages through the page “temperature” sorting. vTMM also dynamically adjusts fast memory among multi-VMs on demand by using a memory pool. Further, vTMM tracks huge pages at regular-page granularity in hardware and splits/merges pages in software, realizing hybrid-grained page management and optimization. We implement and evaluate vTMM with single-grained page management on an Intel processor, and the hybrid-grained page management on a Sunway processor with hardware mode supporting hardware/software co-designs. Experiments show that vTMM outperforms existing tiered-memory management designs in virtualization.
{"title":"Hardware-software Collaborative Tiered-memory Management Framework for Virtualization","authors":"Sai Sha, Chuandong Li, Xiaolin Wang, Zhenlin Wang, Yingwei Luo","doi":"10.1145/3639564","DOIUrl":"https://doi.org/10.1145/3639564","url":null,"abstract":"<p>The tiered-memory system can effectively expand the memory capacity for virtual machines (VMs). However, virtualization introduces new challenges specifically in enforcing performance isolation, minimizing context switching, and providing resource overcommit. None of the state-of-the-art designs consider virtualization and address these challenges; we observe that a VM with tiered memory incurs up to a 2 × slowdown compared to a DRAM-only VM. </p><p>We propose <i>vTMM</i>, a hardware-software collaborative tiered-memory management framework for virtualization. A key insight in <i>vTMM</i> is to leverage the unique system features in virtualization to meet the above challenges. <i>vTMM</i> automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. Specially, <i>vTMM</i> optimizes page tracking and migration based on page-modification logging (PML), a hardware-assisted virtualization mechanism, and adaptively distinguishes hot/cold pages through the page “temperature” sorting. <i>vTMM</i> also dynamically adjusts fast memory among multi-VMs on demand by using a memory pool. Further, <i>vTMM</i> tracks huge pages at regular-page granularity in hardware and splits/merges pages in software, realizing hybrid-grained page management and optimization. We implement and evaluate <i>vTMM</i> with single-grained page management on an Intel processor, and the hybrid-grained page management on a Sunway processor with hardware mode supporting hardware/software co-designs. Experiments show that <i>vTMM</i> outperforms existing tiered-memory management designs in virtualization.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1229 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139469156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Containers are a mainstream virtualization technique for running stateful workloads over persistent storage. In highly-utilized multitenant hosts, resource contention at the system kernel leads to inefficient container I/O handling. Although there are interesting techniques to address this issue, they incur high implementation complexity and execution overhead. As a cost-effective alternative, we introduce the Diciclo architecture with our assumptions, goals and principles. For each tenant, Diciclo isolates the control and data I/O path at user level and runs dedicated storage systems. Diciclo includes the libservice unified user-level abstraction of system services and the node structure design pattern for the application and server side. We prototyped a toolkit of user-level components that comprise the library to invoke the standard I/O calls, the I/O communication mechanism, and the I/O services. Based on Diciclo, we built Danaus, a filesystem client that integrates a union filesystem with a Ceph distributed filesystem client and configurable shared cache. Across different host configurations, workloads and systems, Danaus achieves improved performance stability because it handles I/O with reserved per-tenant resources and avoids intensive kernel locking. Based on having built and evaluated Danaus, we share valuable lessons about resource contention, file management, service separation and performance stability in multitenant systems.
{"title":"Diciclo: Flexible User-level Services for Efficient Multitenant Isolation","authors":"Giorgos Kappes, Stergios V. Anastasiadis","doi":"10.1145/3639404","DOIUrl":"https://doi.org/10.1145/3639404","url":null,"abstract":"<p>Containers are a mainstream virtualization technique for running stateful workloads over persistent storage. In highly-utilized multitenant hosts, resource contention at the system kernel leads to inefficient container I/O handling. Although there are interesting techniques to address this issue, they incur high implementation complexity and execution overhead. As a cost-effective alternative, we introduce the Diciclo architecture with our assumptions, goals and principles. For each tenant, Diciclo isolates the control and data I/O path at user level and runs dedicated storage systems. Diciclo includes the libservice unified user-level abstraction of system services and the node structure design pattern for the application and server side. We prototyped a toolkit of user-level components that comprise the library to invoke the standard I/O calls, the I/O communication mechanism, and the I/O services. Based on Diciclo, we built Danaus, a filesystem client that integrates a union filesystem with a Ceph distributed filesystem client and configurable shared cache. Across different host configurations, workloads and systems, Danaus achieves improved performance stability because it handles I/O with reserved per-tenant resources and avoids intensive kernel locking. Based on having built and evaluated Danaus, we share valuable lessons about resource contention, file management, service separation and performance stability in multitenant systems.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"9 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139062405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Zhao, Jinchen Xu, Peng Di, Wang Nie, Jiahui Hu, Yanzhi Yi, Sijia Yang, Zhen Geng, Renwei Zhang, Bojie Li, Zhiliang Gan, Xuefeng Jin
Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers.
In this paper, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upwards exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability.
We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8 × over cuBLAS/cuDNN and 1.1 × over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.
{"title":"Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations","authors":"Jie Zhao, Jinchen Xu, Peng Di, Wang Nie, Jiahui Hu, Yanzhi Yi, Sijia Yang, Zhen Geng, Renwei Zhang, Bojie Li, Zhiliang Gan, Xuefeng Jin","doi":"10.1145/3635305","DOIUrl":"https://doi.org/10.1145/3635305","url":null,"abstract":"<p>Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers. </p><p>In this paper, we present a so-called <i>basteln</i> (an abbreviation for backward slicing of tiled loop nests) <i>strategy</i> in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upwards exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability. </p><p>We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8 × over cuBLAS/cuDNN and 1.1 × over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"5 3","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138503908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laiping Zhao, Yushuai Cui, Yanan Yang, Xiaobo Zhou, Tie Qiu, Keqiu Li, Yungang Bao
Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.
We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.
{"title":"Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing","authors":"Laiping Zhao, Yushuai Cui, Yanan Yang, Xiaobo Zhou, Tie Qiu, Keqiu Li, Yungang Bao","doi":"10.1145/3630006","DOIUrl":"https://doi.org/10.1145/3630006","url":null,"abstract":"<p>Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput. </p><p>We present <monospace>Rhythm</monospace>, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate <monospace>Rhythm</monospace> using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.</p>","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"5 4","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138503907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements. We present an efficient resource management system, namely Erms, for guaranteeing SLAs with high probability in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5 × and more importantly, lead to a reduction in resource usage by 1.6 ×, compared to state-of-the-art approaches.
{"title":"Optimizing Resource Management for Shared Microservices: A Scalable System Design","authors":"Shutian Luo, Chenyu Lin, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, Huanle Xu, Chengzhong Xu","doi":"10.1145/3631607","DOIUrl":"https://doi.org/10.1145/3631607","url":null,"abstract":"A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements. We present an efficient resource management system, namely Erms, for guaranteeing SLAs with high probability in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5 × and more importantly, lead to a reduction in resource usage by 1.6 ×, compared to state-of-the-art approaches.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"16 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135634146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Pellauer, Jason Clemons, Vignesh Balaji, Neal Crago, Aamer Jaleel, Donghyuk Lee, Mike O’Connor, Anghsuman Parashar, Sean Treichler, Po-An Tsai, Stephen W. Keckler, Joel S. Emer
Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This paper proposes Symphony, a hybrid programmable/specialized architecture which focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations, but at supporting data orchestration features such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra, and provide 31 × improved runtime and 44 × improved energy over a comparably provisioned GPU for these applications.
{"title":"Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing","authors":"Michael Pellauer, Jason Clemons, Vignesh Balaji, Neal Crago, Aamer Jaleel, Donghyuk Lee, Mike O’Connor, Anghsuman Parashar, Sean Treichler, Po-An Tsai, Stephen W. Keckler, Joel S. Emer","doi":"10.1145/3630007","DOIUrl":"https://doi.org/10.1145/3630007","url":null,"abstract":"Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This paper proposes Symphony, a hybrid programmable/specialized architecture which focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations, but at supporting data orchestration features such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra, and provide 31 × improved runtime and 44 × improved energy over a comparably provisioned GPU for these applications.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"16 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136234863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Filesystem fragmentation has been one of the primary reasons for computer systems to get slower over time. However, there have been rapid changes in modern storage systems over the past decades, and modern storage devices such as solid state drives have different mechanisms to access data, compared with traditional rotational ones. In this paper, we revisit filesystem fragmentation on modern computer systems from both performance and fairness perspectives. According to our extensive experiments, filesystem fragmentation not only degrades I/O performance of modern storage devices, but also incurs various problems related to I/O fairness, such as performance interference. Unfortunately, conventional defragmentation tools are designed primarily for hard disk drives and thus generate an unnecessarily large amount of I/Os for data migration. To mitigate such problems, this paper present FragPicker, a new defragmentation tool for modern storage devices. FragPicker analyzes the I/O behaviors of each target application and defragments only necessary pieces of data whose migration can contribute to performance improvement, thereby effectively minimizing the I/O amount for defragmentation. Our evaluation with YCSB workload-C shows FragPicker reduces the total amount of I/O for defragmentation by around 66% and the elapsed time by around 84%, while showing a similar level of defragmentation effect.
{"title":"Filesystem Fragmentation on Modern Storage Systems","authors":"Jonggyu Park, Y. Eom","doi":"10.1145/3611386","DOIUrl":"https://doi.org/10.1145/3611386","url":null,"abstract":"Filesystem fragmentation has been one of the primary reasons for computer systems to get slower over time. However, there have been rapid changes in modern storage systems over the past decades, and modern storage devices such as solid state drives have different mechanisms to access data, compared with traditional rotational ones. In this paper, we revisit filesystem fragmentation on modern computer systems from both performance and fairness perspectives. According to our extensive experiments, filesystem fragmentation not only degrades I/O performance of modern storage devices, but also incurs various problems related to I/O fairness, such as performance interference. Unfortunately, conventional defragmentation tools are designed primarily for hard disk drives and thus generate an unnecessarily large amount of I/Os for data migration. To mitigate such problems, this paper present FragPicker, a new defragmentation tool for modern storage devices. FragPicker analyzes the I/O behaviors of each target application and defragments only necessary pieces of data whose migration can contribute to performance improvement, thereby effectively minimizing the I/O amount for defragmentation. Our evaluation with YCSB workload-C shows FragPicker reduces the total amount of I/O for defragmentation by around 66% and the elapsed time by around 84%, while showing a similar level of defragmentation effect.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45128013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaac C. Sheff, Xinwen Wang, Kushal Babel, Haobin Ni, R. Van Renesse, A. Myers
Cross-domain applications are rapidly adopting blockchain techniques for immutability, availability, integrity, and interoperability. However, for most applications, global consensus is unnecessary and may not even provide sufficient guarantees. We propose a new distributed data structure: Attested Data Structures (ADS), which generalize not only blockchains, but also many other structures used by distributed applications. As in blockchains, data in ADSs is immutable and self-authenticating. ADSs go further by supporting application-defined proofs (attestations). Attestations enable applications to plug in their own mechanisms to ensure availability and integrity. We present Charlotte, a framework for composable ADSs. Charlotte deconstructs conventional blockchains into more primitive mechanisms. Charlotte can be used to construct blockchains, but does not impose the usual global-ordering overhead. Charlotte offers a flexible foundation for interacting applications that define their own policies for availability and integrity. Unlike traditional distributed systems, Charlotte supports heterogeneous trust: different observers have their own beliefs about who might fail, and how. Nevertheless, each observer has a consistent, available view of data. Charlotte’s data structures are interoperable and composable: applications and data structures can operate fully independently, or can share data when desired. Charlotte defines a language-independent format for data blocks and a network API for servers. To demonstrate Charlotte’s flexibility, we implement several integrity mechanisms, including consensus and proof of work. We explore the power of disentangling availability and integrity mechanisms in prototype applications. The results suggest that Charlotte can be used to build flexible, fast, composable applications with strong guarantees.
{"title":"Charlotte: Reformulating Blockchains into a Web of Composable Attested Data Structures for Cross-Domain Applications","authors":"Isaac C. Sheff, Xinwen Wang, Kushal Babel, Haobin Ni, R. Van Renesse, A. Myers","doi":"10.1145/3607534","DOIUrl":"https://doi.org/10.1145/3607534","url":null,"abstract":"Cross-domain applications are rapidly adopting blockchain techniques for immutability, availability, integrity, and interoperability. However, for most applications, global consensus is unnecessary and may not even provide sufficient guarantees. We propose a new distributed data structure: Attested Data Structures (ADS), which generalize not only blockchains, but also many other structures used by distributed applications. As in blockchains, data in ADSs is immutable and self-authenticating. ADSs go further by supporting application-defined proofs (attestations). Attestations enable applications to plug in their own mechanisms to ensure availability and integrity. We present Charlotte, a framework for composable ADSs. Charlotte deconstructs conventional blockchains into more primitive mechanisms. Charlotte can be used to construct blockchains, but does not impose the usual global-ordering overhead. Charlotte offers a flexible foundation for interacting applications that define their own policies for availability and integrity. Unlike traditional distributed systems, Charlotte supports heterogeneous trust: different observers have their own beliefs about who might fail, and how. Nevertheless, each observer has a consistent, available view of data. Charlotte’s data structures are interoperable and composable: applications and data structures can operate fully independently, or can share data when desired. Charlotte defines a language-independent format for data blocks and a network API for servers. To demonstrate Charlotte’s flexibility, we implement several integrity mechanisms, including consensus and proof of work. We explore the power of disentangling availability and integrity mechanisms in prototype applications. The results suggest that Charlotte can be used to build flexible, fast, composable applications with strong guarantees.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44802457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}