ACM Transactions on Computer Systems最新文献

PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation PMAlloc：改进持久内存分配的整体方法

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2024-02-03 DOI: 10.1145/3643886

Zheng Dang, Shuibing He, Xuechen Zhang, Peiyi Hong, Zhenxin Li, Xinyu Chen, Haozhe Song, Xian-He Sun, Gang Chen

Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this paper, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9x speedup in multi-socket evaluation and is up to 36x faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1x.

持久内存分配是开发高性能内存应用程序的基本构件。现有的持久内存分配器存在许多性能问题。首先，由于堆元数据管理不善，它们可能会在持久内存中引入重复的缓存行刷新和小规模随机访问。其次，它们使用静态板块隔离，当分配请求大小发生变化时，内存消耗会急剧增加。第三，它们没有意识到 NUMA 效应，导致在内存分配和取消分配过程中出现远程持久内存访问。本文设计了一种名为 PMAlloc 的新型分配器，以同时解决上述问题。(1) PMAlloc 通过将板块中的连续数据块映射到存储在不同缓存行中的交错元数据条目，消除了缓存行刷新。(2) 它以顺序模式将小型元数据单元写入持久性记账日志，以消除对持久性内存中随机堆元数据的访问。(3) 它不使用静态板块隔离，而是支持板块变形，允许板块在大小类别之间转换，从而显著提高板块的使用率。(4) 它使用本地优先分配策略，避免分配远程内存块。它还支持包括记录和同步在内的两阶段去分配机制，以尽量减少去分配过程中的远程内存访问次数。PMAlloc 是对现有一致性模型的补充。6 个基准测试结果表明，PMAlloc 在小规模和大规模分配方面的性能分别比最先进的持久性内存分配器提高了 6.4 倍和 57 倍。经过 NUMA 优化的 PMAlloc 在多插槽评估中的速度提高了 2.9 倍，比其他持久性内存分配器快 36 倍。使用 PMAlloc 最多可减少 57.8% 的内存使用量。此外，我们还将 PMAlloc 集成到了持久性 FPTree 中。与最先进的分配器相比，PMAlloc 最多可将该应用的性能提高 3.1 倍。

{"title":"PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation","authors":"Zheng Dang, Shuibing He, Xuechen Zhang, Peiyi Hong, Zhenxin Li, Xinyu Chen, Haozhe Song, Xian-He Sun, Gang Chen","doi":"10.1145/3643886","DOIUrl":"https://doi.org/10.1145/3643886","url":null,"abstract":"Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this paper, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9x speedup in multi-socket evaluation and is up to 36x faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1x.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139678084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trinity: High-Performance and Reliable Mobile Emulation through Graphics Projection 三位一体通过图形投影实现高性能和可靠的移动仿真

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2024-01-24 DOI: 10.1145/3643029

Hao Lin, Zhenhua Li, Di Gao, Yunhao Liu, Feng Qian, Tianyin Xu

Mobile emulation, which creates full-fledged software mobile devices on a physical PC/server, is pivotal to the mobile ecosystem. Unfortunately, existing mobile emulators perform poorly on graphics-intensive apps in terms of efficiency and compatibility. To address this, we introduce graphics projection, a novel graphics virtualization mechanism that adds a small-size projection space inside the guest memory, which processes graphics operations involving control contexts and resource handles without host interactions. While enhancing performance, the decoupled and asynchronous guest/host control flows introduced by graphics projection can significantly complicate emulators’ reliability issue diagnosis when faced with a variety of uncommon or non-standard app behaviors in the wild, hindering practical deployment in production. To overcome this drawback, we develop an automatic reliability issue analysis pipeline that distills the critical code paths across the guest and host control flows by runtime quarantine and state introspection. The resulting new Android emulator, dubbed Trinity, exhibits an average of 97% native hardware performance and 99.3% reliable app support, in some cases outperforming other emulators by more than an order of magnitude. It has been deployed in Huawei DevEco Studio, a major Android IDE with millions of developers.

移动模拟可在物理 PC/服务器上创建完整的软件移动设备，对移动生态系统至关重要。遗憾的是，现有的移动模拟器在图形密集型应用程序方面的效率和兼容性都很差。为了解决这个问题，我们引入了图形投影，这是一种新颖的图形虚拟化机制，它在客户内存中添加了一个小尺寸的投影空间，无需主机交互即可处理涉及控制上下文和资源句柄的图形操作。在提高性能的同时，图形投影引入的客体/主机解耦和异步控制流在面对各种不常见或非标准的应用程序行为时，会使仿真器的可靠性问题诊断变得非常复杂，从而阻碍了生产中的实际部署。为了克服这一弊端，我们开发了一个自动可靠性问题分析管道，通过运行时隔离和状态自省，提炼出客机和主机控制流中的关键代码路径。由此产生的新安卓模拟器被称为 Trinity，平均具有 97% 的原生硬件性能和 99.3% 的可靠应用程序支持，在某些情况下比其他模拟器的性能高出一个数量级以上。它已部署在华为 DevEco Studio 中，这是一个拥有数百万开发人员的主要安卓集成开发环境。

{"title":"Trinity: High-Performance and Reliable Mobile Emulation through Graphics Projection","authors":"Hao Lin, Zhenhua Li, Di Gao, Yunhao Liu, Feng Qian, Tianyin Xu","doi":"10.1145/3643029","DOIUrl":"https://doi.org/10.1145/3643029","url":null,"abstract":"Mobile emulation, which creates full-fledged software mobile devices on a physical PC/server, is pivotal to the mobile ecosystem. Unfortunately, existing mobile emulators perform poorly on graphics-intensive apps in terms of efficiency and compatibility. To address this, we introduce graphics projection, a novel graphics virtualization mechanism that adds a small-size projection space inside the guest memory, which processes graphics operations involving control contexts and resource handles without host interactions. While enhancing performance, the decoupled and asynchronous guest/host control flows introduced by graphics projection can significantly complicate emulators’ reliability issue diagnosis when faced with a variety of uncommon or non-standard app behaviors in the wild, hindering practical deployment in production. To overcome this drawback, we develop an automatic reliability issue analysis pipeline that distills the critical code paths across the guest and host control flows by runtime quarantine and state introspection. The resulting new Android emulator, dubbed Trinity, exhibits an average of 97% native hardware performance and 99.3% reliable app support, in some cases outperforming other emulators by more than an order of magnitude. It has been deployed in Huawei DevEco Studio, a major Android IDE with millions of developers.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"7 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139553863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hardware-software Collaborative Tiered-memory Management Framework for Virtualization 面向虚拟化的软硬件协同分层内存管理框架

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2024-01-15 DOI: 10.1145/3639564

Sai Sha, Chuandong Li, Xiaolin Wang, Zhenlin Wang, Yingwei Luo

The tiered-memory system can effectively expand the memory capacity for virtual machines (VMs). However, virtualization introduces new challenges specifically in enforcing performance isolation, minimizing context switching, and providing resource overcommit. None of the state-of-the-art designs consider virtualization and address these challenges; we observe that a VM with tiered memory incurs up to a 2 × slowdown compared to a DRAM-only VM.

We propose vTMM, a hardware-software collaborative tiered-memory management framework for virtualization. A key insight in vTMM is to leverage the unique system features in virtualization to meet the above challenges. vTMM automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. Specially, vTMM optimizes page tracking and migration based on page-modification logging (PML), a hardware-assisted virtualization mechanism, and adaptively distinguishes hot/cold pages through the page “temperature” sorting. vTMM also dynamically adjusts fast memory among multi-VMs on demand by using a memory pool. Further, vTMM tracks huge pages at regular-page granularity in hardware and splits/merges pages in software, realizing hybrid-grained page management and optimization. We implement and evaluate vTMM with single-grained page management on an Intel processor, and the hybrid-grained page management on a Sunway processor with hardware mode supporting hardware/software co-designs. Experiments show that vTMM outperforms existing tiered-memory management designs in virtualization.

分层内存系统可有效扩展虚拟机（VM）的内存容量。然而，虚拟化带来了新的挑战，特别是在执行性能隔离、尽量减少上下文切换和提供资源超量分配方面。最先进的设计都没有考虑到虚拟化并解决这些挑战；我们发现，与仅使用 DRAM 的虚拟机相比，使用分层内存的虚拟机的运行速度最多会降低 2 倍。我们提出了一个用于虚拟化的软硬件协同分层内存管理框架--vTMM。vTMM 的一个关键见解是利用虚拟化中独特的系统功能来应对上述挑战。vTMM 可自动判断页面热度，并在快慢内存之间迁移页面，以获得更好的性能。特别值得一提的是，vTMM 基于硬件辅助虚拟化机制--页面修改日志（PML），优化页面跟踪和迁移，并通过页面 "温度 "排序自适应地区分冷热页面。此外，vTMM 在硬件上以常规页粒度跟踪巨量页，在软件上分割/合并页，实现了混合粒度页管理和优化。我们在英特尔处理器上实现并评估了单粒度页面管理的 vTMM，并在 Sunway 处理器上实现并评估了支持硬件/软件协同设计的硬件模式的混合粒度页面管理。实验表明，vTMM 优于虚拟化中现有的分层内存管理设计。

{"title":"Hardware-software Collaborative Tiered-memory Management Framework for Virtualization","authors":"Sai Sha, Chuandong Li, Xiaolin Wang, Zhenlin Wang, Yingwei Luo","doi":"10.1145/3639564","DOIUrl":"https://doi.org/10.1145/3639564","url":null,"abstract":"The tiered-memory system can effectively expand the memory capacity for virtual machines (VMs). However, virtualization introduces new challenges specifically in enforcing performance isolation, minimizing context switching, and providing resource overcommit. None of the state-of-the-art designs consider virtualization and address these challenges; we observe that a VM with tiered memory incurs up to a 2 × slowdown compared to a DRAM-only VM. We propose vTMM, a hardware-software collaborative tiered-memory management framework for virtualization. A key insight in vTMM is to leverage the unique system features in virtualization to meet the above challenges. vTMM automatically determines page hotness and migrates pages between fast and slow memory to achieve better performance. Specially, vTMM optimizes page tracking and migration based on page-modification logging (PML), a hardware-assisted virtualization mechanism, and adaptively distinguishes hot/cold pages through the page “temperature” sorting. vTMM also dynamically adjusts fast memory among multi-VMs on demand by using a memory pool. Further, vTMM tracks huge pages at regular-page granularity in hardware and splits/merges pages in software, realizing hybrid-grained page management and optimization. We implement and evaluate vTMM with single-grained page management on an Intel processor, and the hybrid-grained page management on a Sunway processor with hardware mode supporting hardware/software co-designs. Experiments show that vTMM outperforms existing tiered-memory management designs in virtualization.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1229 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139469156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diciclo: Flexible User-level Services for Efficient Multitenant Isolation Diciclo：灵活的用户级服务，实现高效的多租户隔离

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-12-30 DOI: 10.1145/3639404

Giorgos Kappes, Stergios V. Anastasiadis

Containers are a mainstream virtualization technique for running stateful workloads over persistent storage. In highly-utilized multitenant hosts, resource contention at the system kernel leads to inefficient container I/O handling. Although there are interesting techniques to address this issue, they incur high implementation complexity and execution overhead. As a cost-effective alternative, we introduce the Diciclo architecture with our assumptions, goals and principles. For each tenant, Diciclo isolates the control and data I/O path at user level and runs dedicated storage systems. Diciclo includes the libservice unified user-level abstraction of system services and the node structure design pattern for the application and server side. We prototyped a toolkit of user-level components that comprise the library to invoke the standard I/O calls, the I/O communication mechanism, and the I/O services. Based on Diciclo, we built Danaus, a filesystem client that integrates a union filesystem with a Ceph distributed filesystem client and configurable shared cache. Across different host configurations, workloads and systems, Danaus achieves improved performance stability because it handles I/O with reserved per-tenant resources and avoids intensive kernel locking. Based on having built and evaluated Danaus, we share valuable lessons about resource contention, file management, service separation and performance stability in multitenant systems.

容器是在持久存储上运行有状态工作负载的主流虚拟化技术。在高利用率的多租户主机中，系统内核的资源争用会导致容器 I/O 处理效率低下。虽然有一些有趣的技术可以解决这个问题，但它们会带来很高的实施复杂性和执行开销。作为一种具有成本效益的替代方案，我们引入了 Diciclo 架构，并提出了我们的假设、目标和原则。对于每个租户，Diciclo 在用户层隔离了控制和数据 I/O 路径，并运行专用的存储系统。Diciclo 包括统一的用户级系统服务抽象 libservice，以及应用和服务器端的节点结构设计模式。我们开发了一个用户级组件工具包原型，其中包括调用标准 I/O 调用的库，I/O 通信机制和 I/O 服务。在 Diciclo 的基础上，我们构建了文件系统客户端 Danaus，它将联合文件系统与 Ceph 分布式文件系统客户端和可配置的共享缓存集成在一起。在不同的主机配置、工作负载和系统中，Danaus 的性能稳定性都得到了提高，因为它使用为每个租户预留的资源处理 I/O，并避免了密集的内核锁定。基于对 Danaus 的构建和评估，我们分享了多租户系统中资源争用、文件管理、服务分离和性能稳定性方面的宝贵经验。

{"title":"Diciclo: Flexible User-level Services for Efficient Multitenant Isolation","authors":"Giorgos Kappes, Stergios V. Anastasiadis","doi":"10.1145/3639404","DOIUrl":"https://doi.org/10.1145/3639404","url":null,"abstract":"Containers are a mainstream virtualization technique for running stateful workloads over persistent storage. In highly-utilized multitenant hosts, resource contention at the system kernel leads to inefficient container I/O handling. Although there are interesting techniques to address this issue, they incur high implementation complexity and execution overhead. As a cost-effective alternative, we introduce the Diciclo architecture with our assumptions, goals and principles. For each tenant, Diciclo isolates the control and data I/O path at user level and runs dedicated storage systems. Diciclo includes the libservice unified user-level abstraction of system services and the node structure design pattern for the application and server side. We prototyped a toolkit of user-level components that comprise the library to invoke the standard I/O calls, the I/O communication mechanism, and the I/O services. Based on Diciclo, we built Danaus, a filesystem client that integrates a union filesystem with a Ceph distributed filesystem client and configurable shared cache. Across different host configurations, workloads and systems, Danaus achieves improved performance stability because it handles I/O with reserved per-tenant resources and avoids intensive kernel locking. Based on having built and evaluated Danaus, we share valuable lessons about resource contention, file management, service separation and performance stability in multitenant systems.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"9 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139062405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations 利用仿射关系对优化编译器中循环平铺和融合的相互作用进行建模

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-12-01 DOI: 10.1145/3635305

Jie Zhao, Jinchen Xu, Peng Di, Wang Nie, Jiahui Hu, Yanzhi Yi, Sijia Yang, Zhen Geng, Renwei Zhang, Bojie Li, Zhiliang Gan, Xuefeng Jin

Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers.

In this paper, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upwards exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability.

We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8 × over cuBLAS/cuDNN and 1.1 × over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.

循环平铺和融合是优化编译器以增强程序的数据局部性的两个重要转变。现有的启发式方法要么以特定的顺序执行循环平纹和融合，错过了它们的一些有益的组合，要么为特定领域的应用程序执行特定的实现，在优化编译器时需要一个通用的和系统的解决方案。在本文中，我们提出了一种所谓的basteln(对平铺循环巢的向后切片的缩写)策略，以更好地模拟循环平铺和融合之间的相互作用。basteln策略首先通过保留循环巢的并行性/可伸缩性对它们进行分组，然后对产生在所考虑的程序片段之外使用的数据的输出组执行矩形/平行四边形平铺。然后计算每个瓦片所需的内存占用，从中提取向上暴露的数据，以确定剩余融合组的瓦片形状。这种平铺机制可以构建由这些组之间的依赖关系施加的复杂的平铺形状，并通过平铺后融合算法进一步合并，以增强数据局部性，同时又不会失去输出组的并行性/可平铺性。basteln策略还考虑了冗余计算的数量和独立群体的融合，显示出普遍的适用性。我们将basteln策略集成到两个优化编译器中，其中一个是通用优化器，另一个是用于部署深度学习模型的特定领域编译器。特别是，在优化深度学习模型时，basteln策略比cuBLAS/cuDNN平均加速1.8倍，比GPU上的TVM平均加速1.1倍;

{"title":"Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations","authors":"Jie Zhao, Jinchen Xu, Peng Di, Wang Nie, Jiahui Hu, Yanzhi Yi, Sijia Yang, Zhen Geng, Renwei Zhang, Bojie Li, Zhiliang Gan, Xuefeng Jin","doi":"10.1145/3635305","DOIUrl":"https://doi.org/10.1145/3635305","url":null,"abstract":"Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers. In this paper, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upwards exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability. We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8 × over cuBLAS/cuDNN and 1.1 × over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"5 3","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138503908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing 面向高吞吐量计算的组件可区分的协同定位和资源回收

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-11-18 DOI: 10.1145/3630006

Laiping Zhao, Yushuai Cui, Yanan Yang, Xiaobo Zhou, Tie Qiu, Keqiu Li, Yungang Bao

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.

We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.

云服务提供商通过将延迟关键型(LC)工作负载与尽力而为的批处理(BE)作业放在数据中心中，从而提高资源利用率。然而，他们通常将多组件lc视为整体应用程序，并在为它们分配资源时将BEs视为“二等公民”。忽略LC组件的不一致的干扰容忍能力和BE工作负载的不一致的抢占损失可能会导致错过更高吞吐量的共址机会。我们提出了一种协同位置控制器Rhythm，它可以有节奏地部署工作负载和回收资源，以最大限度地提高系统吞吐量，同时保证LC服务的尾部延迟要求。关键思想是区分每个LC组件启动的BE吞吐量，即具有更高干扰容限的组件可以与更多的BE作业一起部署。它还通过评估BEs在多级回收队列中的抢占损失，为它们分配不同的回收优先级值。我们使用容器化流程和微服务形式的工作负载来实现和评估Rhythm。实验结果表明，在保证尾部延迟要求的情况下，该方法可将系统吞吐量提高47.3%，CPU利用率提高38.6%，内存带宽利用率提高45.4%。

{"title":"Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing","authors":"Laiping Zhao, Yushuai Cui, Yanan Yang, Xiaobo Zhou, Tie Qiu, Keqiu Li, Yungang Bao","doi":"10.1145/3630006","DOIUrl":"https://doi.org/10.1145/3630006","url":null,"abstract":"Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as ”second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput. We present <monospace>Rhythm</monospace>, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate <monospace>Rhythm</monospace> using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"5 4","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138503907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Resource Management for Shared Microservices: A Scalable System Design 优化共享微服务的资源管理:一个可扩展的系统设计

4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-11-06 DOI: 10.1145/3631607

Shutian Luo, Chenyu Lin, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, Huanle Xu, Chengzhong Xu

A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements. We present an efficient resource management system, namely Erms, for guaranteeing SLAs with high probability in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5 × and more importantly, lead to a reduction in resource usage by 1.6 ×, compared to state-of-the-art approaches.

提高数据中心资源利用率的一种常用方法是根据实际工作负载自适应地提供资源。然而，在微服务管理框架中这样做的一个基本挑战是，服务的不同组件对端到端性能的影响可能存在显著差异。为了使资源管理更具挑战性，单个微服务可以由具有不同工作负载模式和SLA需求的多个在线服务共享。我们提出了一个高效的资源管理系统，即Erms，用于在共享微服务环境中保证高概率的sla。Erms将微服务延迟描述为工作负载、资源使用和干扰的分段线性函数。基于此分析，Erms构建资源缩放模型，以最佳方式确定具有复杂依赖关系的微服务的延迟目标。Erms还在共享微服务上设计了新的调度策略，以进一步提高资源效率。跨微服务基准的实验以及跟踪驱动的模拟表明，与最先进的方法相比，Erms可以将SLA违反概率降低5倍，更重要的是，可以将资源使用减少1.6倍。

{"title":"Optimizing Resource Management for Shared Microservices: A Scalable System Design","authors":"Shutian Luo, Chenyu Lin, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, Huanle Xu, Chengzhong Xu","doi":"10.1145/3631607","DOIUrl":"https://doi.org/10.1145/3631607","url":null,"abstract":"A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements. We present an efficient resource management system, namely Erms, for guaranteeing SLAs with high probability in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5 × and more importantly, lead to a reduction in resource usage by 1.6 ×, compared to state-of-the-art approaches.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"16 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135634146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing 交响乐团:用层次异构处理编排稀疏和密集张量

4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-10-27 DOI: 10.1145/3630007

Michael Pellauer, Jason Clemons, Vignesh Balaji, Neal Crago, Aamer Jaleel, Donghyuk Lee, Mike O’Connor, Anghsuman Parashar, Sean Treichler, Po-An Tsai, Stephen W. Keckler, Joel S. Emer

Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This paper proposes Symphony, a hybrid programmable/specialized architecture which focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations, but at supporting data orchestration features such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra, and provide 31 × improved runtime and 44 × improved energy over a comparably provisioned GPU for these applications.

稀疏张量算法正变得越来越广泛，特别是在深度学习、图和数据分析以及科学计算领域。当前的高性能广域架构(如gpu)经常由于移动太多数据或在内存层次结构中移动得太远而导致内存系统效率低下。为了提高性能和效率，所提出的特定于领域的加速器根据狭窄应用领域的数据需求定制其体系结构，但结果不能应用于广泛的算法或包含稀疏和密集算法混合的应用程序。本文提出了Symphony，这是一种可编程/专用的混合架构，专注于整个内存层次结构中的数据编排，同时减少不必要数据的移动和数据移动距离。Symphony架构的关键元素包括:(1)专门的可重构单元，不仅针对浮点计算，还支持数据编排功能，如地址生成、数据过滤和稀疏元数据处理;(2)在整个片上存储器层次结构中分配计算资源(包括可编程的和专用的)。我们证明Symphony可以在稀疏张量代数上匹配非可编程ASIC性能，并且在这些应用程序中提供比同等配置的GPU提高31倍的运行时间和44倍的能量。

{"title":"Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing","authors":"Michael Pellauer, Jason Clemons, Vignesh Balaji, Neal Crago, Aamer Jaleel, Donghyuk Lee, Mike O’Connor, Anghsuman Parashar, Sean Treichler, Po-An Tsai, Stephen W. Keckler, Joel S. Emer","doi":"10.1145/3630007","DOIUrl":"https://doi.org/10.1145/3630007","url":null,"abstract":"Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This paper proposes Symphony, a hybrid programmable/specialized architecture which focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations, but at supporting data orchestration features such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra, and provide 31 × improved runtime and 44 × improved energy over a comparably provisioned GPU for these applications.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"16 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136234863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Filesystem Fragmentation on Modern Storage Systems 现代存储系统中的文件系统碎片化

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-08-02 DOI: 10.1145/3611386

Jonggyu Park, Y. Eom

Filesystem fragmentation has been one of the primary reasons for computer systems to get slower over time. However, there have been rapid changes in modern storage systems over the past decades, and modern storage devices such as solid state drives have different mechanisms to access data, compared with traditional rotational ones. In this paper, we revisit filesystem fragmentation on modern computer systems from both performance and fairness perspectives. According to our extensive experiments, filesystem fragmentation not only degrades I/O performance of modern storage devices, but also incurs various problems related to I/O fairness, such as performance interference. Unfortunately, conventional defragmentation tools are designed primarily for hard disk drives and thus generate an unnecessarily large amount of I/Os for data migration. To mitigate such problems, this paper present FragPicker, a new defragmentation tool for modern storage devices. FragPicker analyzes the I/O behaviors of each target application and defragments only necessary pieces of data whose migration can contribute to performance improvement, thereby effectively minimizing the I/O amount for defragmentation. Our evaluation with YCSB workload-C shows FragPicker reduces the total amount of I/O for defragmentation by around 66% and the elapsed time by around 84%, while showing a similar level of defragmentation effect.

随着时间的推移，文件系统碎片一直是计算机系统变慢的主要原因之一。然而，在过去的几十年里，现代存储系统发生了迅速的变化，与传统的旋转存储相比，现代存储设备(如固态驱动器)具有不同的访问数据的机制。在本文中，我们从性能和公平性的角度重新审视现代计算机系统上的文件系统碎片。根据我们广泛的实验，文件系统碎片不仅会降低现代存储设备的I/O性能，而且还会引发与I/O公平性相关的各种问题，例如性能干扰。不幸的是，传统的碎片整理工具主要是为硬盘驱动器设计的，因此为数据迁移产生了不必要的大量I/ o。为了减轻这些问题，本文提出了FragPicker，一个新的碎片整理工具，为现代存储设备。FragPicker分析每个目标应用程序的I/O行为，只对迁移有助于提高性能的必要数据片段进行碎片整理，从而有效地减少用于碎片整理的I/O量。我们对YCSB工作负载c的评估显示，FragPicker将碎片整理的I/O总量减少了约66%，运行时间减少了约84%，同时显示出类似水平的碎片整理效果。

{"title":"Filesystem Fragmentation on Modern Storage Systems","authors":"Jonggyu Park, Y. Eom","doi":"10.1145/3611386","DOIUrl":"https://doi.org/10.1145/3611386","url":null,"abstract":"Filesystem fragmentation has been one of the primary reasons for computer systems to get slower over time. However, there have been rapid changes in modern storage systems over the past decades, and modern storage devices such as solid state drives have different mechanisms to access data, compared with traditional rotational ones. In this paper, we revisit filesystem fragmentation on modern computer systems from both performance and fairness perspectives. According to our extensive experiments, filesystem fragmentation not only degrades I/O performance of modern storage devices, but also incurs various problems related to I/O fairness, such as performance interference. Unfortunately, conventional defragmentation tools are designed primarily for hard disk drives and thus generate an unnecessarily large amount of I/Os for data migration. To mitigate such problems, this paper present FragPicker, a new defragmentation tool for modern storage devices. FragPicker analyzes the I/O behaviors of each target application and defragments only necessary pieces of data whose migration can contribute to performance improvement, thereby effectively minimizing the I/O amount for defragmentation. Our evaluation with YCSB workload-C shows FragPicker reduces the total amount of I/O for defragmentation by around 66% and the elapsed time by around 84%, while showing a similar level of defragmentation effect.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45128013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Charlotte: Reformulating Blockchains into a Web of Composable Attested Data Structures for Cross-Domain Applications Charlotte:将区块链重新定义为跨领域应用的可组合认证数据结构网络

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2023-07-22 DOI: 10.1145/3607534

Isaac C. Sheff, Xinwen Wang, Kushal Babel, Haobin Ni, R. Van Renesse, A. Myers

Cross-domain applications are rapidly adopting blockchain techniques for immutability, availability, integrity, and interoperability. However, for most applications, global consensus is unnecessary and may not even provide sufficient guarantees. We propose a new distributed data structure: Attested Data Structures (ADS), which generalize not only blockchains, but also many other structures used by distributed applications. As in blockchains, data in ADSs is immutable and self-authenticating. ADSs go further by supporting application-defined proofs (attestations). Attestations enable applications to plug in their own mechanisms to ensure availability and integrity. We present Charlotte, a framework for composable ADSs. Charlotte deconstructs conventional blockchains into more primitive mechanisms. Charlotte can be used to construct blockchains, but does not impose the usual global-ordering overhead. Charlotte offers a flexible foundation for interacting applications that define their own policies for availability and integrity. Unlike traditional distributed systems, Charlotte supports heterogeneous trust: different observers have their own beliefs about who might fail, and how. Nevertheless, each observer has a consistent, available view of data. Charlotte’s data structures are interoperable and composable: applications and data structures can operate fully independently, or can share data when desired. Charlotte defines a language-independent format for data blocks and a network API for servers. To demonstrate Charlotte’s flexibility, we implement several integrity mechanisms, including consensus and proof of work. We explore the power of disentangling availability and integrity mechanisms in prototype applications. The results suggest that Charlotte can be used to build flexible, fast, composable applications with strong guarantees.

跨域应用程序正在迅速采用区块链技术，以实现不变性、可用性、完整性和互操作性。然而，对于大多数应用程序来说，全球共识是不必要的，甚至可能无法提供足够的保证。我们提出了一种新的分布式数据结构：认证数据结构（ADS），它不仅推广了区块链，还推广了分布式应用程序使用的许多其他结构。与区块链一样，ADS中的数据是不可变的，并且是自认证的。ADS进一步支持应用程序定义的证明（证明）。验证使应用程序能够插入自己的机制，以确保可用性和完整性。我们介绍了Charlotte，一个可组合ADS的框架。Charlotte将传统的区块链解构为更原始的机制。Charlotte可以用于构建区块链，但不会强加通常的全局订购开销。Charlotte为交互应用程序提供了一个灵活的基础，这些应用程序定义了自己的可用性和完整性策略。与传统的分布式系统不同，Charlotte支持异构信任：不同的观察者对谁可能失败以及如何失败有自己的看法。尽管如此，每个观察者对数据的看法都是一致的。Charlotte的数据结构是可互操作和组合的：应用程序和数据结构可以完全独立地运行，也可以在需要时共享数据。Charlotte为数据块定义了依赖于语言的格式，并为服务器定义了网络API。为了展示Charlotte的灵活性，我们实施了几个诚信机制，包括共识和工作证明。我们探索了在原型应用程序中解开可用性和完整性机制的力量。结果表明，Charlotte可以用于构建具有强大保证的灵活、快速、可组合的应用程序。

{"title":"Charlotte: Reformulating Blockchains into a Web of Composable Attested Data Structures for Cross-Domain Applications","authors":"Isaac C. Sheff, Xinwen Wang, Kushal Babel, Haobin Ni, R. Van Renesse, A. Myers","doi":"10.1145/3607534","DOIUrl":"https://doi.org/10.1145/3607534","url":null,"abstract":"Cross-domain applications are rapidly adopting blockchain techniques for immutability, availability, integrity, and interoperability. However, for most applications, global consensus is unnecessary and may not even provide sufficient guarantees. We propose a new distributed data structure: Attested Data Structures (ADS), which generalize not only blockchains, but also many other structures used by distributed applications. As in blockchains, data in ADSs is immutable and self-authenticating. ADSs go further by supporting application-defined proofs (attestations). Attestations enable applications to plug in their own mechanisms to ensure availability and integrity. We present Charlotte, a framework for composable ADSs. Charlotte deconstructs conventional blockchains into more primitive mechanisms. Charlotte can be used to construct blockchains, but does not impose the usual global-ordering overhead. Charlotte offers a flexible foundation for interacting applications that define their own policies for availability and integrity. Unlike traditional distributed systems, Charlotte supports heterogeneous trust: different observers have their own beliefs about who might fail, and how. Nevertheless, each observer has a consistent, available view of data. Charlotte’s data structures are interoperable and composable: applications and data structures can operate fully independently, or can share data when desired. Charlotte defines a language-independent format for data blocks and a network API for servers. To demonstrate Charlotte’s flexibility, we implement several integrity mechanisms, including consensus and proof of work. We explore the power of disentangling availability and integrity mechanisms in prototype applications. The results suggest that Charlotte can be used to build flexible, fast, composable applications with strong guarantees.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44802457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0