2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献_第2页

Callback: Efficient synchronization without invalidation with a directory just for spin-waiting 回调:有效的同步，不会使目录失效，只是为了等待旋转

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750405

Alberto Ros, S. Kaxiras

Cache coherence protocols based on self-invalidation allow a simpler design compared to traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and applying self-invalidation on racy synchronization points exposed to the hardware. Their simplicity lies in the absence of invalidation traffic, which eliminates the need to track readers in a directory, and reduces the number of transient protocol states. With the addition of self-downgrade these protocols can become effectively directory-free. While this works well for race-free data, unfortunately, lack of explicit invalidations compromises the effectiveness of any synchronization that relies on races. This includes any form of spin waiting, which is employed for signaling, locking, and barrier primitives. In this work we propose a new solution for spin-waiting in these protocols, the callback mechanism, that is simpler and more efficient than explicit invalidation. Callbacks are set by reads involved in spin waiting, and are satisfied by writes (that can even precede these reads). To implement callbacks we use a small (just a few entries) directory-cache structure that is intended to service only these “spin-waiting” races. This directory structure is self-contained and is not backed up in any way. Entries are created on demand and can be evicted without the need to preserve their information. Our evaluation shows a significant improvement both over explicit invalidation and over exponential back-off, the state-of-the-art mechanism for self-invalidation protocols to avoid spinning in the shared cache.

与传统的基于失效的协议相比，基于自失效的缓存一致性协议允许更简单的设计，它依赖于数据无竞争(DRF)语义，并在暴露给硬件的动态同步点上应用自失效。它们的简单性在于没有无效通信流，这消除了在目录中跟踪读取器的需要，并减少了临时协议状态的数量。通过添加自降级功能，这些协议可以有效地摆脱目录限制。虽然这对于没有竞争的数据很有效，但不幸的是，缺乏显式的失效会影响依赖于竞争的任何同步的有效性。这包括用于信令、锁定和屏障原语的任何形式的自旋等待。在这项工作中，我们提出了一种新的自旋等待解决方案，即回调机制，它比显式失效更简单，更有效。回调由spin等待中涉及的读操作设置，并由写操作满足(甚至可以在这些读操作之前)。为了实现回调，我们使用一个小的(只有几个条目)目录缓存结构，该结构旨在仅为这些“自旋等待”竞争提供服务。此目录结构是自包含的，不以任何方式进行备份。条目是按需创建的，可以在不保留其信息的情况下删除。我们的评估显示，与显式失效和指数回退(用于避免在共享缓存中旋转的自我失效协议的最先进机制)相比，有了显著的改进。

{"title":"Callback: Efficient synchronization without invalidation with a directory just for spin-waiting","authors":"Alberto Ros, S. Kaxiras","doi":"10.1145/2749469.2750405","DOIUrl":"https://doi.org/10.1145/2749469.2750405","url":null,"abstract":"Cache coherence protocols based on self-invalidation allow a simpler design compared to traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and applying self-invalidation on racy synchronization points exposed to the hardware. Their simplicity lies in the absence of invalidation traffic, which eliminates the need to track readers in a directory, and reduces the number of transient protocol states. With the addition of self-downgrade these protocols can become effectively directory-free. While this works well for race-free data, unfortunately, lack of explicit invalidations compromises the effectiveness of any synchronization that relies on races. This includes any form of spin waiting, which is employed for signaling, locking, and barrier primitives. In this work we propose a new solution for spin-waiting in these protocols, the callback mechanism, that is simpler and more efficient than explicit invalidation. Callbacks are set by reads involved in spin waiting, and are satisfied by writes (that can even precede these reads). To implement callbacks we use a small (just a few entries) directory-cache structure that is intended to service only these “spin-waiting” races. This directory structure is self-contained and is not backed up in any way. Entries are created on demand and can be evicted without the need to preserve their information. Our evaluation shows a significant improvement both over explicit invalidation and over exponential back-off, the state-of-the-art mechanism for self-invalidation protocols to avoid spinning in the shared cache.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"3 1","pages":"427-438"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89540845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Unified address translation for memory-mapped SSDs with FlashMap 统一地址转换的内存映射ssd与FlashMap

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750420

Jian Huang, Anirudh Badam, Moinuddin K. Qureshi, K. Schwan

Applications can map data on SSDs into virtual memory to transparently scale beyond DRAM capacity, permitting them to leverage high SSD capacities with few code changes. Obtaining good performance for memory-mapped SSD content, however, is hard because the virtual memory layer, the file system and the flash translation layer (FTL) perform address translations, sanity and permission checks independently from each other. We introduce FlashMap, an SSD interface that is optimized for memory-mapped SSD-files. FlashMap combines all the address translations into page tables that are used to index files and also to store the FTL-level mappings without altering the guarantees of the file system or the FTL. It uses the state in the OS memory manager and the page tables to perform sanity and permission checks respectively. By combining these layers, FlashMap reduces critical-path latency and improves DRAM caching efficiency. We find that this increases performance for applications by up to 3.32x compared to state-of-the-art SSD file-mapping mechanisms. Additionally, latency of SSD accesses reduces by up to 53.2%.

应用程序可以将SSD上的数据映射到虚拟内存中，以透明地扩展DRAM容量之外的容量，从而允许它们在很少的代码更改的情况下利用高SSD容量。然而，为内存映射的SSD内容获得良好的性能是困难的，因为虚拟内存层、文件系统和闪存转换层(FTL)相互独立地执行地址转换、完整性和权限检查。我们介绍FlashMap，这是一个SSD接口，针对内存映射的SSD文件进行了优化。FlashMap将所有的地址转换合并到页表中，这些页表用于索引文件，也用于存储FTL级别的映射，而不会改变文件系统或FTL的保证。它使用操作系统内存管理器中的状态和页表分别执行完整性检查和权限检查。通过结合这些层，FlashMap减少了关键路径延迟并提高了DRAM缓存效率。我们发现，与最先进的SSD文件映射机制相比，这将应用程序的性能提高了3.32倍。此外，SSD访问延迟降低高达53.2%。

{"title":"Unified address translation for memory-mapped SSDs with FlashMap","authors":"Jian Huang, Anirudh Badam, Moinuddin K. Qureshi, K. Schwan","doi":"10.1145/2749469.2750420","DOIUrl":"https://doi.org/10.1145/2749469.2750420","url":null,"abstract":"Applications can map data on SSDs into virtual memory to transparently scale beyond DRAM capacity, permitting them to leverage high SSD capacities with few code changes. Obtaining good performance for memory-mapped SSD content, however, is hard because the virtual memory layer, the file system and the flash translation layer (FTL) perform address translations, sanity and permission checks independently from each other. We introduce FlashMap, an SSD interface that is optimized for memory-mapped SSD-files. FlashMap combines all the address translations into page tables that are used to index files and also to store the FTL-level mappings without altering the guarantees of the file system or the FTL. It uses the state in the OS memory manager and the page tables to perform sanity and permission checks respectively. By combining these layers, FlashMap reduces critical-path latency and improves DRAM caching efficiency. We find that this increases performance for applications by up to 3.32x compared to state-of-the-art SSD file-mapping mechanisms. Additionally, latency of SSD accesses reduces by up to 53.2%.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"33 1","pages":"580-591"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86629314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Accelerating asynchronous programs through Event Sneak Peek 通过事件预览加速异步程序

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750373

Gaurav Chadha, S. Mahlke, S. Narayanasamy

Asynchronous or event-driven programming is now being used to develop a wide range of systems, including mobile and Web 2.0 applications, Internet-of-Things, and even distributed servers. We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs. Execution characteristics of asynchronous programs significantly differ from synchronous programs as they interleave short events from varied tasks in a fine-grained manner. This paper proposes the Event Sneak Peek (ESP) architecture to mitigate microarchitectural bottlenecks in asynchronous programs. ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events. Instead of stalling on long latency cache misses, ESP jumps ahead to pre-execute future events and gathers useful information that later help initiate accurate instruction and data prefetches and correct branch mispredictions. We demonstrate that ESP improves the performance of popular asynchronous Web 2.0 applications including Amazon, Google maps, and Facebook, by an average of 16%.

异步或事件驱动编程现在被广泛用于开发各种系统，包括移动和Web 2.0应用程序、物联网，甚至分布式服务器。我们观察到，这些程序在针对同步程序的特性进行了大量优化的传统处理器架构上表现不佳。异步程序的执行特征明显不同于同步程序，因为它们以细粒度的方式将来自不同任务的短事件穿插在一起。本文提出了事件窥视(ESP)架构来缓解异步程序中的微架构瓶颈。ESP利用了事件在执行之前被发布到事件队列的事实。通过向处理器公开这个事件队列，ESP获得了未来事件的知识。ESP不会因为长时间的缓存丢失而停滞不前，而是提前执行未来的事件，并收集有用的信息，这些信息可以帮助启动准确的指令和数据预取，并纠正分支错误预测。我们证明ESP将流行的异步Web 2.0应用程序(包括Amazon、谷歌地图和Facebook)的性能平均提高了16%。

{"title":"Accelerating asynchronous programs through Event Sneak Peek","authors":"Gaurav Chadha, S. Mahlke, S. Narayanasamy","doi":"10.1145/2749469.2750373","DOIUrl":"https://doi.org/10.1145/2749469.2750373","url":null,"abstract":"Asynchronous or event-driven programming is now being used to develop a wide range of systems, including mobile and Web 2.0 applications, Internet-of-Things, and even distributed servers. We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs. Execution characteristics of asynchronous programs significantly differ from synchronous programs as they interleave short events from varied tasks in a fine-grained manner. This paper proposes the Event Sneak Peek (ESP) architecture to mitigate microarchitectural bottlenecks in asynchronous programs. ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events. Instead of stalling on long latency cache misses, ESP jumps ahead to pre-execute future events and gathers useful information that later help initiate accurate instruction and data prefetches and correct branch mispredictions. We demonstrate that ESP improves the performance of popular asynchronous Web 2.0 applications including Amazon, Google maps, and Facebook, by an average of 16%.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"24 1","pages":"642-654"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80130005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

ArMOR: Defending against memory consistency model mismatches in heterogeneous architectures 装甲:在异构架构中防止内存一致性模型不匹配

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750378

Daniel Lustig, Caroline Trippel, Michael Pellauer, M. Martonosi

Architectural heterogeneity is increasing: numerous products and studies have proven the benefits of combining cores and accelerators with varying ISAs into a single system. However, an underappreciated barrier to unlocking the full potential of heterogeneity is the need to specify and to reconcile differences in memory consistency models across layers of the hardware-software stack and among on-chip components. This paper presents ArMOR, a framework for specifying, comparing, and translating between memory consistency models. ArMOR defines MOSTs, an architecture-independent and precise format for specifying the semantics of memory ordering requirements such as preserved program order or explicit fences. MOSTs allow any two consistency models to be directly and algorithmically compared, and they help avoid many of the pitfalls of traditional consistency model analysis. As a case study, we use ArMOR to automatically generate translation modules called shims that dynamically translate code compiled for one memory model to execute on hardware implementing a different model.

体系结构的异构性正在增加:许多产品和研究已经证明了将具有不同isa的核心和加速器组合到单个系统中的好处。然而，在释放异构的全部潜力方面，一个未被充分认识的障碍是需要指定和协调跨硬件软件堆栈层和片上组件之间内存一致性模型的差异。本文介绍了ArMOR，一个用于指定、比较和转换内存一致性模型的框架。ArMOR定义了most，这是一种独立于体系结构的精确格式，用于指定内存排序需求的语义，例如保留程序顺序或显式围栏。most允许对任意两个一致性模型进行直接和算法上的比较，并且它们有助于避免传统一致性模型分析的许多缺陷。作为一个案例研究，我们使用ArMOR自动生成称为shims的翻译模块，该模块动态地翻译为一个内存模型编译的代码，以便在实现不同模型的硬件上执行。

{"title":"ArMOR: Defending against memory consistency model mismatches in heterogeneous architectures","authors":"Daniel Lustig, Caroline Trippel, Michael Pellauer, M. Martonosi","doi":"10.1145/2749469.2750378","DOIUrl":"https://doi.org/10.1145/2749469.2750378","url":null,"abstract":"Architectural heterogeneity is increasing: numerous products and studies have proven the benefits of combining cores and accelerators with varying ISAs into a single system. However, an underappreciated barrier to unlocking the full potential of heterogeneity is the need to specify and to reconcile differences in memory consistency models across layers of the hardware-software stack and among on-chip components. This paper presents ArMOR, a framework for specifying, comparing, and translating between memory consistency models. ArMOR defines MOSTs, an architecture-independent and precise format for specifying the semantics of memory ordering requirements such as preserved program order or explicit fences. MOSTs allow any two consistency models to be directly and algorithmically compared, and they help avoid many of the pitfalls of traditional consistency model analysis. As a case study, we use ArMOR to automatically generate translation modules called shims that dynamically translate code compiled for one memory model to execute on hardware implementing a different model.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"4 1","pages":"388-400"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79984272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Page overlays: An enhanced virtual memory framework to enable fine-grained memory management 页面覆盖:一个增强的虚拟内存框架，支持细粒度的内存管理

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750379

V. Seshadri, Gennady Pekhimenko, Olatunji Ruwase, O. Mutlu, Phillip B. Gibbons, M. Kozuch, T. Mowry, Trishul M. Chilimbi

Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity-e.g., fine-grained de-duplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure. We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads. We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.

最近的许多工作都提出了一些机制，展示了以精细粒度(例如，缓存线)管理内存的潜在优势。、细粒度重复数据删除和细粒度内存保护。不幸的是，现有的虚拟内存系统以更大的粒度(例如，4 KB页面)跟踪内存，从而抑制了此类技术的有效实现。简单地减小页大小会导致页表开销和TLB压力的不可接受的增加。我们提出了一个新的虚拟内存框架，它能够有效地实现各种细粒度内存管理技术。在我们的框架中，除了常规的物理页面之外，每个虚拟页面都可以映射到一个称为页面覆盖的结构。覆盖层包含来自虚拟页的缓存行子集。覆盖层中存在的缓存线从那里访问，所有其他缓存线从常规物理页面访问。我们的页面覆盖框架支持缓存行粒度的内存管理，而不会显著改变现有的虚拟内存框架或引入高昂的开销。我们展示了我们的框架可以支持七种内存管理技术的简单而有效的实现，每种技术都有各种各样的应用程序。我们定量地评估了其中两种技术的潜在好处:写时覆盖和稀疏数据结构计算。我们的评估表明，与传统的写时复制(copy-on-write)相比，将写时覆盖(overlay-on-write)应用于fork可以提高15%的性能，并将内存容量需求平均降低53%。对于稀疏数据计算，我们的框架可以在许多真实世界的稀疏矩阵上优于最先进的基于软件的稀疏表示。我们的框架是通用的、强大的、有效的，能够以低成本实现细粒度的内存管理。

{"title":"Page overlays: An enhanced virtual memory framework to enable fine-grained memory management","authors":"V. Seshadri, Gennady Pekhimenko, Olatunji Ruwase, O. Mutlu, Phillip B. Gibbons, M. Kozuch, T. Mowry, Trishul M. Chilimbi","doi":"10.1145/2749469.2750379","DOIUrl":"https://doi.org/10.1145/2749469.2750379","url":null,"abstract":"Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity-e.g., fine-grained de-duplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure. We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads. We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"3 2 1","pages":"79-91"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88354475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

VIP: Virtualizing IP chains on handheld platforms VIP:在手持平台虚拟化IP链

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750382

N. Nachiappan, Haibo Zhang, Jihyun Ryoo, N. Soundararajan, A. Sivasubramaniam, M. Kandemir, Ravishankar R. Iyer, C. Das

Energy-efficient user-interactive and display-oriented applications on handhelds rely heavily on multiple accelerators (termed IP cores) to meet their periodic frame processing needs. Further, these platforms are starting to host multiple applications concurrently on the multiple CPU cores. Unfortunately, today's hardware exposes an interface that forces the host software (Android drivers) to treat each IP core as an isolated device. Consequently, the host CPU has to get involved in the (i) processing of each frame, (ii) scheduling them to ensure timely progress through the IP cores to meet their QoS needs, and (iii) explicitly having to move data from one IP core to the next, with main memory serving as the common staging area. We show in this paper through measurements on a Nexus 7 platform that the frequent invocation of the CPU for processing these frames and the involvement of main memory as a data flow conduit, are serious limitations. Instead, we propose a novel IP virtualization framework (VIP), involving three key ideas that allow several IPs to be chained together and made to appear to the software as a single device. First, chaining of IPs avoids data transfer through the memory system, enhancing the throughput of flows through the IPs. Second, by using a burst-mode, the CPU can initiate the processing of several frames through the virtual IP chain, without getting involved (and interrupted) for each frame, thereby allowing better energy saving and utilization opportunities. Removing the CPU from this loop, requires alternate orchestration of frame flows to ensure QoS guarantees for each frame of each application. Our third enhancement in VIP creates several virtual paths, one for each flow, through these IP chains with the hardware scheduling the frames to enforce QoS guarantees despite any contention for resources along the way. Our experimental evaluations demonstrate the effectiveness of VIP on energy consumption and QoS for multiple applications.

手持设备上节能的用户交互和面向显示的应用程序严重依赖多个加速器(称为IP核)来满足其周期性帧处理需求。此外，这些平台开始在多个CPU核心上并发地托管多个应用程序。不幸的是，今天的硬件暴露了一个接口，迫使主机软件(Android驱动程序)将每个IP核视为一个孤立的设备。因此，主机CPU必须参与(i)处理每一帧，(ii)调度它们以确保及时通过IP核以满足它们的QoS需求，以及(iii)明确地将数据从一个IP核移动到下一个，主存储器作为公共staging区域。在本文中，我们通过在Nexus 7平台上的测量表明，频繁调用CPU来处理这些帧以及主存储器作为数据流管道的参与是严重的限制。相反，我们提出了一种新颖的IP虚拟化框架(VIP)，涉及三个关键思想，允许多个IP链接在一起，并使软件作为单个设备出现。首先，ip链避免了数据通过存储系统传输，提高了ip流的吞吐量。其次，通过使用突发模式，CPU可以通过虚拟IP链启动多个帧的处理，而不涉及(和中断)每一帧，从而提供更好的节能和利用机会。将CPU从这个循环中移除，需要对帧流进行备用编排，以确保每个应用程序的每个帧都有QoS保证。我们在VIP中的第三个增强创建了几个虚拟路径，每个流一个，通过这些IP链，通过硬件调度帧来强制QoS保证，尽管沿途存在资源争用。我们的实验评估证明了VIP在多种应用中对能耗和QoS的有效性。

{"title":"VIP: Virtualizing IP chains on handheld platforms","authors":"N. Nachiappan, Haibo Zhang, Jihyun Ryoo, N. Soundararajan, A. Sivasubramaniam, M. Kandemir, Ravishankar R. Iyer, C. Das","doi":"10.1145/2749469.2750382","DOIUrl":"https://doi.org/10.1145/2749469.2750382","url":null,"abstract":"Energy-efficient user-interactive and display-oriented applications on handhelds rely heavily on multiple accelerators (termed IP cores) to meet their periodic frame processing needs. Further, these platforms are starting to host multiple applications concurrently on the multiple CPU cores. Unfortunately, today's hardware exposes an interface that forces the host software (Android drivers) to treat each IP core as an isolated device. Consequently, the host CPU has to get involved in the (i) processing of each frame, (ii) scheduling them to ensure timely progress through the IP cores to meet their QoS needs, and (iii) explicitly having to move data from one IP core to the next, with main memory serving as the common staging area. We show in this paper through measurements on a Nexus 7 platform that the frequent invocation of the CPU for processing these frames and the involvement of main memory as a data flow conduit, are serious limitations. Instead, we propose a novel IP virtualization framework (VIP), involving three key ideas that allow several IPs to be chained together and made to appear to the software as a single device. First, chaining of IPs avoids data transfer through the memory system, enhancing the throughput of flows through the IPs. Second, by using a burst-mode, the CPU can initiate the processing of several frames through the virtual IP chain, without getting involved (and interrupted) for each frame, thereby allowing better energy saving and utilization opportunities. Removing the CPU from this loop, requires alternate orchestration of frame flows to ensure QoS guarantees for each frame of each application. Our third enhancement in VIP creates several virtual paths, one for each flow, through these IP chains with the hardware scheduling the frames to enforce QoS guarantees despite any contention for resources along the way. Our experimental evaluations demonstrate the effectiveness of VIP on energy consumption and QoS for multiple applications.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"3 1","pages":"655-667"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90157690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture 支持pim的指令:低开销、位置感知的内存处理体系结构

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750385

Junwhan Ahn, S. Yoo, O. Mutlu, Kiyoung Choi

Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches. In this paper, we propose a new PIM architecture that (I) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions. We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PlM architectures by adapting to data locality of applications.

内存中处理(PIM)作为解决内存墙危机的可行方案迅速崛起，从20世纪90年代由于实用性问题而失败的尝试中反弹，最近3D堆叠技术的进步缓解了这种危机。然而，由于两个共同的特点，将PIM体系结构与现有系统无缝集成仍然具有挑战性:内存计算单元的非常规编程模型以及缺乏利用大型片上缓存的能力。在本文中，我们提出了一种新的PIM架构，它(I)不改变现有的顺序编程模型，(2)根据数据的位置自动决定是在内存中执行PIM操作还是在处理器中执行PIM操作。关键思想是使用可计算的内存命令实现简单的内存计算，并使用专用指令(我们称之为支持pim的指令)来调用内存计算。这使得PIM操作无需修改即可与现有编程模型、缓存一致性协议和虚拟内存机制进行互操作。此外，我们还介绍了一个简单的硬件结构，该结构在运行时监视支持pim的指令访问的数据的位置，以便在指令可以从大型片上缓存中获益时，在主机处理器(而不是内存)上自适应地执行该指令。因此，我们的体系结构提供了PIM操作被执行的假象，就好像它们是主机处理器指令一样。我们提供了一个案例研究，说明十个新兴的数据密集型工作负载如何从我们的新PIM抽象及其硬件实现中受益。评估表明，我们的体系结构显著提高了系统性能，更重要的是，通过适应应用程序的数据位置，结合了传统和PlM体系结构的最佳部分。

{"title":"PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture","authors":"Junwhan Ahn, S. Yoo, O. Mutlu, Kiyoung Choi","doi":"10.1145/2749469.2750385","DOIUrl":"https://doi.org/10.1145/2749469.2750385","url":null,"abstract":"Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches. In this paper, we propose a new PIM architecture that (I) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions. We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PlM architectures by adapting to data locality of applications.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"14 1","pages":"336-348"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86640457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 443

Fusion: Design tradeoffs in coherent cache hierarchies for accelerators 融合:加速器连贯缓存层次结构的设计权衡

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750421

Snehasish Kumar, Arrvindh Shriraman, Naveen Vedula

Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading at the granularity of functions and loops. The sequential program now needs to migrate across the chip utilizing the appropriate accelerator for each program region. As the execution migrates, it has become increasingly challenging to retain the temporal and spatial locality of the original program as well as manage the data sharing. We show that with the increasing energy cost of wires and caches relative to compute operations, it is imperative to optimize data movement to retain the energy benefits of accelerators. We develop FUSION, a lightweight coherent cache hierarchy for accelerators and study the tradeoffs compared to a scratchpad based architecture. We find that coherency, both between the accelerators and with the CPU, can help minimize data movement and save energy. FUSION leverages temporal coherence [32] to optimize data movement within the accelerator tile. The accelerator tile includes small per-accelerator L0 caches to minimize hit energy and a per-tile shared cache to improve localized-sharing between accelerators and minimize data exchanges with the host LLC. We find that overall FUSION improves performance by 4.3× compared to an oracle DMA that pushes data into the scratchpad. In workloads with inter-accelerator sharing we save up to 10× the dynamic energy of the cache hierarchy by minimizing the host-accelerator data ping-ponging.

芯片设计师对将专门的固定功能协处理器集成到多核设计中以提高能效越来越感兴趣。最近在学术界[11,37]和工业界[16]的工作已经寻求在函数和循环的粒度上实现更细粒度的卸载。顺序程序现在需要利用每个程序区域的适当加速器跨芯片迁移。随着执行的迁移，保留原始程序的时间和空间局部性以及管理数据共享变得越来越具有挑战性。我们表明，随着连接和缓存相对于计算操作的能量成本的增加，优化数据移动以保持加速器的能量优势是必要的。我们开发了FUSION，一种用于加速器的轻量级连贯缓存层次结构，并研究了与基于刮擦板的架构相比的权衡。我们发现，无论是加速器之间还是与CPU之间的一致性，都有助于减少数据移动并节省能源。FUSION利用时间相干性[32]来优化加速器块内的数据移动。加速器块包括小的每个加速器L0缓存，以最大限度地减少撞击能量，以及每个块共享缓存，以改善加速器之间的本地化共享，并最大限度地减少与主机LLC的数据交换。我们发现，与将数据推送到刮板上的oracle DMA相比，FUSION总体上提高了4.3倍的性能。在具有加速器间共享的工作负载中，通过最小化主机加速器数据的“乒乓”，可以节省高达10倍的缓存层次结构的动态能量。

{"title":"Fusion: Design tradeoffs in coherent cache hierarchies for accelerators","authors":"Snehasish Kumar, Arrvindh Shriraman, Naveen Vedula","doi":"10.1145/2749469.2750421","DOIUrl":"https://doi.org/10.1145/2749469.2750421","url":null,"abstract":"Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading at the granularity of functions and loops. The sequential program now needs to migrate across the chip utilizing the appropriate accelerator for each program region. As the execution migrates, it has become increasingly challenging to retain the temporal and spatial locality of the original program as well as manage the data sharing. We show that with the increasing energy cost of wires and caches relative to compute operations, it is imperative to optimize data movement to retain the energy benefits of accelerators. We develop FUSION, a lightweight coherent cache hierarchy for accelerators and study the tradeoffs compared to a scratchpad based architecture. We find that coherency, both between the accelerators and with the CPU, can help minimize data movement and save energy. FUSION leverages temporal coherence [32] to optimize data movement within the accelerator tile. The accelerator tile includes small per-accelerator L0 caches to minimize hit energy and a per-tile shared cache to improve localized-sharing between accelerators and minimize data exchanges with the host LLC. We find that overall FUSION improves performance by 4.3× compared to an oracle DMA that pushes data into the scratchpad. In workloads with inter-accelerator sharing we save up to 10× the dynamic energy of the cache hierarchy by minimizing the host-accelerator data ping-ponging.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"8 1","pages":"733-745"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80834072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads 协调warp调度和缓存优先级，用于GPGPU工作负载的关键warp加速

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750418

Shin-Ying Lee, A. Arunkumar, Carole-Jean Wu

The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chipmultiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs - the warp criticality problem. To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.

无处不在的图形处理单元(GPU)架构使它们成为并行工作负载的芯片多处理器的有效替代品。gpu通过使用大规模多线程和快速上下文切换来隐藏管道停滞和内存访问延迟，从而实现卓越的性能。然而，最近的表征结果表明，通用GPU (GPGPU)应用程序通常会遇到长时间的失速延迟，这种延迟不能轻易地被大量并发线程/扭曲所掩盖。这导致不同并行翘曲之间的执行时间差异不同，损害gpu的整体性能-翘曲临界问题。为了解决翘曲临界问题，我们提出了一种协调的解决方案——临界感知翘曲加速(CAWA)，它有效地管理计算和内存资源，以加速临界翘曲的执行。具体来说，我们设计了(1)一个基于指令和基于停顿的临界预测器来识别线程块中的临界扭曲，(2)一个临界感知的扭曲调度器，优先为临界扭曲分配更多的时间资源，以及(3)一个临界感知的缓存重用预测器，通过在L1数据缓存中保留延迟关键和有用的缓存块来帮助临界扭曲加速。CAWA的目标是消除显著的执行时间差异，以提高GPGPU工作负载的资源利用率。我们的评估结果表明，在提出的协调调度器和缓存优先级管理方案下，GPGPU工作负载的性能可以提高23%，而其他最先进的调度器，GTO和2级调度器的性能分别提高16%和-2%。

{"title":"CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads","authors":"Shin-Ying Lee, A. Arunkumar, Carole-Jean Wu","doi":"10.1145/2749469.2750418","DOIUrl":"https://doi.org/10.1145/2749469.2750418","url":null,"abstract":"The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chipmultiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs - the warp criticality problem. To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"34 1","pages":"515-527"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80615613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 83

Quantitative comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8 Blue Gene/Q、zEnterprise EC12、Intel Core和POWER8硬件事务性内存的定量比较

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750403

T. Nakaike, Rei Odaira, Matthew Gaudet, Maged M. Michael, Hisanobu Tomari

Transactional Memory (TM) is a new programming paradigm for both simple concurrent programming and high concurrent performance. Hardware Transactional Memory (HTM) is hardware support for TM-based programming. It has lower overhead than software transactional memory (STM), which is a software-based implementation of TM. There are now four commercial systems, IBM Blue Gene/Q, IBM zEnterprise EC12, Intel Core, and IBM POWER8, offering HTM. Our work is the first to compare the performance of these four HTM systems. We measured the STAMP benchmarks, the most widely used TM benchmarks. We also evaluated the specific features of each HTM system. Our experimental results show that: (1) there is no single HTM system that is more scalable than the others in all of the benchmarks, (2) there are measurable performance differences among the HTM systems in some benchmarks, and (3) each HTM system has its own implementation characteristics that limit its scalability.

事务性内存(Transactional Memory, TM)是一种用于简单并发编程和高并发性能的新编程范式。硬件事务性内存(Hardware Transactional Memory, HTM)是对基于tm编程的硬件支持。它的开销低于软件事务性内存(STM)，后者是一种基于软件的TM实现。现在有四个商业系统提供HTM，分别是IBM Blue Gene/Q、IBM zEnterprise EC12、Intel Core和IBM POWER8。我们的工作首次比较了这四种HTM系统的性能。我们测量了STAMP基准，这是使用最广泛的TM基准。我们还评估了每个HTM系统的具体特性。我们的实验结果表明:(1)在所有的基准测试中，没有一个HTM系统比其他系统更具可扩展性;(2)在一些基准测试中，HTM系统之间存在可测量的性能差异;(3)每个HTM系统都有自己的实现特征，限制了其可扩展性。

引用次数: 100