近内存数据结构重排

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI:10.1145/2818950.2818986

M. Gokhale, G. S. Lloyd, C. Hajas

{"title":"近内存数据结构重排","authors":"M. Gokhale, G. S. Lloyd, C. Hajas","doi":"10.1145/2818950.2818986","DOIUrl":null,"url":null,"abstract":"As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":"{\"title\":\"Near memory data structure rearrangement\",\"authors\":\"M. Gokhale, G. S. Lloyd, C. Hajas\",\"doi\":\"10.1145/2818950.2818986\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.\",\"PeriodicalId\":389462,\"journal\":{\"name\":\"Proceedings of the 2015 International Symposium on Memory Systems\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"40\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2015 International Symposium on Memory Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2818950.2818986\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Symposium on Memory Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818950.2818986","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

摘要

随着CPU核数的不断增加，计算能力和可用内存带宽之间的差距已经扩大。更大和更深的缓存层次结构有利于位置友好型计算，但对不规则的数据密集型应用程序的改进有限。在这项工作中，我们探索了一种通过内存中的数据重构来加速这些应用程序的新方法。与其他提出的内存处理体系结构不同，重排硬件执行数据减少，而不是计算卸载。使用定制的FPGA仿真器，我们定量地评估了近内存硬件结构的性能和能源效益，该结构动态地将内存中的数据重构为缓存友好的布局，最大限度地减少了内存带宽的浪费。我们使用Micron Hybrid Memory Cube内存模型对具有代表性的不规则基准测试进行了测试，结果显示加速、带宽节省和能耗降低。提出了一种近内存加速器的API，并通过应用实例描述了CPU与重排硬件之间的交互。对SRAM与DRAM的优点进行了探讨，以重新排列数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Near memory data structure rearrangement

As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2015 International Symposium on Memory Systems

自引率

0.00%

发文量

期刊最新文献

MEMST: Cloning Memory Behavior using Stochastic Traces Dynamic Memory Pressure Aware Ballooning Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads