多执行:用于数据类似执行的多核缓存

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI:10.1145/1555754.1555777

Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong

{"title":"多执行:用于数据类似执行的多核缓存","authors":"Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong","doi":"10.1145/1555754.1555777","DOIUrl":null,"url":null,"abstract":"While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as \"multi-execution.\" Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"61 1","pages":"164-173"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"Multi-execution: multicore caching for data-similar executions\",\"authors\":\"Susmit Biswas, D. Franklin, Alan Savage, Ryan Dixon, T. Sherwood, F. Chong\",\"doi\":\"10.1145/1555754.1555777\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as \\\"multi-execution.\\\" Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.\",\"PeriodicalId\":91388,\"journal\":{\"name\":\"Proceedings. International Symposium on Computer Architecture\",\"volume\":\"61 1\",\"pages\":\"164-173\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1555754.1555777\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1555754.1555777","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

摘要

当微处理器设计人员转向多核架构以维持性能期望时，这种架构的并行性的急剧增加将对片外带宽提出大量要求，并使内存墙比以往任何时候都更加重要。本文论证了多核处理器的一个有益的应用是执行同一个程序的许多类似实例。我们确定这种执行模型在几个实际场景中使用，并将其称为“多执行”。通常，每个这样的实例都使用非常相似的数据。在传统的缓存层次结构中，每个实例将独立地缓存自己的数据。我们提出了可合并的缓存架构，它可以检测数据相似性并合并缓存块，从而大大节省了缓存存储需求。这将减少片外内存访问和总体功耗，并提高应用程序性能。我们提供了8个基准测试的周期精确模拟结果(6个来自SPEC2000)，以证明我们的技术提供了一个可扩展的解决方案，并由于减少主内存访问而导致显着的速度提高。对于8个内核运行8个相同应用程序的类似执行，并共享一个独占的4 mb 8路L2缓存，可合并缓存显示执行速度平均提高2.5倍(范围从0.93倍到6.92倍)，而在使用它时，缓存面积的开销仅为4.28%，功耗仅为5.21%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-execution: multicore caching for data-similar executions

While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as "multi-execution." Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. International Symposium on Computer Architecture

自引率

0.00%

发文量

期刊最新文献

ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022 Special-purpose and future architectures Computer memory systems Basics of the central processing unit FRONT MATTER