Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI:10.1145/2818950.2818980

A. Laer, William Wang, C. D. Emmons

{"title":"Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads","authors":"A. Laer, William Wang, C. D. Emmons","doi":"10.1145/2818950.2818980","DOIUrl":null,"url":null,"abstract":"With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Symposium on Memory Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818950.2818980","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

缓存层次结构中的低效率:移动工作负载下缓存大小的敏感性研究

随着移动设备中核心数量的增加，移动应用程序处理器中的缓存层次结构越来越深，缓存大小也越来越大。然而，在过去十年中，移动应用程序处理器的缓存大小保持相对稳定。在这项工作中，我们研究了移动应用程序处理器中的缓存大小是否需要刷新，通过观察缓存层次结构中的低效率，当增加缓存大小时，这种低效率往往会加剧:错误共享和缓存利用率。首先，我们看看虚假共享，这更有可能出现在更大的缓存大小，并可能严重影响性能。当运行在不同内核上的线程访问映射到同一cacheline的非共享数据结构时，就会发生错误共享，从而导致可避免的失效和随后的失误。在科学工作负载和实际应用程序等各个地方都发现了虚假共享。我们发现，虽然增加缓存大小确实会增加错误共享，但与科学工作负载中已知的错误共享案例相比，它仍然可以忽略不计，因为移动工作负载中的线程级并行性水平有限。其次，我们查看cacheline利用率，它测量处理器实际使用的cacheline中的字节数。我们已经对许多服务器和桌面应用程序的不同名称进行了研究。由于低cacheline利用率意味着处理器只使用了很少的获取的cacheline，这会导致在内存层次结构中移动数据时浪费带宽和能量。与逻辑操作相比，与数据移动相关的能源成本要高得多，这增加了对缓存效率的需求，特别是在像移动设备这样的能量受限平台的情况下。我们发现，移动工作负载的缓存利用率通常较低，随着缓存大小的增加而降低。当将缓存大小从64字节增加到128字节时，根据工作负载的不同，丢失的数量将减少10%- 30%。但是，由于缓存利用率较低，这使L1缓存的未使用流量增加了一倍以上。以这种方式使用缓存利用率作为度量，说明了重要的一点。如果cacheline大小的变化只对其局部影响进行评估，我们发现cacheline大小的变化只会随着脱靶率的降低而具有优势。但是，在系统级别，这种更改将增加总线上的压力，并增加由于未使用的流量而浪费的能量。使用缓存利用率作为度量，强调在更改缓存层次结构的特征时需要进行系统级研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 International Symposium on Memory Systems

自引率

0.00%

发文量

期刊最新文献

MEMST: Cloning Memory Behavior using Stochastic Traces Dynamic Memory Pressure Aware Ballooning Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads