Making STMs Cache Friendly with Compiler Transformations

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.55

Sandya Mannarswamy, R. Govindarajan

{"title":"Making STMs Cache Friendly with Compiler Transformations","authors":"Sandya Mannarswamy, R. Govindarajan","doi":"10.1109/PACT.2011.55","DOIUrl":null,"url":null,"abstract":"Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"163 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过编译器转换使stm缓存友好

软件事务性内存(STM)是一种很有前途的多线程共享内存编程范式。随着处理器和内存之间的性能差距不断扩大，为了使STM被广泛用于性能关键型软件，理解和改进运行在STM上的应用程序的缓存性能变得越来越重要。在本文中，我们对STM应用程序的缓存行为进行了迄今为止最详细的实验评估，并量化了不同STM因素对应用程序所经历的缓存缺失的影响。我们发现stm对缓存不友好，在大多数基准测试中，数据缓存失速周期占执行周期的50%以上。我们发现，平均而言，在STM内部发生的丢失占应用程序经历的总数据缓存丢失延迟周期的62%，并且由于STM本身的某些固有特征，缓存性能受到不利影响。上述观察结果促使我们提出一组特定的编译器转换，旨在使stm缓存友好。我们发现STM的细粒度和应用程序不知道的锁定是导致其糟糕的缓存行为的主要原因。因此，我们提出了选择性锁数据协同定位(LDC)和冗余锁访问移除(RLAR)来解决锁访问缺失问题。我们发现，即使是完全不连接访问并行的事务，也会因集中的全局时间戳更新而导致代价高昂的一致性丢失，因此我们提出了选择性分区时间戳(SPTS)转换来解决这个问题。我们发现，在8个STAMP应用程序中的5个中，我们的转换有效地改善了STM应用程序的缓存行为，将数据缓存丢失延迟减少了20.15%到37.14%，并将执行时间提高了18.32%到33.12%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量