首页 > 最新文献

2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

英文 中文
Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs 相干轮廓:为基于环路的并行程序实现多核缩放的有效重用距离分析
Meng-Ju Wu, D. Yeung
Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD profiles shift coherently to larger CRD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD profiles feasible. Given the ubiquity and scalability of loop-level parallelism, such techniques will be extremely valuable for studying future large multicore designs. This paper investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide in depth analysis on how CRD profiles change with core count scaling. Second, we develop techniques to predict CRD profile scaling, in particular employing reference groups to predict coherent shift, and evaluate prediction accuracy. Third, we show core count scaling only degrades performance for last level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 64 -- 128MB if problem size scales by 64x. Finally, we apply CRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict LLC MPKI to within 11.1% of simulation across 1,728 configurations using only 36 measured CRD profiles.
重用距离(RD)分析是一种强大的内存分析工具,可以潜在地帮助架构师研究多核处理器的可伸缩性。然而,一个关键的障碍是多核RD分析需要测量跨线程交错内存引用流的并发重用距离(CRD)概况。对内存交错的敏感性使CRD配置文件依赖于体系结构,从而阻止它们分析不同的处理器配置。对于基于循环的并行程序,由于交错线程是对称的,所以随着核数的缩放,CRD配置文件会一致地向更大的CRD值移动。简单的技术可以预测这种变化,使得从一小组CRD剖面分析大量多核配置成为可能。考虑到循环级并行的普遍性和可扩展性,这些技术对于研究未来的大型多核设计将是非常有价值的。本文研究了利用RD分析来有效地分析基于循环的并行程序的多核缓存性能,并做出了一些贡献。首先,我们深入分析了CRD配置文件如何随岩心计数缩放而变化。其次,我们开发了预测CRD剖面缩放的技术,特别是使用参考组来预测相干位移,并评估预测精度。第三,我们显示,对于我们的基准测试和问题大小,内核计数缩放只会降低低于16MB的最后一级缓存(LLCs)的性能,如果问题大小缩放64倍,则会增加到64 - 128MB。最后,我们应用CRD配置文件来分析多核缓存性能。当与现有的问题缩放预测相结合时,我们的技术可以仅使用36个测量的CRD剖面,在1,728个配置中预测LLC MPKI的模拟精度在11.1%以内。
{"title":"Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs","authors":"Meng-Ju Wu, D. Yeung","doi":"10.1145/2427631.2427632","DOIUrl":"https://doi.org/10.1145/2427631.2427632","url":null,"abstract":"Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD profiles shift coherently to larger CRD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD profiles feasible. Given the ubiquity and scalability of loop-level parallelism, such techniques will be extremely valuable for studying future large multicore designs. This paper investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide in depth analysis on how CRD profiles change with core count scaling. Second, we develop techniques to predict CRD profile scaling, in particular employing reference groups to predict coherent shift, and evaluate prediction accuracy. Third, we show core count scaling only degrades performance for last level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 64 -- 128MB if problem size scales by 64x. Finally, we apply CRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict LLC MPKI to within 11.1% of simulation across 1,728 configurations using only 36 measured CRD profiles.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124946475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Making STMs Cache Friendly with Compiler Transformations 通过编译器转换使stm缓存友好
Sandya Mannarswamy, R. Govindarajan
Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.
软件事务性内存(STM)是一种很有前途的多线程共享内存编程范式。随着处理器和内存之间的性能差距不断扩大,为了使STM被广泛用于性能关键型软件,理解和改进运行在STM上的应用程序的缓存性能变得越来越重要。在本文中,我们对STM应用程序的缓存行为进行了迄今为止最详细的实验评估,并量化了不同STM因素对应用程序所经历的缓存缺失的影响。我们发现stm对缓存不友好,在大多数基准测试中,数据缓存失速周期占执行周期的50%以上。我们发现,平均而言,在STM内部发生的丢失占应用程序经历的总数据缓存丢失延迟周期的62%,并且由于STM本身的某些固有特征,缓存性能受到不利影响。上述观察结果促使我们提出一组特定的编译器转换,旨在使stm缓存友好。我们发现STM的细粒度和应用程序不知道的锁定是导致其糟糕的缓存行为的主要原因。因此,我们提出了选择性锁数据协同定位(LDC)和冗余锁访问移除(RLAR)来解决锁访问缺失问题。我们发现,即使是完全不连接访问并行的事务,也会因集中的全局时间戳更新而导致代价高昂的一致性丢失,因此我们提出了选择性分区时间戳(SPTS)转换来解决这个问题。我们发现,在8个STAMP应用程序中的5个中,我们的转换有效地改善了STM应用程序的缓存行为,将数据缓存丢失延迟减少了20.15%到37.14%,并将执行时间提高了18.32%到33.12%。
{"title":"Making STMs Cache Friendly with Compiler Transformations","authors":"Sandya Mannarswamy, R. Govindarajan","doi":"10.1109/PACT.2011.55","DOIUrl":"https://doi.org/10.1109/PACT.2011.55","url":null,"abstract":"Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132529241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Speculative Parallelization in Decoupled Look-ahead 解耦预见性并行化
Alok Garg, Raj Parihar, Michael C. Huang
While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential programs, its effectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple yet effective approach to mitigate the performance impact of branch mispredictions and cache misses. Unfortunately, program slice-guided look ahead is often limited by the speed of the look-ahead code slice, especially for irregular programs. In this paper, we attempt to speed up the look-ahead agent using speculative parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependences that prohibit successful speculative parallelization. Second, the task for look-ahead is not correctness critical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation detection altogether, simplifying architectural support tremendously. In a straightforward implementation, incorporating speculative parallelization to the look-ahead agent further improves system performance by up to 1.39x with an average of 1.13x.
虽然规范的乱序引擎可以有效地利用顺序程序中的隐式并行性,但其有效性经常受到指令和数据提供缺陷的阻碍,这些缺陷表现为分支错误预测和缓存缺失。由执行的程序片段引导的准确和深入的前瞻性是一种简单而有效的方法,可以减轻分支错误预测和缓存丢失对性能的影响。不幸的是,程序切片引导的提前查找常常受到提前查找代码片速度的限制,特别是对于不规则程序。在本文中,我们尝试使用推测并行化来加速预查代理,这种方法特别适合于该任务。首先,为前瞻性而进行的切片倾向于减少重要的数据依赖,而这些数据依赖阻碍了成功的推测并行化。其次,预检任务的正确性并不重要,因此自然会容忍依赖违反。这使得实现完全放弃了冲突检测,极大地简化了体系结构支持。在一个简单的实现中,将推测并行化结合到预检代理中进一步提高系统性能,最高可提高1.39倍,平均提高1.13倍。
{"title":"Speculative Parallelization in Decoupled Look-ahead","authors":"Alok Garg, Raj Parihar, Michael C. Huang","doi":"10.1109/PACT.2011.72","DOIUrl":"https://doi.org/10.1109/PACT.2011.72","url":null,"abstract":"While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential programs, its effectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple yet effective approach to mitigate the performance impact of branch mispredictions and cache misses. Unfortunately, program slice-guided look ahead is often limited by the speed of the look-ahead code slice, especially for irregular programs. In this paper, we attempt to speed up the look-ahead agent using speculative parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependences that prohibit successful speculative parallelization. Second, the task for look-ahead is not correctness critical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation detection altogether, simplifying architectural support tremendously. In a straightforward implementation, incorporating speculative parallelization to the look-ahead agent further improves system performance by up to 1.39x with an average of 1.13x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
rPRAM: Exploring Redundancy Techniques to Improve Lifetime of PCM-based Main Memory 探索冗余技术以提高基于pcm的主存的寿命
Jie Chen, Zachary Winter, Guru Venkataramani, H. H. Huang
Future main memory systems will confront the scaling challenges posed by DRAM technology and should adapt themselves to use the emerging memory technologies like Phase Change Memory (PCM, or PRAM). PCM offers advantages such as storage density, non-volatility, and lower energy consumption. However, they are constrained by limited write endurance and reduced performance. In this paper, we propose a novel PCM-based main memory system, rPRAM, that explores advanced redundancy techniques to resuscitate faulty PCM pages and reuse these pages to store data. Our preliminary experiments show that rPRAM has the potential to extend the lifetime of PCM based memory commensurate with the existing schemes like ECP, while incurring only a negligible fraction of hardware cost compared to ECP.
未来的主存储系统将面临DRAM技术带来的扩展挑战,并应适应新兴的存储技术,如相变存储器(PCM,或PRAM)。PCM具有诸如存储密度、非易失性和较低能耗等优点。然而,它们受到写入持久性有限和性能降低的限制。在本文中,我们提出了一种新的基于PCM的主存储系统,rPRAM,它探索了先进的冗余技术来恢复故障的PCM页面并重用这些页面来存储数据。我们的初步实验表明,rPRAM有可能延长基于PCM的存储器的寿命,与现有方案(如ECP)相当,而与ECP相比,其硬件成本仅为微不足道的一小部分。
{"title":"rPRAM: Exploring Redundancy Techniques to Improve Lifetime of PCM-based Main Memory","authors":"Jie Chen, Zachary Winter, Guru Venkataramani, H. H. Huang","doi":"10.1109/PACT.2011.40","DOIUrl":"https://doi.org/10.1109/PACT.2011.40","url":null,"abstract":"Future main memory systems will confront the scaling challenges posed by DRAM technology and should adapt themselves to use the emerging memory technologies like Phase Change Memory (PCM, or PRAM). PCM offers advantages such as storage density, non-volatility, and lower energy consumption. However, they are constrained by limited write endurance and reduced performance. In this paper, we propose a novel PCM-based main memory system, rPRAM, that explores advanced redundancy techniques to resuscitate faulty PCM pages and reuse these pages to store data. Our preliminary experiments show that rPRAM has the potential to extend the lifetime of PCM based memory commensurate with the existing schemes like ECP, while incurring only a negligible fraction of hardware cost compared to ECP.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131487050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
CriticalFault: Amplifying Soft Error Effect Using Vulnerability-Driven Injection CriticalFault:使用漏洞驱动注入放大软错误效果
Xin Xu, Man-Lap Li
As future microprocessors will be prone to various types of errors, researchers have looked into cross-layer hardware-software reliability solutions to reduce overheads. These mechanisms are shown to be effective when evaluated with statistical fault injection (SFI). However, under SFI, a large number of injected faults can be derated, making the evaluation less rigorous. To handle this problem, we propose a biased fault injection framework called Ciritical Fault that leverages vulnerability analysis to identify faults that are more likely to stress test the underlying reliability solution. Our experimental results show that the injection space is reduced by 30% and a large portion of injected faults cause software aborts and silent data corruptions. Overall, Critical Fault allows us to amplify soft error effects on reliability mechanism-under-test, which can help improve current techniques or inspire other new fault-tolerant mechanisms.
由于未来的微处理器将容易出现各种类型的错误,研究人员已经研究了跨层硬件软件可靠性解决方案,以减少开销。当用统计故障注入(SFI)评估时,这些机制被证明是有效的。然而,在SFI下,大量注入断层可以被减额处理,使得评价不那么严格。为了处理这个问题,我们提出了一个有偏差的故障注入框架,称为critical fault,它利用漏洞分析来识别更有可能对底层可靠性解决方案进行压力测试的故障。我们的实验结果表明,注入空间减少了30%,并且很大一部分注入错误导致软件中断和静默数据损坏。总的来说,临界故障允许我们放大软错误对被测试可靠性机制的影响,这可以帮助改进当前技术或激发其他新的容错机制。
{"title":"CriticalFault: Amplifying Soft Error Effect Using Vulnerability-Driven Injection","authors":"Xin Xu, Man-Lap Li","doi":"10.1109/PACT.2011.25","DOIUrl":"https://doi.org/10.1109/PACT.2011.25","url":null,"abstract":"As future microprocessors will be prone to various types of errors, researchers have looked into cross-layer hardware-software reliability solutions to reduce overheads. These mechanisms are shown to be effective when evaluated with statistical fault injection (SFI). However, under SFI, a large number of injected faults can be derated, making the evaluation less rigorous. To handle this problem, we propose a biased fault injection framework called Ciritical Fault that leverages vulnerability analysis to identify faults that are more likely to stress test the underlying reliability solution. Our experimental results show that the injection space is reduced by 30% and a large portion of injected faults cause software aborts and silent data corruptions. Overall, Critical Fault allows us to amplify soft error effects on reliability mechanism-under-test, which can help improve current techniques or inspire other new fault-tolerant mechanisms.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"299 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116323654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
StVEC: A Vector Instruction Extension for High Performance Stencil Computation StVEC:高性能模板计算的矢量指令扩展
N. Sedaghati, Renji Thomas, L. Pouchet, R. Teodorescu, P. Sadayappan
Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. We propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA's for stencil computations. Further, we present a code generation approach that can be used by a vectorizing compiler for processors with such an instructions set. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library and show minimal impact on processor cycle time.
模板计算构成了许多科学应用的计算密集型核心。模板计算的数据访问模式通常需要在最内层并行循环中访问数组的几个相邻数据元素。虽然这样的循环是由GCC和ICC等针对短向量SIMD指令集的当前编译器向量化的,但需要大量冗余负载或额外的寄存器内数据shuffle操作,从而降低了可实现的性能。因此,即使所有数组都驻留在缓存中,通过模板计算实现的峰值性能也远低于机器峰值。在本文中,我们提出了一种基于硬件的解决方案。我们提出了对isa(如SSE、AVX、VMX等)中矢量浮点指令的标准寻址模式的扩展。我们提出了一种扩展的配对寄存器寻址模式及其硬件实现,以克服当前用于模板计算的短向量SIMD ISA的性能限制。此外,我们提出了一种代码生成方法,该方法可用于具有这种指令集的处理器的向量化编译器。通过对所提出的指令扩展进行乐观和悲观的仿真,我们证明了所提出的方法在SSE和AVX支持处理器上的有效性。我们还使用45纳米CMOS库合成了部分拟议设计,并显示对处理器周期时间的影响最小。
{"title":"StVEC: A Vector Instruction Extension for High Performance Stencil Computation","authors":"N. Sedaghati, Renji Thomas, L. Pouchet, R. Teodorescu, P. Sadayappan","doi":"10.1109/PACT.2011.59","DOIUrl":"https://doi.org/10.1109/PACT.2011.59","url":null,"abstract":"Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. We propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA's for stencil computations. Further, we present a code generation approach that can be used by a vectorizing compiler for processors with such an instructions set. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library and show minimal impact on processor cycle time.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121488385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
A Compiler-assisted Runtime-prefetching Scheme for Heterogenous Platforms 面向异构平台的编译器辅助运行时预取方案
Pub Date : 2011-10-10 DOI: 10.1007/978-3-642-30961-8_9
Li Chen, B. Shou, Xionghui Hou, Lei Huang
{"title":"A Compiler-assisted Runtime-prefetching Scheme for Heterogenous Platforms","authors":"Li Chen, B. Shou, Xionghui Hou, Lei Huang","doi":"10.1007/978-3-642-30961-8_9","DOIUrl":"https://doi.org/10.1007/978-3-642-30961-8_9","url":null,"abstract":"","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134475498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing Data Layouts for Parallel Computation on Multicores 优化多核并行计算的数据布局
Yuanrui Zhang, W. Ding, Jun Liu, M. Kandemir
The emergence of multicore platforms offers several opportunities for boosting application performance. These opportunities, which include parallelism and data locality benefits, require strong support from compilers as well as operating systems. Current compiler research targeting multicores mostly focuses on code restructuring and mapping. In this work, we explore automatic data layout transformation targeting multithreaded applications running on multicores. Our transformation considers both data access patterns exhibited by different threads of a multithreaded application and the on-chip cache topology of the target multicore architecture. It automatically determines a customized memory layout for each target array to minimize potential cache conflicts across threads. Our experiments show that, our optimization brings significant benefits over state-of-the-art data locality optimization strategies when tested using 30 benchmark programs on an Intel multicore machine. The results also indicate that this strategy is able to scale to larger core counts and it performs better with increased data set sizes.
多核平台的出现为提高应用程序性能提供了一些机会。这些机会,包括并行性和数据局部性的好处,需要编译器和操作系统的有力支持。目前针对多核的编译器研究主要集中在代码重构和映射方面。在这项工作中,我们探索了针对运行在多核上的多线程应用程序的自动数据布局转换。我们的转换考虑了多线程应用程序的不同线程所显示的数据访问模式和目标多核体系结构的片上缓存拓扑。它自动为每个目标数组确定自定义的内存布局,以最小化线程间潜在的缓存冲突。我们的实验表明,当在Intel多核机器上使用30个基准程序进行测试时,我们的优化比最先进的数据局部性优化策略带来了显著的好处。结果还表明,该策略能够扩展到更大的核心计数,并且随着数据集大小的增加,它的性能更好。
{"title":"Optimizing Data Layouts for Parallel Computation on Multicores","authors":"Yuanrui Zhang, W. Ding, Jun Liu, M. Kandemir","doi":"10.1109/PACT.2011.20","DOIUrl":"https://doi.org/10.1109/PACT.2011.20","url":null,"abstract":"The emergence of multicore platforms offers several opportunities for boosting application performance. These opportunities, which include parallelism and data locality benefits, require strong support from compilers as well as operating systems. Current compiler research targeting multicores mostly focuses on code restructuring and mapping. In this work, we explore automatic data layout transformation targeting multithreaded applications running on multicores. Our transformation considers both data access patterns exhibited by different threads of a multithreaded application and the on-chip cache topology of the target multicore architecture. It automatically determines a customized memory layout for each target array to minimize potential cache conflicts across threads. Our experiments show that, our optimization brings significant benefits over state-of-the-art data locality optimization strategies when tested using 30 benchmark programs on an Intel multicore machine. The results also indicate that this strategy is able to scale to larger core counts and it performs better with increased data set sizes.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125631200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Exploiting Rank Idle Time for Scheduling Last-Level Cache Writeback 利用Rank空闲时间调度最后一级缓存回写
Zhe Wang, Daniel A. Jiménez
we propose a predictor-guided last-level cache (LLC) write back technique. This technique uses a to predict when a rank will have significant idle time. The scheduled dirty cache blocks can be written back during this idle rank period. Write-induced interference is significantly reduced by our technique.
我们提出了一种预测器引导的最后一级缓存(LLC)回写技术。该技术使用a来预测某个rank何时会有大量空闲时间。在空闲分级期间,可以回写调度的脏缓存块。我们的技术显著降低了写诱发干扰。
{"title":"Exploiting Rank Idle Time for Scheduling Last-Level Cache Writeback","authors":"Zhe Wang, Daniel A. Jiménez","doi":"10.1109/PACT.2011.43","DOIUrl":"https://doi.org/10.1109/PACT.2011.43","url":null,"abstract":"we propose a predictor-guided last-level cache (LLC) write back technique. This technique uses a to predict when a rank will have significant idle time. The scheduled dirty cache blocks can be written back during this idle rank period. Write-induced interference is significantly reduced by our technique.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133885869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sampling Temporal Touch Hint (STTH) Inclusive Cache Management Policy 采样时间触摸提示(STTH)包含缓存管理策略
Yingying Tian, Daniel A. Jiménez
Sampling Temporal Touch Hint (STTH) Inclusive Cache Management Policy
采样时间触摸提示(STTH)包含缓存管理策略
{"title":"Sampling Temporal Touch Hint (STTH) Inclusive Cache Management Policy","authors":"Yingying Tian, Daniel A. Jiménez","doi":"10.1109/PACT.2011.42","DOIUrl":"https://doi.org/10.1109/PACT.2011.42","url":null,"abstract":"Sampling Temporal Touch Hint (STTH) Inclusive Cache Management Policy","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127917955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2011 International Conference on Parallel Architectures and Compilation Techniques
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1