首页 > 最新文献

2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

英文 中文
Collaborative Caching for Unknown Cache Sizes 未知缓存大小的协同缓存
Xiaoming Gu
A number of hardware systems have been built or proposed to provide an interface for software to influence cache management. The combined software-hardware solution is called collaborative caching. Our previous work showed that in theory collaborative caching with LRU and MRU may enable a program to manage cache optimally. In this work we first present a prioritized LRU model. For each memory access, a program specifies a priority, the target cache position for the accessed datum, for all cache sizes. We have proved that the prioritized LRU holds inclusion property. Alternatively, we describe a dynamic cache control scheme based on the associated priority. The limitation of knowing cache size in our LRU-MRU collaborative caching work is removed.
已经建立或提出了许多硬件系统来为影响缓存管理的软件提供接口。这种软硬件结合的解决方案被称为协同缓存。我们之前的工作表明,从理论上讲,LRU和MRU的协同缓存可以使程序最优地管理缓存。在这项工作中,我们首先提出了一个优先级的LRU模型。对于每次内存访问,程序为所有缓存大小指定一个优先级,即所访问数据的目标缓存位置。我们证明了优先级LRU具有包含性。另外,我们描述了一种基于相关优先级的动态缓存控制方案。消除了LRU-MRU协同缓存工作中知道缓存大小的限制。
{"title":"Collaborative Caching for Unknown Cache Sizes","authors":"Xiaoming Gu","doi":"10.1109/PACT.2011.50","DOIUrl":"https://doi.org/10.1109/PACT.2011.50","url":null,"abstract":"A number of hardware systems have been built or proposed to provide an interface for software to influence cache management. The combined software-hardware solution is called collaborative caching. Our previous work showed that in theory collaborative caching with LRU and MRU may enable a program to manage cache optimally. In this work we first present a prioritized LRU model. For each memory access, a program specifies a priority, the target cache position for the accessed datum, for all cache sizes. We have proved that the prioritized LRU holds inclusion property. Alternatively, we describe a dynamic cache control scheme based on the associated priority. The limitation of knowing cache size in our LRU-MRU collaborative caching work is removed.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121205192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems STM2:用于高性能同步多线程系统的并行STM
Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero
Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.
从现代芯片多线程(CMT)处理器中提取高性能是一项复杂的任务,特别是对于大型CMT系统。程序员必须有效地并行处理性能关键型软件,同时避免死锁和竞争条件。事务性内存(TM)是一种很有前途的编程模型,它允许程序员专注于并行性,而不是维护正确性和避免死锁。纯软件实现(stm)尤其引人注目,因为它们运行在商用硬件上,因此提供了高可移植性。不幸的是,STM系统通常有很高的开销,这可能会限制它们的使用,特别是在规模上。在本文中,我们提出了STM2,一个新的并行STM设计用于高性能,侵略性多线程系统。STM2通过将读集验证、簿记和冲突检测卸载给运行在同级硬件线程上的辅助线程,显著降低了运行时开销。辅助线程与其配对的应用程序线程并行执行STM操作,并吸收STM开销,从而显著提高性能。我们利用了这样一个事实,即在现代多核处理器上,核心集可以共享L1或L2缓存。这让我们实现了应用程序线程和辅助线程之间更紧密的耦合(与传统的多处理器系统相比)。我们在IBM POWER7机器上执行的结果表明,我们的方法优于几个知名的STM实现。特别是,与测试的STM系统相比,STM2的平均速度在1.8倍到5.2倍之间,峰值可达12.8倍。
{"title":"STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems","authors":"Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1109/PACT.2011.54","DOIUrl":"https://doi.org/10.1109/PACT.2011.54","url":null,"abstract":"Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116922373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
An Evaluation of Vectorizing Compilers 向量化编译器的评价
Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua
Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.
今天的大多数处理器都包含矢量单元,它们被设计用来加速单线程程序。虽然矢量指令可以提供高性能,但在汇编语言中编写矢量代码或在高级语言中使用内在特性是一项耗时且容易出错的任务。另一种选择是通过使用向量化编译器来自动化向量化过程。本文评估了编译器如何很好地向量化一个由151个循环、两个来自Petascale应用协作团队(PACT)的应用程序和8个来自Media Bench II的应用程序组成的合成基准。我们评估了三种编译器:GCC(版本4.7.0)、ICC(版本12.0)和XLC(版本11.01)。我们的结果表明,尽管在过去40年中在向量化方面做了很多工作,但我们评估的编译器对合成基准中的45-71%的循环和实际应用中的少数循环进行了向量化。
{"title":"An Evaluation of Vectorizing Compilers","authors":"Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua","doi":"10.1109/PACT.2011.68","DOIUrl":"https://doi.org/10.1109/PACT.2011.68","url":null,"abstract":"Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124112634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 216
Scalable Proximity-Aware Cache Replication in Chip Multiprocessors 芯片多处理器中可扩展的邻近感知缓存复制
Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li
We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. Simulation results on a 64-core CMP show that PAR can achieve 12% speedup over the baseline shared cache design with SPLASH2 and PARSEC workloads. It also provides around 5% speedup over a couple contemporary approaches with much simpler and scalable support.
我们提出了就近感知缓存复制(PAR),这是一种LLC复制技术,它将智能缓存复制放置机制和基于分层目录的一致性协议优雅地集成到一个具有成本效益和可扩展的设计中。在64核CMP上的仿真结果表明,在SPLASH2和PARSEC工作负载下,PAR可以比基线共享缓存设计实现12%的加速。它还提供了大约5%的加速,比一些现代的方法更简单和可扩展的支持。
{"title":"Scalable Proximity-Aware Cache Replication in Chip Multiprocessors","authors":"Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li","doi":"10.1109/PACT.2011.35","DOIUrl":"https://doi.org/10.1109/PACT.2011.35","url":null,"abstract":"We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. Simulation results on a 64-core CMP show that PAR can achieve 12% speedup over the baseline shared cache design with SPLASH2 and PARSEC workloads. It also provides around 5% speedup over a couple contemporary approaches with much simpler and scalable support.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126326633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems 基于SR-NFA的多核正则表达式匹配优化
Y. Yang, V. Prasanna
Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71, yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resilient multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regex sets that construct a DFA with moderate state explosion, i.e., on average 200k states in the STT, the proposed SR-NFA is 367k times faster to construct and update and use 23k times less memory than the DFA approach. Running on an 8-core 2.6 GHz Opteron platform, our prototype achieves 2.2 Gbps average matching throughput for regex sets with up to 4,000 SR-NFA states per regex set.
通常,正则表达式匹配(REM)是通过顺序地比较正则表达式(regex)和输入流来执行的,由于过度回溯,这可能会很慢(smith:acsac06)。或者,可以将正则表达式转换为确定性有限自动机(DFA)以进行有效匹配,但是由于指数状态爆炸,这可能需要一个非常大的状态转移表(STT) (meyer:swat71, yu:ancs06)。我们提出了分段的正则表达式-有限自动机(SR-NFA)架构,其中正则表达式首先被编译成模块化的不确定性有限自动机(NFA),然后在现代多核处理器上进行分区、优化和有效匹配。SR-NFA提供了每秒多千兆比特的攻击弹性匹配吞吐量,不受回溯或状态爆炸的影响,并且可以快速构建。对于构建具有中等状态爆炸的DFA的正则表达式集,即STT中平均有200k个状态,所提出的SR-NFA的构建和更新速度比DFA方法快36.7万倍,使用的内存比DFA方法少2.3万倍。在8核2.6 GHz Opteron平台上运行,我们的原型实现了2.2 Gbps的正则表达式集平均匹配吞吐量,每个正则表达式集多达4,000个SR-NFA状态。
{"title":"Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems","authors":"Y. Yang, V. Prasanna","doi":"10.1109/PACT.2011.73","DOIUrl":"https://doi.org/10.1109/PACT.2011.73","url":null,"abstract":"Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71, yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resilient multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regex sets that construct a DFA with moderate state explosion, i.e., on average 200k states in the STT, the proposed SR-NFA is 367k times faster to construct and update and use 23k times less memory than the DFA approach. Running on an 8-core 2.6 GHz Opteron platform, our prototype achieves 2.2 Gbps average matching throughput for regex sets with up to 4,000 SR-NFA states per regex set.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131634169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Scalable and Efficient Bounds Checking for Large-Scale CMP Environments 大规模CMP环境的可扩展和高效边界检查
Baik Song An, K. H. Yum, Eun Jung Kim
We attempt to provide an architectural support for fast and efficient bounds checking for multithread work-loads in chip-multiprocessor (CMP) environments. Bounds information sharing and smart tagging help to perform bounds checking more effectively utilizing the characteristics of a pointer. Also, the BCache architecture allows fast access to the bounds information. Simulation results show that the proposed scheme increases ¥ìPC of memory operations by 29% on average compared to the previous hardware scheme.
我们试图为芯片多处理器(CMP)环境中多线程工作负载的快速有效边界检查提供体系结构支持。边界信息共享和智能标记有助于更有效地利用指针的特征执行边界检查。此外,BCache架构允许快速访问边界信息。仿真结果表明,与之前的硬件方案相比,该方案平均提高了¥ìPC内存操作量。
{"title":"Scalable and Efficient Bounds Checking for Large-Scale CMP Environments","authors":"Baik Song An, K. H. Yum, Eun Jung Kim","doi":"10.1109/PACT.2011.36","DOIUrl":"https://doi.org/10.1109/PACT.2011.36","url":null,"abstract":"We attempt to provide an architectural support for fast and efficient bounds checking for multithread work-loads in chip-multiprocessor (CMP) environments. Bounds information sharing and smart tagging help to perform bounds checking more effectively utilizing the characteristics of a pointer. Also, the BCache architecture allows fast access to the bounds information. Simulation results show that the proposed scheme increases ¥ìPC of memory operations by 29% on average compared to the previous hardware scheme.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122553906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Modeling and Performance Evaluation of TSO-Preserving Binary Optimization 保持tso的二值优化建模与性能评价
Cheng Wang, Youfeng Wu
Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary optimization based on the formal TSO memory model. The major contribution of the modeling is a sound and complete algorithm to verify TSO-preserving binary optimization with O(N2) complexity. We also developed a dynamic binary optimization system to evaluate the performance impact of TSO-preserving optimization. We show in our experiments that, dynamic binary optimization without memory optimizations can improve performance by 8.1%. TSO-preserving optimizations can further improve the performance by 4.8% to a total 12.9%. Without considering the restriction for TSO-preserving optimizations, the dynamic binary optimization can improve the overall performance to 20.4%.
多核系统上的程序优化必须保证程序内存的一致性。本文研究了保持tso的二值优化问题。提出了一种基于形式化TSO内存模型的保持TSO的二值优化形式化建模方法。该模型的主要贡献是一个完善的算法来验证复杂度为0 (N2)的保持tso的二值优化。我们还开发了一个动态二元优化系统来评估tso保持优化对性能的影响。我们在实验中表明,不进行内存优化的动态二进制优化可以提高8.1%的性能。保留tso的优化可以进一步提高性能4.8%至12.9%。在不考虑保留tso优化限制的情况下,动态二值优化可以使总体性能提高20.4%。
{"title":"Modeling and Performance Evaluation of TSO-Preserving Binary Optimization","authors":"Cheng Wang, Youfeng Wu","doi":"10.1109/PACT.2011.69","DOIUrl":"https://doi.org/10.1109/PACT.2011.69","url":null,"abstract":"Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary optimization based on the formal TSO memory model. The major contribution of the modeling is a sound and complete algorithm to verify TSO-preserving binary optimization with O(N2) complexity. We also developed a dynamic binary optimization system to evaluate the performance impact of TSO-preserving optimization. We show in our experiments that, dynamic binary optimization without memory optimizations can improve performance by 8.1%. TSO-preserving optimizations can further improve the performance by 4.8% to a total 12.9%. Without considering the restriction for TSO-preserving optimizations, the dynamic binary optimization can improve the overall performance to 20.4%.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113998939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Decoupled Cache Segmentation: Mutable Policy with Automated Bypass 解耦缓存分段:自动绕过可变策略
S. Khan, Daniel A. Jiménez
The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. The technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. It outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement in a 2MB LLC in a single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks.
最近最少使用(LRU)替换策略在最后一级缓存(LLC)中表现不佳,因为内存访问的时间局部性是由一级和二级缓存过滤的。我们提出了一种缓存分段技术,该技术通过预测缓存中尚未引用和已引用的块的最佳数量来适应缓存访问模式。该技术独立于LRU策略,因此可以使用成本较低的替换策略。它可以自动检测何时绕过块到CPU,没有额外的开销。在使用SPEC CPU 2006基准测试的内存密集型子集的单核处理器中的2MB LLC中,如果使用非最近使用(NRU)替换,它的性能平均优于LRU替换5.2%,如果使用随机替换,它的性能平均优于2.2%。
{"title":"Decoupled Cache Segmentation: Mutable Policy with Automated Bypass","authors":"S. Khan, Daniel A. Jiménez","doi":"10.1109/PACT.2011.45","DOIUrl":"https://doi.org/10.1109/PACT.2011.45","url":null,"abstract":"The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. The technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. It outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement in a 2MB LLC in a single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132524041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures 理解非统一缓存架构上Pthread应用程序的行为
Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian
Future scalable multi-core chips are expected to implement a shared last-level cache (LLC) with banks distributed on chip, forcing a core to incur non-uniform access latencies to each bank. Consequently, high performance and energy efficiency depend on whether a thread's data is placed in local or nearby banks. Using compiler and programmer support, we aim to find an alternative solution to existing high-overhead designs. In this paper, we take existing parallel programs written in Pthreads, and show the performance gap between current static mapping schemes, costly migration schemes and idealized static and dynamic best-case scenarios.
未来可扩展的多核芯片有望实现共享的最后一级缓存(LLC),并与分布在芯片上的银行共享,从而迫使一个核心对每个银行产生非统一的访问延迟。因此,高性能和能源效率取决于线程的数据是放在本地还是附近的银行中。使用编译器和程序员支持,我们的目标是找到现有高开销设计的替代解决方案。在本文中,我们以现有的用pthread编写的并行程序为例,展示了当前静态映射方案,昂贵的迁移方案以及理想的静态和动态最佳情况下的性能差距。
{"title":"Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures","authors":"Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian","doi":"10.1109/PACT.2011.26","DOIUrl":"https://doi.org/10.1109/PACT.2011.26","url":null,"abstract":"Future scalable multi-core chips are expected to implement a shared last-level cache (LLC) with banks distributed on chip, forcing a core to incur non-uniform access latencies to each bank. Consequently, high performance and energy efficiency depend on whether a thread's data is placed in local or nearby banks. Using compiler and programmer support, we aim to find an alternative solution to existing high-overhead designs. In this paper, we take existing parallel programs written in Pthreads, and show the performance gap between current static mapping schemes, costly migration schemes and idealized static and dynamic best-case scenarios.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129967907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs 不再背后捅刀子……多线程程序的忠实调度策略
K. Pusukuri, Rajiv Gupta, L. Bhuyan
Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.
对于运行在多核系统上的多线程应用程序,高效的争用管理是实现可伸缩性能的关键。然而,现代操作系统提供的争用管理策略增加了上下文切换,并导致多线程应用程序在高负载下的性能下降。此外,争用管理策略和操作系统调度策略之间的交互加剧了这个问题。时间共享(TS)是Open Solaris等现代操作系统中的默认调度策略,使用TS策略时,线程的优先级会非常频繁地更改,以平衡负载和提供调度的公平性。由于优先级的频繁变动,一个应用程序的线程经常被同一应用程序的线程抢占。这增加了非自愿上下文切换的频率以及锁持有者线程抢占,并导致性能下降。这个问题在高负载下变得非常严重。为了缓解这个问题,在本文中,我们提出了一种称为忠实调度(FF)的调度策略,它极大地减少了上下文切换和锁持有者线程抢占。我们在一台运行OpenSolaris.2009.06的24核戴尔Power Edge R905服务器上实现了FF,并使用包括TATP数据库应用程序、SPECjbb2005、PARSEC程序、SPEC OMP程序和一些微基准测试在内的22个程序对FF进行了评估。实验结果表明,FF策略在轻负荷和重负荷系统中都能达到较高的性能。此外,它不需要对应用程序源代码或操作系统内核进行任何更改。
{"title":"No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs","authors":"K. Pusukuri, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2011.8","DOIUrl":"https://doi.org/10.1109/PACT.2011.8","url":null,"abstract":"Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130350185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
期刊
2011 International Conference on Parallel Architectures and Compilation Techniques
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1