A number of hardware systems have been built or proposed to provide an interface for software to influence cache management. The combined software-hardware solution is called collaborative caching. Our previous work showed that in theory collaborative caching with LRU and MRU may enable a program to manage cache optimally. In this work we first present a prioritized LRU model. For each memory access, a program specifies a priority, the target cache position for the accessed datum, for all cache sizes. We have proved that the prioritized LRU holds inclusion property. Alternatively, we describe a dynamic cache control scheme based on the associated priority. The limitation of knowing cache size in our LRU-MRU collaborative caching work is removed.
{"title":"Collaborative Caching for Unknown Cache Sizes","authors":"Xiaoming Gu","doi":"10.1109/PACT.2011.50","DOIUrl":"https://doi.org/10.1109/PACT.2011.50","url":null,"abstract":"A number of hardware systems have been built or proposed to provide an interface for software to influence cache management. The combined software-hardware solution is called collaborative caching. Our previous work showed that in theory collaborative caching with LRU and MRU may enable a program to manage cache optimally. In this work we first present a prioritized LRU model. For each memory access, a program specifies a priority, the target cache position for the accessed datum, for all cache sizes. We have proved that the prioritized LRU holds inclusion property. Alternatively, we describe a dynamic cache control scheme based on the associated priority. The limitation of knowing cache size in our LRU-MRU collaborative caching work is removed.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121205192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero
Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.
{"title":"STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems","authors":"Gokcen Kestor, R. Gioiosa, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1109/PACT.2011.54","DOIUrl":"https://doi.org/10.1109/PACT.2011.54","url":null,"abstract":"Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transactional memory (TM) is a promising programming model that allows programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Software-only implementations (STMs) are especially compelling because they run on commodity hardware, therefore providing high portability. Unfortunately, STM systems usually suffer from high overheads, which may limit their usage especially at scale. In this paper we present STM2, a novel parallel STM designed for high performance, aggressive multithreading systems. STM2 significantly lowers runtime overhead by offloading read-set validation, bookkeeping and conflict detection to auxiliary threads running on sibling hardware threads. Auxiliary threads perform STM operations in parallel with their paired application threads and absorb STM overhead, significantly improving performance. We exploit the fact that, on modern multi-core processors, sets of cores can share L1 or L2 caches. This lets us achieve closer coupling between the application thread and the auxiliary thread (when compared with a traditional multi-processor systems). Our results, performed on an IBM POWER7 machine, a state-of-the-art, aggressive multi-threaded system, show that our approach outperforms several well-known STM implementations. In particular, STM2 shows speedups between 1.8x and 5.2x over the tested STM systems, on average, with peaks up to 12.8x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116922373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua
Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.
{"title":"An Evaluation of Vectorizing Compilers","authors":"Saeed Maleki, Yaoqing Gao, M. Garzarán, Tommy Wong, D. Padua","doi":"10.1109/PACT.2011.68","DOIUrl":"https://doi.org/10.1109/PACT.2011.68","url":null,"abstract":"Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124112634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li
We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. Simulation results on a 64-core CMP show that PAR can achieve 12% speedup over the baseline shared cache design with SPLASH2 and PARSEC workloads. It also provides around 5% speedup over a couple contemporary approaches with much simpler and scalable support.
{"title":"Scalable Proximity-Aware Cache Replication in Chip Multiprocessors","authors":"Chongmin Li, Haixia Wang, Y. Xue, Dongsheng Wang, Jian Li","doi":"10.1109/PACT.2011.35","DOIUrl":"https://doi.org/10.1109/PACT.2011.35","url":null,"abstract":"We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. Simulation results on a 64-core CMP show that PAR can achieve 12% speedup over the baseline shared cache design with SPLASH2 and PARSEC workloads. It also provides around 5% speedup over a couple contemporary approaches with much simpler and scalable support.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126326633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71, yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resilient multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regex sets that construct a DFA with moderate state explosion, i.e., on average 200k states in the STT, the proposed SR-NFA is 367k times faster to construct and update and use 23k times less memory than the DFA approach. Running on an 8-core 2.6 GHz Opteron platform, our prototype achieves 2.2 Gbps average matching throughput for regex sets with up to 4,000 SR-NFA states per regex set.
{"title":"Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems","authors":"Y. Yang, V. Prasanna","doi":"10.1109/PACT.2011.73","DOIUrl":"https://doi.org/10.1109/PACT.2011.73","url":null,"abstract":"Conventionally, regular expression matching (REM) has been performed by sequentially comparing the regular expression (regex) to the input stream, which can be slow due to excessive backtracking (smith:acsac06). Alternatively, the regex can be converted to a deterministic finite automaton (DFA) for efficient matching, which however may require an extremely large state transition table (STT) due to exponential state explosion (meyer:swat71, yu:ancs06). We propose the segmented regex-NFA (SR-NFA) architecture, where the regex is first compiled into modular nondeterministic finite automata (NFA), then partitioned, optimized, and matched efficiently on modern multi-core processors. SR-NFA offers attack-resilient multi-gigabit per second matching throughput, does not suffer from either backtracking or state explosion, and can be rapidly constructed. For regex sets that construct a DFA with moderate state explosion, i.e., on average 200k states in the STT, the proposed SR-NFA is 367k times faster to construct and update and use 23k times less memory than the DFA approach. Running on an 8-core 2.6 GHz Opteron platform, our prototype achieves 2.2 Gbps average matching throughput for regex sets with up to 4,000 SR-NFA states per regex set.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131634169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We attempt to provide an architectural support for fast and efficient bounds checking for multithread work-loads in chip-multiprocessor (CMP) environments. Bounds information sharing and smart tagging help to perform bounds checking more effectively utilizing the characteristics of a pointer. Also, the BCache architecture allows fast access to the bounds information. Simulation results show that the proposed scheme increases ¥ìPC of memory operations by 29% on average compared to the previous hardware scheme.
{"title":"Scalable and Efficient Bounds Checking for Large-Scale CMP Environments","authors":"Baik Song An, K. H. Yum, Eun Jung Kim","doi":"10.1109/PACT.2011.36","DOIUrl":"https://doi.org/10.1109/PACT.2011.36","url":null,"abstract":"We attempt to provide an architectural support for fast and efficient bounds checking for multithread work-loads in chip-multiprocessor (CMP) environments. Bounds information sharing and smart tagging help to perform bounds checking more effectively utilizing the characteristics of a pointer. Also, the BCache architecture allows fast access to the bounds information. Simulation results show that the proposed scheme increases ¥ìPC of memory operations by 29% on average compared to the previous hardware scheme.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122553906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary optimization based on the formal TSO memory model. The major contribution of the modeling is a sound and complete algorithm to verify TSO-preserving binary optimization with O(N2) complexity. We also developed a dynamic binary optimization system to evaluate the performance impact of TSO-preserving optimization. We show in our experiments that, dynamic binary optimization without memory optimizations can improve performance by 8.1%. TSO-preserving optimizations can further improve the performance by 4.8% to a total 12.9%. Without considering the restriction for TSO-preserving optimizations, the dynamic binary optimization can improve the overall performance to 20.4%.
{"title":"Modeling and Performance Evaluation of TSO-Preserving Binary Optimization","authors":"Cheng Wang, Youfeng Wu","doi":"10.1109/PACT.2011.69","DOIUrl":"https://doi.org/10.1109/PACT.2011.69","url":null,"abstract":"Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary optimization based on the formal TSO memory model. The major contribution of the modeling is a sound and complete algorithm to verify TSO-preserving binary optimization with O(N2) complexity. We also developed a dynamic binary optimization system to evaluate the performance impact of TSO-preserving optimization. We show in our experiments that, dynamic binary optimization without memory optimizations can improve performance by 8.1%. TSO-preserving optimizations can further improve the performance by 4.8% to a total 12.9%. Without considering the restriction for TSO-preserving optimizations, the dynamic binary optimization can improve the overall performance to 20.4%.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"395 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113998939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. The technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. It outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement in a 2MB LLC in a single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks.
最近最少使用(LRU)替换策略在最后一级缓存(LLC)中表现不佳,因为内存访问的时间局部性是由一级和二级缓存过滤的。我们提出了一种缓存分段技术,该技术通过预测缓存中尚未引用和已引用的块的最佳数量来适应缓存访问模式。该技术独立于LRU策略,因此可以使用成本较低的替换策略。它可以自动检测何时绕过块到CPU,没有额外的开销。在使用SPEC CPU 2006基准测试的内存密集型子集的单核处理器中的2MB LLC中,如果使用非最近使用(NRU)替换,它的性能平均优于LRU替换5.2%,如果使用随机替换,它的性能平均优于2.2%。
{"title":"Decoupled Cache Segmentation: Mutable Policy with Automated Bypass","authors":"S. Khan, Daniel A. Jiménez","doi":"10.1109/PACT.2011.45","DOIUrl":"https://doi.org/10.1109/PACT.2011.45","url":null,"abstract":"The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. The technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. It outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement in a 2MB LLC in a single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132524041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian
Future scalable multi-core chips are expected to implement a shared last-level cache (LLC) with banks distributed on chip, forcing a core to incur non-uniform access latencies to each bank. Consequently, high performance and energy efficiency depend on whether a thread's data is placed in local or nearby banks. Using compiler and programmer support, we aim to find an alternative solution to existing high-overhead designs. In this paper, we take existing parallel programs written in Pthreads, and show the performance gap between current static mapping schemes, costly migration schemes and idealized static and dynamic best-case scenarios.
{"title":"Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures","authors":"Gagandeep S. Sachdev, K. Sudan, Mary W. Hall, R. Balasubramonian","doi":"10.1109/PACT.2011.26","DOIUrl":"https://doi.org/10.1109/PACT.2011.26","url":null,"abstract":"Future scalable multi-core chips are expected to implement a shared last-level cache (LLC) with banks distributed on chip, forcing a core to incur non-uniform access latencies to each bank. Consequently, high performance and energy efficiency depend on whether a thread's data is placed in local or nearby banks. Using compiler and programmer support, we aim to find an alternative solution to existing high-overhead designs. In this paper, we take existing parallel programs written in Pthreads, and show the performance gap between current static mapping schemes, costly migration schemes and idealized static and dynamic best-case scenarios.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129967907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.
{"title":"No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs","authors":"K. Pusukuri, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2011.8","DOIUrl":"https://doi.org/10.1109/PACT.2011.8","url":null,"abstract":"Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as Open Solaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell Power Edge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some micro benchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130350185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}