Jaejin Lee, Sangmin Seo, C. Kim, Junghyun Kim, Posung Chun, Zehra Sura, Jungwon Kim, Sang-Yong Han
The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).
{"title":"COMIC: A coherent shared memory interface for cell BE","authors":"Jaejin Lee, Sangmin Seo, C. Kim, Junghyun Kim, Posung Chun, Zehra Sura, Jungwon Kim, Sang-Yong Han","doi":"10.1145/1454115.1454157","DOIUrl":"https://doi.org/10.1145/1454115.1454157","url":null,"abstract":"The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 s1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114248095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seth H. Pugsley, M. Awasthi, Niti Madan, Naveen Muralimanohar, R. Balasubramonian
In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable in terms of delay and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages for commit. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), and 8% (for average transaction length of 4000 cycles). For a small group of multi-threaded programs with frequent transaction commits, improvements of up to 8% were observed for a 32-node simulation.
{"title":"Scalable and reliable communication for hardware transactional memory","authors":"Seth H. Pugsley, M. Awasthi, Niti Madan, Naveen Muralimanohar, R. Balasubramonian","doi":"10.1145/1454115.1454137","DOIUrl":"https://doi.org/10.1145/1454115.1454137","url":null,"abstract":"In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable in terms of delay and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages for commit. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), and 8% (for average transaction length of 4000 cycles). For a small group of multi-threaded programs with frequent transaction commits, improvements of up to 8% were observed for a 32-node simulation.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graham D. Price, John Giacomoni, Manish Vachharajani
This paper presents ParaMeter, an interactive program analysis and visualization system for large traces. Using ParaMeter, a software developer can locate and analyze regions of code that may yield to parallelization efforts and to possibly extract performance from multicore hardware. The key contributions in the paper are (1) a method to use interactive visualization of traces to find and exploit parallelism, (2) interactive-speed visualization of large-scale trace dependencies, (3) interactive-speed visualization of code interactions, and (4) a BDD variable ordering for BDD-compressed traces that results in fast visualization, fast analysis, and good compression. ParaMeter's effectiveness is demonstrated by finding and exploiting parallelism in 175.vpr. Measurements of ParaMeter's visualization algorithms show that they are up to seventy-five thousand times faster than prior approaches.
{"title":"Visualizing potential parallelism in sequential programs","authors":"Graham D. Price, John Giacomoni, Manish Vachharajani","doi":"10.1145/1454115.1454129","DOIUrl":"https://doi.org/10.1145/1454115.1454129","url":null,"abstract":"This paper presents ParaMeter, an interactive program analysis and visualization system for large traces. Using ParaMeter, a software developer can locate and analyze regions of code that may yield to parallelization efforts and to possibly extract performance from multicore hardware. The key contributions in the paper are (1) a method to use interactive visualization of traces to find and exploit parallelism, (2) interactive-speed visualization of large-scale trace dependencies, (3) interactive-speed visualization of code interactions, and (4) a BDD variable ordering for BDD-compressed traces that results in fast visualization, fast analysis, and good compression. ParaMeter's effectiveness is demonstrated by finding and exploiting parallelism in 175.vpr. Measurements of ParaMeter's visualization algorithms show that they are up to seventy-five thousand times faster than prior approaches.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122213063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an energy management policy for reconfigurable clusters running a multi-tier application, exploiting DVS together with multiple sleep states. We develop a theoretical analysis of the corresponding power optimization problem and design an algorithm around the solution. Moreover, we rigorously investigate selection of the optimal number of spare servers for each power state, a problem that has only been approached in an ad-hoc manner in current policies. To validate our results and policies, we implement them on an actual multi-tier server cluster where nodes support all power management techniques considered. Experimental results using realistic dynamic workloads based on the TPC-W benchmark show that exploiting multiple sleep states results in significant additional cluster-wide energy savings up to 23% with little or no performance degradation.
{"title":"Multi-mode energy management for multi-tier server clusters","authors":"T. Horvath, K. Skadron","doi":"10.1145/1454115.1454153","DOIUrl":"https://doi.org/10.1145/1454115.1454153","url":null,"abstract":"This paper presents an energy management policy for reconfigurable clusters running a multi-tier application, exploiting DVS together with multiple sleep states. We develop a theoretical analysis of the corresponding power optimization problem and design an algorithm around the solution. Moreover, we rigorously investigate selection of the optimal number of spare servers for each power state, a problem that has only been approached in an ad-hoc manner in current policies. To validate our results and policies, we implement them on an actual multi-tier server cluster where nodes support all power management techniques considered. Experimental results using realistic dynamic workloads based on the TPC-W benchmark show that exploiting multiple sleep states results in significant additional cluster-wide energy savings up to 23% with little or no performance degradation.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115808515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katherine E. Coons, Behnam Robatmili, Matthew E. Taylor, Bertrand A. Maher, D. Burger, K. McKinley
Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions communicate with one another directly on a distributed substrate, give the compiler control over communication overheads at a fine granularity. Prior work shows that compilers can effectively reduce fine-grained communication overheads in EDGE architectures using a spatial instruction placement algorithm with a heuristic-based cost function. While this algorithm is effective, the cost function must be painstakingly tuned. Heuristics tuned to perform well across a variety of applications leave users with little ability to tune performance-critical applications, yet we find that the best placement heuristics vary significantly with the application. First, we suggest a systematic feature selection method that reduces the feature set size based on the extent to which features affect performance. To automatically discover placement heuristics, we then use these features as input to a reinforcement learning technique, called Neuro-Evolution of Augmenting Topologies (NEAT), that uses a genetic algorithm to evolve neural networks. We show that NEAT outperforms simulated annealing, the most commonly used optimization technique for instruction placement. We use NEAT to learn general heuristics that are as effective as hand-tuned heuristics, but we find that improving over highly hand-tuned general heuristics is difficult. We then suggest a hierarchical approach to machine learning that classifies segments of code with similar characteristics and learns heuristics for these classes. This approach performs closer to the specialized heuristics. Together, these results suggest that learning compiler heuristics may benefit from both improved feature selection and classification.
{"title":"Feature selection and policy optimization for distributed instruction placement using reinforcement learning","authors":"Katherine E. Coons, Behnam Robatmili, Matthew E. Taylor, Bertrand A. Maher, D. Burger, K. McKinley","doi":"10.1145/1454115.1454122","DOIUrl":"https://doi.org/10.1145/1454115.1454122","url":null,"abstract":"Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions communicate with one another directly on a distributed substrate, give the compiler control over communication overheads at a fine granularity. Prior work shows that compilers can effectively reduce fine-grained communication overheads in EDGE architectures using a spatial instruction placement algorithm with a heuristic-based cost function. While this algorithm is effective, the cost function must be painstakingly tuned. Heuristics tuned to perform well across a variety of applications leave users with little ability to tune performance-critical applications, yet we find that the best placement heuristics vary significantly with the application. First, we suggest a systematic feature selection method that reduces the feature set size based on the extent to which features affect performance. To automatically discover placement heuristics, we then use these features as input to a reinforcement learning technique, called Neuro-Evolution of Augmenting Topologies (NEAT), that uses a genetic algorithm to evolve neural networks. We show that NEAT outperforms simulated annealing, the most commonly used optimization technique for instruction placement. We use NEAT to learn general heuristics that are as effective as hand-tuned heuristics, but we find that improving over highly hand-tuned general heuristics is difficult. We then suggest a hierarchical approach to machine learning that classifies segments of code with similar characteristics and learns heuristics for these classes. This approach performs closer to the specialized heuristics. Together, these results suggest that learning compiler heuristics may benefit from both improved feature selection and classification.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124309502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuejun Yang, Y. Zhang, Jingling Xue, Ian Rogers, Gen Li, Guibin Wang
The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.
{"title":"Exploiting loop-dependent Stream Reuse for stream processors","authors":"Xuejun Yang, Y. Zhang, Jingling Xue, Ian Rogers, Gen Li, Guibin Wang","doi":"10.1145/1454115.1454121","DOIUrl":"https://doi.org/10.1145/1454115.1454121","url":null,"abstract":"The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"20 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132090961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fang Liu, Fei Guo, Yan Solihin, Seongbeom Kim, A. Eker
One of the essential features in modern computer systems is context switching, which allows multiple threads of execution to time-share a limited number of processors. While very useful, context switching can introduce high performance overheads, with one of the primary reasons being the cache perturbation effect. Between the time a thread is switched out and when it resumes execution, parts of its working set in the cache may be perturbed by other interfering threads, leading to (context switch) cache misses to recover from the perturbation. The goal of this paper is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. We characterize a previously-unreported type of context switch misses that occur as the artifact of the interaction of cache replacement policy and an application's temporal reuse behavior. We characterize the behavior of these “reordered misses” for various applications, cache sizes, and the amount of cache perturbation. As a second contribution, we develop an analytical model that reveals the mathematical relationship between cache design parameters, an application's temporal reuse pattern, and the number of context switch misses the application suffers from. We validate the model against simulation studies and find that it is accurate in predicting the trends of context switch misses. The mathematical relationship provided by the model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others. Through a case study, we also find that prefetching tends to aggravate the number of context switch misses.
{"title":"Characterizing and modeling the behavior of context switch misses!","authors":"Fang Liu, Fei Guo, Yan Solihin, Seongbeom Kim, A. Eker","doi":"10.1145/1454115.1454130","DOIUrl":"https://doi.org/10.1145/1454115.1454130","url":null,"abstract":"One of the essential features in modern computer systems is context switching, which allows multiple threads of execution to time-share a limited number of processors. While very useful, context switching can introduce high performance overheads, with one of the primary reasons being the cache perturbation effect. Between the time a thread is switched out and when it resumes execution, parts of its working set in the cache may be perturbed by other interfering threads, leading to (context switch) cache misses to recover from the perturbation. The goal of this paper is to understand how cache parameters and application behavior influence the number of context switch misses the application suffers from. We characterize a previously-unreported type of context switch misses that occur as the artifact of the interaction of cache replacement policy and an application's temporal reuse behavior. We characterize the behavior of these “reordered misses” for various applications, cache sizes, and the amount of cache perturbation. As a second contribution, we develop an analytical model that reveals the mathematical relationship between cache design parameters, an application's temporal reuse pattern, and the number of context switch misses the application suffers from. We validate the model against simulation studies and find that it is accurate in predicting the trends of context switch misses. The mathematical relationship provided by the model allows us to derive insights into precisely why some applications are more vulnerable to context switch misses than others. Through a case study, we also find that prefetching tends to aggravate the number of context switch misses.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"512 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simultaneous Multithreading (SMT) increases processor throughput by allowing the parallel execution of several threads. However, fully sharing processor resources may cause resource monopolization by a single thread or other misallocations, resulting in overall performance degradation. Static resource partitioning techniques have been suggested, but are not as effective as dynamically controlling the resource usage of each thread since program behavior does change during its execution. In this paper, we propose an Adaptive Resource Partitioning Algorithm (ARPA) that dynamically assigns resources to threads according to thread behavior changes. ARPA analyzes the resource usage efficiency of each thread in a time period and assigns more resources to threads which can use them in a more efficient way. The purpose of ARPA is to improve the efficiency of resource utilization, thereby improving overall instruction throughput. Our simulation results on a set of 42 multiprogramming workloads show that ARPA outperforms the traditional fetch policy ICOUNT by 55.8% with regard to overall instruction throughput and achieves a 33.8% improvement over Static Partitioning. It also outperforms the current best dynamic resource allocation technique, Hill-climbing, by 5.7%. Considering fairness accorded to each thread, ARPA attains 43.6%, 18.5% and 9.2% improvements over ICOUNT, Static Partitioning and Hill-climbing, respectively, using a common fairness metric.
{"title":"An Adaptive Resource Partitioning Algorithm for SMT processors","authors":"Huaping Wang, I. Koren, C. M. Krishna","doi":"10.1145/1454115.1454148","DOIUrl":"https://doi.org/10.1145/1454115.1454148","url":null,"abstract":"Simultaneous Multithreading (SMT) increases processor throughput by allowing the parallel execution of several threads. However, fully sharing processor resources may cause resource monopolization by a single thread or other misallocations, resulting in overall performance degradation. Static resource partitioning techniques have been suggested, but are not as effective as dynamically controlling the resource usage of each thread since program behavior does change during its execution. In this paper, we propose an Adaptive Resource Partitioning Algorithm (ARPA) that dynamically assigns resources to threads according to thread behavior changes. ARPA analyzes the resource usage efficiency of each thread in a time period and assigns more resources to threads which can use them in a more efficient way. The purpose of ARPA is to improve the efficiency of resource utilization, thereby improving overall instruction throughput. Our simulation results on a set of 42 multiprogramming workloads show that ARPA outperforms the traditional fetch policy ICOUNT by 55.8% with regard to overall instruction throughput and achieves a 33.8% improvement over Static Partitioning. It also outperforms the current best dynamic resource allocation technique, Hill-climbing, by 5.7%. Considering fairness accorded to each thread, ARPA attains 43.6%, 18.5% and 9.2% improvements over ICOUNT, Static Partitioning and Hill-climbing, respectively, using a common fairness metric.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134056352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunlian Jiang, Xipeng Shen, Jie Chen, Rahul Tripathi
Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate the contention, but it remains an open question how to efficiently find optimal co-schedules. Solving the question is critical for determining the potential of a co-scheduling system. This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness. Furthermore, for a special case when there are two sharers per chip, we propose an algorithm that finds the optimal co-schedules in polynomial time. For more complex cases, we design and evaluate a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability. This study facilitates the evaluation of co-scheduling systems, as well as offers some techniques directly usable in proactive job co-scheduling.
{"title":"Analysis and approximation of optimal co-scheduling on Chip Multiprocessors","authors":"Yunlian Jiang, Xipeng Shen, Jie Chen, Rahul Tripathi","doi":"10.1145/1454115.1454146","DOIUrl":"https://doi.org/10.1145/1454115.1454146","url":null,"abstract":"Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate the contention, but it remains an open question how to efficiently find optimal co-schedules. Solving the question is critical for determining the potential of a co-scheduling system. This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness. Furthermore, for a special case when there are two sharers per chip, we propose an algorithm that finds the optimal co-schedules in polynomial time. For more complex cases, we design and evaluate a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability. This study facilitates the evaluation of co-scheduling systems, as well as offers some techniques directly usable in proactive job co-scheduling.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"45 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132267487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.
{"title":"The PARSEC benchmark suite: Characterization and architectural implications","authors":"Christian Bienia, Sanjeev Kumar, J. Singh, Kai Li","doi":"10.1145/1454115.1454128","DOIUrl":"https://doi.org/10.1145/1454115.1454128","url":null,"abstract":"This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121954776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}