Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra
To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.
{"title":"Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer","authors":"Nikolas Ioannou, M. Kauschke, M. Gries, Marcelo H. Cintra","doi":"10.1109/PACT.2011.19","DOIUrl":"https://doi.org/10.1109/PACT.2011.19","url":null,"abstract":"To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than today's multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130775310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Gulur, R. Manikantan, R. Govindarajan, M. Mehendale
In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.
{"title":"Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs","authors":"N. Gulur, R. Manikantan, R. Govindarajan, M. Mehendale","doi":"10.1109/PACT.2011.34","DOIUrl":"https://doi.org/10.1109/PACT.2011.34","url":null,"abstract":"In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131822126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present OUTRIDERHP, a novel implementation of a decoupled architecture that approaches the performance of contemporary out-of-order processors on parallel benchmarks while maintaining low hardware complexity. OUTRIDERHP leverages the compiler to separate a single thread of execution into memory-accessing and memory-consuming streams that can be executed concurrently, which we call strands. We identify loss-of-decoupling events which cripple performance on traditional decoupled architectures, and design OUTRIDERHP to enable extraction of multiple strands and control speculation which provide superior memory and functional unit latency tolerance. OUTRIDERHP outperforms a baseline in-order architecture by 26-220% and Decoupled Access/Execute by 7-172% when executing parallel benchmarks on an 8-core CMP configuration. OUTRIDERHP performs within 15% of higher-complexity out-of-order cores despite not utilizing large physical register files, dynamic scheduling, and register renaming hardware.
{"title":"Decoupled Architectures as a Low-Complexity Alternative to Out-of-order Execution","authors":"N. Crago, Sanjay J. Patel","doi":"10.1109/PACT.2011.28","DOIUrl":"https://doi.org/10.1109/PACT.2011.28","url":null,"abstract":"In this paper we present OUTRIDERHP, a novel implementation of a decoupled architecture that approaches the performance of contemporary out-of-order processors on parallel benchmarks while maintaining low hardware complexity. OUTRIDERHP leverages the compiler to separate a single thread of execution into memory-accessing and memory-consuming streams that can be executed concurrently, which we call strands. We identify loss-of-decoupling events which cripple performance on traditional decoupled architectures, and design OUTRIDERHP to enable extraction of multiple strands and control speculation which provide superior memory and functional unit latency tolerance. OUTRIDERHP outperforms a baseline in-order architecture by 26-220% and Decoupled Access/Execute by 7-172% when executing parallel benchmarks on an 8-core CMP configuration. OUTRIDERHP performs within 15% of higher-complexity out-of-order cores despite not utilizing large physical register files, dynamic scheduling, and register renaming hardware.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130479663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current main memory system design is severely limited by the decades-old synchronous DRAM architecture, which requires the memory controller to track the internal status of memory devices (chips) and schedule the timing of all device operations. This rigidity has become an obstacle of integrating emerging memory technologies such as PCM into existing memory systems, because their timing requirements are vastly different. Furthermore, with the trend of embedding memory controllers into processors, it is crucial to have interoperability among general-purpose processors and diverse memory modules. To address this issue, we propose a new memory architecture framework called universal memory architecture (UniMA). It enables the interoperability by decoupling the scheduling of device operations from memory controller, using a bridge chip at each memory module to perform local scheduling. The new architecture may also help improve memory scalability, power efficiency, and bandwidth as previously proposed decoupled memory organizations. A major focus of this study is to evaluate the performance impact of local scheduling of device operations. We present a prototype implementation of UniMA on top of DDRx memory bus, and then evaluate its efficiency with different workloads. The simulation results show that UniMA actually improves memory system efficiency for memory-intensive workloads due to increased parallelism among memory modules. The overall performance improvement over the conventional DDRx memory architecture is 3.1% on average. The performance of other workloads is reduced slightly, by 1.0% on average, due to a small increase of memory idle latency. In short, the prototype and evaluation demonstrate that it is possible to integrate diverse memory technologies into a single memory architecture with virtually no loss of overall performance.
{"title":"Memory Architecture for Integrating Emerging Memory Technologies","authors":"Kun Fang, Long Chen, Zhao Zhang, Zhichun Zhu","doi":"10.1109/PACT.2011.71","DOIUrl":"https://doi.org/10.1109/PACT.2011.71","url":null,"abstract":"Current main memory system design is severely limited by the decades-old synchronous DRAM architecture, which requires the memory controller to track the internal status of memory devices (chips) and schedule the timing of all device operations. This rigidity has become an obstacle of integrating emerging memory technologies such as PCM into existing memory systems, because their timing requirements are vastly different. Furthermore, with the trend of embedding memory controllers into processors, it is crucial to have interoperability among general-purpose processors and diverse memory modules. To address this issue, we propose a new memory architecture framework called universal memory architecture (UniMA). It enables the interoperability by decoupling the scheduling of device operations from memory controller, using a bridge chip at each memory module to perform local scheduling. The new architecture may also help improve memory scalability, power efficiency, and bandwidth as previously proposed decoupled memory organizations. A major focus of this study is to evaluate the performance impact of local scheduling of device operations. We present a prototype implementation of UniMA on top of DDRx memory bus, and then evaluate its efficiency with different workloads. The simulation results show that UniMA actually improves memory system efficiency for memory-intensive workloads due to increased parallelism among memory modules. The overall performance improvement over the conventional DDRx memory architecture is 3.1% on average. The performance of other workloads is reduced slightly, by 1.0% on average, due to a small increase of memory idle latency. In short, the prototype and evaluation demonstrate that it is possible to integrate diverse memory technologies into a single memory architecture with virtually no loss of overall performance.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130687013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy
This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.
{"title":"Compiler Directed Data Locality Optimization for Multicore Architectures","authors":"W. Ding, Jithendra Srinivas, M. Kandemir, Mustafa Karaköy","doi":"10.1109/PACT.2011.24","DOIUrl":"https://doi.org/10.1109/PACT.2011.24","url":null,"abstract":"This paper presents and evaluates a cache hierarchy-aware code parallelization/mapping and scheduling strategy for multicore architectures. Our proposed parallelization/mapping strategy determines a loop iteration-to-core mapping by taking into account the data access pattern of an application and the on-chip cache hierarchy of a target architecture. The goal of this step is to maximize data locality at each level of caches while minimizing the data dependences across the cores. Our scheduling strategy on the other hand determines a schedule for the iterations assigned to each core in the target architecture, with the goal of satisfying all the data dependences in the code (both intra-core and inter-core) and reducing data reuse distances across the cores that share data. We formulate both parallelization/mapping problem and scheduling problem in a linear algebraic framework and solve them using the Farkas Lemma and the Integer Fourier-Motzkin Elimination. To measure the effectiveness of our schemes, we implemented them in a compiler and tested them using eight multithreaded application programs on a multicore machine. Our results show that the proposed mapping scheme reduces cache miss rates at all levels of the cache hierarchy and improves execution time of applications significantly, compared to alternate approaches, and when supported by scheduling, the improvements in cache miss rates and execution time become much larger.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large number of kernels that exhibit similar computation and memory access characteristics and require optimization over similar problem spaces. We propose an auto-tuning method for tensor contraction functions on GPUs, based on parameterized micro-benchmarks. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations without auto-tuning.
{"title":"Parameterized Micro-benchmarking: An Auto-tuning Approach for Complex Applications","authors":"Wenjing Ma, S. Krishnamoorthy, G. Agrawal","doi":"10.1145/2212908.2212938","DOIUrl":"https://doi.org/10.1145/2212908.2212938","url":null,"abstract":"Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large number of kernels that exhibit similar computation and memory access characteristics and require optimization over similar problem spaces. We propose an auto-tuning method for tensor contraction functions on GPUs, based on parameterized micro-benchmarks. Using our parameterized micro-benchmarking approach, we obtain a speedup of up to 2 over the version that used default optimizations without auto-tuning.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133580702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Programmer productivity considerations are increasing the popularity of interpreted languages like Python. At the same time, for applications where performance is important, these languages clearly lack even on uniprocessors. In addition, the use of dynamic data structures in a language like Python makes it very hard to use emerging libraries for enabling the execution on multi-core and many-core architectures. This paper presents a framework for compiling Python to use multi-core and many-core libraries. The key component of our framework involves a suite of algorithms for replacing dynamic and/or nested data structures by arrays, while minimizing unnecessary data copying costs. This involves a novel use of an existing partial redundancy elimination algorithm, development of a new demand-driven interprocedural partial redundancy algorithm, a data flow formulation for determining that the contents of the data structure are of the same type, and a linearization algorithm. We have evaluated our framework using data mining and two linear algebra applications written in pure Python. The key observations were: 1) the code generated by our framework is only 10% to 20% slower compared to the hand-written C code that invokes the same libraries, 2) our optimizations turn out to be significant for improving the performance in most cases, and 3) we outperform interpreted Python and the C++ code generated by an existing tool by one to two orders of magnitude.
{"title":"Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries","authors":"Bin Ren, G. Agrawal","doi":"10.1109/PACT.2011.13","DOIUrl":"https://doi.org/10.1109/PACT.2011.13","url":null,"abstract":"Programmer productivity considerations are increasing the popularity of interpreted languages like Python. At the same time, for applications where performance is important, these languages clearly lack even on uniprocessors. In addition, the use of dynamic data structures in a language like Python makes it very hard to use emerging libraries for enabling the execution on multi-core and many-core architectures. This paper presents a framework for compiling Python to use multi-core and many-core libraries. The key component of our framework involves a suite of algorithms for replacing dynamic and/or nested data structures by arrays, while minimizing unnecessary data copying costs. This involves a novel use of an existing partial redundancy elimination algorithm, development of a new demand-driven interprocedural partial redundancy algorithm, a data flow formulation for determining that the contents of the data structure are of the same type, and a linearization algorithm. We have evaluated our framework using data mining and two linear algebra applications written in pure Python. The key observations were: 1) the code generated by our framework is only 10% to 20% slower compared to the hand-written C code that invokes the same libraries, 2) our optimizations turn out to be significant for improving the performance in most cases, and 3) we outperform interpreted Python and the C++ code generated by an existing tool by one to two orders of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115518174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory controllers in graphics processing units (GPU) often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling compared with in-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose an alternative memory scheduling where the memory scheduling is performed not at the destination (i.e., memory controller) but is done at the source (i.e., the cores). We propose two complementary techniques in source-based memory scheduling -- network congestion-aware source throttling and super packets, where multiple request packets are grouped together to create a single super packet. By combing these techniques, the performance across a wide range of application is within 95% of the complex FR-FCFS on average and at significantly lower cost and complexity.
{"title":"An Alternative Memory Access Scheduling in Manycore Accelerators","authors":"Yonggon Kim, Hyunseok Lee, John Kim","doi":"10.1109/PACT.2011.37","DOIUrl":"https://doi.org/10.1109/PACT.2011.37","url":null,"abstract":"Memory controllers in graphics processing units (GPU) often employ out-of-order scheduling to maximize row access locality. However, this requires complex logic to enable out-of-order scheduling compared with in-order scheduling. To provide a low-cost and low-complexity memory scheduling, we propose an alternative memory scheduling where the memory scheduling is performed not at the destination (i.e., memory controller) but is done at the source (i.e., the cores). We propose two complementary techniques in source-based memory scheduling -- network congestion-aware source throttling and super packets, where multiple request packets are grouped together to create a single super packet. By combing these techniques, the performance across a wide range of application is within 95% of the complex FR-FCFS on average and at significantly lower cost and complexity.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114477105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. M. Hassan, Dhruv Choudhary, M. Rasquinha, S. Yalamanchili
The presence of multiple MCs and their integration into the on-chip network fabric creates a highly concurrent system that can support significant levels of memory level parallelism (MLP) across cores. This work exposes the trade-off between DRAM parameters, bank level parallelism (BLP), and row buffer hit rate that exposes the amount of effective BLP that is necessary to approximate a 100% hit rate. We further study how this trade-off can be controlled and propose a class of global (system) and local (within an MC) address mappings that can be tuned to optimize the performance across a set of multiprogrammed benchmarks.
{"title":"Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments","authors":"S. M. Hassan, Dhruv Choudhary, M. Rasquinha, S. Yalamanchili","doi":"10.1109/PACT.2011.33","DOIUrl":"https://doi.org/10.1109/PACT.2011.33","url":null,"abstract":"The presence of multiple MCs and their integration into the on-chip network fabric creates a highly concurrent system that can support significant levels of memory level parallelism (MLP) across cores. This work exposes the trade-off between DRAM parameters, bank level parallelism (BLP), and row buffer hit rate that exposes the amount of effective BLP that is necessary to approximate a 100% hit rate. We further study how this trade-off can be controlled and propose a class of global (system) and local (within an MC) address mappings that can be tuned to optimize the performance across a set of multiprogrammed benchmarks.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"14 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120904947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junghoon Lee, Minjeong Shin, H. Kim, John Kim, Jaehyuk Huh
The unique characteristics of prefetch traffic have not been considered in on-chip network design for multicore architectures. Most prefetchers are often oblivious to the network congestion when generating prefetech requests. In this work, we investigate the interaction between prefetchers and on-chip networks and exploit the synergy of these two components in multi-core architectures. We explore prefetchaware on-chip networks that differentiates between prefetch and demand traffic by prioritizing demand traffic. In addition, we propose prefetch control mechanism based on network congestion. Our evaluations show that the combination of the proposed prefetch-aware router architecture and congestion sensitive prefetch control improves the performance of benchmarks by 11-13% on average, up to 30% on some of the workloads.
{"title":"Exploiting Mutual Awareness between Prefetchers and On-chip Networks in Multi-cores","authors":"Junghoon Lee, Minjeong Shin, H. Kim, John Kim, Jaehyuk Huh","doi":"10.1109/PACT.2011.27","DOIUrl":"https://doi.org/10.1109/PACT.2011.27","url":null,"abstract":"The unique characteristics of prefetch traffic have not been considered in on-chip network design for multicore architectures. Most prefetchers are often oblivious to the network congestion when generating prefetech requests. In this work, we investigate the interaction between prefetchers and on-chip networks and exploit the synergy of these two components in multi-core architectures. We explore prefetchaware on-chip networks that differentiates between prefetch and demand traffic by prioritizing demand traffic. In addition, we propose prefetch control mechanism based on network congestion. Our evaluations show that the combination of the proposed prefetch-aware router architecture and congestion sensitive prefetch control improves the performance of benchmarks by 11-13% on average, up to 30% on some of the workloads.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}