Stephen Somogyi, T. Wenisch, N. Hardavellas, Jangwoo Kim, A. Ailamaki, B. Falsafi
Recent research indicates that prediction-based coherence optimizations offer substantial performance improvements for scientific applications in distributed shared memory multiprocessors. Important commercial applications also show sensitivity to coherence latency, which will become more acute in the future as technology scales. Therefore it is important to investigate prediction of memory coherence activity in the context of commercial workloads.This paper studies a trace-based Downgrade Predictor (DGP) for predicting last stores to shared cache blocks, and a pattern-based Consumer Set Predictor (CSP) for predicting subsequent readers. We evaluate this class of predictors for the first time on commercial applications and demonstrate that our DGP correctly predicts 47%-76% of last stores. Memory sharing patterns in commercial workloads are inherently non-repetitive; hence CSP cannot attain high coverage. We perform an opportunity study of a DGP enhanced through competitive underlying predictors, and in commercial and scientific applications, demonstrate potential to increase coverage up to 14%.
{"title":"Memory coherence activity prediction in commercial workloads","authors":"Stephen Somogyi, T. Wenisch, N. Hardavellas, Jangwoo Kim, A. Ailamaki, B. Falsafi","doi":"10.1145/1054943.1054949","DOIUrl":"https://doi.org/10.1145/1054943.1054949","url":null,"abstract":"Recent research indicates that prediction-based coherence optimizations offer substantial performance improvements for scientific applications in distributed shared memory multiprocessors. Important commercial applications also show sensitivity to coherence latency, which will become more acute in the future as technology scales. Therefore it is important to investigate prediction of memory coherence activity in the context of commercial workloads.This paper studies a trace-based Downgrade Predictor (DGP) for predicting last stores to shared cache blocks, and a pattern-based Consumer Set Predictor (CSP) for predicting subsequent readers. We evaluate this class of predictors for the first time on commercial applications and demonstrate that our DGP correctly predicts 47%-76% of last stores. Memory sharing patterns in commercial workloads are inherently non-repetitive; hence CSP cannot attain high coverage. We perform an opportunity study of a DGP enhanced through competitive underlying predictors, and in commercial and scientific applications, demonstrate potential to increase coverage up to 14%.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132061515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes several selected micro-architectural tradeoffs and optimizations for the scalable memory controller of the Intel E8870 chipset architecture. The Intel E8870 chipset architecture supports scalable coherent multiprocessor systems using 2 to 16 processors, and a point-to-point Scalability Port (SP) Protocol. The scalable memory controller micro-architecture applies a number of micro-architecture techniques to reduce the local & remote idle and loaded latencies. The performance optimizations were achieved within the constraints of maintaining functional correctness, while reducing implementation complexity and cost. High bandwidth point-to-point interconnects and distributed memory are expected to be more common in future platforms to support powerful multi-core processors. The selected techniques discussed in this paper will be applicable to scalable memory controllers needed in those platforms. These techniques have been proven for production systems for the Itanium® II Processor platforms.
{"title":"Micro-architecture techniques in the intel® E8870 scalable memory controller","authors":"F. Briggs, S. Chittor, Kai Cheng","doi":"10.1145/1054943.1054948","DOIUrl":"https://doi.org/10.1145/1054943.1054948","url":null,"abstract":"This paper describes several selected micro-architectural tradeoffs and optimizations for the scalable memory controller of the Intel E8870 chipset architecture. The Intel E8870 chipset architecture supports scalable coherent multiprocessor systems using 2 to 16 processors, and a point-to-point Scalability Port (SP) Protocol. The scalable memory controller micro-architecture applies a number of micro-architecture techniques to reduce the local & remote idle and loaded latencies. The performance optimizations were achieved within the constraints of maintaining functional correctness, while reducing implementation complexity and cost. High bandwidth point-to-point interconnects and distributed memory are expected to be more common in future platforms to support powerful multi-core processors. The selected techniques discussed in this paper will be applicable to scalable memory controllers needed in those platforms. These techniques have been proven for production systems for the Itanium® II Processor platforms.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132865243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reducing the number of data cache accesses improves performance, port efficiency, bandwidth and motivates the use of single ported caches instead of complex and expensive multi-ported ones. In this paper we consider an intrusion detection system as a target application and study the effectiveness of two techniques - (i) prefetching data from the cache into local buffers in the processor core and (ii) load Instruction Reuse (IR) - in reducing data cache traffic. The analysis is carried out using a microarchitecture and instruction set representative of a programmable processor with the aim of determining if the above techniques are viable for a programmable pattern matching engine found in many network processors. We find that IR is the most generic and efficient technique which reduces cache traffic by up to 60%. However, a combination of prefetching and IR with application specific tuning performs as well as and sometimes better than IR alone.
{"title":"On the effectiveness of prefetching and reuse in reducing L1 data cache traffic: a case study of Snort","authors":"G. Surendra, Subhasish Banerjee, S. Nandy","doi":"10.1145/1054943.1054955","DOIUrl":"https://doi.org/10.1145/1054943.1054955","url":null,"abstract":"Reducing the number of data cache accesses improves performance, port efficiency, bandwidth and motivates the use of single ported caches instead of complex and expensive multi-ported ones. In this paper we consider an intrusion detection system as a target application and study the effectiveness of two techniques - (i) prefetching data from the cache into local buffers in the processor core and (ii) load Instruction Reuse (IR) - in reducing data cache traffic. The analysis is carried out using a microarchitecture and instruction set representative of a programmable processor with the aim of determining if the above techniques are viable for a programmable pattern matching engine found in many network processors. We find that IR is the most generic and efficient technique which reduces cache traffic by up to 60%. However, a combination of prefetching and IR with application specific tuning performs as well as and sometimes better than IR alone.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"201202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116486698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22% compared to previous approaches.
{"title":"Compiler-optimized usage of partitioned memories","authors":"L. Wehmeyer, Urs Helmig, P. Marwedel","doi":"10.1145/1054943.1054959","DOIUrl":"https://doi.org/10.1145/1054943.1054959","url":null,"abstract":"In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22% compared to previous approaches.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114282443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-performance out-of-order processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do not change the architectural state of the processor, they can affect the arrangement of data in the memory hierarchy. This paper examines the effects of wrong-path memory references on processor performance. It is shown that these references significantly affect the IPC (Instructions Per Cycle) performance of a processor. Not modeling them can lead to errors of up to 10% in IPC estimates for the SPEC2000 integer benchmarks; 7 out of 12 benchmarks experience an error of greater than 2% in IPC estimates. In general, the error in the IPC increases with increasing memory latency and instruction window size.We find that wrong-path references are usually beneficial for performance, because they prefetch data that will be used by later correct-path references. L2 cache pollution is found to be the most significant negative effect of wrong-path references. Code examples are shown to provide insights into how wrong-path references affect performance.
{"title":"Understanding the effects of wrong-path memory references on processor performance","authors":"O. Mutlu, Hyesoon Kim, D. N. Armstrong, Y. Patt","doi":"10.1145/1054943.1054951","DOIUrl":"https://doi.org/10.1145/1054943.1054951","url":null,"abstract":"High-performance out-of-order processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do not change the architectural state of the processor, they can affect the arrangement of data in the memory hierarchy. This paper examines the effects of wrong-path memory references on processor performance. It is shown that these references significantly affect the IPC (Instructions Per Cycle) performance of a processor. Not modeling them can lead to errors of up to 10% in IPC estimates for the SPEC2000 integer benchmarks; 7 out of 12 benchmarks experience an error of greater than 2% in IPC estimates. In general, the error in the IPC increases with increasing memory latency and instruction window size.We find that wrong-path references are usually beneficial for performance, because they prefetch data that will be used by later correct-path references. L2 cache pollution is found to be the most significant negative effect of wrong-path references. Code examples are shown to provide insights into how wrong-path references affect performance.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131748325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current trends suggest that the number of memory chips per processor chip will increase at least a factor of ten in seven years. This will make DRAM cost, the space and the power it consumes a serious problem. The main question raised in this research is how cost, size, and power consumption can be reduced by transforming traditional flat main-memory systems into a multi-level hierarchy. We make the case for a multi-level main memory hierarchy by proposing and evaluating the performance of an implementation that enables aggressive use of memory compression, sharing of memory resources among computers, and dynamic power management of unused regions of memory. This paper presents the key design strategies to make this happen. We evaluate our implementation using complete runs of applications from the Spec 2K suite, SpecJBB, and SAP --- typical desktop and server applications. We show that only 30% of the entire memory resources typically needed must be accessed at DRAM speed whereas the rest can be accessed at a speed that is a magnitude slower. The resulting performance overhead is shown to be only 1.2% on average.
{"title":"A case for multi-level main memory","authors":"M. Ekman, P. Stenström","doi":"10.1145/1054943.1054944","DOIUrl":"https://doi.org/10.1145/1054943.1054944","url":null,"abstract":"Current trends suggest that the number of memory chips per processor chip will increase at least a factor of ten in seven years. This will make DRAM cost, the space and the power it consumes a serious problem. The main question raised in this research is how cost, size, and power consumption can be reduced by transforming traditional flat main-memory systems into a multi-level hierarchy. We make the case for a multi-level main memory hierarchy by proposing and evaluating the performance of an implementation that enables aggressive use of memory compression, sharing of memory resources among computers, and dynamic power management of unused regions of memory. This paper presents the key design strategies to make this happen. We evaluate our implementation using complete runs of applications from the Spec 2K suite, SpecJBB, and SAP --- typical desktop and server applications. We show that only 30% of the entire memory resources typically needed must be accessed at DRAM speed whereas the rest can be accessed at a speed that is a magnitude slower. The resulting performance overhead is shown to be only 1.2% on average.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115531761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Galluzzi, R. Beivide, Valentin Puente, J. Gregorio, A. Cristal, M. Valero
The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies by having thousands of in-flight instructions. Current multiprocessor systems also have to deal with this increasing memory latency while facing other sources of latencies: those coming from communication among processors. What we propose, in this paper, is the use of Kilo-instruction processors as computing nodes for small-scale CCNUMA multiprocessors. We evaluate what we appropriately call Kilo-instruction Multiprocessors. This kind of systems appears to achieve very good performance while showing two interesting behaviours. First, the great amount of in-flight instructions makes the system not just to hide the latencies coming from the memory accesses but also the inherent communication latencies involved in remote memory accesses. Second, the significant pressure imposed by many in-flight instructions translates into a very high contention for the interconnection network, what indicates us that more efforts need to be employed in designing routers capable of managing high traffic levels.
{"title":"Evaluating kilo-instruction multiprocessors","authors":"M. Galluzzi, R. Beivide, Valentin Puente, J. Gregorio, A. Cristal, M. Valero","doi":"10.1145/1054943.1054953","DOIUrl":"https://doi.org/10.1145/1054943.1054953","url":null,"abstract":"The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies by having thousands of in-flight instructions. Current multiprocessor systems also have to deal with this increasing memory latency while facing other sources of latencies: those coming from communication among processors. What we propose, in this paper, is the use of Kilo-instruction processors as computing nodes for small-scale CCNUMA multiprocessors. We evaluate what we appropriately call Kilo-instruction Multiprocessors. This kind of systems appears to achieve very good performance while showing two interesting behaviours. First, the great amount of in-flight instructions makes the system not just to hide the latencies coming from the memory accesses but also the inherent communication latencies involved in remote memory accesses. Second, the significant pressure imposed by many in-flight instructions translates into a very high contention for the interconnection network, what indicates us that more efforts need to be employed in designing routers capable of managing high traffic levels.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122634974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Future terminals for wireless communication not only must support multiple standards but execute several of them concurrently. To meet these requirements, flexibility and ease of programming of integrated circuits for digital baseband processing are increasingly important criteria for the deployment of such devices, while power consumption and area of the devices remain as critical as in the past.The paper presents the architecture of a fully programmable system-on-chip for digital signal processing in the baseband of contemporary and up-coming standards for wireless communication. Particular focus is given to the memory hierarchy of the multi-processor system and the measures to minimize the power it dissipates. The reduction of the power consumption of the entire chip is estimated to amount to 28% compared to a straightforward approach.
{"title":"A low-power memory hierarchy for a fully programmable baseband processor","authors":"W. Raab, Hans-Martin Blüthgen, U. Ramacher","doi":"10.1145/1054943.1054957","DOIUrl":"https://doi.org/10.1145/1054943.1054957","url":null,"abstract":"Future terminals for wireless communication not only must support multiple standards but execute several of them concurrently. To meet these requirements, flexibility and ease of programming of integrated circuits for digital baseband processing are increasingly important criteria for the deployment of such devices, while power consumption and area of the devices remain as critical as in the past.The paper presents the architecture of a fully programmable system-on-chip for digital signal processing in the baseband of contemporary and up-coming standards for wireless communication. Particular focus is given to the memory hierarchy of the multi-processor system and the measures to minimize the power it dissipates. The reduction of the power consumption of the entire chip is estimated to amount to 28% compared to a straightforward approach.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128683421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}