{"title":"Design trade-offs for load/store buffers in embedded processing environments","authors":"Y. Kang, J. Draper","doi":"10.1109/MWSCAS.2007.4488819","DOIUrl":null,"url":null,"abstract":"Memory latency is a critical issue for conventional high-speed computing platforms, and it is becoming a common problem in embedded and CMP (chip multiprocessing) systems as well. Conventional processors typically adopt caches and a load/store queue (LSQ) to address the processor-to-memory bottleneck. However, the conventional LSQ design, which has a large number of entries, is not appropriate for embedded systems due to its area and power hungry out-of- order speculation. A compact, low-power load/store buffer that also provides significant performance improvement is essential for such systems. In this paper, we propose an area-efficient wideword load/store buffer (WLSB) which supports both WideWord (256-bit) and scalar (32-bit) load/store instructions for a recently fabricated PIM (processing-in-memory) device. Given its small size, the 4 entry WLSB yields a 57.33% load hit rate on SPEC2K benchmarks. This result is 5.72% better as compared to a less area-efficient 32-entry fully associative scalar load/store buffer (SLSB). The WLSB was synthesized in IBM 90 nm technology, and the resulting implementation occupies less than a seventh of a square mm and is projected to run at 1.6 ns cycle time with 15.72 mW of dynamic power dissipation. This paper demonstrates how this very small-entry buffer can affect the load hit rate and quantifies the design trade-offs between wide small-entry and narrow large-entry buffers with respect to size, power, load hit ratio and clock speed. Although this WLSB has been specifically designed to benefit a PIM architecture, it is expected to be useful for other embedded processing platforms and CMPs due to emphasized area/power constraints.","PeriodicalId":256061,"journal":{"name":"2007 50th Midwest Symposium on Circuits and Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 50th Midwest Symposium on Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MWSCAS.2007.4488819","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Memory latency is a critical issue for conventional high-speed computing platforms, and it is becoming a common problem in embedded and CMP (chip multiprocessing) systems as well. Conventional processors typically adopt caches and a load/store queue (LSQ) to address the processor-to-memory bottleneck. However, the conventional LSQ design, which has a large number of entries, is not appropriate for embedded systems due to its area and power hungry out-of- order speculation. A compact, low-power load/store buffer that also provides significant performance improvement is essential for such systems. In this paper, we propose an area-efficient wideword load/store buffer (WLSB) which supports both WideWord (256-bit) and scalar (32-bit) load/store instructions for a recently fabricated PIM (processing-in-memory) device. Given its small size, the 4 entry WLSB yields a 57.33% load hit rate on SPEC2K benchmarks. This result is 5.72% better as compared to a less area-efficient 32-entry fully associative scalar load/store buffer (SLSB). The WLSB was synthesized in IBM 90 nm technology, and the resulting implementation occupies less than a seventh of a square mm and is projected to run at 1.6 ns cycle time with 15.72 mW of dynamic power dissipation. This paper demonstrates how this very small-entry buffer can affect the load hit rate and quantifies the design trade-offs between wide small-entry and narrow large-entry buffers with respect to size, power, load hit ratio and clock speed. Although this WLSB has been specifically designed to benefit a PIM architecture, it is expected to be useful for other embedded processing platforms and CMPs due to emphasized area/power constraints.