Motivated by data value locality and quality tolerance present in multimedia applications, we propose a new micro-architecture, region-level approximate computation buffer (RACB), to reduce power consumption in such applications. The proposed RACB relaxes the exact matching into partial and approximate tag matching and applies it to regions of code in a program, thereby allowing for aggressive computation/execution reduction, in addition to reductions in memory accesses and pipeline activities. Our experiments demonstrate that a 64-entry RACB can yield up to 70% of region-level execution reduction without noticeable quality degradation in MPEG-2 video decoding, corresponding to 55.9% of system power savings with respect to the regions.
{"title":"Region-level approximate computation reuse for power reduction in multimedia applications","authors":"Xueqi Cheng, M. Hsiao","doi":"10.1145/1077603.1077635","DOIUrl":"https://doi.org/10.1145/1077603.1077635","url":null,"abstract":"Motivated by data value locality and quality tolerance present in multimedia applications, we propose a new micro-architecture, region-level approximate computation buffer (RACB), to reduce power consumption in such applications. The proposed RACB relaxes the exact matching into partial and approximate tag matching and applies it to regions of code in a program, thereby allowing for aggressive computation/execution reduction, in addition to reductions in memory accesses and pipeline activities. Our experiments demonstrate that a 64-entry RACB can yield up to 70% of region-level execution reduction without noticeable quality degradation in MPEG-2 video decoding, corresponding to 55.9% of system power savings with respect to the regions.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129105234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a multi-story power delivery scheme which shows significant reduction of supply noise and power consumption compared to conventional power delivery scheme. To maximize the effectiveness of the proposed scheme, a digital voltage regulator is designed to balance the current dissipation of circuits in different voltage domains. Data transfer circuits based on capacitive coupling are developed for efficient inter-story data communication. Simulation results show 66% and 67% reduction of IR noise and Ldi/dt noise, respectively, while the total power consumption was reduced by 5% compared to a conventional power delivery scheme.
{"title":"Multi-story power delivery for supply noise reduction and low voltage operation","authors":"Jie Gu, C. Kim","doi":"10.1145/1077603.1077651","DOIUrl":"https://doi.org/10.1145/1077603.1077651","url":null,"abstract":"This paper presents a multi-story power delivery scheme which shows significant reduction of supply noise and power consumption compared to conventional power delivery scheme. To maximize the effectiveness of the proposed scheme, a digital voltage regulator is designed to balance the current dissipation of circuits in different voltage domains. Data transfer circuits based on capacitive coupling are developed for efficient inter-story data communication. Simulation results show 66% and 67% reduction of IR noise and Ldi/dt noise, respectively, while the total power consumption was reduced by 5% compared to a conventional power delivery scheme.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133370406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingmin Li, Mark Hempstead, Patrick Mauro, D. Brooks, Zhigang Hu, K. Skadron
This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain. We study both SRAM and latch and multiplexer ("latch-mux") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the "unconstrained" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when "stall gating" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit design choices when performing early-stage design exploration.
{"title":"Power and thermal effects of SRAM vs. latch-mux design styles and clock gating choices","authors":"Yingmin Li, Mark Hempstead, Patrick Mauro, D. Brooks, Zhigang Hu, K. Skadron","doi":"10.1145/1077603.1077647","DOIUrl":"https://doi.org/10.1145/1077603.1077647","url":null,"abstract":"This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain. We study both SRAM and latch and multiplexer (\"latch-mux\") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the \"unconstrained\" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when \"stall gating\" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit design choices when performing early-stage design exploration.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130890023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presented a hierarchical power management architecture which aims to facilitate power-awareness in an energy-managed computer (EMC) system with multiple components. The proposed architecture divides PM function into two layers: system-level and component-level. The system-level hierarchical PM was formulated as a concurrent service request flow regulation and application scheduling problem. Experimental results showed that a 25% reduction in the total system energy can be achieved compared to the optimal component-level DPM policy.
{"title":"Hierarchical power management with application to scheduling","authors":"Peng Rong, Massoud Pedram","doi":"10.1145/1077603.1077667","DOIUrl":"https://doi.org/10.1145/1077603.1077667","url":null,"abstract":"This paper presented a hierarchical power management architecture which aims to facilitate power-awareness in an energy-managed computer (EMC) system with multiple components. The proposed architecture divides PM function into two layers: system-level and component-level. The system-level hierarchical PM was formulated as a concurrent service request flow regulation and application scheduling problem. Experimental results showed that a 25% reduction in the total system energy can be achieved compared to the optimal component-level DPM policy.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Zhang, John M. Wilson, R. Bashirullah, L. Luo, Jian Xu, P. Franzon
By using current-sensing differential buses with driver pre-emphasis techniques, power dissipation is reduced by 26.0%-51.2% and peak current is reduced by 63.8%, compared to conventional repeater insertion techniques, for 10mm long buses in TSMC 0.25/spl mu/m technology. This proposed architecture lowers the worst coupling capacitance to total capacitance ratio to 14.4%. It only requires 7.9% more bus routing area than single-ended designs for a 16-bit bus, and saves all of the repeater placement blockages. To further verify that the driver pre-emphasis techniques can also be applied to voltage-mode single-ended buses, a test chip in TSMC 0.18/spl mu/m technology was fabricated and measured.
{"title":"Driver pre-emphasis techniques for on-chip global buses","authors":"L. Zhang, John M. Wilson, R. Bashirullah, L. Luo, Jian Xu, P. Franzon","doi":"10.1145/1077603.1077650","DOIUrl":"https://doi.org/10.1145/1077603.1077650","url":null,"abstract":"By using current-sensing differential buses with driver pre-emphasis techniques, power dissipation is reduced by 26.0%-51.2% and peak current is reduced by 63.8%, compared to conventional repeater insertion techniques, for 10mm long buses in TSMC 0.25/spl mu/m technology. This proposed architecture lowers the worst coupling capacitance to total capacitance ratio to 14.4%. It only requires 7.9% more bus routing area than single-ended designs for a 16-bit bus, and saves all of the repeater placement blockages. To further verify that the driver pre-emphasis techniques can also be applied to voltage-mode single-ended buses, a test chip in TSMC 0.18/spl mu/m technology was fabricated and measured.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132055244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper researches translation look-aside buffer (TLB) of embedded processor. Based on an analysis of design-related factors: power, area, critical path and performance of the research model - Godson-I, a low-power TLB design is proposed without sacrifice of performance and timing. Using this method, the following results are achieved: power of TLB-RAM reduces 92.7% and area of TLB-RAM reduces 50%. Compared with other methods, the hit rate of this design is much higher and the accessing conflict to RAM between ITLB and DTLB is much reduced. Although our work targets to Godson-I, the proposed methodology should be applicable to other designs.
{"title":"An energy efficient TLB design methodology","authors":"Dongrui Fan, Zhimin Tang, H. Huang, G. Gao","doi":"10.1145/1077603.1077688","DOIUrl":"https://doi.org/10.1145/1077603.1077688","url":null,"abstract":"This paper researches translation look-aside buffer (TLB) of embedded processor. Based on an analysis of design-related factors: power, area, critical path and performance of the research model - Godson-I, a low-power TLB design is proposed without sacrifice of performance and timing. Using this method, the following results are achieved: power of TLB-RAM reduces 92.7% and area of TLB-RAM reduces 50%. Compared with other methods, the hit rate of this design is much higher and the accessing conflict to RAM between ITLB and DTLB is much reduced. Although our work targets to Godson-I, the proposed methodology should be applicable to other designs.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131114681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network-on-chip (NoC) has been proposed as a solution for the global communication challenges of system-on-chip (SoC) design in the nanoscale technologies. NoC design with mesh based topologies requires mapping of cores to router ports, and routing of traffic traces such that the bandwidth and latency constraints are satisfied. The authors presented a novel automated design technique that solves the mesh based NoC design problem with an objective of minimizing the communication energy. In contrast to existing research that only take bandwidth constraints as inputs, the technique solves the NoC design problem in the presence of bandwidth as well as latency constraints. The technique was compared with a recent work called NMAP and an optimal MILP based formulation. It is proven that the complexity of the technique is lower than that of NMAP. For the latency constrained case, while NMAP fails on most test cases, the technique is able to generate high quality results. In comparison to the MILP formulation, the results produced by our technique are within 14 % of the optimal.
{"title":"A technique for low energy mapping and routing in network-on-chip architectures","authors":"K. Srinivasan, Karam S. Chatha","doi":"10.1145/1077603.1077695","DOIUrl":"https://doi.org/10.1145/1077603.1077695","url":null,"abstract":"Network-on-chip (NoC) has been proposed as a solution for the global communication challenges of system-on-chip (SoC) design in the nanoscale technologies. NoC design with mesh based topologies requires mapping of cores to router ports, and routing of traffic traces such that the bandwidth and latency constraints are satisfied. The authors presented a novel automated design technique that solves the mesh based NoC design problem with an objective of minimizing the communication energy. In contrast to existing research that only take bandwidth constraints as inputs, the technique solves the NoC design problem in the presence of bandwidth as well as latency constraints. The technique was compared with a recent work called NMAP and an optimal MILP based formulation. It is proven that the complexity of the technique is lower than that of NMAP. For the latency constrained case, while NMAP fails on most test cases, the technique is able to generate high quality results. In comparison to the MILP formulation, the results produced by our technique are within 14 % of the optimal.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131168559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emphasis in microprocessor design has shifted from high performance, to a combination of high performance and low power. Until recently, this trend was mostly true for uniprocessors. In this work the authors focused on new energy consumption issues unique to multiprocessor systems: synchronization of accesses to shared memory. The authors investigated and compared different means of providing atomic access to shared memory, including locks and lock-free synchronization (i.e., transactional memory), with respect to energy as well as performance. It is shown that transactional memory has an advantage in terms of energy consumption over locks, but that this advantage largely depends on the system architecture, the contention level, and the policy of conflict resolution.
{"title":"Energy reduction in multiprocessor systems using transactional memory","authors":"T. Moreshet, R. I. Bahar, M. Herlihy","doi":"10.1145/1077603.1077683","DOIUrl":"https://doi.org/10.1145/1077603.1077683","url":null,"abstract":"The emphasis in microprocessor design has shifted from high performance, to a combination of high performance and low power. Until recently, this trend was mostly true for uniprocessors. In this work the authors focused on new energy consumption issues unique to multiprocessor systems: synchronization of accesses to shared memory. The authors investigated and compared different means of providing atomic access to shared memory, including locks and lock-free synchronization (i.e., transactional memory), with respect to energy as well as performance. It is shown that transactional memory has an advantage in terms of energy consumption over locks, but that this advantage largely depends on the system architecture, the contention level, and the policy of conflict resolution.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123884501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A variety of portable and wearable device form factors such as the VisionPad, WatchPad, WearableData, MetaPad, Personal Mobile Hub, and SoulPad have been prototyped at IBM Research over the last several years. Each of different form factors has some unique advantages and addresses different user needs. This talk will present experiences in building these prototypes and describe how this landscape of devices is expected to change. The continuous availability of wearable and mobile devices enables the deployment of many novel applications and scenarios but places heavy demands on energy, infrastructure support, and usability. Recently how some of the inherent limitations of mobile devices can be alleviated by symbiosis with more powerful stationary devices has been investigated. Effective device symbiosis requires middleware that addresses device heterogeneity, improves usability, provides privacy and ensures security. The availability of a variety of new form factors coupled with device symbiosis will catalyze the way business was conducted, entertain and take care of ourselves.
在过去的几年里,IBM研究院已经开发出了VisionPad、WatchPad、WearableData、MetaPad、Personal Mobile Hub和SoulPad等各种便携和可穿戴设备的原型。每种不同的外形都有一些独特的优势,可以满足不同的用户需求。本演讲将介绍构建这些原型的经验,并描述设备的前景将如何变化。可穿戴设备和移动设备的持续可用性使许多新颖的应用程序和场景得以部署,但对能源、基础设施支持和可用性提出了很高的要求。最近,人们研究了如何通过与更强大的固定设备共生来减轻移动设备的一些固有局限性。有效的设备共生需要中间件解决设备异构性、提高可用性、提供隐私和确保安全。各种新形式因素的可用性加上设备的共生将催化商业运作,娱乐和照顾我们自己的方式。
{"title":"Wearable computing - a catalyst for business and entertainment","authors":"C. Narayanaswami","doi":"10.1145/1077603.1077605","DOIUrl":"https://doi.org/10.1145/1077603.1077605","url":null,"abstract":"A variety of portable and wearable device form factors such as the VisionPad, WatchPad, WearableData, MetaPad, Personal Mobile Hub, and SoulPad have been prototyped at IBM Research over the last several years. Each of different form factors has some unique advantages and addresses different user needs. This talk will present experiences in building these prototypes and describe how this landscape of devices is expected to change. The continuous availability of wearable and mobile devices enables the deployment of many novel applications and scenarios but places heavy demands on energy, infrastructure support, and usability. Recently how some of the inherent limitations of mobile devices can be alleviated by symbiosis with more powerful stationary devices has been investigated. Effective device symbiosis requires middleware that addresses device heterogeneity, improves usability, provides privacy and ensures security. The availability of a variety of new form factors coupled with device symbiosis will catalyze the way business was conducted, entertain and take care of ourselves.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123946773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern processors can issue and execute multiple instructions per cycle, often performing multiple memory operations simultaneously. To reduce stalls due to resource conflicts, most processors employ multi-ported L1 caches and TLBs to enable concurrent memory accesses. In this paper, the authors observed that data TLB lookups within a cycle and across consecutive cycles are often synonymous - they go to the same page. To exploit this finding, two new mechanisms were proposed - intra-cycle compaction and inter-cycle compaction of address translation requests in order to save energy in the data TLB. The results showed that average energy savings of 27% using intra-cycle, 42% using inter-cycle in a conventional d-TLB, and 56% using inter-cycle compaction in semantic-aware d-TLBs can be achieved. When these 2 compaction techniques are combined together and applied to both the i-TLB and semantic-aware d-TLBs, an average energy savings of 76% (up to 87%) is obtained.
{"title":"Synonymous address compaction for energy reduction in data TLB","authors":"C. Ballapuram, H. Lee, Milos Prvulović","doi":"10.1145/1077603.1077689","DOIUrl":"https://doi.org/10.1145/1077603.1077689","url":null,"abstract":"Modern processors can issue and execute multiple instructions per cycle, often performing multiple memory operations simultaneously. To reduce stalls due to resource conflicts, most processors employ multi-ported L1 caches and TLBs to enable concurrent memory accesses. In this paper, the authors observed that data TLB lookups within a cycle and across consecutive cycles are often synonymous - they go to the same page. To exploit this finding, two new mechanisms were proposed - intra-cycle compaction and inter-cycle compaction of address translation requests in order to save energy in the data TLB. The results showed that average energy savings of 27% using intra-cycle, 42% using inter-cycle in a conventional d-TLB, and 56% using inter-cycle compaction in semantic-aware d-TLBs can be achieved. When these 2 compaction techniques are combined together and applied to both the i-TLB and semantic-aware d-TLBs, an average energy savings of 76% (up to 87%) is obtained.","PeriodicalId":256018,"journal":{"name":"ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005.","volume":"467 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124378272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}