We introduce an energy consumption analysis of complex digital systems through a case study of ARM7TDMI RISC processor by using a new energy measurement technique. We developed a cycle-accurate energy consumption measurement system based on charge transfer which is robust to spiky noise and is capable of collecting a range of power consumption profiles in real time. The relative energy variation of the RISC core is measured by changing the op-code, the instruction fetch address, the register number, the register value, the data fetch address, and the immediate operand value in each pipeline stage, respectively. We demonstrated energy characterization of a pipelined RISC processor for high-level power reduction.
{"title":"Cycle-accurate energy consumption measurement and analysis: Case study of ARM7TDMI","authors":"N. Chang, Kwanho Kim, H. Lee","doi":"10.1145/344166.344576","DOIUrl":"https://doi.org/10.1145/344166.344576","url":null,"abstract":"We introduce an energy consumption analysis of complex digital systems through a case study of ARM7TDMI RISC processor by using a new energy measurement technique. We developed a cycle-accurate energy consumption measurement system based on charge transfer which is robust to spiky noise and is capable of collecting a range of power consumption profiles in real time. The relative energy variation of the RISC core is measured by changing the op-code, the instruction fetch address, the register number, the register value, the data fetch address, and the immediate operand value in each pipeline stage, respectively. We demonstrated energy characterization of a pipelined RISC processor for high-level power reduction.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124684894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy is one of the limited resources for modern systems, especially battery-operated devices and personal digital assistants. The backlog in new technologies for a more powerful battery is changing the traditional system design philosophies. For example, due to the limitation on battery life, it is more realistic to design for the optimal benefit from limited resource rather than design to meet all the applications' requirement. We consider the following problem: a system achieves a certain amount of utility from a set of applications by providing certain levels of quality of service (QoS). We want to allocate the limited system resources to get the maximal system utility. We formulate this utility maximization problem, which is NP-hard in general, and propose heuristic algorithms that are capable of finding solutions provably arbitrarily close to the optimal. We have also derived explicit formulae to guide the allocation of resources to actually achieve such solutions. Simulation shows that our approach can use 99.9% of the given resource to achieve 25.6% and 32.17% more system utilities over two other heuristics, while providing QoS guarantees to the application program.
{"title":"Achieving utility arbitrarily close to the optimal with limited energy","authors":"G. Qu, M. Potkonjak","doi":"10.1145/344166.344545","DOIUrl":"https://doi.org/10.1145/344166.344545","url":null,"abstract":"Energy is one of the limited resources for modern systems, especially battery-operated devices and personal digital assistants. The backlog in new technologies for a more powerful battery is changing the traditional system design philosophies. For example, due to the limitation on battery life, it is more realistic to design for the optimal benefit from limited resource rather than design to meet all the applications' requirement. We consider the following problem: a system achieves a certain amount of utility from a set of applications by providing certain levels of quality of service (QoS). We want to allocate the limited system resources to get the maximal system utility. We formulate this utility maximization problem, which is NP-hard in general, and propose heuristic algorithms that are capable of finding solutions provably arbitrarily close to the optimal. We have also derived explicit formulae to guide the allocation of resources to actually achieve such solutions. Simulation shows that our approach can use 99.9% of the given resource to achieve 25.6% and 32.17% more system utilities over two other heuristics, while providing QoS guarantees to the application program.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126121561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the device's components. The M.CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with programmable features was added to the M3 core. These features allow the architecture to be optimized based on the application's requirements. In this paper we focus on the features of the M340 cache sub-system and illustrate the effect on power and performance through benchmark analysis and actual silicon measurements.
{"title":"A low power unified cache architecture providing power and performance flexibility","authors":"Afzal Malik, B. Moyer, D. Čermák","doi":"10.1145/344166.344610","DOIUrl":"https://doi.org/10.1145/344166.344610","url":null,"abstract":"Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the device's components. The M.CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with programmable features was added to the M3 core. These features allow the architecture to be optimized based on the application's requirements. In this paper we focus on the features of the M340 cache sub-system and illustrate the effect on power and performance through benchmark analysis and actual silicon measurements.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128063233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gate capacitance has complex voltage dependency on terminal voltages but the impact of this voltage dependency of gate capacitance on power and delay has not been fully investigated, especially, in low-voltage, low-power designs. Introducing an effective gate capacitance, C/sub G,eff/ it is shown that the power and delay of CMOS digital circuit can be estimated accurately. C/sub G,eff/ is a strong function of V/sub TH//V/sub DD/ and V/sub TH//V/sub DD/ tends to increase in low-voltage region. Hence, the effective capacitance relative to oxide capacitance, C/sub OX/, is decreasing in low-voltage, low-power designs. Therefore, considering C/sub G,eff/ in accurate power and delay estimation becomes more important in the future.
{"title":"Voltage dependent gate capacitance and its impact in estimating power and delay of CMOS digital circuits with low supply voltage","authors":"K. Nose, S. Chae, T. Sakurai","doi":"10.1145/344166.344601","DOIUrl":"https://doi.org/10.1145/344166.344601","url":null,"abstract":"Gate capacitance has complex voltage dependency on terminal voltages but the impact of this voltage dependency of gate capacitance on power and delay has not been fully investigated, especially, in low-voltage, low-power designs. Introducing an effective gate capacitance, C/sub G,eff/ it is shown that the power and delay of CMOS digital circuit can be estimated accurately. C/sub G,eff/ is a strong function of V/sub TH//V/sub DD/ and V/sub TH//V/sub DD/ tends to increase in low-voltage region. Hence, the effective capacitance relative to oxide capacitance, C/sub OX/, is decreasing in low-voltage, low-power designs. Therefore, considering C/sub G,eff/ in accurate power and delay estimation becomes more important in the future.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117119650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a new approach for power reduction by taking a global, software-centric view. It analyzes the sources of power consumption: tasks that require services from hardware components. When a component is not used by any task, it can enter a sleeping state to save power. Operating systems have detailed information about tasks; therefore, OS is the best place for power reduction. Our technique is effective in identifying hardware idleness and shutting down unused components. We implement this technique in Linux and show that it can save more than 50% power compared to traditional hardware-centric shutdown techniques.
{"title":"Operating-system directed power reduction","authors":"Yung-Hsiang Lu, L. Benini, G. Micheli","doi":"10.1145/344166.344189","DOIUrl":"https://doi.org/10.1145/344166.344189","url":null,"abstract":"This paper presents a new approach for power reduction by taking a global, software-centric view. It analyzes the sources of power consumption: tasks that require services from hardware components. When a component is not used by any task, it can enter a sleeping state to save power. Operating systems have detailed information about tasks; therefore, OS is the best place for power reduction. Our technique is effective in identifying hardware idleness and shutting down unused components. We implement this technique in Linux and show that it can save more than 50% power compared to traditional hardware-centric shutdown techniques.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"216 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114322858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We extend earlier work on high-level average power estimation to include the power due to interconnect loading. The resulting technique is a combination of an RTL-level gate count prediction method and average interconnect estimation based on Rent's rule. The method can be adapted to be used with different place and route engines and standard cell libraries. For a number of benchmark circuits, the method is verified by extracting wire lengths from a layout of each circuit and then comparing the predicted (at RTL) power against that measured using SPICE. An average error of 14.4% is obtained for the average interconnect length, and an average error of 25.8% is obtained for average power estimation including interconnect effects.
{"title":"High-level power estimation with interconnect effects","authors":"Kavel M. Büyüksahin, F. Najm","doi":"10.1145/344166.344581","DOIUrl":"https://doi.org/10.1145/344166.344581","url":null,"abstract":"We extend earlier work on high-level average power estimation to include the power due to interconnect loading. The resulting technique is a combination of an RTL-level gate count prediction method and average interconnect estimation based on Rent's rule. The method can be adapted to be used with different place and route engines and standard cell libraries. For a number of benchmark circuits, the method is verified by extracting wire lengths from a layout of each circuit and then comparing the predicted (at RTL) power against that measured using SPICE. An average error of 14.4% is obtained for the average interconnect length, and an average error of 25.8% is obtained for average power estimation including interconnect effects.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127420348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Processors in portable electronic devices generally have a computational load which has time-varying performance requirements. Dynamic Voltage Scaling is a method to vary the processor's supply voltage so that it consumes the minimal amount of energy by operating at the minimum performance level required by the active software processes. A dynamically varying supply voltage has implications on the processor circuit design and design flow, but with some minimal constraints it is straightforward to design a processor with this capability.
{"title":"Design issues for Dynamic Voltage Scaling","authors":"T. Burd, R. Brodersen","doi":"10.1145/344166.344181","DOIUrl":"https://doi.org/10.1145/344166.344181","url":null,"abstract":"Processors in portable electronic devices generally have a computational load which has time-varying performance requirements. Dynamic Voltage Scaling is a method to vary the processor's supply voltage so that it consumes the minimal amount of energy by operating at the minimum performance level required by the active software processes. A dynamically varying supply voltage has implications on the processor circuit design and design flow, but with some minimal constraints it is straightforward to design a processor with this capability.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121144454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory-processor integration offers new opportunities for reducing the energy of a system. In the case of embedded systems, one solution consists of mapping the most frequently accessed addresses onto the on-chip SRAM to guarantee power and performance efficiency. This option is especially effective when memory access patterns can be profiled and studied at design time (as in typical real-time embedded systems). In this work, we propose an algorithm for the automatic partitioning of on-chip SRAM in multiple banks that can be independently accessed. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The algorithm provides a globally optimum solution to the problem under realistic assumptions on the power cost metrics, and with constraints on the number of memory banks. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 42%.
{"title":"A recursive algorithm for low-power memory partitioning","authors":"L. Benini, A. Macii, M. Poncino","doi":"10.1145/344166.344518","DOIUrl":"https://doi.org/10.1145/344166.344518","url":null,"abstract":"Memory-processor integration offers new opportunities for reducing the energy of a system. In the case of embedded systems, one solution consists of mapping the most frequently accessed addresses onto the on-chip SRAM to guarantee power and performance efficiency. This option is especially effective when memory access patterns can be profiled and studied at design time (as in typical real-time embedded systems). In this work, we propose an algorithm for the automatic partitioning of on-chip SRAM in multiple banks that can be independently accessed. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The algorithm provides a globally optimum solution to the problem under realistic assumptions on the power cost metrics, and with constraints on the number of memory banks. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 42%.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134614436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes an efficient asynchronous hardwired matrix-vector multiplier for the two-dimensional discrete cosine transform and inverse discrete cosine transform (DCT/IDCT). The design achieves low power and high performance by taking advantage of the typically large fraction of zero and small-valued data in DCT and IDCT applications. In particular, it skips multiplication by zero and dynamically activates/deactivates required bit-slices of fine-grain bit-partitioned adders using simplified, static-logic-based speculative completion sensing. The results extracted by both bit-level analysis and HSPICE simulations indicate significant improvements compared to traditional designs.
{"title":"An asynchronous matrix-vector multiplier for discrete cosine transform","authors":"Kyeounsoo Kim, P. Beerel, Youpyo Hong","doi":"10.1145/344166.344621","DOIUrl":"https://doi.org/10.1145/344166.344621","url":null,"abstract":"This paper proposes an efficient asynchronous hardwired matrix-vector multiplier for the two-dimensional discrete cosine transform and inverse discrete cosine transform (DCT/IDCT). The design achieves low power and high performance by taking advantage of the typically large fraction of zero and small-valued data in DCT and IDCT applications. In particular, it skips multiplication by zero and dynamically activates/deactivates required bit-slices of fine-grain bit-partitioned adders using simplified, static-logic-based speculative completion sensing. The results extracted by both bit-level analysis and HSPICE simulations indicate significant improvements compared to traditional designs.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128767109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new high-speed domino circuit, called HS-Domino is developed. HS-Domino resolves the trade-off between performance and noise margins in conventional CD-Domino logic while dissipating low dynamic power with minimal area overhead. A dual-threshold (MTCMOS) implementation of HS-Domino and DDCVS logic is also devised. This implementation achieves low leakage values during standby, while maintaining high performance and low dynamic power during the active mode.
{"title":"High-speed dynamic logic styles for scaled-down CMOS and MTCMOS technologies","authors":"M. Allam, M. Anis, M. Elmasry","doi":"10.1145/344166.344562","DOIUrl":"https://doi.org/10.1145/344166.344562","url":null,"abstract":"A new high-speed domino circuit, called HS-Domino is developed. HS-Domino resolves the trade-off between performance and noise margins in conventional CD-Domino logic while dissipating low dynamic power with minimal area overhead. A dual-threshold (MTCMOS) implementation of HS-Domino and DDCVS logic is also devised. This implementation achieves low leakage values during standby, while maintaining high performance and low dynamic power during the active mode.","PeriodicalId":188020,"journal":{"name":"ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117212540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}