T. K. Tan, A. Raghunathan, G. Lakshminarayana, N. Jha
This paper presents an efficient and accurate high-level software energy estimation methodology using the concept of characterization-based macro-modeling. In characterization-based macro-modeling, a function or sub-routine is characterized using an accurate lower-level energy model of the target processor, to construct a macro-model that relates the energy consumed in the function under consideration to various parameters that can be easily observed or calculated from a high-level programming language description. The constructed macro-models eliminate the need for significantly slower instruction-level interpretation or hardware simulation that is required in conventional approaches to software energy estimation. We present two different approaches to macro-modeling for embedded software that offer distinct efficiency-accuracy characteristics: (i) complexity-based macro-modeling, where the variables that determine the algorithmic complexity of the function under consideration are used as macro-modeling parameters, and (ii) profiling-based macro-modeling, where internal profiling statistics for the functions are used as parameters in the energy macro-models. We have experimentally validated our software energy macro-modeling techniques on a wide range of embedded software routines and two different target processor architectures. Our experiments demonstrate that high-level macro-models constructed using the proposed techniques are able to estimate the energy consumption to within 95% accuracy on the average, while commanding speedups of one to five orders-of-magnitude over current instruction-level and architectural energy estimation techniques.
{"title":"High-level software energy macro-modeling","authors":"T. K. Tan, A. Raghunathan, G. Lakshminarayana, N. Jha","doi":"10.1145/378239.379033","DOIUrl":"https://doi.org/10.1145/378239.379033","url":null,"abstract":"This paper presents an efficient and accurate high-level software energy estimation methodology using the concept of characterization-based macro-modeling. In characterization-based macro-modeling, a function or sub-routine is characterized using an accurate lower-level energy model of the target processor, to construct a macro-model that relates the energy consumed in the function under consideration to various parameters that can be easily observed or calculated from a high-level programming language description. The constructed macro-models eliminate the need for significantly slower instruction-level interpretation or hardware simulation that is required in conventional approaches to software energy estimation. We present two different approaches to macro-modeling for embedded software that offer distinct efficiency-accuracy characteristics: (i) complexity-based macro-modeling, where the variables that determine the algorithmic complexity of the function under consideration are used as macro-modeling parameters, and (ii) profiling-based macro-modeling, where internal profiling statistics for the functions are used as parameters in the energy macro-models. We have experimentally validated our software energy macro-modeling techniques on a wide range of embedded software routines and two different target processor architectures. Our experiments demonstrate that high-level macro-models constructed using the proposed techniques are able to estimate the energy consumption to within 95% accuracy on the average, while commanding speedups of one to five orders-of-magnitude over current instruction-level and architectural energy estimation techniques.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132539557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an efficient algorithm for optimizing the area of power or ground networks in integrated circuits subject to the reliability constraints. Instead of solving the original power/ground networks extracted from circuit layouts as previous methods did, the new method first builds the equivalent models for many series resistors in the original networks, then the sequence of linear programming method is used to solve the simplified networks. The solutions of the original networks then are back solved from the optimized, simplified networks. The new algorithm simply exploits the regularities in the power/ground networks. Experimental results show that the complexities of simplified networks are typically significantly smaller than that of the original circuits, which renders the new algorithm extremely fast. For instance, power/ground networks with more than one million branches can be sized in a few minutes on modern SUN workstations.
{"title":"Fast power/ground network optimization based on equivalent circuit modeling","authors":"S. Tan, C. Shi","doi":"10.1145/378239.379021","DOIUrl":"https://doi.org/10.1145/378239.379021","url":null,"abstract":"This paper presents an efficient algorithm for optimizing the area of power or ground networks in integrated circuits subject to the reliability constraints. Instead of solving the original power/ground networks extracted from circuit layouts as previous methods did, the new method first builds the equivalent models for many series resistors in the original networks, then the sequence of linear programming method is used to solve the simplified networks. The solutions of the original networks then are back solved from the optimized, simplified networks. The new algorithm simply exploits the regularities in the power/ground networks. Experimental results show that the complexities of simplified networks are typically significantly smaller than that of the original circuits, which renders the new algorithm extremely fast. For instance, power/ground networks with more than one million branches can be sized in a few minutes on modern SUN workstations.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122440388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce SATIRE, a new satisfiability solver that is particularly suited to verification and optimization problems in electronic design automation. SATIRE builds on the most recent advances in satisfiability research, and includes two new features to achieve even higher performance: a facility for incrementally solving sets of related problems, and the ability to handle non-CNF constraints. We provide experimental evidence showing the effectiveness of these additions to classical satisfiability solvers.
{"title":"SATIRE: A new incremental satisfiability engine","authors":"J. Whittemore, Joonyoung Kim, K. Sakallah","doi":"10.1145/378239.379019","DOIUrl":"https://doi.org/10.1145/378239.379019","url":null,"abstract":"We introduce SATIRE, a new satisfiability solver that is particularly suited to verification and optimization problems in electronic design automation. SATIRE builds on the most recent advances in satisfiability research, and includes two new features to achieve even higher performance: a facility for incrementally solving sets of related problems, and the ability to handle non-CNF constraints. We provide experimental evidence showing the effectiveness of these additions to classical satisfiability solvers.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122950071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a design flow for the generation of application-specific multiprocessor architectures. In the flow, architectural parameters are first extracted from a high-level system specification. Parameters are used to instantiate architectural components, such as processors, memory modules and communication networks. The flow includes the automatic generation of a communication coprocessor that adapts the processor to the communication network in an application-specific way. Experiments with two system examples show the effectiveness of the presented design flow.
{"title":"Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip","authors":"D. Lyonnard, S. Yoo, A. Baghdadi, A. Jerraya","doi":"10.1145/378239.379015","DOIUrl":"https://doi.org/10.1145/378239.379015","url":null,"abstract":"We present a design flow for the generation of application-specific multiprocessor architectures. In the flow, architectural parameters are first extracted from a high-level system specification. Parameters are used to instantiate architectural components, such as processors, memory modules and communication networks. The flow includes the automatic generation of a communication coprocessor that adapts the processor to the communication network in an application-specific way. Experiments with two system examples show the effectiveness of the presented design flow.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124134578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Benini, L. Macchiarulo, A. Macii, E. Macii, M. Poncino
We propose an integrated front-end/back-end flow for the automatic generation of a multi-bank memory architecture for embedded systems. The flow is based on an algorithm for the automatic partitioning of on-chip SRAM. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The partitioning algorithm is integrated with the physical design phase into a complete flow that allows the back-annotation of layout information to drive the partitioning process. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 34%.
{"title":"From architecture to layout: partitioned memory synthesis for embedded systems-on-chip","authors":"L. Benini, L. Macchiarulo, A. Macii, E. Macii, M. Poncino","doi":"10.1145/378239.379066","DOIUrl":"https://doi.org/10.1145/378239.379066","url":null,"abstract":"We propose an integrated front-end/back-end flow for the automatic generation of a multi-bank memory architecture for embedded systems. The flow is based on an algorithm for the automatic partitioning of on-chip SRAM. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The partitioning algorithm is integrated with the physical design phase into a complete flow that allows the back-annotation of layout information to drive the partitioning process. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 34%.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"97 18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121239374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy consumption has become an increasingly important consideration in designing many real-time embedded systems. Variable voltage processors, if used properly, can dramatically reduce such system energy consumption. In this paper, we present a technique to determine voltage settings for a variable voltage processor that utilizes a fixed priority assignment to schedule jobs. Our approach also produces the minimum constant voltage needed to feasibly schedule the entire job set. Our algorithms lead to significant energy saving compared with previously presented approaches.
{"title":"Energy efficient fixed-priority scheduling for real-time systems on variable voltage processors","authors":"Gang Quan, X. Hu","doi":"10.1145/378239.379074","DOIUrl":"https://doi.org/10.1145/378239.379074","url":null,"abstract":"Energy consumption has become an increasingly important consideration in designing many real-time embedded systems. Variable voltage processors, if used properly, can dramatically reduce such system energy consumption. In this paper, we present a technique to determine voltage settings for a variable voltage processor that utilizes a fixed priority assignment to schedule jobs. Our approach also produces the minimum constant voltage needed to feasibly schedule the entire job set. Our algorithms lead to significant energy saving compared with previously presented approaches.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes that the ability to control the difference between the simulated and actual frequencies of a design is a key strategy to achieving high frequency in both ASIC and custom designs. We examine this principle and the methodologies that can be deployed to manage this gap.
{"title":"Reducing the frequency gap between ASIC and custom designs: a custom perspective","authors":"S. E. Rich, Matthew J. Parker, Jim Schwartz","doi":"10.1145/378239.378548","DOIUrl":"https://doi.org/10.1145/378239.378548","url":null,"abstract":"This paper proposes that the ability to control the difference between the simulated and actual frequencies of a design is a key strategy to achieving high frequency in both ASIC and custom designs. We examine this principle and the methodologies that can be deployed to manage this gap.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115492652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a methodology for microarchitectural customization of embedded processors by exploiting application information, thus attaining the twin benefits of processor standardization and application-specific customization. Such powerful techniques enable increased application fragments to be placed on the processor, with no sacrifice in system requirements, thus reducing the custom hardware and the concomitant area requirements in SOCs. We illustrate these ideas through the branch resolution problem, known to impose severe performance degradation on control-dominated embedded applications. A low-cost late customizable hardware that uses application information to fold out a set of frequently executed branches is described. Experimental results show that for a representative set of control dominated applications a reduction in the range of 7%-22% in processor cycles can be achieved, thus extending the scope of low-cost embedded processors in complex co-designs for control intensive systems.
{"title":"Speeding up control-dominated applications through microarchitectural customizations in embedded processors","authors":"Peter Petrov, A. Orailoglu","doi":"10.1145/378239.379014","DOIUrl":"https://doi.org/10.1145/378239.379014","url":null,"abstract":"We present a methodology for microarchitectural customization of embedded processors by exploiting application information, thus attaining the twin benefits of processor standardization and application-specific customization. Such powerful techniques enable increased application fragments to be placed on the processor, with no sacrifice in system requirements, thus reducing the custom hardware and the concomitant area requirements in SOCs. We illustrate these ideas through the branch resolution problem, known to impose severe performance degradation on control-dominated embedded applications. A low-cost late customizable hardware that uses application information to fold out a set of frequently executed branches is described. Experimental results show that for a representative set of control dominated applications a reduction in the range of 7%-22% in processor cycles can be achieved, thus extending the scope of low-cost embedded processors in complex co-designs for control intensive systems.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126160154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Ramanujam, Jinpyo Hong, M. Kandemir, A. Narayan
Most embedded systems have limited amount of memory. In contrast, the memory requirements of code (in particular loops) running on embedded systems is significant. This paper addresses the problem of estimating the amount of memory needed for transfers of data in embedded systems. The problem of estimating the region associated with a statement or the set of elements referenced by a statement during the execution of the entire set of nested loops is analyzed. A quantitative analysis of the number of elements referenced is presented; exact expressions for uniformly generated references and a close upper and lower bound for non-uniformly generated references are derived. In addition to presenting an algorithm that computes the total memory required, we discuss the effect of transformations on the lifetimes of array variables, i.e., the time between the first and last accesses to a given array location. A detailed analysis on the effect of unimodular transformations on data locality including the calculation of the maximum window size is discussed. The term maximum window size is introduced and quantitative expressions are derived to compute the window size. The smaller the value of the maximum window size, the higher the amount of data locality in the loop.
{"title":"Reducing memory requirements of nested loops for embedded systems","authors":"J. Ramanujam, Jinpyo Hong, M. Kandemir, A. Narayan","doi":"10.1145/378239.378523","DOIUrl":"https://doi.org/10.1145/378239.378523","url":null,"abstract":"Most embedded systems have limited amount of memory. In contrast, the memory requirements of code (in particular loops) running on embedded systems is significant. This paper addresses the problem of estimating the amount of memory needed for transfers of data in embedded systems. The problem of estimating the region associated with a statement or the set of elements referenced by a statement during the execution of the entire set of nested loops is analyzed. A quantitative analysis of the number of elements referenced is presented; exact expressions for uniformly generated references and a close upper and lower bound for non-uniformly generated references are derived. In addition to presenting an algorithm that computes the total memory required, we discuss the effect of transformations on the lifetimes of array variables, i.e., the time between the first and last accesses to a given array location. A detailed analysis on the effect of unimodular transformations on data locality including the calculation of the maximum window size is discussed. The term maximum window size is introduced and quantitative expressions are derived to compute the window size. The smaller the value of the maximum window size, the higher the amount of data locality in the loop.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127473041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering is an effective method to increase the available parallelism in VLIW datapaths without incurring severe penalties associated with large numbers of register file ports. Efficient utilization of a clustered datapath requires careful binding of operations to clusters. The paper proposes a binding algorithm that effectively explores tradeoffs between in-cluster operation serialization and delays associated with data transfers between clusters. Extensive experimental evidence is provided showing that the algorithm generates high quality solutions for basic blocks, with up to 29% improvement over a state-of-the-art advanced binding algorithm.
{"title":"High-quality operation binding for clustered VLIW datapaths","authors":"V. Lapinskii, M. Jacome, G. Veciana","doi":"10.1145/378239.379051","DOIUrl":"https://doi.org/10.1145/378239.379051","url":null,"abstract":"Clustering is an effective method to increase the available parallelism in VLIW datapaths without incurring severe penalties associated with large numbers of register file ports. Efficient utilization of a clustered datapath requires careful binding of operations to clusters. The paper proposes a binding algorithm that effectively explores tradeoffs between in-cluster operation serialization and delays associated with data transfers between clusters. Extensive experimental evidence is provided showing that the algorithm generates high quality solutions for basic blocks, with up to 29% improvement over a state-of-the-art advanced binding algorithm.","PeriodicalId":154316,"journal":{"name":"Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125667455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}