A well-known challenge during processor design is to obtain the best possible results for a typical target application domain that is generally described as a set of benchmarks. Obtaining the best possible result in turn becomes a complex tradeoff between the generality of the processor and the physical characteristics. A custom instruction to perform a task can result in significant improvements for an application, but generally, at the expense of some overhead for all other applications. In the recent years, Application-Specific Instruction-Set Processors (ASIP) have gained popularity in production chips as well as in the research community. In this paper, we present a unique architecture and methodology to design ASIPs in the embedded controller domain by customizing an existing processor instruction set and architecture rather than creating an entirely new ASIP tuned to a benchmark.
{"title":"An ASIP design methodology for embedded systems","authors":"K. Kucukcakar","doi":"10.1109/HSC.1999.777384","DOIUrl":"https://doi.org/10.1109/HSC.1999.777384","url":null,"abstract":"A well-known challenge during processor design is to obtain the best possible results for a typical target application domain that is generally described as a set of benchmarks. Obtaining the best possible result in turn becomes a complex tradeoff between the generality of the processor and the physical characteristics. A custom instruction to perform a task can result in significant improvements for an application, but generally, at the expense of some overhead for all other applications. In the recent years, Application-Specific Instruction-Set Processors (ASIP) have gained popularity in production chips as well as in the research community. In this paper, we present a unique architecture and methodology to design ASIPs in the embedded controller domain by customizing an existing processor instruction set and architecture rather than creating an entirely new ASIP tuned to a benchmark.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129354851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a system synthesis method which bridges the gap between a highly abstract functional model and an efficient hardware or software implementation. The functional model is based on a formal semantics and the synchrony hypothesis. However, the use of skeletons in conjunction with a proper computational model structures the system description into three layers, the system layer, the skeleton layer, and the elementary layer. The synthesis process takes advantage of this structure and uses a different technique for each layer: (a) connection of components, and processes at the system layer; (b) template based generation of compound entities possibly containing state information, memory, and complex control at the skeleton layer; this layer also determines the communication and timing behaviour; (c) direct translation into combinatorial functions at the elementary layer. Thus, without compromising the formal properties of the abstract system model we provide an efficient synthesis method.
{"title":"System synthesis utilizing a layered functional model","authors":"I. Sander, A. Jantsch","doi":"10.1145/301177.301510","DOIUrl":"https://doi.org/10.1145/301177.301510","url":null,"abstract":"We propose a system synthesis method which bridges the gap between a highly abstract functional model and an efficient hardware or software implementation. The functional model is based on a formal semantics and the synchrony hypothesis. However, the use of skeletons in conjunction with a proper computational model structures the system description into three layers, the system layer, the skeleton layer, and the elementary layer. The synthesis process takes advantage of this structure and uses a different technique for each layer: (a) connection of components, and processes at the system layer; (b) template based generation of compound entities possibly containing state information, memory, and complex control at the skeleton layer; this layer also determines the communication and timing behaviour; (c) direct translation into combinatorial functions at the elementary layer. Thus, without compromising the formal properties of the abstract system model we provide an efficient synthesis method.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"279 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123151126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The power consumption due to the HW/SW communication on system-level buses represents one of the major contributions to the overall power budget. A model to estimate the switching activity of the on-chip and off-chip buses at the system-level has been defined to evaluate the power dissipation and to compare the effectiveness of power optimization techniques. The paper aims at providing a framework for architectural exploration of a system design, focusing on the power consumption estimation of memory communication. Experimental results, conducted on bus streams generated by a real microprocessor and a stream generator, show how the variation of cache parameters and the introduction of bus encoding at the different levels on the memory hierarchy can affect the system power dissipation. Therefore, the proposed model can be effectively adopted to appropriately configure the memory hierarchy and the system bus architecture from the power standpoint.
{"title":"Power estimation for architectural exploration of HW/SW communication on system-level buses","authors":"W. Fornaciari, D. Sciuto, C. Silvano","doi":"10.1109/HSC.1999.777411","DOIUrl":"https://doi.org/10.1109/HSC.1999.777411","url":null,"abstract":"The power consumption due to the HW/SW communication on system-level buses represents one of the major contributions to the overall power budget. A model to estimate the switching activity of the on-chip and off-chip buses at the system-level has been defined to evaluate the power dissipation and to compare the effectiveness of power optimization techniques. The paper aims at providing a framework for architectural exploration of a system design, focusing on the power consumption estimation of memory communication. Experimental results, conducted on bus streams generated by a real microprocessor and a stream generator, show how the variation of cache parameters and the introduction of bus encoding at the different levels on the memory hierarchy can affect the system power dissipation. Therefore, the proposed model can be effectively adopted to appropriately configure the memory hierarchy and the system bus architecture from the power standpoint.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127106388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An important problem in the area of processor design for embedded systems is determining the proper instruction set architecture. Trade-offs have to be made between programmability and reusability of dedicated hardware for special functionality on the one hand, and a high performance dedicated instruction set on the other hand. This paper addresses the question of how to find specialized ISA extensions for a set of applications. We describe the application of a new pattern matching technique to the problem of the identification of recurring patterns of operations. By implementing frequently occurring operation patterns in hardware, and using this hardware as special function units, a fine-grained hardware/software partitioning can be found. The fine granularity, and the fact that patterns are taken from a number of different target applications rather than a single one, increase the opportunities for reuse of the special-purpose hardware. We illustrate our technique with experiments on a number of benchmarks from the DSP domain.
{"title":"Automatic detection of recurring operation patterns","authors":"M. Arnold, H. Corporaal","doi":"10.1145/301177.301192","DOIUrl":"https://doi.org/10.1145/301177.301192","url":null,"abstract":"An important problem in the area of processor design for embedded systems is determining the proper instruction set architecture. Trade-offs have to be made between programmability and reusability of dedicated hardware for special functionality on the one hand, and a high performance dedicated instruction set on the other hand. This paper addresses the question of how to find specialized ISA extensions for a set of applications. We describe the application of a new pattern matching technique to the problem of the identification of recurring patterns of operations. By implementing frequently occurring operation patterns in hardware, and using this hardware as special function units, a fine-grained hardware/software partitioning can be found. The fine granularity, and the fact that patterns are taken from a number of different target applications rather than a single one, increase the opportunities for reuse of the special-purpose hardware. We illustrate our technique with experiments on a number of benchmarks from the DSP domain.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126579181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trace-driven cache simulation is a time-consuming yet valuable procedure for evaluating the performance of embedded memory systems. In this paper we present a novel technique, called iterative cache simulation, to produce a variety of performance metrics for several different cache configurations. Compared with previous work in this field, our approach has the following features. First, it supports a wide range of performance metrics, including miss ratio, write-back counts, bus traffic, et al. Second, unlike estimation-based methods, the results produced by our simulator are accurate. Third, our approach is flexible. It can simulate both uniprocessor and multiprocessor caches, with options of higher level caches, sub-block replacement and prefetching. Last, it is fast. Our simulation results show that it has similar runtime as the fastest one-pass cache simulator.
{"title":"Iterative cache simulation of embedded CPUs with trace stripping","authors":"Z. Wu, W. Wolf","doi":"10.1145/301177.301496","DOIUrl":"https://doi.org/10.1145/301177.301496","url":null,"abstract":"Trace-driven cache simulation is a time-consuming yet valuable procedure for evaluating the performance of embedded memory systems. In this paper we present a novel technique, called iterative cache simulation, to produce a variety of performance metrics for several different cache configurations. Compared with previous work in this field, our approach has the following features. First, it supports a wide range of performance metrics, including miss ratio, write-back counts, bus traffic, et al. Second, unlike estimation-based methods, the results produced by our simulator are accurate. Third, our approach is flexible. It can simulate both uniprocessor and multiprocessor caches, with options of higher level caches, sub-block replacement and prefetching. Last, it is fast. Our simulation results show that it has similar runtime as the fastest one-pass cache simulator.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128995297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A common design methodology for embedded DSP systems is the integration of one or more digital signal processors (DSPs), program memory, and ASIC circuitry onto a single IC. Consequently, program memory size being limited, the criterion for optimality is that the embedded software must be very dense. We describe the development of an optimizing compiler, based on a retargetable compiler infrastructure, for the Fujitsu Elixir, a fixed-point DSP that is primarily used in cellular telephones. For small DSP benchmark programs (25-90 lines of C code), the average ratio of the size of compiler-generated code to the size of hand-written assembly code is 1.18. For a much larger program (more than 800 lines of C code), the ratio of the size of compiled code to the size of hand-written code is similar (1.14).
{"title":"Development of an optimizing compiler for a Fujitsu fixed-point digital signal processor","authors":"S. Rajan, M. Fujita, A. Sudarsanam, S. Malik","doi":"10.1145/301177.301184","DOIUrl":"https://doi.org/10.1145/301177.301184","url":null,"abstract":"A common design methodology for embedded DSP systems is the integration of one or more digital signal processors (DSPs), program memory, and ASIC circuitry onto a single IC. Consequently, program memory size being limited, the criterion for optimality is that the embedded software must be very dense. We describe the development of an optimizing compiler, based on a retargetable compiler infrastructure, for the Fujitsu Elixir, a fixed-point DSP that is primarily used in cellular telephones. For small DSP benchmark programs (25-90 lines of C code), the average ratio of the size of compiler-generated code to the size of hand-written assembly code is 1.18. For a much larger program (more than 800 lines of C code), the ratio of the size of compiled code to the size of hand-written code is similar (1.14).","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133449294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel compiler for Esterel, a concurrent synchronous imperative language. It generates fast, small object code by compiling away concurrency, producing a single C function requiring no operating system support for threads. It translates an Esterel program into an acyclic concurrent control-flow graph from which code is synthesized that runs instructions in an order respecting inter-thread communication. Exceptions and preemption constructs become conditional branches. Variables save control state; conditional branches restore it. Although designed for Esterel, this approach could be applied to compiling other synchronous concurrent languages.
{"title":"Compiling Esterel into sequential code","authors":"S. Edwards","doi":"10.1109/HSC.1999.777410","DOIUrl":"https://doi.org/10.1109/HSC.1999.777410","url":null,"abstract":"This paper presents a novel compiler for Esterel, a concurrent synchronous imperative language. It generates fast, small object code by compiling away concurrency, producing a single C function requiring no operating system support for threads. It translates an Esterel program into an acyclic concurrent control-flow graph from which code is synthesized that runs instructions in an order respecting inter-thread communication. Exceptions and preemption constructs become conditional branches. Variables save control state; conditional branches restore it. Although designed for Esterel, this approach could be applied to compiling other synchronous concurrent languages.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130951993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are integrating language-based software and hardware behaviors in C/pthreads and Verilog for unrestricted peer execution of the domains, including bounded (finite) and unbounded notions of computer system modeling. Since we do not restrict the modeling currently available in each domain, our co-specification is inclusive of both reactive and data-intensive systems. By viewing all mixed system state as shared memory accessible by threads in each domain, we differentiate domains by system resource inferences. We introduce a unified multithreading model for execution and motivate the need to expand the specification capabilities currently available in each domain for mixed-systems using widely accepted languages as a basis. We discuss specific aspects of our cosimulator, provide examples and results, and indicate future directions of our work.
{"title":"Peer-based multithreaded executable co-specification","authors":"D. E. Thomas, J. M. Paul, S. Peffers, S. J. Weber","doi":"10.1109/HSC.1999.777402","DOIUrl":"https://doi.org/10.1109/HSC.1999.777402","url":null,"abstract":"We are integrating language-based software and hardware behaviors in C/pthreads and Verilog for unrestricted peer execution of the domains, including bounded (finite) and unbounded notions of computer system modeling. Since we do not restrict the modeling currently available in each domain, our co-specification is inclusive of both reactive and data-intensive systems. By viewing all mixed system state as shared memory accessible by threads in each domain, we differentiate domains by system resource inferences. We introduce a unified multithreading model for execution and motivate the need to expand the specification capabilities currently available in each domain for mixed-systems using widely accepted languages as a basis. We discuss specific aspects of our cosimulator, provide examples and results, and indicate future directions of our work.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114175823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Desmet, M. Esvelt, P. Avasare, D. Verkest, H. Man
In this paper we propose a C++ based cosimulation and codesign environment, that allows to specify the timing behavior of the components of a complex hardware-software system independently of the functional refinement. While the hardware models are at a high functional abstraction level, thus resulting in a high simulation speed, yet the timing behavior can be specified with sufficient granularity to give relevant feedback concerning the timing of the software tasks. We demonstrate this method on the design of the digital part of an ADSL modem.
{"title":"Timed executable system specification of an ADSL modem using a C++ based design environment: A case study","authors":"D. Desmet, M. Esvelt, P. Avasare, D. Verkest, H. Man","doi":"10.1145/301177.301198","DOIUrl":"https://doi.org/10.1145/301177.301198","url":null,"abstract":"In this paper we propose a C++ based cosimulation and codesign environment, that allows to specify the timing behavior of the components of a complex hardware-software system independently of the functional refinement. While the hardware models are at a high functional abstraction level, thus resulting in a high simulation speed, yet the timing behavior can be specified with sufficient granularity to give relevant feedback concerning the timing of the software tasks. We demonstrate this method on the design of the digital part of an ADSL modem.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114494672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Lajolo, M. Lazarescu, A. Sangiovanni-Vincentelli
High-level cost and performance estimation, coupled with a fast hardware/software co-simulation framework, is a key enabler to a fast embedded system design cycle. Unfortunately, the problem of deriving such estimates without a detailed implementation available is very difficult. In this paper we focus on embedded software performance estimation. Current approaches use either behavioral simulation with (often manual) timing annotations, or a clock cycle-accurate model of instruction execution (e.g., an instruction set simulator). The former provides greater flexibility (no need to perform a detailed design) and high simulation speed, but cannot easily consider effects such as compiler optimization and processor architecture. The latter provides high accuracy, but requires a more detailed implementation model, and is much slower in general. We hence developed a hybrid approach, that incorporates some aspects of both. It provides a flexible and fast simulation platform, considering also compilation issues and processor features. The key idea is to use the GNU-C compiler (GCC) to generate "assembler-level" C code. This code can be annotated with timing information, and used as a very precise, yet fast, software simulation model. We report some experimental results that show the effectiveness of our approach, and we propose some future improvements.
{"title":"A compilation-based software estimation scheme for hardware/software co-simulation","authors":"M. Lajolo, M. Lazarescu, A. Sangiovanni-Vincentelli","doi":"10.1145/301177.301493","DOIUrl":"https://doi.org/10.1145/301177.301493","url":null,"abstract":"High-level cost and performance estimation, coupled with a fast hardware/software co-simulation framework, is a key enabler to a fast embedded system design cycle. Unfortunately, the problem of deriving such estimates without a detailed implementation available is very difficult. In this paper we focus on embedded software performance estimation. Current approaches use either behavioral simulation with (often manual) timing annotations, or a clock cycle-accurate model of instruction execution (e.g., an instruction set simulator). The former provides greater flexibility (no need to perform a detailed design) and high simulation speed, but cannot easily consider effects such as compiler optimization and processor architecture. The latter provides high accuracy, but requires a more detailed implementation model, and is much slower in general. We hence developed a hybrid approach, that incorporates some aspects of both. It provides a flexible and fast simulation platform, considering also compilation issues and processor features. The key idea is to use the GNU-C compiler (GCC) to generate \"assembler-level\" C code. This code can be annotated with timing information, and used as a very precise, yet fast, software simulation model. We report some experimental results that show the effectiveness of our approach, and we propose some future improvements.","PeriodicalId":344739,"journal":{"name":"Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126469499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}