Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106823
D. Barretta, W. Fornaciari, M. Sami, D. Pau
We propose a retargetable architecture, based on a multicluster VLIW processor that can exploit either instruction level parallelism (ILP) or ILP and data level parallelism (DLP) jointly in a SIMD fashion. Simulation results show that performances may increase significantly when the application is compiled for the proposed architecture.
{"title":"SIMD extension to VLIW multicluster processors for embedded applications","authors":"D. Barretta, W. Fornaciari, M. Sami, D. Pau","doi":"10.1109/ICCD.2002.1106823","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106823","url":null,"abstract":"We propose a retargetable architecture, based on a multicluster VLIW processor that can exploit either instruction level parallelism (ILP) or ILP and data level parallelism (DLP) jointly in a SIMD fashion. Simulation results show that performances may increase significantly when the application is compiled for the proposed architecture.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123837533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106783
U. Kapasi, W. Dally, S. Rixner, John Douglas Owens, Brucek Khailany
The Imagine Stream Processor is a single-chip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on single-precision data and 32 GOPS on 16 bit fixed-point data. The scalability of Imagine's programming model and architecture enable it to achieve such high arithmetic rates. Imagine executes applications that have been mapped to the stream programming model. The stream model decomposes applications into a set of computation kernels that operate on data streams. This mapping exposes the inherent locality and parallelism in the application, and Imagine exploits the locality and parallelism to provide a scalable architecture that supports 48 ALUs on a single chip. This paper presents the Imagine architecture and programming model in the first half and explores the scalability of the Imagine architecture in the second half.
{"title":"The Imagine Stream Processor","authors":"U. Kapasi, W. Dally, S. Rixner, John Douglas Owens, Brucek Khailany","doi":"10.1109/ICCD.2002.1106783","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106783","url":null,"abstract":"The Imagine Stream Processor is a single-chip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on single-precision data and 32 GOPS on 16 bit fixed-point data. The scalability of Imagine's programming model and architecture enable it to achieve such high arithmetic rates. Imagine executes applications that have been mapped to the stream programming model. The stream model decomposes applications into a set of computation kernels that operate on data streams. This mapping exposes the inherent locality and parallelism in the application, and Imagine exploits the locality and parallelism to provide a scalable architecture that supports 48 ALUs on a single chip. This paper presents the Imagine architecture and programming model in the first half and explores the scalability of the Imagine architecture in the second half.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130656213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106810
A. Gordon-Ross, F. Vahid
Dynamically-loaded tagless loop caching reduces instruction fetch power for embedded software with small loops, but only supports simple loops without taken branches. Preloaded tagless loop caching supports complex loops with branches and thus can reduce power further, but has a limit on the total number of instructions cached. We show that each does well on particular benchmarks, but neither is best across all of those benchmarks. We present a new hybrid loop cache that only preloads the complex loops, while dynamically loading other loops, thus achieving the strengths of each approach. We demonstrate better power savings than either previous approach alone.
{"title":"Dynamic loop caching meets preloaded loop caching-a hybrid approach","authors":"A. Gordon-Ross, F. Vahid","doi":"10.1109/ICCD.2002.1106810","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106810","url":null,"abstract":"Dynamically-loaded tagless loop caching reduces instruction fetch power for embedded software with small loops, but only supports simple loops without taken branches. Preloaded tagless loop caching supports complex loops with branches and thus can reduce power further, but has a limit on the total number of instructions cached. We show that each does well on particular benchmarks, but neither is best across all of those benchmarks. We present a new hybrid loop cache that only preloads the complex loops, while dynamically loading other loops, thus achieving the strengths of each approach. We demonstrate better power savings than either previous approach alone.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116508393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106781
J. Savir, Zhen Guo
This paper investigates the detectability of parameter faults in linear, time-invariant, analog circuits. We show that there are inherent limitations with regard to analog fault detectability.
研究了线性时不变模拟电路中参数故障的可检测性。我们表明,在模拟故障检测方面存在固有的局限性。
{"title":"On the detectability of parametric faults in analog circuits","authors":"J. Savir, Zhen Guo","doi":"10.1109/ICCD.2002.1106781","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106781","url":null,"abstract":"This paper investigates the detectability of parameter faults in linear, time-invariant, analog circuits. We show that there are inherent limitations with regard to analog fault detectability.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106738
Atsushi Mizuno, K. Kohno, Ryuichiro Ohyama, T. Tokuyoshi, H. Uetani, H. Eichel, T. Miyamori, Nobu Matsumoto, M. Matsui
A new integrated system to design and generate a configurable embedded processor for multimedia applications has been developed. The system, "Media embedded Processor Integrator", provides a distinctive feature that generates development tools, such as compilers and simulators, not only for the configurable embedded processor but also for its template based extensible VLIW co-processor. This paper describes the architecture and the function of the "Media embedded Processor Integrator" especially focusing on how the system treats the VLIW co-processor extension. In order to determine an ISA for a 3-way VLIW co-processor for image recognition as an example, several different sets of ISA were evaluated and compared for the best performance using corresponding compilers and simulators, which were generated by the system. The system greatly contributed to reduce this entire ISA definition process.
{"title":"Design methodology and system for a configurable media embedded processor extensible to VLIW architecture","authors":"Atsushi Mizuno, K. Kohno, Ryuichiro Ohyama, T. Tokuyoshi, H. Uetani, H. Eichel, T. Miyamori, Nobu Matsumoto, M. Matsui","doi":"10.1109/ICCD.2002.1106738","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106738","url":null,"abstract":"A new integrated system to design and generate a configurable embedded processor for multimedia applications has been developed. The system, \"Media embedded Processor Integrator\", provides a distinctive feature that generates development tools, such as compilers and simulators, not only for the configurable embedded processor but also for its template based extensible VLIW co-processor. This paper describes the architecture and the function of the \"Media embedded Processor Integrator\" especially focusing on how the system treats the VLIW co-processor extension. In order to determine an ISA for a 3-way VLIW co-processor for image recognition as an example, several different sets of ISA were evaluated and compared for the best performance using corresponding compilers and simulators, which were generated by the system. The system greatly contributed to reduce this entire ISA definition process.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125604866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106804
T. Henriksson, U. Nordqvist, Dake Liu
Computer network equipment presents a bottleneck for further increasing the capacity in the networks. Terminals have problems keeping up with network speed when using general purpose processors for protocol processing. We present a novel processor architecture, that works in-line with the data flow and does not use a traditional von Neuman architecture. The program is contained in three lookup tables within the processor core, which allows for one cycle if-then-else and switch-case-case... execution. The processor is estimated to be able to handle a 10 Gb/s Ethernet connection when implemented in 0.18 micron technology.
{"title":"Embedded protocol processor for fast and efficient packet reception","authors":"T. Henriksson, U. Nordqvist, Dake Liu","doi":"10.1109/ICCD.2002.1106804","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106804","url":null,"abstract":"Computer network equipment presents a bottleneck for further increasing the capacity in the networks. Terminals have problems keeping up with network speed when using general purpose processors for protocol processing. We present a novel processor architecture, that works in-line with the data flow and does not use a traditional von Neuman architecture. The program is contained in three lookup tables within the processor core, which allows for one cycle if-then-else and switch-case-case... execution. The processor is estimated to be able to handle a 10 Gb/s Ethernet connection when implemented in 0.18 micron technology.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126383904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106775
Jessica Feng, F. Koushanfar, M. Potkonjak
Our goal is to identify the key architectural and design issues related to Sensor Networks (SNs), evaluate the proposed solutions, and to outline the most challenging research directions. The evaluation has three scopes ndividual components on SN nodes (processor, communication, storage, sensors, actuators, and power supply), node level and networked system level. The special emphasis is placed on architecture and system software, and on new challenges related to the usage of new types of components in networked systems. The evaluation is guided by anticipated technology trends and both current and future applications. The main conclusion of the analysis is that the architectural and synthesis emphasis will be shifted from computation and to some extent communication components to sensors and actuators.
{"title":"System-architectures for sensor networks issues, alternatives, and directions","authors":"Jessica Feng, F. Koushanfar, M. Potkonjak","doi":"10.1109/ICCD.2002.1106775","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106775","url":null,"abstract":"Our goal is to identify the key architectural and design issues related to Sensor Networks (SNs), evaluate the proposed solutions, and to outline the most challenging research directions. The evaluation has three scopes ndividual components on SN nodes (processor, communication, storage, sensors, actuators, and power supply), node level and networked system level. The special emphasis is placed on architecture and system software, and on new challenges related to the usage of new types of components in networked systems. The evaluation is guided by anticipated technology trends and both current and future applications. The main conclusion of the analysis is that the architectural and synthesis emphasis will be shifted from computation and to some extent communication components to sensors and actuators.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126409779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106813
A. Baniasadi, Andreas Moshovos
We introduce branch predictor prediction (BPP) as a power-aware branch prediction technique for high performance processors. Our predictor reduces branch prediction power dissipation by selectively turning on and off two of the three tables used in the combined branch predictor BPP relies on a small buffer that stores the addresses and the sub-predictors used by the most recent branches executed. Later we refer to this buffer to decide if any of the sub-predictors and the selector could be gated without harming performance. In this paper we study power and performance trade-offs for a subset of SPEC 2k benchmarks. We show that on the average and for an 8-way processor, BPP can reduce branch prediction power dissipation by 28% and 14% compared to non-banked and banked 32k predictors respectively. This comes with a negligible impact on performance (1% max). We show that BPP always reduces power even for smaller predictors and that it offers better overall power and performance compared to simpler predictors.
{"title":"Branch predictor prediction: a power-aware branch predictor for high-performance processors","authors":"A. Baniasadi, Andreas Moshovos","doi":"10.1109/ICCD.2002.1106813","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106813","url":null,"abstract":"We introduce branch predictor prediction (BPP) as a power-aware branch prediction technique for high performance processors. Our predictor reduces branch prediction power dissipation by selectively turning on and off two of the three tables used in the combined branch predictor BPP relies on a small buffer that stores the addresses and the sub-predictors used by the most recent branches executed. Later we refer to this buffer to decide if any of the sub-predictors and the selector could be gated without harming performance. In this paper we study power and performance trade-offs for a subset of SPEC 2k benchmarks. We show that on the average and for an 8-way processor, BPP can reduce branch prediction power dissipation by 28% and 14% compared to non-banked and banked 32k predictors respectively. This comes with a negligible impact on performance (1% max). We show that BPP always reduces power even for smaller predictors and that it offers better overall power and performance compared to simpler predictors.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126458296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-16DOI: 10.1109/ICCD.2002.1106800
Lin Zhong, Jiong Luo, Yunsi Fei, N. Jha
A circuit or circuit component that does not contain any spurious switching activity, i.e., activity that is not required by its specified functionality, is called perfectly power managed (PPM). We present a general sufficient condition for register binding to ensure that a given set of functional units is PPM. This condition not only applies to data-flow intensive (DFI) behaviors but also to control-flow intensive (CFI) behaviors. It leads to a straightforward power-managed (PM) register binding algorithm. The proposed algorithm is independent of the functional unit binding and scheduling algorithms. Hence, it can be easily incorporated into existing high-level synthesis systems. For the benchmarks we experimented with, an average 45.9% power reduction was achieved by our method at the cost of 7.7% average area overhead, compared to power-optimized register-transfer level (RTL) circuits which did not use PM register binding.
{"title":"Register binding based power management for high-level synthesis of control-flow intensive behaviors","authors":"Lin Zhong, Jiong Luo, Yunsi Fei, N. Jha","doi":"10.1109/ICCD.2002.1106800","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106800","url":null,"abstract":"A circuit or circuit component that does not contain any spurious switching activity, i.e., activity that is not required by its specified functionality, is called perfectly power managed (PPM). We present a general sufficient condition for register binding to ensure that a given set of functional units is PPM. This condition not only applies to data-flow intensive (DFI) behaviors but also to control-flow intensive (CFI) behaviors. It leads to a straightforward power-managed (PM) register binding algorithm. The proposed algorithm is independent of the functional unit binding and scheduling algorithms. Hence, it can be easily incorporated into existing high-level synthesis systems. For the benchmarks we experimented with, an average 45.9% power reduction was achieved by our method at the cost of 7.7% average area overhead, compared to power-optimized register-transfer level (RTL) circuits which did not use PM register binding.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114921125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-09-01DOI: 10.1109/ICCD.2002.1106772
C. Galke, M. Pflanz, H. Vierhaus
This paper introduces a new concept for the self test of systems on a chip (SoCs) with embedded processors. We propose hardware- and software-based test strategy. A minimum sized test processor was designed in order to perform on-chip test functions. Its architecture contains special adopted registers to realize LFSR or MISR functions for pattern de-compaction and pattern filtering. High-performance interfaces allow parallel and serial pattern in and output, and a fast test vector comparison. The architecture is scalable and is based on a standard RISC architecture in order to facilitate the use of standard compilers.
{"title":"A test processor concept for systems-on-a-chip","authors":"C. Galke, M. Pflanz, H. Vierhaus","doi":"10.1109/ICCD.2002.1106772","DOIUrl":"https://doi.org/10.1109/ICCD.2002.1106772","url":null,"abstract":"This paper introduces a new concept for the self test of systems on a chip (SoCs) with embedded processors. We propose hardware- and software-based test strategy. A minimum sized test processor was designed in order to perform on-chip test functions. Its architecture contains special adopted registers to realize LFSR or MISR functions for pattern de-compaction and pattern filtering. High-performance interfaces allow parallel and serial pattern in and output, and a fast test vector comparison. The architecture is scalable and is based on a standard RISC architecture in order to facilitate the use of standard compilers.","PeriodicalId":164768,"journal":{"name":"Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123288840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}