Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606815
C. Miro, N. Darbel, R. Pacalet, Valerie Paquet
This paper presents a circuit dedicated to real time geometrical transforms of pictures. The supported transforms are third degree polynomials of two variables. The post-processing is performed by a bilinear filter. An embedded DSP core is in charge of high level, low rate, control tasks while a set of hard wired units is in charge of computing intensive low level tasks.
{"title":"A VLSI architecture for image geometrical transformations using an embedded core based processor","authors":"C. Miro, N. Darbel, R. Pacalet, Valerie Paquet","doi":"10.1109/ASAP.1997.606815","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606815","url":null,"abstract":"This paper presents a circuit dedicated to real time geometrical transforms of pictures. The supported transforms are third degree polynomials of two variables. The post-processing is performed by a bilinear filter. An embedded DSP core is in charge of high level, low rate, control tasks while a set of hard wired units is in charge of computing intensive low level tasks.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123981804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606853
P. Kuhn, A. Weisgerber, Robert Poppenwimmer, W. Stechele
This paper describes a flexible 25.6 Giga operations per second exhaustive search segment matching VLSI architecture to support evolving motion estimation algorithms as well as block matching algorithms of established video coding standards. The architecture is based on a 16/spl times/16 processor element (PE) array and a 12 kbyte on-chip search area RAM and allows concurrent calculation of motion vectors for 32/spl times/32, 16/spl times/16, 8/spl times/8 and 4/spl times/4 blocks and partial quadtrees (called segments)for a +/-32 pel search range with 100% PE utilization. This architecture supports object based algorithms by excluding pixels outside of video objects from the segment matching process as well as advanced algorithms like variable blocksize segment matching with luminance correction. A preprocessing unit is included to support halfpel interpolation and pixel decimation. The VLSI has been designed using VHDL synthesis and a 0.5 /spl mu/m CMOS technology. The chip will have a clock rate of 100 MHz (min.) allowing realtime variable blocksize segment matching of 4CIF video (704/spl times/576 pel) at 15 fps or luminance corrected variable blocksize segment matching at above CIF (352/spl times/288), 15 fps resolution.
{"title":"A flexible VLSI architecture for variable block size segment matching with luminance correction","authors":"P. Kuhn, A. Weisgerber, Robert Poppenwimmer, W. Stechele","doi":"10.1109/ASAP.1997.606853","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606853","url":null,"abstract":"This paper describes a flexible 25.6 Giga operations per second exhaustive search segment matching VLSI architecture to support evolving motion estimation algorithms as well as block matching algorithms of established video coding standards. The architecture is based on a 16/spl times/16 processor element (PE) array and a 12 kbyte on-chip search area RAM and allows concurrent calculation of motion vectors for 32/spl times/32, 16/spl times/16, 8/spl times/8 and 4/spl times/4 blocks and partial quadtrees (called segments)for a +/-32 pel search range with 100% PE utilization. This architecture supports object based algorithms by excluding pixels outside of video objects from the segment matching process as well as advanced algorithms like variable blocksize segment matching with luminance correction. A preprocessing unit is included to support halfpel interpolation and pixel decimation. The VLSI has been designed using VHDL synthesis and a 0.5 /spl mu/m CMOS technology. The chip will have a clock rate of 100 MHz (min.) allowing realtime variable blocksize segment matching of 4CIF video (704/spl times/576 pel) at 15 fps or luminance corrected variable blocksize segment matching at above CIF (352/spl times/288), 15 fps resolution.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130253403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606817
R. Osorio, J. Bruguera
In this paper we present new VLSI architectures for the arithmetic encoding and decoding of multilevel images. In these algorithms the speed is limited by their recursive natures and the arithmetic and memory access operations. They become specially critical in the case of decoding. In order to reduce the cycle length we propose working with two executions of the algorithm which alternate in the use of the pipelined hardware with a minimum increase in its cost.
{"title":"New arithmetic coder/decoder architectures based on pipelining","authors":"R. Osorio, J. Bruguera","doi":"10.1109/ASAP.1997.606817","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606817","url":null,"abstract":"In this paper we present new VLSI architectures for the arithmetic encoding and decoding of multilevel images. In these algorithms the speed is limited by their recursive natures and the arithmetic and memory access operations. They become specially critical in the case of decoding. In order to reduce the cycle length we propose working with two executions of the algorithm which alternate in the use of the pipelined hardware with a minimum increase in its cost.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128711331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606857
C. Lutkemeyer, T. Noll
A processor for the adaptation of the coefficients in high throughput adaptive equalizers is presented. The accumulation operation-fundamental basis of the adaptation process-is split into two steps: A fine-grain carry-save accumulation with time sharing factor 2 collects the products of estimated error and input symbols over a block length of 16 input symbols and operates at twice the symbol rate, a master accumulator with time-sharing factor 32 collects the block-sums from 16 fine-grain accumulators, multiplies them with the adaptation constant and carries out the final vector merging operation, saturation, tap leakage and radix-4 Booth recording. Three steps to reduce the power consumption of the fine-grain accumulators is presented and evaluated for a 14-bit-wide accumulator: The suppression of one state of the redundant codes for the value "1" in the carry save digit alphabet i.e. (0, 1) or (1,0), reduces the power consumption by 5.5%; The redundancy-reduced digit alphabet can be exploited to reduce the transistor count of the following full adder by one third, resulting in a significant input capacity reduction which increases the maximum clock frequency by nearly 15% and achieves further reduction of power consumption of 2.7%. Finally an optimized sign extension logic reduces the capacitive load of the input sign bits by 70%, eliminates six of the full adders in the sign extension slices and increases the power reduction to 19.2%. The maximum clock frequency of the accumulator could be increased by 18% due to the reduced internal lends.
{"title":"An optimized coefficient update processor for high-throughput adaptive equalizers","authors":"C. Lutkemeyer, T. Noll","doi":"10.1109/ASAP.1997.606857","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606857","url":null,"abstract":"A processor for the adaptation of the coefficients in high throughput adaptive equalizers is presented. The accumulation operation-fundamental basis of the adaptation process-is split into two steps: A fine-grain carry-save accumulation with time sharing factor 2 collects the products of estimated error and input symbols over a block length of 16 input symbols and operates at twice the symbol rate, a master accumulator with time-sharing factor 32 collects the block-sums from 16 fine-grain accumulators, multiplies them with the adaptation constant and carries out the final vector merging operation, saturation, tap leakage and radix-4 Booth recording. Three steps to reduce the power consumption of the fine-grain accumulators is presented and evaluated for a 14-bit-wide accumulator: The suppression of one state of the redundant codes for the value \"1\" in the carry save digit alphabet i.e. (0, 1) or (1,0), reduces the power consumption by 5.5%; The redundancy-reduced digit alphabet can be exploited to reduce the transistor count of the following full adder by one third, resulting in a significant input capacity reduction which increases the maximum clock frequency by nearly 15% and achieves further reduction of power consumption of 2.7%. Finally an optimized sign extension logic reduces the capacitive load of the input sign bits by 70%, eliminates six of the full adders in the sign extension slices and increases the power reduction to 19.2%. The maximum clock frequency of the accumulator could be increased by 18% due to the reduced internal lends.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"592 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116309183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606833
H. P. Peixoto, M. Jacome
Incorporating algorithm and architecture level design space exploration in the early phases of the design process can have a dramatic impact on the area, speed, and power consumption of the resulting systems. This paper proposes a framework for supporting system-level design space exploration and discusses the three fundamental issues involved in effectively supporting such an early design space exploration: definition of an adequate level of abstraction; definition of good fidelity system-level metrics; and definition of mechanisms for automating the exploration process. The first issue, the definition of an adequate level of abstraction is then addressed in detail. Specifically, an algorithm-level model, an architecture-level model, and a set of operations on these models, are proposed, aiming at efficiently supporting an early, aggressive system-level design space exploration. A discussion on work in progress in the other two topics, metrics and automation, concludes the paper.
{"title":"Algorithm and architecture-level design space exploration using hierarchical data flows","authors":"H. P. Peixoto, M. Jacome","doi":"10.1109/ASAP.1997.606833","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606833","url":null,"abstract":"Incorporating algorithm and architecture level design space exploration in the early phases of the design process can have a dramatic impact on the area, speed, and power consumption of the resulting systems. This paper proposes a framework for supporting system-level design space exploration and discusses the three fundamental issues involved in effectively supporting such an early design space exploration: definition of an adequate level of abstraction; definition of good fidelity system-level metrics; and definition of mechanisms for automating the exploration process. The first issue, the definition of an adequate level of abstraction is then addressed in detail. Specifically, an algorithm-level model, an architecture-level model, and a set of operations on these models, are proposed, aiming at efficiently supporting an early, aggressive system-level design space exploration. A discussion on work in progress in the other two topics, metrics and automation, concludes the paper.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116917199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606829
P. Calland, J. Dongarra, Y. Robert
In the framework of perfect loop nests with uniform dependences, tiling has been extensively studied as a source-to-source program transformation. Little work has been devoted to the mapping and scheduling of the tiles on to physical processors. We present several new results in the context of limited computational resources, and assuming communication-computation overlap. In particular, under some reasonable assumptions, we derive the optimal mapping and scheduling of tiles to physical processors.
{"title":"Tiling with limited resources","authors":"P. Calland, J. Dongarra, Y. Robert","doi":"10.1109/ASAP.1997.606829","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606829","url":null,"abstract":"In the framework of perfect loop nests with uniform dependences, tiling has been extensively studied as a source-to-source program transformation. Little work has been devoted to the mapping and scheduling of the tiles on to physical processors. We present several new results in the context of limited computational resources, and assuming communication-computation overlap. In particular, under some reasonable assumptions, we derive the optimal mapping and scheduling of tiles to physical processors.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132371516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606842
L. Breveglieri, L. Dadda, V. Piuri
The FERMI is a data acquisition system for calorimetry experiments in high energy physics at the LHC, CERN. The system contains a large number of acquisition channels, with a precision of 16 bits and a sampling rate of 40 MHz. A large part of the information driven by the channels is processed locally, to reduce the amount of data. This requires to cluster several channels by adding them. The paper presents the design of a fast, low cost adder chip, based on the implementation of column compression techniques for the computation of integer addition. Since the system is operating in a radiation-hard environment, fault tolerance (namely fault detection) is implemented by means of arithmetic codes.
{"title":"Fast arithmetic and fault tolerance in the FERMI system","authors":"L. Breveglieri, L. Dadda, V. Piuri","doi":"10.1109/ASAP.1997.606842","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606842","url":null,"abstract":"The FERMI is a data acquisition system for calorimetry experiments in high energy physics at the LHC, CERN. The system contains a large number of acquisition channels, with a precision of 16 bits and a sampling rate of 40 MHz. A large part of the information driven by the channels is processed locally, to reduce the amount of data. This requires to cluster several channels by adding them. The paper presents the design of a fast, low cost adder chip, based on the implementation of column compression techniques for the computation of integer addition. Since the system is operating in a radiation-hard environment, fault tolerance (namely fault detection) is implemented by means of arithmetic codes.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132581085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606813
E. D. Greef, F. Catthoor, H. Man
In this paper we present the second stage of a two-phase strategy for reducing the required background memory sizes for a large class of data-intensive multimedia applications. This strategy is particularly useful in an embedded application context, where memory size and the corresponding power consumption are the main cost factors together with data transfers. Our strategy optimizes the storage order of arrays in memory by trying to improve the reuse of memory locations, as well for elements of the same array as for elements of different arrays. Although size reduction is the main objective, an added benefit is a reduced power consumption due to the decreased capacitive load of the memories. The memory size reduction task is part of an overall memory size and power reduction methodology called ATOMIUM in which other tasks can increase its effectiveness (e.g. loop, transformations), but it can also be used on a stand-alone base. The effectiveness of our approach is demonstrated by experimental results for some real-life multimedia applications, for which a considerable memory size reduction was obtained.
{"title":"Array placement for storage size reduction in embedded multimedia systems","authors":"E. D. Greef, F. Catthoor, H. Man","doi":"10.1109/ASAP.1997.606813","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606813","url":null,"abstract":"In this paper we present the second stage of a two-phase strategy for reducing the required background memory sizes for a large class of data-intensive multimedia applications. This strategy is particularly useful in an embedded application context, where memory size and the corresponding power consumption are the main cost factors together with data transfers. Our strategy optimizes the storage order of arrays in memory by trying to improve the reuse of memory locations, as well for elements of the same array as for elements of different arrays. Although size reduction is the main objective, an added benefit is a reduced power consumption due to the decreased capacitive load of the memories. The memory size reduction task is part of an overall memory size and power reduction methodology called ATOMIUM in which other tasks can increase its effectiveness (e.g. loop, transformations), but it can also be used on a stand-alone base. The effectiveness of our approach is demonstrated by experimental results for some real-life multimedia applications, for which a considerable memory size reduction was obtained.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131716476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606849
M. Gansen, F. Richter, O. Weiss, T. Noll
A new flexible datapath generator which allows the automated design of full-custom macros covering dedicated filter structures as well as programmable DSP cores is presented. The underlying concept combines the advantages of full-custom designs concerning power dissipation, silicon area, and throughput rate with a moderate design effort. In addition, the datapath generator can be easily included in existing semi-custom design flows. This enables highly efficient VLSI implementations of optimized full-custom macros (datapaths) embedded into synthesized standard cell designs covering uncritical structures in terms of area, power, and throughput (e.g. control paths) using common design flows. In order to demonstrate the datapath generator assisted design flow, the implementation of a time-shared correlator is presented as an example.
{"title":"A datapath generator for full-custom macros of iterative logic arrays","authors":"M. Gansen, F. Richter, O. Weiss, T. Noll","doi":"10.1109/ASAP.1997.606849","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606849","url":null,"abstract":"A new flexible datapath generator which allows the automated design of full-custom macros covering dedicated filter structures as well as programmable DSP cores is presented. The underlying concept combines the advantages of full-custom designs concerning power dissipation, silicon area, and throughput rate with a moderate design effort. In addition, the datapath generator can be easily included in existing semi-custom design flows. This enables highly efficient VLSI implementations of optimized full-custom macros (datapaths) embedded into synthesized standard cell designs covering uncritical structures in terms of area, power, and throughput (e.g. control paths) using common design flows. In order to demonstrate the datapath generator assisted design flow, the implementation of a time-shared correlator is presented as an example.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131543959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1997-07-14DOI: 10.1109/ASAP.1997.606818
A. Drolshagen, H. Henkelmann, W. Anheier
In this article processor elements for the effective implementation of standard cell circuits based on residue number systems (RNS) are presented. Two new processors are proposed helping to reduce the hardware requirements of the implementations. Following a new strategy for implementation a comparison between other circuits discussed in past prove the new method and cells to lead to faster and smaller circuits.
{"title":"Processor elements for the standard cell implementation of residue number systems","authors":"A. Drolshagen, H. Henkelmann, W. Anheier","doi":"10.1109/ASAP.1997.606818","DOIUrl":"https://doi.org/10.1109/ASAP.1997.606818","url":null,"abstract":"In this article processor elements for the effective implementation of standard cell circuits based on residue number systems (RNS) are presented. Two new processors are proposed helping to reduce the hardware requirements of the implementations. Following a new strategy for implementation a comparison between other circuits discussed in past prove the new method and cells to lead to faster and smaller circuits.","PeriodicalId":368315,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125510578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}