Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558335
G. Hekstra, E. Deprettere
Rendering artificial scenes is an appealing example of a class of problems leading to complex data dependent algorithms for which efficient software/hardware mapping techniques have to be envisaged. We present one of the ASICs in our rendering system to illustrate our design methodology in more detail. The first step in the algorithm-architecture design is to reformulate an existing naive algorithm in such a way that, as much as possible, only significant operations are performed. The resulting algorithm has a nested loop structure, with non-manifest, data-dependent loop bounds, rendering classical techniques for parallelisation useless. The second step is to greatly reduce the overall computation time of the algorithm by reducing the computational complexity of the innermost loop operation. The third and last step is to map this algorithm on a pipelined architecture, where the pipeline stages-functional units within an ASIC-implement different loop levels. Due to the data dependent nature, the functional units that implement the parts of the loops are time-varying with regard to both execution time and in how much data is produced for the following pipeline stages. Since the execution times of the various pipeline stages are changing, so does the location of the bottleneck over time. Hence the goal is not to keep all pipeline stages continually busy, but to keep the throughput at the most critical innermost loop operation as high as possible.
{"title":"A chip set for a ray-casting engine","authors":"G. Hekstra, E. Deprettere","doi":"10.1109/VLSISP.1996.558335","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558335","url":null,"abstract":"Rendering artificial scenes is an appealing example of a class of problems leading to complex data dependent algorithms for which efficient software/hardware mapping techniques have to be envisaged. We present one of the ASICs in our rendering system to illustrate our design methodology in more detail. The first step in the algorithm-architecture design is to reformulate an existing naive algorithm in such a way that, as much as possible, only significant operations are performed. The resulting algorithm has a nested loop structure, with non-manifest, data-dependent loop bounds, rendering classical techniques for parallelisation useless. The second step is to greatly reduce the overall computation time of the algorithm by reducing the computational complexity of the innermost loop operation. The third and last step is to map this algorithm on a pipelined architecture, where the pipeline stages-functional units within an ASIC-implement different loop levels. Due to the data dependent nature, the functional units that implement the parts of the loops are time-varying with regard to both execution time and in how much data is produced for the following pipeline stages. Since the execution times of the various pipeline stages are changing, so does the location of the bottleneck over time. Hence the goal is not to keep all pipeline stages continually busy, but to keep the throughput at the most critical innermost loop operation as high as possible.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114874163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558370
Jamshed N. Pately, Ashfaq A. Khokharz, Leah H. Jamiesony
We present analytical and experimental results for the scalability of 2-D discrete wavelet transform algorithms on coarse-grained parallel architectures. The principal operation in the 2-D DWT is the filtering operation used to implement the filter banks of the 2-D subband decomposition. We derive analytical results comparing time domain and frequency domain parallel algorithms for realizing the filter banks. Experiments on the Intel Paragon validate the analytical results. We demonstrate that there exist combinations of the machine size, image size, and wavelet size for which the time-domain algorithms outperform the frequency domain algorithms, and vice-versa.
{"title":"Scalability of 2-D wavelet transform algorithms: analytical and experimental results on coarse-grained parallel computers","authors":"Jamshed N. Pately, Ashfaq A. Khokharz, Leah H. Jamiesony","doi":"10.1109/VLSISP.1996.558370","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558370","url":null,"abstract":"We present analytical and experimental results for the scalability of 2-D discrete wavelet transform algorithms on coarse-grained parallel architectures. The principal operation in the 2-D DWT is the filtering operation used to implement the filter banks of the 2-D subband decomposition. We derive analytical results comparing time domain and frequency domain parallel algorithms for realizing the filter banks. Experiments on the Intel Paragon validate the analytical results. We demonstrate that there exist combinations of the machine size, image size, and wavelet size for which the time-domain algorithms outperform the frequency domain algorithms, and vice-versa.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125344572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558375
Seongsoo Lee, Jeong-Min Kim, S. Chae
We propose a new algorithm of real-time motion estimation for MPEG2 video encoding. It reduces the computational cost by using low bit-resolution quantization and new matching criterion. To maintain the performance, we employed a low-resolution search followed by a full-resolution search. Simulation results show that the proposed algorithm requires 1/17.4 computational cost while maintaining the performance degradation less than 0.37 dB with respect to the full search algorithm for -32.0/spl sim/+31.5 search range in the CCIR601 image. The architecture for the real-time MPEG2 motion estimator using this algorithm is also explained. It searches concurrently two prediction modes for -32.0/spl sim/+31.5 search range. Its hardware complexity is estimated to about 100,000 gates of random logic and 90 Kbits of SRAM. A VLSI design of the proposed architecture is in progress using a 0.5 /spl mu/m triple-metal CMOS standard-cell technology.
{"title":"New motion estimation using low-resolution quantization for MPEG2 video encoding","authors":"Seongsoo Lee, Jeong-Min Kim, S. Chae","doi":"10.1109/VLSISP.1996.558375","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558375","url":null,"abstract":"We propose a new algorithm of real-time motion estimation for MPEG2 video encoding. It reduces the computational cost by using low bit-resolution quantization and new matching criterion. To maintain the performance, we employed a low-resolution search followed by a full-resolution search. Simulation results show that the proposed algorithm requires 1/17.4 computational cost while maintaining the performance degradation less than 0.37 dB with respect to the full search algorithm for -32.0/spl sim/+31.5 search range in the CCIR601 image. The architecture for the real-time MPEG2 motion estimator using this algorithm is also explained. It searches concurrently two prediction modes for -32.0/spl sim/+31.5 search range. Its hardware complexity is estimated to about 100,000 gates of random logic and 90 Kbits of SRAM. A VLSI design of the proposed architecture is in progress using a 0.5 /spl mu/m triple-metal CMOS standard-cell technology.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126709017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558332
Edwin de Angel, Earl E. Swartzlander
This paper presents and compares sign extension techniques used to decrease the switching activity and improve the performance of parallel multipliers. A detailed review of different sign extension schemes is presented and an improved scheme for reducing the power dissipation is proposed. Four parallel CMOS multipliers designed in 0.6 /spl mu/m technology are used to implement and compare the sign extension schemes.
{"title":"Low power parallel multipliers","authors":"Edwin de Angel, Earl E. Swartzlander","doi":"10.1109/VLSISP.1996.558332","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558332","url":null,"abstract":"This paper presents and compares sign extension techniques used to decrease the switching activity and improve the performance of parallel multipliers. A detailed review of different sign extension schemes is presented and an improved scheme for reducing the power dissipation is proposed. Four parallel CMOS multipliers designed in 0.6 /spl mu/m technology are used to implement and compare the sign extension schemes.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127896696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558362
M. Vaupel, T. Grotker, H. Meyr
We describe the automated generation of components for high throughput data-flow dominated VLSI-systems in digital communications. By means of a hierarchically organized library both behavioural models with high simulation efficiency and corresponding hardware generators that produce sophisticated VHDL descriptions are made easily accessible to the system designer. The structured approach allows the evaluation of the trade-offs between alternatives at each design step and guarantees a fast and reliable design flow towards hardware. The design environment ComBox enhances reusability and enables rapid implementation of complex systems starting from a system level description.
{"title":"ComBox: library-based generation of VHDL modules","authors":"M. Vaupel, T. Grotker, H. Meyr","doi":"10.1109/VLSISP.1996.558362","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558362","url":null,"abstract":"We describe the automated generation of components for high throughput data-flow dominated VLSI-systems in digital communications. By means of a hierarchically organized library both behavioural models with high simulation efficiency and corresponding hardware generators that produce sophisticated VHDL descriptions are made easily accessible to the system designer. The structured approach allows the evaluation of the trade-offs between alternatives at each design step and guarantees a fast and reliable design flow towards hardware. The design environment ComBox enhances reusability and enables rapid implementation of complex systems starting from a system level description.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"31 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128564770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}