Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558326
K. Parhi, Janardhan H. SatyanarayanaDepartment
We show theoretically that the average energy consumption of a ripple-carry adder is O(W), and the upper bound on the average energy consumption is O(Wlog/sub 2/W), where W is the word-length of the operands. Our theoretical analysis is based on a simple state transition diagram (STD) model of a full adder cell and the observations that the average length of a carry propagation chain is v=2, and the average length of the maximum carry chain is v/spl les/log/sub 2/W. To verify our theoretical conclusions, we use the HEAT CAD tool to estimate the average power consumed by the ripple-carry adder for word-lengths 4/spl les/W/spl les/64. The experimental results show that, for W/spl ges/16, the error in our theoretical estimations is around 15%.
{"title":"Estimation of average energy consumption of ripple-carry adder based on average length carry chains","authors":"K. Parhi, Janardhan H. SatyanarayanaDepartment","doi":"10.1109/VLSISP.1996.558326","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558326","url":null,"abstract":"We show theoretically that the average energy consumption of a ripple-carry adder is O(W), and the upper bound on the average energy consumption is O(Wlog/sub 2/W), where W is the word-length of the operands. Our theoretical analysis is based on a simple state transition diagram (STD) model of a full adder cell and the observations that the average length of a carry propagation chain is v=2, and the average length of the maximum carry chain is v/spl les/log/sub 2/W. To verify our theoretical conclusions, we use the HEAT CAD tool to estimate the average power consumed by the ripple-carry adder for word-lengths 4/spl les/W/spl les/64. The experimental results show that, for W/spl ges/16, the error in our theoretical estimations is around 15%.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126218274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558310
L. Nachtergaele, Francky Catthoory, Bhanu Kapoorz, S. Janssens, D. Moolenaar
We describe a power exploration methodology for data-dominated applications using a H.263 video decoding demonstrator application. The starting point for our exploration is a C specification of the video decoder, available in the public domain from Telenor Research. We have transformed the data transfer scheme in the specification and have optimized the distributed memory organization. This results in a memory architecture with significantly reduced power consumption. For the worst-case mode using predicted and bi-directional (PB) frames, memory power consumption is reduced by a factor of 9. To achieve these results, we make use of our formalized high-level memory management methodology, partly supported in our ATOMIUM environment.
{"title":"Low power storage exploration for H.263 video decoder","authors":"L. Nachtergaele, Francky Catthoory, Bhanu Kapoorz, S. Janssens, D. Moolenaar","doi":"10.1109/VLSISP.1996.558310","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558310","url":null,"abstract":"We describe a power exploration methodology for data-dominated applications using a H.263 video decoding demonstrator application. The starting point for our exploration is a C specification of the video decoder, available in the public domain from Telenor Research. We have transformed the data transfer scheme in the specification and have optimized the distributed memory organization. This results in a memory architecture with significantly reduced power consumption. For the worst-case mode using predicted and bi-directional (PB) frames, memory power consumption is reduced by a factor of 9. To achieve these results, we make use of our formalized high-level memory management methodology, partly supported in our ATOMIUM environment.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122691090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558313
L. Guerra, M. Potkonjak, J. Rabaey
The paper proposes a divide-and-conquer approach for global throughput optimization designed to coordinate existing techniques and enable their more effective use. The "divide" approach consists of logical partitioning of the computation into subparts. The techniques for partitioning the computation, and the corresponding scheme for classifying the subparts is presented. The subparts are optimized or "conquered" through coordinated application of existing optimization techniques. Optimization techniques that are effective for each class have been characterized in terms of their expected effect on throughput. The approach is not limited to a specific class of computations and gives higher, or at least equal, improvement than previously reported techniques on all examples.
{"title":"Divide-and-conquer techniques for global throughput optimization","authors":"L. Guerra, M. Potkonjak, J. Rabaey","doi":"10.1109/VLSISP.1996.558313","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558313","url":null,"abstract":"The paper proposes a divide-and-conquer approach for global throughput optimization designed to coordinate existing techniques and enable their more effective use. The \"divide\" approach consists of logical partitioning of the computation into subparts. The techniques for partitioning the computation, and the corresponding scheme for classifying the subparts is presented. The subparts are optimized or \"conquered\" through coordinated application of existing optimization techniques. Optimization techniques that are effective for each class have been characterized in terms of their expected effect on throughput. The approach is not limited to a specific class of computations and gives higher, or at least equal, improvement than previously reported techniques on all examples.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132618607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558320
L. Lucke, L. Nelson, H. Oie
We present serial and parallel architectures for an LMS adaptive filter implementation of the minimum-mean-square-error adaptive CDMA receiver. These architectures use fixed-point numbers to represent the variables and 2-bit representation of the input signal to reduce the complexity of the arithmetic operations. We simulate the bit error rate of these architectures to study their performance in near-far and multipath environments. The simulations are used to determine the optimal wordlengths. Simulation results show that the performance of this reduced-complexity digital adaptive filter is compatible with that of the analog one using sufficient numbers of bits and it is much better than that of the conventional matched filter.
{"title":"Adaptive CDMA receiver implementation for multipath and multiuser environments","authors":"L. Lucke, L. Nelson, H. Oie","doi":"10.1109/VLSISP.1996.558320","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558320","url":null,"abstract":"We present serial and parallel architectures for an LMS adaptive filter implementation of the minimum-mean-square-error adaptive CDMA receiver. These architectures use fixed-point numbers to represent the variables and 2-bit representation of the input signal to reduce the complexity of the arithmetic operations. We simulate the bit error rate of these architectures to study their performance in near-far and multipath environments. The simulations are used to determine the optimal wordlengths. Simulation results show that the performance of this reduced-complexity digital adaptive filter is compatible with that of the analog one using sufficient numbers of bits and it is much better than that of the conventional matched filter.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130993047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558361
Kyosun Kim, R. Karri, M. Potkonjak
As witnessed by their rapid market growth, application specific programmable processors (ASSP) provide an attractive alternative to both fully programmable and fully custom hardware platforms. ASPP are data paths which provide efficient implementation for any of k functional specifications assuming that only one will be executed at any given time. We combine the flexibility provided by multiple functionalities with judicious operation-to-application allocation to maximize the permanent fault-tolerance of such ASPP designs. The approach and the synthesis algorithms are demonstrated on a number of signal processing applications.
{"title":"Maximizing the fault-tolerance of application specific programmable signal processors","authors":"Kyosun Kim, R. Karri, M. Potkonjak","doi":"10.1109/VLSISP.1996.558361","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558361","url":null,"abstract":"As witnessed by their rapid market growth, application specific programmable processors (ASSP) provide an attractive alternative to both fully programmable and fully custom hardware platforms. ASPP are data paths which provide efficient implementation for any of k functional specifications assuming that only one will be executed at any given time. We combine the flexibility provided by multiple functionalities with judicious operation-to-application allocation to maximize the permanent fault-tolerance of such ASPP designs. The approach and the synthesis algorithms are demonstrated on a number of signal processing applications.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124877074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558309
E. Holmann, A. Yamada, T. Yoshida, S. Uramoto
A single chip system for real-time MPEG-2 decoding can be created by integrating a dual-issue RISC processor with a small dedicated hardware for the variable length decoding (VLD) and block loading processes; a 32 KB instruction RAM; and a 16 KB data RAM. The VLD hardware performs the Huffman decoding on the input data. The block loader performs the half-sample prediction for motion compensation and acts as a direct memory access controller for the RISC processor. The dual-issue RISC processor, running at 250 MHz, is enhanced with a set of key sub-word and multimedia instructions for a sustained peak performance of 1000 MOPS. With this setup for MPEG-2 decoding applications, bi-directionally predicted non-intra blocks are decoded in less than 800 cycles, leading to a single chip, real-time MPEG-2 decoding system.
{"title":"Real-time MPEG-2 software decoding with a dual-issue RISC processor","authors":"E. Holmann, A. Yamada, T. Yoshida, S. Uramoto","doi":"10.1109/VLSISP.1996.558309","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558309","url":null,"abstract":"A single chip system for real-time MPEG-2 decoding can be created by integrating a dual-issue RISC processor with a small dedicated hardware for the variable length decoding (VLD) and block loading processes; a 32 KB instruction RAM; and a 16 KB data RAM. The VLD hardware performs the Huffman decoding on the input data. The block loader performs the half-sample prediction for motion compensation and acts as a direct memory access controller for the RISC processor. The dual-issue RISC processor, running at 250 MHz, is enhanced with a set of key sub-word and multimedia instructions for a sustained peak performance of 1000 MOPS. With this setup for MPEG-2 decoding applications, bi-directionally predicted non-intra blocks are decoded in less than 800 cycles, leading to a single chip, real-time MPEG-2 decoding system.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114441882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558279
Y. Hwang, C.-L. Su
This paper presents a distributed arithmetic based design scheme for recursive DSP systems requiring high speed computing. The proposed scheme features a bit-serial word-parallel approach and is found more efficient than the conventional bit-parallel word-serial scheme. We apply this scheme to design an ARMA filter and yield an initiation interval as small as the delay of processing only one output bit. We also incorporate the look-ahead transform and the block processing techniques in the proposed DA scheme for further speed improvement. Finally, we propose a signed digit DA scheme to solve the performance degradation problem due to the effect of data word length truncation in fixed point number computing systems.
{"title":"Parallel and pipelined architecture designs for distributed arithmetic-based recursive digital filters","authors":"Y. Hwang, C.-L. Su","doi":"10.1109/VLSISP.1996.558279","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558279","url":null,"abstract":"This paper presents a distributed arithmetic based design scheme for recursive DSP systems requiring high speed computing. The proposed scheme features a bit-serial word-parallel approach and is found more efficient than the conventional bit-parallel word-serial scheme. We apply this scheme to design an ARMA filter and yield an initiation interval as small as the delay of processing only one output bit. We also incorporate the look-ahead transform and the block processing techniques in the proposed DA scheme for further speed improvement. Finally, we propose a signed digit DA scheme to solve the performance degradation problem due to the effect of data word length truncation in fixed point number computing systems.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132877552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558379
A. Abnous, J. Rabaey
Programmability is an important requirement for portable computing and communication devices that must be flexible enough to accommodate a variety of multimedia services and communication capabilities. However, compared to dedicated, application-specific solutions, programmable devices often incur significant performance and power penalties. We present a hybrid architecture template that can be used to implement ultra-low-power programmable processors for signal processing applications.
{"title":"Ultra-low-power domain-specific multimedia processors","authors":"A. Abnous, J. Rabaey","doi":"10.1109/VLSISP.1996.558379","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558379","url":null,"abstract":"Programmability is an important requirement for portable computing and communication devices that must be flexible enough to accommodate a variety of multimedia services and communication capabilities. However, compared to dedicated, application-specific solutions, programmable devices often incur significant performance and power penalties. We present a hybrid architecture template that can be used to implement ultra-low-power programmable processors for signal processing applications.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134015490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558351
A. Kwentus, O. Lee, A. Willson
The implementation of a 250 Msample/sec programmable six-stage cascaded integrator-comb (CIC) decimation filter is described. The prototype IC is implemented using 0.8-/spl mu/m CMOS and contains 39,890 transistors in a core area of 8.5 mm/sup 2/. It accommodates programmable power-of-two decimation factors from 2 to 1024 with 16-bit input and output data.
描述了一种250 m采样/秒可编程六级级联积分梳(CIC)抽取滤波器的实现。原型IC采用0.8-/spl μ m CMOS实现,在8.5 mm/sup /的核心面积中包含39,890个晶体管。它容纳可编程的2次幂抽取因子,从2到1024,16位输入和输出数据。
{"title":"A 250 Msample/sec programmable cascaded integrator-comb decimation filter","authors":"A. Kwentus, O. Lee, A. Willson","doi":"10.1109/VLSISP.1996.558351","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558351","url":null,"abstract":"The implementation of a 250 Msample/sec programmable six-stage cascaded integrator-comb (CIC) decimation filter is described. The prototype IC is implemented using 0.8-/spl mu/m CMOS and contains 39,890 transistors in a core area of 8.5 mm/sup 2/. It accommodates programmable power-of-two decimation factors from 2 to 1024 with 16-bit input and output data.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130609130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1996-10-30DOI: 10.1109/VLSISP.1996.558302
M.C. Mekhallalati, A. Ashur
Two novel uni-directional systolic structures for serial multiplication over the finite field GF(2/sup m/) are presented. The architecture of the new structures posses features of regularity, modularity, and uni-directional data flow. One of the new structures is a serial-parallel structure, whereas the other structure is a fully serial one. Both structures consist of (m/2) novel cells. Due to the novel cells architectures of the new structures, the initial delay (i.e. the number of cycles required to obtain the first output) and the latency (i.e. the number of cycles required to complete the multiplication process) are decreased by 25% and 17% respectively. Also, the number of latches of the new structures are reduced by more than 20% when compared to existing uni-directional serial-parallel structures.
{"title":"Novel structures for serial multiplication over the finite field GF(2/sup m/)","authors":"M.C. Mekhallalati, A. Ashur","doi":"10.1109/VLSISP.1996.558302","DOIUrl":"https://doi.org/10.1109/VLSISP.1996.558302","url":null,"abstract":"Two novel uni-directional systolic structures for serial multiplication over the finite field GF(2/sup m/) are presented. The architecture of the new structures posses features of regularity, modularity, and uni-directional data flow. One of the new structures is a serial-parallel structure, whereas the other structure is a fully serial one. Both structures consist of (m/2) novel cells. Due to the novel cells architectures of the new structures, the initial delay (i.e. the number of cycles required to obtain the first output) and the latency (i.e. the number of cycles required to complete the multiplication process) are decreased by 25% and 17% respectively. Also, the number of latches of the new structures are reduced by more than 20% when compared to existing uni-directional serial-parallel structures.","PeriodicalId":290885,"journal":{"name":"VLSI Signal Processing, IX","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128771932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}