Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387597
Sung-Tze Wu, Chih-Hao Chao, I-Chyn Wey, A. Wu
System-on-Chip (SoC) designs become more complex nowadays. The communication between each processing element often suffers challenges due to the wiring problem. Networks-on-Chip (NoC) provides a practical solution to solve the problem. The major components in NoC are routers, which are dominated by the buffer size. Previous mechanisms need large buffer size to achieve high performance. In this paper, a dynamic channel flow control mechanism is proposed to realize the channel resource sharing globally, which can increase the throughput and the channel utilization rate. An 8 × 8 mesh on-chip network is implemented on a cycle accurate simulator. By the experimental result, the proposed mechanism can reduce the buffer size by 30% as compared with virtual channel flow control at the same throughput. Moreover, the throughput can be improved by 20% as compared with wormhole flow control.
{"title":"Dynamic Channel Flow Control of Networks-on-Chip Systems for High Buffer Efficiency","authors":"Sung-Tze Wu, Chih-Hao Chao, I-Chyn Wey, A. Wu","doi":"10.1109/SIPS.2007.4387597","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387597","url":null,"abstract":"System-on-Chip (SoC) designs become more complex nowadays. The communication between each processing element often suffers challenges due to the wiring problem. Networks-on-Chip (NoC) provides a practical solution to solve the problem. The major components in NoC are routers, which are dominated by the buffer size. Previous mechanisms need large buffer size to achieve high performance. In this paper, a dynamic channel flow control mechanism is proposed to realize the channel resource sharing globally, which can increase the throughput and the channel utilization rate. An 8 × 8 mesh on-chip network is implemented on a cycle accurate simulator. By the experimental result, the proposed mechanism can reduce the buffer size by 30% as compared with virtual channel flow control at the same throughput. Moreover, the throughput can be improved by 20% as compared with wormhole flow control.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"27 1","pages":"493-498"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81270069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387583
Kyung-Ju Cho, Yinan Xu, Jin-Gyun Chung
This paper presents a QR decomposition core by exploiting Givens rotation for the generalized decision feedback equalizer (GDFE). A Givens rotation consists of phase extraction, sine/cosine generation and angle rotation parts. Combining the fixed-width modified-Booth multiplier and two-stage method (coarse and fine stage), we design an efficient QR decomposition core. By simulations, it is shown that the proposed QR decomposition core can be a feasible solution for GDFE.
{"title":"Hardware Efficient QR Decomposition for GDFE","authors":"Kyung-Ju Cho, Yinan Xu, Jin-Gyun Chung","doi":"10.1109/SIPS.2007.4387583","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387583","url":null,"abstract":"This paper presents a QR decomposition core by exploiting Givens rotation for the generalized decision feedback equalizer (GDFE). A Givens rotation consists of phase extraction, sine/cosine generation and angle rotation parts. Combining the fixed-width modified-Booth multiplier and two-stage method (coarse and fine stage), we design an efficient QR decomposition core. By simulations, it is shown that the proposed QR decomposition core can be a feasible solution for GDFE.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"62 1","pages":"412-417"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85055291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387545
Chang-Seok Choi, Hanho Lee
This paper presents a self-reconfigurable adaptive FIR Filter system design using dynamic partial reconfiguration, which has flexibility, power efficiency, configuration time advantage allowing dynamically inserting or removing adaptive FIR filter modules. This self-reconfigurable adaptive FIR filter is responsible for providing the best solution for realization and autonomous adaptation of FIR filters, and processes the optimal digital signal processing algorithms, which are the low-pass, band-pass and high-pass filter algorithms with various frequencies, for noise removal operations. The proposed stand-alone self-reconfigurable system using Xilinx Virtex4 FPGA and Compact-Flash memory shows the improvement of configuration time and flexibility by using the dynamic partial reconfiguration techniques.
{"title":"A Partial Self-Reconfigurable Adaptive FIR Filter System","authors":"Chang-Seok Choi, Hanho Lee","doi":"10.1109/SIPS.2007.4387545","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387545","url":null,"abstract":"This paper presents a self-reconfigurable adaptive FIR Filter system design using dynamic partial reconfiguration, which has flexibility, power efficiency, configuration time advantage allowing dynamically inserting or removing adaptive FIR filter modules. This self-reconfigurable adaptive FIR filter is responsible for providing the best solution for realization and autonomous adaptation of FIR filters, and processes the optimal digital signal processing algorithms, which are the low-pass, band-pass and high-pass filter algorithms with various frequencies, for noise removal operations. The proposed stand-alone self-reconfigurable system using Xilinx Virtex4 FPGA and Compact-Flash memory shows the improvement of configuration time and flexibility by using the dynamic partial reconfiguration techniques.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"2 1","pages":"204-209"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90605599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387573
Maria Pantoja, N. Ling, Weijia Shang
This paper discusses the problem of transcoding between VC-1 and H.264 video standards. VC-1 uses an adaptive block size integer transform, which is different from the 4×4 integer transform used by H.264. We propose an algorithm to transcode the transform coefficients from VC-1 to those for H.264, which is a fundamental step for transform domain transcoding. The paper also presents a fast computation version of the algorithm. The implementation of the proposed algorithm shows that the quality of the video remains roughly the same while the complexity is greatly reduced when compared with the reference full cascade pixel domain transcoder.
{"title":"Coefficient Conversion for Transform Domain VC-1 TO H.264 Transcoding","authors":"Maria Pantoja, N. Ling, Weijia Shang","doi":"10.1109/SIPS.2007.4387573","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387573","url":null,"abstract":"This paper discusses the problem of transcoding between VC-1 and H.264 video standards. VC-1 uses an adaptive block size integer transform, which is different from the 4×4 integer transform used by H.264. We propose an algorithm to transcode the transform coefficients from VC-1 to those for H.264, which is a fundamental step for transform domain transcoding. The paper also presents a fast computation version of the algorithm. The implementation of the proposed algorithm shows that the quality of the video remains roughly the same while the complexity is greatly reduced when compared with the reference full cascade pixel domain transcoder.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"21 1","pages":"363-367"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90729062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387629
Dimitris Gkrimpas, Vassilis Paliouras
This paper investigates the combined computational complexity of demodulation and decoding of QAM signals. Four combinations of demodulation and decoding techniques are compared in terms of bit error rate (BER) vs. signal-to-noise (SNR) behavior, finite word length effects, and hardware complexity. It is found that joint demodulation and decoding using a high-radix trellis can be more efficient for higher orders of modulation, while a decoding strategy which produces soft values, followed by a Viterbi decoder is more efficient for lower modulation orders. Complexity formulas that take into account word lengths and modulation order are introduced.
{"title":"On The Complexity of Joint Demodulation and Convolutional Decoding","authors":"Dimitris Gkrimpas, Vassilis Paliouras","doi":"10.1109/SIPS.2007.4387629","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387629","url":null,"abstract":"This paper investigates the combined computational complexity of demodulation and decoding of QAM signals. Four combinations of demodulation and decoding techniques are compared in terms of bit error rate (BER) vs. signal-to-noise (SNR) behavior, finite word length effects, and hardware complexity. It is found that joint demodulation and decoding using a high-radix trellis can be more efficient for higher orders of modulation, while a decoding strategy which produces soft values, followed by a Viterbi decoder is more efficient for lower modulation orders. Complexity formulas that take into account word lengths and modulation order are introduced.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"34 1","pages":"669-674"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87136807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387589
R. Banchuin, B. Chipipop, B. Sirinaovakul
In this research, the practical 4-OTA-based floating inductor based upon the often cited monolithic CMOS technology has been studied and its complete passive equivalent circuit model, where the effects of both parasitic elements and finite opened-loop bandwidth have been taken into account, has been proposed. The accuracy evaluation of the proposed model has also been performed. The resulting model has been found to be excellently accurate with a considerably very small average error. Furthermore, the further study which is the inclusion of the mismatches among OTAs in order to obtain the most accurate results has also been proposed. However, the proposed passive equivalent circuit model has been found to be a convenience tool for the design of any signal processing circuits which require the CMOS-OTA-based floating inductors due to its considerably very small average error and the nature of the monolithic CMOS technology which allows the exclusion of the mismatches among OTAs.
{"title":"Novel Complete Passive Equivalent Circuit Model of the Practical 4-OTA-Based Floating Inductor","authors":"R. Banchuin, B. Chipipop, B. Sirinaovakul","doi":"10.1109/SIPS.2007.4387589","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387589","url":null,"abstract":"In this research, the practical 4-OTA-based floating inductor based upon the often cited monolithic CMOS technology has been studied and its complete passive equivalent circuit model, where the effects of both parasitic elements and finite opened-loop bandwidth have been taken into account, has been proposed. The accuracy evaluation of the proposed model has also been performed. The resulting model has been found to be excellently accurate with a considerably very small average error. Furthermore, the further study which is the inclusion of the mismatches among OTAs in order to obtain the most accurate results has also been proposed. However, the proposed passive equivalent circuit model has been found to be a convenience tool for the design of any signal processing circuits which require the CMOS-OTA-based floating inductors due to its considerably very small average error and the nature of the monolithic CMOS technology which allows the exclusion of the mismatches among OTAs.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"93 1","pages":"447-451"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86207068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387570
Junqing Liu, Tianhao Li
Binary arithmetic codes with forbidden symbols (named error correction arithmetic codes: ECAC) can be modeled as finite state machines and treated as variable length trellis codes. In this paper, a novel iterative joint source channel decoding algorithm is proposed for decoding trellis based error correction arithmetic codes. Unlike the conventional iterative decoding algorithm, it is needless to use the additional check codes such as CRC during the encoding, the proposed algorithm utilizes the Monte Carlo methods to detect the error bit directly. Furthermore, the outer error detector can not only detect the error bits but also provide the probability of the error location to the inner error corrector so as to accelerate the decoding process. Experimental results show that the proposed algorithm has some significant performance improvements over some conventional decoding algorithms in terms of the symbol error rate, while the increased computational complexity can be accepted.
{"title":"Iterative Joint Source Channel Decoding of Error Correction Arithmetic Codes","authors":"Junqing Liu, Tianhao Li","doi":"10.1109/SIPS.2007.4387570","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387570","url":null,"abstract":"Binary arithmetic codes with forbidden symbols (named error correction arithmetic codes: ECAC) can be modeled as finite state machines and treated as variable length trellis codes. In this paper, a novel iterative joint source channel decoding algorithm is proposed for decoding trellis based error correction arithmetic codes. Unlike the conventional iterative decoding algorithm, it is needless to use the additional check codes such as CRC during the encoding, the proposed algorithm utilizes the Monte Carlo methods to detect the error bit directly. Furthermore, the outer error detector can not only detect the error bits but also provide the probability of the error location to the inner error corrector so as to accelerate the decoding process. Experimental results show that the proposed algorithm has some significant performance improvements over some conventional decoding algorithms in terms of the symbol error rate, while the increased computational complexity can be accepted.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"15 1","pages":"346-350"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82477412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387555
Junfeng Fan, K. Sakiyama, I. Verbauwhede
In this paper, we investigate the efficient software implementations of theMontgomery modular multiplication algorithm on amulti-core system. AHW/SW co-design technique is used to find the efficient system architecture and the instruction scheduling method. We first implement the Montgomery modular multiplication on a multi-core systemwith general purpose cores. We then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core. As a result, the performance can be improved by a factor of 1.53 and 2.15 when 256-bit and 1024-bit Montgomery modular multiplication being performed, respectively.
{"title":"Montgomery Modular Multiplication Algorithm on Multi-Core Systems","authors":"Junfeng Fan, K. Sakiyama, I. Verbauwhede","doi":"10.1109/SIPS.2007.4387555","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387555","url":null,"abstract":"In this paper, we investigate the efficient software implementations of theMontgomery modular multiplication algorithm on amulti-core system. AHW/SW co-design technique is used to find the efficient system architecture and the instruction scheduling method. We first implement the Montgomery modular multiplication on a multi-core systemwith general purpose cores. We then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core. As a result, the performance can be improved by a factor of 1.53 and 2.15 when 256-bit and 1024-bit Montgomery modular multiplication being performed, respectively.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"22 1","pages":"261-266"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81760394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387622
M. V. D. Horst, R. H. Mak
We present a method to design multi-dimensional rank order filters. Our designs are more efficient than existing ones from literature, e.g. reducing the number of operations required by a 2-dimensional 7 × 7 median filter by 66%. This efficiency is maintained regardless of the amount of parallelism, therefore the throughput of our designs scales linearly with the amount of hardware. To accomplish this we introduce a framework in the form of a generator graph. This graph allows us to formalize our methods and formulate an algorithm that produces efficient designs by reusing common sub-expressions. Like other rank order filters our designs are based on sorting networks composed from Batcher¿s merging networks. However, we introduce an additional optimization that increases the savings obtained by pruning sorting networks. Our design method is independent of the implementation method and resulting designs can be implemented both as a VLSI circuit and as a program for an SIMD processor.
{"title":"Multi-Dimensional Parallel Rank Order Filtering","authors":"M. V. D. Horst, R. H. Mak","doi":"10.1109/SIPS.2007.4387622","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387622","url":null,"abstract":"We present a method to design multi-dimensional rank order filters. Our designs are more efficient than existing ones from literature, e.g. reducing the number of operations required by a 2-dimensional 7 × 7 median filter by 66%. This efficiency is maintained regardless of the amount of parallelism, therefore the throughput of our designs scales linearly with the amount of hardware. To accomplish this we introduce a framework in the form of a generator graph. This graph allows us to formalize our methods and formulate an algorithm that produces efficient designs by reusing common sub-expressions. Like other rank order filters our designs are based on sorting networks composed from Batcher¿s merging networks. However, we introduce an additional optimization that increases the savings obtained by pruning sorting networks. Our design method is independent of the implementation method and resulting designs can be implemented both as a VLSI circuit and as a program for an SIMD processor.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"135 1","pages":"627-632"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82864599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-11-21DOI: 10.1109/SIPS.2007.4387567
Y. Ismail, M. Elgamel, M. Bayoumi
Dynamic Block Size Motion Estimation (DBS-ME) and smart Dynamic Early Search Termination (DEST) techniques are proposed and implemented in this paper. Both of the proposed techniques are combined and applied to the conventional phase correlation technique. The performance, visual quality and complexity of the proposed techniques are compared to that of the original phase correlation motion estimation (PC-ME) and Full Search Block Matching (FSBM) techniques. The proposed techniques provide an increase in the encoding quality besides a decrease in the computational complexity of ME process. Results show that there is approximately 100% of the stationary blocks decided by the FSBM algorithm are discovered correctly which consequently reduce the computations compared with the original FS and PC techniques. Also it is noted that, DBS-ME technique greatly decreases the computations required for ME process by decreasing the required padding to one or two pixels for both the current and the reference blocks. In addition, the motion field of the proposed algorithm gives much lower entropy than PC-ME which means more reduction in the transmitted bit rate.
{"title":"Adaptive Techniques for a Fast Frequency Domain Motion Estimation","authors":"Y. Ismail, M. Elgamel, M. Bayoumi","doi":"10.1109/SIPS.2007.4387567","DOIUrl":"https://doi.org/10.1109/SIPS.2007.4387567","url":null,"abstract":"Dynamic Block Size Motion Estimation (DBS-ME) and smart Dynamic Early Search Termination (DEST) techniques are proposed and implemented in this paper. Both of the proposed techniques are combined and applied to the conventional phase correlation technique. The performance, visual quality and complexity of the proposed techniques are compared to that of the original phase correlation motion estimation (PC-ME) and Full Search Block Matching (FSBM) techniques. The proposed techniques provide an increase in the encoding quality besides a decrease in the computational complexity of ME process. Results show that there is approximately 100% of the stationary blocks decided by the FSBM algorithm are discovered correctly which consequently reduce the computations compared with the original FS and PC techniques. Also it is noted that, DBS-ME technique greatly decreases the computations required for ME process by decreasing the required padding to one or two pixels for both the current and the reference blocks. In addition, the motion field of the proposed algorithm gives much lower entropy than PC-ME which means more reduction in the transmitted bit rate.","PeriodicalId":93225,"journal":{"name":"Proceedings. IEEE Workshop on Signal Processing Systems (2007-2014)","volume":"48 1","pages":"331-336"},"PeriodicalIF":0.0,"publicationDate":"2007-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82680761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}