Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110023
Georgios Georgakarakos, Sudeep Kanur, J. Lilius, K. Desnos
Dataflow models of computation have early on been acknowledged as an attractive methodology to describe parallel algorithms, hence they have become highly relevant for programming in the current multicore processor era. While several frameworks provide tools to create dataflow descriptions of algorithms, generating parallel code for programmable processors is still sub-optimal due to the scheduling overheads and the semantics gap when expressing parallelism with conventional programming languages featuring threads. In this paper we propose an optimization of the parallel code generation process by combining dataflow and task programming models. We develop a task-based code generator for PREESM, a dataflow-based prototyping framework, in order to deploy algorithms described as synchronous dataflow graphs on multicore platforms. Experimental performance comparison of our task generated code against typical thread-based code shows that our approach removes significant scheduling and synchronization overheads while maintaining similar (and occasionally improving) application throughput.
{"title":"Task-based execution of synchronous dataflow graphs for scalable multicore computing","authors":"Georgios Georgakarakos, Sudeep Kanur, J. Lilius, K. Desnos","doi":"10.1109/SiPS.2017.8110023","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110023","url":null,"abstract":"Dataflow models of computation have early on been acknowledged as an attractive methodology to describe parallel algorithms, hence they have become highly relevant for programming in the current multicore processor era. While several frameworks provide tools to create dataflow descriptions of algorithms, generating parallel code for programmable processors is still sub-optimal due to the scheduling overheads and the semantics gap when expressing parallelism with conventional programming languages featuring threads. In this paper we propose an optimization of the parallel code generation process by combining dataflow and task programming models. We develop a task-based code generator for PREESM, a dataflow-based prototyping framework, in order to deploy algorithms described as synchronous dataflow graphs on multicore platforms. Experimental performance comparison of our task generated code against typical thread-based code shows that our approach removes significant scheduling and synchronization overheads while maintaining similar (and occasionally improving) application throughput.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125538206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109980
Benjamin Barrois, O. Sentieys
This paper presents a comparison between custom fixed-point (FxP) and floating-point (FlP) arithmetic, applied to bidimensional K-means clustering algorithm. After a discussion on the K-means clustering algorithm and arithmetic characteristics, hardware implementations of FxP and FlP arithmetic operators are compared in terms of area, delay and energy, for different bitwidth, using the ApxPerf2.0 framework. Finally, both are compared in the context of K-means clustering. The direct comparison shows the large difference between 8-to-16-bit FxP and FlP operators, FlP adders consuming 5–12 χ more energy than FxP adders, and multipliers 2–10χ more. However, when applied to K-means clustering algorithm, the gap between FxP and FlP tightens. Indeed, the accuracy improvements brought by FlP make the computation more accurate and lead to an accuracy equivalent to FxP with less iterations of the algorithm, proportionally reducing the global energy spent. The 8-bit version of the algorithm becomes more profitable using FlP, which is 80% more accurate with only 1.6 χ more energy. This paper finally discusses the stake of custom FlP for low-energy general-purpose computation, thanks to its ease of use, supported by an energy overhead lower than what could have been expected.
{"title":"Customizing fixed-point and floating-point arithmetic — A case study in K-means clustering","authors":"Benjamin Barrois, O. Sentieys","doi":"10.1109/SiPS.2017.8109980","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109980","url":null,"abstract":"This paper presents a comparison between custom fixed-point (FxP) and floating-point (FlP) arithmetic, applied to bidimensional K-means clustering algorithm. After a discussion on the K-means clustering algorithm and arithmetic characteristics, hardware implementations of FxP and FlP arithmetic operators are compared in terms of area, delay and energy, for different bitwidth, using the ApxPerf2.0 framework. Finally, both are compared in the context of K-means clustering. The direct comparison shows the large difference between 8-to-16-bit FxP and FlP operators, FlP adders consuming 5–12 χ more energy than FxP adders, and multipliers 2–10χ more. However, when applied to K-means clustering algorithm, the gap between FxP and FlP tightens. Indeed, the accuracy improvements brought by FlP make the computation more accurate and lead to an accuracy equivalent to FxP with less iterations of the algorithm, proportionally reducing the global energy spent. The 8-bit version of the algorithm becomes more profitable using FlP, which is 80% more accurate with only 1.6 χ more energy. This paper finally discusses the stake of custom FlP for low-energy general-purpose computation, thanks to its ease of use, supported by an energy overhead lower than what could have been expected.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121527467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110008
Mikko Parviainen, Pasi Pertilä
This article presents a method to obtain personalized Head-Related Transfer Functions (HRTFs) for creating virtual soundscapes based on small amount of measurements. The best matching set of HRTFs are selected among the entries from publicly available databases. The proposed method is evaluated using a listening test where subjects assess the audio samples created using the best matching set of HRTFs against a randomly chosen set of HRTFs from the same location. The listening test indicates that subjects prefer the proposed method over random set of HRTFs.
{"title":"Obtaining an optimal set of head-related transfer functions with a small amount of measurements","authors":"Mikko Parviainen, Pasi Pertilä","doi":"10.1109/SiPS.2017.8110008","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110008","url":null,"abstract":"This article presents a method to obtain personalized Head-Related Transfer Functions (HRTFs) for creating virtual soundscapes based on small amount of measurements. The best matching set of HRTFs are selected among the entries from publicly available databases. The proposed method is evaluated using a listening test where subjects assess the audio samples created using the best matching set of HRTFs against a randomly chosen set of HRTFs from the same location. The listening test indicates that subjects prefer the proposed method over random set of HRTFs.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132613771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110011
Yu Gong, Tingting Xu, Bo Liu, Wei-qi Ge, Jinjiang Yang, Jun Yang, Longxing Shi
With the rapidly increasing applications of deep learning, LSTM-RNNs are widely used. Meanwhile, the complex data dependence and intensive computation limit the performance of the accelerators. In this paper, we first proposed a hybrid network expansion model to exploit the finegrained data parallelism. Based on the model, we implemented a Reconfigurable Processing Unit(RPU) using Processing In Memory(PIM) units. Our work shows that the gates and cells in LSTM can be partitioned to fundamental operations and then recombined and mapped into heterogeneous computing components. The experimental results show that, implemented on 45nm CMOS process, the proposed RPU with size of 1.51 mm2 and power of 413 mw achieves 309 GOPS/W in power efficiency, and is 1.7 χ better than state-of-the-art reconfigurable architecture.
{"title":"Processing LSTM in memory using hybrid network expansion model","authors":"Yu Gong, Tingting Xu, Bo Liu, Wei-qi Ge, Jinjiang Yang, Jun Yang, Longxing Shi","doi":"10.1109/SiPS.2017.8110011","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110011","url":null,"abstract":"With the rapidly increasing applications of deep learning, LSTM-RNNs are widely used. Meanwhile, the complex data dependence and intensive computation limit the performance of the accelerators. In this paper, we first proposed a hybrid network expansion model to exploit the finegrained data parallelism. Based on the model, we implemented a Reconfigurable Processing Unit(RPU) using Processing In Memory(PIM) units. Our work shows that the gates and cells in LSTM can be partitioned to fundamental operations and then recombined and mapped into heterogeneous computing components. The experimental results show that, implemented on 45nm CMOS process, the proposed RPU with size of 1.51 mm2 and power of 413 mw achieves 309 GOPS/W in power efficiency, and is 1.7 χ better than state-of-the-art reconfigurable architecture.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133414533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109996
Junhee Cho, Wonyong Sung
Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.
{"title":"High-throughput decoding of block turbo codes on graphics processing units","authors":"Junhee Cho, Wonyong Sung","doi":"10.1109/SiPS.2017.8109996","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109996","url":null,"abstract":"Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the aid of a storage-release mechanism named key-keysmith, an implementation approach based on chemical reaction networks (CRNs) for synchronous sequential logic is proposed. This design approach, which stores logic information in keysmith and releases it through key, primarily focuses on the underlying state transitions behind the required logic rather than the electronic circuit representation. Therefore, it can be uniformly and easily employed to implement any synchronous sequential logic with molecular reactions. Theoretical analysis and numerical simulations have demonstrated the robustness and universality of the proposed approach.
{"title":"CRN-based design methodology for synchronous sequential logic","authors":"Zhiwei Zhong, Lulu Ge, Ziyuan Shen, X. You, Chuan Zhang","doi":"10.1109/SiPS.2017.8109979","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109979","url":null,"abstract":"With the aid of a storage-release mechanism named key-keysmith, an implementation approach based on chemical reaction networks (CRNs) for synchronous sequential logic is proposed. This design approach, which stores logic information in keysmith and releases it through key, primarily focuses on the underlying state transitions behind the required logic rather than the electronic circuit representation. Therefore, it can be uniformly and easily employed to implement any synchronous sequential logic with molecular reactions. Theoretical analysis and numerical simulations have demonstrated the robustness and universality of the proposed approach.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134310204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110020
A. Jallouli, Fatma Belghith, M. A. B. Ayed, W. Hamidouche, J. Nezan, N. Masmoudi
The Post-HEVC is the emerging video coding standard beyond the High Efficiency Video Coding (HEVC) standard. It is more complex in transformation and prediction steps but it offers the opportunity of 3D and 360° videos coding and compression. This paper presents different statistical analyzes of Post-HEVC encoded videos especially analysis on 1D and 2D transformation types and analysis on intra and inter prediction types of some test videos for different classes and resolutions. Analyzes are carried out at the decoder level where the coding decision has already been taken by the encoder. Results show that the choice of transformation (type and size) and the prediction type (intra or inter) depends on the nature of video: motion and texture. This work can be considered as a milestone for proposing intelligent algorithms based on video characteristics to perform fast decision in the Post-HEVC encoding process.
后HEVC是继HEVC (High Efficiency video coding)标准之后的新兴视频编码标准。它在转换和预测步骤上比较复杂,但为3D和360°视频编码和压缩提供了机会。本文对hevc编码后的视频进行了不同的统计分析,特别是对一维和二维变换类型的分析,以及对不同类别和分辨率的一些测试视频的内预测和间预测类型的分析。在编码器已经做出编码决定的地方,在解码器级别进行分析。结果表明,变换(类型和大小)和预测类型(内部或内部)的选择取决于视频的性质:运动和纹理。这项工作可以被认为是提出基于视频特征的智能算法在后hevc编码过程中进行快速决策的里程碑。
{"title":"Statistical analysis of Post-HEVC encoded videos","authors":"A. Jallouli, Fatma Belghith, M. A. B. Ayed, W. Hamidouche, J. Nezan, N. Masmoudi","doi":"10.1109/SiPS.2017.8110020","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110020","url":null,"abstract":"The Post-HEVC is the emerging video coding standard beyond the High Efficiency Video Coding (HEVC) standard. It is more complex in transformation and prediction steps but it offers the opportunity of 3D and 360° videos coding and compression. This paper presents different statistical analyzes of Post-HEVC encoded videos especially analysis on 1D and 2D transformation types and analysis on intra and inter prediction types of some test videos for different classes and resolutions. Analyzes are carried out at the decoder level where the coding decision has already been taken by the encoder. Results show that the choice of transformation (type and size) and the prediction type (intra or inter) depends on the nature of video: motion and texture. This work can be considered as a milestone for proposing intelligent algorithms based on video characteristics to perform fast decision in the Post-HEVC encoding process.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114180474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110000
Swati Bhardwaj, Shashank Raghuraman, A. Acharyya
This paper proposes a low complex hardware accelerator algorithmic modification for n-dimensional (nD) FastICA methodology based on Coordinate Rotation Digital Computer (CORDIC) to attain high computation speed. The most complex and time consuming update stage and convergence check required for computation of the nth weight vector are eliminated in the proposed methodology. Using the Gram-Schmidt Orthogonalization stage and normalization stage to calculate nth weight vector in an entirely sequential procedure of CORDIC-based FastICA results in a significant gain in terms of the computation time. The proposed methodology has been functionally verified and validated by applying it for separating 6D speech signals. It has been implemented on hardware using Verilog HDL and synthesized using UMC 180nm technology. The average improvement in computation time obtained by using the proposed methodology for 4D to 6D FastICA with 1024 samples, considering the minimum case of two iterations for nth stage, was found to be 98.79 %.
{"title":"Low complexity hardware accelerator for nD FastICA based on coordinate rotation","authors":"Swati Bhardwaj, Shashank Raghuraman, A. Acharyya","doi":"10.1109/SiPS.2017.8110000","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110000","url":null,"abstract":"This paper proposes a low complex hardware accelerator algorithmic modification for n-dimensional (nD) FastICA methodology based on Coordinate Rotation Digital Computer (CORDIC) to attain high computation speed. The most complex and time consuming update stage and convergence check required for computation of the nth weight vector are eliminated in the proposed methodology. Using the Gram-Schmidt Orthogonalization stage and normalization stage to calculate nth weight vector in an entirely sequential procedure of CORDIC-based FastICA results in a significant gain in terms of the computation time. The proposed methodology has been functionally verified and validated by applying it for separating 6D speech signals. It has been implemented on hardware using Verilog HDL and synthesized using UMC 180nm technology. The average improvement in computation time obtained by using the proposed methodology for 4D to 6D FastICA with 1024 samples, considering the minimum case of two iterations for nth stage, was found to be 98.79 %.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124309218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109993
Yuri Nishizumi, Go Matsukawa, K. Kajihara, T. Kodama, S. Izumi, H. Kawaguchi, C. Nakanishi, Toshio Goto, Takeo Kato, M. Yoshimoto
This paper describes FPGA implementation of object recognition processor for HDTV resolution 30 fps video using the Sparse FIND feature. Two-stage feature extraction processing by HOG and Sparse FIND, a highly parallel classification in the support vector machine (SVM), and a block-parallel processing for RAM access cycle reduction are proposed to perform a real time object recognition with enormous computational complexity. From implementation of the proposed architecture in the FPGA, it was confirmed that detection using the Sparse FIND feature was performed for HDTV images at 47.63 fps, on average, at 90 MHz. The recognition accuracy degradation from the original Sparse FIND-base object detection algorithm implemented on software was 0.5%, which shows that the FPGA system provides sufficient accuracy for practical use.
{"title":"FPGA implementation of object recognition processor for HDTV resolution video using sparse FIND feature","authors":"Yuri Nishizumi, Go Matsukawa, K. Kajihara, T. Kodama, S. Izumi, H. Kawaguchi, C. Nakanishi, Toshio Goto, Takeo Kato, M. Yoshimoto","doi":"10.1109/SiPS.2017.8109993","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109993","url":null,"abstract":"This paper describes FPGA implementation of object recognition processor for HDTV resolution 30 fps video using the Sparse FIND feature. Two-stage feature extraction processing by HOG and Sparse FIND, a highly parallel classification in the support vector machine (SVM), and a block-parallel processing for RAM access cycle reduction are proposed to perform a real time object recognition with enormous computational complexity. From implementation of the proposed architecture in the FPGA, it was confirmed that detection using the Sparse FIND feature was performed for HDTV images at 47.63 fps, on average, at 90 MHz. The recognition accuracy degradation from the original Sparse FIND-base object detection algorithm implemented on software was 0.5%, which shows that the FPGA system provides sufficient accuracy for practical use.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116800175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110009
M. Masera, M. Martina, G. Masera
In this paper, we show a class of relationships which link Discrete Cosine Transforms (DCT) and Discrete Sine Transforms (DST) of types V, VI, VII and VIII, which have been recently considered for inclusion in the future video coding technology. In particular, the proposed relationships allow to compute the DCT-V and the DCT-VIII as functions of the DCT-VI and the DST-VII respectively, plus simple reordering and sign-inversion. Moreover, this paper exploits the proposed relationships and the Winograd factorization of the Discrete Fourier Transform to construct low-complexity factorizations for computing the DCT-V and the DCT-VIII of length 4 and 8. Finally, the proposed signal-flow-graphs have been implemented using an FPGA technology, thus showing reduced hardware utilization with respect to the direct implementation of the matrix-vector multiplication algorithm.
{"title":"Odd type DCT/DST for video coding: Relationships and low-complexity implementations","authors":"M. Masera, M. Martina, G. Masera","doi":"10.1109/SiPS.2017.8110009","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110009","url":null,"abstract":"In this paper, we show a class of relationships which link Discrete Cosine Transforms (DCT) and Discrete Sine Transforms (DST) of types V, VI, VII and VIII, which have been recently considered for inclusion in the future video coding technology. In particular, the proposed relationships allow to compute the DCT-V and the DCT-VIII as functions of the DCT-VI and the DST-VII respectively, plus simple reordering and sign-inversion. Moreover, this paper exploits the proposed relationships and the Winograd factorization of the Discrete Fourier Transform to construct low-complexity factorizations for computing the DCT-V and the DCT-VIII of length 4 and 8. Finally, the proposed signal-flow-graphs have been implemented using an FPGA technology, thus showing reduced hardware utilization with respect to the direct implementation of the matrix-vector multiplication algorithm.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"337 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123232119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}