Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110025
Alexandre Mercat, F. Arrestier, M. Pelcat, W. Hamidouche, D. Ménard
High Efficiency Video Coding (Hevc), the newest video encoding standard, provides up to 50% bitrate savings compared to the state-of-art H.264/AVC standard for the same perceptual video quality. In the last few years, the Internet of Things (IoT) has become a reality. Forthcoming applications are likely to boost mobile video demand to an unprecedented level. A large number of systems are likely to integrate HEVC codec in the long run and will need to be energy aware. In this context, constraining the energy consumption of HEVC encoder becomes a challenging task for embedded applications based on a software encoder. The most frequent approach to overcome this issue consists in optimising the coding tree structure to balance compression efficiency and energy consumption. In the purpose of budgeting the energy consumption of real-time HEVC encoder, we propose in this paper a variance-aware quad-tree prediction to limit the recursive RDO process. The experimental results show that the proposed energy reduction scheme achieve on average 60% of energy reduction for a slight bit rate increase of 3.4%.
{"title":"Prediction of quad-tree partitioning for budgeted energy HEVC encoding","authors":"Alexandre Mercat, F. Arrestier, M. Pelcat, W. Hamidouche, D. Ménard","doi":"10.1109/SiPS.2017.8110025","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110025","url":null,"abstract":"High Efficiency Video Coding (Hevc), the newest video encoding standard, provides up to 50% bitrate savings compared to the state-of-art H.264/AVC standard for the same perceptual video quality. In the last few years, the Internet of Things (IoT) has become a reality. Forthcoming applications are likely to boost mobile video demand to an unprecedented level. A large number of systems are likely to integrate HEVC codec in the long run and will need to be energy aware. In this context, constraining the energy consumption of HEVC encoder becomes a challenging task for embedded applications based on a software encoder. The most frequent approach to overcome this issue consists in optimising the coding tree structure to balance compression efficiency and energy consumption. In the purpose of budgeting the energy consumption of real-time HEVC encoder, we propose in this paper a variance-aware quad-tree prediction to limit the recursive RDO process. The experimental results show that the proposed energy reduction scheme achieve on average 60% of energy reduction for a slight bit rate increase of 3.4%.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121229944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109970
W. Abbas, N. Khan
This paper presents a novel technique for motor imagery event classification. Extraction of discriminative feature is a key to accurate classification. To realize this objective we have explored the use of nonnegative matrix factorization (NNMF) for sparse representation of our input signal and determining the discriminative basis vector. We extract both spectral as well as temporal features from this representation to construct our features set. Band power has been shown to be a powerful discriminative feature of the spectral domain for motor imagery classes. Time Domain Parameter (TDP) taken as a temporal feature measures power of EEG using first few derivatives. Our approach is novel in proposing a fusion of both these features. We have used Hierarchical Alternating Least Square (HALS) as a convergence solution to minimize error function of NNMF as it converges more rapidly as compared to other methods. The proposed feature set has been tested using LDA and SVM classifiers technique for classification of 4-class motor imagery signals. We have compared our approach with others presented in literature using the Dataset 2a of BCI competition IV and has shown that our approach achieves the highest reported mean kappa value of 0.62 with the SVM classifier.
{"title":"A discriminative spectral-temporal feature set for motor imagery classification","authors":"W. Abbas, N. Khan","doi":"10.1109/SiPS.2017.8109970","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109970","url":null,"abstract":"This paper presents a novel technique for motor imagery event classification. Extraction of discriminative feature is a key to accurate classification. To realize this objective we have explored the use of nonnegative matrix factorization (NNMF) for sparse representation of our input signal and determining the discriminative basis vector. We extract both spectral as well as temporal features from this representation to construct our features set. Band power has been shown to be a powerful discriminative feature of the spectral domain for motor imagery classes. Time Domain Parameter (TDP) taken as a temporal feature measures power of EEG using first few derivatives. Our approach is novel in proposing a fusion of both these features. We have used Hierarchical Alternating Least Square (HALS) as a convergence solution to minimize error function of NNMF as it converges more rapidly as compared to other methods. The proposed feature set has been tested using LDA and SVM classifiers technique for classification of 4-class motor imagery signals. We have compared our approach with others presented in literature using the Dataset 2a of BCI competition IV and has shown that our approach achieves the highest reported mean kappa value of 0.62 with the SVM classifier.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124601952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning recently shows strong competitiveness to improve polar code decoding. However, suffering from prohibitive training and computation complexity, the conventional deep neural network (DNN) is only possible for very short code length. In this paper, the main problems of deep learning in decoding are well solved. We first present the multiple scaled belief propagation (BP) algorithm, aiming at obtaining faster convergence and better performance. Based on this, deep neural network decoder (NND) with low complexity and latency, is proposed for any code length. The training only requires a small set of zero codewords. Besides, its computation complexity is close to the original BP. Experiment results show that the proposed (64,32) NND with 5 iterations achieves even lower bit error rate (BER) than the 30-iteration conventional BP and (512, 256) NND also outperforms conventional BP decoder with same iterations. The hardware architecture of basic computation block is given and folding technique is also considered, saving about 50% hardware cost.
{"title":"Improved polar decoder based on deep learning","authors":"Weihong Xu, Zhizhen Wu, Yeong-Luh Ueng, X. You, Chuan Zhang","doi":"10.1109/SiPS.2017.8109997","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109997","url":null,"abstract":"Deep learning recently shows strong competitiveness to improve polar code decoding. However, suffering from prohibitive training and computation complexity, the conventional deep neural network (DNN) is only possible for very short code length. In this paper, the main problems of deep learning in decoding are well solved. We first present the multiple scaled belief propagation (BP) algorithm, aiming at obtaining faster convergence and better performance. Based on this, deep neural network decoder (NND) with low complexity and latency, is proposed for any code length. The training only requires a small set of zero codewords. Besides, its computation complexity is close to the original BP. Experiment results show that the proposed (64,32) NND with 5 iterations achieves even lower bit error rate (BER) than the 30-iteration conventional BP and (512, 256) NND also outperforms conventional BP decoder with same iterations. The hardware architecture of basic computation block is given and folding technique is also considered, saving about 50% hardware cost.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134554663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110002
J. Boutellier, S. Bhattacharyya
Dataflow models of computation have been shown to provide an excellent basis for describing signal processing applications and mapping them to heterogeneous computing platforms that consist of multicore CPUs and graphics processing units (GPUs). Recently several efficient dataflow-based programming frameworks have been introduced for such needs. Most of contemporary signal processing applications can be described using static dataflow models of computation (e.g. synchronous dataflow) that have desirable features such as compile-time analyzability. Unfortunately, static dataflow models of computation turn out to be restrictive when applications need to adapt their behavior to varying conditions at run-time, such as power saving through adaptive processing. This paper analyzes three dataflow approaches for implementing adaptive application behavior in terms of expressiveness and efficiency. The focus of the paper is on heterogeneous computing platforms and particularly on adapting application processing for achieving power saving. Experiments are conducted with deep neural network and dynamic predistortion applications on two platforms: a mobile multicore SoC and a GPU-equipped workstation.
{"title":"Low-power heterogeneous computing via adaptive execution of dataflow actors","authors":"J. Boutellier, S. Bhattacharyya","doi":"10.1109/SiPS.2017.8110002","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110002","url":null,"abstract":"Dataflow models of computation have been shown to provide an excellent basis for describing signal processing applications and mapping them to heterogeneous computing platforms that consist of multicore CPUs and graphics processing units (GPUs). Recently several efficient dataflow-based programming frameworks have been introduced for such needs. Most of contemporary signal processing applications can be described using static dataflow models of computation (e.g. synchronous dataflow) that have desirable features such as compile-time analyzability. Unfortunately, static dataflow models of computation turn out to be restrictive when applications need to adapt their behavior to varying conditions at run-time, such as power saving through adaptive processing. This paper analyzes three dataflow approaches for implementing adaptive application behavior in terms of expressiveness and efficiency. The focus of the paper is on heterogeneous computing platforms and particularly on adapting application processing for achieving power saving. Experiments are conducted with deep neural network and dynamic predistortion applications on two platforms: a mobile multicore SoC and a GPU-equipped workstation.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131887527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109975
Yufeng Yang, Ye Xue, X. You, Chuan Zhang
In nowadays wireless communication systems, massive multiple-input multiple-output (MIMO) technique brings better energy efficiency and coverage but higher computational complexity than small-scale MIMO. For linear detection such as minimum mean square error (MMSE), prohibitive complexity lies in solving large-scale linear equations. For a better tradeoff between BER performance and computational complexity, iterative linear methods like conjugate gradient (CG) have been applied for massive MIMO detection. By leaving out a matrix-vector product of CG, conjugate residual (CR) further achieves lower computational complexity with similar BER performance compared to CG. Since the BER performance can be improved by utilizing pre-condition with incomplete Cholesky (IC) factorization, pre-conditioned conjugate residual (PCR) is proposed. Simulation results indicate that PCR method achieves better performance than both CR and CG methods. It has 1 dB performance improvement than CG at BER = 5 χ Analysis shows that CR achieves 20% computational complexity reduction compared with CG when antenna configuration is 128 χ 60. With the same configuration, PCR reduces complexity by 66% while achieves similar BER performance compared with the detector with Cholesky decomposition. Finally, the corresponding VLSI architecture is proposed in detail.
{"title":"An efficient conjugate residual detector for massive MIMO systems","authors":"Yufeng Yang, Ye Xue, X. You, Chuan Zhang","doi":"10.1109/SiPS.2017.8109975","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109975","url":null,"abstract":"In nowadays wireless communication systems, massive multiple-input multiple-output (MIMO) technique brings better energy efficiency and coverage but higher computational complexity than small-scale MIMO. For linear detection such as minimum mean square error (MMSE), prohibitive complexity lies in solving large-scale linear equations. For a better tradeoff between BER performance and computational complexity, iterative linear methods like conjugate gradient (CG) have been applied for massive MIMO detection. By leaving out a matrix-vector product of CG, conjugate residual (CR) further achieves lower computational complexity with similar BER performance compared to CG. Since the BER performance can be improved by utilizing pre-condition with incomplete Cholesky (IC) factorization, pre-conditioned conjugate residual (PCR) is proposed. Simulation results indicate that PCR method achieves better performance than both CR and CG methods. It has 1 dB performance improvement than CG at BER = 5 χ Analysis shows that CR achieves 20% computational complexity reduction compared with CG when antenna configuration is 128 χ 60. With the same configuration, PCR reduces complexity by 66% while achieves similar BER performance compared with the detector with Cholesky decomposition. Finally, the corresponding VLSI architecture is proposed in detail.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131458858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109973
P. Martins, L. Sousa
Privacy of data has become an increasing concern over the past years. With Fully Homomorphic Encryption (FHE), one can offload the processing of data to a third-party while keeping it private. A technique called batching has been proposed to accelerate FHE, allowing for several bits to be encrypted in the same ciphertext, which can be processed in parallel. Herein, we argue that for a certain class of applications, a stochastic representation of numbers takes optimal advantage of this technique. Operations on stochastic numbers have direct homomorphic counterparts, leading to low degree arithmetic circuits for the evaluation of additions and multiplications. Moreover, an efficient technique for the homomorphic evaluation of nonlinear functions is proposed in this paper. The applicability of the proposed methods is assessed with efficient and accurate proof-of-concept implementations of homomorphic image processing, as well as the homomorphic evaluation of radial basis functions for Support Vector Machines (SVMs).
{"title":"A stochastic number representation for fully homomorphic cryptography","authors":"P. Martins, L. Sousa","doi":"10.1109/SiPS.2017.8109973","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109973","url":null,"abstract":"Privacy of data has become an increasing concern over the past years. With Fully Homomorphic Encryption (FHE), one can offload the processing of data to a third-party while keeping it private. A technique called batching has been proposed to accelerate FHE, allowing for several bits to be encrypted in the same ciphertext, which can be processed in parallel. Herein, we argue that for a certain class of applications, a stochastic representation of numbers takes optimal advantage of this technique. Operations on stochastic numbers have direct homomorphic counterparts, leading to low degree arithmetic circuits for the evaluation of additions and multiplications. Moreover, an efficient technique for the homomorphic evaluation of nonlinear functions is proposed in this paper. The applicability of the proposed methods is assessed with efficient and accurate proof-of-concept implementations of homomorphic image processing, as well as the homomorphic evaluation of radial basis functions for Support Vector Machines (SVMs).","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114076333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109999
Panu Sjovall, Vili Viitamäki, Arto Oinonen, Jarno Vanne, T. Hämäläinen, A. Kulmala
This paper presents a real-time Kvazaar HEVC intra encoder for 4K Ultra HD video streaming. The encoder is implemented on Nokia AirFrame Cloud Server featuring a 2.4 GHz dual 14-core Intel Xeon processor and Arria 10 PCI Express FPGA accelerator card. In our HW/SW partitioning scheme, the data-intensive Kvazaar coding tools including intra prediction, DCT, inverse DCT, quantization, and inverse quantization are offloaded to Arria 10 whereas CABAC coding and other control-intensive coding tools are executed on Xeon processors. Arria 10 has enough capacity for up to two instances of our intra coding accelerator. The results show that the proposed system is able to encode 4K video at 30 fps with a single intra coding accelerator and at 40 fps with two accelerators. The respective speed-up factors are 1.6 and 2.1 over the pure Xeon implementation. To the best of our knowledge, this is the first work dealing with HEVC intra encoder partitioned between CPU and FPGA. It achieves the same coding speed as HEVC intra encoders on ASIC and it is at least 4 times faster than existing HEVC intra encoders on FPGA.
{"title":"Kvazaar 4K HEVC intra encoder on FPGA accelerated airframe server","authors":"Panu Sjovall, Vili Viitamäki, Arto Oinonen, Jarno Vanne, T. Hämäläinen, A. Kulmala","doi":"10.1109/SiPS.2017.8109999","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109999","url":null,"abstract":"This paper presents a real-time Kvazaar HEVC intra encoder for 4K Ultra HD video streaming. The encoder is implemented on Nokia AirFrame Cloud Server featuring a 2.4 GHz dual 14-core Intel Xeon processor and Arria 10 PCI Express FPGA accelerator card. In our HW/SW partitioning scheme, the data-intensive Kvazaar coding tools including intra prediction, DCT, inverse DCT, quantization, and inverse quantization are offloaded to Arria 10 whereas CABAC coding and other control-intensive coding tools are executed on Xeon processors. Arria 10 has enough capacity for up to two instances of our intra coding accelerator. The results show that the proposed system is able to encode 4K video at 30 fps with a single intra coding accelerator and at 40 fps with two accelerators. The respective speed-up factors are 1.6 and 2.1 over the pure Xeon implementation. To the best of our knowledge, this is the first work dealing with HEVC intra encoder partitioned between CPU and FPGA. It achieves the same coding speed as HEVC intra encoders on ASIC and it is at least 4 times faster than existing HEVC intra encoders on FPGA.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116742062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109969
Mao-Ruei Li, Li-Min Jhuang, Yeong-Luh Ueng
It is known that the Gradient descent bit flipping (GDBF) algorithm is an effective hard-decision decoding algorithm for low-density parity-check (LDPC) codes. However, trapping in a local maximum limits its error-rate performance. This paper presents a modified GDBF scheme that can mitigate the trapping problem and hence can improve the error-rate performance. Compared to the conventional GDBF algorithm, the proposed method is able to improve the decoding performance of 0.3dB for an (18582, 16626) code. The (18582, 16626) LDPC decoder integrates 636k logic gates and achieves a throughput of 12.4 Gbps at a clock frequency of 200 MHz in a 90nm process.
{"title":"A modified gradient descent bit flipping decoding scheme for LDPC codes","authors":"Mao-Ruei Li, Li-Min Jhuang, Yeong-Luh Ueng","doi":"10.1109/SiPS.2017.8109969","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109969","url":null,"abstract":"It is known that the Gradient descent bit flipping (GDBF) algorithm is an effective hard-decision decoding algorithm for low-density parity-check (LDPC) codes. However, trapping in a local maximum limits its error-rate performance. This paper presents a modified GDBF scheme that can mitigate the trapping problem and hence can improve the error-rate performance. Compared to the conventional GDBF algorithm, the proposed method is able to improve the decoding performance of 0.3dB for an (18582, 16626) code. The (18582, 16626) LDPC decoder integrates 636k logic gates and achieves a throughput of 12.4 Gbps at a clock frequency of 200 MHz in a 90nm process.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128678077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8110005
Cory Stephenson, P. Callier, Abhinav Ganesh, Karl S. Ni
We propose an algorithm to separate simultaneously speaking persons from each other, the “cocktail party problem”, using a single microphone. Our approach involves a deep recurrent neural networks regression to a vector space that is descriptive of independent speakers. Such a vector space can embed empirically determined speaker characteristics and is optimized by distinguishing between speaker masks. We call this technique source-contrastive estimation. The methodology is inspired by negative sampling, which has seen success in natural language processing, where an embedding is learned by correlating and decorrelating a given input vector with output weights. Although the matrix determined by the output weights is dependent on a set of known speakers, we only use the input vectors during inference. Doing so will ensure that source separation is explicitly speaker-independent. Our approach is similar to recent deep neural network clustering and permutation-invariant training research; we use weighted spectral features and masks to augment individual speaker frequencies while filtering out other speakers. We avoid, however, the severe computational burden of other approaches with our technique. Furthermore, by training a vector space rather than combinations of different speakers or differences thereof, we avoid the so-called permutation problem during training. Our algorithm offers an intuitive, computationally efficient response to the cocktail party problem, and most importantly boasts better empirical performance than other current techniques.
{"title":"Monaural speaker separation using source-contrastive estimation","authors":"Cory Stephenson, P. Callier, Abhinav Ganesh, Karl S. Ni","doi":"10.1109/SiPS.2017.8110005","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110005","url":null,"abstract":"We propose an algorithm to separate simultaneously speaking persons from each other, the “cocktail party problem”, using a single microphone. Our approach involves a deep recurrent neural networks regression to a vector space that is descriptive of independent speakers. Such a vector space can embed empirically determined speaker characteristics and is optimized by distinguishing between speaker masks. We call this technique source-contrastive estimation. The methodology is inspired by negative sampling, which has seen success in natural language processing, where an embedding is learned by correlating and decorrelating a given input vector with output weights. Although the matrix determined by the output weights is dependent on a set of known speakers, we only use the input vectors during inference. Doing so will ensure that source separation is explicitly speaker-independent. Our approach is similar to recent deep neural network clustering and permutation-invariant training research; we use weighted spectral features and masks to augment individual speaker frequencies while filtering out other speakers. We avoid, however, the severe computational burden of other approaches with our technique. Furthermore, by training a vector space rather than combinations of different speakers or differences thereof, we avoid the so-called permutation problem during training. Our algorithm offers an intuitive, computationally efficient response to the cocktail party problem, and most importantly boasts better empirical performance than other current techniques.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-10-01DOI: 10.1109/SiPS.2017.8109996
Junhee Cho, Wonyong Sung
Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.
{"title":"High-throughput decoding of block turbo codes on graphics processing units","authors":"Junhee Cho, Wonyong Sung","doi":"10.1109/SiPS.2017.8109996","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109996","url":null,"abstract":"Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}