首页 > 最新文献

2017 IEEE International Workshop on Signal Processing Systems (SiPS)最新文献

英文 中文
Prediction of quad-tree partitioning for budgeted energy HEVC encoding 预算能量HEVC编码四叉树分区预测
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110025
Alexandre Mercat, F. Arrestier, M. Pelcat, W. Hamidouche, D. Ménard
High Efficiency Video Coding (Hevc), the newest video encoding standard, provides up to 50% bitrate savings compared to the state-of-art H.264/AVC standard for the same perceptual video quality. In the last few years, the Internet of Things (IoT) has become a reality. Forthcoming applications are likely to boost mobile video demand to an unprecedented level. A large number of systems are likely to integrate HEVC codec in the long run and will need to be energy aware. In this context, constraining the energy consumption of HEVC encoder becomes a challenging task for embedded applications based on a software encoder. The most frequent approach to overcome this issue consists in optimising the coding tree structure to balance compression efficiency and energy consumption. In the purpose of budgeting the energy consumption of real-time HEVC encoder, we propose in this paper a variance-aware quad-tree prediction to limit the recursive RDO process. The experimental results show that the proposed energy reduction scheme achieve on average 60% of energy reduction for a slight bit rate increase of 3.4%.
高效视频编码(Hevc)是最新的视频编码标准,与最先进的H.264/AVC标准相比,在相同的感知视频质量下,可提供高达50%的比特率节省。在过去的几年里,物联网(IoT)已经成为现实。即将推出的应用程序可能会将移动视频需求提升到前所未有的水平。从长远来看,大量的系统可能会集成HEVC编解码器,并且需要具有能源意识。在这种背景下,限制HEVC编码器的能量消耗成为基于软件编码器的嵌入式应用的一项具有挑战性的任务。克服这个问题最常见的方法是优化编码树结构,以平衡压缩效率和能耗。为了对实时HEVC编码器的能量消耗进行预算,本文提出了一种方差感知的四叉树预测来限制递归RDO过程。实验结果表明,在比特率略微提高3.4%的情况下,所提出的降耗方案平均实现了60%的降耗。
{"title":"Prediction of quad-tree partitioning for budgeted energy HEVC encoding","authors":"Alexandre Mercat, F. Arrestier, M. Pelcat, W. Hamidouche, D. Ménard","doi":"10.1109/SiPS.2017.8110025","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110025","url":null,"abstract":"High Efficiency Video Coding (Hevc), the newest video encoding standard, provides up to 50% bitrate savings compared to the state-of-art H.264/AVC standard for the same perceptual video quality. In the last few years, the Internet of Things (IoT) has become a reality. Forthcoming applications are likely to boost mobile video demand to an unprecedented level. A large number of systems are likely to integrate HEVC codec in the long run and will need to be energy aware. In this context, constraining the energy consumption of HEVC encoder becomes a challenging task for embedded applications based on a software encoder. The most frequent approach to overcome this issue consists in optimising the coding tree structure to balance compression efficiency and energy consumption. In the purpose of budgeting the energy consumption of real-time HEVC encoder, we propose in this paper a variance-aware quad-tree prediction to limit the recursive RDO process. The experimental results show that the proposed energy reduction scheme achieve on average 60% of energy reduction for a slight bit rate increase of 3.4%.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121229944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A discriminative spectral-temporal feature set for motor imagery classification 一种用于运动图像分类的判别光谱-时间特征集
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109970
W. Abbas, N. Khan
This paper presents a novel technique for motor imagery event classification. Extraction of discriminative feature is a key to accurate classification. To realize this objective we have explored the use of nonnegative matrix factorization (NNMF) for sparse representation of our input signal and determining the discriminative basis vector. We extract both spectral as well as temporal features from this representation to construct our features set. Band power has been shown to be a powerful discriminative feature of the spectral domain for motor imagery classes. Time Domain Parameter (TDP) taken as a temporal feature measures power of EEG using first few derivatives. Our approach is novel in proposing a fusion of both these features. We have used Hierarchical Alternating Least Square (HALS) as a convergence solution to minimize error function of NNMF as it converges more rapidly as compared to other methods. The proposed feature set has been tested using LDA and SVM classifiers technique for classification of 4-class motor imagery signals. We have compared our approach with others presented in literature using the Dataset 2a of BCI competition IV and has shown that our approach achieves the highest reported mean kappa value of 0.62 with the SVM classifier.
提出了一种新的运动意象事件分类方法。判别特征的提取是实现准确分类的关键。为了实现这一目标,我们探索了使用非负矩阵分解(NNMF)来稀疏表示输入信号并确定判别基向量。我们从这个表示中提取光谱和时间特征来构建我们的特征集。频带功率已被证明是一个强大的区分特征的频谱域运动图像类。时域参数(TDP)作为一种时间特征,利用前几阶导数来衡量脑电信号的功率。我们提出的融合这两种特性的方法是新颖的。我们使用层次交替最小二乘(HALS)作为收敛解来最小化NNMF的误差函数,因为它比其他方法收敛得更快。利用LDA和SVM分类器技术对所提出的特征集进行了测试,用于4类运动图像信号的分类。我们使用BCI竞赛IV的数据集2a将我们的方法与文献中提出的其他方法进行了比较,并表明我们的方法使用SVM分类器实现了最高的平均kappa值0.62。
{"title":"A discriminative spectral-temporal feature set for motor imagery classification","authors":"W. Abbas, N. Khan","doi":"10.1109/SiPS.2017.8109970","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109970","url":null,"abstract":"This paper presents a novel technique for motor imagery event classification. Extraction of discriminative feature is a key to accurate classification. To realize this objective we have explored the use of nonnegative matrix factorization (NNMF) for sparse representation of our input signal and determining the discriminative basis vector. We extract both spectral as well as temporal features from this representation to construct our features set. Band power has been shown to be a powerful discriminative feature of the spectral domain for motor imagery classes. Time Domain Parameter (TDP) taken as a temporal feature measures power of EEG using first few derivatives. Our approach is novel in proposing a fusion of both these features. We have used Hierarchical Alternating Least Square (HALS) as a convergence solution to minimize error function of NNMF as it converges more rapidly as compared to other methods. The proposed feature set has been tested using LDA and SVM classifiers technique for classification of 4-class motor imagery signals. We have compared our approach with others presented in literature using the Dataset 2a of BCI competition IV and has shown that our approach achieves the highest reported mean kappa value of 0.62 with the SVM classifier.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124601952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Improved polar decoder based on deep learning 基于深度学习的改进型极解码器
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109997
Weihong Xu, Zhizhen Wu, Yeong-Luh Ueng, X. You, Chuan Zhang
Deep learning recently shows strong competitiveness to improve polar code decoding. However, suffering from prohibitive training and computation complexity, the conventional deep neural network (DNN) is only possible for very short code length. In this paper, the main problems of deep learning in decoding are well solved. We first present the multiple scaled belief propagation (BP) algorithm, aiming at obtaining faster convergence and better performance. Based on this, deep neural network decoder (NND) with low complexity and latency, is proposed for any code length. The training only requires a small set of zero codewords. Besides, its computation complexity is close to the original BP. Experiment results show that the proposed (64,32) NND with 5 iterations achieves even lower bit error rate (BER) than the 30-iteration conventional BP and (512, 256) NND also outperforms conventional BP decoder with same iterations. The hardware architecture of basic computation block is given and folding technique is also considered, saving about 50% hardware cost.
最近,深度学习在提高极码解码方面表现出了很强的竞争力。然而,由于训练和计算的复杂性,传统的深度神经网络(DNN)只能用于非常短的代码长度。本文较好地解决了深度学习在译码中的主要问题。首先提出了多尺度信念传播(BP)算法,以获得更快的收敛速度和更好的性能。在此基础上,提出了一种低复杂度、低时延的深度神经网络解码器(NND)。训练只需要一小组零码字。同时,其计算复杂度接近于原始BP。实验结果表明,5次迭代的(64,32)NND比30次迭代的传统BP获得更低的误码率(BER), (512,256) NND也优于相同迭代的传统BP解码器。给出了基本计算块的硬件结构,并考虑了折叠技术,节省了约50%的硬件成本。
{"title":"Improved polar decoder based on deep learning","authors":"Weihong Xu, Zhizhen Wu, Yeong-Luh Ueng, X. You, Chuan Zhang","doi":"10.1109/SiPS.2017.8109997","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109997","url":null,"abstract":"Deep learning recently shows strong competitiveness to improve polar code decoding. However, suffering from prohibitive training and computation complexity, the conventional deep neural network (DNN) is only possible for very short code length. In this paper, the main problems of deep learning in decoding are well solved. We first present the multiple scaled belief propagation (BP) algorithm, aiming at obtaining faster convergence and better performance. Based on this, deep neural network decoder (NND) with low complexity and latency, is proposed for any code length. The training only requires a small set of zero codewords. Besides, its computation complexity is close to the original BP. Experiment results show that the proposed (64,32) NND with 5 iterations achieves even lower bit error rate (BER) than the 30-iteration conventional BP and (512, 256) NND also outperforms conventional BP decoder with same iterations. The hardware architecture of basic computation block is given and folding technique is also considered, saving about 50% hardware cost.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134554663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 111
Low-power heterogeneous computing via adaptive execution of dataflow actors 通过自适应执行数据流参与者的低功耗异构计算
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110002
J. Boutellier, S. Bhattacharyya
Dataflow models of computation have been shown to provide an excellent basis for describing signal processing applications and mapping them to heterogeneous computing platforms that consist of multicore CPUs and graphics processing units (GPUs). Recently several efficient dataflow-based programming frameworks have been introduced for such needs. Most of contemporary signal processing applications can be described using static dataflow models of computation (e.g. synchronous dataflow) that have desirable features such as compile-time analyzability. Unfortunately, static dataflow models of computation turn out to be restrictive when applications need to adapt their behavior to varying conditions at run-time, such as power saving through adaptive processing. This paper analyzes three dataflow approaches for implementing adaptive application behavior in terms of expressiveness and efficiency. The focus of the paper is on heterogeneous computing platforms and particularly on adapting application processing for achieving power saving. Experiments are conducted with deep neural network and dynamic predistortion applications on two platforms: a mobile multicore SoC and a GPU-equipped workstation.
计算的数据流模型已经被证明为描述信号处理应用和将它们映射到由多核cpu和图形处理单元(gpu)组成的异构计算平台提供了一个很好的基础。最近,针对这种需求引入了几个高效的基于数据流的编程框架。大多数当代信号处理应用都可以用静态数据流模型(例如同步数据流)来描述,这些模型具有编译时可分析性等理想特性。不幸的是,当应用程序需要在运行时调整其行为以适应不同的条件(例如通过自适应处理节省电力)时,计算的静态数据流模型会受到限制。从表现力和效率两方面分析了实现自适应应用行为的三种数据流方法。本文的重点是异构计算平台,特别是适应应用程序处理,以实现节能。在移动多核SoC和配备gpu的工作站两个平台上进行了深度神经网络和动态预失真应用的实验。
{"title":"Low-power heterogeneous computing via adaptive execution of dataflow actors","authors":"J. Boutellier, S. Bhattacharyya","doi":"10.1109/SiPS.2017.8110002","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110002","url":null,"abstract":"Dataflow models of computation have been shown to provide an excellent basis for describing signal processing applications and mapping them to heterogeneous computing platforms that consist of multicore CPUs and graphics processing units (GPUs). Recently several efficient dataflow-based programming frameworks have been introduced for such needs. Most of contemporary signal processing applications can be described using static dataflow models of computation (e.g. synchronous dataflow) that have desirable features such as compile-time analyzability. Unfortunately, static dataflow models of computation turn out to be restrictive when applications need to adapt their behavior to varying conditions at run-time, such as power saving through adaptive processing. This paper analyzes three dataflow approaches for implementing adaptive application behavior in terms of expressiveness and efficiency. The focus of the paper is on heterogeneous computing platforms and particularly on adapting application processing for achieving power saving. Experiments are conducted with deep neural network and dynamic predistortion applications on two platforms: a mobile multicore SoC and a GPU-equipped workstation.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131887527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An efficient conjugate residual detector for massive MIMO systems 大规模MIMO系统的有效共轭残差检测器
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109975
Yufeng Yang, Ye Xue, X. You, Chuan Zhang
In nowadays wireless communication systems, massive multiple-input multiple-output (MIMO) technique brings better energy efficiency and coverage but higher computational complexity than small-scale MIMO. For linear detection such as minimum mean square error (MMSE), prohibitive complexity lies in solving large-scale linear equations. For a better tradeoff between BER performance and computational complexity, iterative linear methods like conjugate gradient (CG) have been applied for massive MIMO detection. By leaving out a matrix-vector product of CG, conjugate residual (CR) further achieves lower computational complexity with similar BER performance compared to CG. Since the BER performance can be improved by utilizing pre-condition with incomplete Cholesky (IC) factorization, pre-conditioned conjugate residual (PCR) is proposed. Simulation results indicate that PCR method achieves better performance than both CR and CG methods. It has 1 dB performance improvement than CG at BER = 5 χ Analysis shows that CR achieves 20% computational complexity reduction compared with CG when antenna configuration is 128 χ 60. With the same configuration, PCR reduces complexity by 66% while achieves similar BER performance compared with the detector with Cholesky decomposition. Finally, the corresponding VLSI architecture is proposed in detail.
在当今的无线通信系统中,大规模多输入多输出(MIMO)技术具有比小规模多输入多输出(MIMO)技术更高的能量效率和覆盖范围,但其计算复杂度也比小规模多输入多输出(MIMO)技术高。对于线性检测,如最小均方误差(MMSE),令人望而却步的复杂性在于求解大规模线性方程。为了更好地平衡误码率性能和计算复杂度,共轭梯度(CG)等迭代线性方法已被应用于大规模MIMO检测。通过去掉CG的矩阵向量积,共轭残差(CR)进一步实现了与CG相比更低的计算复杂度和相似的误码率性能。由于利用不完全Cholesky (IC)分解的预条件可以提高误码率,因此提出了预条件共轭残差(pre-conditioned conjugate residual, PCR)。仿真结果表明,PCR方法比CR和CG方法具有更好的性能。分析表明,当天线配置为128 χ 60时,CR比CG的计算复杂度降低了20%。与采用Cholesky分解的检测器相比,在相同配置下,PCR降低了66%的复杂性,同时获得了相似的BER性能。最后,详细提出了相应的VLSI体系结构。
{"title":"An efficient conjugate residual detector for massive MIMO systems","authors":"Yufeng Yang, Ye Xue, X. You, Chuan Zhang","doi":"10.1109/SiPS.2017.8109975","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109975","url":null,"abstract":"In nowadays wireless communication systems, massive multiple-input multiple-output (MIMO) technique brings better energy efficiency and coverage but higher computational complexity than small-scale MIMO. For linear detection such as minimum mean square error (MMSE), prohibitive complexity lies in solving large-scale linear equations. For a better tradeoff between BER performance and computational complexity, iterative linear methods like conjugate gradient (CG) have been applied for massive MIMO detection. By leaving out a matrix-vector product of CG, conjugate residual (CR) further achieves lower computational complexity with similar BER performance compared to CG. Since the BER performance can be improved by utilizing pre-condition with incomplete Cholesky (IC) factorization, pre-conditioned conjugate residual (PCR) is proposed. Simulation results indicate that PCR method achieves better performance than both CR and CG methods. It has 1 dB performance improvement than CG at BER = 5 χ Analysis shows that CR achieves 20% computational complexity reduction compared with CG when antenna configuration is 128 χ 60. With the same configuration, PCR reduces complexity by 66% while achieves similar BER performance compared with the detector with Cholesky decomposition. Finally, the corresponding VLSI architecture is proposed in detail.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131458858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A stochastic number representation for fully homomorphic cryptography 全同态密码学的随机数表示
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109973
P. Martins, L. Sousa
Privacy of data has become an increasing concern over the past years. With Fully Homomorphic Encryption (FHE), one can offload the processing of data to a third-party while keeping it private. A technique called batching has been proposed to accelerate FHE, allowing for several bits to be encrypted in the same ciphertext, which can be processed in parallel. Herein, we argue that for a certain class of applications, a stochastic representation of numbers takes optimal advantage of this technique. Operations on stochastic numbers have direct homomorphic counterparts, leading to low degree arithmetic circuits for the evaluation of additions and multiplications. Moreover, an efficient technique for the homomorphic evaluation of nonlinear functions is proposed in this paper. The applicability of the proposed methods is assessed with efficient and accurate proof-of-concept implementations of homomorphic image processing, as well as the homomorphic evaluation of radial basis functions for Support Vector Machines (SVMs).
近年来,数据隐私问题日益受到关注。使用完全同态加密(FHE),可以将数据处理工作交给第三方,同时保持数据的私密性。已经提出了一种称为批处理的技术来加速FHE,允许在相同的密文中加密几个比特,这些密文可以并行处理。在此,我们认为对于某一类应用,数字的随机表示最优地利用了这种技术。对随机数的运算具有直接同态对应物,这导致了计算加法和乘法的低次算术电路。此外,本文还提出了一种求解非线性函数同态求值的有效方法。通过高效、准确的同态图像处理的概念验证,以及支持向量机(svm)径向基函数的同态评估,评估了所提出方法的适用性。
{"title":"A stochastic number representation for fully homomorphic cryptography","authors":"P. Martins, L. Sousa","doi":"10.1109/SiPS.2017.8109973","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109973","url":null,"abstract":"Privacy of data has become an increasing concern over the past years. With Fully Homomorphic Encryption (FHE), one can offload the processing of data to a third-party while keeping it private. A technique called batching has been proposed to accelerate FHE, allowing for several bits to be encrypted in the same ciphertext, which can be processed in parallel. Herein, we argue that for a certain class of applications, a stochastic representation of numbers takes optimal advantage of this technique. Operations on stochastic numbers have direct homomorphic counterparts, leading to low degree arithmetic circuits for the evaluation of additions and multiplications. Moreover, an efficient technique for the homomorphic evaluation of nonlinear functions is proposed in this paper. The applicability of the proposed methods is assessed with efficient and accurate proof-of-concept implementations of homomorphic image processing, as well as the homomorphic evaluation of radial basis functions for Support Vector Machines (SVMs).","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114076333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Kvazaar 4K HEVC intra encoder on FPGA accelerated airframe server Kvazaar 4K HEVC内部编码器的FPGA加速机身服务器
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109999
Panu Sjovall, Vili Viitamäki, Arto Oinonen, Jarno Vanne, T. Hämäläinen, A. Kulmala
This paper presents a real-time Kvazaar HEVC intra encoder for 4K Ultra HD video streaming. The encoder is implemented on Nokia AirFrame Cloud Server featuring a 2.4 GHz dual 14-core Intel Xeon processor and Arria 10 PCI Express FPGA accelerator card. In our HW/SW partitioning scheme, the data-intensive Kvazaar coding tools including intra prediction, DCT, inverse DCT, quantization, and inverse quantization are offloaded to Arria 10 whereas CABAC coding and other control-intensive coding tools are executed on Xeon processors. Arria 10 has enough capacity for up to two instances of our intra coding accelerator. The results show that the proposed system is able to encode 4K video at 30 fps with a single intra coding accelerator and at 40 fps with two accelerators. The respective speed-up factors are 1.6 and 2.1 over the pure Xeon implementation. To the best of our knowledge, this is the first work dealing with HEVC intra encoder partitioned between CPU and FPGA. It achieves the same coding speed as HEVC intra encoders on ASIC and it is at least 4 times faster than existing HEVC intra encoders on FPGA.
本文提出了一种用于4K超高清视频流的实时Kvazaar HEVC内编码器。该编码器在诺基亚AirFrame云服务器上实现,该服务器具有2.4 GHz双14核英特尔至强处理器和Arria 10 PCI Express FPGA加速卡。在我们的硬件/软件分区方案中,数据密集型的Kvazaar编码工具(包括帧内预测、DCT、逆DCT、量化和逆量化)被卸载到Arria 10上,而CABAC编码和其他控制密集型编码工具则在Xeon处理器上执行。Arria 10有足够的容量容纳两个内部编码加速器。结果表明,该系统能够以30 fps的速度对单个帧内编码加速器和40 fps的速度对4K视频进行编码。与纯Xeon实现相比,各自的加速系数分别为1.6和2.1。据我们所知,这是处理CPU和FPGA之间划分的HEVC内部编码器的第一个工作。它实现了与ASIC上HEVC内编码器相同的编码速度,比FPGA上现有HEVC内编码器快至少4倍。
{"title":"Kvazaar 4K HEVC intra encoder on FPGA accelerated airframe server","authors":"Panu Sjovall, Vili Viitamäki, Arto Oinonen, Jarno Vanne, T. Hämäläinen, A. Kulmala","doi":"10.1109/SiPS.2017.8109999","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109999","url":null,"abstract":"This paper presents a real-time Kvazaar HEVC intra encoder for 4K Ultra HD video streaming. The encoder is implemented on Nokia AirFrame Cloud Server featuring a 2.4 GHz dual 14-core Intel Xeon processor and Arria 10 PCI Express FPGA accelerator card. In our HW/SW partitioning scheme, the data-intensive Kvazaar coding tools including intra prediction, DCT, inverse DCT, quantization, and inverse quantization are offloaded to Arria 10 whereas CABAC coding and other control-intensive coding tools are executed on Xeon processors. Arria 10 has enough capacity for up to two instances of our intra coding accelerator. The results show that the proposed system is able to encode 4K video at 30 fps with a single intra coding accelerator and at 40 fps with two accelerators. The respective speed-up factors are 1.6 and 2.1 over the pure Xeon implementation. To the best of our knowledge, this is the first work dealing with HEVC intra encoder partitioned between CPU and FPGA. It achieves the same coding speed as HEVC intra encoders on ASIC and it is at least 4 times faster than existing HEVC intra encoders on FPGA.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116742062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A modified gradient descent bit flipping decoding scheme for LDPC codes 一种改进的LDPC码梯度下降位翻转译码方案
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109969
Mao-Ruei Li, Li-Min Jhuang, Yeong-Luh Ueng
It is known that the Gradient descent bit flipping (GDBF) algorithm is an effective hard-decision decoding algorithm for low-density parity-check (LDPC) codes. However, trapping in a local maximum limits its error-rate performance. This paper presents a modified GDBF scheme that can mitigate the trapping problem and hence can improve the error-rate performance. Compared to the conventional GDBF algorithm, the proposed method is able to improve the decoding performance of 0.3dB for an (18582, 16626) code. The (18582, 16626) LDPC decoder integrates 636k logic gates and achieves a throughput of 12.4 Gbps at a clock frequency of 200 MHz in a 90nm process.
梯度下降位翻转(GDBF)算法是一种有效的低密度奇偶校验(LDPC)码硬判决译码算法。然而,在局部最大值中捕获限制了其错误率性能。本文提出了一种改进的GDBF方案,可以缓解捕获问题,从而提高错误率性能。与传统的GDBF算法相比,该算法对(18582,16626)码的译码性能提高0.3dB。(18582, 16626) LDPC解码器集成了636k逻辑门,在时钟频率为200mhz的90nm工艺中实现了12.4 Gbps的吞吐量。
{"title":"A modified gradient descent bit flipping decoding scheme for LDPC codes","authors":"Mao-Ruei Li, Li-Min Jhuang, Yeong-Luh Ueng","doi":"10.1109/SiPS.2017.8109969","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109969","url":null,"abstract":"It is known that the Gradient descent bit flipping (GDBF) algorithm is an effective hard-decision decoding algorithm for low-density parity-check (LDPC) codes. However, trapping in a local maximum limits its error-rate performance. This paper presents a modified GDBF scheme that can mitigate the trapping problem and hence can improve the error-rate performance. Compared to the conventional GDBF algorithm, the proposed method is able to improve the decoding performance of 0.3dB for an (18582, 16626) code. The (18582, 16626) LDPC decoder integrates 636k logic gates and achieves a throughput of 12.4 Gbps at a clock frequency of 200 MHz in a 90nm process.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128678077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Monaural speaker separation using source-contrastive estimation 基于源对比估计的单耳说话人分离
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110005
Cory Stephenson, P. Callier, Abhinav Ganesh, Karl S. Ni
We propose an algorithm to separate simultaneously speaking persons from each other, the “cocktail party problem”, using a single microphone. Our approach involves a deep recurrent neural networks regression to a vector space that is descriptive of independent speakers. Such a vector space can embed empirically determined speaker characteristics and is optimized by distinguishing between speaker masks. We call this technique source-contrastive estimation. The methodology is inspired by negative sampling, which has seen success in natural language processing, where an embedding is learned by correlating and decorrelating a given input vector with output weights. Although the matrix determined by the output weights is dependent on a set of known speakers, we only use the input vectors during inference. Doing so will ensure that source separation is explicitly speaker-independent. Our approach is similar to recent deep neural network clustering and permutation-invariant training research; we use weighted spectral features and masks to augment individual speaker frequencies while filtering out other speakers. We avoid, however, the severe computational burden of other approaches with our technique. Furthermore, by training a vector space rather than combinations of different speakers or differences thereof, we avoid the so-called permutation problem during training. Our algorithm offers an intuitive, computationally efficient response to the cocktail party problem, and most importantly boasts better empirical performance than other current techniques.
我们提出了一种算法来区分同时说话的人彼此,“鸡尾酒会问题”,使用一个麦克风。我们的方法涉及深度递归神经网络回归到描述独立说话者的向量空间。这样的向量空间可以嵌入经验确定的说话人特征,并通过区分说话人掩模进行优化。我们称这种技术为源对比估计。该方法受到负抽样的启发,负抽样在自然语言处理中取得了成功,其中嵌入是通过将给定的输入向量与输出权重进行相关和解相关来学习的。虽然由输出权重决定的矩阵依赖于一组已知的说话者,但我们在推理过程中只使用输入向量。这样做将确保源分离是显式独立于说话者的。我们的方法类似于最近的深度神经网络聚类和排列不变训练研究;我们使用加权频谱特征和掩模来增加单个扬声器的频率,同时过滤掉其他扬声器。然而,我们用我们的技术避免了其他方法的严重计算负担。此外,通过训练向量空间而不是不同说话者的组合或差异,我们避免了训练过程中所谓的排列问题。我们的算法为鸡尾酒会问题提供了一个直观的、计算效率高的响应,最重要的是,它比其他现有技术具有更好的经验性能。
{"title":"Monaural speaker separation using source-contrastive estimation","authors":"Cory Stephenson, P. Callier, Abhinav Ganesh, Karl S. Ni","doi":"10.1109/SiPS.2017.8110005","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110005","url":null,"abstract":"We propose an algorithm to separate simultaneously speaking persons from each other, the “cocktail party problem”, using a single microphone. Our approach involves a deep recurrent neural networks regression to a vector space that is descriptive of independent speakers. Such a vector space can embed empirically determined speaker characteristics and is optimized by distinguishing between speaker masks. We call this technique source-contrastive estimation. The methodology is inspired by negative sampling, which has seen success in natural language processing, where an embedding is learned by correlating and decorrelating a given input vector with output weights. Although the matrix determined by the output weights is dependent on a set of known speakers, we only use the input vectors during inference. Doing so will ensure that source separation is explicitly speaker-independent. Our approach is similar to recent deep neural network clustering and permutation-invariant training research; we use weighted spectral features and masks to augment individual speaker frequencies while filtering out other speakers. We avoid, however, the severe computational burden of other approaches with our technique. Furthermore, by training a vector space rather than combinations of different speakers or differences thereof, we avoid the so-called permutation problem during training. Our algorithm offers an intuitive, computationally efficient response to the cocktail party problem, and most importantly boasts better empirical performance than other current techniques.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
High-throughput decoding of block turbo codes on graphics processing units 在图形处理单元上的块涡轮码的高吞吐量解码
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109996
Junhee Cho, Wonyong Sung
Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.
块涡轮码(btc)可以为光学网络和NAND闪存设备等多种应用提供非常强大的前向纠错(FEC)。这些应用需要软判决FEC码来保证误码率(BER)低于10−12,然而,很难用CPU模拟器验证。在本文中,我们提出了基于高吞吐量图形处理单元(GPU)的turbo解码软件,以帮助开发非常低错误率的btc。为了有效地利用gpu,该软件同时处理多个BTC帧,并最小化全局内存访问延迟。特别地,Chase-Pyndiah算法被有效地并行化以解码BTC字的每一行和每一列。基于gpu的仿真器对由Hamming码和BCH码组成的btc分别实现了80和150 Mb/s左右的解码吞吐量。与基于cpu的结果相比,吞吐量结果高达124倍。
{"title":"High-throughput decoding of block turbo codes on graphics processing units","authors":"Junhee Cho, Wonyong Sung","doi":"10.1109/SiPS.2017.8109996","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109996","url":null,"abstract":"Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2017 IEEE International Workshop on Signal Processing Systems (SiPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1