首页 > 最新文献

2017 IEEE International Workshop on Signal Processing Systems (SiPS)最新文献

英文 中文
Task-based execution of synchronous dataflow graphs for scalable multicore computing 基于任务的同步数据流图执行,用于可扩展的多核计算
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110023
Georgios Georgakarakos, Sudeep Kanur, J. Lilius, K. Desnos
Dataflow models of computation have early on been acknowledged as an attractive methodology to describe parallel algorithms, hence they have become highly relevant for programming in the current multicore processor era. While several frameworks provide tools to create dataflow descriptions of algorithms, generating parallel code for programmable processors is still sub-optimal due to the scheduling overheads and the semantics gap when expressing parallelism with conventional programming languages featuring threads. In this paper we propose an optimization of the parallel code generation process by combining dataflow and task programming models. We develop a task-based code generator for PREESM, a dataflow-based prototyping framework, in order to deploy algorithms described as synchronous dataflow graphs on multicore platforms. Experimental performance comparison of our task generated code against typical thread-based code shows that our approach removes significant scheduling and synchronization overheads while maintaining similar (and occasionally improving) application throughput.
计算的数据流模型很早就被认为是描述并行算法的一种有吸引力的方法,因此它们与当前多核处理器时代的编程高度相关。虽然有几个框架提供了创建算法数据流描述的工具,但由于调度开销和使用传统的以线程为特征的编程语言表达并行性时的语义差距,为可编程处理器生成并行代码仍然不是最优的。本文提出了一种结合数据流和任务编程模型的并行代码生成过程优化方法。我们为PREESM(一个基于数据流的原型框架)开发了一个基于任务的代码生成器,以便在多核平台上部署被描述为同步数据流图的算法。我们的任务生成代码与典型的基于线程的代码的实验性能比较表明,我们的方法在保持类似(偶尔改进)应用程序吞吐量的同时消除了显著的调度和同步开销。
{"title":"Task-based execution of synchronous dataflow graphs for scalable multicore computing","authors":"Georgios Georgakarakos, Sudeep Kanur, J. Lilius, K. Desnos","doi":"10.1109/SiPS.2017.8110023","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110023","url":null,"abstract":"Dataflow models of computation have early on been acknowledged as an attractive methodology to describe parallel algorithms, hence they have become highly relevant for programming in the current multicore processor era. While several frameworks provide tools to create dataflow descriptions of algorithms, generating parallel code for programmable processors is still sub-optimal due to the scheduling overheads and the semantics gap when expressing parallelism with conventional programming languages featuring threads. In this paper we propose an optimization of the parallel code generation process by combining dataflow and task programming models. We develop a task-based code generator for PREESM, a dataflow-based prototyping framework, in order to deploy algorithms described as synchronous dataflow graphs on multicore platforms. Experimental performance comparison of our task generated code against typical thread-based code shows that our approach removes significant scheduling and synchronization overheads while maintaining similar (and occasionally improving) application throughput.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125538206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Customizing fixed-point and floating-point arithmetic — A case study in K-means clustering 自定义定点和浮点算法- K-means聚类的案例研究
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109980
Benjamin Barrois, O. Sentieys
This paper presents a comparison between custom fixed-point (FxP) and floating-point (FlP) arithmetic, applied to bidimensional K-means clustering algorithm. After a discussion on the K-means clustering algorithm and arithmetic characteristics, hardware implementations of FxP and FlP arithmetic operators are compared in terms of area, delay and energy, for different bitwidth, using the ApxPerf2.0 framework. Finally, both are compared in the context of K-means clustering. The direct comparison shows the large difference between 8-to-16-bit FxP and FlP operators, FlP adders consuming 5–12 χ more energy than FxP adders, and multipliers 2–10χ more. However, when applied to K-means clustering algorithm, the gap between FxP and FlP tightens. Indeed, the accuracy improvements brought by FlP make the computation more accurate and lead to an accuracy equivalent to FxP with less iterations of the algorithm, proportionally reducing the global energy spent. The 8-bit version of the algorithm becomes more profitable using FlP, which is 80% more accurate with only 1.6 χ more energy. This paper finally discusses the stake of custom FlP for low-energy general-purpose computation, thanks to its ease of use, supported by an energy overhead lower than what could have been expected.
本文比较了自定义定点算法(FxP)和浮点算法(FlP)在二维k均值聚类算法中的应用。在讨论了K-means聚类算法和算法特性之后,在ApxPerf2.0框架下,比较了不同位宽下FxP和FlP算法的硬件实现在面积、延迟和能量方面的差异。最后,在K-means聚类的背景下对两者进行比较。直接比较显示8- 16位FxP和FlP运算符之间的巨大差异,FlP加法器比FxP加法器消耗的能量多5-12 χ,乘法器多2-10χ。然而,当应用于K-means聚类算法时,FxP与FlP之间的差距缩小了。事实上,FlP带来的精度改进使计算更加精确,并且通过更少的算法迭代获得相当于FxP的精度,成比例地减少了全局能量消耗。使用FlP,该算法的8位版本变得更加有利可图,其准确率提高80%,仅增加1.6 χ的能量。本文最后讨论了自定义FlP对低能耗通用计算的重要性,由于它易于使用,并且能量开销低于预期。
{"title":"Customizing fixed-point and floating-point arithmetic — A case study in K-means clustering","authors":"Benjamin Barrois, O. Sentieys","doi":"10.1109/SiPS.2017.8109980","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109980","url":null,"abstract":"This paper presents a comparison between custom fixed-point (FxP) and floating-point (FlP) arithmetic, applied to bidimensional K-means clustering algorithm. After a discussion on the K-means clustering algorithm and arithmetic characteristics, hardware implementations of FxP and FlP arithmetic operators are compared in terms of area, delay and energy, for different bitwidth, using the ApxPerf2.0 framework. Finally, both are compared in the context of K-means clustering. The direct comparison shows the large difference between 8-to-16-bit FxP and FlP operators, FlP adders consuming 5–12 χ more energy than FxP adders, and multipliers 2–10χ more. However, when applied to K-means clustering algorithm, the gap between FxP and FlP tightens. Indeed, the accuracy improvements brought by FlP make the computation more accurate and lead to an accuracy equivalent to FxP with less iterations of the algorithm, proportionally reducing the global energy spent. The 8-bit version of the algorithm becomes more profitable using FlP, which is 80% more accurate with only 1.6 χ more energy. This paper finally discusses the stake of custom FlP for low-energy general-purpose computation, thanks to its ease of use, supported by an energy overhead lower than what could have been expected.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121527467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Obtaining an optimal set of head-related transfer functions with a small amount of measurements 用少量的测量获得一组最优的头部相关传递函数
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110008
Mikko Parviainen, Pasi Pertilä
This article presents a method to obtain personalized Head-Related Transfer Functions (HRTFs) for creating virtual soundscapes based on small amount of measurements. The best matching set of HRTFs are selected among the entries from publicly available databases. The proposed method is evaluated using a listening test where subjects assess the audio samples created using the best matching set of HRTFs against a randomly chosen set of HRTFs from the same location. The listening test indicates that subjects prefer the proposed method over random set of HRTFs.
本文提出了一种方法,以获得个性化的头部相关传递函数(HRTFs),以创建基于少量测量的虚拟声景。从公开数据库的条目中选择最佳的hrtf匹配集。使用听力测试来评估所提出的方法,其中受试者评估使用最佳匹配的hrtf集与来自同一位置的随机选择的hrtf集创建的音频样本。听力测试表明,受试者更喜欢所提出的方法,而不是随机设置的hrtf。
{"title":"Obtaining an optimal set of head-related transfer functions with a small amount of measurements","authors":"Mikko Parviainen, Pasi Pertilä","doi":"10.1109/SiPS.2017.8110008","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110008","url":null,"abstract":"This article presents a method to obtain personalized Head-Related Transfer Functions (HRTFs) for creating virtual soundscapes based on small amount of measurements. The best matching set of HRTFs are selected among the entries from publicly available databases. The proposed method is evaluated using a listening test where subjects assess the audio samples created using the best matching set of HRTFs against a randomly chosen set of HRTFs from the same location. The listening test indicates that subjects prefer the proposed method over random set of HRTFs.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132613771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Processing LSTM in memory using hybrid network expansion model 使用混合网络扩展模型处理内存中的LSTM
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110011
Yu Gong, Tingting Xu, Bo Liu, Wei-qi Ge, Jinjiang Yang, Jun Yang, Longxing Shi
With the rapidly increasing applications of deep learning, LSTM-RNNs are widely used. Meanwhile, the complex data dependence and intensive computation limit the performance of the accelerators. In this paper, we first proposed a hybrid network expansion model to exploit the finegrained data parallelism. Based on the model, we implemented a Reconfigurable Processing Unit(RPU) using Processing In Memory(PIM) units. Our work shows that the gates and cells in LSTM can be partitioned to fundamental operations and then recombined and mapped into heterogeneous computing components. The experimental results show that, implemented on 45nm CMOS process, the proposed RPU with size of 1.51 mm2 and power of 413 mw achieves 309 GOPS/W in power efficiency, and is 1.7 χ better than state-of-the-art reconfigurable architecture.
随着深度学习应用的迅速增加,lstm - rnn得到了广泛的应用。同时,复杂的数据依赖性和密集的计算量限制了加速器的性能。本文首先提出了一种利用细粒度数据并行性的混合网络扩展模型。基于该模型,我们使用内存处理(PIM)单元实现了可重构处理单元(RPU)。我们的工作表明,LSTM中的门和单元可以划分为基本操作,然后重新组合并映射为异构计算组件。实验结果表明,在45nm CMOS工艺上实现的RPU尺寸为1.51 mm2,功耗为413 mw,功率效率为309 GOPS/W,比目前最先进的可重构架构提高1.7 χ。
{"title":"Processing LSTM in memory using hybrid network expansion model","authors":"Yu Gong, Tingting Xu, Bo Liu, Wei-qi Ge, Jinjiang Yang, Jun Yang, Longxing Shi","doi":"10.1109/SiPS.2017.8110011","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110011","url":null,"abstract":"With the rapidly increasing applications of deep learning, LSTM-RNNs are widely used. Meanwhile, the complex data dependence and intensive computation limit the performance of the accelerators. In this paper, we first proposed a hybrid network expansion model to exploit the finegrained data parallelism. Based on the model, we implemented a Reconfigurable Processing Unit(RPU) using Processing In Memory(PIM) units. Our work shows that the gates and cells in LSTM can be partitioned to fundamental operations and then recombined and mapped into heterogeneous computing components. The experimental results show that, implemented on 45nm CMOS process, the proposed RPU with size of 1.51 mm2 and power of 413 mw achieves 309 GOPS/W in power efficiency, and is 1.7 χ better than state-of-the-art reconfigurable architecture.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133414533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-throughput decoding of block turbo codes on graphics processing units 在图形处理单元上的块涡轮码的高吞吐量解码
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109996
Junhee Cho, Wonyong Sung
Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.
块涡轮码(btc)可以为光学网络和NAND闪存设备等多种应用提供非常强大的前向纠错(FEC)。这些应用需要软判决FEC码来保证误码率(BER)低于10−12,然而,很难用CPU模拟器验证。在本文中,我们提出了基于高吞吐量图形处理单元(GPU)的turbo解码软件,以帮助开发非常低错误率的btc。为了有效地利用gpu,该软件同时处理多个BTC帧,并最小化全局内存访问延迟。特别地,Chase-Pyndiah算法被有效地并行化以解码BTC字的每一行和每一列。基于gpu的仿真器对由Hamming码和BCH码组成的btc分别实现了80和150 Mb/s左右的解码吞吐量。与基于cpu的结果相比,吞吐量结果高达124倍。
{"title":"High-throughput decoding of block turbo codes on graphics processing units","authors":"Junhee Cho, Wonyong Sung","doi":"10.1109/SiPS.2017.8109996","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109996","url":null,"abstract":"Block turbo codes (BTCs) can provide very powerful forward error correction (FEC) for several applications, such as optical networks and NAND flash memory devices. These applications require soft-decision FEC codes to guarantee the bit error rate (BER) of under 10−12 which is, however, very difficult to verify with a CPU simulator. In this paper, we present high-throughput graphics processing unit (GPU) based turbo decoding software to aid the development of very low error rate BTCs. For effective utilization of the GPUs, the software processes multiple BTC frames simultaneously and minimizes the global memory access latency. Especially, the Chase-Pyndiah algorithm is efficiently parallelized to decode every row and column of a BTC word. The GPU-based simulator achieved the throughputs of about 80 and 150 Mb/s for decoding of BTCs composed of Hamming and BCH codes, respectively. The throughput results are up to 124 times higher when compared to the CPU-based ones.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
CRN-based design methodology for synchronous sequential logic 基于crn的同步顺序逻辑设计方法
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109979
Zhiwei Zhong, Lulu Ge, Ziyuan Shen, X. You, Chuan Zhang
With the aid of a storage-release mechanism named key-keysmith, an implementation approach based on chemical reaction networks (CRNs) for synchronous sequential logic is proposed. This design approach, which stores logic information in keysmith and releases it through key, primarily focuses on the underlying state transitions behind the required logic rather than the electronic circuit representation. Therefore, it can be uniformly and easily employed to implement any synchronous sequential logic with molecular reactions. Theoretical analysis and numerical simulations have demonstrated the robustness and universality of the proposed approach.
借助一种名为key-keysmith的存储-释放机制,提出了一种基于化学反应网络(crn)的同步顺序逻辑实现方法。这种将逻辑信息存储在keysmith中并通过key释放的设计方法主要关注所需逻辑背后的底层状态转换,而不是电子电路表示。因此,它可以统一和容易地实现与分子反应的任何同步顺序逻辑。理论分析和数值仿真证明了该方法的鲁棒性和通用性。
{"title":"CRN-based design methodology for synchronous sequential logic","authors":"Zhiwei Zhong, Lulu Ge, Ziyuan Shen, X. You, Chuan Zhang","doi":"10.1109/SiPS.2017.8109979","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109979","url":null,"abstract":"With the aid of a storage-release mechanism named key-keysmith, an implementation approach based on chemical reaction networks (CRNs) for synchronous sequential logic is proposed. This design approach, which stores logic information in keysmith and releases it through key, primarily focuses on the underlying state transitions behind the required logic rather than the electronic circuit representation. Therefore, it can be uniformly and easily employed to implement any synchronous sequential logic with molecular reactions. Theoretical analysis and numerical simulations have demonstrated the robustness and universality of the proposed approach.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134310204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Statistical analysis of Post-HEVC encoded videos 后hevc编码视频的统计分析
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110020
A. Jallouli, Fatma Belghith, M. A. B. Ayed, W. Hamidouche, J. Nezan, N. Masmoudi
The Post-HEVC is the emerging video coding standard beyond the High Efficiency Video Coding (HEVC) standard. It is more complex in transformation and prediction steps but it offers the opportunity of 3D and 360° videos coding and compression. This paper presents different statistical analyzes of Post-HEVC encoded videos especially analysis on 1D and 2D transformation types and analysis on intra and inter prediction types of some test videos for different classes and resolutions. Analyzes are carried out at the decoder level where the coding decision has already been taken by the encoder. Results show that the choice of transformation (type and size) and the prediction type (intra or inter) depends on the nature of video: motion and texture. This work can be considered as a milestone for proposing intelligent algorithms based on video characteristics to perform fast decision in the Post-HEVC encoding process.
后HEVC是继HEVC (High Efficiency video coding)标准之后的新兴视频编码标准。它在转换和预测步骤上比较复杂,但为3D和360°视频编码和压缩提供了机会。本文对hevc编码后的视频进行了不同的统计分析,特别是对一维和二维变换类型的分析,以及对不同类别和分辨率的一些测试视频的内预测和间预测类型的分析。在编码器已经做出编码决定的地方,在解码器级别进行分析。结果表明,变换(类型和大小)和预测类型(内部或内部)的选择取决于视频的性质:运动和纹理。这项工作可以被认为是提出基于视频特征的智能算法在后hevc编码过程中进行快速决策的里程碑。
{"title":"Statistical analysis of Post-HEVC encoded videos","authors":"A. Jallouli, Fatma Belghith, M. A. B. Ayed, W. Hamidouche, J. Nezan, N. Masmoudi","doi":"10.1109/SiPS.2017.8110020","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110020","url":null,"abstract":"The Post-HEVC is the emerging video coding standard beyond the High Efficiency Video Coding (HEVC) standard. It is more complex in transformation and prediction steps but it offers the opportunity of 3D and 360° videos coding and compression. This paper presents different statistical analyzes of Post-HEVC encoded videos especially analysis on 1D and 2D transformation types and analysis on intra and inter prediction types of some test videos for different classes and resolutions. Analyzes are carried out at the decoder level where the coding decision has already been taken by the encoder. Results show that the choice of transformation (type and size) and the prediction type (intra or inter) depends on the nature of video: motion and texture. This work can be considered as a milestone for proposing intelligent algorithms based on video characteristics to perform fast decision in the Post-HEVC encoding process.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114180474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Low complexity hardware accelerator for nD FastICA based on coordinate rotation 基于坐标旋转的nD - FastICA低复杂度硬件加速器
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110000
Swati Bhardwaj, Shashank Raghuraman, A. Acharyya
This paper proposes a low complex hardware accelerator algorithmic modification for n-dimensional (nD) FastICA methodology based on Coordinate Rotation Digital Computer (CORDIC) to attain high computation speed. The most complex and time consuming update stage and convergence check required for computation of the nth weight vector are eliminated in the proposed methodology. Using the Gram-Schmidt Orthogonalization stage and normalization stage to calculate nth weight vector in an entirely sequential procedure of CORDIC-based FastICA results in a significant gain in terms of the computation time. The proposed methodology has been functionally verified and validated by applying it for separating 6D speech signals. It has been implemented on hardware using Verilog HDL and synthesized using UMC 180nm technology. The average improvement in computation time obtained by using the proposed methodology for 4D to 6D FastICA with 1024 samples, considering the minimum case of two iterations for nth stage, was found to be 98.79 %.
本文提出了一种基于坐标旋转数字计算机(CORDIC)的低复杂度硬件加速器算法对n维(nD) FastICA方法的改进,以获得较高的计算速度。该方法消除了计算第n个权向量所需的最复杂、最耗时的更新阶段和收敛性检查。在基于cordic的FastICA的完全顺序过程中,使用Gram-Schmidt正交化阶段和归一化阶段来计算第n个权重向量,结果在计算时间方面有显着的增益。通过对6D语音信号的分离,对该方法进行了功能验证和验证。采用Verilog HDL在硬件上实现,采用UMC 180nm工艺合成。考虑到第n阶段两次迭代的最小情况,采用所提出的方法对1024个样本的4D到6D FastICA计算时间的平均改进为98.79%。
{"title":"Low complexity hardware accelerator for nD FastICA based on coordinate rotation","authors":"Swati Bhardwaj, Shashank Raghuraman, A. Acharyya","doi":"10.1109/SiPS.2017.8110000","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110000","url":null,"abstract":"This paper proposes a low complex hardware accelerator algorithmic modification for n-dimensional (nD) FastICA methodology based on Coordinate Rotation Digital Computer (CORDIC) to attain high computation speed. The most complex and time consuming update stage and convergence check required for computation of the nth weight vector are eliminated in the proposed methodology. Using the Gram-Schmidt Orthogonalization stage and normalization stage to calculate nth weight vector in an entirely sequential procedure of CORDIC-based FastICA results in a significant gain in terms of the computation time. The proposed methodology has been functionally verified and validated by applying it for separating 6D speech signals. It has been implemented on hardware using Verilog HDL and synthesized using UMC 180nm technology. The average improvement in computation time obtained by using the proposed methodology for 4D to 6D FastICA with 1024 samples, considering the minimum case of two iterations for nth stage, was found to be 98.79 %.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124309218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
FPGA implementation of object recognition processor for HDTV resolution video using sparse FIND feature 基于稀疏FIND特征的HDTV分辨率视频目标识别处理器的FPGA实现
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8109993
Yuri Nishizumi, Go Matsukawa, K. Kajihara, T. Kodama, S. Izumi, H. Kawaguchi, C. Nakanishi, Toshio Goto, Takeo Kato, M. Yoshimoto
This paper describes FPGA implementation of object recognition processor for HDTV resolution 30 fps video using the Sparse FIND feature. Two-stage feature extraction processing by HOG and Sparse FIND, a highly parallel classification in the support vector machine (SVM), and a block-parallel processing for RAM access cycle reduction are proposed to perform a real time object recognition with enormous computational complexity. From implementation of the proposed architecture in the FPGA, it was confirmed that detection using the Sparse FIND feature was performed for HDTV images at 47.63 fps, on average, at 90 MHz. The recognition accuracy degradation from the original Sparse FIND-base object detection algorithm implemented on software was 0.5%, which shows that the FPGA system provides sufficient accuracy for practical use.
本文介绍了利用稀疏查找特性实现HDTV分辨率为30fps视频的目标识别处理器的FPGA实现。提出了基于HOG和Sparse FIND的两阶段特征提取处理、支持向量机(SVM)的高度并行分类和减少RAM访问周期的块并行处理来实现具有巨大计算复杂度的实时目标识别。通过在FPGA中实现所提出的架构,可以确认使用Sparse FIND特征对HDTV图像进行检测,平均帧率为47.63 fps,频率为90 MHz。在软件上实现的基于Sparse find的原始目标检测算法的识别精度下降了0.5%,表明FPGA系统具有足够的实际应用精度。
{"title":"FPGA implementation of object recognition processor for HDTV resolution video using sparse FIND feature","authors":"Yuri Nishizumi, Go Matsukawa, K. Kajihara, T. Kodama, S. Izumi, H. Kawaguchi, C. Nakanishi, Toshio Goto, Takeo Kato, M. Yoshimoto","doi":"10.1109/SiPS.2017.8109993","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8109993","url":null,"abstract":"This paper describes FPGA implementation of object recognition processor for HDTV resolution 30 fps video using the Sparse FIND feature. Two-stage feature extraction processing by HOG and Sparse FIND, a highly parallel classification in the support vector machine (SVM), and a block-parallel processing for RAM access cycle reduction are proposed to perform a real time object recognition with enormous computational complexity. From implementation of the proposed architecture in the FPGA, it was confirmed that detection using the Sparse FIND feature was performed for HDTV images at 47.63 fps, on average, at 90 MHz. The recognition accuracy degradation from the original Sparse FIND-base object detection algorithm implemented on software was 0.5%, which shows that the FPGA system provides sufficient accuracy for practical use.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116800175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Odd type DCT/DST for video coding: Relationships and low-complexity implementations 用于视频编码的奇数类型DCT/DST:关系和低复杂度实现
Pub Date : 2017-10-01 DOI: 10.1109/SiPS.2017.8110009
M. Masera, M. Martina, G. Masera
In this paper, we show a class of relationships which link Discrete Cosine Transforms (DCT) and Discrete Sine Transforms (DST) of types V, VI, VII and VIII, which have been recently considered for inclusion in the future video coding technology. In particular, the proposed relationships allow to compute the DCT-V and the DCT-VIII as functions of the DCT-VI and the DST-VII respectively, plus simple reordering and sign-inversion. Moreover, this paper exploits the proposed relationships and the Winograd factorization of the Discrete Fourier Transform to construct low-complexity factorizations for computing the DCT-V and the DCT-VIII of length 4 and 8. Finally, the proposed signal-flow-graphs have been implemented using an FPGA technology, thus showing reduced hardware utilization with respect to the direct implementation of the matrix-vector multiplication algorithm.
在本文中,我们展示了一类将V, VI, VII和VIII类型的离散余弦变换(DCT)和离散正弦变换(DST)联系起来的关系,这些关系最近被考虑包含在未来的视频编码技术中。特别是,所提出的关系允许分别计算DCT-V和DCT-VIII作为DCT-VI和DST-VII的函数,加上简单的重排序和符号反转。此外,本文利用所提出的关系式和离散傅立叶变换的Winograd分解构造了计算长度为4和8的DCT-V和DCT-VIII的低复杂度分解。最后,所提出的信号流图已使用FPGA技术实现,从而显示相对于直接实现矩阵向量乘法算法降低了硬件利用率。
{"title":"Odd type DCT/DST for video coding: Relationships and low-complexity implementations","authors":"M. Masera, M. Martina, G. Masera","doi":"10.1109/SiPS.2017.8110009","DOIUrl":"https://doi.org/10.1109/SiPS.2017.8110009","url":null,"abstract":"In this paper, we show a class of relationships which link Discrete Cosine Transforms (DCT) and Discrete Sine Transforms (DST) of types V, VI, VII and VIII, which have been recently considered for inclusion in the future video coding technology. In particular, the proposed relationships allow to compute the DCT-V and the DCT-VIII as functions of the DCT-VI and the DST-VII respectively, plus simple reordering and sign-inversion. Moreover, this paper exploits the proposed relationships and the Winograd factorization of the Discrete Fourier Transform to construct low-complexity factorizations for computing the DCT-V and the DCT-VIII of length 4 and 8. Finally, the proposed signal-flow-graphs have been implemented using an FPGA technology, thus showing reduced hardware utilization with respect to the direct implementation of the matrix-vector multiplication algorithm.","PeriodicalId":251688,"journal":{"name":"2017 IEEE International Workshop on Signal Processing Systems (SiPS)","volume":"337 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123232119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2017 IEEE International Workshop on Signal Processing Systems (SiPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1