Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)最新文献

英文中文

Parallelization of an ultrasound reconstruction algorithm for non destructive testing on multicore CPU and GPU 基于多核CPU和GPU的无损检测超声重构并行化算法

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136904

Antoine Pedron, L. Lacassagne, F. Bimbard, S. Berre

The CIVA software platform developed by CEA-LIST offers various simulation and data processing modules dedicated to non-destructive testing (NDT). In particular, ultrasonic imaging and reconstruction tools are proposed, in the purpose of localizing echoes and identifying and sizing the detected defects. Because of the complexity of data processed, computation time is now a limitation for the optimal use of available information. In this article, we present performance results on parallelization of one computationally heavy algorithm on general purpose processors (GPP) and graphic processing units (GPU). GPU implementation makes an intensive use of atomic intrinsics. Compared to initial GPP implementation, optimized GPP implementation runs up to ×116 faster and GPU implementation up to ×631. This shows that, even with irregular workloads, combining software optimization and hardware improvements, GPU give high performance.

由CEA-LIST开发的CIVA软件平台提供各种专用于无损检测(NDT)的模拟和数据处理模块。特别提出了超声成像和重建工具，目的是定位回波，识别和确定检测到的缺陷。由于处理数据的复杂性，计算时间现在是对可用信息的最佳利用的限制。在本文中，我们展示了在通用处理器(GPP)和图形处理单元(GPU)上并行化一种计算量大的算法的性能结果。GPU的实现大量使用了原子特性。与初始GPP实现相比，优化后的GPP实现速度最快×116, GPU实现速度最快×631。这表明，即使在不规则的工作负载下，结合软件优化和硬件改进，GPU也能提供高性能。

引用次数: 2

Embedded operating systems energy overhead 嵌入式操作系统的能源开销

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136853

B. Ouni, C. Belleudy, S. Bilavarn, E. Senn

In this paper, a flow of characterization of embedded operating system's energy consumption is presented. The objective is to determine the energy overhead of the services of the embedded OS, we interest particularly on the context switch service. The modeling is based on measurements on the hardware platform OMAP35x EVM board, running Linux omap. Based on the analysis results, a relationship between energy overhead and a set of hardware and software parameters is established.

本文提出了嵌入式操作系统能耗的表征流程。目标是确定嵌入式操作系统服务的能量开销，我们特别感兴趣的是上下文切换服务。该建模基于硬件平台OMAP35x EVM板上的测量，运行Linux omap。根据分析结果，建立了能量开销与一组硬件和软件参数之间的关系。

引用次数: 5

Optimization methodologies for complex FPGA-based signal processing systems with CAL 基于fpga的复杂信号处理系统的优化方法

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136878

A. Rahman, Hossam Amer, A. Prihozhy, Christophe Lucarz, M. Mattavelli

Signal processing designs are becoming increasingly complex with demands for more advanced algorithms. Designers are now seeking high-level tools and methodology to help manage complexity and increase productivity. Recently, CAL dataflow language has been specified which is capable of synthesizing dataflow description into RTL codes for hardware implementation, and based on several case studies, have shown promising results. However, no work has been done on global network analysis, which could increase the optimization space. In this paper, we introduce methodologies to analyze and optimize CAL programs by determining which actions should be parallelized, pipelined, or refactored for the highest throughput gain, and then providing tools and techniques to achieve this using minimum resource. As a case study on the RVC MPEG-4 SP Intra decoder for implementation on Virtex-5 FPGA, experimental results confirmed our analysis with throughput gain of up to 3.5x using relatively-minor additional slice compared to the reference design.

随着对更先进算法的需求，信号处理设计变得越来越复杂。设计师现在正在寻求高级工具和方法来帮助管理复杂性和提高生产力。近年来，CAL数据流语言被提出，它能够将数据流描述合成为硬件实现的RTL代码，并基于几个案例研究，显示出良好的效果。但是，目前还没有对全局网络进行分析，这可以增加优化的空间。在本文中，我们介绍了分析和优化CAL程序的方法，通过确定哪些操作应该并行化，流水线化或重构以获得最高的吞吐量增益，然后提供使用最小资源实现这一目标的工具和技术。作为在Virtex-5 FPGA上实现RVC MPEG-4 SP Intra解码器的案例研究，实验结果证实了我们的分析，与参考设计相比，使用相对较小的额外切片，吞吐量增益高达3.5倍。

引用次数: 3

High speed VLSI architecture for 2-D lifting Discrete Wavelet Transform 二维提升离散小波变换的高速VLSI结构

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136866

A. Darji, R. Bansal, S. Merchant, A. Chandorkar

The lifting scheme reduces the computational complexity for computing Discrete Wavelet Transform (DWT) compared to convolution. We have proposed a high performance and memory efficient architecture with parallel scanning method for 2-D DWT using 5/3 Lifting wavelet. This 2-D architecture is composed with two 1-D DWT units and a Transpose Unit (TU). Proposed parallel scanning reduces requirement of on-chip line buffer compared to other line based scanning. Proposed 2-D DWT architecture utilizes only 2N size buffer for NxN sized image, which is low compare to 3.5N usual requirement for to implement 5/3 Lifting wavelet. This is achieved by performing column and row transform simultaneously. Designed 1-D DWT module can process two inputs at a time and produce two outputs per clock which reduces latency significantly compared to other 2-D dual scan based DWT architectures. Designed TU operates at half clock rate which reduces power and its design is independent of size of input image. Instead of shifter we propose Hardwired Scaling Unit (HSU) for coefficient multiplication. Unlike shift register unit this design saves clocks and helps in reducing power by great amount. This architecture is synthesized using Xilinx ISE 10.1 and is implemented on Virtex-IIPRO XC2VP30 FPGA. Very low FPGA resource utilization is found.

与卷积相比，提升方案降低了离散小波变换(DWT)的计算复杂度。本文提出了一种基于5/3提升小波并行扫描的二维小波变换算法。该二维结构由两个一维DWT单元和一个转置单元(TU)组成。所提出的并行扫描与其他基于行扫描相比，减少了对片上行缓冲区的需求。所提出的二维DWT架构对于NxN大小的图像仅使用2N大小的缓冲区，这与实现5/3提升小波通常需要3.5N的缓冲区相比是低的。这是通过同时执行列和行变换来实现的。设计的一维DWT模块可以同时处理两个输入，每个时钟产生两个输出，与其他基于二维双扫描的DWT架构相比，大大降低了延迟。所设计的TU以半时钟速率工作，降低了功耗，并且其设计与输入图像的大小无关。我们提出了硬连线缩放单元(HSU)来代替移位器进行系数乘法。与移位寄存器单元不同，这种设计节省了时钟，并有助于大量降低功耗。该架构采用赛灵思ISE 10.1合成，在Virtex-IIPRO XC2VP30 FPGA上实现。发现非常低的FPGA资源利用率。

{"title":"High speed VLSI architecture for 2-D lifting Discrete Wavelet Transform","authors":"A. Darji, R. Bansal, S. Merchant, A. Chandorkar","doi":"10.1109/DASIP.2011.6136866","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136866","url":null,"abstract":"The lifting scheme reduces the computational complexity for computing Discrete Wavelet Transform (DWT) compared to convolution. We have proposed a high performance and memory efficient architecture with parallel scanning method for 2-D DWT using 5/3 Lifting wavelet. This 2-D architecture is composed with two 1-D DWT units and a Transpose Unit (TU). Proposed parallel scanning reduces requirement of on-chip line buffer compared to other line based scanning. Proposed 2-D DWT architecture utilizes only 2N size buffer for NxN sized image, which is low compare to 3.5N usual requirement for to implement 5/3 Lifting wavelet. This is achieved by performing column and row transform simultaneously. Designed 1-D DWT module can process two inputs at a time and produce two outputs per clock which reduces latency significantly compared to other 2-D dual scan based DWT architectures. Designed TU operates at half clock rate which reduces power and its design is independent of size of input image. Instead of shifter we propose Hardwired Scaling Unit (HSU) for coefficient multiplication. Unlike shift register unit this design saves clocks and helps in reducing power by great amount. This architecture is synthesized using Xilinx ISE 10.1 and is implemented on Virtex-IIPRO XC2VP30 FPGA. Very low FPGA resource utilization is found.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123872570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Efficient maximal convex custom instruction enumeration for extensible processors 可扩展处理器的高效最大凸自定义指令枚举

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136868

Chenglong Xiao, E. Casseau

In recent years, the use of extensible processors has been increased. Extensible processors extend the base instruction set of a general-purpose processor with a set of custom instructions. Custom instructions that can be implemented in special hardware units make it possible to improve performance and decrease power consumption in extensible processors. The key issue involved is to generate and select automatically the custom instructions from a high-level application code. However, enumerating all possible custom instructions of a given dataflow graph is a computationally difficult problem. In this paper, we propose an efficient algorithm for the exact enumeration of maximal convex custom instructions. The state of the art algorithms use either a bottom-up manner or a top-down manner to solve the problem. The proposed algorithm enumerates all maximal convex custom instructions by using a sandwich manner that combines the advantage of the bottom-up manner and the top-down manner. Compared to the latest algorithm, our algorithm can achieve orders of magnitude speedup.

近年来，可扩展处理器的使用有所增加。可扩展处理器用一组自定义指令扩展通用处理器的基本指令集。可以在特殊硬件单元中实现的自定义指令可以提高可扩展处理器的性能并降低功耗。所涉及的关键问题是从高级应用程序代码中自动生成和选择自定义指令。然而，枚举给定数据流图的所有可能的自定义指令是一个计算困难的问题。在本文中，我们提出了一种有效的算法来精确枚举最大凸自定义指令。最先进的算法要么使用自底向上的方式，要么使用自顶向下的方式来解决问题。该算法结合了自底向上和自顶向下的优点，采用三明治方式枚举所有最大凸自定义指令。与最新算法相比，我们的算法可以实现数量级的加速。

引用次数: 5

DFG implementation on multi GPU cluster with computation-communication overlap 计算通信重叠的多GPU集群DFG实现

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136859

Sylvain Huet, Vincent Boulos, V. Fristot, L. Salvo

Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization … In this paper, we propose a design flow allowing an efficient implementation of a Digital Signal Processing (DSP) application specified as a Data Flow Graph (DFG) on a multi GPU computer cluster. We focus particularly on the effective implementation of communications by automating the computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on a 3D granulometry application developed for research on materials.

如今，只要几千美元，就可以制造一台多gpu的超级计算机，非常适合实现数字信号处理算法。然而，为了实现这种架构的最高性能，程序员必须关注处理器间通信，任务同步……在本文中，我们提出了一个设计流程，允许在多GPU计算机集群上有效地实现数字信号处理(DSP)应用程序，指定为数据流图(DFG)。我们特别关注通过自动化计算-通信重叠来实现通信的有效实现，这可以带来显着的速度提升，如所提供的基准测试所示。该方法在用于材料研究的3D粒度测量应用程序中得到了验证。

引用次数: 4

An efficient parallel motion estimation algorithm and X264 parallelization in CUDA 一种高效的并行运动估计算法和CUDA中的X264并行化

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136860

Youngsub Ko, Youngmin Yi, S. Ha

H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research effort to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation because of significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. The proposed H.264 encoder achieves more than 20% speed-up compared with x264.

H.264/AVC视频编码器以其较高的编码效率得到了广泛的应用。由于与帧分辨率成正比的计算需求不断增加，利用并行处理加速H.264/AVC已成为人们关注的焦点。最近，图形处理单元(gpu)已经成为通过利用细粒度数据并行性来加速通用应用程序的可行目标。尽管使用GPU来加速H.264/AVC算法进行了大量的研究，但由于主机CPU和GPU之间的通信开销以及算法中的帧内依赖性，它并没有成功地实现比x264算法更快的速度。x264算法被称为最快的CPU实现。在本文中，我们提出了一种针对NVIDIA GPU实现的新颖运动估计(ME)算法。它伴随着一种新的流水线技术，称为子帧ME处理，有效地隐藏了主机CPU和GPU之间的通信开销。与x264相比，所提出的H.264编码器的速度提高了20%以上。

{"title":"An efficient parallel motion estimation algorithm and X264 parallelization in CUDA","authors":"Youngsub Ko, Youngmin Yi, S. Ha","doi":"10.1109/DASIP.2011.6136860","DOIUrl":"https://doi.org/10.1109/DASIP.2011.6136860","url":null,"abstract":"H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research effort to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation because of significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. The proposed H.264 encoder achieves more than 20% speed-up compared with x264.","PeriodicalId":199500,"journal":{"name":"Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)","volume":"21 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120905395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Middleware approaches for adaptivity of Kahn Process Networks on Networks-on-Chip 片上网络上Kahn过程网络自适应的中间件方法

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136862

E. Cannella, O. Derin, T. Stefanov

We investigate and propose a number of different middleware approaches, namely virtual connector, virtual connector with variable rate, and request-driven, which implement the semantics of Kahn Process Networks on Network-on-Chip architectures. All of the presented solutions allow for run-time system adaptivity. We implement the approaches on a Network-on-Chip multiprocessor platform prototyped on an FPGA. Their comparison in terms of the introduced overhead is presented on two case studies with different communication characteristics. We found out that the virtual connector mechanism outperforms other approaches in the communication-intensive application. In the other case study, which has a higher computation/communication ratio, the middleware approaches show similar performance.

我们研究并提出了许多不同的中间件方法，即虚拟连接器、可变速率虚拟连接器和请求驱动，它们在片上网络架构上实现了Kahn进程网络的语义。所有提出的解决方案都允许运行时系统自适应。我们在基于FPGA的片上网络多处理器平台上实现了这些方法。在两个具有不同通信特性的案例研究中，给出了它们在引入开销方面的比较。我们发现虚拟连接器机制在通信密集型应用中优于其他方法。在另一个具有更高计算/通信比率的案例研究中，中间件方法显示出类似的性能。

引用次数: 7

A systemc TLM framework for distributed simulation of complex systems with unpredictable communication 具有不可预测通信的复杂系统分布式仿真的系统TLM框架

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136847

J. Peeters, N. Ventroux, Tanguy Sassolas, L. Lacassagne

Increasingly complex systems need parallelized simulation engines. In the context of SystemC simulation, existing proposals require predicting communication in the simulated system. However, this is often unpredictable. In order to deal with unpredictable systems, this paper presents a parallelization approach using asynchronous communication without modification of the SystemC simulation engine. Simulated system model is cut up and distributed across separate simulation engines, each part being evaluated in parallel of others. Functional consistency is preserved thanks to the simulated system write exclusive memory access policy while temporal consistency is guaranteed using explicit synchronization. Experimental results show up a speed-up up to 13x on 16 processors.

越来越复杂的系统需要并行仿真引擎。在SystemC仿真的背景下，现有的方案要求对仿真系统中的通信进行预测。然而，这通常是不可预测的。为了处理不可预测的系统，本文提出了一种在不修改SystemC仿真引擎的情况下使用异步通信的并行化方法。仿真系统模型被分割并分布在不同的仿真引擎上，每个部分并行地进行评估。通过模拟系统的写独占内存访问策略，可以保持功能一致性，同时通过显式同步保证时间一致性。实验结果表明，在16个处理器上，速度提高了13倍。

引用次数: 14

Systemc modelization for fast validation of imager architectures 用于快速验证成像仪架构的系统建模

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

Pub Date : 2011-11-01 DOI: 10.1109/DASIP.2011.6136902

Y. Blanchard, A. Dupret, A. Peizerat

Development of smart CMOS imagers is a complex design task where the verification of an architecture composed of a matrix of pixels intermixed with analog and digital electronics is playing an important part. New generations of imager using 3D integration will allow even more processing to be done in-situ. Verification has to be done locally for the pixel and globally for the architecture. Design exploration and validation problematic has shifted from mostly the analog domain to the validation of a complex SOC with millions of parallel processors, the pixels. In this paper we present a methodology using the SystemC language for the creation of fast models for validation and a first level evaluation of performance of large CMOS imager architectures.

智能CMOS成像仪的开发是一项复杂的设计任务，其中验证由模拟和数字电子元件混合的像素矩阵组成的架构起着重要作用。使用3D集成的新一代成像仪将允许更多的处理在现场完成。必须在局部对像素进行验证，在全局对架构进行验证。设计探索和验证问题已经从模拟领域转移到具有数百万并行处理器(像素)的复杂SOC的验证。在本文中，我们提出了一种使用SystemC语言创建快速模型的方法，用于验证和对大型CMOS成像仪架构的性能进行一级评估。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2011 Conference on Design & Architectures for Signal & Image Processing (DASIP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀