2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)最新文献

英文中文

Area-driven partial reconfiguration for SEU mitigation on SRAM-based FPGAs 基于sram的fpga的区域驱动部分重构

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857154

M. Vavouras, C. Bouganis

This paper presents an area-driven Field-Programmable Gate Array (FPGA) scrubbing technique based on partial reconfiguration for Single Event Upset (SEU) mitigation. The proposed method is compared with existing techniques such as blind and on-demand scrubbing on a novel SEU mitigation framework implemented on the ZYNQ platform, supporting various SEU and scrubbing rates. A design space exploration on the availability versus data transfers from a Double Data Rate Type 3 (DDR3) memory, shows that our approach outperforms blind scrubbing for a range of availability values when a second order polynomial IP is targeted. A comparison to an existing on-demand scrubbing technique based on Dual Modular Redundancy (DMR) shows that our approach saves up to 46% area for the same case study.

提出了一种基于局部重构的区域驱动现场可编程门阵列(FPGA)擦洗技术，用于缓解单事件干扰(SEU)。在ZYNQ平台上实现了一个新的SEU缓解框架，支持不同的SEU和洗涤速率，并将所提出的方法与现有的盲法和按需洗涤等技术进行了比较。对可用性与双数据速率类型3 (DDR3)内存的数据传输的设计空间探索表明，当目标是二阶多项式IP时，我们的方法在可用性值范围内优于盲擦洗。与现有的基于双模冗余(DMR)的按需洗涤技术相比，我们的方法在相同的案例研究中节省了高达46%的面积。

引用次数: 2

Hybrid energy-aware reconfiguration management on Xilinx Zynq SoCs 基于Xilinx Zynq soc的混合能量感知重构管理

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857177

Andreas Becher, Jutta Pirkl, Achim Herrmann, J. Teich, S. Wildermann

Partial Reconfiguration is a common technique on FPGA platforms to load hardware accelerators at runtime without interrupting the remaining system. One crucial element is the time needed for reconfiguration as it affects usability, performance and energy consumption. Furthermore, many systems have to share partial areas between multiple applications and users. In this paper, we introduce a novel open-source reconfiguration manager for Xilinx Zynq SoCs which a) allows partial area sharing and b) includes a hybrid reconfiguration approach utilizing both the Processor Configuration Access Port (PCAP) and the Internal Configuration Access Port (ICAP) in order to minimize reconfiguration time and system energy consumption. We evaluate our design and identify the sweet spots between energy consumption and latency of accelerator availability with an example use case. By means of the hybrid approach, a speedup for the full configuration after powering on the FPGA of up to 64 % in comparison to solely using the PCAP interface can be achieved.

部分重构是FPGA平台上的一种常用技术，用于在运行时加载硬件加速器，而不会中断剩余的系统。一个关键因素是重新配置所需的时间，因为它会影响可用性、性能和能耗。此外，许多系统必须在多个应用程序和用户之间共享部分区域。在本文中，我们为Xilinx Zynq soc介绍了一种新颖的开源重新配置管理器，它a)允许部分区域共享，b)包括利用处理器配置访问端口(PCAP)和内部配置访问端口(ICAP)的混合重新配置方法，以最大限度地减少重新配置时间和系统能耗。我们通过一个示例用例评估我们的设计，并确定能量消耗和加速器可用性延迟之间的最佳点。通过混合方法，与单独使用PCAP接口相比，在FPGA上电后，可以实现高达64%的全配置加速。

引用次数: 7

Reconfigurable computing for network function virtualization: A protocol independent switch 用于网络功能虚拟化的可重构计算:一个协议独立的交换机

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857183

Qianqiao Chen, Vaibhawa Mishra, G. Zervas

Network function virtualization (NFV) aims to decouple software network applications from their hardware in order to reduce development and deployment costs for new services. To enable the deployment of diverse network services, a reconfigurable and high performance hardware platform can bring considerable benefits to NFV. In this paper, an FPGA-based platform is proposed to perform as a protocol reconfigurable NFV switch. Logic circuit of virtual network functions can be reconfigured at run time on the proposed platform. A reconfiguration process is also proposed to enable packet loss free switch-over between virtual network functions that delivers undisrupted service. The platform can be reconfigured between Layer 1 circuit switch and Layer 2 Ethernet packet switch. Once running as a packet switch, the platform can switch over from Layer 2 Ethernet switch to Layer 3 IP parser and even Layer 4 UDP parser. Performance of the implemented 2×2 switch at 10Gbps per port delivers a minimum latency of 300 nanoseconds (circuit switch) and maximum latency of 1 microsecond. Reconfiguration between IP and UDP parser without loss of data is also demonstrated.

网络功能虚拟化(NFV)旨在将软件网络应用与硬件分离，以降低新服务的开发和部署成本。为了实现多样化的网络业务部署，一个可重构的高性能硬件平台可以为NFV带来可观的效益。本文提出了一个基于fpga的平台，作为协议可重构的NFV交换机。虚拟网络功能的逻辑电路可以在该平台上运行时重新配置。还提出了一种重新配置过程，以实现虚拟网络功能之间的无丢包切换，从而提供不中断的服务。该平台可在第1层电路交换机和第2层以太网分组交换机之间重新配置。一旦作为数据包交换机运行，该平台可以从第二层以太网交换机切换到第三层IP解析器，甚至第四层UDP解析器。所实现的2×2交换机每端口10Gbps的性能提供了300纳秒的最小延迟(电路交换机)和1微秒的最大延迟。还演示了IP和UDP解析器之间的重新配置而不会丢失数据。

{"title":"Reconfigurable computing for network function virtualization: A protocol independent switch","authors":"Qianqiao Chen, Vaibhawa Mishra, G. Zervas","doi":"10.1109/ReConFig.2016.7857183","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857183","url":null,"abstract":"Network function virtualization (NFV) aims to decouple software network applications from their hardware in order to reduce development and deployment costs for new services. To enable the deployment of diverse network services, a reconfigurable and high performance hardware platform can bring considerable benefits to NFV. In this paper, an FPGA-based platform is proposed to perform as a protocol reconfigurable NFV switch. Logic circuit of virtual network functions can be reconfigured at run time on the proposed platform. A reconfiguration process is also proposed to enable packet loss free switch-over between virtual network functions that delivers undisrupted service. The platform can be reconfigured between Layer 1 circuit switch and Layer 2 Ethernet packet switch. Once running as a packet switch, the platform can switch over from Layer 2 Ethernet switch to Layer 3 IP parser and even Layer 4 UDP parser. Performance of the implemented 2×2 switch at 10Gbps per port delivers a minimum latency of 300 nanoseconds (circuit switch) and maximum latency of 1 microsecond. Reconfiguration between IP and UDP parser without loss of data is also demonstrated.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128989023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Dataflow optimization for programmable embedded image preprocessing accelerators 可编程嵌入式图像预处理加速器的数据流优化

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857161

T. Lieske, M. Reichenbach, Burkhard Ringlein, D. Fey

Image processing is an omnipresent topic in current embedded industrial and consumer applications. Therefore, it is important to investigate processing architectures to extract design guidelines for developing efficient image processors. While SIMD (single instruction, multiple data) processor arrays were often proposed to accelerate image processing tasks, the internal architecture of processor elements (PEs) has not been optimized. Nevertheless, it is necessary to evaluate the optimal complexity of PEs to trade off performance and architectural overhead caused by complex processor architectures. Hence, the goal of this paper is to present a deep evaluation of finding the right architectural complexity of PEs in a processor field to meet given performance and logic area constraints. In order to determine the optimal complexity, the ADL (architecture description language) based FAUPU framework for image preprocessing architectures is utilized and after evaluation extended with pipelining support. The newly introduced pipelining features enable resource-efficient performance optimizations and are a significant improvement to the FAUPU ADL. Due to the fine-grained configurability of the FAUPU architecture, several design variants can be easily generated and it is possible to evaluate the effects of instruction set architecture (ISA) complexity and pipelining on design properties and how these features are best combined. Consequently, the FAUPU framework can be used to address the question, whether it is better to use many lightweight cores or do less but more complex cores yield a greater performance to area ratio? The results show that lightweight cores are best suited to achieve a targeted frame rate with the least resources. However, more complex cores on the other hand yield better performance to area ratios.

图像处理是当前嵌入式工业和消费应用中无处不在的主题。因此，研究处理架构以提取设计准则以开发高效的图像处理器是非常重要的。虽然SIMD(单指令多数据)处理器阵列经常被提出来加速图像处理任务，但处理器元件的内部架构尚未得到优化。然而，有必要评估pe的最佳复杂性，以权衡复杂处理器体系结构引起的性能和体系结构开销。因此，本文的目标是深入评估如何在处理器领域中找到合适的pe架构复杂性，以满足给定的性能和逻辑领域约束。为了确定最优复杂度，采用了基于体系结构描述语言(ADL)的图像预处理体系结构FAUPU框架，并在评估后扩展为支持流水线。新引入的流水线功能实现了资源效率的性能优化，是对FAUPU ADL的重大改进。由于FAUPU体系结构的细粒度可配置性，可以很容易地生成几个设计变体，并且可以评估指令集体系结构(ISA)复杂性和流水线对设计属性的影响，以及如何最好地组合这些特性。因此，FAUPU框架可以用来解决这样一个问题:是使用许多轻量级内核更好，还是使用更少但更复杂的内核产生更高的性能面积比?结果表明，轻量级内核最适合用最少的资源实现目标帧率。然而，另一方面，更复杂的核心产生更好的性能面积比。

{"title":"Dataflow optimization for programmable embedded image preprocessing accelerators","authors":"T. Lieske, M. Reichenbach, Burkhard Ringlein, D. Fey","doi":"10.1109/ReConFig.2016.7857161","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857161","url":null,"abstract":"Image processing is an omnipresent topic in current embedded industrial and consumer applications. Therefore, it is important to investigate processing architectures to extract design guidelines for developing efficient image processors. While SIMD (single instruction, multiple data) processor arrays were often proposed to accelerate image processing tasks, the internal architecture of processor elements (PEs) has not been optimized. Nevertheless, it is necessary to evaluate the optimal complexity of PEs to trade off performance and architectural overhead caused by complex processor architectures. Hence, the goal of this paper is to present a deep evaluation of finding the right architectural complexity of PEs in a processor field to meet given performance and logic area constraints. In order to determine the optimal complexity, the ADL (architecture description language) based FAUPU framework for image preprocessing architectures is utilized and after evaluation extended with pipelining support. The newly introduced pipelining features enable resource-efficient performance optimizations and are a significant improvement to the FAUPU ADL. Due to the fine-grained configurability of the FAUPU architecture, several design variants can be easily generated and it is possible to evaluate the effects of instruction set architecture (ISA) complexity and pipelining on design properties and how these features are best combined. Consequently, the FAUPU framework can be used to address the question, whether it is better to use many lightweight cores or do less but more complex cores yield a greater performance to area ratio? The results show that lightweight cores are best suited to achieve a targeted frame rate with the least resources. However, more complex cores on the other hand yield better performance to area ratios.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129159098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

RePaBit: Automated generation of relocatable partial bitstreams for Xilinx Zynq FPGAs RePaBit: Xilinx Zynq fpga的可重定位部分位流的自动生成

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857186

J. Rettkowski, Konstantin Friesen, D. Göhringer

Partial reconfiguration in FPGAs increases the flexibility of a system due to dynamic replacement of hardware modules. However, more memory is needed to store all partial bitstreams and the generation of all partial bitstreams for all possible regions on the FPGA is very time-consuming. In order to overcome these issues, bitstream relocation can be used. In this paper, a novel approach that facilitates bitstream relocation with the Xilinx Vivado tool flow is presented. In addition, the approach is automated by TCL scripts that extend Vivado to RePaBit. RePaBit is successfully evaluated on the Xilinx Zynq FPGA using 1D and 2D relocation of complex modules such as MicroBlaze processors. The results show a negligible overhead in terms of area and frequency while enabling more flexibility by partial bitstream relocation as well as a faster design time.

fpga中的部分重构由于硬件模块的动态替换而增加了系统的灵活性。然而，需要更多的内存来存储所有的部分比特流，并且在FPGA上为所有可能的区域生成所有的部分比特流非常耗时。为了克服这些问题，可以使用位流重定位。本文提出了一种利用Xilinx Vivado工具流实现位流重定位的新方法。此外，该方法由TCL脚本自动执行，该脚本将Vivado扩展到RePaBit。reabit在Xilinx Zynq FPGA上使用复杂模块(如MicroBlaze处理器)的1D和2D重新定位成功地进行了评估。结果表明，在面积和频率方面的开销可以忽略不计，同时通过部分位流重新定位实现更大的灵活性以及更快的设计时间。

引用次数: 17

Automated synthesis of FPGA-based packet filters for 100 Gbps network monitoring applications 用于100gbps网络监控应用的基于fpga的包过滤器的自动合成

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857156

J. F. Zazo, S. López-Buedo, G. Sutter, J. Aracil

Monitoring 100 Gbps network links is a challenging task. Packet filtering allows monitoring applications to focus on the relevant data, discarding packets that do not provide any valuable information. However, such a large line rate calls for custom hardware solutions. This work presents a tool for automatically synthesizing packets filters from a custom grammar, which defines filters in a human-readable format. Thanks to parser generators (Bison) and lexical analyzers (Flex), Verilog code is automatically generated from the filter specification. Rules can be applied over a protocol, a protocol field, the packet payload, or a combination of them. The generated filters use standard AXI4-Stream interfaces, which seamlessly integrate in the packet filtering framework that we have developed for the integrated block for 100G Ethernet available in Xilinx Ultrascale devices. We present the results for two proof-of-concept packet filtering designs. Furthermore, filters are fully pipelined, so the full 100 Gb/s rate is guaranteed. As the framework uses a cut-through approach, latency is kept to a minimum. Finally, the proposed framework allows for the integration of more complex payload-level filters, written in C language with the Vivado-HLS tool.

监控100gbps的网络链路是一项具有挑战性的任务。包过滤允许监控应用程序关注相关数据，丢弃不提供任何有价值信息的数据包。然而，如此大的线路速率需要定制硬件解决方案。这项工作提供了一个从自定义语法自动合成包过滤器的工具，该语法以人类可读的格式定义过滤器。多亏了解析器生成器(Bison)和词法分析器(Flex)， Verilog代码可以从过滤器规范自动生成。规则可以应用于协议、协议字段、数据包有效负载或它们的组合。生成的过滤器使用标准的AXI4-Stream接口，该接口无缝集成到我们为Xilinx Ultrascale设备中可用的100G以太网集成块开发的包过滤框架中。我们给出了两个概念验证包过滤设计的结果。此外，过滤器是完全流水线的，因此可以保证100% Gb/s的速率。由于框架使用直通方法，因此延迟保持在最低限度。最后，建议的框架允许集成使用Vivado-HLS工具用C语言编写的更复杂的有效负载级过滤器。

{"title":"Automated synthesis of FPGA-based packet filters for 100 Gbps network monitoring applications","authors":"J. F. Zazo, S. López-Buedo, G. Sutter, J. Aracil","doi":"10.1109/ReConFig.2016.7857156","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857156","url":null,"abstract":"Monitoring 100 Gbps network links is a challenging task. Packet filtering allows monitoring applications to focus on the relevant data, discarding packets that do not provide any valuable information. However, such a large line rate calls for custom hardware solutions. This work presents a tool for automatically synthesizing packets filters from a custom grammar, which defines filters in a human-readable format. Thanks to parser generators (Bison) and lexical analyzers (Flex), Verilog code is automatically generated from the filter specification. Rules can be applied over a protocol, a protocol field, the packet payload, or a combination of them. The generated filters use standard AXI4-Stream interfaces, which seamlessly integrate in the packet filtering framework that we have developed for the integrated block for 100G Ethernet available in Xilinx Ultrascale devices. We present the results for two proof-of-concept packet filtering designs. Furthermore, filters are fully pipelined, so the full 100 Gb/s rate is guaranteed. As the framework uses a cut-through approach, latency is kept to a minimum. Finally, the proposed framework allows for the integration of more complex payload-level filters, written in C language with the Vivado-HLS tool.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130282312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Packing a modern Xilinx FPGA using RapidSmith 使用RapidSmith封装现代赛灵思FPGA

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857180

Travis Haroldsen, B. Nelson, B. Hutchings

Academic packing algorithms have typically been limited to theoretical architectures. In this paper, we describe RSVPack, a packing algorithm built on top of RapidSmith to target the Xilinx Virtex 6 architecture. We integrate our packer into the Xilinx ISE CAD flow and demonstrate our packer tool by packing a set of benchmark circuits and performing routing and timing analysis inside ISE.

学术上的打包算法通常局限于理论架构。在本文中，我们描述了RSVPack，这是一种基于RapidSmith构建的针对Xilinx Virtex 6架构的打包算法。我们将封隔器集成到Xilinx ISE CAD流程中，并通过在ISE中封装一组基准电路并执行路由和时序分析来演示我们的封隔器工具。

引用次数: 6

Thread shadowing: On the effectiveness of error detection at the hardware thread level 线程阴影:关于硬件线程级别错误检测的有效性

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857193

S. Meisner, M. Platzner

Dynamic thread duplication is a known redundancy technique for multi-cores. Recent research applied this concept to hybrid multi-cores for error detection and introduced thread shadowing that runs hardware threads in the reconfigurable cores and compares their outputs for deviation at configurable signature levels. Previously published work evaluated this concept in terms of performance, error detection latency and resource consumption. In this paper we report on the error detection capabilities of thread shadowing by presenting an extensive fault injection campaign. We employ the Xilinx Soft Error Mitigation Controller for fault injection and the Xilinx Essential Bit facility to limit the fault injections to relevant bits in the configuration bitstream. Our findings from fault injection experiments with a sorting benchmark are threefold: First, up to 98% of all errors are detected by the operating system of the hybrid multi-core supported by thread shadowing. Second, thread shadowing's signature levels provide a useful trade-off between detected errors and effort needed, with around 5% of all errors detected in calls to operating system functions and around 52% of errors detected in memory accesses of the hardware thread. Third, essential bit testing is effective and cuts down the amount of bits to be tested by a factor of 14.48 compared to the total amount of bits available in the configuration address space.

动态线程复制是一种已知的多核冗余技术。最近的研究将这一概念应用于混合多核的错误检测，并引入了线程阴影，在可重构核中运行硬件线程，并在可配置签名级别上比较它们的输出偏差。以前发表的工作从性能、错误检测延迟和资源消耗方面评估了这个概念。在这篇论文中，我们通过介绍一个广泛的错误注入活动来报告线程遮蔽的错误检测能力。我们使用Xilinx软错误缓解控制器进行故障注入，并使用Xilinx基本位工具将故障注入限制在配置比特流中的相关位。我们在排序基准的故障注入实验中发现:首先，在线程阴影支持的混合多核操作系统中，高达98%的错误被检测到。其次，线程跟踪的签名级别在检测到的错误和所需的努力之间提供了一个有用的权衡，在对操作系统函数的调用中检测到的所有错误中，约有5%在硬件线程的内存访问中检测到的错误中约有52%。第三，基本位测试是有效的，与配置地址空间中可用的总位数相比，它将要测试的位数减少了14.48倍。

{"title":"Thread shadowing: On the effectiveness of error detection at the hardware thread level","authors":"S. Meisner, M. Platzner","doi":"10.1109/ReConFig.2016.7857193","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857193","url":null,"abstract":"Dynamic thread duplication is a known redundancy technique for multi-cores. Recent research applied this concept to hybrid multi-cores for error detection and introduced thread shadowing that runs hardware threads in the reconfigurable cores and compares their outputs for deviation at configurable signature levels. Previously published work evaluated this concept in terms of performance, error detection latency and resource consumption. In this paper we report on the error detection capabilities of thread shadowing by presenting an extensive fault injection campaign. We employ the Xilinx Soft Error Mitigation Controller for fault injection and the Xilinx Essential Bit facility to limit the fault injections to relevant bits in the configuration bitstream. Our findings from fault injection experiments with a sorting benchmark are threefold: First, up to 98% of all errors are detected by the operating system of the hybrid multi-core supported by thread shadowing. Second, thread shadowing's signature levels provide a useful trade-off between detected errors and effort needed, with around 5% of all errors detected in calls to operating system functions and around 52% of errors detected in memory accesses of the hardware thread. Third, essential bit testing is effective and cuts down the amount of bits to be tested by a factor of 14.48 compared to the total amount of bits available in the configuration address space.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115398969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A scalable latency-insensitive architecture for FPGA-accelerated semi-global matching in stereo vision applications 立体视觉应用中fpga加速半全局匹配的可扩展延迟不敏感架构

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857147

Jaco A. Hofmann, Jens Korinth, A. Koch

Semi-Global Matching (SGM) is a high-performance method for computing high-quality disparity maps from stereo camera images in machine vision applications. It is also suitable for direct hardware execution, e.g., in ASICs or reconfigurable logic devices. We present a highly parametrized FPGA implementation, scalable from simple low-resolution low-power use-cases, up to complex real-time full-HD multi-camera scenarios. By using a latency-insensitive design style, high-level synthesis from the Bluespec SystemVerilog next-generation hardware description language, and an automated design-space exploration flow, many implementation alternatives could be examined with high productivity. The use of the Threadpool Composer system-on-chip assembly tool allows the portable mapping of the SGM accelerator to different hardware platforms. The accelerator performance exceeds that of prior fixed-architecture approaches.

半全局匹配(Semi-Global Matching, SGM)是一种高性能的从立体相机图像中计算高质量视差映射的方法。它也适用于直接硬件执行，例如，在asic或可重构逻辑器件中。我们提出了一种高度参数化的FPGA实现，可从简单的低分辨率低功耗用例扩展到复杂的实时全高清多摄像头场景。通过使用延迟不敏感的设计风格，来自Bluespec SystemVerilog下一代硬件描述语言的高级合成，以及自动化的设计空间探索流，可以以高生产率检查许多实现替代方案。使用Threadpool Composer片上系统组装工具可以将SGM加速器移植到不同的硬件平台。加速器的性能优于先前的固定架构方法。

引用次数: 4

Overloaded CDMA interconnect for Network-on-Chip (OCNoC) 面向片上网络(OCNoC)的过载CDMA互连

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857179

K. E. Ahmed, M. Rizk, Mohammed M. Farag

Networks on Chip (NoCs) have replaced on-chip buses as the paramount communication strategy in large scale Systems-on-Chips (SoCs). Code Division Multiple Access (CDMA) has been proposed as an interconnect fabric that can achieve high throughput and fixed transfer latency due to the CDMA transmission concurrency. Overloaded CDMA Interconnect (OCI) is an architectural evolution of the conventional CDMA interconnects that can double their bandwidth at marginal cost. Employing OCI in CDMA-based NoCs has the potential of providing higher bandwidth at low-power and -area overheads compared to other NoC architectures. Furthermore, fixed latency and predictable performance achieved by the inherent CDMA concurrency can reduce the effort and overhead required to implement QoS. In this work, we advance the Overloaded CDMA interconnect for Network on Chip (OCNoC) dynamic central router. The OCNoC router leverages the overloaded CDMA concept to reduce the overall packet transfer latency and improve the network throughput at a negligible area overhead. Dynamic code assignment is adopted to reduce the decoding complexity and transfer latency and maximize the crossbar utilization. Two OCNoC solutions are advanced, serial and parallel CDMA encoding schemes. The OCNoC central routers are implemented and validated on a Virtex-7 VC709 FPGA kit. Evaluation results show a throughput enhancement up to 142% with a 1.7% variation in packet latencies. Synthesized using a 65 nm ASIC standard cell library, the presented ASIC OCNoC router requires 61% less area per processing element at 81.5% saving in energy dissipation compared to conventional CDMA-based NoCs.

片上网络(noc)已经取代片上总线成为大规模片上系统(soc)中最重要的通信策略。码分多址(CDMA)由于其传输的并发性，被提出作为一种能够实现高吞吐量和固定传输延迟的互连结构。超载CDMA互连(OCI)是传统CDMA互连的一种架构演变，它可以在边际成本下将带宽提高一倍。与其他NoC架构相比，在基于cdma的NoC中使用OCI具有以低功耗和面积开销提供更高带宽的潜力。此外，CDMA固有的并发性所带来的固定延迟和可预测的性能可以减少实现QoS所需的工作量和开销。本文提出了一种基于片上网络(OCNoC)的过载CDMA互连动态中心路由器。OCNoC路由器利用了过载的CDMA概念来减少总体数据包传输延迟，并在可忽略的区域开销下提高网络吞吐量。采用动态码分配，降低了译码复杂度和传输延迟，最大限度地提高了交叉码利用率。两个OCNoC解决方案是先进的，串行和并行CDMA编码方案。OCNoC中心路由器在Virtex-7 VC709 FPGA套件上实现并验证。评估结果显示，吞吐量提高了142%，数据包延迟变化了1.7%。使用65nm ASIC标准单元库合成的ASIC OCNoC路由器，与传统的基于cdma的noc相比，每个处理元件的面积减少61%，能耗节省81.5%。

{"title":"Overloaded CDMA interconnect for Network-on-Chip (OCNoC)","authors":"K. E. Ahmed, M. Rizk, Mohammed M. Farag","doi":"10.1109/ReConFig.2016.7857179","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857179","url":null,"abstract":"Networks on Chip (NoCs) have replaced on-chip buses as the paramount communication strategy in large scale Systems-on-Chips (SoCs). Code Division Multiple Access (CDMA) has been proposed as an interconnect fabric that can achieve high throughput and fixed transfer latency due to the CDMA transmission concurrency. Overloaded CDMA Interconnect (OCI) is an architectural evolution of the conventional CDMA interconnects that can double their bandwidth at marginal cost. Employing OCI in CDMA-based NoCs has the potential of providing higher bandwidth at low-power and -area overheads compared to other NoC architectures. Furthermore, fixed latency and predictable performance achieved by the inherent CDMA concurrency can reduce the effort and overhead required to implement QoS. In this work, we advance the Overloaded CDMA interconnect for Network on Chip (OCNoC) dynamic central router. The OCNoC router leverages the overloaded CDMA concept to reduce the overall packet transfer latency and improve the network throughput at a negligible area overhead. Dynamic code assignment is adopted to reduce the decoding complexity and transfer latency and maximize the crossbar utilization. Two OCNoC solutions are advanced, serial and parallel CDMA encoding schemes. The OCNoC central routers are implemented and validated on a Virtex-7 VC709 FPGA kit. Evaluation results show a throughput enhancement up to 142% with a 1.7% variation in packet latencies. Synthesized using a 65 nm ASIC standard cell library, the presented ASIC OCNoC router requires 61% less area per processing element at 81.5% saving in energy dissipation compared to conventional CDMA-based NoCs.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129369088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀