Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857154
M. Vavouras, C. Bouganis
This paper presents an area-driven Field-Programmable Gate Array (FPGA) scrubbing technique based on partial reconfiguration for Single Event Upset (SEU) mitigation. The proposed method is compared with existing techniques such as blind and on-demand scrubbing on a novel SEU mitigation framework implemented on the ZYNQ platform, supporting various SEU and scrubbing rates. A design space exploration on the availability versus data transfers from a Double Data Rate Type 3 (DDR3) memory, shows that our approach outperforms blind scrubbing for a range of availability values when a second order polynomial IP is targeted. A comparison to an existing on-demand scrubbing technique based on Dual Modular Redundancy (DMR) shows that our approach saves up to 46% area for the same case study.
{"title":"Area-driven partial reconfiguration for SEU mitigation on SRAM-based FPGAs","authors":"M. Vavouras, C. Bouganis","doi":"10.1109/ReConFig.2016.7857154","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857154","url":null,"abstract":"This paper presents an area-driven Field-Programmable Gate Array (FPGA) scrubbing technique based on partial reconfiguration for Single Event Upset (SEU) mitigation. The proposed method is compared with existing techniques such as blind and on-demand scrubbing on a novel SEU mitigation framework implemented on the ZYNQ platform, supporting various SEU and scrubbing rates. A design space exploration on the availability versus data transfers from a Double Data Rate Type 3 (DDR3) memory, shows that our approach outperforms blind scrubbing for a range of availability values when a second order polynomial IP is targeted. A comparison to an existing on-demand scrubbing technique based on Dual Modular Redundancy (DMR) shows that our approach saves up to 46% area for the same case study.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125796530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857177
Andreas Becher, Jutta Pirkl, Achim Herrmann, J. Teich, S. Wildermann
Partial Reconfiguration is a common technique on FPGA platforms to load hardware accelerators at runtime without interrupting the remaining system. One crucial element is the time needed for reconfiguration as it affects usability, performance and energy consumption. Furthermore, many systems have to share partial areas between multiple applications and users. In this paper, we introduce a novel open-source reconfiguration manager for Xilinx Zynq SoCs which a) allows partial area sharing and b) includes a hybrid reconfiguration approach utilizing both the Processor Configuration Access Port (PCAP) and the Internal Configuration Access Port (ICAP) in order to minimize reconfiguration time and system energy consumption. We evaluate our design and identify the sweet spots between energy consumption and latency of accelerator availability with an example use case. By means of the hybrid approach, a speedup for the full configuration after powering on the FPGA of up to 64 % in comparison to solely using the PCAP interface can be achieved.
{"title":"Hybrid energy-aware reconfiguration management on Xilinx Zynq SoCs","authors":"Andreas Becher, Jutta Pirkl, Achim Herrmann, J. Teich, S. Wildermann","doi":"10.1109/ReConFig.2016.7857177","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857177","url":null,"abstract":"Partial Reconfiguration is a common technique on FPGA platforms to load hardware accelerators at runtime without interrupting the remaining system. One crucial element is the time needed for reconfiguration as it affects usability, performance and energy consumption. Furthermore, many systems have to share partial areas between multiple applications and users. In this paper, we introduce a novel open-source reconfiguration manager for Xilinx Zynq SoCs which a) allows partial area sharing and b) includes a hybrid reconfiguration approach utilizing both the Processor Configuration Access Port (PCAP) and the Internal Configuration Access Port (ICAP) in order to minimize reconfiguration time and system energy consumption. We evaluate our design and identify the sweet spots between energy consumption and latency of accelerator availability with an example use case. By means of the hybrid approach, a speedup for the full configuration after powering on the FPGA of up to 64 % in comparison to solely using the PCAP interface can be achieved.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131159987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857183
Qianqiao Chen, Vaibhawa Mishra, G. Zervas
Network function virtualization (NFV) aims to decouple software network applications from their hardware in order to reduce development and deployment costs for new services. To enable the deployment of diverse network services, a reconfigurable and high performance hardware platform can bring considerable benefits to NFV. In this paper, an FPGA-based platform is proposed to perform as a protocol reconfigurable NFV switch. Logic circuit of virtual network functions can be reconfigured at run time on the proposed platform. A reconfiguration process is also proposed to enable packet loss free switch-over between virtual network functions that delivers undisrupted service. The platform can be reconfigured between Layer 1 circuit switch and Layer 2 Ethernet packet switch. Once running as a packet switch, the platform can switch over from Layer 2 Ethernet switch to Layer 3 IP parser and even Layer 4 UDP parser. Performance of the implemented 2×2 switch at 10Gbps per port delivers a minimum latency of 300 nanoseconds (circuit switch) and maximum latency of 1 microsecond. Reconfiguration between IP and UDP parser without loss of data is also demonstrated.
{"title":"Reconfigurable computing for network function virtualization: A protocol independent switch","authors":"Qianqiao Chen, Vaibhawa Mishra, G. Zervas","doi":"10.1109/ReConFig.2016.7857183","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857183","url":null,"abstract":"Network function virtualization (NFV) aims to decouple software network applications from their hardware in order to reduce development and deployment costs for new services. To enable the deployment of diverse network services, a reconfigurable and high performance hardware platform can bring considerable benefits to NFV. In this paper, an FPGA-based platform is proposed to perform as a protocol reconfigurable NFV switch. Logic circuit of virtual network functions can be reconfigured at run time on the proposed platform. A reconfiguration process is also proposed to enable packet loss free switch-over between virtual network functions that delivers undisrupted service. The platform can be reconfigured between Layer 1 circuit switch and Layer 2 Ethernet packet switch. Once running as a packet switch, the platform can switch over from Layer 2 Ethernet switch to Layer 3 IP parser and even Layer 4 UDP parser. Performance of the implemented 2×2 switch at 10Gbps per port delivers a minimum latency of 300 nanoseconds (circuit switch) and maximum latency of 1 microsecond. Reconfiguration between IP and UDP parser without loss of data is also demonstrated.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128989023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857161
T. Lieske, M. Reichenbach, Burkhard Ringlein, D. Fey
Image processing is an omnipresent topic in current embedded industrial and consumer applications. Therefore, it is important to investigate processing architectures to extract design guidelines for developing efficient image processors. While SIMD (single instruction, multiple data) processor arrays were often proposed to accelerate image processing tasks, the internal architecture of processor elements (PEs) has not been optimized. Nevertheless, it is necessary to evaluate the optimal complexity of PEs to trade off performance and architectural overhead caused by complex processor architectures. Hence, the goal of this paper is to present a deep evaluation of finding the right architectural complexity of PEs in a processor field to meet given performance and logic area constraints. In order to determine the optimal complexity, the ADL (architecture description language) based FAUPU framework for image preprocessing architectures is utilized and after evaluation extended with pipelining support. The newly introduced pipelining features enable resource-efficient performance optimizations and are a significant improvement to the FAUPU ADL. Due to the fine-grained configurability of the FAUPU architecture, several design variants can be easily generated and it is possible to evaluate the effects of instruction set architecture (ISA) complexity and pipelining on design properties and how these features are best combined. Consequently, the FAUPU framework can be used to address the question, whether it is better to use many lightweight cores or do less but more complex cores yield a greater performance to area ratio? The results show that lightweight cores are best suited to achieve a targeted frame rate with the least resources. However, more complex cores on the other hand yield better performance to area ratios.
{"title":"Dataflow optimization for programmable embedded image preprocessing accelerators","authors":"T. Lieske, M. Reichenbach, Burkhard Ringlein, D. Fey","doi":"10.1109/ReConFig.2016.7857161","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857161","url":null,"abstract":"Image processing is an omnipresent topic in current embedded industrial and consumer applications. Therefore, it is important to investigate processing architectures to extract design guidelines for developing efficient image processors. While SIMD (single instruction, multiple data) processor arrays were often proposed to accelerate image processing tasks, the internal architecture of processor elements (PEs) has not been optimized. Nevertheless, it is necessary to evaluate the optimal complexity of PEs to trade off performance and architectural overhead caused by complex processor architectures. Hence, the goal of this paper is to present a deep evaluation of finding the right architectural complexity of PEs in a processor field to meet given performance and logic area constraints. In order to determine the optimal complexity, the ADL (architecture description language) based FAUPU framework for image preprocessing architectures is utilized and after evaluation extended with pipelining support. The newly introduced pipelining features enable resource-efficient performance optimizations and are a significant improvement to the FAUPU ADL. Due to the fine-grained configurability of the FAUPU architecture, several design variants can be easily generated and it is possible to evaluate the effects of instruction set architecture (ISA) complexity and pipelining on design properties and how these features are best combined. Consequently, the FAUPU framework can be used to address the question, whether it is better to use many lightweight cores or do less but more complex cores yield a greater performance to area ratio? The results show that lightweight cores are best suited to achieve a targeted frame rate with the least resources. However, more complex cores on the other hand yield better performance to area ratios.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129159098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857186
J. Rettkowski, Konstantin Friesen, D. Göhringer
Partial reconfiguration in FPGAs increases the flexibility of a system due to dynamic replacement of hardware modules. However, more memory is needed to store all partial bitstreams and the generation of all partial bitstreams for all possible regions on the FPGA is very time-consuming. In order to overcome these issues, bitstream relocation can be used. In this paper, a novel approach that facilitates bitstream relocation with the Xilinx Vivado tool flow is presented. In addition, the approach is automated by TCL scripts that extend Vivado to RePaBit. RePaBit is successfully evaluated on the Xilinx Zynq FPGA using 1D and 2D relocation of complex modules such as MicroBlaze processors. The results show a negligible overhead in terms of area and frequency while enabling more flexibility by partial bitstream relocation as well as a faster design time.
{"title":"RePaBit: Automated generation of relocatable partial bitstreams for Xilinx Zynq FPGAs","authors":"J. Rettkowski, Konstantin Friesen, D. Göhringer","doi":"10.1109/ReConFig.2016.7857186","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857186","url":null,"abstract":"Partial reconfiguration in FPGAs increases the flexibility of a system due to dynamic replacement of hardware modules. However, more memory is needed to store all partial bitstreams and the generation of all partial bitstreams for all possible regions on the FPGA is very time-consuming. In order to overcome these issues, bitstream relocation can be used. In this paper, a novel approach that facilitates bitstream relocation with the Xilinx Vivado tool flow is presented. In addition, the approach is automated by TCL scripts that extend Vivado to RePaBit. RePaBit is successfully evaluated on the Xilinx Zynq FPGA using 1D and 2D relocation of complex modules such as MicroBlaze processors. The results show a negligible overhead in terms of area and frequency while enabling more flexibility by partial bitstream relocation as well as a faster design time.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130271876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857156
J. F. Zazo, S. López-Buedo, G. Sutter, J. Aracil
Monitoring 100 Gbps network links is a challenging task. Packet filtering allows monitoring applications to focus on the relevant data, discarding packets that do not provide any valuable information. However, such a large line rate calls for custom hardware solutions. This work presents a tool for automatically synthesizing packets filters from a custom grammar, which defines filters in a human-readable format. Thanks to parser generators (Bison) and lexical analyzers (Flex), Verilog code is automatically generated from the filter specification. Rules can be applied over a protocol, a protocol field, the packet payload, or a combination of them. The generated filters use standard AXI4-Stream interfaces, which seamlessly integrate in the packet filtering framework that we have developed for the integrated block for 100G Ethernet available in Xilinx Ultrascale devices. We present the results for two proof-of-concept packet filtering designs. Furthermore, filters are fully pipelined, so the full 100 Gb/s rate is guaranteed. As the framework uses a cut-through approach, latency is kept to a minimum. Finally, the proposed framework allows for the integration of more complex payload-level filters, written in C language with the Vivado-HLS tool.
{"title":"Automated synthesis of FPGA-based packet filters for 100 Gbps network monitoring applications","authors":"J. F. Zazo, S. López-Buedo, G. Sutter, J. Aracil","doi":"10.1109/ReConFig.2016.7857156","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857156","url":null,"abstract":"Monitoring 100 Gbps network links is a challenging task. Packet filtering allows monitoring applications to focus on the relevant data, discarding packets that do not provide any valuable information. However, such a large line rate calls for custom hardware solutions. This work presents a tool for automatically synthesizing packets filters from a custom grammar, which defines filters in a human-readable format. Thanks to parser generators (Bison) and lexical analyzers (Flex), Verilog code is automatically generated from the filter specification. Rules can be applied over a protocol, a protocol field, the packet payload, or a combination of them. The generated filters use standard AXI4-Stream interfaces, which seamlessly integrate in the packet filtering framework that we have developed for the integrated block for 100G Ethernet available in Xilinx Ultrascale devices. We present the results for two proof-of-concept packet filtering designs. Furthermore, filters are fully pipelined, so the full 100 Gb/s rate is guaranteed. As the framework uses a cut-through approach, latency is kept to a minimum. Finally, the proposed framework allows for the integration of more complex payload-level filters, written in C language with the Vivado-HLS tool.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130282312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857180
Travis Haroldsen, B. Nelson, B. Hutchings
Academic packing algorithms have typically been limited to theoretical architectures. In this paper, we describe RSVPack, a packing algorithm built on top of RapidSmith to target the Xilinx Virtex 6 architecture. We integrate our packer into the Xilinx ISE CAD flow and demonstrate our packer tool by packing a set of benchmark circuits and performing routing and timing analysis inside ISE.
学术上的打包算法通常局限于理论架构。在本文中,我们描述了RSVPack,这是一种基于RapidSmith构建的针对Xilinx Virtex 6架构的打包算法。我们将封隔器集成到Xilinx ISE CAD流程中,并通过在ISE中封装一组基准电路并执行路由和时序分析来演示我们的封隔器工具。
{"title":"Packing a modern Xilinx FPGA using RapidSmith","authors":"Travis Haroldsen, B. Nelson, B. Hutchings","doi":"10.1109/ReConFig.2016.7857180","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857180","url":null,"abstract":"Academic packing algorithms have typically been limited to theoretical architectures. In this paper, we describe RSVPack, a packing algorithm built on top of RapidSmith to target the Xilinx Virtex 6 architecture. We integrate our packer into the Xilinx ISE CAD flow and demonstrate our packer tool by packing a set of benchmark circuits and performing routing and timing analysis inside ISE.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116358327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857193
S. Meisner, M. Platzner
Dynamic thread duplication is a known redundancy technique for multi-cores. Recent research applied this concept to hybrid multi-cores for error detection and introduced thread shadowing that runs hardware threads in the reconfigurable cores and compares their outputs for deviation at configurable signature levels. Previously published work evaluated this concept in terms of performance, error detection latency and resource consumption. In this paper we report on the error detection capabilities of thread shadowing by presenting an extensive fault injection campaign. We employ the Xilinx Soft Error Mitigation Controller for fault injection and the Xilinx Essential Bit facility to limit the fault injections to relevant bits in the configuration bitstream. Our findings from fault injection experiments with a sorting benchmark are threefold: First, up to 98% of all errors are detected by the operating system of the hybrid multi-core supported by thread shadowing. Second, thread shadowing's signature levels provide a useful trade-off between detected errors and effort needed, with around 5% of all errors detected in calls to operating system functions and around 52% of errors detected in memory accesses of the hardware thread. Third, essential bit testing is effective and cuts down the amount of bits to be tested by a factor of 14.48 compared to the total amount of bits available in the configuration address space.
{"title":"Thread shadowing: On the effectiveness of error detection at the hardware thread level","authors":"S. Meisner, M. Platzner","doi":"10.1109/ReConFig.2016.7857193","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857193","url":null,"abstract":"Dynamic thread duplication is a known redundancy technique for multi-cores. Recent research applied this concept to hybrid multi-cores for error detection and introduced thread shadowing that runs hardware threads in the reconfigurable cores and compares their outputs for deviation at configurable signature levels. Previously published work evaluated this concept in terms of performance, error detection latency and resource consumption. In this paper we report on the error detection capabilities of thread shadowing by presenting an extensive fault injection campaign. We employ the Xilinx Soft Error Mitigation Controller for fault injection and the Xilinx Essential Bit facility to limit the fault injections to relevant bits in the configuration bitstream. Our findings from fault injection experiments with a sorting benchmark are threefold: First, up to 98% of all errors are detected by the operating system of the hybrid multi-core supported by thread shadowing. Second, thread shadowing's signature levels provide a useful trade-off between detected errors and effort needed, with around 5% of all errors detected in calls to operating system functions and around 52% of errors detected in memory accesses of the hardware thread. Third, essential bit testing is effective and cuts down the amount of bits to be tested by a factor of 14.48 compared to the total amount of bits available in the configuration address space.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115398969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857147
Jaco A. Hofmann, Jens Korinth, A. Koch
Semi-Global Matching (SGM) is a high-performance method for computing high-quality disparity maps from stereo camera images in machine vision applications. It is also suitable for direct hardware execution, e.g., in ASICs or reconfigurable logic devices. We present a highly parametrized FPGA implementation, scalable from simple low-resolution low-power use-cases, up to complex real-time full-HD multi-camera scenarios. By using a latency-insensitive design style, high-level synthesis from the Bluespec SystemVerilog next-generation hardware description language, and an automated design-space exploration flow, many implementation alternatives could be examined with high productivity. The use of the Threadpool Composer system-on-chip assembly tool allows the portable mapping of the SGM accelerator to different hardware platforms. The accelerator performance exceeds that of prior fixed-architecture approaches.
{"title":"A scalable latency-insensitive architecture for FPGA-accelerated semi-global matching in stereo vision applications","authors":"Jaco A. Hofmann, Jens Korinth, A. Koch","doi":"10.1109/ReConFig.2016.7857147","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857147","url":null,"abstract":"Semi-Global Matching (SGM) is a high-performance method for computing high-quality disparity maps from stereo camera images in machine vision applications. It is also suitable for direct hardware execution, e.g., in ASICs or reconfigurable logic devices. We present a highly parametrized FPGA implementation, scalable from simple low-resolution low-power use-cases, up to complex real-time full-HD multi-camera scenarios. By using a latency-insensitive design style, high-level synthesis from the Bluespec SystemVerilog next-generation hardware description language, and an automated design-space exploration flow, many implementation alternatives could be examined with high productivity. The use of the Threadpool Composer system-on-chip assembly tool allows the portable mapping of the SGM accelerator to different hardware platforms. The accelerator performance exceeds that of prior fixed-architecture approaches.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125003678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857179
K. E. Ahmed, M. Rizk, Mohammed M. Farag
Networks on Chip (NoCs) have replaced on-chip buses as the paramount communication strategy in large scale Systems-on-Chips (SoCs). Code Division Multiple Access (CDMA) has been proposed as an interconnect fabric that can achieve high throughput and fixed transfer latency due to the CDMA transmission concurrency. Overloaded CDMA Interconnect (OCI) is an architectural evolution of the conventional CDMA interconnects that can double their bandwidth at marginal cost. Employing OCI in CDMA-based NoCs has the potential of providing higher bandwidth at low-power and -area overheads compared to other NoC architectures. Furthermore, fixed latency and predictable performance achieved by the inherent CDMA concurrency can reduce the effort and overhead required to implement QoS. In this work, we advance the Overloaded CDMA interconnect for Network on Chip (OCNoC) dynamic central router. The OCNoC router leverages the overloaded CDMA concept to reduce the overall packet transfer latency and improve the network throughput at a negligible area overhead. Dynamic code assignment is adopted to reduce the decoding complexity and transfer latency and maximize the crossbar utilization. Two OCNoC solutions are advanced, serial and parallel CDMA encoding schemes. The OCNoC central routers are implemented and validated on a Virtex-7 VC709 FPGA kit. Evaluation results show a throughput enhancement up to 142% with a 1.7% variation in packet latencies. Synthesized using a 65 nm ASIC standard cell library, the presented ASIC OCNoC router requires 61% less area per processing element at 81.5% saving in energy dissipation compared to conventional CDMA-based NoCs.
{"title":"Overloaded CDMA interconnect for Network-on-Chip (OCNoC)","authors":"K. E. Ahmed, M. Rizk, Mohammed M. Farag","doi":"10.1109/ReConFig.2016.7857179","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857179","url":null,"abstract":"Networks on Chip (NoCs) have replaced on-chip buses as the paramount communication strategy in large scale Systems-on-Chips (SoCs). Code Division Multiple Access (CDMA) has been proposed as an interconnect fabric that can achieve high throughput and fixed transfer latency due to the CDMA transmission concurrency. Overloaded CDMA Interconnect (OCI) is an architectural evolution of the conventional CDMA interconnects that can double their bandwidth at marginal cost. Employing OCI in CDMA-based NoCs has the potential of providing higher bandwidth at low-power and -area overheads compared to other NoC architectures. Furthermore, fixed latency and predictable performance achieved by the inherent CDMA concurrency can reduce the effort and overhead required to implement QoS. In this work, we advance the Overloaded CDMA interconnect for Network on Chip (OCNoC) dynamic central router. The OCNoC router leverages the overloaded CDMA concept to reduce the overall packet transfer latency and improve the network throughput at a negligible area overhead. Dynamic code assignment is adopted to reduce the decoding complexity and transfer latency and maximize the crossbar utilization. Two OCNoC solutions are advanced, serial and parallel CDMA encoding schemes. The OCNoC central routers are implemented and validated on a Virtex-7 VC709 FPGA kit. Evaluation results show a throughput enhancement up to 142% with a 1.7% variation in packet latencies. Synthesized using a 65 nm ASIC standard cell library, the presented ASIC OCNoC router requires 61% less area per processing element at 81.5% saving in energy dissipation compared to conventional CDMA-based NoCs.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129369088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}