Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857159
Sen Ma, D. Andrews, Shanyuan Gao, Jaime Cummins
In this paper, we introduce a new design flow and architecture that lets programmers replace synthesis with compilation to create custom accelerators within data center and warehouse scale computers that include reconfigurable many core architectures. Within our new approach, we virtualize FPGAs into pre-defined partially reconfigurable tiles. We then define a run time interpreter that assembles bit stream versions of programming patterns into the tiles. The bit streams as well as software executables are maintained within libraries accessed by the application programmers. In our approach, synthesis occurs hand in hand with the initial coding of the software programming patterns when a Domain Specific Language is first created for the application programmers. Initial results show the approach allows hardware accelerators to be compiled 100x faster compared to the time required to synthesize the same functionality. Initial performance results further show a compilation/interpretation approach can achieve approximately equivalent performance for matrix operations and filtering compared to synthesizing a custom accelerator.
{"title":"Breeze computing: A just in time (JIT) approach for virtualizing FPGAs in the cloud","authors":"Sen Ma, D. Andrews, Shanyuan Gao, Jaime Cummins","doi":"10.1109/ReConFig.2016.7857159","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857159","url":null,"abstract":"In this paper, we introduce a new design flow and architecture that lets programmers replace synthesis with compilation to create custom accelerators within data center and warehouse scale computers that include reconfigurable many core architectures. Within our new approach, we virtualize FPGAs into pre-defined partially reconfigurable tiles. We then define a run time interpreter that assembles bit stream versions of programming patterns into the tiles. The bit streams as well as software executables are maintained within libraries accessed by the application programmers. In our approach, synthesis occurs hand in hand with the initial coding of the software programming patterns when a Domain Specific Language is first created for the application programmers. Initial results show the approach allows hardware accelerators to be compiled 100x faster compared to the time required to synthesize the same functionality. Initial performance results further show a compilation/interpretation approach can achieve approximately equivalent performance for matrix operations and filtering compared to synthesizing a custom accelerator.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116431535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857176
Bernhard Jungk, Marc Stöttinger
In this paper, we revisit lightweight FPGA implementations for SHA-3 and improve upon the state of the art by applying a new optimization technique to the slice-oriented architecture, which is based on a shallow pipeline. As a result, the area for the implementation reduces by almost one quarter (23%), compared to the up to now smallest implementation for Virtex-5 FPGAs. The proposed design also improves on the throughput-area ratio by 59%. For Virtex-6 FPGAs, the improvements are even higher, showing a throughput-area ratio increase by over 150% upon previously reported results for this FPGA. Furthermore, we evaluate several additional implementation trade-offs. First, we provide the maximum number of pipeline stages for lightweight architectures, which process several slices in parallel and for variants of SHA-3 with only 800 and 400 bits of internal state. Second, we evaluate several hardware interfaces. This evaluation shows, that the hardware interface may have a significant impact on the area consumption and the throughput.
{"title":"Hobbit — Smaller but faster than a dwarf: Revisiting lightweight SHA-3 FPGA implementations","authors":"Bernhard Jungk, Marc Stöttinger","doi":"10.1109/ReConFig.2016.7857176","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857176","url":null,"abstract":"In this paper, we revisit lightweight FPGA implementations for SHA-3 and improve upon the state of the art by applying a new optimization technique to the slice-oriented architecture, which is based on a shallow pipeline. As a result, the area for the implementation reduces by almost one quarter (23%), compared to the up to now smallest implementation for Virtex-5 FPGAs. The proposed design also improves on the throughput-area ratio by 59%. For Virtex-6 FPGAs, the improvements are even higher, showing a throughput-area ratio increase by over 150% upon previously reported results for this FPGA. Furthermore, we evaluate several additional implementation trade-offs. First, we provide the maximum number of pipeline stages for lightweight architectures, which process several slices in parallel and for variants of SHA-3 with only 800 and 400 bits of internal state. Second, we evaluate several hardware interfaces. This evaluation shows, that the hardware interface may have a significant impact on the area consumption and the throughput.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114581675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857181
Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars
Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.
{"title":"Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms","authors":"Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars","doi":"10.1109/ReConFig.2016.7857181","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857181","url":null,"abstract":"Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131671640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857155
N. Nila-Olmedo, F. Mendoza-Mondragón, A. Espinosa-Calderón, Moreno
This paper proposes a suitable digital platform based on the Xilinx Zynq®-7000 family to process the functions performed by a smart-solid-state-transformer (3ST) in smart grids (SG) application. These functions include: link to information and communication technologies (ICT), voltage transformation, integration of distributed renewable energy resources (DRER) and distributed energy storage devices (DESD). The Zynq platform embeds a double-core ARM® Cortex™-A9 processor and Field Programmable Gates Array (FPGA) technology, all within a programmable system on a chip (SoC). The main advantages of this technology are modularity, scalability, quick and easy maintenance and low-costs. Experimental results are included to show some capabilities of the proposed platform in a 3ST laboratory test-bed.
{"title":"ARM+FPGA platform to manage solid-state-smart transformer in smart grid application","authors":"N. Nila-Olmedo, F. Mendoza-Mondragón, A. Espinosa-Calderón, Moreno","doi":"10.1109/ReConFig.2016.7857155","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857155","url":null,"abstract":"This paper proposes a suitable digital platform based on the Xilinx Zynq®-7000 family to process the functions performed by a smart-solid-state-transformer (3ST) in smart grids (SG) application. These functions include: link to information and communication technologies (ICT), voltage transformation, integration of distributed renewable energy resources (DRER) and distributed energy storage devices (DESD). The Zynq platform embeds a double-core ARM® Cortex™-A9 processor and Field Programmable Gates Array (FPGA) technology, all within a programmable system on a chip (SoC). The main advantages of this technology are modularity, scalability, quick and easy maintenance and low-costs. Experimental results are included to show some capabilities of the proposed platform in a 3ST laboratory test-bed.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123004689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857194
S. M. H. Ho, Maolin Wang, Ho-Cheung Ng, Hayden Kwok-Hay So
A system that augments the Apache Spark data processing framework with FPGA accelerators is presented as a way to model and deploy FPGA-assisted applications in large-scale clusters. In our proposed framework, FPGAs can optionally be used in place of the host CPU for Resilient distributed datasets (RDDs) transformations, allowing seamless integration between gateware and software processing. Using the case of training an Support Vector Machine (SVM) cell image classifier as a case study, we explore the feasibilities, benefits and challenges of such technique. In our experiments where data communication between CPU and FPGA is tightly controlled, a consistent speedup of up to 1.6x can be achieved for the target SVM training application as the cluster size increases. Hardware-software techniques that are crucial to achieve acceleration such as the management of data partition size are explored. We demonstrate the benefits of the proposed framework, while also illustrate the importance of careful hardware-software management to avoid excessive CPU-FPGA communication that can quickly diminish the benefits of FPGA acceleration.
{"title":"Towards FPGA-assisted spark: An SVM training acceleration case study","authors":"S. M. H. Ho, Maolin Wang, Ho-Cheung Ng, Hayden Kwok-Hay So","doi":"10.1109/ReConFig.2016.7857194","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857194","url":null,"abstract":"A system that augments the Apache Spark data processing framework with FPGA accelerators is presented as a way to model and deploy FPGA-assisted applications in large-scale clusters. In our proposed framework, FPGAs can optionally be used in place of the host CPU for Resilient distributed datasets (RDDs) transformations, allowing seamless integration between gateware and software processing. Using the case of training an Support Vector Machine (SVM) cell image classifier as a case study, we explore the feasibilities, benefits and challenges of such technique. In our experiments where data communication between CPU and FPGA is tightly controlled, a consistent speedup of up to 1.6x can be achieved for the target SVM training application as the cluster size increases. Hardware-software techniques that are crucial to achieve acceleration such as the management of data partition size are explored. We demonstrate the benefits of the proposed framework, while also illustrate the importance of careful hardware-software management to avoid excessive CPU-FPGA communication that can quickly diminish the benefits of FPGA acceleration.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126268889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857187
Robert Karam, Tamzidul Hoque, S. Ray, M. Tehranipoor, S. Bhunia
Reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs), are being increasingly deployed in diverse application areas including automotive systems, critical infrastructures, and the emerging Internet of Things (IoT), to implement customized designs. However, securing FPGA-based designs against piracy, reverse engineering, and tampering is challenging, especially for systems that require remote upgrade. In many cases, existing solutions based on bit-stream encryption may not provide sufficient protection against these attacks. In this paper, we present a novel obfuscation approach for provably robust protection of FPGA bitstreams at low overhead that goes well beyond the protection offered by bitstream encryption. The approach works with existing FPGA architectures and synthesis flows, and can be used with encryption techniques, or by itself for power and area-constrained systems. It leverages “FPGA dark silicon” — unused resources within the configurable logic blocks — to efficiently obfuscate the true functionality. We provide a detailed threat model and security analysis for the approach. We have developed a complete application mapping framework that integrates with the Altera Quartus II software. Using this CAD framework, we achieve provably strong security against all major attacks on FPGA bitstreams with an average 13% latency and 2% total power overhead for a set of benchmark circuits, as well as several large-scale open-source IP blocks on commercial FPGA.
{"title":"Robust bitstream protection in FPGA-based systems through low-overhead obfuscation","authors":"Robert Karam, Tamzidul Hoque, S. Ray, M. Tehranipoor, S. Bhunia","doi":"10.1109/ReConFig.2016.7857187","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857187","url":null,"abstract":"Reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs), are being increasingly deployed in diverse application areas including automotive systems, critical infrastructures, and the emerging Internet of Things (IoT), to implement customized designs. However, securing FPGA-based designs against piracy, reverse engineering, and tampering is challenging, especially for systems that require remote upgrade. In many cases, existing solutions based on bit-stream encryption may not provide sufficient protection against these attacks. In this paper, we present a novel obfuscation approach for provably robust protection of FPGA bitstreams at low overhead that goes well beyond the protection offered by bitstream encryption. The approach works with existing FPGA architectures and synthesis flows, and can be used with encryption techniques, or by itself for power and area-constrained systems. It leverages “FPGA dark silicon” — unused resources within the configurable logic blocks — to efficiently obfuscate the true functionality. We provide a detailed threat model and security analysis for the approach. We have developed a complete application mapping framework that integrates with the Altera Quartus II software. Using this CAD framework, we achieve provably strong security against all major attacks on FPGA bitstreams with an average 13% latency and 2% total power overhead for a set of benchmark circuits, as well as several large-scale open-source IP blocks on commercial FPGA.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117288606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857171
Kledermon Garcia, D. L. Oliveira, R. d'Amore, L. Faria, J. L. V. Oliveira
The asynchronous paradigm is an alternative to digital system design because it eliminates the problems related to the clock signal, such as clock skew, clock distribution and power dissipation of the clock. An interesting style for asynchronous design, which is familiar to designers, divides the system in an asynchronous controller with synchronous datapath. A specification known as Extended Burst-Mode (XBM) is the most adequate one to describe the asynchronous controllers in this design style. The XBM specification must meet a number of properties to be implementable. A property known as the signal polarity may affect the controller performance. To satisfy the signal polarity, the designer must often introduce some state transitions that do not perform any operation, which are called in this paper as “dead transitions”. An XBM specification with dead transitions can reduce the controller performance. In this paper, we propose an algorithm that eliminates dead transitions in a XBM specification. This elimination occurs by transforming the original XBM specification, which leads to an optimization of the system performance. The algorithm was applied to seven well-known benchmarks and obtained a reduction of up to 37% in processing time.
{"title":"FPGA implementation of optimized XBM specifications by transformation for AFSMs","authors":"Kledermon Garcia, D. L. Oliveira, R. d'Amore, L. Faria, J. L. V. Oliveira","doi":"10.1109/ReConFig.2016.7857171","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857171","url":null,"abstract":"The asynchronous paradigm is an alternative to digital system design because it eliminates the problems related to the clock signal, such as clock skew, clock distribution and power dissipation of the clock. An interesting style for asynchronous design, which is familiar to designers, divides the system in an asynchronous controller with synchronous datapath. A specification known as Extended Burst-Mode (XBM) is the most adequate one to describe the asynchronous controllers in this design style. The XBM specification must meet a number of properties to be implementable. A property known as the signal polarity may affect the controller performance. To satisfy the signal polarity, the designer must often introduce some state transitions that do not perform any operation, which are called in this paper as “dead transitions”. An XBM specification with dead transitions can reduce the controller performance. In this paper, we propose an algorithm that eliminates dead transitions in a XBM specification. This elimination occurs by transforming the original XBM specification, which leads to an optimization of the system performance. The algorithm was applied to seven well-known benchmarks and obtained a reduction of up to 37% in processing time.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114868780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857145
Nobuyuki Yahiro, Bo Liu, Atsushi Nanri, S. Nakatake, Y. Takashima, Gong Chen
An application-specific usage of memory is an important key in development of embedded systems for IoT devices. A functional memory unit such as content addressable memory (CAM) is a good solution for network-specific applications. This work proposes a novel functional memory unit which can reconfigure a function of the memory decoder. In our reconfigurable mechanism, uni-switch cells are introduced to play an alternative role of a logic or a wire, and are embedded in an SRAM memory array. A set of uni-switches is connected and constitutes a programmable logic array (PLA) unit. The PLA has a suitable advantage for a decoder that the multi-input and multi-output function can be realized with a small area, compared with look-up table (LUT). Hence, an extensional function of the decoder is realized by PLA units inside the memory array, and a combination of PLA units provides potentials to configure various functions for stored data such as sorting, filtering, error correction, and encryption/decryption. In this paper, we present a fundamental architecture of our functional memory unit with PLA units, and demonstrate an implementation of 32-bit full adder and 2-bit counter by using PLA units.
{"title":"A multi-functional memory unit with PLA-based reconfigurable decoder","authors":"Nobuyuki Yahiro, Bo Liu, Atsushi Nanri, S. Nakatake, Y. Takashima, Gong Chen","doi":"10.1109/ReConFig.2016.7857145","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857145","url":null,"abstract":"An application-specific usage of memory is an important key in development of embedded systems for IoT devices. A functional memory unit such as content addressable memory (CAM) is a good solution for network-specific applications. This work proposes a novel functional memory unit which can reconfigure a function of the memory decoder. In our reconfigurable mechanism, uni-switch cells are introduced to play an alternative role of a logic or a wire, and are embedded in an SRAM memory array. A set of uni-switches is connected and constitutes a programmable logic array (PLA) unit. The PLA has a suitable advantage for a decoder that the multi-input and multi-output function can be realized with a small area, compared with look-up table (LUT). Hence, an extensional function of the decoder is realized by PLA units inside the memory array, and a combination of PLA units provides potentials to configure various functions for stored data such as sorting, filtering, error correction, and encryption/decryption. In this paper, we present a fundamental architecture of our functional memory unit with PLA units, and demonstrate an implementation of 32-bit full adder and 2-bit counter by using PLA units.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115613890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857192
Steffen Vaas, M. Reichenbach, Ulrich Margull, D. Fey
Safety-critical applications require reliable hardware platforms with deterministic behavior. Concerning the increasing demand for performance, current single core solutions are not sufficient anymore. Classical multi-core processors are designed for a general application case, which provide much performance at the expense of determinism and reliability. In safety-critical applications, all required tasks are already known at development time. They are specified by a system description, like AUTOSAR. Thus, a hardware architecture providing one core for each task and one physical link for each data exchange between different tasks can be derived. However, such a highly application-specific architecture is not available. Latest FPGA technologies provide now enough resources to integrate several soft-core processors in one low-cost chip. Furthermore, the cores and their connections can be arranged flexibly in an FPGA. To bridge the gap between safety-critical applications and FPGAs, this approach provides a toolchain as addition to existing AUTOSAR design tools for automatically generating a specific hardware architecture from metadata of an AUTOSAR description. By reducing the complexity of the hardware platform drastically, a reconfigurable, reliable, deterministic, distributed (R2-D2) hardware architecture can be created. The results show that safety-critical tasks can be executed deterministically on one chip in parallel and multiple applications can be mapped to one low-cost FPGA. Furthermore, the latency of the system could be reduced extensively, so new application areas can be accessed.
{"title":"The R2-D2 toolchain — Automated porting of safety-critical applications to FPGAs","authors":"Steffen Vaas, M. Reichenbach, Ulrich Margull, D. Fey","doi":"10.1109/ReConFig.2016.7857192","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857192","url":null,"abstract":"Safety-critical applications require reliable hardware platforms with deterministic behavior. Concerning the increasing demand for performance, current single core solutions are not sufficient anymore. Classical multi-core processors are designed for a general application case, which provide much performance at the expense of determinism and reliability. In safety-critical applications, all required tasks are already known at development time. They are specified by a system description, like AUTOSAR. Thus, a hardware architecture providing one core for each task and one physical link for each data exchange between different tasks can be derived. However, such a highly application-specific architecture is not available. Latest FPGA technologies provide now enough resources to integrate several soft-core processors in one low-cost chip. Furthermore, the cores and their connections can be arranged flexibly in an FPGA. To bridge the gap between safety-critical applications and FPGAs, this approach provides a toolchain as addition to existing AUTOSAR design tools for automatically generating a specific hardware architecture from metadata of an AUTOSAR description. By reducing the complexity of the hardware platform drastically, a reconfigurable, reliable, deterministic, distributed (R2-D2) hardware architecture can be created. The results show that safety-critical tasks can be executed deterministically on one chip in parallel and multiple applications can be mapped to one low-cost FPGA. Furthermore, the latency of the system could be reduced extensively, so new application areas can be accessed.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-01DOI: 10.1109/ReConFig.2016.7857158
Thaddeus Koehn, P. Athanas
Structured matrices in which at least one element is known to always be zero commonly appear in a variety of applications, including Markov processes, MIMO communications, and eigenvalue decomposition. Since matrices with known zeros require fewer computations, generating hardware to take advantage of this allows increased throughput. The approach in this paper can generate hardware for anything ranging from very sparse to completely full matrices. When dense (all elements non-zero) matrix multiplication hardware is generated, throughput is comparable to commercially available generators. As sparsity increases, throughput improves proportionally. This method also achieves a shorter processing delay compared with other techniques for sparse matrices.
{"title":"Automating structured matrix-matrix multiplication for stream processing","authors":"Thaddeus Koehn, P. Athanas","doi":"10.1109/ReConFig.2016.7857158","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857158","url":null,"abstract":"Structured matrices in which at least one element is known to always be zero commonly appear in a variety of applications, including Markov processes, MIMO communications, and eigenvalue decomposition. Since matrices with known zeros require fewer computations, generating hardware to take advantage of this allows increased throughput. The approach in this paper can generate hardware for anything ranging from very sparse to completely full matrices. When dense (all elements non-zero) matrix multiplication hardware is generated, throughput is comparable to commercially available generators. As sparsity increases, throughput improves proportionally. This method also achieves a shorter processing delay compared with other techniques for sparse matrices.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124537459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}