Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245728
V. Catania, Andrea Mineo, Salvatore Monteleone, M. Palesi, Davide Patti
Emerging on-chip communication technologies like wireless Networks-on-Chip (WiNoCs) have been proposed as candidate solutions for addressing the scalability limitations of conventional multi-hop NoC architectures. In a WiNoC, a subset of network nodes are equipped with a wireless interface which allows them long-range communication in a single hop. This paper presents Noxim, an open, configurable, extendible, cycle-accurate NoC simulator developed in SystemC which allows to analyze the performance and power figures of both conventional wired NoC and emerging WiNoC architectures.
{"title":"Noxim: An open, extensible and cycle-accurate network on chip simulator","authors":"V. Catania, Andrea Mineo, Salvatore Monteleone, M. Palesi, Davide Patti","doi":"10.1109/ASAP.2015.7245728","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245728","url":null,"abstract":"Emerging on-chip communication technologies like wireless Networks-on-Chip (WiNoCs) have been proposed as candidate solutions for addressing the scalability limitations of conventional multi-hop NoC architectures. In a WiNoC, a subset of network nodes are equipped with a wireless interface which allows them long-range communication in a single hop. This paper presents Noxim, an open, configurable, extendible, cycle-accurate NoC simulator developed in SystemC which allows to analyze the performance and power figures of both conventional wired NoC and emerging WiNoC architectures.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"48 1","pages":"162-163"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82311856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245723
P. Martins, L. Sousa, J. Eynard, J. Bajard
Should quantum computing become viable, current public-key cryptographic schemes will no longer be valid. Since cryptosystems take many years to mature, research on post-quantum cryptography is now more important than ever. Herein, lattice-based cryptography is focused on, as an alternative post-quantum cryptosystem, to improve its efficiency. We put together several theoretical developments so as to produce an efficient implementation that solves the Closest Vector Problem (CVP) on Goldreich-Goldwasser-Halevi (GGH)-like cryptosystems based on the Residue Number System (RNS). We were able to produce speed-ups of up to 5.9 and 11.2 on the GTX 780 Ti and i7 4770K devices, respectively, when compared to a single-core optimized implementation. Finally, we show that the proposed implementation is a competitive alternative to the Rivest-Shamir-Adleman (RSA).
{"title":"Programmable RNS lattice-based parallel cryptographic decryption","authors":"P. Martins, L. Sousa, J. Eynard, J. Bajard","doi":"10.1109/ASAP.2015.7245723","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245723","url":null,"abstract":"Should quantum computing become viable, current public-key cryptographic schemes will no longer be valid. Since cryptosystems take many years to mature, research on post-quantum cryptography is now more important than ever. Herein, lattice-based cryptography is focused on, as an alternative post-quantum cryptosystem, to improve its efficiency. We put together several theoretical developments so as to produce an efficient implementation that solves the Closest Vector Problem (CVP) on Goldreich-Goldwasser-Halevi (GGH)-like cryptosystems based on the Residue Number System (RNS). We were able to produce speed-ups of up to 5.9 and 11.2 on the GTX 780 Ti and i7 4770K devices, respectively, when compared to a single-core optimized implementation. Finally, we show that the proposed implementation is a competitive alternative to the Rivest-Shamir-Adleman (RSA).","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"27 1","pages":"149-153"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90691957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245717
Ran Wang, Jie Han, B. Cockburn, D. Elliott
Vector quantization (VQ) is a general data compression technique that has a scalable implementation complexity and potentially a high compression ratio. In this paper, a novel implementation of VQ using stochastic circuits is proposed and its performance is evaluated. The stochastic and binary designs are compared for the same compression quality and the circuits are synthesized for an industrial 28-nm cell library. The effects of varying the sequence length of the stochastic design are studied with respect to the performance metric of throughput per area (TPA). When a shortened 512-bit encoding sequence is used to obtain a lower quality compression, the TPA is about 2.60 times that of the binary implementation with the same quality as that of the stochastic implementation measured by the L1 norm error (i.e., the first-order error). Thus, the stochastic implementation outperforms the conventional binary design in terms of TPA for a relatively low compression quality. By exploiting the progressive precision feature of a stochastic circuit, a readily scalable processing quality can be attained by simply halting the computation after different numbers of clock cycles.
{"title":"Stochastic circuit design and performance evaluation of vector quantization","authors":"Ran Wang, Jie Han, B. Cockburn, D. Elliott","doi":"10.1109/ASAP.2015.7245717","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245717","url":null,"abstract":"Vector quantization (VQ) is a general data compression technique that has a scalable implementation complexity and potentially a high compression ratio. In this paper, a novel implementation of VQ using stochastic circuits is proposed and its performance is evaluated. The stochastic and binary designs are compared for the same compression quality and the circuits are synthesized for an industrial 28-nm cell library. The effects of varying the sequence length of the stochastic design are studied with respect to the performance metric of throughput per area (TPA). When a shortened 512-bit encoding sequence is used to obtain a lower quality compression, the TPA is about 2.60 times that of the binary implementation with the same quality as that of the stochastic implementation measured by the L1 norm error (i.e., the first-order error). Thus, the stochastic implementation outperforms the conventional binary design in terms of TPA for a relatively low compression quality. By exploiting the progressive precision feature of a stochastic circuit, a readily scalable processing quality can be attained by simply halting the computation after different numbers of clock cycles.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"38 1","pages":"111-115"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74640030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245738
Shijie Zhou, Yun Qu, V. Prasanna
Packet classification is a key network function enabling a variety of network applications, such as network security, Quality of Service (QoS) routing, and other value-added services. Routers perform packet classification based on a predefined rule set. Packet classification faces two challenges: (1) the data rate of the network traffic keeps increasing, and (2) the size of the rule sets are becoming very large. In this paper, we propose an FPGA-based packet classification engine for large rule sets. We present a decomposition-based approach, where each field of the packet header is searched separately. Then we merge the partial search results from all the fields using a merging network. Experimental results show that our design can achieve a throughput of 147 Million Packets Per Second (MPPS), while supporting upto 256K rules on a state-of-the-art FPGA. Compared to the prior works on FPGA or multi-core processors, our design demonstrates significant performance improvements.
{"title":"Large-scale packet classification on FPGA","authors":"Shijie Zhou, Yun Qu, V. Prasanna","doi":"10.1109/ASAP.2015.7245738","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245738","url":null,"abstract":"Packet classification is a key network function enabling a variety of network applications, such as network security, Quality of Service (QoS) routing, and other value-added services. Routers perform packet classification based on a predefined rule set. Packet classification faces two challenges: (1) the data rate of the network traffic keeps increasing, and (2) the size of the rule sets are becoming very large. In this paper, we propose an FPGA-based packet classification engine for large rule sets. We present a decomposition-based approach, where each field of the packet header is searched separately. Then we merge the partial search results from all the fields using a merging network. Experimental results show that our design can achieve a throughput of 147 Million Packets Per Second (MPPS), while supporting upto 256K rules on a state-of-the-art FPGA. Compared to the prior works on FPGA or multi-core processors, our design demonstrates significant performance improvements.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"37 1","pages":"226-233"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81127622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245713
Yongchao Liu, B. Schmidt
Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.
{"title":"LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs","authors":"Yongchao Liu, B. Schmidt","doi":"10.1109/ASAP.2015.7245713","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245713","url":null,"abstract":"Compressed sparse row (CSR) is a frequently used format for sparse matrix storage. However, the state-of-the-art CSR-based sparse matrix-vector multiplication (SpMV) implementations on CUDA-enabled GPUs do not exhibit very high efficiency. This has motivated the development of some alternative storage formats for GPU computing. Unfortunately, these alternatives are incompatible with most CPU-centric programs and require dynamic conversion from CSR at runtime, thus incurring significant computational and storage overheads. We present LightSpMV, a novel CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high speed by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors. In LightSpMV, two dynamic row distribution approaches have been investigated at the vector and warp levels with atomic operations and warp shuffle functions as the fundamental building blocks. We have evaluated LightSpMV using various sparse matrices and further compared it to the CSR-based SpMV subprograms in the state-of-the-art CUSP and cuSPARSE libraries. Performance evaluation reveals that on the same Tesla K40c GPU, LightSpMV is superior to both CUSP and cuSPARSE, with a speedup of up to 2.60 and 2.63 over CUSP, and up to 1.93 and 1.79 over cuSPARSE for single and double precision, respectively. LightSpMV is available at http://lightspmv.sourceforge.net.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"82-89"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82739057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245706
S. Thompson, J. Stine
This paper discusses an optimized double-precision floating-point multiplier that can handle both denormalized and normalized IEEE 754 floating-point numbers. Discussions of the optimizations are given and compared versus similar implementations, however, the main objective is keeping compliant for denormalized IEEE 754 floating-point numbers while still maintaining high performance operations for normalized numbers.
{"title":"An IEEE 754 double-precision floating-point multiplier for denormalized and normalized floating-point numbers","authors":"S. Thompson, J. Stine","doi":"10.1109/ASAP.2015.7245706","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245706","url":null,"abstract":"This paper discusses an optimized double-precision floating-point multiplier that can handle both denormalized and normalized IEEE 754 floating-point numbers. Discussions of the optimizations are given and compared versus similar implementations, however, the main objective is keeping compliant for denormalized IEEE 754 floating-point numbers while still maintaining high performance operations for normalized numbers.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"254 1","pages":"62-63"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73194842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245737
H. Tabkhi, Majid Sabbagh, G. Schirner
Embedded vision is a rapidly growing market with a host of challenging algorithms. Among vision algorithms, Mixture of Gaussian (MoG) background subtraction is a frequently used kernel involving massive computation and communication. Tremendous challenges need to be reolved to provide MoG's high computation and communication demands with minimal power consumption allowing its embedded deployment. This paper proposes a customized architecture for power-efficient realization of MoG background subtraction operating at Full-HD resolution. Our design process benefits from system-level design principles. An SLDL-captured specification (result of high-level explorations) serves as a specification for architecture realization and hand-crafted RTL design. To optimize the architecture, this paper employs a set of optimization techniques including parallelism extraction, algorithm tuning, operation width sizing and deep pipelining. The final MoG implementation consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. Furthermore, our background subtraction solution is flexible allowing end users to adjust algorithm parameters according to scene complexity. Our results demonstrate a very high efficiency for both indoor and outdoor scenes with 145 mW on-chip power consumption and more than 600× speedup over software execution on ARM Cortex A9 core.
{"title":"An efficient architecture solution for low-power real-time background subtraction","authors":"H. Tabkhi, Majid Sabbagh, G. Schirner","doi":"10.1109/ASAP.2015.7245737","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245737","url":null,"abstract":"Embedded vision is a rapidly growing market with a host of challenging algorithms. Among vision algorithms, Mixture of Gaussian (MoG) background subtraction is a frequently used kernel involving massive computation and communication. Tremendous challenges need to be reolved to provide MoG's high computation and communication demands with minimal power consumption allowing its embedded deployment. This paper proposes a customized architecture for power-efficient realization of MoG background subtraction operating at Full-HD resolution. Our design process benefits from system-level design principles. An SLDL-captured specification (result of high-level explorations) serves as a specification for architecture realization and hand-crafted RTL design. To optimize the architecture, this paper employs a set of optimization techniques including parallelism extraction, algorithm tuning, operation width sizing and deep pipelining. The final MoG implementation consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. Furthermore, our background subtraction solution is flexible allowing end users to adjust algorithm parameters according to scene complexity. Our results demonstrate a very high efficiency for both indoor and outdoor scenes with 145 mW on-chip power consumption and more than 600× speedup over software execution on ARM Cortex A9 core.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"218-225"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88200441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245725
Jie Tang, Chen Liu, J. Gaudiot
Garbage Collection (GC) is still a major issue in JVM for both mobile and cluster computing. GC offloading is proposed to improve the performance of GC by delivering part or all of the operations into another dedicated GC hardware. However, the traditional offloading just offloads directly not considering the phase change of GC behavior, which can be classified into two different groups: minor GC and major GC. The minor GC is fast and frequently invoked, while major GC is expensive in terms of time but seldom takes place. The direct offloading made GC workload frequently hopping between main processor and GC hardware, introduced a noticeable overhead and offset any possible benefits of workload loading. To solve this issue, we propose to offload GC dynamically by a careful selection of profitable and harmful GC operations. We also made a case study on Apache Spark, a lightning-fast cluster computing platform. It shows dynamic offloading can yield nearly 42.6% performance improvement with a concurrent 32.1% in energy cost reduction.
{"title":"How can Garbage Collection be energy efficient by dynamic offloading?","authors":"Jie Tang, Chen Liu, J. Gaudiot","doi":"10.1109/ASAP.2015.7245725","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245725","url":null,"abstract":"Garbage Collection (GC) is still a major issue in JVM for both mobile and cluster computing. GC offloading is proposed to improve the performance of GC by delivering part or all of the operations into another dedicated GC hardware. However, the traditional offloading just offloads directly not considering the phase change of GC behavior, which can be classified into two different groups: minor GC and major GC. The minor GC is fast and frequently invoked, while major GC is expensive in terms of time but seldom takes place. The direct offloading made GC workload frequently hopping between main processor and GC hardware, introduced a noticeable overhead and offset any possible benefits of workload loading. To solve this issue, we propose to offload GC dynamically by a careful selection of profitable and harmful GC operations. We also made a case study on Apache Spark, a lightning-fast cluster computing platform. It shows dynamic offloading can yield nearly 42.6% performance improvement with a concurrent 32.1% in energy cost reduction.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"156-157"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83548740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245718
G. Cowan, Kevin Cushon, W. Gross
This paper presents the mixed-signal circuit implementation of reduced complexity algorithms for decoding low-density parity check (LDPC) codes. Based on modified differential decoding using binary message passing (MDD-BMP), binary addition using discrete-time digital circuits is replaced by continuous-time analog-current summation. Potential degradation due to the mismatch between current sources, P/N strength mismatch and inverter-threshold mismatch is considered in behavioural simulation and shown to be tolerable. Area estimates suggest a reduction from 0.27 mm2 to 0.11 mm2 for the FG(273, 191) code. Finally, transistor level simulation of the FG(273, 191) code using TSMC 65 nm technology shows an efficiency of 0.56 pJ/bit.
{"title":"Mixed-signal implementation of differential decoding using binary message passing algorithms","authors":"G. Cowan, Kevin Cushon, W. Gross","doi":"10.1109/ASAP.2015.7245718","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245718","url":null,"abstract":"This paper presents the mixed-signal circuit implementation of reduced complexity algorithms for decoding low-density parity check (LDPC) codes. Based on modified differential decoding using binary message passing (MDD-BMP), binary addition using discrete-time digital circuits is replaced by continuous-time analog-current summation. Potential degradation due to the mismatch between current sources, P/N strength mismatch and inverter-threshold mismatch is considered in behavioural simulation and shown to be tolerable. Area estimates suggest a reduction from 0.27 mm2 to 0.11 mm2 for the FG(273, 191) code. Finally, transistor level simulation of the FG(273, 191) code using TSMC 65 nm technology shows an efficiency of 0.56 pJ/bit.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"49 1","pages":"116-119"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79777986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245734
Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari
We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.
{"title":"On-demand fault-tolerant loop processing on massively parallel processor arrays","authors":"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig, Vahid Lari","doi":"10.1109/ASAP.2015.7245734","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245734","url":null,"abstract":"We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"70 1","pages":"194-201"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90563008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}