Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016159
Adrián Dominguez, P. P. Carballo, A. Núñez
This paper describes the work done to design a SoC platform for real-time on-line pattern search in TCP packets for Deep Packet Inspection (DPI) applications. The platform is based on a Xilinx Zynq programmable SoC and includes an accelerator that implements a pattern search engine that extends the original Boyer-Moore algorithm with timing and logical rules, that produces a very complex set of rules. Also, the platform implements different modes of operation, including SIMD and MISD parallelism, which can be configured on-line. The platform is scalable depending of the analysis requirement up to 8 Gbps. High-Level synthesis and platform based design methodologies have been used to reduce the time to market of the completed system.
{"title":"Programmable SoC platform for deep packet inspection using enhanced Boyer-Moore algorithm","authors":"Adrián Dominguez, P. P. Carballo, A. Núñez","doi":"10.1109/ReCoSoC.2017.8016159","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016159","url":null,"abstract":"This paper describes the work done to design a SoC platform for real-time on-line pattern search in TCP packets for Deep Packet Inspection (DPI) applications. The platform is based on a Xilinx Zynq programmable SoC and includes an accelerator that implements a pattern search engine that extends the original Boyer-Moore algorithm with timing and logical rules, that produces a very complex set of rules. Also, the platform implements different modes of operation, including SIMD and MISD parallelism, which can be configured on-line. The platform is scalable depending of the analysis requirement up to 8 Gbps. High-Level synthesis and platform based design methodologies have been used to reduce the time to market of the completed system.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114414486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016147
Zakarya Guettatfi, Philipp Hübner, M. Platzner, B. Rinner
Visual sensor networks (VSNs) represent distributed embedded systems with tight constraints on sensing, processing, memory, communications and power consumption. VSNs are expected to scale up in the number of nodes, be required to offer more complex functionality, a higher degree of flexibility and increased autonomy. The engineering of such VSNs capable of (self-)adapting on the application and platform levels poses a formidable challenge. In this paper, we introduce a novel design approach for visual sensor nodes which is founded on computational self-awareness. Computational self-awareness maintains knowledge about the system's state and environment with models and then uses this knowledge to reason about and adapt behaviours. We discuss the concept of computational self-awareness and present our novel design approach that is centred on a reference architecture for individual VSN nodes, but can be naturally extended to networks. We present the VSN node implementation with its platform architecture and resource adaptivity and report on preliminary implementation results of a Zynq-based VSN node prototype.
{"title":"Computational self-awareness as design approach for visual sensor nodes","authors":"Zakarya Guettatfi, Philipp Hübner, M. Platzner, B. Rinner","doi":"10.1109/ReCoSoC.2017.8016147","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016147","url":null,"abstract":"Visual sensor networks (VSNs) represent distributed embedded systems with tight constraints on sensing, processing, memory, communications and power consumption. VSNs are expected to scale up in the number of nodes, be required to offer more complex functionality, a higher degree of flexibility and increased autonomy. The engineering of such VSNs capable of (self-)adapting on the application and platform levels poses a formidable challenge. In this paper, we introduce a novel design approach for visual sensor nodes which is founded on computational self-awareness. Computational self-awareness maintains knowledge about the system's state and environment with models and then uses this knowledge to reason about and adapt behaviours. We discuss the concept of computational self-awareness and present our novel design approach that is centred on a reference architecture for individual VSN nodes, but can be naturally extended to networks. We present the VSN node implementation with its platform architecture and resource adaptivity and report on preliminary implementation results of a Zynq-based VSN node prototype.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130341864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016156
S. Oyeniran, R. Ubar, Siavoosh Payandeh Azad, J. Raik
The advent of many-core system-on-chips (SoC) will involve new scalable hardware/software mechanisms that can efficiently utilize the abundance of interconnected processing elements found in these SoCs. These trends will have a great impact on the strategies for testing the systems and improving their reliability by exploiting system's re-configurability to achieve graceful degradation of system's performance. We propose a strategy of Software-Based Self-Test (SBST) to be used for testing of processing elements in many-core systems with the goal to increase fault coverage and structuring the test routines in a way which makes test-data delivery in many-core systems more efficient. A new high-level fault model is introduced, which covers a broad class of gate-level Stuck-at-Faults (SAF), conditional SAF, and bridging faults of any multiplicity in processor control paths. Two algorithms for high-level simulation-based test generation for the control path and a bit-wise pseudo-exhaustive test approach for data path are proposed. No implementation details are needed for test data generation. A novel method for proving the redundancy of high-level functional faults is presented, which allows for precise evaluation of fault coverage.
{"title":"High-level test generation for processing elements in many-core systems","authors":"S. Oyeniran, R. Ubar, Siavoosh Payandeh Azad, J. Raik","doi":"10.1109/ReCoSoC.2017.8016156","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016156","url":null,"abstract":"The advent of many-core system-on-chips (SoC) will involve new scalable hardware/software mechanisms that can efficiently utilize the abundance of interconnected processing elements found in these SoCs. These trends will have a great impact on the strategies for testing the systems and improving their reliability by exploiting system's re-configurability to achieve graceful degradation of system's performance. We propose a strategy of Software-Based Self-Test (SBST) to be used for testing of processing elements in many-core systems with the goal to increase fault coverage and structuring the test routines in a way which makes test-data delivery in many-core systems more efficient. A new high-level fault model is introduced, which covers a broad class of gate-level Stuck-at-Faults (SAF), conditional SAF, and bridging faults of any multiplicity in processor control paths. Two algorithms for high-level simulation-based test generation for the control path and a bit-wise pseudo-exhaustive test approach for data path are proposed. No implementation details are needed for test data generation. A novel method for proving the redundancy of high-level functional faults is presented, which allows for precise evaluation of fault coverage.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116069664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016155
Poona Bahrebar, D. Stroobandt
Networks with torus interconnection topology are widely used due to the symmetry in traffic distribution. In order to ensure deadlock-freedom and provide adaptive routing in torus, at least two Virtual Channels (VCs) per physical channel are required to break the cyclic channel dependencies. However, VCs increase the arbitration latency and consume large power/area overheads which is undesirable, particularly for on-chip networks with limited power/area budgets. In this paper, we propose a novel technique for routing in wormhole-switched 2D torus networks. The proposed method relies on the Abacus Turn Model (AbTM) and Worm-Bubble Flow Control (WBFC) to support adaptive and deadlock-free routing without using VCs. Furthermore, the network blocking is reduced by providing on-demand routing adaptiveness through reconfiguration. The experimental results demonstrate the efficiency of the proposed scheme in terms of performance and hardware overhead.
{"title":"Adaptive and reconfigurable bubble routing technique for 2D Torus interconnection networks","authors":"Poona Bahrebar, D. Stroobandt","doi":"10.1109/ReCoSoC.2017.8016155","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016155","url":null,"abstract":"Networks with torus interconnection topology are widely used due to the symmetry in traffic distribution. In order to ensure deadlock-freedom and provide adaptive routing in torus, at least two Virtual Channels (VCs) per physical channel are required to break the cyclic channel dependencies. However, VCs increase the arbitration latency and consume large power/area overheads which is undesirable, particularly for on-chip networks with limited power/area budgets. In this paper, we propose a novel technique for routing in wormhole-switched 2D torus networks. The proposed method relies on the Abacus Turn Model (AbTM) and Worm-Bubble Flow Control (WBFC) to support adaptive and deadlock-free routing without using VCs. Furthermore, the network blocking is reduced by providing on-demand routing adaptiveness through reconfiguration. The experimental results demonstrate the efficiency of the proposed scheme in terms of performance and hardware overhead.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133044775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016150
Martha Johanna Sepúlveda, Mathieu Gross, A. Zankl, G. Sigl
The growing complexity of Systems-on-Chips (SoCs) increases the risk of software attacks during runtime. A critical threat to system security are so-called side-channel attacks based on the processor cache and its usage during the execution of cryptographic algorithms. Recent publications have analyzed cache attacks on mobile devices and network-on-chip platforms. In this work, we investigate cache attacks on bus-like tile-based Multi-Processor Systems-on-Chips (MPSoCs). This work presents two contributions. First, we demonstrate a trace-driven cache attack on AES-128 based on the exploitation of bus communication. Second, we integrate two countermeasures (Shuffling and Mini-table) and evaluate their impact on the trace-based cache attack and on the performance of the system. The results illustrate that trace-driven attacks based on bus communication are a non-negligible threat in SoC environments. The results also show that the protection techniques are feasible to implement and that they are able to mitigate the attacks.
{"title":"Towards trace-driven cache attacks on Systems-on-Chips — exploiting bus communication","authors":"Martha Johanna Sepúlveda, Mathieu Gross, A. Zankl, G. Sigl","doi":"10.1109/ReCoSoC.2017.8016150","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016150","url":null,"abstract":"The growing complexity of Systems-on-Chips (SoCs) increases the risk of software attacks during runtime. A critical threat to system security are so-called side-channel attacks based on the processor cache and its usage during the execution of cryptographic algorithms. Recent publications have analyzed cache attacks on mobile devices and network-on-chip platforms. In this work, we investigate cache attacks on bus-like tile-based Multi-Processor Systems-on-Chips (MPSoCs). This work presents two contributions. First, we demonstrate a trace-driven cache attack on AES-128 based on the exploitation of bus communication. Second, we integrate two countermeasures (Shuffling and Mini-table) and evaluate their impact on the trace-based cache attack and on the performance of the system. The results illustrate that trace-driven attacks based on bus communication are a non-negligible threat in SoC environments. The results also show that the protection techniques are feasible to implement and that they are able to mitigate the attacks.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133170165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016153
Marcel Essig, K. F. Ackermann
State of the art FPGAs comprise various architectural features providing the performance and flexibility required to comply with growing real-time demands of today's industrial applications. Nevertheless, the requirements on engineering expertise in order to exploit these platform features significantly increased during the past few years, consequently raising product costs and the time-to-market as well. Especially the feature of dynamic partial reconfiguration, enabling timedivision multiplexing of resources within the reconfigurable fabric, is barely adopted by industry yet. This paper introduces a lightweight co-processing framework, taking advantage of an embedded processor closely coupled with the programmable logic inside the FPGA. The basic idea of this concept is to implement the sequential control flow of applications in software, while reconfigurable hardware accelerators may be utilized on-demand, in order to increase the performance on computation-intensive tasks. A hardware abstraction layer hides complex architectural processes and provides software engineers with a set of routines, enabling run-time requests and the interfacing of co-processors from within the code. Implementation details and sequences of operations are given and discussed.
{"title":"On-demand instantiation of co-processors on dynamically reconfigurable FPGAs","authors":"Marcel Essig, K. F. Ackermann","doi":"10.1109/ReCoSoC.2017.8016153","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016153","url":null,"abstract":"State of the art FPGAs comprise various architectural features providing the performance and flexibility required to comply with growing real-time demands of today's industrial applications. Nevertheless, the requirements on engineering expertise in order to exploit these platform features significantly increased during the past few years, consequently raising product costs and the time-to-market as well. Especially the feature of dynamic partial reconfiguration, enabling timedivision multiplexing of resources within the reconfigurable fabric, is barely adopted by industry yet. This paper introduces a lightweight co-processing framework, taking advantage of an embedded processor closely coupled with the programmable logic inside the FPGA. The basic idea of this concept is to implement the sequential control flow of applications in software, while reconfigurable hardware accelerators may be utilized on-demand, in order to increase the performance on computation-intensive tasks. A hardware abstraction layer hides complex architectural processes and provides software engineers with a set of routines, enabling run-time requests and the interfacing of co-processors from within the code. Implementation details and sequences of operations are given and discussed.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133192953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016146
A. Nocua, Florent Bruguier, G. Sassatelli, A. Gamatie
Multicore system analysis requires efficient solutions for architectural parameter and scalability exploration. Long simulation time is the main drawback of current simulation approaches. In order to reduce the simulation time while keeping the accuracy levels, trace-driven simulation approaches have been developed. However, existing approaches do not allow multicore exploration or do not capture the behavior of multithreaded programs. Based on the gem5 simulator, we developed a novel synchronization mechanism for multicore analysis based on the trace collection of synchronization events, instruction and dependencies. It allows efficient architectural parameter and scalability exploration with acceptable simulation speed and accuracy.
{"title":"ElasticSimMATE: A fast and accurate gem5 trace-driven simulator for multicore systems","authors":"A. Nocua, Florent Bruguier, G. Sassatelli, A. Gamatie","doi":"10.1109/ReCoSoC.2017.8016146","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016146","url":null,"abstract":"Multicore system analysis requires efficient solutions for architectural parameter and scalability exploration. Long simulation time is the main drawback of current simulation approaches. In order to reduce the simulation time while keeping the accuracy levels, trace-driven simulation approaches have been developed. However, existing approaches do not allow multicore exploration or do not capture the behavior of multithreaded programs. Based on the gem5 simulator, we developed a novel synchronization mechanism for multicore analysis based on the trace collection of synchronization events, instruction and dependencies. It allows efficient architectural parameter and scalability exploration with acceptable simulation speed and accuracy.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115344488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016158
Yidi Liu, M. Villaverde, F. Moreno, Benjamin Carrión Schäfer
This work presents a method to characterize and optimize hardware accelerators (HWaccs) given as Behavioral IPs (BIPs) mapped as loosely coupled HWaccs in heterogenous MPSoCs. The proposed HWacc exploration flow is composed of two main stages. The first stage characterizes each BIPs individually by performing a High-Level Synthesis (HLS) Design Space Exploration (DSE) on each of the BIPs to obtain a trade-off curve of Pareto-optimal designs. It then continues by exploring the system-level design space using these Pareto-optimal designs and finding configurations with unique area vs. performance trade-offs. Our proposed system-level explorer makes use of cycle-accurate simulation models to explore the search space fast and accurately. Experimental results show that our proposed method works well for MPSoCs of different sizes ranging from systems with 1 to 4 masters and with 3 to 7 HWaccs.
{"title":"Characterization and optimization of behavioral hardware accelerators in heterogeneous MPSoCs","authors":"Yidi Liu, M. Villaverde, F. Moreno, Benjamin Carrión Schäfer","doi":"10.1109/ReCoSoC.2017.8016158","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016158","url":null,"abstract":"This work presents a method to characterize and optimize hardware accelerators (HWaccs) given as Behavioral IPs (BIPs) mapped as loosely coupled HWaccs in heterogenous MPSoCs. The proposed HWacc exploration flow is composed of two main stages. The first stage characterizes each BIPs individually by performing a High-Level Synthesis (HLS) Design Space Exploration (DSE) on each of the BIPs to obtain a trade-off curve of Pareto-optimal designs. It then continues by exploring the system-level design space using these Pareto-optimal designs and finding configurations with unique area vs. performance trade-offs. Our proposed system-level explorer makes use of cycle-accurate simulation models to explore the search space fast and accurately. Experimental results show that our proposed method works well for MPSoCs of different sizes ranging from systems with 1 to 4 masters and with 3 to 7 HWaccs.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125064189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016161
Tsotne Putkaradze, Siavoosh Payandeh Azad, Behrad Niazmand, J. Raik, G. Jervan
The current trend of aggressive technology scaling results in a decrease in system's reliability. This motivates investigation of fault-resilient architectures which provide graceful degradation of system's functionality. In this paper, three novel fault-resilient Network-on-Chip (NoC) router architectures are proposed. These architectures, exploit the regularity of the router and reallocate available existing and spare units to maintain functionality of certain turns. The resource reallocation is performed transparently from system's resource manager and is based on predefined priorities. A new metric for architecture reliability comparison based on reliability block diagrams is introduced. In contrast to Silicone Protection Factor (SPF) metric, the proposed metric also takes into account the areas of different units. Area overhead and reliability of proposed architectures are compared with Triple Modular Redundancy (TMR) and Unit-Duplication mechanisms. All proposed architectures showed remarkable reliability improvement compared to original, TMR and Unit Duplication architectures; while at the same time, their area overhead is less than or equal to unit-duplication mechanisms.
{"title":"Fault-resilient NoC router with transparent resource allocation","authors":"Tsotne Putkaradze, Siavoosh Payandeh Azad, Behrad Niazmand, J. Raik, G. Jervan","doi":"10.1109/ReCoSoC.2017.8016161","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016161","url":null,"abstract":"The current trend of aggressive technology scaling results in a decrease in system's reliability. This motivates investigation of fault-resilient architectures which provide graceful degradation of system's functionality. In this paper, three novel fault-resilient Network-on-Chip (NoC) router architectures are proposed. These architectures, exploit the regularity of the router and reallocate available existing and spare units to maintain functionality of certain turns. The resource reallocation is performed transparently from system's resource manager and is based on predefined priorities. A new metric for architecture reliability comparison based on reliability block diagrams is introduced. In contrast to Silicone Protection Factor (SPF) metric, the proposed metric also takes into account the areas of different units. Area overhead and reliability of proposed architectures are compared with Triple Modular Redundancy (TMR) and Unit-Duplication mechanisms. All proposed architectures showed remarkable reliability improvement compared to original, TMR and Unit Duplication architectures; while at the same time, their area overhead is less than or equal to unit-duplication mechanisms.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124157693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.1109/ReCoSoC.2017.8016148
Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto
Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.
{"title":"Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs","authors":"Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto","doi":"10.1109/ReCoSoC.2017.8016148","DOIUrl":"https://doi.org/10.1109/ReCoSoC.2017.8016148","url":null,"abstract":"Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129566660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}