Fubing Mao, Yi-Chung Chen, Wei Zhang, Hai Helen Li
With the wide application of FPGAs in adaptive computing systems, there is an increasing need to support design automation for PR FPGAs. However, there is a missing link between CAD tools for PR FPGA and existing widely used CAD tools, such as VPR. Hence, in this work we propose a modular placer for FPGAs because each PR region needs to be identified during partial reconfiguration and treated as an entity during placement and routing, which is not well supported by the current CAD tools. Our proposed tool is built on top of VPR. It takes the pre-synthesized module information from library, such as area, delay, etc, and performs modular placement to minimize total area and delay of the application. Modular information is represented in B*-Tree structure to allow fast placement. We amend the operations of B*-Tree to fit hardware characteristic of FPGAs. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Experimental results show comparisons of area, delay and execution time with original VPR. Though it may have disadvantage in area because of blank area among modules, it improves the delay of most of benchmarks comparing to results from VPR. At the end, we show PR-aware routing based on the modular placement.
{"title":"BMP: a fast B*-tree based modular placer for FPGAs (abstract only)","authors":"Fubing Mao, Yi-Chung Chen, Wei Zhang, Hai Helen Li","doi":"10.1145/2554688.2554755","DOIUrl":"https://doi.org/10.1145/2554688.2554755","url":null,"abstract":"With the wide application of FPGAs in adaptive computing systems, there is an increasing need to support design automation for PR FPGAs. However, there is a missing link between CAD tools for PR FPGA and existing widely used CAD tools, such as VPR. Hence, in this work we propose a modular placer for FPGAs because each PR region needs to be identified during partial reconfiguration and treated as an entity during placement and routing, which is not well supported by the current CAD tools. Our proposed tool is built on top of VPR. It takes the pre-synthesized module information from library, such as area, delay, etc, and performs modular placement to minimize total area and delay of the application. Modular information is represented in B*-Tree structure to allow fast placement. We amend the operations of B*-Tree to fit hardware characteristic of FPGAs. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Experimental results show comparisons of area, delay and execution time with original VPR. Though it may have disadvantage in area because of blank area among modules, it improves the delay of most of benchmarks comparing to results from VPR. At the end, we show PR-aware routing based on the modular placement.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127373785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-FPGA boards are widely used for rapid system prototyping. Even though the prototyping is trying to reach the maximum performance, the performance is limited by the inter-FPGA communication. As the capacity per I/O for each FPGA generation is increasing, FPGA I/Os are becoming a scarce resource. The design is divided into several parts, each part's capacity fits in a single FPGA. Signals crossing design's parts located in different FPGAs are called cut nets. In order to resolve pin limitation problem, cut nets are sent between FPGAs in pipelined way using the Time-Division-Multiplexing technique. The maximum number of cut nets passing through one FPGA I/O is called the TDM ratio. There are two multiplexing architectures used for multi-FPGA based prototyping: Logic Multiplexing and ISERDES/OSERDES. In this paper, a new multiplexing architecture Multi-Gigabit Transceiver (MGT) is proposed. Experiments are done in a multi-FPGA board with the testbench LFSR to validate the achieved performance. Assume that all the FPGA I/Os used for inter-FPGA communication are MGT capable in the future. Analyses show that the proposed multiplexing architecture can achieve higher performance when the TDM ratio exceeds 67. The gain in performance of the proposed architecture over the existing architecture augments as the TDM ratio increases.
{"title":"Future inter-FPGA communication architecture for multi-FPGA based prototyping (abstract only)","authors":"Qingshan Tang, M. Tuna, H. Mehrez","doi":"10.1145/2554688.2554747","DOIUrl":"https://doi.org/10.1145/2554688.2554747","url":null,"abstract":"Multi-FPGA boards are widely used for rapid system prototyping. Even though the prototyping is trying to reach the maximum performance, the performance is limited by the inter-FPGA communication. As the capacity per I/O for each FPGA generation is increasing, FPGA I/Os are becoming a scarce resource. The design is divided into several parts, each part's capacity fits in a single FPGA. Signals crossing design's parts located in different FPGAs are called cut nets. In order to resolve pin limitation problem, cut nets are sent between FPGAs in pipelined way using the Time-Division-Multiplexing technique. The maximum number of cut nets passing through one FPGA I/O is called the TDM ratio. There are two multiplexing architectures used for multi-FPGA based prototyping: Logic Multiplexing and ISERDES/OSERDES. In this paper, a new multiplexing architecture Multi-Gigabit Transceiver (MGT) is proposed. Experiments are done in a multi-FPGA board with the testbench LFSR to validate the achieved performance. Assume that all the FPGA I/Os used for inter-FPGA communication are MGT capable in the future. Analyses show that the proposed multiplexing architecture can achieve higher performance when the TDM ratio exceeds 67. The gain in performance of the proposed architecture over the existing architecture augments as the TDM ratio increases.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133748362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA. We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.
{"title":"Cad and routing architecture for interposer-based multi-FPGA systems","authors":"A. H. Pereira, Vaughn Betz","doi":"10.1145/2554688.2554776","DOIUrl":"https://doi.org/10.1145/2554688.2554776","url":null,"abstract":"Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA. We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130816996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the CMOS technology aggressively scaling towards the 22nm node, modern FPGA devices face tremendous aging- induced reliability challenges due to Bias Temperature In- stability (BTI) and Hot Carrier Injection (HCI). This paper presents a novel antiaging technique at logic level that is both scalable and applicable for VLSI digital circuits implemented with FPGA devices. The key idea is to prolong the lifetime of FPGA-mapped designs by strategically elevating the VDD values of some LUTs based on their modular criticality values. Although the idea of scaling VDD in order to improve either energy efficiency or circuit reliability has been explored extensively, our study distinguishes itself by approaching this challenge through analytical procedure, therefore able to maximize the overall reliability of target FPGA design by rigorously modelling the BTI-induce de- vice reliability and optimally solving the VDD assignment problem.
{"title":"Optimally mitigating BTI-induced FPGA device aging with discriminative voltage scaling (abstract only)","authors":"Yu Bai, Mohammed Alawad, Mingjie Lin","doi":"10.1145/2554688.2554752","DOIUrl":"https://doi.org/10.1145/2554688.2554752","url":null,"abstract":"With the CMOS technology aggressively scaling towards the 22nm node, modern FPGA devices face tremendous aging- induced reliability challenges due to Bias Temperature In- stability (BTI) and Hot Carrier Injection (HCI). This paper presents a novel antiaging technique at logic level that is both scalable and applicable for VLSI digital circuits implemented with FPGA devices. The key idea is to prolong the lifetime of FPGA-mapped designs by strategically elevating the VDD values of some LUTs based on their modular criticality values. Although the idea of scaling VDD in order to improve either energy efficiency or circuit reliability has been explored extensively, our study distinguishes itself by approaching this challenge through analytical procedure, therefore able to maximize the overall reliability of target FPGA design by rigorously modelling the BTI-induce de- vice reliability and optimally solving the VDD assignment problem.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134192810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron Severance, Joe Edwards, Hossein Omidian, G. Lemieux
Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.
{"title":"Soft vector processors with streaming pipelines","authors":"Aaron Severance, Joe Edwards, Hossein Omidian, G. Lemieux","doi":"10.1145/2554688.2554774","DOIUrl":"https://doi.org/10.1145/2554688.2554774","url":null,"abstract":"Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124820797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance the designer faces the burden of packaging such buffers into on-chip memories and manually optimizing the utilization of each memory and the throughput of each buffer. In addition, the application memories may not match the word width or depth of the physical on-chip memories available on the FPGA. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. We propose a tool, MPack, which globally optimizes on-chip memory use across all buffers for stream applications. The goal is to speed up development time by providing rapid design space exploration and relieving the designer of lengthy low-level iterations. We introduce new high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. We allow the user to quickly generate a large number of memory solutions and explore the trade-off between memory usage and achievable throughput. To demonstrate the effectiveness of our tool, we apply the new high-level pragmas to an image processing benchmark. MPack effectively explores the design space and is able to produce a large number of memory solutions ranging from 10 to 100% in throughput, and from 12 to 100% in on-chip memory usage.
{"title":"MPack: global memory optimization for stream applications in high-level synthesis","authors":"Jasmina Vasiljevic, P. Chow","doi":"10.1145/2554688.2554761","DOIUrl":"https://doi.org/10.1145/2554688.2554761","url":null,"abstract":"One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance the designer faces the burden of packaging such buffers into on-chip memories and manually optimizing the utilization of each memory and the throughput of each buffer. In addition, the application memories may not match the word width or depth of the physical on-chip memories available on the FPGA. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. We propose a tool, MPack, which globally optimizes on-chip memory use across all buffers for stream applications. The goal is to speed up development time by providing rapid design space exploration and relieving the designer of lengthy low-level iterations. We introduce new high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. We allow the user to quickly generate a large number of memory solutions and explore the trade-off between memory usage and achievable throughput. To demonstrate the effectiveness of our tool, we apply the new high-level pragmas to an image processing benchmark. MPack effectively explores the design space and is able to produce a large number of memory solutions ranging from 10 to 100% in throughput, and from 12 to 100% in on-chip memory usage.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131727862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.
{"title":"A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs","authors":"R. Dorrance, Fengbo Ren, D. Markovic","doi":"10.1145/2554688.2554785","DOIUrl":"https://doi.org/10.1145/2554688.2554785","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we propose a modified DEFENSE architecture termed as xDEFENSE that can detect and react to hardware attacks in real-time. In the past, several Root of Trust architectures such as DEFENSE and RETC have been proposed to foil attempts by hardware Trojans to leak sensitive information. In a typical Root of Trust architecture scenario, hardware is allowed to access the memory only by responding properly to a challenge requested by the memory guard. However in a recent effort, we observed that these architectures can in fact be susceptible to a variety of threats ranging from denial of service attacks, privilege escalation to information leakage, by injecting a Trojan into the Root of Trust modules such as memory guards and authorized hardware. In our work, we propose a security monitor that monitors all transactions between the authorized hardware, memory guard and memory. It also authenticates these components through the use of Hashed Message Authentication Codes (HMAC) to detect any invalid memory access or denial of service attack by disrupting the challenge-response pairs. The proposed xDEFENSE architecture was implemented on a Xilinx SPARTAN 3 FPGA evaluation board and our results indicate that xDEFENSE requires 143 additional slices as compared to DEFENSE and incurs a monitoring latency of 22ns.
在这项工作中,我们提出了一种改进的防御体系结构,称为xDEFENSE,可以实时检测和响应硬件攻击。在过去,已经提出了几个信任根架构,如DEFENSE和RETC,以阻止硬件木马泄露敏感信息的企图。在典型的Root of Trust架构场景中,硬件只有在正确响应内存保护请求的情况下才能访问内存。然而,在最近的努力中,我们观察到这些架构实际上容易受到各种威胁的影响,从拒绝服务攻击、特权升级到信息泄露,通过向信任根模块(如内存保护和授权硬件)注入木马。在我们的工作中,我们提出了一个安全监视器来监视授权硬件、内存保护和内存之间的所有事务。它还通过使用哈希消息身份验证码(HMAC)对这些组件进行身份验证,以通过破坏挑战-响应对来检测任何无效的内存访问或拒绝服务攻击。提出的xDEFENSE架构在Xilinx SPARTAN 3 FPGA评估板上实现,我们的结果表明,与DEFENSE相比,xDEFENSE需要143个额外的切片,并且会产生22ns的监控延迟。
{"title":"xDEFENSE: an extended DEFENSE for mitigating next generation intrusions (abstract only)","authors":"J. Lamberti, D. Shila, V. Venugopal","doi":"10.1145/2554688.2554714","DOIUrl":"https://doi.org/10.1145/2554688.2554714","url":null,"abstract":"In this work, we propose a modified DEFENSE architecture termed as xDEFENSE that can detect and react to hardware attacks in real-time. In the past, several Root of Trust architectures such as DEFENSE and RETC have been proposed to foil attempts by hardware Trojans to leak sensitive information. In a typical Root of Trust architecture scenario, hardware is allowed to access the memory only by responding properly to a challenge requested by the memory guard. However in a recent effort, we observed that these architectures can in fact be susceptible to a variety of threats ranging from denial of service attacks, privilege escalation to information leakage, by injecting a Trojan into the Root of Trust modules such as memory guards and authorized hardware. In our work, we propose a security monitor that monitors all transactions between the authorized hardware, memory guard and memory. It also authenticates these components through the use of Hashed Message Authentication Codes (HMAC) to detect any invalid memory access or denial of service attack by disrupting the challenge-response pairs. The proposed xDEFENSE architecture was implemented on a Xilinx SPARTAN 3 FPGA evaluation board and our results indicate that xDEFENSE requires 143 additional slices as compared to DEFENSE and incurs a monitoring latency of 22ns.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129894814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Applications 2","authors":"Lesley Shannon","doi":"10.1145/3260941","DOIUrl":"https://doi.org/10.1145/3260941","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132884536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, S. Rethinagiri
In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that support
{"title":"APMC: advanced pattern based memory controller (abstract only)","authors":"Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, S. Rethinagiri","doi":"10.1145/2554688.2554732","DOIUrl":"https://doi.org/10.1145/2554688.2554732","url":null,"abstract":"In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that support","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134049087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}