Georgios Smaragdos, S. Isaza, M. F. V. Eijk, I. Sourdis, C. Strydis
The Inferior-Olivary nucleus (ION) is a well-charted region of the brain, heavily associated with sensorimotor control of the body. It comprises ION cells with unique properties which facilitate sensory processing and motor-learning skills. Various simulation models of ION-cell networks have been written in an attempt to unravel their mysteries. However, simulations become rapidly intractable when biophysically plausible models and meaningful network sizes (>=100 cells) are modeled. To overcome this problem, in this work we port a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip. It was first converted to ANSI C code and extensively profiled. It was, then, translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied. The design was implemented in a Virtex 7 (XC7VX485T) device and can simulate a 96-cell network at real-time speed, yielding a speedup of x700 compared to the original Matlab code and x12.5 compared to the reference C implementation running on a Intel Xeon 2.66GHz machine with 20GB RAM. For a 1,056-cell network (non-real-time), an FPGA speedup of x45 against the C code can be achieved, demonstrating the design's usefulness in accelerating neuroscience research. Limited by the available on-chip memory, the FPGA can maximally support a 14,400-cell network (non-real-time) with online parameter configurability for cell state and network size. The maximum throughput of the FPGA ION-network accelerator can reach 2.13 GFLOPS.
{"title":"FPGA-based biophysically-meaningful modeling of olivocerebellar neurons","authors":"Georgios Smaragdos, S. Isaza, M. F. V. Eijk, I. Sourdis, C. Strydis","doi":"10.1145/2554688.2554790","DOIUrl":"https://doi.org/10.1145/2554688.2554790","url":null,"abstract":"The Inferior-Olivary nucleus (ION) is a well-charted region of the brain, heavily associated with sensorimotor control of the body. It comprises ION cells with unique properties which facilitate sensory processing and motor-learning skills. Various simulation models of ION-cell networks have been written in an attempt to unravel their mysteries. However, simulations become rapidly intractable when biophysically plausible models and meaningful network sizes (>=100 cells) are modeled. To overcome this problem, in this work we port a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip. It was first converted to ANSI C code and extensively profiled. It was, then, translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied. The design was implemented in a Virtex 7 (XC7VX485T) device and can simulate a 96-cell network at real-time speed, yielding a speedup of x700 compared to the original Matlab code and x12.5 compared to the reference C implementation running on a Intel Xeon 2.66GHz machine with 20GB RAM. For a 1,056-cell network (non-real-time), an FPGA speedup of x45 against the C code can be achieved, demonstrating the design's usefulness in accelerating neuroscience research. Limited by the available on-chip memory, the FPGA can maximally support a 14,400-cell network (non-real-time) with online parameter configurability for cell state and network size. The maximum throughput of the FPGA ION-network accelerator can reach 2.13 GFLOPS.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133459584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timing margins in FPGAs are already significant and as process scaling continues they will have to grow to guarantee operation under increased variation. Margins enforce worst-case operation even in typical conditions and result in devices operating more slowly and consuming more energy than necessary. This paper presents a method of dynamic voltage and frequency scaling that uses online slack measurement to determine timing headroom in a circuit while it is operating and scale the voltage and/or frequency in response. Doing so can significantly reduce power consumption or increase throughput with a minimal overhead. The method is demonstrated on a number of benchmark circuits under a range of operating conditions, constraints and optimisation targets.
{"title":"Dynamic voltage & frequency scaling with online slack measurement","authors":"Joshua M. Levine, Edward A. Stott, P. Cheung","doi":"10.1145/2554688.2554784","DOIUrl":"https://doi.org/10.1145/2554688.2554784","url":null,"abstract":"Timing margins in FPGAs are already significant and as process scaling continues they will have to grow to guarantee operation under increased variation. Margins enforce worst-case operation even in typical conditions and result in devices operating more slowly and consuming more energy than necessary. This paper presents a method of dynamic voltage and frequency scaling that uses online slack measurement to determine timing headroom in a circuit while it is operating and scale the voltage and/or frequency in response. Doing so can significantly reduce power consumption or increase throughput with a minimal overhead. The method is demonstrated on a number of benchmark circuits under a range of operating conditions, constraints and optimisation targets.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130104243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongbin Zheng, S. Gurumani, K. Rupnow, Deming Chen
Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax is limited by critical path delay, which is highly influenced by lower-level details of the circuit implementation such as technology mapping, placement and routing. However, for high-level synthesis~(HLS) design flows, it is challenging to evaluate the real critical delay at the behavioral level. Current HLS flows typically use module pre-characterization for delay estimates. However, we will demonstrate that such delay estimates are not sufficient to obtain high fmax and also minimize total execution latency. In this paper, we introduce a new HLS flow that integrates with Altera's Quartus synthesis and fast placement and routing (PAR) tool to obtain realistic post-PAR delay estimates. This integration enables an iterative flow that improves the performance of the design with both behavioral-level and circuit-level optimizations using realistic delay information. We demonstrate our HLS flow produces up to 24% (on average 20%) improvement in fmax and upto 22% (on average 20%) improvement in execution latency. Furthermore, results demonstrate that our flow is able to achieve from 65% to 91% of the theoretical fmax on Stratix IV devices (550MHz).
{"title":"Fast and effective placement and routing directed high-level synthesis for FPGAs","authors":"Hongbin Zheng, S. Gurumani, K. Rupnow, Deming Chen","doi":"10.1145/2554688.2554775","DOIUrl":"https://doi.org/10.1145/2554688.2554775","url":null,"abstract":"Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax is limited by critical path delay, which is highly influenced by lower-level details of the circuit implementation such as technology mapping, placement and routing. However, for high-level synthesis~(HLS) design flows, it is challenging to evaluate the real critical delay at the behavioral level. Current HLS flows typically use module pre-characterization for delay estimates. However, we will demonstrate that such delay estimates are not sufficient to obtain high fmax and also minimize total execution latency. In this paper, we introduce a new HLS flow that integrates with Altera's Quartus synthesis and fast placement and routing (PAR) tool to obtain realistic post-PAR delay estimates. This integration enables an iterative flow that improves the performance of the design with both behavioral-level and circuit-level optimizations using realistic delay information. We demonstrate our HLS flow produces up to 24% (on average 20%) improvement in fmax and upto 22% (on average 20%) improvement in execution latency. Furthermore, results demonstrate that our flow is able to achieve from 65% to 91% of the theoretical fmax on Stratix IV devices (550MHz).","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Tools and models 1","authors":"Deming Chen","doi":"10.1145/3260942","DOIUrl":"https://doi.org/10.1145/3260942","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133398091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the sizes of modern circuits become bigger and bigger, implementing those large circuits into FPGA becomes arduous. The state-of-the-art academic FPGA place-and-route tool, VPR, has good quality but needs around a whole day to complete a placement when the input circuit contains millions of lookup tables, excluding the runtime for routing. To expedite the placement process, we propose a routability-driven placement algorithm for FPGA that adopts techniques used in ASIC global placer. Our placer follows the lower-bound-and-upper-bound iterative optimization process in ASIC placers like Ripple. In the lower-bound computation, the total HPWL, modeled using the Bound2Bound net model, is minimized using the conjugate gradient method. In the upper-bound computation, an almost-legalized result is produced by spreading cells linearly in the placement area. Those positions are then served as fixed-point anchors and fed into the next lower-bound computation. Furthermore, global routing will be performed in the upper-bound computation to estimate the routing segment usage, as a mean to consider congestion in placement. We tested our approach using 20 MCNC benchmarks and 4 large benchmarks for performance and scalability. Experimental results show that based on the island-style architecture which VPR is most optimized for, our approach can obtain a placement result 8x faster than VPR with 2% more in channel width, or 3x faster with 1% more in channel width when congestion is being considered. Our approach is even 14x faster than VPR in placing large benchmarks with over 10,000 lookup tables, with only 7% more in channel width.
{"title":"A scalable routability-driven analytical placer with global router integration for FPGAs (abstract only)","authors":"Ka-Chun Lam, W. Tang, Evangeline F. Y. Young","doi":"10.1145/2554688.2554711","DOIUrl":"https://doi.org/10.1145/2554688.2554711","url":null,"abstract":"As the sizes of modern circuits become bigger and bigger, implementing those large circuits into FPGA becomes arduous. The state-of-the-art academic FPGA place-and-route tool, VPR, has good quality but needs around a whole day to complete a placement when the input circuit contains millions of lookup tables, excluding the runtime for routing. To expedite the placement process, we propose a routability-driven placement algorithm for FPGA that adopts techniques used in ASIC global placer. Our placer follows the lower-bound-and-upper-bound iterative optimization process in ASIC placers like Ripple. In the lower-bound computation, the total HPWL, modeled using the Bound2Bound net model, is minimized using the conjugate gradient method. In the upper-bound computation, an almost-legalized result is produced by spreading cells linearly in the placement area. Those positions are then served as fixed-point anchors and fed into the next lower-bound computation. Furthermore, global routing will be performed in the upper-bound computation to estimate the routing segment usage, as a mean to consider congestion in placement. We tested our approach using 20 MCNC benchmarks and 4 large benchmarks for performance and scalability. Experimental results show that based on the island-style architecture which VPR is most optimized for, our approach can obtain a placement result 8x faster than VPR with 2% more in channel width, or 3x faster with 1% more in channel width when congestion is being considered. Our approach is even 14x faster than VPR in placing large benchmarks with over 10,000 lookup tables, with only 7% more in channel width.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116756506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Zhang, Lei Chen, Wenyao Xu, Yuanfu Zhao, Zhiping Wen
The significance of FPGA test and the challenge of its increasing cost can never be ignored. In island-style FPGA architectures, hex lines are the principal interconnect resources. Testing hex lines and hex Programmable Interconnect Points (PIPs) have remained as the major technical difficulty in FPGAs test due to complex interconnect rules. Particularly, test in oblique direction of hex PIPs has rarely been addressed in previous studies. Towards this challenge, this paper for the first time proposes a coordinate system and formulates the interconnect rules of hex lines as mathematical equations. For hex PIPs in horizontal and vertical direction, an efficient circle test structure is formed by coordinate equations. For hex PIPs in oblique direction, the coordinate method is used to generate the partial-cascade pattern. The corresponding test vector is also generated, which ensures the ergodicity of hex PIPs in oblique direction. In addition to hex PIPs, hex lines are also covered without extra effort. Compared to previous researches, the configuration number for hex lines is decreased significantly. We evaluate this method on Xilinx XC2V1000, and experimental results show that our proposed method achieves 100% fault coverage for hex PIPs and can be generally applied to all mainstream island-style FPGAs with a similar interconnect structure currently.
{"title":"Coordinating routing resources for hex pips test in island-style FPGAs (abstract only)","authors":"Fan Zhang, Lei Chen, Wenyao Xu, Yuanfu Zhao, Zhiping Wen","doi":"10.1145/2554688.2554740","DOIUrl":"https://doi.org/10.1145/2554688.2554740","url":null,"abstract":"The significance of FPGA test and the challenge of its increasing cost can never be ignored. In island-style FPGA architectures, hex lines are the principal interconnect resources. Testing hex lines and hex Programmable Interconnect Points (PIPs) have remained as the major technical difficulty in FPGAs test due to complex interconnect rules. Particularly, test in oblique direction of hex PIPs has rarely been addressed in previous studies. Towards this challenge, this paper for the first time proposes a coordinate system and formulates the interconnect rules of hex lines as mathematical equations. For hex PIPs in horizontal and vertical direction, an efficient circle test structure is formed by coordinate equations. For hex PIPs in oblique direction, the coordinate method is used to generate the partial-cascade pattern. The corresponding test vector is also generated, which ensures the ergodicity of hex PIPs in oblique direction. In addition to hex PIPs, hex lines are also covered without extra effort. Compared to previous researches, the configuration number for hex lines is decreased significantly. We evaluate this method on Xilinx XC2V1000, and experimental results show that our proposed method achieves 100% fault coverage for hex PIPs and can be generally applied to all mainstream island-style FPGAs with a similar interconnect structure currently.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124786441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keita Ito, T. Yoneda, Yuta Yamato, K. Hatayama, M. Inoue
This paper presents a scan-based BIST architecture for FPGAs used as application-specific embedded devices for low-volume products. The proposed architecture efficiently utilizes memory blocks, instead of logic elements, to build up BIST components such as LFSR, MISR and scan chains for test points. It also provides enhanced scan functionality for test points and performs a hybrid test application of LOC and enhanced scan to improve delay test quality. Experimental results show that the proposed BIST architecture achieves high delay test quality with efficient resource utilization.
{"title":"Memory block based scan-BIST architecture for application-dependent FPGA testing","authors":"Keita Ito, T. Yoneda, Yuta Yamato, K. Hatayama, M. Inoue","doi":"10.1145/2554688.2554764","DOIUrl":"https://doi.org/10.1145/2554688.2554764","url":null,"abstract":"This paper presents a scan-based BIST architecture for FPGAs used as application-specific embedded devices for low-volume products. The proposed architecture efficiently utilizes memory blocks, instead of logic elements, to build up BIST components such as LFSR, MISR and scan chains for test points. It also provides enhanced scan functionality for test points and performs a hybrid test application of LOC and enhanced scan to improve delay test quality. Experimental results show that the proposed BIST architecture achieves high delay test quality with efficient resource utilization.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127978121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-trivial hardware architectures consist of a significant number of fine-grained modules that communication with each other via dedicated signal lines. In field-programmable gate arrays (FPGAs), these communication lines are provided in forms of global vertical and horizontal routing channels, and are subject to the routing process. Since the effects of physical properties on the signal skew along these lines is well understood, this paper investigates the observable effects on a signal's duty cycle. Practical experiments show that the distortion on the duty cycle progressively increases along such wires (connections) and that in the extreme case, a signal may entirely vanish.
{"title":"Exploring duty cycle distortions along signal paths in FPGAs (abstract only)","authors":"Matthias Hinkfoth, R. Joost, R. Salomon","doi":"10.1145/2554688.2554737","DOIUrl":"https://doi.org/10.1145/2554688.2554737","url":null,"abstract":"Non-trivial hardware architectures consist of a significant number of fine-grained modules that communication with each other via dedicated signal lines. In field-programmable gate arrays (FPGAs), these communication lines are provided in forms of global vertical and horizontal routing channels, and are subject to the routing process. Since the effects of physical properties on the signal skew along these lines is well understood, this paper investigates the observable effects on a signal's duty cycle. Practical experiments show that the distortion on the duty cycle progressively increases along such wires (connections) and that in the extreme case, a signal may entirely vanish.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129777560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Lerner, Zane R. Franklin, W. Baumann, C. Patterson
Industrial control systems (ICSes) have the conflicting requirements of security and network access. In the event of large-scale hostilities, factories and infrastructure would more likely be targeted by computer viruses than the bomber squadrons used in WWII. ICS zero-day exploits are now a commodity sold on brokerages to interested parties including nations. We mitigate these threats not by bolstering perimeter security, but rather by assuming that potentially all layers of ICS software have already been compromised and are capable of launching a latent attack while reporting normal system status to human operators. In our approach, application-specific configurable hardware is the final authority for scrutinizing controller commands and process sensors, and can monitor and override operations at the lowest (I/O pin) level of a configurable system-on-chip platform. The process specifications, stability-preserving backup controller, and switchover logic are specified and formally verified as C code, and synthesized into hardware to resist software reconfiguration attacks. To provide greater assurance that the backup controller can be invoked before the physical process becomes unstable, copies of the production controller task and plant model are accelerated to preview the controller's behavior in the near future.
{"title":"Using high-level synthesis and formal analysis to predict and preempt attacks on industrial control systems","authors":"L. Lerner, Zane R. Franklin, W. Baumann, C. Patterson","doi":"10.1145/2554688.2554759","DOIUrl":"https://doi.org/10.1145/2554688.2554759","url":null,"abstract":"Industrial control systems (ICSes) have the conflicting requirements of security and network access. In the event of large-scale hostilities, factories and infrastructure would more likely be targeted by computer viruses than the bomber squadrons used in WWII. ICS zero-day exploits are now a commodity sold on brokerages to interested parties including nations. We mitigate these threats not by bolstering perimeter security, but rather by assuming that potentially all layers of ICS software have already been compromised and are capable of launching a latent attack while reporting normal system status to human operators. In our approach, application-specific configurable hardware is the final authority for scrutinizing controller commands and process sensors, and can monitor and override operations at the lowest (I/O pin) level of a configurable system-on-chip platform. The process specifications, stability-preserving backup controller, and switchover logic are specified and formally verified as C code, and synthesized into hardware to resist software reconfiguration attacks. To provide greater assurance that the backup controller can be invoked before the physical process becomes unstable, copies of the production controller task and plant model are accelerated to preview the controller's behavior in the near future.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122600010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When are FPGAs more energy efficient than processors? This question is complicated by technology factors and the wide range of application characteristics that can be exploited to minimize energy. Using a wire-dominated energy model to estimate the absolute energy required for programmable computations, we determine when spatially organized programmable computations (FPGAs) require less energy than temporally organized programmable computations (processors). The point of crossover will depend on the metal layers available, the locality, the SIMD wordwidth regularity, and the compactness of the instructions. When the Rent Exponent, p, is less than 0.7, the spatial design is always more energy efficient. When p=0.8, the technology offers 8-metal layers for routing, and data can be organized into 16b words and processed in tight loops of no more than 128 instructions, the temporal design uses less energy when the number of LUTs is greater than 64K. We further show that heterogeneous multicontext architectures can use even less energy than the p=0.8, 16b word temporal case.
{"title":"Wordwidth, instructions, looping, and virtualization: the role of sharing in absolute energy minimization","authors":"A. DeHon","doi":"10.1145/2554688.2554781","DOIUrl":"https://doi.org/10.1145/2554688.2554781","url":null,"abstract":"When are FPGAs more energy efficient than processors? This question is complicated by technology factors and the wide range of application characteristics that can be exploited to minimize energy. Using a wire-dominated energy model to estimate the absolute energy required for programmable computations, we determine when spatially organized programmable computations (FPGAs) require less energy than temporally organized programmable computations (processors). The point of crossover will depend on the metal layers available, the locality, the SIMD wordwidth regularity, and the compactness of the instructions. When the Rent Exponent, p, is less than 0.7, the spatial design is always more energy efficient. When p=0.8, the technology offers 8-metal layers for routing, and data can be organized into 16b words and processed in tight loops of no more than 128 instructions, the temporal design uses less energy when the number of LUTs is greater than 64K. We further show that heterogeneous multicontext architectures can use even less energy than the p=0.8, 16b word temporal case.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130937684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}