Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357309
Cong Ma, D. Lilja
Stochastic computing, which employs random bit streams for computations, has shown low hardware cost and high fault-tolerance compared to the computations using a conventional binary encoding. Finite state machine (FSM) based stochastic computing elements can compute complex functions, such as the exponentiation and hyperbolic tangent functions, more efficiently than those using combinational logic. However, the FSM, as a sequential logic, cannot be directly implemented in parallel like the combinational logic, so reducing the long latency of the calculation becomes difficult. Applications in the relatively higher frequency domain would require an extremely fast clock rate using FSM. This paper proposes a parallel implementation of the FSM, using an estimator and a dispatcher to directly initialize the FSM to the steady state. Experimental results show that the outputs of four typical functions using the parallel implementation are very close to those of the serial version. The parallel FSM scheme further shows equivalent or better image quality than the serial implementation in two image processing applications Edge Detection and Frame Difference.
{"title":"Parallel implementation of finite state machines for reducing the latency of stochastic computing","authors":"Cong Ma, D. Lilja","doi":"10.1109/ISQED.2018.8357309","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357309","url":null,"abstract":"Stochastic computing, which employs random bit streams for computations, has shown low hardware cost and high fault-tolerance compared to the computations using a conventional binary encoding. Finite state machine (FSM) based stochastic computing elements can compute complex functions, such as the exponentiation and hyperbolic tangent functions, more efficiently than those using combinational logic. However, the FSM, as a sequential logic, cannot be directly implemented in parallel like the combinational logic, so reducing the long latency of the calculation becomes difficult. Applications in the relatively higher frequency domain would require an extremely fast clock rate using FSM. This paper proposes a parallel implementation of the FSM, using an estimator and a dispatcher to directly initialize the FSM to the steady state. Experimental results show that the outputs of four typical functions using the parallel implementation are very close to those of the serial version. The parallel FSM scheme further shows equivalent or better image quality than the serial implementation in two image processing applications Edge Detection and Frame Difference.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130885653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357286
A. Pandey, Pitul Garg, Shobhit Tyagi, R. Ranjan, B. Anand
For the transistor sizing of multistage digital circuits with predictable delays, the relationship between the effective input capacitances of all the stages and the stage size ratios must be known. Effective capacitances of the FinFET logic gates are strongly dependent on transition times at their input-output nodes and are, therefore, not directly proportional to the stage size, as opposed to conventional transistors. Due to this, the methods developed for transistor sizing of planar logic circuits are not valid for FinFET logic circuits. Though this effect is not present in highly gate-drain overlapped FinFET devices, their performance is highly compromised (higher power consumption and larger delay). We propose a modification of the existing logical effort based delay model for FinFET inverter chain that considers the above-mentioned characteristics of FinFET devices. We also discuss branching loads and transistor sizing of non-critical paths in this paper. We observe that our FinFET sizing scheme leads to a significant reduction in inverter chain delays. We observe that error in estimation of delay (not considering the transition time dependency) in a two stage FO4 inverter chain is 31.8% and 15.3% respectively (from mixed-mode TCAD simulations).
{"title":"A modified method of logical effort for FinFET circuits considering impact of fin-extension effects","authors":"A. Pandey, Pitul Garg, Shobhit Tyagi, R. Ranjan, B. Anand","doi":"10.1109/ISQED.2018.8357286","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357286","url":null,"abstract":"For the transistor sizing of multistage digital circuits with predictable delays, the relationship between the effective input capacitances of all the stages and the stage size ratios must be known. Effective capacitances of the FinFET logic gates are strongly dependent on transition times at their input-output nodes and are, therefore, not directly proportional to the stage size, as opposed to conventional transistors. Due to this, the methods developed for transistor sizing of planar logic circuits are not valid for FinFET logic circuits. Though this effect is not present in highly gate-drain overlapped FinFET devices, their performance is highly compromised (higher power consumption and larger delay). We propose a modification of the existing logical effort based delay model for FinFET inverter chain that considers the above-mentioned characteristics of FinFET devices. We also discuss branching loads and transistor sizing of non-critical paths in this paper. We observe that our FinFET sizing scheme leads to a significant reduction in inverter chain delays. We observe that error in estimation of delay (not considering the transition time dependency) in a two stage FO4 inverter chain is 31.8% and 15.3% respectively (from mixed-mode TCAD simulations).","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124691474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357276
M. Vratonjic, H. Singh, G. Kumar, R. Mohamed, Ashish Bajaj, Ken Gainey
Modern System-on-Chips (SoCs) integrate a graphics unit (GPU) with many application processor cores (CPUs), communication cores (modem, WiFi) and device interfaces (USB, HDMI) on a single die. The primary memory system is fast becoming a major performance bottleneck as more and more of these units share this critical resource. An Integrated-Memory-Controller (IMC) is responsible for buffering and servicing memory requests from different CPU cores, GPU and other processing blocks that require DDR memory access. Previous work [2] was focused on appropriately prioritizing memory requests and increasing IMC/DDR memory frequency to improve system performance — which came at the expense of higher power consumption. Recent work has addressed this problem by using a demand based approach. This is accomplished by making the IMC aware of the application characteristics and then scaling its frequency based on the memory access demand [1]. This leads to lower IMC and DDR frequencies and thus lower power. The work presented here shows that instead of lowering the frequency, greater total system power savings can be achieved by increasing IMC frequency at the beginning of a use-case that has moderate GPU utilization. The primary motivation behind this approach is that it allows GPU, with its inherent ability to execute a larger number of parallel threads, to access memory faster and therefore complete its processing portion of the execution pipeline faster. This, in turn, allows relaxation of the timing requirements imposed on the CPU pipeline portion and consecutive cycles, thus saving on total system power. An algorithm for this technique, along with the silicon results on an SoC implemented in an industrial 28nm process, will be presented in this paper.
{"title":"Power and performance aware memory-controller voting mechanism","authors":"M. Vratonjic, H. Singh, G. Kumar, R. Mohamed, Ashish Bajaj, Ken Gainey","doi":"10.1109/ISQED.2018.8357276","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357276","url":null,"abstract":"Modern System-on-Chips (SoCs) integrate a graphics unit (GPU) with many application processor cores (CPUs), communication cores (modem, WiFi) and device interfaces (USB, HDMI) on a single die. The primary memory system is fast becoming a major performance bottleneck as more and more of these units share this critical resource. An Integrated-Memory-Controller (IMC) is responsible for buffering and servicing memory requests from different CPU cores, GPU and other processing blocks that require DDR memory access. Previous work [2] was focused on appropriately prioritizing memory requests and increasing IMC/DDR memory frequency to improve system performance — which came at the expense of higher power consumption. Recent work has addressed this problem by using a demand based approach. This is accomplished by making the IMC aware of the application characteristics and then scaling its frequency based on the memory access demand [1]. This leads to lower IMC and DDR frequencies and thus lower power. The work presented here shows that instead of lowering the frequency, greater total system power savings can be achieved by increasing IMC frequency at the beginning of a use-case that has moderate GPU utilization. The primary motivation behind this approach is that it allows GPU, with its inherent ability to execute a larger number of parallel threads, to access memory faster and therefore complete its processing portion of the execution pipeline faster. This, in turn, allows relaxation of the timing requirements imposed on the CPU pipeline portion and consecutive cycles, thus saving on total system power. An algorithm for this technique, along with the silicon results on an SoC implemented in an industrial 28nm process, will be presented in this paper.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117069742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357271
Joohan Kim, Taewhan Kim
For high-speed digital circuits, the activation of all flip-flops that are used to store data should be strictly synchronized by clock signals delivered through clock networks. However, due to the high frequency of simultaneous switching of clock pins in flip-flops, a high peak power/ground noise (i.e., voltage drop) is induced at the clock boundary. To mitigate the current noise, we employ four different types of hardware component that can implement a set of flip-flops and their driving buffer as a single unit, which was previously used for reducing clock power consumption. (The idea for the generation of the four types of clock boundary component was that one of the two inverters in a driving buffer and one of the two inverters in each of its driven flip-flops can be nullified without altering the circuit functionality.) Consequently, we have a flexibility of selecting (i.e., allocating) clock boundary components in a way to reduce peak current under timing constraint. We formulate the component allocation problem of minimizing peak current into a multi-objective shortest path problem and solve it efficiently using an approximation algorithm. We have implemented our proposed approach and tested it with ISCAS benchmark circuits. The experimental results confirm that our approach is able to reduce the peak current by 27.7%∼30.9% on average.
{"title":"Clock buffer and flip-flop co-optimization for reducing peak current noise","authors":"Joohan Kim, Taewhan Kim","doi":"10.1109/ISQED.2018.8357271","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357271","url":null,"abstract":"For high-speed digital circuits, the activation of all flip-flops that are used to store data should be strictly synchronized by clock signals delivered through clock networks. However, due to the high frequency of simultaneous switching of clock pins in flip-flops, a high peak power/ground noise (i.e., voltage drop) is induced at the clock boundary. To mitigate the current noise, we employ four different types of hardware component that can implement a set of flip-flops and their driving buffer as a single unit, which was previously used for reducing clock power consumption. (The idea for the generation of the four types of clock boundary component was that one of the two inverters in a driving buffer and one of the two inverters in each of its driven flip-flops can be nullified without altering the circuit functionality.) Consequently, we have a flexibility of selecting (i.e., allocating) clock boundary components in a way to reduce peak current under timing constraint. We formulate the component allocation problem of minimizing peak current into a multi-objective shortest path problem and solve it efficiently using an approximation algorithm. We have implemented our proposed approach and tested it with ISCAS benchmark circuits. The experimental results confirm that our approach is able to reduce the peak current by 27.7%∼30.9% on average.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134209025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357318
M. Imani, Pushen Wang, T. Simunic
Deep Neural Networks (DNNs) are known as effective model to perform cognitive tasks. However, DNNs are computationally expensive in both train and inference modes as they require the precision of floating point operations. Although, several prior work proposed approximate hardware to accelerate DNNs inference, they have not considered the impact of training on accuracy. In this paper, we propose a general framework called FramNN, which adjusts DNN training model to make it appropriate for underlying hardware. To accelerate training FramNN applies adaptive approximation which dynamically changes the level of hardware approximation depending on the DNN error rate. We test the efficiency of the proposed design over six popular DNN applications. Our evaluation shows that in inference, our design can achieve 1.9× energy efficiency improvement and 1.7× speedup while ensuring less than 1% quality loss. Similarly, in training mode FramNN can achieve 5.0× energy-delay product improvement as compared to baseline AMD GPU.
{"title":"Deep neural network acceleration framework under hardware uncertainty","authors":"M. Imani, Pushen Wang, T. Simunic","doi":"10.1109/ISQED.2018.8357318","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357318","url":null,"abstract":"Deep Neural Networks (DNNs) are known as effective model to perform cognitive tasks. However, DNNs are computationally expensive in both train and inference modes as they require the precision of floating point operations. Although, several prior work proposed approximate hardware to accelerate DNNs inference, they have not considered the impact of training on accuracy. In this paper, we propose a general framework called FramNN, which adjusts DNN training model to make it appropriate for underlying hardware. To accelerate training FramNN applies adaptive approximation which dynamically changes the level of hardware approximation depending on the DNN error rate. We test the efficiency of the proposed design over six popular DNN applications. Our evaluation shows that in inference, our design can achieve 1.9× energy efficiency improvement and 1.7× speedup while ensuring less than 1% quality loss. Similarly, in training mode FramNN can achieve 5.0× energy-delay product improvement as compared to baseline AMD GPU.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128151350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357292
P. Chawda, S. Srinivasan
Traditionally SPICE simulations have been extensively used to validate switched mode power supply designs. It takes multiple iterations of each simulation type at a design corner and there are multiple such simulations required over several design corners to find and fix issues, which is tedious and error prone. This paper presents a novel methodology of using detailed analytical circuit equations in addition to simulations for automated validation of power supply design. In this work we propose automatic detection and correction of design issues over required design corners. This significantly reduces the number of simulations required for design signoff and therefore drastically reduces overall design cycle time.
{"title":"An automated flow for design validation of switched mode power supply","authors":"P. Chawda, S. Srinivasan","doi":"10.1109/ISQED.2018.8357292","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357292","url":null,"abstract":"Traditionally SPICE simulations have been extensively used to validate switched mode power supply designs. It takes multiple iterations of each simulation type at a design corner and there are multiple such simulations required over several design corners to find and fix issues, which is tedious and error prone. This paper presents a novel methodology of using detailed analytical circuit equations in addition to simulations for automated validation of power supply design. In this work we propose automatic detection and correction of design issues over required design corners. This significantly reduces the number of simulations required for design signoff and therefore drastically reduces overall design cycle time.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"38 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120982440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357272
Tuotian Liao, Lihong Zhang
Accurate parasitic consideration in analog/RF circuit synthesis becomes more essential since layout-dependent effects become more influential in the advanced technologies. In this paper, a gm/ID-based circuit sizing method, which takes into account both device intrinsic parasitics and interconnect parasitics, is proposed as the first stage of a hybrid sizing optimization. In the second stage, a many-objective evolutionary algorithm is applied to refine the sizing solutions. The proposed methodology has been utilized to optimize multiple performances of an analog dynamic differential comparator and a RF circuit in the advanced CMOS technology. The experimental results have exhibited high efficacy of our proposed parasitic-aware hybrid sizing methodology.
{"title":"Parasitic-aware gm/ID-based many-objective analog/RF circuit sizing","authors":"Tuotian Liao, Lihong Zhang","doi":"10.1109/ISQED.2018.8357272","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357272","url":null,"abstract":"Accurate parasitic consideration in analog/RF circuit synthesis becomes more essential since layout-dependent effects become more influential in the advanced technologies. In this paper, a gm/ID-based circuit sizing method, which takes into account both device intrinsic parasitics and interconnect parasitics, is proposed as the first stage of a hybrid sizing optimization. In the second stage, a many-objective evolutionary algorithm is applied to refine the sizing solutions. The proposed methodology has been utilized to optimize multiple performances of an analog dynamic differential comparator and a RF circuit in the advanced CMOS technology. The experimental results have exhibited high efficacy of our proposed parasitic-aware hybrid sizing methodology.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116804265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357257
Aydin Dirican, Cagatay Ozmen, M. Margala
Today's highly integrated system-on-chips (SOCs) employ several integrated voltage regulators to achieve higher power efficiency and smaller board area. Testing of voltage regulators is essential to validate the final product. In this work, we present a unique droop measurement built-in self-test (BIST) circuit for digital low-dropout regulators (DLDOs). The proposed BIST system is capable of storing transient droop information with less than 1.05 % error for droop voltages ranging from 45 mV to 520 mV for nominal DLDO output voltage of 1.6 V where supply voltage is 1.8 V. Additionally, a reuse based 10-bit successive-approximation (SAR) analog-to-digital converter (ADC) is incorporated to generate a digital output corresponding to the stored droop information as the BIST measurement result. The on-chip DLDO decoupling capacitor (∼1 nF) is reconfigured as a charge scaling array for ADC operation during testing to increase reusability. The proposed BIST circuit is designed with 0.18 μm CMOS process in Cadence Virtuoso and verified with corner simulations.
{"title":"A droop measurement built-in self-test circuit for digital low-dropout regulators","authors":"Aydin Dirican, Cagatay Ozmen, M. Margala","doi":"10.1109/ISQED.2018.8357257","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357257","url":null,"abstract":"Today's highly integrated system-on-chips (SOCs) employ several integrated voltage regulators to achieve higher power efficiency and smaller board area. Testing of voltage regulators is essential to validate the final product. In this work, we present a unique droop measurement built-in self-test (BIST) circuit for digital low-dropout regulators (DLDOs). The proposed BIST system is capable of storing transient droop information with less than 1.05 % error for droop voltages ranging from 45 mV to 520 mV for nominal DLDO output voltage of 1.6 V where supply voltage is 1.8 V. Additionally, a reuse based 10-bit successive-approximation (SAR) analog-to-digital converter (ADC) is incorporated to generate a digital output corresponding to the stored droop information as the BIST measurement result. The on-chip DLDO decoupling capacitor (∼1 nF) is reconfigured as a charge scaling array for ADC operation during testing to increase reusability. The proposed BIST circuit is designed with 0.18 μm CMOS process in Cadence Virtuoso and verified with corner simulations.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114653895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357287
V. Huang, C. Pan, A. Naeemi
In this work, a fast, generic system-level design and optimization methodology is presented for futuristic devices. This work evaluates GaN Heterojunction TFET, WTe2 Two-dimensional heterojunction interlayer TFET (ThinTFET), and WTe2 Transition Metal Dichalcogenide TFET (TMD TFET) in terms of performance and energy-delay product (EDP). This study investigates the impact of device-level performance on the system-level performance and power dissipation. The system-level methodology uses a generic model that utilizes a stochastic wire distribution to estimate system performance. An optimum supply voltage and gate count to achieve maximum throughput is examined for each of the devices using an empirical CPI model under different power budget constraints. Based on this study, the optimal design of each beyond-CMOS device technology is demonstrated to improve EDP. Results in this work delineate an optimal EDP for a given range of power budgets, and provides insightful trends on key design parameters as well as optimal performance and power metrics based on the fast system-level optimization at the early design stage.
{"title":"Generic system-level modeling and optimization for beyond CMOS device applications","authors":"V. Huang, C. Pan, A. Naeemi","doi":"10.1109/ISQED.2018.8357287","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357287","url":null,"abstract":"In this work, a fast, generic system-level design and optimization methodology is presented for futuristic devices. This work evaluates GaN Heterojunction TFET, WTe2 Two-dimensional heterojunction interlayer TFET (ThinTFET), and WTe2 Transition Metal Dichalcogenide TFET (TMD TFET) in terms of performance and energy-delay product (EDP). This study investigates the impact of device-level performance on the system-level performance and power dissipation. The system-level methodology uses a generic model that utilizes a stochastic wire distribution to estimate system performance. An optimum supply voltage and gate count to achieve maximum throughput is examined for each of the devices using an empirical CPI model under different power budget constraints. Based on this study, the optimal design of each beyond-CMOS device technology is demonstrated to improve EDP. Results in this work delineate an optimal EDP for a given range of power budgets, and provides insightful trends on key design parameters as well as optimal performance and power metrics based on the fast system-level optimization at the early design stage.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130805382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-13DOI: 10.1109/ISQED.2018.8357305
Jialing Li, Kangjun Bai, Lingjia Liu, Y. Yi
As the 2020 roadblock approaches, the need of breakthrough in computing systems has directed researchers to novel computing paradigms. The recently emerged reservoir computing model, delayed feedback reservoir (DFR) computing, only utilizes one nonlinear neuron along with a delay loop. It not only offers the ease of hardware implementation but also enables the optimal performance contributed by the inherent delay and its rich intrinsic dynamics. The field of deep learning has attracted worldwide attention due to its hierarchical architecture that allows more efficient performance than a shallow structure. Along with our analog hardware implementation of the DFR, we investigate the possibility of merging deep learning and DFR computing systems. By evaluating the results, deep DFR models demonstrate 50%–81% better performance during training and 39%–64% performance improvement during testing than shallow leaky echo state network (ESN) model. Due to the difference in architecture, the training time of MI (multiple inputs)-deep DFR requires approximately 21% longer than that of the deep DFR model. Our approach offers the great potential and promise in the realization of analog hardware implementations for deep DFR systems.
{"title":"A deep learning based approach for analog hardware implementation of delayed feedback reservoir computing system","authors":"Jialing Li, Kangjun Bai, Lingjia Liu, Y. Yi","doi":"10.1109/ISQED.2018.8357305","DOIUrl":"https://doi.org/10.1109/ISQED.2018.8357305","url":null,"abstract":"As the 2020 roadblock approaches, the need of breakthrough in computing systems has directed researchers to novel computing paradigms. The recently emerged reservoir computing model, delayed feedback reservoir (DFR) computing, only utilizes one nonlinear neuron along with a delay loop. It not only offers the ease of hardware implementation but also enables the optimal performance contributed by the inherent delay and its rich intrinsic dynamics. The field of deep learning has attracted worldwide attention due to its hierarchical architecture that allows more efficient performance than a shallow structure. Along with our analog hardware implementation of the DFR, we investigate the possibility of merging deep learning and DFR computing systems. By evaluating the results, deep DFR models demonstrate 50%–81% better performance during training and 39%–64% performance improvement during testing than shallow leaky echo state network (ESN) model. Due to the difference in architecture, the training time of MI (multiple inputs)-deep DFR requires approximately 21% longer than that of the deep DFR model. Our approach offers the great potential and promise in the realization of analog hardware implementations for deep DFR systems.","PeriodicalId":213351,"journal":{"name":"2018 19th International Symposium on Quality Electronic Design (ISQED)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115063674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}