Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00077
Matthew Gaalswyk, James W. Stine
This paper presents a novel radix-4 division by recurrence architecture that utilizes a hierarchical Signed-Digit (SD) adder. The implementations are easily generated based on the methodology as it is suited towards digital implementations. Results are generated for several designs using Global Foundries 45nm SOI technology and ARM standard cells. Results indicate that power dissipation can be reduced using these architectures for division by recurrence as the area is significantly decreased.
{"title":"A Low-Power Recurrence-Based Radix 4 Divider Using Signed-Digit Addition","authors":"Matthew Gaalswyk, James W. Stine","doi":"10.1109/ISVLSI.2019.00077","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00077","url":null,"abstract":"This paper presents a novel radix-4 division by recurrence architecture that utilizes a hierarchical Signed-Digit (SD) adder. The implementations are easily generated based on the methodology as it is suited towards digital implementations. Results are generated for several designs using Global Foundries 45nm SOI technology and ARM standard cells. Results indicate that power dissipation can be reduced using these architectures for division by recurrence as the area is significantly decreased.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"86 1","pages":"391-396"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83977372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00081
H. Thapliyal, Zachary Kahleifeh
Approximate computing is a circuit design technique that reduces area and power dissipation at the cost of accurate results. In this paper, we have investigated to further reduce the power dissipation of approximate circuits while maintaining high speeds using a form of energy recovery (ER) computing known as Pulse Boost Logic (PBL). To demonstrate power savings and speed capabilities, we have constructed an approximate 4-2 compressor circuit using PBL based ER computing. Simulations were performed using 45nm technology in Cadence Spectre. At 800 MHz, our results show the average power saving of 64% in PBL based approximate 4-2 compressor design compared to its standard CMOS based design. We also illustrate that the power saving of 89% can be achieved in 4-2 compressor by combining approximate and ER computing compared to CMOS based design of accurate 4-2 compressor. Further, we illustrate that the PBL based proposed approximate 4-2 compressor has 65% less energy consumption than the CMOS based approximate 4-2 compressor. We have verified the functionality of the proposed PBL based approximate 4-2 compressor up to 1 GHz to illustrate its application in low-power and low-energy Sub-GHz IoT applications.
{"title":"Approximate Energy Recovery 4-2 Compressor for Low-Power Sub-GHz IoT Applications","authors":"H. Thapliyal, Zachary Kahleifeh","doi":"10.1109/ISVLSI.2019.00081","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00081","url":null,"abstract":"Approximate computing is a circuit design technique that reduces area and power dissipation at the cost of accurate results. In this paper, we have investigated to further reduce the power dissipation of approximate circuits while maintaining high speeds using a form of energy recovery (ER) computing known as Pulse Boost Logic (PBL). To demonstrate power savings and speed capabilities, we have constructed an approximate 4-2 compressor circuit using PBL based ER computing. Simulations were performed using 45nm technology in Cadence Spectre. At 800 MHz, our results show the average power saving of 64% in PBL based approximate 4-2 compressor design compared to its standard CMOS based design. We also illustrate that the power saving of 89% can be achieved in 4-2 compressor by combining approximate and ER computing compared to CMOS based design of accurate 4-2 compressor. Further, we illustrate that the PBL based proposed approximate 4-2 compressor has 65% less energy consumption than the CMOS based approximate 4-2 compressor. We have verified the functionality of the proposed PBL based approximate 4-2 compressor up to 1 GHz to illustrate its application in low-power and low-energy Sub-GHz IoT applications.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"79 1","pages":"414-418"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87719016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00082
Sarit Chakraborty, Susanta Chakraborty
Digital Microfluidic based Biochips (DMFBs) are capable of automation, re-configurable, low operational cost and accuracy of results. Such Lab-on-Chips (Loc's) are now extensively used in point of care diagnosis and other monitoring applications. Routing of micro or nano (10^-6 or 10^-9) litre volume of droplets on such chips elevate few critical challenges due to the blockages caused by microfluidic modules present on the chip. Micro-Electrode Dot Array (MEDA) based architecture of DMFB can facilitate cross contamination free routing and eradicate other routing issues over conventional DMF chips. This paper proposes a novel heuristic routing technique for MEDA based DMFB architecture to tackle routing complexities due to overlapping nets, interfering blockages and deadlock zones formed by the conflicting nets. We have categorized various region based movements of droplet on MEDA chip and derived a metric named Snooping Index (SIn) to improve the routing performance of the droplets in first phase. Next an exhaustive search is applied to find the routing path for the remaining nets considering different constraints specific to MEDA platform. Finally we have computed another measure called 'Zone Compaction Factor' (ZCF) to overcome blockage extensive route paths. Experimental results on benchmark suite I and III show our proposed technique significantly reduces latest arrival time, average assay execution time and number of used cells as compared with earlier methods.
{"title":"Routing Performance Optimization for Homogeneous Droplets on MEDA-based Digital Microfluidic Biochips","authors":"Sarit Chakraborty, Susanta Chakraborty","doi":"10.1109/ISVLSI.2019.00082","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00082","url":null,"abstract":"Digital Microfluidic based Biochips (DMFBs) are capable of automation, re-configurable, low operational cost and accuracy of results. Such Lab-on-Chips (Loc's) are now extensively used in point of care diagnosis and other monitoring applications. Routing of micro or nano (10^-6 or 10^-9) litre volume of droplets on such chips elevate few critical challenges due to the blockages caused by microfluidic modules present on the chip. Micro-Electrode Dot Array (MEDA) based architecture of DMFB can facilitate cross contamination free routing and eradicate other routing issues over conventional DMF chips. This paper proposes a novel heuristic routing technique for MEDA based DMFB architecture to tackle routing complexities due to overlapping nets, interfering blockages and deadlock zones formed by the conflicting nets. We have categorized various region based movements of droplet on MEDA chip and derived a metric named Snooping Index (SIn) to improve the routing performance of the droplets in first phase. Next an exhaustive search is applied to find the routing path for the remaining nets considering different constraints specific to MEDA platform. Finally we have computed another measure called 'Zone Compaction Factor' (ZCF) to overcome blockage extensive route paths. Experimental results on benchmark suite I and III show our proposed technique significantly reduces latest arrival time, average assay execution time and number of used cells as compared with earlier methods.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"33 1","pages":"419-424"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87353695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00046
Gopabandhu Hota, Hardik Agrawal, M. Sharad
Acquisition and analysis of neural signals have greatly changed our understanding of the brain. These neural implants are required to be as small as possible so that they are least invasive to normal body functioning. The neural signal contains frequency components from 0.1-10KHz and amplitude in 10-100µV range, which is very small and can be easily distorted by external noise sources. This demands a very area-efficient and low-noise Front-End Amplifier (FEA). Low voltage supply and low power dissipation is another critical requirement to ensure safe implantation and prolonged battery life. Keeping all these requirements in mind, we propose a programmable area efficient and low-noise FEA design along with both manual and SAR-based Gain Tuning and Offset Cancellation Scheme which is robust to any temperature and process variations. The designed FEA occupies a minimal area of 0.05 mm2 which shows great area efficiency w.r.t. switch-capacitor based and closed-loop frontend amplifiers. Obtained maximum voltage gain from Simulation is 87.6 dB, Input-referred noise density is 20 nV/√Hz, and the power consumption is 43.2µW at 1.8V power supply with a Noise Efficiency(NEF) factor of 1.84. The proposed scheme has offset cancellation capacity up to 30 mV using the 7 bits of transistor bank.
{"title":"An Area Effective Programmable Front-end Amplifier for Neural Signal Acquisition","authors":"Gopabandhu Hota, Hardik Agrawal, M. Sharad","doi":"10.1109/ISVLSI.2019.00046","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00046","url":null,"abstract":"Acquisition and analysis of neural signals have greatly changed our understanding of the brain. These neural implants are required to be as small as possible so that they are least invasive to normal body functioning. The neural signal contains frequency components from 0.1-10KHz and amplitude in 10-100µV range, which is very small and can be easily distorted by external noise sources. This demands a very area-efficient and low-noise Front-End Amplifier (FEA). Low voltage supply and low power dissipation is another critical requirement to ensure safe implantation and prolonged battery life. Keeping all these requirements in mind, we propose a programmable area efficient and low-noise FEA design along with both manual and SAR-based Gain Tuning and Offset Cancellation Scheme which is robust to any temperature and process variations. The designed FEA occupies a minimal area of 0.05 mm2 which shows great area efficiency w.r.t. switch-capacitor based and closed-loop frontend amplifiers. Obtained maximum voltage gain from Simulation is 87.6 dB, Input-referred noise density is 20 nV/√Hz, and the power consumption is 43.2µW at 1.8V power supply with a Noise Efficiency(NEF) factor of 1.84. The proposed scheme has offset cancellation capacity up to 30 mV using the 7 bits of transistor bank.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"38 1","pages":"207-211"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81503513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00086
Yasuhiro Takahashi, Hiroki Koyasu, S. D. Kumar, H. Thapliyal
Silicon based Physical Unclonable Function (PUF) is a popular hardware security primitive for mitigating security vulnerabilities. Recently, Quasi-adiabatic logic based physical unclonable function (QUALPUF) was first proposed by Kumar and Thapliyal. QUALPUF has ultra low-power dissipation; hence it is suitable to implement in low-power portable electronic devices such RFIDs, wireless sensor nodes, etc. In this paper, we present the post-layout simulation results of the 4-bit QUALPUF for low-power portable electronic devices. To evaluate the uniqueness and reliability, the 4-bit QUALPUF is implemented in 0.18 um standard CMOS process with 1.8 V supply voltage. The QUALPUF occupies 58.7x15.7 um2 of layout area. The post-layout simulation results illustrate that the 4-bit QUALPUF has good uniqueness and reliability with 29.73 fJ/cycle/bit energy consumption.
{"title":"Post-Layout Simulation of Quasi-Adiabatic Logic Based Physical Unclonable Function","authors":"Yasuhiro Takahashi, Hiroki Koyasu, S. D. Kumar, H. Thapliyal","doi":"10.1109/ISVLSI.2019.00086","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00086","url":null,"abstract":"Silicon based Physical Unclonable Function (PUF) is a popular hardware security primitive for mitigating security vulnerabilities. Recently, Quasi-adiabatic logic based physical unclonable function (QUALPUF) was first proposed by Kumar and Thapliyal. QUALPUF has ultra low-power dissipation; hence it is suitable to implement in low-power portable electronic devices such RFIDs, wireless sensor nodes, etc. In this paper, we present the post-layout simulation results of the 4-bit QUALPUF for low-power portable electronic devices. To evaluate the uniqueness and reliability, the 4-bit QUALPUF is implemented in 0.18 um standard CMOS process with 1.8 V supply voltage. The QUALPUF occupies 58.7x15.7 um2 of layout area. The post-layout simulation results illustrate that the 4-bit QUALPUF has good uniqueness and reliability with 29.73 fJ/cycle/bit energy consumption.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"12 1","pages":"443-446"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80235908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00097
Changlu Liu, T. Lan, Qin Li, Kaige Jia, Yidian Fan, Xing Wu, F. Qiao, W. Qi, Xinjun Liu, Huazhong Yang
Direction of arrival (DOA) is a critical component in the conventional smart acoustic system for navigation, noise canceling hearing aids and so on. However, conventional DOA has encountered power consumption and processing speed bottlenecks dominated by analog-to-digital converter (ADC) and fast fourier transform (FFT). Especially in the always-on applications, the power-hungry ADC and time-consuming FFT take up most of the system's computation cost. We propose a novel processing architecture with analog-domain processing for DOA. The whole processing procedure of DOA is implemented in the analog domain without ADC and frequency-domain transformation. In order to verify the performance of the architecture, we simulate a generic DOA algorithm. Under the CMOS 0.18µm process, the results show the 94.5% reduction in power consumption and 4724× improvement in processing speed compared to conventional digital realization. We simulate the simple task with the direction accuracy of 80.74%, which can be extended to a more complex scenario.
{"title":"Energy-efficient Analog Processing Architecture for Direction of Arrival with Microphone Array","authors":"Changlu Liu, T. Lan, Qin Li, Kaige Jia, Yidian Fan, Xing Wu, F. Qiao, W. Qi, Xinjun Liu, Huazhong Yang","doi":"10.1109/ISVLSI.2019.00097","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00097","url":null,"abstract":"Direction of arrival (DOA) is a critical component in the conventional smart acoustic system for navigation, noise canceling hearing aids and so on. However, conventional DOA has encountered power consumption and processing speed bottlenecks dominated by analog-to-digital converter (ADC) and fast fourier transform (FFT). Especially in the always-on applications, the power-hungry ADC and time-consuming FFT take up most of the system's computation cost. We propose a novel processing architecture with analog-domain processing for DOA. The whole processing procedure of DOA is implemented in the analog domain without ADC and frequency-domain transformation. In order to verify the performance of the architecture, we simulate a generic DOA algorithm. Under the CMOS 0.18µm process, the results show the 94.5% reduction in power consumption and 4724× improvement in processing speed compared to conventional digital realization. We simulate the simple task with the direction accuracy of 80.74%, which can be extended to a more complex scenario.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"105 1","pages":"507-512"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87481938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00075
Kevin Vaca, Archit Gajjar, Xiaokun Yang
A real-time automatic music transcription (AMT) system has a great potential for applications and interactions between people and music, such as the popular devices Amazon Echo and Google Home. This paper thus presents a design on chord recognition with the Zync7000 Field-Programmable Gate Array (FPGA), capable of sampling analog frequency signals through a microphone and, in real time, showing sheet music on a smart phone app that corresponds to the user's playing. We demonstrate the design of audio sampling on programming logic and the implementation of frequency transform and vector building on programming system, which is an embedded ARM core on the Zync FPGA. Experimental results show that the logic design spends 574 slices of look-up-tables (LUTs) and 792 slices of flip-flops. Due to the dynamic power consumption on programming system (1399 mW) being significantly higher than the dynamic power dissipation on programming logic (7 mW), the future work of this platform is to design intelligent property (IP) for algorithms of frequency transform, pitch class profile (PCP), and pattern matching with hardware description language (HDL), making the entire system-on-chip (SoC) able to be taped out as an application-specific design for consumer electronics.
{"title":"Real-Time Automatic Music Transcription (AMT) with Zync FPGA","authors":"Kevin Vaca, Archit Gajjar, Xiaokun Yang","doi":"10.1109/ISVLSI.2019.00075","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00075","url":null,"abstract":"A real-time automatic music transcription (AMT) system has a great potential for applications and interactions between people and music, such as the popular devices Amazon Echo and Google Home. This paper thus presents a design on chord recognition with the Zync7000 Field-Programmable Gate Array (FPGA), capable of sampling analog frequency signals through a microphone and, in real time, showing sheet music on a smart phone app that corresponds to the user's playing. We demonstrate the design of audio sampling on programming logic and the implementation of frequency transform and vector building on programming system, which is an embedded ARM core on the Zync FPGA. Experimental results show that the logic design spends 574 slices of look-up-tables (LUTs) and 792 slices of flip-flops. Due to the dynamic power consumption on programming system (1399 mW) being significantly higher than the dynamic power dissipation on programming logic (7 mW), the future work of this platform is to design intelligent property (IP) for algorithms of frequency transform, pitch class profile (PCP), and pattern matching with hardware description language (HDL), making the entire system-on-chip (SoC) able to be taped out as an application-specific design for consumer electronics.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"141 1","pages":"378-384"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80139863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00042
R. Cai, Xiaolong Ma, O. Chen, Ao Ren, Ning Liu, N. Yoshikawa, Yanzhi Wang
Josephson Junction (JJ) based superconductor logic families have been proposed and implemented to process analog and digital signals [1] for its low energy dissipation and ultrafast switching speed. Thanks to its construction of resistance-less wires and ultrafast switches, it can operate at clock frequencies of several tens of gigahertz and even hundreds of thousands of times as energy efficient as its CMOS counterparts. It has been perceived to be an important candidate to replace stateof-the-art CMOS due to the superior potential in operation speed and energy efficiency, as recognized by the U.S. IARPA C3 and SuperTools Programs and Japan MEXT-JSPS Project. The design and fabrication of superconducting circuits have already been established [2]-[4]. In addition, a prototype superconducting microprocessor "Core 1" has been demonstrated in 2004 [3], which is able to execute instructions at a high clock frequency of several tens of gigahertz, and with extremely low-power dissipation. These achievements make superconducting electronics highly promising for future high-performance computing applications. As one of the most matured superconducting technology, the Rapid-Single-Flux-Quantum (RSFQ) technology is proposed by K. Likharev, O. Mukhanoc, V. Semenov in 1985 [1]. Despite its capability to be operated at an ultra-high speed of hundreds of GHz while maintaining extremely low switching energy (10^-19 J), it suffers from an increasing static power due to on-chip resistors that are required for constant DC bias supply for the main RSFQ circuit. Numerous methods have been proposed to resolve the static power dissipation problem of RSFQ, including low-voltage RSFQ (LV-RSFQ) [5], reciprocal quantum logic (RQL) [6], LRbiased RSFQ [7] and energy-efficient single-flux quantum (eSFQ) [8]. The Adiabatic Quantum-Flux-Parametron (AQFP) technology, on the other hand, uses AC bias/excitation currents as both multiphase clock signal and power supply [9] to mitigate the power consumption overhead of DC bias while operating at a frequency of few GHz. Consequently, AQFP is remarkably energy efficient compared to RSFQ, albeit operating at a lower frequency. The energy-delay-product (EDP) of the AQFP circuits fabricated using processes such as the AIST standard process 2 (STP2) and the MIT-LL SFQ process [10], [11], is at least 200 times smaller than those of the other energy-efficient superconductor logics and is only three orders of magnitude larger than the quantum limit [9]. Physical testing results of an AQFP 8-bit carry-look-ahead adder and large scale circuits consisting up-to 10,000 AQFP logic gates have demonstrated the AQFP being a promising technology that is robust against circuit parameter variations [12]. Despite the high application potential of AQFP in VLSI circuits, a systematic, automatic synthesis framework for AQFP is imminent. There are two features of AQFP that restrict conventional CMOS synthesis methods being directly applied on AQFP. In spi
基于Josephson结(JJ)的超导体逻辑族因其低能量损耗和超快的开关速度而被提出并实现,用于处理模拟和数字信号[1]。由于其无电阻导线和超快开关的结构,它可以在几十千兆赫兹的时钟频率下工作,甚至比CMOS同类产品节能数十万倍。美国IARPA C3和SuperTools项目以及日本next - jsps项目都承认,由于在运行速度和能源效率方面具有卓越的潜力,它已被认为是取代最先进CMOS的重要候选者。超导电路的设计和制造已经建立起来[2]-[4]。此外,超导微处理器原型“Core 1”已于2004年展示[3],它能够以几十千兆赫的高时钟频率执行指令,并且功耗极低。这些成就使超导电子学在未来的高性能计算应用中具有很大的前景。快速单通量量子(Rapid-Single-Flux-Quantum, RSFQ)技术是由K. Likharev, O. Mukhanoc, V. Semenov于1985年提出的超导技术之一[1]。尽管它能够在数百GHz的超高速下工作,同时保持极低的开关能量(10^-19 J),但由于片上电阻需要为主RSFQ电路提供恒定的直流偏置电源,因此它的静态功率不断增加。解决RSFQ静态功耗问题的方法有很多,包括低压RSFQ (LV-RSFQ)[5]、互反量子逻辑(RQL)[6]、LRbiased RSFQ[7]和节能单通量量子(eSFQ)[8]。另一方面,绝热量子通量参数管(AQFP)技术使用交流偏置/激励电流作为多相时钟信号和电源[9],以减轻在几GHz频率下工作时直流偏置的功耗开销。因此,与RSFQ相比,AQFP非常节能,尽管工作频率较低。使用AIST标准工艺2 (STP2)和MIT-LL SFQ工艺[10],[11]等工艺制造的AQFP电路的能量延迟积(EDP)比其他节能超导体逻辑的能量延迟积至少小200倍,仅比量子极限大三个数量级[9]。AQFP 8位超前进位加法器和由多达10,000个AQFP逻辑门组成的大规模电路的物理测试结果表明,AQFP是一种有前途的技术,对电路参数变化具有鲁棒性[12]。尽管AQFP在VLSI电路中具有很高的应用潜力,但一个系统的、自动的AQFP合成框架迫在眉睫。AQFP有两个特点,限制了传统的CMOS合成方法直接应用于AQFP。尽管传统CMOS电路高度依赖于基于and - or -逆变器(AOI)的表示,但AQFP电路更倾向于多数门。事实上,它的两个输入与或门也是用三个输入多数门构建的,其中一个输入是恒定的。此外,由于AQFP技术具有时钟同步的数据传播特性,它要求任何门的所有输入具有相等的延迟。为了满足这种平衡的时序要求,需要在电路中插入分配器和缓冲器。事实上,一些电路的大小可以增加一倍,甚至与最佳数量的缓冲器和分离器插入。缓冲区和分离器插入方法会对整体资源消耗产生巨大影响。随着设计复杂性的增加,未优化的缓冲区和分配器插入方法可能会导致添加大量不必要的缓冲区和分配器。除了完整的综合框架外,还缺乏用于AQFP设计的集成开发环境(IDE)。一个集成了原理图和布局编辑器、仿真和分析的AQFP集成工具的IDE即将出现,以实现更好、更高效的AQFP设计流程。在本文中,我们提出了一个完整的AQFP设计工具,包括一个集成开发环境(IDE),一个完整的基于多数的合成框架和一个缓冲器和分离器插入框架。我们提出了一个AQFP电路的多数门综合框架,该框架能够通过将所有可行的三输入子网络映射到相应的基于MAJ的实现,将任何AOI网络列表转换为相应的MAJ网络列表。此外,我们还提出了一种自动缓冲区和分离器插入方法,该方法能够在任何给定的门级网络列表中添加最佳数量的缓冲区和分离器。该方法可以在任意库对分配器大小的限制下,找到要插入的缓冲区和分配器的最小数量,以实现相等的延迟。
{"title":"IDE Development, Logic Synthesis and Buffer/Splitter Insertion Framework for Adiabatic Quantum-Flux-Parametron Superconducting Circuits","authors":"R. Cai, Xiaolong Ma, O. Chen, Ao Ren, Ning Liu, N. Yoshikawa, Yanzhi Wang","doi":"10.1109/ISVLSI.2019.00042","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00042","url":null,"abstract":"Josephson Junction (JJ) based superconductor logic families have been proposed and implemented to process analog and digital signals [1] for its low energy dissipation and ultrafast switching speed. Thanks to its construction of resistance-less wires and ultrafast switches, it can operate at clock frequencies of several tens of gigahertz and even hundreds of thousands of times as energy efficient as its CMOS counterparts. It has been perceived to be an important candidate to replace stateof-the-art CMOS due to the superior potential in operation speed and energy efficiency, as recognized by the U.S. IARPA C3 and SuperTools Programs and Japan MEXT-JSPS Project. The design and fabrication of superconducting circuits have already been established [2]-[4]. In addition, a prototype superconducting microprocessor \"Core 1\" has been demonstrated in 2004 [3], which is able to execute instructions at a high clock frequency of several tens of gigahertz, and with extremely low-power dissipation. These achievements make superconducting electronics highly promising for future high-performance computing applications. As one of the most matured superconducting technology, the Rapid-Single-Flux-Quantum (RSFQ) technology is proposed by K. Likharev, O. Mukhanoc, V. Semenov in 1985 [1]. Despite its capability to be operated at an ultra-high speed of hundreds of GHz while maintaining extremely low switching energy (10^-19 J), it suffers from an increasing static power due to on-chip resistors that are required for constant DC bias supply for the main RSFQ circuit. Numerous methods have been proposed to resolve the static power dissipation problem of RSFQ, including low-voltage RSFQ (LV-RSFQ) [5], reciprocal quantum logic (RQL) [6], LRbiased RSFQ [7] and energy-efficient single-flux quantum (eSFQ) [8]. The Adiabatic Quantum-Flux-Parametron (AQFP) technology, on the other hand, uses AC bias/excitation currents as both multiphase clock signal and power supply [9] to mitigate the power consumption overhead of DC bias while operating at a frequency of few GHz. Consequently, AQFP is remarkably energy efficient compared to RSFQ, albeit operating at a lower frequency. The energy-delay-product (EDP) of the AQFP circuits fabricated using processes such as the AIST standard process 2 (STP2) and the MIT-LL SFQ process [10], [11], is at least 200 times smaller than those of the other energy-efficient superconductor logics and is only three orders of magnitude larger than the quantum limit [9]. Physical testing results of an AQFP 8-bit carry-look-ahead adder and large scale circuits consisting up-to 10,000 AQFP logic gates have demonstrated the AQFP being a promising technology that is robust against circuit parameter variations [12]. Despite the high application potential of AQFP in VLSI circuits, a systematic, automatic synthesis framework for AQFP is imminent. There are two features of AQFP that restrict conventional CMOS synthesis methods being directly applied on AQFP. In spi","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"92 1","pages":"187-192"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74945988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00043
Adarsha Balaji, Anup Das
Spiking neural networks (SNN) are efficient computation models to infer spacio-temporal pattern recognition applications on neuromorphic hardware. Neuromorphic hardware are typically designed using interconnected crossbars, with each crossbar containing a structure of fully connected neurons. In order to ensure application performance such as accuracy and system performance such as throughput and resource utilization, SNNs need to be efficiently mapped on neuromorphic hardware. To address this, we propose a design flow to partition and map SNN-based applications on neuromorphic hardware, with an aim to enhance application and system performance. The design flow operates in two steps : (1) a two-step clustering technique to partition trained SNNs into clusters of neurons and synapses, with an aim to minimize inter-cluster spike communication, (2) mapping and scheduling the clusters on to crossbars-based architectures, modeled using Synchronous Data-flow Graphs (SDFGs). The SDFG model incorporates hardware constraints such as I/O bandwidth of crossbars and synaptic memory while analyzing the throughput of the modeled system. Our design-flow integrates CARLsim, a GPU-accelerated application-level SNN simulator with SDF3, a tool to map SDFG on hardware. We evaluate the design-flow using synthetic and realistic SNN-based applications. We show that, for throughput constrained applications, we achieve a 21.74% and 15.03% reduction in memory usage and utilization of the time-multiplexed interconnect, compared to a state of the art approach.
{"title":"A Framework for the Analysis of Throughput-Constraints of SNNs on Neuromorphic Hardware","authors":"Adarsha Balaji, Anup Das","doi":"10.1109/ISVLSI.2019.00043","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00043","url":null,"abstract":"Spiking neural networks (SNN) are efficient computation models to infer spacio-temporal pattern recognition applications on neuromorphic hardware. Neuromorphic hardware are typically designed using interconnected crossbars, with each crossbar containing a structure of fully connected neurons. In order to ensure application performance such as accuracy and system performance such as throughput and resource utilization, SNNs need to be efficiently mapped on neuromorphic hardware. To address this, we propose a design flow to partition and map SNN-based applications on neuromorphic hardware, with an aim to enhance application and system performance. The design flow operates in two steps : (1) a two-step clustering technique to partition trained SNNs into clusters of neurons and synapses, with an aim to minimize inter-cluster spike communication, (2) mapping and scheduling the clusters on to crossbars-based architectures, modeled using Synchronous Data-flow Graphs (SDFGs). The SDFG model incorporates hardware constraints such as I/O bandwidth of crossbars and synaptic memory while analyzing the throughput of the modeled system. Our design-flow integrates CARLsim, a GPU-accelerated application-level SNN simulator with SDF3, a tool to map SDFG on hardware. We evaluate the design-flow using synthetic and realistic SNN-based applications. We show that, for throughput constrained applications, we achieve a 21.74% and 15.03% reduction in memory usage and utilization of the time-multiplexed interconnect, compared to a state of the art approach.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"193-196"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75049569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00101
Ruben Vazquez, Islam Badreldin, Mohamad Hammam Alsafrjalani, A. Gordon-Ross
Embedded computing systems are becoming increasingly complex, now performing tasks that were generally limited to desktop computing systems. However, embedded system designers are still required to adhere to stringent embedded design constraints (e.g., energy and area requirements) when designing such increasingly complex systems. To meet these constraints, configurable hardware components introduce configurable parameters (e.g., CPU voltage and frequency, cache size, cache associativity, cache line size, pipeline depth/width, etc.) that can be tuned to specific values to meet different design constraints (e.g., area, energy, performance, etc.) and user demands (e.g., increased battery life, increased performance, or a desired trade off), which translates to a better quality of the user experience. However, determining these specific parameter values is increasingly difficult and time-consuming as the configurable parameter design space increases. This issue is further complicated when considering that each application has a different set of optimal/best parameter values based on these demands and requirements. Furthermore, repetitious application behavior, known as phases, which occur throughout an application's runtime, can be exploited by tracking each phase's unique optimal parameter values; resulting in a multiplicative increase or an exponential increase in the size of the size of the configuration space. In this paper, we propose a machine learning-based methodology to significantly reduce the time required to find the optimal configurable parameter values for the instruction and data caches for each application phase. In our method, we use artificial neural networks (ANNs) to predict the optimal configuration for application phases. We collect execution statistics for use as features for an application phase and use feature reduction to significantly reduce the features size. We show that ANNs exhibit high, stable accuracy over multiple training and testing iterations. We also show that applications exhibit low energy degradations (less than 1%) for both the instruction and data caches using our methodology.
{"title":"Machine Learning-based Prediction for Phase-Based Dynamic Architectural Specialization","authors":"Ruben Vazquez, Islam Badreldin, Mohamad Hammam Alsafrjalani, A. Gordon-Ross","doi":"10.1109/ISVLSI.2019.00101","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00101","url":null,"abstract":"Embedded computing systems are becoming increasingly complex, now performing tasks that were generally limited to desktop computing systems. However, embedded system designers are still required to adhere to stringent embedded design constraints (e.g., energy and area requirements) when designing such increasingly complex systems. To meet these constraints, configurable hardware components introduce configurable parameters (e.g., CPU voltage and frequency, cache size, cache associativity, cache line size, pipeline depth/width, etc.) that can be tuned to specific values to meet different design constraints (e.g., area, energy, performance, etc.) and user demands (e.g., increased battery life, increased performance, or a desired trade off), which translates to a better quality of the user experience. However, determining these specific parameter values is increasingly difficult and time-consuming as the configurable parameter design space increases. This issue is further complicated when considering that each application has a different set of optimal/best parameter values based on these demands and requirements. Furthermore, repetitious application behavior, known as phases, which occur throughout an application's runtime, can be exploited by tracking each phase's unique optimal parameter values; resulting in a multiplicative increase or an exponential increase in the size of the size of the configuration space. In this paper, we propose a machine learning-based methodology to significantly reduce the time required to find the optimal configurable parameter values for the instruction and data caches for each application phase. In our method, we use artificial neural networks (ANNs) to predict the optimal configuration for application phases. We collect execution statistics for use as features for an application phase and use feature reduction to significantly reduce the features size. We show that ANNs exhibit high, stable accuracy over multiple training and testing iterations. We also show that applications exhibit low energy degradations (less than 1%) for both the instruction and data caches using our methodology.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"5 1","pages":"529-534"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91506287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}