Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9365935
Min-Woo Ko, Hyunki Han, Hyunsik Kim
The applications of large-area electronics (LAEs) based on thin-film transistors (TFTs) are rapidly expanding from displays to sensors. For the TFT gate drivers, high-voltage bipolar supply rails (approximately ± 15V) are required; so far, they have been typically generated from a battery $(V_{BAT})$ by employing switched-capacitor converters (SCCs) [1]. Since a high SNR is crucial for TFT-based sensors such as an under-display fingerprint sensor [2], the noise and ripple of the SCC output, which are prone to be coupled to the readout AFE, should be minimized. As a straightforward method, an LDO can be utilized as a post-regulator in series with the SCC. However, the relatively large dropout voltage $(V_{DO})$ of the LDO significantly degrades the efficiency [3]. In contrast, small VDO reduces LDO loop-gain due to the pass-transistor working in the triode region, resulting in decreased supply-ripple rejection (PSR). From the perspective of SCC, owing to its fixed voltage conversion ratio (VCR), the VDO cannot be finely regulated over a wide variation of VBAT. For fine regulation, the complexity (cost) overhead or the power loss will increase in the SC circuit. In this work, an energy-recycled optimal VDO control (EROC) technique in the SC bipolar step-up stage is proposed for higher efficiency. Also, load-current-reused (LCR) post-regulator is presented to achieve high PSR while extremely minimizing the power loss at the pass-transistor.
{"title":"17.8 A 90.5%-Efficiency 28.7µ VRMS-Noise Bipolar-Output High-Step-Up SC DC-DC Converter with Energy-Recycled Regulation and Post-Filtering for ±15V TFT-Based LAE Sensors","authors":"Min-Woo Ko, Hyunki Han, Hyunsik Kim","doi":"10.1109/ISSCC42613.2021.9365935","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365935","url":null,"abstract":"The applications of large-area electronics (LAEs) based on thin-film transistors (TFTs) are rapidly expanding from displays to sensors. For the TFT gate drivers, high-voltage bipolar supply rails (approximately ± 15V) are required; so far, they have been typically generated from a battery $(V_{BAT})$ by employing switched-capacitor converters (SCCs) [1]. Since a high SNR is crucial for TFT-based sensors such as an under-display fingerprint sensor [2], the noise and ripple of the SCC output, which are prone to be coupled to the readout AFE, should be minimized. As a straightforward method, an LDO can be utilized as a post-regulator in series with the SCC. However, the relatively large dropout voltage $(V_{DO})$ of the LDO significantly degrades the efficiency [3]. In contrast, small VDO reduces LDO loop-gain due to the pass-transistor working in the triode region, resulting in decreased supply-ripple rejection (PSR). From the perspective of SCC, owing to its fixed voltage conversion ratio (VCR), the VDO cannot be finely regulated over a wide variation of VBAT. For fine regulation, the complexity (cost) overhead or the power loss will increase in the SC circuit. In this work, an energy-recycled optimal VDO control (EROC) technique in the SC bipolar step-up stage is proposed for higher efficiency. Also, load-current-reused (LCR) post-regulator is presented to achieve high PSR while extremely minimizing the power loss at the pass-transistor.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123938821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9366045
Zhengyu Chen, X. Chen, Jie Gu
Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $sim 10 times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $sim 2 times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.
{"title":"15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency","authors":"Zhengyu Chen, X. Chen, Jie Gu","doi":"10.1109/ISSCC42613.2021.9366045","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9366045","url":null,"abstract":"Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $sim 10 times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $sim 2 times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123946538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9365988
T. Song, W. Rim, Hoonki Kim, K. Cho, Taeyeong Kim, Taejung Lee, Geumjong Bae, Dong-Won Kim, S. Kwon, S. Baek, Jonghoon Jung, J. Kye, Hakchul Jung, Hyungtae Kim, Soon-Moon Jung, Jaehong Park
Advanced technologies help to improve SRAM performance via recent transistor breakthroughs [1], which allow SRAM designers to focus on handling metal resistance by alleviating device performance impediments. Since SRAM margins are more vulnerable to the increasing metal resistance, due to smaller critical dimensions, SRAM-assist circuits are proposed to overcome the impact of metal resistance in recent technologies [2 –5]. One of the challenges is the design limitation such as the quantized transistor, which requires SRAM-assist to optimize SRAM margins. In this paper, gate-all-around (GAA) SRAM design techniques are proposed, which improve SRAM margins more freely, in addition to power, performance, and area (PPA). Moreover, SRAM-assist schemes are proposed to overcome metal resistance, which maximizes the benefit of GAA devices.
{"title":"24.3 A 3nm Gate-All-Around SRAM Featuring an Adaptive Dual-BL and an Adaptive Cell-Power Assist Circuit","authors":"T. Song, W. Rim, Hoonki Kim, K. Cho, Taeyeong Kim, Taejung Lee, Geumjong Bae, Dong-Won Kim, S. Kwon, S. Baek, Jonghoon Jung, J. Kye, Hakchul Jung, Hyungtae Kim, Soon-Moon Jung, Jaehong Park","doi":"10.1109/ISSCC42613.2021.9365988","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365988","url":null,"abstract":"Advanced technologies help to improve SRAM performance via recent transistor breakthroughs [1], which allow SRAM designers to focus on handling metal resistance by alleviating device performance impediments. Since SRAM margins are more vulnerable to the increasing metal resistance, due to smaller critical dimensions, SRAM-assist circuits are proposed to overcome the impact of metal resistance in recent technologies [2 –5]. One of the challenges is the design limitation such as the quantized transistor, which requires SRAM-assist to optimize SRAM margins. In this paper, gate-all-around (GAA) SRAM design techniques are proposed, which improve SRAM margins more freely, in addition to power, performance, and area (PPA). Moreover, SRAM-assist schemes are proposed to overcome metal resistance, which maximizes the benefit of GAA devices.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114358546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9365933
Zongming Duan, Bowen Wu, Chuanming Zhu, Yan Wang, Weiwei Jin, Y. Liu, Yanhui Wu, Tao Zhang, Ming Liu, B. Dou, Bingbing Liao, Wei Lv, Dongfang Pan, Yongjie Li, Changwei Wang, Yuefei Dai, Pei Li, Hao Gao
Millimeter-wave (mm-wave) radar is an essential sensor of advanced driver assistance systems and autonomous driving. Its detection requirement extends from traditional long-to-medium range to emerging short and ultra-short range for surround sensing, which is from sub-1 meter to 40 meters and requires a compact and low-cost solution with fast chirp generation to improve the range resolution [1] –[5]. In [2], 3 transmitters (TX) and 4 receivers (RX) are utilized to create a multi-mode radar transceiver, however its on-board antenna integration solution limits its size and cost for short-distance applications. In [3], 12TX and 16RX phase-domain multiple-input multiple-output (MIMO) radar is presented. However an external PLL is required, and the digitalprocessing power consumption is high. Furthermore its size and cost are not suitable for short and ultra-short applications. In [4], a compact 1TX and 1RX frequency-modulated continuous-wave (FMCW) radar is applied with an antenna-in-package (AiP) for short-distance applications, however its detection range is limited to only 20meters. This work presents a compact 76-to-81GHz FMCW MIMO radar with fast chirp generation and integrated with an AiP array in embedded glass fan-out (eGFO) technology for short and ultra-short range application. A fast-modulated chirp signal is especially crucial in MIMO radar for improving range resolution with update rate, and it also moves IF frequency away from the 1/f noise corner. Thanks to the dynamic bias technique, the chirp rate is improved to 312.5MHz/μs and the maximum modulation bandwidth is 7.2GHz. In addition, the 2TX and 6RX array antennas are integrated in package, and effective isotropic radiated power (EIRP) of the TX antenna is improved by 7.5dB by a coaxial feeding structure in eGFO technology with multi-feed antenna technique.
{"title":"14.6 A 76-to-81GHz 2×8 FMCW MIMO Radar Transceiver with Fast Chirp Generation and Multi-Feed Antenna-in-Package Array","authors":"Zongming Duan, Bowen Wu, Chuanming Zhu, Yan Wang, Weiwei Jin, Y. Liu, Yanhui Wu, Tao Zhang, Ming Liu, B. Dou, Bingbing Liao, Wei Lv, Dongfang Pan, Yongjie Li, Changwei Wang, Yuefei Dai, Pei Li, Hao Gao","doi":"10.1109/ISSCC42613.2021.9365933","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365933","url":null,"abstract":"Millimeter-wave (mm-wave) radar is an essential sensor of advanced driver assistance systems and autonomous driving. Its detection requirement extends from traditional long-to-medium range to emerging short and ultra-short range for surround sensing, which is from sub-1 meter to 40 meters and requires a compact and low-cost solution with fast chirp generation to improve the range resolution [1] –[5]. In [2], 3 transmitters (TX) and 4 receivers (RX) are utilized to create a multi-mode radar transceiver, however its on-board antenna integration solution limits its size and cost for short-distance applications. In [3], 12TX and 16RX phase-domain multiple-input multiple-output (MIMO) radar is presented. However an external PLL is required, and the digitalprocessing power consumption is high. Furthermore its size and cost are not suitable for short and ultra-short applications. In [4], a compact 1TX and 1RX frequency-modulated continuous-wave (FMCW) radar is applied with an antenna-in-package (AiP) for short-distance applications, however its detection range is limited to only 20meters. This work presents a compact 76-to-81GHz FMCW MIMO radar with fast chirp generation and integrated with an AiP array in embedded glass fan-out (eGFO) technology for short and ultra-short range application. A fast-modulated chirp signal is especially crucial in MIMO radar for improving range resolution with update rate, and it also moves IF frequency away from the 1/f noise corner. Thanks to the dynamic bias technique, the chirp rate is improved to 312.5MHz/μs and the maximum modulation bandwidth is 7.2GHz. In addition, the 2TX and 6RX array antennas are integrated in package, and effective isotropic radiated power (EIRP) of the TX antenna is improved by 7.5dB by a coaxial feeding structure in eGFO technology with multi-feed antenna technique.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116150533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9366048
Kelvin McCollough, S. Huss, J. Vandersand, Randall Smith, C. Moscone, Qazi Omar Farooq
With recent AI and big data developments, quickly moving massive amounts of data is paramount to future technologies. Scalable solutions that can sustain higher performance systems that consume ever-increasing amounts of data are necessary. Low power, low latency, and aggregation are essential in highly scalable data-distribution solutions [2].
{"title":"A 480Gb/s/mm 1.7pJ/b Short-Reach Wireline Transceiver Using Single-Ended NRZ for Die-to-Die Applications","authors":"Kelvin McCollough, S. Huss, J. Vandersand, Randall Smith, C. Moscone, Qazi Omar Farooq","doi":"10.1109/ISSCC42613.2021.9366048","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9366048","url":null,"abstract":"With recent AI and big data developments, quickly moving massive amounts of data is paramount to future technologies. Scalable solutions that can sustain higher performance systems that consume ever-increasing amounts of data are necessary. Low power, low latency, and aggregation are essential in highly scalable data-distribution solutions [2].","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121578423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9366053
Junjie Mu, Bongjin Kim
Partial differential equations (PDEs) are ubiquitous in physics and engineering and used for understanding various physical phenomena, including heat, diffusion, fluid and electrodynamics, and quantum mechanics. Analytical PDE solutions are rare, and hence, we approximate using numerical methods. The finite difference method (FDM) approximates PDEs by computing finite differences between discretized solutions. Since finite differences approximate the derivatives of PDEs, many iterations of high-precision computations are required to achieve higher accuracy in their numerical solutions. Hence, computationally-expensive FDM necessitates the use of high-performance computers. As such, their energy consumption is excessive (e.g. 15mJ per iteration and $gt 320mathrm{J}$ in total for solving PDE with $mathrm{a}128 times 128$ grid using GPU [1]). Consequently, there is an ever-increasing need for a dedicated hardware accelerator for solving PDEs.
{"title":"A 21×21 Dynamic-Precision Bit-Serial Computing Graph Accelerator for Solving Partial Differential Equations Using Finite Difference Method","authors":"Junjie Mu, Bongjin Kim","doi":"10.1109/ISSCC42613.2021.9366053","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9366053","url":null,"abstract":"Partial differential equations (PDEs) are ubiquitous in physics and engineering and used for understanding various physical phenomena, including heat, diffusion, fluid and electrodynamics, and quantum mechanics. Analytical PDE solutions are rare, and hence, we approximate using numerical methods. The finite difference method (FDM) approximates PDEs by computing finite differences between discretized solutions. Since finite differences approximate the derivatives of PDEs, many iterations of high-precision computations are required to achieve higher accuracy in their numerical solutions. Hence, computationally-expensive FDM necessitates the use of high-performance computers. As such, their energy consumption is excessive (e.g. 15mJ per iteration and $gt 320mathrm{J}$ in total for solving PDE with $mathrm{a}128 times 128$ grid using GPU [1]). Consequently, there is an ever-increasing need for a dedicated hardware accelerator for solving PDEs.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126230424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9365859
Junyao Tang, Lei Zhao, Cheng Huang
Wireless power transfer (WPT) systems are becoming increasingly popular for sub100mW biomedical applications [1] –[5]. Because the received power is sensitive to coupling and loading conditions, power/voltage regulations are essential to achieve stable and accurate power delivery, fast transient response, and high end-to-end (E2E) efficiency, which includes all the power losses in the transmitter (TX), wireless power link, and the receiver (RX). Many existing WPT designs operated in open-loop [3] –[5]; or achieved voltage regulation but only in the RX [6], with the TX remained unregulated and designed to operate at full capacity, thus degraded E2E efficiency at light-load conditions. Because lower-power or standby mode typically contributes to the majority of the operation time, light-load efficiency is always an important specification of power management circuits, especially to extend the run time for battery-powered devices, e.g., a wearable/portable WPT transmitter supporting bioimplants. [1], [2], [7] –[9] have reported different approaches to achieve TX regulation; however, all required extra discrete components, which increased the form-factor and cost. [7], [8] required a wire to close the loop. [1], [2], [9] utilized load-shift-keying (LSK) backscattering for TX regulation, which was proved an effective solution. However, [2], [9] relied on lots of off-chip components, including power inductors, diodes, DACs, FPGAs, etc., due to the analog control methodologies. The linear control also introduced small-signal bandwidth limitations, which required careful design to ensure stability at different loading/coupling conditions with PVT/component variations, and resulted in significant compromise in dynamic performance. [1] introduced a nonlinear constant-idle-time control to eliminate the bandwidth limitations and most of the off-chip components; however, the light-load efficiency still suffered. In addition, [1] still required an extra sensing coil to extract LSK signals that increased the TX coil area by 86%.
{"title":"A Wireless Power Transfer System with Up-to-20% Light- Load Efficiency Enhancement and Instant Dynamic Response by Fully Integrated Wireless Hysteretic Control for Bioimplants","authors":"Junyao Tang, Lei Zhao, Cheng Huang","doi":"10.1109/ISSCC42613.2021.9365859","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365859","url":null,"abstract":"Wireless power transfer (WPT) systems are becoming increasingly popular for sub100mW biomedical applications [1] –[5]. Because the received power is sensitive to coupling and loading conditions, power/voltage regulations are essential to achieve stable and accurate power delivery, fast transient response, and high end-to-end (E2E) efficiency, which includes all the power losses in the transmitter (TX), wireless power link, and the receiver (RX). Many existing WPT designs operated in open-loop [3] –[5]; or achieved voltage regulation but only in the RX [6], with the TX remained unregulated and designed to operate at full capacity, thus degraded E2E efficiency at light-load conditions. Because lower-power or standby mode typically contributes to the majority of the operation time, light-load efficiency is always an important specification of power management circuits, especially to extend the run time for battery-powered devices, e.g., a wearable/portable WPT transmitter supporting bioimplants. [1], [2], [7] –[9] have reported different approaches to achieve TX regulation; however, all required extra discrete components, which increased the form-factor and cost. [7], [8] required a wire to close the loop. [1], [2], [9] utilized load-shift-keying (LSK) backscattering for TX regulation, which was proved an effective solution. However, [2], [9] relied on lots of off-chip components, including power inductors, diodes, DACs, FPGAs, etc., due to the analog control methodologies. The linear control also introduced small-signal bandwidth limitations, which required careful design to ensure stability at different loading/coupling conditions with PVT/component variations, and resulted in significant compromise in dynamic performance. [1] introduced a nonlinear constant-idle-time control to eliminate the bandwidth limitations and most of the off-chip components; however, the light-load efficiency still suffered. In addition, [1] still required an extra sensing coil to extract LSK signals that increased the TX coil area by 86%.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9366037
Alessandro Novello, Gabriele Atzeni, Giorgio Cristiano, Mathieu Coustans, Taekwang Jang
Over the past years, the constant reduction in the size of consumer electronics has strengthened the demand for fully integrated power management circuits. Buck converters offer high efficiency, but they cannot satisfy the stringent size requirements because bulky off-chip inductors are required [1]. Switched-capacitor (SC) approaches provide fully integrated power management solutions; however, their power density is limited by the on-chip capacitance density [2]. Resonant switched capacitor (ReSC) converters need 3D die-stacked inductors or PCB-integrated inductors to achieve appropriate power density values, posing challenges for monolithic integration [3]. A fully integrated ReSC has been presented [4], which implements an on-chip resonator, avoiding any external or 3D stacked passive components. However, the switching losses associated with the four transistors driving the resonator limit the switching frequency to 10s of MHz, bounding the power density scaling to 0.097W/mm 2.
{"title":"17.3 A 1.25GHz Fully Integrated DC-DC Converter Using Electromagnetically Coupled Class-D LC Oscillators","authors":"Alessandro Novello, Gabriele Atzeni, Giorgio Cristiano, Mathieu Coustans, Taekwang Jang","doi":"10.1109/ISSCC42613.2021.9366037","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9366037","url":null,"abstract":"Over the past years, the constant reduction in the size of consumer electronics has strengthened the demand for fully integrated power management circuits. Buck converters offer high efficiency, but they cannot satisfy the stringent size requirements because bulky off-chip inductors are required [1]. Switched-capacitor (SC) approaches provide fully integrated power management solutions; however, their power density is limited by the on-chip capacitance density [2]. Resonant switched capacitor (ReSC) converters need 3D die-stacked inductors or PCB-integrated inductors to achieve appropriate power density values, posing challenges for monolithic integration [3]. A fully integrated ReSC has been presented [4], which implements an on-chip resonator, avoiding any external or 3D stacked passive components. However, the switching losses associated with the four transistors driving the resonator limit the switching frequency to 10s of MHz, bounding the power density scaling to 0.097W/mm 2.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121734771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9365803
Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany
The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.
{"title":"3.2 The A100 Datacenter GPU and Ampere Architecture","authors":"Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany","doi":"10.1109/ISSCC42613.2021.9365803","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365803","url":null,"abstract":"The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115803567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-13DOI: 10.1109/ISSCC42613.2021.9365928
Jun-Seok Park, Jun-Woo Jang, Heonsoo Lee, Dongwook Lee, Sehwan Lee, Hanwoong Jung, Seungwon Lee, S. Kwon, Kyung-Ah Jeong, Joonho Song, Sukhwan Lim, Inyup Kang
On-device machine learning is critical for mobile products as it enables real-time applications (e.g. AI-powered camera applications), which need to be responsive, always available (i.e. do not require network connectivity) and privacy preserving. The platforms used in such situations have limited computing resources, power, and memory bandwidth. Enabling such on-device machine learning has triggered wide development of efficient neural-network accelerators that promise high energy and area efficiency compared to general-purpose processors, such as CPUs. The need to support a comprehensive range of neural networks has been important as well because the field of deep learning is evolving rapidly as depicted in Fig. 9.5.1. Recent work on neural-network accelerators has focused on improving energy efficiency, while obtaining high performance in order to meet the needs of real-time applications. For example, weightzero-skipping and pruning have been deployed in recent accelerators [2] –[7]. SIMD or systolic array-based accelerators [2] –[4], [6] provide flexibility to support various types of compute across a wide range of Deep Neural Network (DNN) models.
{"title":"9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC","authors":"Jun-Seok Park, Jun-Woo Jang, Heonsoo Lee, Dongwook Lee, Sehwan Lee, Hanwoong Jung, Seungwon Lee, S. Kwon, Kyung-Ah Jeong, Joonho Song, Sukhwan Lim, Inyup Kang","doi":"10.1109/ISSCC42613.2021.9365928","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365928","url":null,"abstract":"On-device machine learning is critical for mobile products as it enables real-time applications (e.g. AI-powered camera applications), which need to be responsive, always available (i.e. do not require network connectivity) and privacy preserving. The platforms used in such situations have limited computing resources, power, and memory bandwidth. Enabling such on-device machine learning has triggered wide development of efficient neural-network accelerators that promise high energy and area efficiency compared to general-purpose processors, such as CPUs. The need to support a comprehensive range of neural networks has been important as well because the field of deep learning is evolving rapidly as depicted in Fig. 9.5.1. Recent work on neural-network accelerators has focused on improving energy efficiency, while obtaining high performance in order to meet the needs of real-time applications. For example, weightzero-skipping and pruning have been deployed in recent accelerators [2] –[7]. SIMD or systolic array-based accelerators [2] –[4], [6] provide flexibility to support various types of compute across a wide range of Deep Neural Network (DNN) models.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}