2021 IEEE International Solid- State Circuits Conference (ISSCC)最新文献_第2页

17.8 A 90.5%-Efficiency 28.7µ VRMS-Noise Bipolar-Output High-Step-Up SC DC-DC Converter with Energy-Recycled Regulation and Post-Filtering for ±15V TFT-Based LAE Sensors 17.8用于±15V tft LAE传感器的90.5%效率28.7µvrms噪声双极输出高升压SC DC-DC变换器

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9365935

Min-Woo Ko, Hyunki Han, Hyunsik Kim

The applications of large-area electronics (LAEs) based on thin-film transistors (TFTs) are rapidly expanding from displays to sensors. For the TFT gate drivers, high-voltage bipolar supply rails (approximately ± 15V) are required; so far, they have been typically generated from a battery $(V_{BAT})$ by employing switched-capacitor converters (SCCs) [1]. Since a high SNR is crucial for TFT-based sensors such as an under-display fingerprint sensor [2], the noise and ripple of the SCC output, which are prone to be coupled to the readout AFE, should be minimized. As a straightforward method, an LDO can be utilized as a post-regulator in series with the SCC. However, the relatively large dropout voltage $(V_{DO})$ of the LDO significantly degrades the efficiency [3]. In contrast, small VDO reduces LDO loop-gain due to the pass-transistor working in the triode region, resulting in decreased supply-ripple rejection (PSR). From the perspective of SCC, owing to its fixed voltage conversion ratio (VCR), the VDO cannot be finely regulated over a wide variation of VBAT. For fine regulation, the complexity (cost) overhead or the power loss will increase in the SC circuit. In this work, an energy-recycled optimal VDO control (EROC) technique in the SC bipolar step-up stage is proposed for higher efficiency. Also, load-current-reused (LCR) post-regulator is presented to achieve high PSR while extremely minimizing the power loss at the pass-transistor.

基于薄膜晶体管(TFTs)的大面积电子学(LAEs)的应用正迅速从显示器扩展到传感器。对于TFT栅极驱动器，需要高压双极供电轨(大约±15V);到目前为止，它们通常是通过使用开关电容器转换器(SCCs)从电池$(V_{BAT})$产生的[1]。由于高信噪比对于基于tft的传感器(如屏下指纹传感器)至关重要[2]，因此应尽量减少易于与读出AFE耦合的SCC输出的噪声和纹波。作为一种简单的方法，LDO可以用作与SCC串联的后稳压器。然而，LDO相对较大的压降电压$(V_{DO})$会显著降低效率[3]。相比之下，由于通管工作在三极管区域，小的VDO降低了LDO环路增益，从而降低了电源纹波抑制(PSR)。从SCC的角度来看，由于其固定的电压转换比(VCR)， VDO不能在VBAT的大变化下进行精细调节。对于精细调节，SC电路的复杂性(成本)开销或功率损耗将会增加。在本研究中，提出了一种能量回收的最佳VDO控制(EROC)技术，以提高SC双极升压阶段的效率。此外，负载电流复用(LCR)后稳压器提出了实现高PSR，同时极大地减少了在通管的功率损耗。

{"title":"17.8 A 90.5%-Efficiency 28.7µ VRMS-Noise Bipolar-Output High-Step-Up SC DC-DC Converter with Energy-Recycled Regulation and Post-Filtering for ±15V TFT-Based LAE Sensors","authors":"Min-Woo Ko, Hyunki Han, Hyunsik Kim","doi":"10.1109/ISSCC42613.2021.9365935","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365935","url":null,"abstract":"The applications of large-area electronics (LAEs) based on thin-film transistors (TFTs) are rapidly expanding from displays to sensors. For the TFT gate drivers, high-voltage bipolar supply rails (approximately ± 15V) are required; so far, they have been typically generated from a battery $(V_{BAT})$ by employing switched-capacitor converters (SCCs) [1]. Since a high SNR is crucial for TFT-based sensors such as an under-display fingerprint sensor [2], the noise and ripple of the SCC output, which are prone to be coupled to the readout AFE, should be minimized. As a straightforward method, an LDO can be utilized as a post-regulator in series with the SCC. However, the relatively large dropout voltage $(V_{DO})$ of the LDO significantly degrades the efficiency [3]. In contrast, small VDO reduces LDO loop-gain due to the pass-transistor working in the triode region, resulting in decreased supply-ripple rejection (PSR). From the perspective of SCC, owing to its fixed voltage conversion ratio (VCR), the VDO cannot be finely regulated over a wide variation of VBAT. For fine regulation, the complexity (cost) overhead or the power loss will increase in the SC circuit. In this work, an energy-recycled optimal VDO control (EROC) technique in the SC bipolar step-up stage is proposed for higher efficiency. Also, load-current-reused (LCR) post-regulator is presented to achieve high PSR while extremely minimizing the power loss at the pass-transistor.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123938821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency 15.3基于65nm 3T动态模拟ram的内存中计算宏和CNN加速器，具有保留增强、自适应模拟稀疏性和44TOPS/W系统能效

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9366045

Zhengyu Chen, X. Chen, Jie Gu

Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $sim 10 times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $sim 2 times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.

内存计算(CIM)技术将模拟计算集成到内存宏中，在深度学习应用的计算效率方面显示出显著的优势。虽然早期的CIM宏受到较低的位精度的限制，例如[1]中的二进制权重，但最近的研究表明，权重/输入的精度为4到8b，输出值的精度高达20b[2]，[3]。在系统层面也利用了稀疏性和应用特性来进一步提高计算效率[4]，[5]。为了实现更高的精度，通常使用逐位操作[3]，[4]。然而，在使用SRAM单元的位操作的现有解决方案中存在局限性。图15.3.1显示了这项工作的挑战和解决方案的总结。首先，所有现有的解决方案都使用6T/8T/10T SRAM作为CIM单元，这从根本上限制了CIM阵列的大小。在这项工作中，我们将常用的SRAM单元替换为3晶体管(3T)模拟存储单元，称为动态模拟ram (DARAM)，它代表4b权重值作为模拟电压。这使得晶体管数量减少了10倍，并实现了比代工厂供应的6T SRAM单元更小的有效CIM单比特面积。其次，由于本工作不需要逐位计算，因此只执行单相MAC操作，从而消除了[3]，[4]中与先前多相方法和数字累积相关的吞吐量下降。此外，模拟线性问题通过高度线性的基于时间的激活、去除关键多比特上限的匹配要求[4]、[6]和特殊的读电流补偿技术得到缓解。第三，为了缓解ADC或SA的功率瓶颈，本工作应用了基于模拟稀疏性的低功耗方法，其中包括当模拟MAC值很小(或“稀疏”)时的计算自适应ADC跳变操作和特殊的权重转移技术，导致cim宏功率额外降低2倍。我们使用65纳米基于cim的CNN加速器演示了所提出的技术，显示了最先进的能源效率。

{"title":"15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency","authors":"Zhengyu Chen, X. Chen, Jie Gu","doi":"10.1109/ISSCC42613.2021.9366045","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9366045","url":null,"abstract":"Computing-In-Memory (CIM) techniques which incorporate analog computing inside memory macros have shown significant advantages in computing efficiency for deep learning applications. While earlier CIM macros were limited by lower bit precision, e.g. binary weights in [1], recent works have shown 4-to-8b precision for the weights/inputs and up to 20b for the output values [2], [3]. Sparsity and application features have also been exploited at the system level to further improve the computation efficiency [4], [5]. To enable higher precision, bit-wise operations were commonly utilized [3], [4]. However, there are limitations in existing solutions using the bit-wise operations with SRAM cells. Fig. 15.3.1 shows the summary of challenges and solutions in this work. First, all existing solutions utilize 6T/8T/10T SRAM as a CIM cell, which fundamentally limits the size of the CIM array. In this work, we replace the commonly used SRAM cell with a 3-transistor (3T) analog memory cell, referred as dynamic-analog-RAM (DARAM) which represents a 4b weight value as an analog voltage. This leads to $sim 10 times$ reduction in transistor count and achieves an effective CIM single-bit area smaller than the foundry-supplied 6T SRAM cell. Secondly, as no bit-wise calculation is needed in this work, only single-phase MAC operations are performed, removing the throughput degradation associated with previous multi-phase approaches and digital accumulation in [3], [4]. Furthermore, analog linearity issues are mitigated by highly linear time-based activation, removal of matching requirements for critical multi-bit caps [4], [6], and a special read current compensation technique. Thirdly, to mitigate the power bottleneck of ADC or SA, this work applies analog sparsity-based low-power methods, which include a compute-adaptive ADC skipping operation when the analog MAC value is small (or “sparse”) and a special weight-shifting technique, leading to an additional $sim 2 times$ reduction in CIM-macro power. We demonstrate the proposed techniques using a 65nm CIM-based CNN accelerator showing state-of-art energy efficiency.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123946538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

24.3 A 3nm Gate-All-Around SRAM Featuring an Adaptive Dual-BL and an Adaptive Cell-Power Assist Circuit 24.3具有自适应双bl和自适应电池电源辅助电路的3nm栅极全能SRAM

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9365988

T. Song, W. Rim, Hoonki Kim, K. Cho, Taeyeong Kim, Taejung Lee, Geumjong Bae, Dong-Won Kim, S. Kwon, S. Baek, Jonghoon Jung, J. Kye, Hakchul Jung, Hyungtae Kim, Soon-Moon Jung, Jaehong Park

Advanced technologies help to improve SRAM performance via recent transistor breakthroughs [1], which allow SRAM designers to focus on handling metal resistance by alleviating device performance impediments. Since SRAM margins are more vulnerable to the increasing metal resistance, due to smaller critical dimensions, SRAM-assist circuits are proposed to overcome the impact of metal resistance in recent technologies [2 –5]. One of the challenges is the design limitation such as the quantized transistor, which requires SRAM-assist to optimize SRAM margins. In this paper, gate-all-around (GAA) SRAM design techniques are proposed, which improve SRAM margins more freely, in addition to power, performance, and area (PPA). Moreover, SRAM-assist schemes are proposed to overcome metal resistance, which maximizes the benefit of GAA devices.

通过最近的晶体管突破，先进技术有助于提高SRAM的性能[1]，这使得SRAM设计人员可以通过减轻器件性能障碍来专注于处理金属电阻。由于临界尺寸较小，SRAM边缘更容易受到金属电阻增加的影响，因此在最近的技术中，SRAM辅助电路被提出来克服金属电阻的影响[2 -5]。其中一个挑战是设计限制，如量子化晶体管，这需要SRAM辅助来优化SRAM余量。本文提出了栅极全能(GAA) SRAM设计技术，除了功率、性能和面积(PPA)外，还可以更自由地提高SRAM的余量。此外，还提出了sram辅助方案来克服金属电阻，从而最大限度地提高GAA器件的效益。

引用次数: 7

14.6 A 76-to-81GHz 2×8 FMCW MIMO Radar Transceiver with Fast Chirp Generation and Multi-Feed Antenna-in-Package Array 14.6 76- 81ghz 2×8具有快速啁啾产生和多馈源封装天线阵列的FMCW MIMO雷达收发器

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9365933

Zongming Duan, Bowen Wu, Chuanming Zhu, Yan Wang, Weiwei Jin, Y. Liu, Yanhui Wu, Tao Zhang, Ming Liu, B. Dou, Bingbing Liao, Wei Lv, Dongfang Pan, Yongjie Li, Changwei Wang, Yuefei Dai, Pei Li, Hao Gao

Millimeter-wave (mm-wave) radar is an essential sensor of advanced driver assistance systems and autonomous driving. Its detection requirement extends from traditional long-to-medium range to emerging short and ultra-short range for surround sensing, which is from sub-1 meter to 40 meters and requires a compact and low-cost solution with fast chirp generation to improve the range resolution [1] –[5]. In [2], 3 transmitters (TX) and 4 receivers (RX) are utilized to create a multi-mode radar transceiver, however its on-board antenna integration solution limits its size and cost for short-distance applications. In [3], 12TX and 16RX phase-domain multiple-input multiple-output (MIMO) radar is presented. However an external PLL is required, and the digitalprocessing power consumption is high. Furthermore its size and cost are not suitable for short and ultra-short applications. In [4], a compact 1TX and 1RX frequency-modulated continuous-wave (FMCW) radar is applied with an antenna-in-package (AiP) for short-distance applications, however its detection range is limited to only 20meters. This work presents a compact 76-to-81GHz FMCW MIMO radar with fast chirp generation and integrated with an AiP array in embedded glass fan-out (eGFO) technology for short and ultra-short range application. A fast-modulated chirp signal is especially crucial in MIMO radar for improving range resolution with update rate, and it also moves IF frequency away from the 1/f noise corner. Thanks to the dynamic bias technique, the chirp rate is improved to 312.5MHz/μs and the maximum modulation bandwidth is 7.2GHz. In addition, the 2TX and 6RX array antennas are integrated in package, and effective isotropic radiated power (EIRP) of the TX antenna is improved by 7.5dB by a coaxial feeding structure in eGFO technology with multi-feed antenna technique.

毫米波(mm-wave)雷达是先进驾驶辅助系统和自动驾驶必不可少的传感器。它的探测需求从传统的中远距离扩展到新兴的近距离和超近距离环绕传感，从1米以下到40米，需要一种紧凑、低成本、快速产生啁啾的解决方案来提高距离分辨率[1]-[5]。在[2]中，使用3个发射机(TX)和4个接收机(RX)来创建多模雷达收发器，但是其机载天线集成解决方案限制了其尺寸和短距离应用的成本。文献[3]中提出了12TX和16RX相域多输入多输出(MIMO)雷达。然而，需要外部锁相环，并且数字处理功耗很高。此外，它的尺寸和成本不适合短距离和超短距离应用。在[4]中，紧凑型1TX和1RX调频连续波(FMCW)雷达采用了包内天线(AiP)，用于短距离应用，但其探测距离仅限于20米。这项工作提出了一种紧凑的76至81ghz FMCW MIMO雷达，具有快速啁啾产生，并集成了嵌入式玻璃扇出(eGFO)技术中的AiP阵列，适用于短距离和超短距离应用。快速调制的啁啾信号在MIMO雷达中尤其重要，因为它可以提高距离分辨率和更新速率，并且还可以使中频远离1/f噪声角。利用动态偏置技术，将啁啾速率提高到312.5MHz/μs，最大调制带宽为7.2GHz。此外，将2TX和6RX阵列天线集成在封装中，采用多馈源eGFO技术中的同轴馈电结构，将TX天线的有效各向同性辐射功率(EIRP)提高了7.5dB。

{"title":"14.6 A 76-to-81GHz 2×8 FMCW MIMO Radar Transceiver with Fast Chirp Generation and Multi-Feed Antenna-in-Package Array","authors":"Zongming Duan, Bowen Wu, Chuanming Zhu, Yan Wang, Weiwei Jin, Y. Liu, Yanhui Wu, Tao Zhang, Ming Liu, B. Dou, Bingbing Liao, Wei Lv, Dongfang Pan, Yongjie Li, Changwei Wang, Yuefei Dai, Pei Li, Hao Gao","doi":"10.1109/ISSCC42613.2021.9365933","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365933","url":null,"abstract":"Millimeter-wave (mm-wave) radar is an essential sensor of advanced driver assistance systems and autonomous driving. Its detection requirement extends from traditional long-to-medium range to emerging short and ultra-short range for surround sensing, which is from sub-1 meter to 40 meters and requires a compact and low-cost solution with fast chirp generation to improve the range resolution [1] –[5]. In [2], 3 transmitters (TX) and 4 receivers (RX) are utilized to create a multi-mode radar transceiver, however its on-board antenna integration solution limits its size and cost for short-distance applications. In [3], 12TX and 16RX phase-domain multiple-input multiple-output (MIMO) radar is presented. However an external PLL is required, and the digitalprocessing power consumption is high. Furthermore its size and cost are not suitable for short and ultra-short applications. In [4], a compact 1TX and 1RX frequency-modulated continuous-wave (FMCW) radar is applied with an antenna-in-package (AiP) for short-distance applications, however its detection range is limited to only 20meters. This work presents a compact 76-to-81GHz FMCW MIMO radar with fast chirp generation and integrated with an AiP array in embedded glass fan-out (eGFO) technology for short and ultra-short range application. A fast-modulated chirp signal is especially crucial in MIMO radar for improving range resolution with update rate, and it also moves IF frequency away from the 1/f noise corner. Thanks to the dynamic bias technique, the chirp rate is improved to 312.5MHz/μs and the maximum modulation bandwidth is 7.2GHz. In addition, the 2TX and 6RX array antennas are integrated in package, and effective isotropic radiated power (EIRP) of the TX antenna is improved by 7.5dB by a coaxial feeding structure in eGFO technology with multi-feed antenna technique.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116150533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

A 480Gb/s/mm 1.7pJ/b Short-Reach Wireline Transceiver Using Single-Ended NRZ for Die-to-Die Applications 采用单端NRZ的480Gb/s/mm 1.7pJ/b短距离有线收发器，适用于模对模应用

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9366048

Kelvin McCollough, S. Huss, J. Vandersand, Randall Smith, C. Moscone, Qazi Omar Farooq

With recent AI and big data developments, quickly moving massive amounts of data is paramount to future technologies. Scalable solutions that can sustain higher performance systems that consume ever-increasing amounts of data are necessary. Low power, low latency, and aggregation are essential in highly scalable data-distribution solutions [2].

随着最近人工智能和大数据的发展，快速移动大量数据对未来技术至关重要。可扩展的解决方案是必要的，这些解决方案可以维持消耗不断增加的数据量的更高性能的系统。低功耗、低延迟和聚合在高度可扩展的数据分发解决方案中是必不可少的[2]。

引用次数: 6

A 21×21 Dynamic-Precision Bit-Serial Computing Graph Accelerator for Solving Partial Differential Equations Using Finite Difference Method 一个21×21动态精密位-串行计算图形加速器，用于用有限差分法求解偏微分方程

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9366053

Junjie Mu, Bongjin Kim

Partial differential equations (PDEs) are ubiquitous in physics and engineering and used for understanding various physical phenomena, including heat, diffusion, fluid and electrodynamics, and quantum mechanics. Analytical PDE solutions are rare, and hence, we approximate using numerical methods. The finite difference method (FDM) approximates PDEs by computing finite differences between discretized solutions. Since finite differences approximate the derivatives of PDEs, many iterations of high-precision computations are required to achieve higher accuracy in their numerical solutions. Hence, computationally-expensive FDM necessitates the use of high-performance computers. As such, their energy consumption is excessive (e.g. 15mJ per iteration and $gt 320mathrm{J}$ in total for solving PDE with $mathrm{a}128 times 128$ grid using GPU [1]). Consequently, there is an ever-increasing need for a dedicated hardware accelerator for solving PDEs.

偏微分方程(PDEs)在物理和工程中无处不在，用于理解各种物理现象，包括热、扩散、流体和电动力学以及量子力学。解析PDE解是罕见的，因此，我们近似使用数值方法。有限差分法(FDM)通过计算离散解之间的有限差分来逼近偏微分方程。由于有限差分近似于偏微分方程的导数，因此需要进行多次高精度的迭代计算才能获得更高精度的数值解。因此，计算成本高的FDM需要使用高性能计算机。因此，它们的能耗过高(例如，每次迭代15mJ，使用GPU使用$ mathm {a}128 × 128$网格求解PDE，总能耗为$gt 320 mathm {J}$)[1]。因此，对于解决pde的专用硬件加速器的需求不断增加。

引用次数: 4

A Wireless Power Transfer System with Up-to-20% Light- Load Efficiency Enhancement and Instant Dynamic Response by Fully Integrated Wireless Hysteretic Control for Bioimplants 全集成无线迟滞控制的生物植入物轻载效率提升20%及即时动态响应的无线电力传输系统

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9365859

Junyao Tang, Lei Zhao, Cheng Huang

Wireless power transfer (WPT) systems are becoming increasingly popular for sub100mW biomedical applications [1] –[5]. Because the received power is sensitive to coupling and loading conditions, power/voltage regulations are essential to achieve stable and accurate power delivery, fast transient response, and high end-to-end (E2E) efficiency, which includes all the power losses in the transmitter (TX), wireless power link, and the receiver (RX). Many existing WPT designs operated in open-loop [3] –[5]; or achieved voltage regulation but only in the RX [6], with the TX remained unregulated and designed to operate at full capacity, thus degraded E2E efficiency at light-load conditions. Because lower-power or standby mode typically contributes to the majority of the operation time, light-load efficiency is always an important specification of power management circuits, especially to extend the run time for battery-powered devices, e.g., a wearable/portable WPT transmitter supporting bioimplants. [1], [2], [7] –[9] have reported different approaches to achieve TX regulation; however, all required extra discrete components, which increased the form-factor and cost. [7], [8] required a wire to close the loop. [1], [2], [9] utilized load-shift-keying (LSK) backscattering for TX regulation, which was proved an effective solution. However, [2], [9] relied on lots of off-chip components, including power inductors, diodes, DACs, FPGAs, etc., due to the analog control methodologies. The linear control also introduced small-signal bandwidth limitations, which required careful design to ensure stability at different loading/coupling conditions with PVT/component variations, and resulted in significant compromise in dynamic performance. [1] introduced a nonlinear constant-idle-time control to eliminate the bandwidth limitations and most of the off-chip components; however, the light-load efficiency still suffered. In addition, [1] still required an extra sensing coil to extract LSK signals that increased the TX coil area by 86%.

无线电力传输(WPT)系统在100mw以下的生物医学应用中越来越受欢迎[1]-[5]。由于接收功率对耦合和负载条件很敏感，因此功率/电压调节对于实现稳定准确的功率传输、快速瞬态响应和高端到端(E2E)效率至关重要，这包括发射器(TX)、无线电源链路和接收器(RX)中的所有功率损耗。许多现有的WPT设计都是开环的[3]- [5];或者实现了电压调节，但仅在RX[6]中，TX保持不调节，设计为满负荷运行，从而降低了轻负载条件下的端到端效率。由于低功耗或待机模式通常会占用大部分工作时间，因此轻负载效率一直是电源管理电路的重要规格，特别是延长电池供电设备的运行时间，例如支持生物植入物的可穿戴/便携式WPT发射机。[1]，[2]，[7] -[9]报道了实现TX调节的不同方法;然而，所有这些都需要额外的分立元件，这增加了外形因素和成本。[7]，[8]需要一根电线来闭合回路。[1]，[2]，[9]利用负载移位键控(LSK)后向散射进行TX调节，被证明是一种有效的解决方案。然而，由于模拟控制方法，[2]，[9]依赖于许多片外组件，包括功率电感，二极管，dac, fpga等。线性控制还引入了小信号带宽限制，这需要仔细设计，以确保在不同负载/ PVT/组件变化的耦合条件下的稳定性，并导致动态性能的重大损害。[1]引入了非线性恒空闲时间控制，消除了带宽限制和大部分片外元件;然而，轻载效率仍然受到影响。此外，[1]仍然需要额外的传感线圈来提取LSK信号，从而使TX线圈面积增加86%。

{"title":"A Wireless Power Transfer System with Up-to-20% Light- Load Efficiency Enhancement and Instant Dynamic Response by Fully Integrated Wireless Hysteretic Control for Bioimplants","authors":"Junyao Tang, Lei Zhao, Cheng Huang","doi":"10.1109/ISSCC42613.2021.9365859","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365859","url":null,"abstract":"Wireless power transfer (WPT) systems are becoming increasingly popular for sub100mW biomedical applications [1] –[5]. Because the received power is sensitive to coupling and loading conditions, power/voltage regulations are essential to achieve stable and accurate power delivery, fast transient response, and high end-to-end (E2E) efficiency, which includes all the power losses in the transmitter (TX), wireless power link, and the receiver (RX). Many existing WPT designs operated in open-loop [3] –[5]; or achieved voltage regulation but only in the RX [6], with the TX remained unregulated and designed to operate at full capacity, thus degraded E2E efficiency at light-load conditions. Because lower-power or standby mode typically contributes to the majority of the operation time, light-load efficiency is always an important specification of power management circuits, especially to extend the run time for battery-powered devices, e.g., a wearable/portable WPT transmitter supporting bioimplants. [1], [2], [7] –[9] have reported different approaches to achieve TX regulation; however, all required extra discrete components, which increased the form-factor and cost. [7], [8] required a wire to close the loop. [1], [2], [9] utilized load-shift-keying (LSK) backscattering for TX regulation, which was proved an effective solution. However, [2], [9] relied on lots of off-chip components, including power inductors, diodes, DACs, FPGAs, etc., due to the analog control methodologies. The linear control also introduced small-signal bandwidth limitations, which required careful design to ensure stability at different loading/coupling conditions with PVT/component variations, and resulted in significant compromise in dynamic performance. [1] introduced a nonlinear constant-idle-time control to eliminate the bandwidth limitations and most of the off-chip components; however, the light-load efficiency still suffered. In addition, [1] still required an extra sensing coil to extract LSK signals that increased the TX coil area by 86%.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

17.3 A 1.25GHz Fully Integrated DC-DC Converter Using Electromagnetically Coupled Class-D LC Oscillators 17.3采用电磁耦合d类LC振荡器的1.25GHz全集成DC-DC变换器

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9366037

Alessandro Novello, Gabriele Atzeni, Giorgio Cristiano, Mathieu Coustans, Taekwang Jang

Over the past years, the constant reduction in the size of consumer electronics has strengthened the demand for fully integrated power management circuits. Buck converters offer high efficiency, but they cannot satisfy the stringent size requirements because bulky off-chip inductors are required [1]. Switched-capacitor (SC) approaches provide fully integrated power management solutions; however, their power density is limited by the on-chip capacitance density [2]. Resonant switched capacitor (ReSC) converters need 3D die-stacked inductors or PCB-integrated inductors to achieve appropriate power density values, posing challenges for monolithic integration [3]. A fully integrated ReSC has been presented [4], which implements an on-chip resonator, avoiding any external or 3D stacked passive components. However, the switching losses associated with the four transistors driving the resonator limit the switching frequency to 10s of MHz, bounding the power density scaling to 0.097W/mm 2.

在过去的几年里，消费电子产品的尺寸不断缩小，加强了对完全集成电源管理电路的需求。降压变换器具有很高的效率，但由于需要笨重的片外电感，因此无法满足严格的尺寸要求[1]。开关电容器(SC)方法提供完全集成的电源管理解决方案;然而，它们的功率密度受到片上电容密度的限制[2]。谐振开关电容(ReSC)转换器需要3D模叠电感或pcb集成电感来实现合适的功率密度值，这对单片集成提出了挑战[3]。已经提出了一种完全集成的ReSC[4]，它实现了片上谐振器，避免了任何外部或3D堆叠无源元件。然而，与驱动谐振器的四个晶体管相关的开关损耗将开关频率限制在10s MHz，将功率密度限制在0.097W/ mm2。

引用次数: 0

3.2 The A100 Datacenter GPU and Ampere Architecture 3.2 A100数据中心GPU与安培架构

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9365803

Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany

The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.

现代云数据中心中计算密集型应用程序的多样性推动了gpu加速云计算的爆炸式增长。这些应用包括人工智能深度学习训练和推理、数据分析、科学计算、基因组学、边缘视频分析和5G服务、图形渲染和云游戏。A100 GPU针对这些工作负载引入了几个特性:支持细粒度稀疏的$3^{rd}-$代张量核心，支持新的BFIoat16 (BF16)， tensorfio32 (TF32)和FP64数据类型，支持多实例GPU (MIG)虚拟化的横向扩展，以及支持$3^{rd}-$代50Gbps NVLink I/0接口(NVLink3)和NVSwitch GPU间通信的横向扩展。如图3.2.1所示，A100包含108个流式多处理器(SMs)和6912个CUDA内核。SMs由40MB二级缓存和1。56TB/s HBM2内存带宽(BW)。在1.41GHz时，A100提供1248T0PS (8b整数)，624TFLOPS (FP16)和312tflops (TF32)的有效峰值，包括稀疏性优化。A100芯片采用台积电7nm N7工艺，包含54B个晶体管，尺寸为826mm2。

{"title":"3.2 The A100 Datacenter GPU and Ampere Architecture","authors":"Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany","doi":"10.1109/ISSCC42613.2021.9365803","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365803","url":null,"abstract":"The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115803567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC 9.5 5nm旗舰移动SoC中的6K-MAC特征映射稀疏感知神经处理单元

2021 IEEE International Solid- State Circuits Conference (ISSCC)

Pub Date : 2021-02-13 DOI: 10.1109/ISSCC42613.2021.9365928

Jun-Seok Park, Jun-Woo Jang, Heonsoo Lee, Dongwook Lee, Sehwan Lee, Hanwoong Jung, Seungwon Lee, S. Kwon, Kyung-Ah Jeong, Joonho Song, Sukhwan Lim, Inyup Kang

On-device machine learning is critical for mobile products as it enables real-time applications (e.g. AI-powered camera applications), which need to be responsive, always available (i.e. do not require network connectivity) and privacy preserving. The platforms used in such situations have limited computing resources, power, and memory bandwidth. Enabling such on-device machine learning has triggered wide development of efficient neural-network accelerators that promise high energy and area efficiency compared to general-purpose processors, such as CPUs. The need to support a comprehensive range of neural networks has been important as well because the field of deep learning is evolving rapidly as depicted in Fig. 9.5.1. Recent work on neural-network accelerators has focused on improving energy efficiency, while obtaining high performance in order to meet the needs of real-time applications. For example, weightzero-skipping and pruning have been deployed in recent accelerators [2] –[7]. SIMD or systolic array-based accelerators [2] –[4], [6] provide flexibility to support various types of compute across a wide range of Deep Neural Network (DNN) models.

设备上的机器学习对于移动产品至关重要，因为它可以实现实时应用(例如人工智能相机应用)，这些应用需要响应，始终可用(即不需要网络连接)和隐私保护。在这种情况下使用的平台具有有限的计算资源、能力和内存带宽。实现这种设备上的机器学习已经引发了高效神经网络加速器的广泛发展，与cpu等通用处理器相比，这些加速器承诺具有更高的能量和面积效率。支持广泛的神经网络的需求也很重要，因为深度学习领域正在迅速发展，如图9.5.1所示。神经网络加速器最近的工作集中在提高能源效率，同时获得高性能，以满足实时应用的需求。例如，在最近的加速器中已经部署了跳权和剪枝[2]-[7]。SIMD或基于收缩阵列的加速器[2]-[4]，[6]提供了灵活性，以支持各种深度神经网络(DNN)模型的各种类型的计算。

{"title":"9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC","authors":"Jun-Seok Park, Jun-Woo Jang, Heonsoo Lee, Dongwook Lee, Sehwan Lee, Hanwoong Jung, Seungwon Lee, S. Kwon, Kyung-Ah Jeong, Joonho Song, Sukhwan Lim, Inyup Kang","doi":"10.1109/ISSCC42613.2021.9365928","DOIUrl":"https://doi.org/10.1109/ISSCC42613.2021.9365928","url":null,"abstract":"On-device machine learning is critical for mobile products as it enables real-time applications (e.g. AI-powered camera applications), which need to be responsive, always available (i.e. do not require network connectivity) and privacy preserving. The platforms used in such situations have limited computing resources, power, and memory bandwidth. Enabling such on-device machine learning has triggered wide development of efficient neural-network accelerators that promise high energy and area efficiency compared to general-purpose processors, such as CPUs. The need to support a comprehensive range of neural networks has been important as well because the field of deep learning is evolving rapidly as depicted in Fig. 9.5.1. Recent work on neural-network accelerators has focused on improving energy efficiency, while obtaining high performance in order to meet the needs of real-time applications. For example, weightzero-skipping and pruning have been deployed in recent accelerators [2] –[7]. SIMD or systolic array-based accelerators [2] –[4], [6] provide flexibility to support various types of compute across a wide range of Deep Neural Network (DNN) models.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34