首页 > 最新文献

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)最新文献

英文 中文
AI Processor based Data Correction for Enhancing Accuracy of Ultrasonic Sensor 基于AI处理器的超声传感器数据校正提高精度
Jin Young Shin, Sang Ho Lee, Kwang Hyun Go, Soo-Gon Kim, Seung Eun Lee
The usage of various sensors in vehicles has increased with the generalization of advanced driver assistance systems (ADAS). To ensure the safety of drivers and pedestrians, considering the accuracy of measured sensor data is essential. In this paper, we propose a data correction system for enhancing the accuracy of distance data from an ultrasonic sensor utilizing an AI processor. The proposed system detects the motion of an object and adjusts the obtained distance data to align with an ideal gradient of sequential data. Experimental results of the proposed system show an error detection rate of 90.6%.
随着先进驾驶辅助系统(ADAS)的普及,各种传感器在车辆中的使用也越来越多。为了确保驾驶员和行人的安全,考虑传感器测量数据的准确性至关重要。在本文中,我们提出了一种数据校正系统,用于利用人工智能处理器提高超声波传感器距离数据的准确性。所提出的系统检测物体的运动并调整所获得的距离数据,使其与序列数据的理想梯度对齐。实验结果表明,该系统的检测错误率为90.6%。
{"title":"AI Processor based Data Correction for Enhancing Accuracy of Ultrasonic Sensor","authors":"Jin Young Shin, Sang Ho Lee, Kwang Hyun Go, Soo-Gon Kim, Seung Eun Lee","doi":"10.1109/AICAS57966.2023.10168652","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168652","url":null,"abstract":"The usage of various sensors in vehicles has increased with the generalization of advanced driver assistance systems (ADAS). To ensure the safety of drivers and pedestrians, considering the accuracy of measured sensor data is essential. In this paper, we propose a data correction system for enhancing the accuracy of distance data from an ultrasonic sensor utilizing an AI processor. The proposed system detects the motion of an object and adjusts the obtained distance data to align with an ideal gradient of sequential data. Experimental results of the proposed system show an error detection rate of 90.6%.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"348 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116066337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-Immersed Collaborative Digitization for Area-Efficient Compute-in-Memory Deep Learning 面向区域高效内存计算深度学习的沉浸式协同数字化
Shamma Nasrin, Maeesha Binte Hashem, Nastaran Darabi, Benjamin Parpillon, F. Fahim, Wilfred Gomes, A. Trivedi
This work discusses memory-immersed collaborative digitization among compute-in-memory (CiM) arrays to minimize the area overheads of a conventional analog-to-digital converter (ADC) for deep learning inference. Thereby, using the proposed scheme, significantly more CiM arrays can be accommodated within limited footprint designs to improve parallelism and minimize external memory accesses. Under the digitization scheme, CiM arrays exploit their parasitic bit lines to form a within-memory capacitive digital-to-analog converter (DAC) that facilitates area-efficient successive approximation (SA) digitization. CiM arrays collaborate where a proximal array digitizes the analog-domain product-sums when an array computes the scalar product of input and weights. We discuss various networking configurations among CiM arrays where Flash, SA, and their hybrid digitization steps can be efficiently implemented using the proposed memory-immersed scheme. The results are demonstrated using a 65 nm CMOS test chip. Compared to a 40 nm-node 5-bit SAR ADC, our 65 nm design requires ~25 area× less and ∼1.4× less energy by leveraging in-memory computing structures. Compared to a 40 nm-node 5-bit Flash ADC, our design requires ∼51× less area and ∼13× less energy.
这项工作讨论了内存中计算(CiM)阵列之间的内存沉浸式协作数字化,以最大限度地减少用于深度学习推理的传统模数转换器(ADC)的面积开销。因此,使用所提出的方案,可以在有限的内存占用设计中容纳更多的CiM阵列,从而提高并行性并最大限度地减少外部内存访问。在数字化方案下,CiM阵列利用其寄生位线形成内存内电容数模转换器(DAC),促进面积高效连续逼近(SA)数字化。CiM阵列协作时,当阵列计算输入和权重的标量积时,近端阵列将模拟域积和数字化。我们讨论了CiM阵列之间的各种网络配置,其中Flash, SA及其混合数字化步骤可以使用所提出的内存浸入式方案有效地实现。结果用65纳米CMOS测试芯片进行了验证。与40 nm节点的5位SAR ADC相比,我们的65 nm设计通过利用内存计算结构,减少了~25 area×和~ 1.4×的能量。与40nm节点的5位Flash ADC相比,我们的设计所需面积减少~ 51x,能耗减少~ 13x。
{"title":"Memory-Immersed Collaborative Digitization for Area-Efficient Compute-in-Memory Deep Learning","authors":"Shamma Nasrin, Maeesha Binte Hashem, Nastaran Darabi, Benjamin Parpillon, F. Fahim, Wilfred Gomes, A. Trivedi","doi":"10.1109/AICAS57966.2023.10168632","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168632","url":null,"abstract":"This work discusses memory-immersed collaborative digitization among compute-in-memory (CiM) arrays to minimize the area overheads of a conventional analog-to-digital converter (ADC) for deep learning inference. Thereby, using the proposed scheme, significantly more CiM arrays can be accommodated within limited footprint designs to improve parallelism and minimize external memory accesses. Under the digitization scheme, CiM arrays exploit their parasitic bit lines to form a within-memory capacitive digital-to-analog converter (DAC) that facilitates area-efficient successive approximation (SA) digitization. CiM arrays collaborate where a proximal array digitizes the analog-domain product-sums when an array computes the scalar product of input and weights. We discuss various networking configurations among CiM arrays where Flash, SA, and their hybrid digitization steps can be efficiently implemented using the proposed memory-immersed scheme. The results are demonstrated using a 65 nm CMOS test chip. Compared to a 40 nm-node 5-bit SAR ADC, our 65 nm design requires ~25 area× less and ∼1.4× less energy by leveraging in-memory computing structures. Compared to a 40 nm-node 5-bit Flash ADC, our design requires ∼51× less area and ∼13× less energy.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129330767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
AI-assisted ISP hyperparameter auto tuning 人工智能辅助ISP超参数自动调优
Fa Xu, Zihao Liu, YanHeng Lu, Sicheng Li, Susong Xu, Yibo Fan, Yen-Kuang Chen
Images and videos are vital visual information carriers, and the image signal processor (ISP) is an essential hardware component for capturing and processing these visual signals. ISPs convert raw data into high-quality color images, which requires various function modules to control different aspects of image quality. However, the results of these modules are interdependent and have crosstalk with each other, making it tedious and time-consuming for manual tuning to obtain a set of ideal parameter configurations to achieve stable performance. In this paper, we introduce xkISP, a self-developed open-source ISP project which includes both a C model and hardware implementation of an 8-stage ISP pipeline. Most importantly, we present a novel proxy function-based AI-assisted ISP tuning solution that is demonstrated to accelerate the ISP parameter configuration process and improve performance for both human vision and computer vision tasks.
图像和视频是重要的视觉信息载体,图像信号处理器(ISP)是捕获和处理这些视觉信号必不可少的硬件部件。isp将原始数据转换成高质量的彩色图像,这就需要各种功能模块来控制图像质量的不同方面。然而,这些模块的结果是相互依赖的,彼此之间存在串扰,为了获得一组理想的参数配置以实现稳定的性能,手动调优是繁琐而耗时的。在本文中,我们介绍了xkISP,一个自主开发的开源ISP项目,它包括一个8阶段ISP管道的C模型和硬件实现。最重要的是,我们提出了一种新的基于代理函数的人工智能辅助ISP调优解决方案,该解决方案被证明可以加速ISP参数配置过程,并提高人类视觉和计算机视觉任务的性能。
{"title":"AI-assisted ISP hyperparameter auto tuning","authors":"Fa Xu, Zihao Liu, YanHeng Lu, Sicheng Li, Susong Xu, Yibo Fan, Yen-Kuang Chen","doi":"10.1109/AICAS57966.2023.10168574","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168574","url":null,"abstract":"Images and videos are vital visual information carriers, and the image signal processor (ISP) is an essential hardware component for capturing and processing these visual signals. ISPs convert raw data into high-quality color images, which requires various function modules to control different aspects of image quality. However, the results of these modules are interdependent and have crosstalk with each other, making it tedious and time-consuming for manual tuning to obtain a set of ideal parameter configurations to achieve stable performance. In this paper, we introduce xkISP, a self-developed open-source ISP project which includes both a C model and hardware implementation of an 8-stage ISP pipeline. Most importantly, we present a novel proxy function-based AI-assisted ISP tuning solution that is demonstrated to accelerate the ISP parameter configuration process and improve performance for both human vision and computer vision tasks.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129072570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HNSG – A SNN Training Method Ultilizing Hidden Network 一种利用隐藏网络的SNN训练方法
Chunhui Wu, Wenbing Fang, Yi Kang
Spiking Neural Network is more energy efficient compared to traditional ANNs, and many training methods of SNNs have been proposed in past decades. However, traditional backward-propagation based training methods are difficult to deploy on SNN due to its discontinuous gradient. Previous works mainly focused on weight training or weight transferring. The Hidden Network inspired by Lottery Ticket Hypothesis that is proposed for convolutional neural networks opens possibility of network connection training on SNN. In this article, a training algorithm based on Hidden Network is applied to SNN to show its potential on neuromorphic spiking networks. A novel training method called HNSG is proposed that modifies hidden network search using surrogate gradient function based back propagation. The proposed HNSG method is tested on image classification task using MNIST with simple two fully-connected layer SNN model. Simulation shows HNSG reaches 93.73% accuracy on average fire intensity of 0.138 with LIF neuron.
与传统的人工神经网络相比,峰值神经网络具有更高的能量效率,在过去的几十年里,人们提出了许多snn的训练方法。然而,由于SNN的梯度不连续,传统的基于反向传播的训练方法难以在SNN上部署。以前的工作主要集中在重量训练或重量转移。基于彩票假设的卷积神经网络隐藏网络的提出,为在SNN上进行网络连接训练提供了可能。本文将一种基于隐网络的训练算法应用于SNN,以展示其在神经形态尖峰网络上的潜力。提出了一种基于反向传播的替代梯度函数改进隐网络搜索的HNSG训练方法。采用简单的两层全连接SNN模型,在MNIST图像分类任务上对HNSG方法进行了测试。仿真结果表明,在LIF神经元的平均火力强度为0.138时,HNSG的准确率达到93.73%。
{"title":"HNSG – A SNN Training Method Ultilizing Hidden Network","authors":"Chunhui Wu, Wenbing Fang, Yi Kang","doi":"10.1109/AICAS57966.2023.10168579","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168579","url":null,"abstract":"Spiking Neural Network is more energy efficient compared to traditional ANNs, and many training methods of SNNs have been proposed in past decades. However, traditional backward-propagation based training methods are difficult to deploy on SNN due to its discontinuous gradient. Previous works mainly focused on weight training or weight transferring. The Hidden Network inspired by Lottery Ticket Hypothesis that is proposed for convolutional neural networks opens possibility of network connection training on SNN. In this article, a training algorithm based on Hidden Network is applied to SNN to show its potential on neuromorphic spiking networks. A novel training method called HNSG is proposed that modifies hidden network search using surrogate gradient function based back propagation. The proposed HNSG method is tested on image classification task using MNIST with simple two fully-connected layer SNN model. Simulation shows HNSG reaches 93.73% accuracy on average fire intensity of 0.138 with LIF neuron.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121659307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Systolic Array with Activation Stationary Dataflow for Deep Fully-Connected Networks 深度全连接网络中具有激活平稳数据流的收缩阵列
Haochuan Wan, Chaolin Rao, Yueyang Zheng, Pingqiang Zhou, Xin Lou
This paper presents an activation stationary (AS) dataflow suitable for networks with pure fully-connected (FC) layers. It is shown that the proposed AS dataflow can help to reduce the required memory size in hardware design and optimize energy efficiency by reducing data movement. Based on the AS dataflow, an output stationary (OS) systolic array is proposed to compute FC networks. To evaluate the proposed design, we further implement an accelerator for the FC-based implicit representation for MRI (IREM) algorithm. A proofof-concept demonstration system is developed based on field programmable gate array (FPGA). To evaluate the proposed design, We also map the IREM accelerator to 40nm CMOS technology and compare it with CPU, GPU-based and ASIC implementations.
提出了一种适用于纯全连接(FC)层网络的激活平稳(AS)数据流。结果表明,所提出的AS数据流有助于减少硬件设计中所需的内存大小,并通过减少数据移动来优化能源效率。基于AS数据流,提出了一种用于FC网络计算的输出平稳(OS)收缩阵列。为了评估所提出的设计,我们进一步实现了基于fc的MRI隐式表示(IREM)算法的加速器。开发了一种基于现场可编程门阵列(FPGA)的概念验证演示系统。为了评估提出的设计,我们还将IREM加速器映射到40nm CMOS技术,并将其与CPU, gpu和ASIC实现进行比较。
{"title":"A Systolic Array with Activation Stationary Dataflow for Deep Fully-Connected Networks","authors":"Haochuan Wan, Chaolin Rao, Yueyang Zheng, Pingqiang Zhou, Xin Lou","doi":"10.1109/AICAS57966.2023.10168602","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168602","url":null,"abstract":"This paper presents an activation stationary (AS) dataflow suitable for networks with pure fully-connected (FC) layers. It is shown that the proposed AS dataflow can help to reduce the required memory size in hardware design and optimize energy efficiency by reducing data movement. Based on the AS dataflow, an output stationary (OS) systolic array is proposed to compute FC networks. To evaluate the proposed design, we further implement an accelerator for the FC-based implicit representation for MRI (IREM) algorithm. A proofof-concept demonstration system is developed based on field programmable gate array (FPGA). To evaluate the proposed design, We also map the IREM accelerator to 40nm CMOS technology and compare it with CPU, GPU-based and ASIC implementations.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127906620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Searching Tiny Neural Networks for Deployment on Embedded FPGA 面向嵌入式FPGA部署的微型神经网络搜索
Haiyan Qin, Yejun Zeng, Jinyu Bai, Wang Kang
Embedded FPGAs have become increasingly popular as acceleration platforms for the deployment of edge-side artificial intelligence (AI) applications, due in part to their flexible and configurable heterogeneous architectures. However, the complex deployment process hinders the realization of AI democratization, particularly at the edge. In this paper, we propose a software-hardware co-design framework that enables simultaneous searching for neural network architectures and corresponding accelerator designs on embedded FPGAs. The proposed framework comprises a hardware-friendly neural architecture search space, a reconfigurable streaming-based accelerator architecture, and a model performance estimator. An evolutionary algorithm targeting multi-objective optimization is employed to identify the optimal neural architecture and corresponding accelerator design. We evaluate our framework on various datasets and demonstrate that, in a typical edge AI scenario, the searched network and accelerator can achieve up to a 2.9% accuracy improvement and up to a 21 speedup compared to manually designed networks based on× common accelerator designs when deployed on a widely used embedded FPGA (Xilinx XC7Z020).
嵌入式fpga作为边缘人工智能(AI)应用部署的加速平台越来越受欢迎,部分原因是其灵活且可配置的异构架构。然而,复杂的部署过程阻碍了人工智能民主化的实现,特别是在边缘。在本文中,我们提出了一个软硬件协同设计框架,可以同时搜索嵌入式fpga上的神经网络架构和相应的加速器设计。该框架包括一个硬件友好的神经结构搜索空间、一个可重构的基于流的加速器结构和一个模型性能估计器。采用多目标优化进化算法确定最优神经结构和相应的加速器设计。我们在各种数据集上评估了我们的框架,并证明,在典型的边缘人工智能场景中,当部署在广泛使用的嵌入式FPGA (Xilinx XC7Z020)上时,与基于通用加速器设计的手动设计网络相比,搜索网络和加速器可以实现高达2.9%的精度提高和高达21%的加速。
{"title":"Searching Tiny Neural Networks for Deployment on Embedded FPGA","authors":"Haiyan Qin, Yejun Zeng, Jinyu Bai, Wang Kang","doi":"10.1109/AICAS57966.2023.10168571","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168571","url":null,"abstract":"Embedded FPGAs have become increasingly popular as acceleration platforms for the deployment of edge-side artificial intelligence (AI) applications, due in part to their flexible and configurable heterogeneous architectures. However, the complex deployment process hinders the realization of AI democratization, particularly at the edge. In this paper, we propose a software-hardware co-design framework that enables simultaneous searching for neural network architectures and corresponding accelerator designs on embedded FPGAs. The proposed framework comprises a hardware-friendly neural architecture search space, a reconfigurable streaming-based accelerator architecture, and a model performance estimator. An evolutionary algorithm targeting multi-objective optimization is employed to identify the optimal neural architecture and corresponding accelerator design. We evaluate our framework on various datasets and demonstrate that, in a typical edge AI scenario, the searched network and accelerator can achieve up to a 2.9% accuracy improvement and up to a 21 speedup compared to manually designed networks based on× common accelerator designs when deployed on a widely used embedded FPGA (Xilinx XC7Z020).","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121462398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Three Challenges in ReRAM-Based Process-In-Memory for Neural Network 基于reram的神经网络内存处理的三个挑战
Ziyi Yang, Kehan Liu, Yiru Duan, Mingjia Fan, Qiyue Zhang, Zhou Jin
Artificial intelligence (AI) has been successfully applied to various fields of natural science. One of the biggest challenges in AI acceleration is the performance and energy bottleneck caused by the limited capacity and bandwidth of massive data movement between memory and processing units. In the past decade, much AI accelerator work based on process-in-memory (PIM) has been studied, especially on emerging non-volatile resistive random access memory (ReRAM). In this paper, we provide a comprehensive perspective on ReRAM-based AI accelerators, including software-hardware co-design, the status of chip fabrications, researches on ReRAM non-idealities, and support for the EDA tool chain. Finally, we summarize and provide three directions for future trends: support for complex patterns of models, addressing the impact of non-idealities such as improving endurance, process perturbations, and leakage current, and addressing the lack of EDA tools.
人工智能(AI)已经成功地应用于自然科学的各个领域。人工智能加速的最大挑战之一是内存和处理单元之间大量数据移动的有限容量和带宽导致的性能和能量瓶颈。在过去的十年中,许多基于内存进程(PIM)的人工智能加速器工作得到了研究,特别是新兴的非易失性电阻随机存取存储器(ReRAM)。在本文中,我们提供了基于ReRAM的AI加速器的全面视角,包括软硬件协同设计,芯片制造现状,ReRAM非理想性研究以及对EDA工具链的支持。最后,我们总结并提出了未来趋势的三个方向:支持模型的复杂模式,解决非理想性的影响,如提高耐久性、过程扰动和漏电流,以及解决EDA工具的缺乏。
{"title":"Three Challenges in ReRAM-Based Process-In-Memory for Neural Network","authors":"Ziyi Yang, Kehan Liu, Yiru Duan, Mingjia Fan, Qiyue Zhang, Zhou Jin","doi":"10.1109/AICAS57966.2023.10168640","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168640","url":null,"abstract":"Artificial intelligence (AI) has been successfully applied to various fields of natural science. One of the biggest challenges in AI acceleration is the performance and energy bottleneck caused by the limited capacity and bandwidth of massive data movement between memory and processing units. In the past decade, much AI accelerator work based on process-in-memory (PIM) has been studied, especially on emerging non-volatile resistive random access memory (ReRAM). In this paper, we provide a comprehensive perspective on ReRAM-based AI accelerators, including software-hardware co-design, the status of chip fabrications, researches on ReRAM non-idealities, and support for the EDA tool chain. Finally, we summarize and provide three directions for future trends: support for complex patterns of models, addressing the impact of non-idealities such as improving endurance, process perturbations, and leakage current, and addressing the lack of EDA tools.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134346208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Accuracy and Energy-Efficient Acoustic Inference using Hardware-Aware Training and a 0.34nW/Ch Full-Wave Rectifier 基于硬件感知训练和0.34nW/Ch全波整流器的高精度节能声学推断
Sheng Zhou, Xi Chen, Kwantae Kim, Shih-Chii Liu
A full-wave rectifier (FWR) is a necessary component of many analog acoustic feature extractor (FEx) designs targeted at edge audio applications. However, analog circuits that perform close-to-ideal rectification contribute a significant portion of the total power of the FEx. This work presents an energy-efficient FWR design by using a dynamic comparator and scaling the comparator clock frequency with its input signal bandwidth. Simulated in a 65nm CMOS process, the rectifier circuit consumes 0.34nW per channel for a 0.6V supply. Although the FWR does not perform ideal rectification, an acoustic FEx behavioral model in Python is proposed based on our FWR design, and a neural network trained with the output of the proposed behavioral model recovers high classification accuracy in an audio keyword spotting (KWS) task. The behavioral model also included comparator noise and offset extracted from transistor-level simulation. The whole KWS chain using our behavioral model achieves 89.45% accuracy for 12-class KWS on the Google Speech Commands Dataset.
全波整流器(FWR)是许多针对边缘音频应用的模拟声学特征提取器(FEx)设计的必要组成部分。然而,执行接近理想整流的模拟电路贡献了FEx总功率的很大一部分。这项工作提出了一种节能的FWR设计,通过使用动态比较器并根据其输入信号带宽缩放比较器时钟频率。在65nm CMOS工艺中模拟,整流电路在0.6V电源下每个通道消耗0.34nW。尽管FWR不能进行理想的校正,但基于我们的FWR设计,我们在Python中提出了一个声学FEx行为模型,并且用所提出的行为模型的输出训练的神经网络在音频关键字定位(KWS)任务中恢复了较高的分类精度。行为模型还包括比较器噪声和从晶体管级仿真中提取的偏移量。在Google语音命令数据集上,使用我们的行为模型对12类KWS的整个KWS链达到89.45%的准确率。
{"title":"High-Accuracy and Energy-Efficient Acoustic Inference using Hardware-Aware Training and a 0.34nW/Ch Full-Wave Rectifier","authors":"Sheng Zhou, Xi Chen, Kwantae Kim, Shih-Chii Liu","doi":"10.1109/AICAS57966.2023.10168561","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168561","url":null,"abstract":"A full-wave rectifier (FWR) is a necessary component of many analog acoustic feature extractor (FEx) designs targeted at edge audio applications. However, analog circuits that perform close-to-ideal rectification contribute a significant portion of the total power of the FEx. This work presents an energy-efficient FWR design by using a dynamic comparator and scaling the comparator clock frequency with its input signal bandwidth. Simulated in a 65nm CMOS process, the rectifier circuit consumes 0.34nW per channel for a 0.6V supply. Although the FWR does not perform ideal rectification, an acoustic FEx behavioral model in Python is proposed based on our FWR design, and a neural network trained with the output of the proposed behavioral model recovers high classification accuracy in an audio keyword spotting (KWS) task. The behavioral model also included comparator noise and offset extracted from transistor-level simulation. The whole KWS chain using our behavioral model achieves 89.45% accuracy for 12-class KWS on the Google Speech Commands Dataset.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131819815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 1W8R 20T SRAM Codebook for 20% Energy Reduction in Mixed-Precision Deep-Learning Inference Processor System 用于混合精度深度学习推理处理器系统能耗降低20%的1w8r20t SRAM码本
Ryotaro Ohara, Masaya Kabuto, Masakazu Taichi, Atsushi Fukunaga, Yuto Yasuda, Riku Hamabe, S. Izumi, H. Kawaguchi
This study introduces a 1W8R 20T multiport memory for codebook quantization in deep-learning processors. We manufactured the memory in a 40 nm process and achieved memory read-access time at 2.75 ns and 2.7-pj/byte power consumption. In addition, we used NVDLA, which was NVIDIA’s deep-learning processor, as a motif and simulated it based on the power obtained from the actual proposed memory. The obtained power and area reduction results are 20.24% and 26.24%, respectively.
本研究介绍一种用于深度学习处理器码本量化的1w8r20t多端口存储器。我们在40 nm工艺中制造了存储器,并实现了存储器读取访问时间为2.75 ns和2.7 pj/byte的功耗。此外,我们使用NVIDIA的深度学习处理器NVDLA作为母题,并根据从实际提议的存储器中获得的功率进行模拟。得到的功率和面积分别降低20.24%和26.24%。
{"title":"A 1W8R 20T SRAM Codebook for 20% Energy Reduction in Mixed-Precision Deep-Learning Inference Processor System","authors":"Ryotaro Ohara, Masaya Kabuto, Masakazu Taichi, Atsushi Fukunaga, Yuto Yasuda, Riku Hamabe, S. Izumi, H. Kawaguchi","doi":"10.1109/AICAS57966.2023.10168555","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168555","url":null,"abstract":"This study introduces a 1W8R 20T multiport memory for codebook quantization in deep-learning processors. We manufactured the memory in a 40 nm process and achieved memory read-access time at 2.75 ns and 2.7-pj/byte power consumption. In addition, we used NVDLA, which was NVIDIA’s deep-learning processor, as a motif and simulated it based on the power obtained from the actual proposed memory. The obtained power and area reduction results are 20.24% and 26.24%, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"297 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127415307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 115.1 TOPS/W, 12.1 TOPS/mm2 Computation-in-Memory using Ring-Oscillator based ADC for Edge AI 一个115.1 TOPS/W, 12.1 TOPS/mm2的基于环形振荡器的边缘AI ADC内存计算
Abhairaj Singh, R. Bishnoi, A. Kaichouhi, Sumit Diware, R. Joshi, S. Hamdioui
Analog computation-in-memory (CIM) architecture alleviates massive data movement between the memory and the processor, thus promising great prospects to accelerate certain computational tasks in an energy-efficient manner. However, data converters involved in these architectures typically achieve the required computing accuracy at the expense of high area and energy footprint which can potentially determine CIM candidacy for low-power and compact edge-AI devices. In this work, we present a memory-periphery co-design to perform accurate A/D conversions of analog matrix-vector-multiplication (MVM) outputs. Here, we introduce a scheme where select-lines and bit-lines in the memory are virtually fixed to improve conversion accuracy and aid a ring-oscillator-based A/D conversion, equipped with component sharing and inter-matching of the reference blocks. In addition, we deploy a self-timed technique to further ensure high robustness addressing global design and cycle-to-cycle variations. Based on measurement results of a 4Kb CIM chip prototype equipped with TSMC 40nm, a relative accuracy of up to 99.71% is achieved with an energy efficiency of 115.1 TOPS/W and computational density of 12.1 TOPS/mm2 for the MNIST dataset. Thus, an improvement of up to 11.3X and 7.5X compared to the state-of-the-art, respectively.
模拟内存计算(CIM)架构减轻了内存和处理器之间的大量数据移动,因此在以节能的方式加速某些计算任务方面具有很大的前景。然而,这些架构中涉及的数据转换器通常以牺牲高面积和能源占用为代价来实现所需的计算精度,这可能会决定低功耗和紧凑型边缘ai设备的CIM候选资格。在这项工作中,我们提出了一种存储器外围协同设计,以执行模拟矩阵矢量乘法(MVM)输出的精确a /D转换。在这里,我们介绍了一种方案,其中存储器中的选择线和位线实际上是固定的,以提高转换精度,并帮助基于环振荡器的a /D转换,配备了元件共享和参考块的相互匹配。此外,我们还部署了一种自定时技术,以进一步确保解决全局设计和周期到周期变化的高鲁棒性。基于采用台积电40nm工艺的4Kb CIM芯片原型的测量结果,MNIST数据集的相对精度高达99.71%,能量效率为115.1 TOPS/W,计算密度为12.1 TOPS/mm2。因此,与最先进的技术相比,分别提高了11.3倍和7.5倍。
{"title":"A 115.1 TOPS/W, 12.1 TOPS/mm2 Computation-in-Memory using Ring-Oscillator based ADC for Edge AI","authors":"Abhairaj Singh, R. Bishnoi, A. Kaichouhi, Sumit Diware, R. Joshi, S. Hamdioui","doi":"10.1109/AICAS57966.2023.10168647","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168647","url":null,"abstract":"Analog computation-in-memory (CIM) architecture alleviates massive data movement between the memory and the processor, thus promising great prospects to accelerate certain computational tasks in an energy-efficient manner. However, data converters involved in these architectures typically achieve the required computing accuracy at the expense of high area and energy footprint which can potentially determine CIM candidacy for low-power and compact edge-AI devices. In this work, we present a memory-periphery co-design to perform accurate A/D conversions of analog matrix-vector-multiplication (MVM) outputs. Here, we introduce a scheme where select-lines and bit-lines in the memory are virtually fixed to improve conversion accuracy and aid a ring-oscillator-based A/D conversion, equipped with component sharing and inter-matching of the reference blocks. In addition, we deploy a self-timed technique to further ensure high robustness addressing global design and cycle-to-cycle variations. Based on measurement results of a 4Kb CIM chip prototype equipped with TSMC 40nm, a relative accuracy of up to 99.71% is achieved with an energy efficiency of 115.1 TOPS/W and computational density of 12.1 TOPS/mm2 for the MNIST dataset. Thus, an improvement of up to 11.3X and 7.5X compared to the state-of-the-art, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124294022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1