首页 > 最新文献

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)最新文献

英文 中文
A Novel Transpose 2T-DRAM based Computing-in-Memory Architecture for On-chip DNN Training and Inference 一种新的基于转置2T-DRAM的片上深度神经网络训练与推理的内存计算架构
Yuansheng Zhao, Zixuan Shen, Jiarui Xu, K. Chai, Yanqing Wu, Chao Wang
Recently, DRAM-based Computing-in-Memory (CIM) has emerged as one of the potential CIM solutions due to its unique advantages of high bit-cell density, large memory capacity and CMOS compatibility. This paper proposes a 2T-DRAM based CIM architecture, which can perform both CIM inference and training for deep neural networks (DNNs) efficiently. The proposed CIM architecture employs 2T-DRAM based transpose circuitry to implement transpose weight memory array and uses digital logic in the array peripheral to implement digital DNN computation in memory. A novel mapping method is proposed to map the convolutional and full-connection computation of the forward propagation and back propagation process into the transpose 2T-DRAM CIM array to achieve digital weight multiplexing and parallel computing. Simulation results show that the computing power of proposed transpose 2T-DRAM based CIM architecture is estimated to 11.26 GOPS by a 16K DRAM array to accelerate 4CONV+3FC @100 MHz and has an 82.15% accuracy on CIFAR-10 dataset, which are much higher than the state-of-the-art DRAM-based CIM accelerators without CIM learning capability. Preliminary evaluation of retention time in DRAM CIM also shows that a refresh-less training-inference process of lightweight networks can be realized by a suitable scale of CIM array through the proposed mapping strategy with negligible refresh-induced performance loss or power increase.
近年来,基于dram的内存计算(Computing-in-Memory, CIM)以其高位元密度、大存储容量和CMOS兼容性等独特优势,成为一种潜在的CIM解决方案。本文提出了一种基于2T-DRAM的CIM结构,该结构可以有效地进行深度神经网络的CIM推理和训练。所提出的CIM架构采用基于2T-DRAM的转置电路实现转置权重存储阵列,并在阵列外设中使用数字逻辑实现内存中的数字DNN计算。提出了一种新的映射方法,将前向传播和反向传播过程的卷积和全连接计算映射到转置2T-DRAM CIM阵列中,实现数字权复用和并行计算。仿真结果表明,采用16K DRAM阵列加速4CONV+3FC @100 MHz时,所提出的基于转置2T-DRAM的CIM架构的计算能力估计为11.26 GOPS,在CIFAR-10数据集上的准确率为82.15%,远高于目前最先进的没有CIM学习能力的基于DRAM的CIM加速器。对DRAM CIM保留时间的初步评估也表明,通过所提出的映射策略,可以通过适当的CIM阵列规模实现轻量级网络的无刷新训练-推理过程,且刷新导致的性能损失或功率增加可以忽略。
{"title":"A Novel Transpose 2T-DRAM based Computing-in-Memory Architecture for On-chip DNN Training and Inference","authors":"Yuansheng Zhao, Zixuan Shen, Jiarui Xu, K. Chai, Yanqing Wu, Chao Wang","doi":"10.1109/AICAS57966.2023.10168641","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168641","url":null,"abstract":"Recently, DRAM-based Computing-in-Memory (CIM) has emerged as one of the potential CIM solutions due to its unique advantages of high bit-cell density, large memory capacity and CMOS compatibility. This paper proposes a 2T-DRAM based CIM architecture, which can perform both CIM inference and training for deep neural networks (DNNs) efficiently. The proposed CIM architecture employs 2T-DRAM based transpose circuitry to implement transpose weight memory array and uses digital logic in the array peripheral to implement digital DNN computation in memory. A novel mapping method is proposed to map the convolutional and full-connection computation of the forward propagation and back propagation process into the transpose 2T-DRAM CIM array to achieve digital weight multiplexing and parallel computing. Simulation results show that the computing power of proposed transpose 2T-DRAM based CIM architecture is estimated to 11.26 GOPS by a 16K DRAM array to accelerate 4CONV+3FC @100 MHz and has an 82.15% accuracy on CIFAR-10 dataset, which are much higher than the state-of-the-art DRAM-based CIM accelerators without CIM learning capability. Preliminary evaluation of retention time in DRAM CIM also shows that a refresh-less training-inference process of lightweight networks can be realized by a suitable scale of CIM array through the proposed mapping strategy with negligible refresh-induced performance loss or power increase.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133249341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
F-CNN: Faster CNN Exploiting Data Re-Use with Statistical Analysis F-CNN:更快的CNN利用统计分析数据重用
Fatmah Alantali, Y. Halawani, B. Mohammad, M. Al-Qutayri
Many of the current edge computing devices need efficient implementation of Artificial Intelligence (AI) applications due to strict latency, security and power requirements. Nonetheless, such devices, face various challenges when executing AI applications due to their limited computing and energy resources. In particular, Convolutional Neural Networks (CNN) is a popular machine learning method that derives a high-level function from being trained on various visual input examples. This paper contributes to enabling the use of CNN on resource-constrained devices offline, where a trade-off between accuracy, running time and power efficiency is verified. The paper investigates the use of minimum pre-processing methods of input data to identify nonessential computations in the convolutional layers. In this work, Spatial locality of input data is considered along with an efficient pre-processing method to mitigate the accuracy loss caused by the computational re-use approach. This technique was tested on LeNet and CIFAR-10 structures and was responsible for 1.9% and 1.6% accuracy loss while reducing the processing time by 38.3% and 20.9% and reducing the energy by 38.3%, and 20.7%, respectively. The models were deployed and verified on Raspberry Pi 4 B platform using the MATLAB coder to measure time and energy.
由于严格的延迟、安全性和功耗要求,许多当前的边缘计算设备需要高效地实现人工智能(AI)应用程序。尽管如此,这些设备在执行人工智能应用时,由于其有限的计算和能源资源,面临着各种挑战。特别是卷积神经网络(CNN)是一种流行的机器学习方法,它通过对各种视觉输入示例进行训练来获得高级函数。本文有助于在离线资源受限的设备上使用CNN,验证了准确性、运行时间和功率效率之间的权衡。本文研究了使用输入数据的最小预处理方法来识别卷积层中的非必要计算。在这项工作中,考虑了输入数据的空间局部性以及有效的预处理方法,以减轻计算重用方法造成的精度损失。该技术在LeNet和CIFAR-10结构上进行了测试,精度损失分别为1.9%和1.6%,处理时间和能量分别减少了38.3%和20.9%,能耗分别减少了38.3%和20.7%。在Raspberry Pi 4b平台上使用MATLAB编码器对模型进行了部署和验证,并测量了时间和能量。
{"title":"F-CNN: Faster CNN Exploiting Data Re-Use with Statistical Analysis","authors":"Fatmah Alantali, Y. Halawani, B. Mohammad, M. Al-Qutayri","doi":"10.1109/AICAS57966.2023.10168606","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168606","url":null,"abstract":"Many of the current edge computing devices need efficient implementation of Artificial Intelligence (AI) applications due to strict latency, security and power requirements. Nonetheless, such devices, face various challenges when executing AI applications due to their limited computing and energy resources. In particular, Convolutional Neural Networks (CNN) is a popular machine learning method that derives a high-level function from being trained on various visual input examples. This paper contributes to enabling the use of CNN on resource-constrained devices offline, where a trade-off between accuracy, running time and power efficiency is verified. The paper investigates the use of minimum pre-processing methods of input data to identify nonessential computations in the convolutional layers. In this work, Spatial locality of input data is considered along with an efficient pre-processing method to mitigate the accuracy loss caused by the computational re-use approach. This technique was tested on LeNet and CIFAR-10 structures and was responsible for 1.9% and 1.6% accuracy loss while reducing the processing time by 38.3% and 20.9% and reducing the energy by 38.3%, and 20.7%, respectively. The models were deployed and verified on Raspberry Pi 4 B platform using the MATLAB coder to measure time and energy.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130546974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Group Vectored Absolute-Value-Subtraction Cell Array for the Efficient Acceleration of AdderNet 用于AdderNet高效加速的群矢量绝对值减法单元阵列
Jiahao Chen, Wanbo Hu, Wenling Ma, Zhilin Zhang, Mingqiang Huang
Convolutional neural networks (CNN) have been widely used for boosting the performance of Artificial Intelligence (AI) tasks. However, the CNN models are usually computational intensive. Recently, the novel absolute-value-subtraction (ABS) operation based CNN, namely the AdderNet is proposed to reduce the computation complexity and energy burden. But the specific hardware design has rarely been explored. In this work, we propose an energy-efficient AdderNet accelerator to address such issue. At the hardware architecture level, we develop a flexible and group vectored systolic array to balance the circuit area, power, and speed. Thanks to the low delay of ABS operation, the systolic array can reach extremely high frequency up to 2GHz. Meanwhile the power- and area- efficiency exhibits about 3× improvement compared with its CNN counterpart. At the processing element level, we propose new ABS cell based on algorithm optimization, which shows about 10% higher performance than the naive design. Finally, the accelerator is practically deployed on FPGA platform to accelerate the AdderNet ResNet-18 network as a case study. The peak throughput is 424.2 GOP/s, which is much higher than previous works.
卷积神经网络(CNN)已被广泛用于提高人工智能(AI)任务的性能。然而,CNN模型通常是计算密集型的。最近,为了降低计算复杂度和能量负担,提出了一种新的基于CNN的绝对值减法(ABS)运算,即AdderNet。但是具体的硬件设计很少被探索。在这项工作中,我们提出了一个节能的AdderNet加速器来解决这个问题。在硬件架构层面,我们开发了一种灵活的群矢量收缩阵列,以平衡电路面积,功率和速度。由于ABS操作的低延迟,收缩阵列可以达到极高的频率,最高可达2GHz。与此同时,功率效率和面积效率比CNN提高了约3倍。在处理单元层面,我们提出了基于算法优化的新型ABS单元,其性能比原始设计提高了约10%。最后,将该加速器实际部署在FPGA平台上,以加速AdderNet ResNet-18网络为例进行研究。峰值吞吐量为424.2 GOP/s,大大高于以往的工作。
{"title":"Group Vectored Absolute-Value-Subtraction Cell Array for the Efficient Acceleration of AdderNet","authors":"Jiahao Chen, Wanbo Hu, Wenling Ma, Zhilin Zhang, Mingqiang Huang","doi":"10.1109/AICAS57966.2023.10168637","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168637","url":null,"abstract":"Convolutional neural networks (CNN) have been widely used for boosting the performance of Artificial Intelligence (AI) tasks. However, the CNN models are usually computational intensive. Recently, the novel absolute-value-subtraction (ABS) operation based CNN, namely the AdderNet is proposed to reduce the computation complexity and energy burden. But the specific hardware design has rarely been explored. In this work, we propose an energy-efficient AdderNet accelerator to address such issue. At the hardware architecture level, we develop a flexible and group vectored systolic array to balance the circuit area, power, and speed. Thanks to the low delay of ABS operation, the systolic array can reach extremely high frequency up to 2GHz. Meanwhile the power- and area- efficiency exhibits about 3× improvement compared with its CNN counterpart. At the processing element level, we propose new ABS cell based on algorithm optimization, which shows about 10% higher performance than the naive design. Finally, the accelerator is practically deployed on FPGA platform to accelerate the AdderNet ResNet-18 network as a case study. The peak throughput is 424.2 GOP/s, which is much higher than previous works.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129368423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPIL: Gradient with PseudoInverse Learning for High Accuracy Fine-Tuning 基于伪逆学习的梯度高精度微调
Gilha Lee, N. Kim, Hyun Kim
PseudoInverse learning (PIL) is proposed to increase the convergence speed of conventional gradient descent. PIL can be trained with fast and reliable convolutional neural networks (CNNs) without a gradient using a pseudoinverse matrix. However, PIL has several problems when training a network. First, there is an out-of-memory problem because all batches are required during one epoch of training. Second, the network cannot be deeper because more unreliable input pseudoinverse matrices are used as the deeper PIL layer is stacked. Therefore, PIL has not yet been effectively applied to widely used deep models. Inspired by the limitation of the existing PIL, we propose a novel error propagation methodology that allows the fine-tuning process, which is often used in a resource-constrained environment, to be performed more accurately. In detail, by using both PIL and gradient descent, we not only enable mini-batch training, which was impossible in PIL, but also achieve higher accuracy through more accurate error propagation. Moreover, unlike the existing PIL, which uses only the pseudoinverse matrix of the CNN input, we additionally use the pseudoinverse matrix of weights to compensate for the limitations of PIL; thus, the proposed method enables faster and more accurate error propagation in the CNN training process. As a result, it is efficient for fine-tuning in resource-constrained environments, such as mobile/edge devices that require an accuracy comparable to small training epochs. Experimental results show that the proposed method improves the accuracy after ResNet-101 fine-tuning on the CIFAR-100 dataset by 2.78% compared to the baseline.
为了提高传统梯度下降算法的收敛速度,提出了伪逆学习方法。PIL可以使用快速可靠的卷积神经网络(cnn)来训练,而不需要使用伪逆矩阵的梯度。然而,PIL在训练网络时有几个问题。首先,存在内存不足的问题,因为在一个epoch的训练中需要所有批处理。其次,网络不能更深,因为随着更深的PIL层的堆叠,使用了更多不可靠的输入伪逆矩阵。因此,PIL尚未有效地应用于广泛使用的深度模型。受现有PIL限制的启发,我们提出了一种新的错误传播方法,该方法允许更准确地执行在资源受限环境中经常使用的微调过程。通过同时使用PIL和梯度下降,我们不仅实现了在PIL中无法实现的小批量训练,而且通过更精确的误差传播达到了更高的精度。此外,与现有的PIL只使用CNN输入的伪逆矩阵不同,我们额外使用权值的伪逆矩阵来补偿PIL的局限性;因此,该方法可以使CNN训练过程中的误差传播更快、更准确。因此,对于资源受限的环境(例如需要与小型训练周期相当的精度的移动/边缘设备)的微调来说,它是有效的。实验结果表明,该方法在CIFAR-100数据集上进行ResNet-101微调后,准确率比基线提高了2.78%。
{"title":"GPIL: Gradient with PseudoInverse Learning for High Accuracy Fine-Tuning","authors":"Gilha Lee, N. Kim, Hyun Kim","doi":"10.1109/AICAS57966.2023.10168584","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168584","url":null,"abstract":"PseudoInverse learning (PIL) is proposed to increase the convergence speed of conventional gradient descent. PIL can be trained with fast and reliable convolutional neural networks (CNNs) without a gradient using a pseudoinverse matrix. However, PIL has several problems when training a network. First, there is an out-of-memory problem because all batches are required during one epoch of training. Second, the network cannot be deeper because more unreliable input pseudoinverse matrices are used as the deeper PIL layer is stacked. Therefore, PIL has not yet been effectively applied to widely used deep models. Inspired by the limitation of the existing PIL, we propose a novel error propagation methodology that allows the fine-tuning process, which is often used in a resource-constrained environment, to be performed more accurately. In detail, by using both PIL and gradient descent, we not only enable mini-batch training, which was impossible in PIL, but also achieve higher accuracy through more accurate error propagation. Moreover, unlike the existing PIL, which uses only the pseudoinverse matrix of the CNN input, we additionally use the pseudoinverse matrix of weights to compensate for the limitations of PIL; thus, the proposed method enables faster and more accurate error propagation in the CNN training process. As a result, it is efficient for fine-tuning in resource-constrained environments, such as mobile/edge devices that require an accuracy comparable to small training epochs. Experimental results show that the proposed method improves the accuracy after ResNet-101 fine-tuning on the CIFAR-100 dataset by 2.78% compared to the baseline.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Configurable Multi-Precision Floating-Point Multiplier Architecture Design for Computation in Deep Learning 面向深度学习计算的可配置多精度浮点乘法器架构设计
Pei-Hsuan Kuo, Yu-Hsiang Huang, Juinn-Dar Huang
The increasing AI applications demands efficient computing capabilities to support a huge amount of calculations. Among the related arithmetic operations, multiplication is an indispensable part in most of deep learning applications. To support computing in different precisions demanded by various applications, it is essential for a multiplier architecture to meet the multi-precision demand while still achieving high utilization of the multiplication array and power efficiency. In this paper, a configurable multi-precision FP multiplier architecture with minimized redundant bits is presented. It can execute 16× FP8 operations, or 8× brain-floating-point (BF16) operations, or 4× half-precision (FP16) operations, or 1× single-precision (FP32) operation every cycle while maintaining a 100% multiplication hardware utilization ratio. Moreover, the computing results can also be represented in higher precision formats for succeeding high-precision computations. The proposed design has been implemented using the TSMC 40nm process with 1GHz clock frequency and consumes only 16.78mW on average. Compared to existing multi-precision FP multiplier architectures, the proposed design achieves the highest hardware utilization ratio with only 4.9K logic gates in the multiplication array. It also achieves high energy efficiencies of 1212.1, 509.6, 207.3, and 42.6 GFLOPS/W at FP8, BF16, FP16 and FP32 modes, respectively.
不断增加的人工智能应用需要高效的计算能力来支持大量的计算。在相关的算术运算中,乘法运算是大多数深度学习应用中不可或缺的一部分。为了支持各种应用所需的不同精度计算,乘法器架构必须满足多精度要求,同时还能实现乘法阵列的高利用率和功率效率。本文提出了一种冗余位最小化的可配置多精度FP乘法器结构。它可以在每个周期内执行16× FP8操作,或8×脑浮点(BF16)操作,或4×半精度(FP16)操作,或1×单精度(FP32)操作,同时保持100%的乘法硬件利用率。此外,计算结果还可以以更高的精度格式表示,以便后续的高精度计算。本设计采用台积电40nm制程,时钟频率为1GHz,平均功耗仅为16.78mW。与现有的多精度FP乘法器架构相比,该设计在乘法阵列中仅使用4.9K逻辑门,实现了最高的硬件利用率。它还在FP8、BF16、FP16和FP32模式下分别实现1212.1、509.6、207.3和42.6 GFLOPS/W的高能效。
{"title":"Configurable Multi-Precision Floating-Point Multiplier Architecture Design for Computation in Deep Learning","authors":"Pei-Hsuan Kuo, Yu-Hsiang Huang, Juinn-Dar Huang","doi":"10.1109/AICAS57966.2023.10168572","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168572","url":null,"abstract":"The increasing AI applications demands efficient computing capabilities to support a huge amount of calculations. Among the related arithmetic operations, multiplication is an indispensable part in most of deep learning applications. To support computing in different precisions demanded by various applications, it is essential for a multiplier architecture to meet the multi-precision demand while still achieving high utilization of the multiplication array and power efficiency. In this paper, a configurable multi-precision FP multiplier architecture with minimized redundant bits is presented. It can execute 16× FP8 operations, or 8× brain-floating-point (BF16) operations, or 4× half-precision (FP16) operations, or 1× single-precision (FP32) operation every cycle while maintaining a 100% multiplication hardware utilization ratio. Moreover, the computing results can also be represented in higher precision formats for succeeding high-precision computations. The proposed design has been implemented using the TSMC 40nm process with 1GHz clock frequency and consumes only 16.78mW on average. Compared to existing multi-precision FP multiplier architectures, the proposed design achieves the highest hardware utilization ratio with only 4.9K logic gates in the multiplication array. It also achieves high energy efficiencies of 1212.1, 509.6, 207.3, and 42.6 GFLOPS/W at FP8, BF16, FP16 and FP32 modes, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122212060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Live Demonstration: An Integrated Computing and Communication Platform for Vehicle-Infrastructure Cooperative Autonomous Driving 现场演示:车辆-基础设施协同自动驾驶集成计算与通信平台
Yuhang Gu, Wei Zhang, Yi-xing Shi, Limin Jiang, Shan-Guo Li, Sha Cao, Zhiyuan Jiang, Ruiqing Mao, Zhewen Lou, Sheng Zhou
Perception, computing and communication are usually decoupled in today’s vehicle-road coordination applications, which significantly adds to the system delay and cost. In contrast, we showcase a platform that integrates perception, communication and computing to provide timely roadside bird-eye-view (BEV) maps to vehicles for vision fusion. A neural processing unit and a cellular vehicle-to-everything (C-V2X) wireless baseband are both implemented on FPGA.
在当今的车辆-道路协调应用中,感知、计算和通信通常是分离的,这大大增加了系统的延迟和成本。相比之下,我们展示了一个集成感知、通信和计算的平台,为车辆提供及时的路边鸟瞰(BEV)地图,用于视觉融合。神经处理单元和蜂窝车联网(C-V2X)无线基带都在FPGA上实现。
{"title":"Live Demonstration: An Integrated Computing and Communication Platform for Vehicle-Infrastructure Cooperative Autonomous Driving","authors":"Yuhang Gu, Wei Zhang, Yi-xing Shi, Limin Jiang, Shan-Guo Li, Sha Cao, Zhiyuan Jiang, Ruiqing Mao, Zhewen Lou, Sheng Zhou","doi":"10.1109/AICAS57966.2023.10168600","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168600","url":null,"abstract":"Perception, computing and communication are usually decoupled in today’s vehicle-road coordination applications, which significantly adds to the system delay and cost. In contrast, we showcase a platform that integrates perception, communication and computing to provide timely roadside bird-eye-view (BEV) maps to vehicles for vision fusion. A neural processing unit and a cellular vehicle-to-everything (C-V2X) wireless baseband are both implemented on FPGA.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125778701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EpilepsyNet: Interpretable Self-Supervised Seizure Detection for Low-Power Wearable Systems 癫痫网:用于低功耗可穿戴系统的可解释自监督癫痫检测
Baichuan Huang, R. Zanetti, A. Abtahi, D. Atienza, A. Aminifar
Epilepsy is one of the most common neurological disorders that is characterized by recurrent and unpredictable seizures. Wearable systems can be used to detect the onset of a seizure and notify family members and emergency units for rescue. The majority of state-of-the-art studies in the epilepsy domain currently explore modern machine learning techniques, e.g., deep neural networks, to accurately detect epileptic seizures. However, training deep learning networks requires a large amount of data and computing resources, which is a major challenge for resource-constrained wearable systems. In this paper, we propose EpilepsyNet, the first interpretable self-supervised network tailored to resource-constrained devices without using any seizure data in its initial offline training. At runtime, however, once a seizure is detected, it can be incorporated into our self-supervised technique to improve seizure detection performance, without the need to retrain our learning model, hence incurring no energy overheads. Our self-supervised approach can reach a detection performance of 79.2%, which is on par with the state-of-the-art fully-supervised deep neural networks trained on seizure data. At the same time, our proposed approach can be deployed in resource-constrained wearable devices, reaching up to 1.3 days of battery life on a single charge.
癫痫是最常见的神经系统疾病之一,其特征是反复发作和不可预测的癫痫发作。可穿戴系统可用于检测癫痫发作,并通知家庭成员和急救单位进行救援。目前,癫痫领域的大多数最新研究都在探索现代机器学习技术,例如深度神经网络,以准确检测癫痫发作。然而,训练深度学习网络需要大量的数据和计算资源,这对于资源受限的可穿戴系统来说是一个重大挑战。在本文中,我们提出了EpilepsyNet,这是第一个为资源受限设备量身定制的可解释自监督网络,在其初始离线训练中不使用任何癫痫发作数据。然而,在运行时,一旦检测到癫痫发作,就可以将其纳入我们的自监督技术中,以提高癫痫发作检测性能,而无需重新训练我们的学习模型,因此不会产生能量开销。我们的自监督方法可以达到79.2%的检测性能,这与最先进的基于癫痫数据训练的全监督深度神经网络相当。同时,我们提出的方法可以部署在资源受限的可穿戴设备中,一次充电可达到1.3天的电池寿命。
{"title":"EpilepsyNet: Interpretable Self-Supervised Seizure Detection for Low-Power Wearable Systems","authors":"Baichuan Huang, R. Zanetti, A. Abtahi, D. Atienza, A. Aminifar","doi":"10.1109/AICAS57966.2023.10168560","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168560","url":null,"abstract":"Epilepsy is one of the most common neurological disorders that is characterized by recurrent and unpredictable seizures. Wearable systems can be used to detect the onset of a seizure and notify family members and emergency units for rescue. The majority of state-of-the-art studies in the epilepsy domain currently explore modern machine learning techniques, e.g., deep neural networks, to accurately detect epileptic seizures. However, training deep learning networks requires a large amount of data and computing resources, which is a major challenge for resource-constrained wearable systems. In this paper, we propose EpilepsyNet, the first interpretable self-supervised network tailored to resource-constrained devices without using any seizure data in its initial offline training. At runtime, however, once a seizure is detected, it can be incorporated into our self-supervised technique to improve seizure detection performance, without the need to retrain our learning model, hence incurring no energy overheads. Our self-supervised approach can reach a detection performance of 79.2%, which is on par with the state-of-the-art fully-supervised deep neural networks trained on seizure data. At the same time, our proposed approach can be deployed in resource-constrained wearable devices, reaching up to 1.3 days of battery life on a single charge.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architecture-Aware Optimization of Layer Fusion for Latency-Optimal CNN Inference 时延最优CNN推理层融合的体系结构感知优化
Minyong Yoon, Jungwook Choi
Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.
层融合是一种在资源受限的加速器上加速延迟敏感的CNN推理任务的有效技术,它利用了分布式片上集成存储器-加速器处理内存(PIM)。然而,以往的研究主要集中在优化内存访问,而忽略了硬件架构对延迟的重要影响。本研究提出了一个二维收缩阵列加速器的分析延迟模型,考虑了各种硬件因素,如阵列尺寸、缓冲区大小和带宽。然后,我们研究了硬件架构和融合策略(包括权重和重叠重用)对性能的影响;这些方面在现有的基于访问的融合模型中没有得到充分的解决。通过将层融合与我们提出的跨不同架构、数据流和工作负载的延迟模型相结合,与基于访问的模型相比,我们实现了端到端网络延迟减少53.1%。
{"title":"Architecture-Aware Optimization of Layer Fusion for Latency-Optimal CNN Inference","authors":"Minyong Yoon, Jungwook Choi","doi":"10.1109/AICAS57966.2023.10168659","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168659","url":null,"abstract":"Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125029728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Live Demonstration: An Efficient Neural Network Processor with Reduced Data Transmission and On-chip Shortcut Mapping 现场演示:具有减少数据传输和片上快捷映射的高效神经网络处理器
Yichuan Bai, Zhuang Shao, Chenshuo Zhang, Aojie Jiang, Yuan Du, Li Du
This demonstration showcases an efficient neural network processor implemented in TSMC 28nm CMOS technology. The processor conducts neural network inference with 16-bit dynamic fix-point activation and 10-bit dynamic fix-point weight. The reconfigurable streaming architecture is employed for off-chip data transmission reduction and on-chip shortcut mapping. An integrated neural network toolchain, including network model converter, quantitative analysis tool, and deep learning compiler, is also developed for fast network deployment.
本演示展示了采用台积电28纳米CMOS技术实现的高效神经网络处理器。处理器以16位动态定点激活和10位动态定点权进行神经网络推理。采用可重构流架构实现片外数据传输减少和片内快捷映射。为实现快速网络部署,开发了集成的神经网络工具链,包括网络模型转换器、定量分析工具和深度学习编译器。
{"title":"Live Demonstration: An Efficient Neural Network Processor with Reduced Data Transmission and On-chip Shortcut Mapping","authors":"Yichuan Bai, Zhuang Shao, Chenshuo Zhang, Aojie Jiang, Yuan Du, Li Du","doi":"10.1109/AICAS57966.2023.10168666","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168666","url":null,"abstract":"This demonstration showcases an efficient neural network processor implemented in TSMC 28nm CMOS technology. The processor conducts neural network inference with 16-bit dynamic fix-point activation and 10-bit dynamic fix-point weight. The reconfigurable streaming architecture is employed for off-chip data transmission reduction and on-chip shortcut mapping. An integrated neural network toolchain, including network model converter, quantitative analysis tool, and deep learning compiler, is also developed for fast network deployment.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125078682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Read-disturb Detection Methodology for RRAM-based Computation-in-Memory Architecture 基于随机存储器的内存计算体系结构的读干扰检测方法
Mohammad Amin Yaldagard, Sumit Diware, R. Joshi, S. Hamdioui, R. Bishnoi
Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures can meet the unprecedented energy efficiency requirements to execute AI algorithms directly on edge devices. However, the read-disturb problem associated with these architectures can lead to accumulated computational errors. To achieve the necessary level of computational accuracy, after a specific number of read cycles, these devices must undergo a reprogramming process which is a static approach and needs a large counter. This paper proposes a circuit-level RRAM read-disturb detection technique by monitoring real-time conductance drifts of RRAM devices, which initiate the reprogramming when actually it needs. Moreover, an analytic method is presented to determine the minimum conductance detection requirements, and our proposed read-disturb detection technique is tuned for the same to detect it dynamically. SPICE simulation result using TSMC 40 nm shows the correct functionality of our proposed detection technique.
基于电阻式随机存取存储器(RRAM)的内存计算(CIM)架构可以满足在边缘设备上直接执行人工智能算法的前所未有的能效要求。然而,与这些体系结构相关的读干扰问题可能导致累积的计算错误。为了达到必要的计算精度水平,在特定数量的读取周期之后,这些设备必须经历一个静态方法的重新编程过程,并且需要一个大的计数器。本文提出了一种电路级的RRAM读扰检测技术,通过实时监测RRAM器件的电导漂移,在实际需要时启动重编程。此外,提出了一种分析方法来确定最小电导检测要求,并对我们提出的读干扰检测技术进行了调整,以动态检测它。采用台积电40纳米芯片的SPICE仿真结果显示了我们提出的检测技术的正确功能。
{"title":"Read-disturb Detection Methodology for RRAM-based Computation-in-Memory Architecture","authors":"Mohammad Amin Yaldagard, Sumit Diware, R. Joshi, S. Hamdioui, R. Bishnoi","doi":"10.1109/AICAS57966.2023.10168638","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168638","url":null,"abstract":"Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures can meet the unprecedented energy efficiency requirements to execute AI algorithms directly on edge devices. However, the read-disturb problem associated with these architectures can lead to accumulated computational errors. To achieve the necessary level of computational accuracy, after a specific number of read cycles, these devices must undergo a reprogramming process which is a static approach and needs a large counter. This paper proposes a circuit-level RRAM read-disturb detection technique by monitoring real-time conductance drifts of RRAM devices, which initiate the reprogramming when actually it needs. Moreover, an analytic method is presented to determine the minimum conductance detection requirements, and our proposed read-disturb detection technique is tuned for the same to detect it dynamically. SPICE simulation result using TSMC 40 nm shows the correct functionality of our proposed detection technique.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125181076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1