Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168641
Yuansheng Zhao, Zixuan Shen, Jiarui Xu, K. Chai, Yanqing Wu, Chao Wang
Recently, DRAM-based Computing-in-Memory (CIM) has emerged as one of the potential CIM solutions due to its unique advantages of high bit-cell density, large memory capacity and CMOS compatibility. This paper proposes a 2T-DRAM based CIM architecture, which can perform both CIM inference and training for deep neural networks (DNNs) efficiently. The proposed CIM architecture employs 2T-DRAM based transpose circuitry to implement transpose weight memory array and uses digital logic in the array peripheral to implement digital DNN computation in memory. A novel mapping method is proposed to map the convolutional and full-connection computation of the forward propagation and back propagation process into the transpose 2T-DRAM CIM array to achieve digital weight multiplexing and parallel computing. Simulation results show that the computing power of proposed transpose 2T-DRAM based CIM architecture is estimated to 11.26 GOPS by a 16K DRAM array to accelerate 4CONV+3FC @100 MHz and has an 82.15% accuracy on CIFAR-10 dataset, which are much higher than the state-of-the-art DRAM-based CIM accelerators without CIM learning capability. Preliminary evaluation of retention time in DRAM CIM also shows that a refresh-less training-inference process of lightweight networks can be realized by a suitable scale of CIM array through the proposed mapping strategy with negligible refresh-induced performance loss or power increase.
{"title":"A Novel Transpose 2T-DRAM based Computing-in-Memory Architecture for On-chip DNN Training and Inference","authors":"Yuansheng Zhao, Zixuan Shen, Jiarui Xu, K. Chai, Yanqing Wu, Chao Wang","doi":"10.1109/AICAS57966.2023.10168641","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168641","url":null,"abstract":"Recently, DRAM-based Computing-in-Memory (CIM) has emerged as one of the potential CIM solutions due to its unique advantages of high bit-cell density, large memory capacity and CMOS compatibility. This paper proposes a 2T-DRAM based CIM architecture, which can perform both CIM inference and training for deep neural networks (DNNs) efficiently. The proposed CIM architecture employs 2T-DRAM based transpose circuitry to implement transpose weight memory array and uses digital logic in the array peripheral to implement digital DNN computation in memory. A novel mapping method is proposed to map the convolutional and full-connection computation of the forward propagation and back propagation process into the transpose 2T-DRAM CIM array to achieve digital weight multiplexing and parallel computing. Simulation results show that the computing power of proposed transpose 2T-DRAM based CIM architecture is estimated to 11.26 GOPS by a 16K DRAM array to accelerate 4CONV+3FC @100 MHz and has an 82.15% accuracy on CIFAR-10 dataset, which are much higher than the state-of-the-art DRAM-based CIM accelerators without CIM learning capability. Preliminary evaluation of retention time in DRAM CIM also shows that a refresh-less training-inference process of lightweight networks can be realized by a suitable scale of CIM array through the proposed mapping strategy with negligible refresh-induced performance loss or power increase.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133249341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168606
Fatmah Alantali, Y. Halawani, B. Mohammad, M. Al-Qutayri
Many of the current edge computing devices need efficient implementation of Artificial Intelligence (AI) applications due to strict latency, security and power requirements. Nonetheless, such devices, face various challenges when executing AI applications due to their limited computing and energy resources. In particular, Convolutional Neural Networks (CNN) is a popular machine learning method that derives a high-level function from being trained on various visual input examples. This paper contributes to enabling the use of CNN on resource-constrained devices offline, where a trade-off between accuracy, running time and power efficiency is verified. The paper investigates the use of minimum pre-processing methods of input data to identify nonessential computations in the convolutional layers. In this work, Spatial locality of input data is considered along with an efficient pre-processing method to mitigate the accuracy loss caused by the computational re-use approach. This technique was tested on LeNet and CIFAR-10 structures and was responsible for 1.9% and 1.6% accuracy loss while reducing the processing time by 38.3% and 20.9% and reducing the energy by 38.3%, and 20.7%, respectively. The models were deployed and verified on Raspberry Pi 4 B platform using the MATLAB coder to measure time and energy.
由于严格的延迟、安全性和功耗要求,许多当前的边缘计算设备需要高效地实现人工智能(AI)应用程序。尽管如此,这些设备在执行人工智能应用时,由于其有限的计算和能源资源,面临着各种挑战。特别是卷积神经网络(CNN)是一种流行的机器学习方法,它通过对各种视觉输入示例进行训练来获得高级函数。本文有助于在离线资源受限的设备上使用CNN,验证了准确性、运行时间和功率效率之间的权衡。本文研究了使用输入数据的最小预处理方法来识别卷积层中的非必要计算。在这项工作中,考虑了输入数据的空间局部性以及有效的预处理方法,以减轻计算重用方法造成的精度损失。该技术在LeNet和CIFAR-10结构上进行了测试,精度损失分别为1.9%和1.6%,处理时间和能量分别减少了38.3%和20.9%,能耗分别减少了38.3%和20.7%。在Raspberry Pi 4b平台上使用MATLAB编码器对模型进行了部署和验证,并测量了时间和能量。
{"title":"F-CNN: Faster CNN Exploiting Data Re-Use with Statistical Analysis","authors":"Fatmah Alantali, Y. Halawani, B. Mohammad, M. Al-Qutayri","doi":"10.1109/AICAS57966.2023.10168606","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168606","url":null,"abstract":"Many of the current edge computing devices need efficient implementation of Artificial Intelligence (AI) applications due to strict latency, security and power requirements. Nonetheless, such devices, face various challenges when executing AI applications due to their limited computing and energy resources. In particular, Convolutional Neural Networks (CNN) is a popular machine learning method that derives a high-level function from being trained on various visual input examples. This paper contributes to enabling the use of CNN on resource-constrained devices offline, where a trade-off between accuracy, running time and power efficiency is verified. The paper investigates the use of minimum pre-processing methods of input data to identify nonessential computations in the convolutional layers. In this work, Spatial locality of input data is considered along with an efficient pre-processing method to mitigate the accuracy loss caused by the computational re-use approach. This technique was tested on LeNet and CIFAR-10 structures and was responsible for 1.9% and 1.6% accuracy loss while reducing the processing time by 38.3% and 20.9% and reducing the energy by 38.3%, and 20.7%, respectively. The models were deployed and verified on Raspberry Pi 4 B platform using the MATLAB coder to measure time and energy.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130546974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional neural networks (CNN) have been widely used for boosting the performance of Artificial Intelligence (AI) tasks. However, the CNN models are usually computational intensive. Recently, the novel absolute-value-subtraction (ABS) operation based CNN, namely the AdderNet is proposed to reduce the computation complexity and energy burden. But the specific hardware design has rarely been explored. In this work, we propose an energy-efficient AdderNet accelerator to address such issue. At the hardware architecture level, we develop a flexible and group vectored systolic array to balance the circuit area, power, and speed. Thanks to the low delay of ABS operation, the systolic array can reach extremely high frequency up to 2GHz. Meanwhile the power- and area- efficiency exhibits about 3× improvement compared with its CNN counterpart. At the processing element level, we propose new ABS cell based on algorithm optimization, which shows about 10% higher performance than the naive design. Finally, the accelerator is practically deployed on FPGA platform to accelerate the AdderNet ResNet-18 network as a case study. The peak throughput is 424.2 GOP/s, which is much higher than previous works.
{"title":"Group Vectored Absolute-Value-Subtraction Cell Array for the Efficient Acceleration of AdderNet","authors":"Jiahao Chen, Wanbo Hu, Wenling Ma, Zhilin Zhang, Mingqiang Huang","doi":"10.1109/AICAS57966.2023.10168637","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168637","url":null,"abstract":"Convolutional neural networks (CNN) have been widely used for boosting the performance of Artificial Intelligence (AI) tasks. However, the CNN models are usually computational intensive. Recently, the novel absolute-value-subtraction (ABS) operation based CNN, namely the AdderNet is proposed to reduce the computation complexity and energy burden. But the specific hardware design has rarely been explored. In this work, we propose an energy-efficient AdderNet accelerator to address such issue. At the hardware architecture level, we develop a flexible and group vectored systolic array to balance the circuit area, power, and speed. Thanks to the low delay of ABS operation, the systolic array can reach extremely high frequency up to 2GHz. Meanwhile the power- and area- efficiency exhibits about 3× improvement compared with its CNN counterpart. At the processing element level, we propose new ABS cell based on algorithm optimization, which shows about 10% higher performance than the naive design. Finally, the accelerator is practically deployed on FPGA platform to accelerate the AdderNet ResNet-18 network as a case study. The peak throughput is 424.2 GOP/s, which is much higher than previous works.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129368423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168584
Gilha Lee, N. Kim, Hyun Kim
PseudoInverse learning (PIL) is proposed to increase the convergence speed of conventional gradient descent. PIL can be trained with fast and reliable convolutional neural networks (CNNs) without a gradient using a pseudoinverse matrix. However, PIL has several problems when training a network. First, there is an out-of-memory problem because all batches are required during one epoch of training. Second, the network cannot be deeper because more unreliable input pseudoinverse matrices are used as the deeper PIL layer is stacked. Therefore, PIL has not yet been effectively applied to widely used deep models. Inspired by the limitation of the existing PIL, we propose a novel error propagation methodology that allows the fine-tuning process, which is often used in a resource-constrained environment, to be performed more accurately. In detail, by using both PIL and gradient descent, we not only enable mini-batch training, which was impossible in PIL, but also achieve higher accuracy through more accurate error propagation. Moreover, unlike the existing PIL, which uses only the pseudoinverse matrix of the CNN input, we additionally use the pseudoinverse matrix of weights to compensate for the limitations of PIL; thus, the proposed method enables faster and more accurate error propagation in the CNN training process. As a result, it is efficient for fine-tuning in resource-constrained environments, such as mobile/edge devices that require an accuracy comparable to small training epochs. Experimental results show that the proposed method improves the accuracy after ResNet-101 fine-tuning on the CIFAR-100 dataset by 2.78% compared to the baseline.
{"title":"GPIL: Gradient with PseudoInverse Learning for High Accuracy Fine-Tuning","authors":"Gilha Lee, N. Kim, Hyun Kim","doi":"10.1109/AICAS57966.2023.10168584","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168584","url":null,"abstract":"PseudoInverse learning (PIL) is proposed to increase the convergence speed of conventional gradient descent. PIL can be trained with fast and reliable convolutional neural networks (CNNs) without a gradient using a pseudoinverse matrix. However, PIL has several problems when training a network. First, there is an out-of-memory problem because all batches are required during one epoch of training. Second, the network cannot be deeper because more unreliable input pseudoinverse matrices are used as the deeper PIL layer is stacked. Therefore, PIL has not yet been effectively applied to widely used deep models. Inspired by the limitation of the existing PIL, we propose a novel error propagation methodology that allows the fine-tuning process, which is often used in a resource-constrained environment, to be performed more accurately. In detail, by using both PIL and gradient descent, we not only enable mini-batch training, which was impossible in PIL, but also achieve higher accuracy through more accurate error propagation. Moreover, unlike the existing PIL, which uses only the pseudoinverse matrix of the CNN input, we additionally use the pseudoinverse matrix of weights to compensate for the limitations of PIL; thus, the proposed method enables faster and more accurate error propagation in the CNN training process. As a result, it is efficient for fine-tuning in resource-constrained environments, such as mobile/edge devices that require an accuracy comparable to small training epochs. Experimental results show that the proposed method improves the accuracy after ResNet-101 fine-tuning on the CIFAR-100 dataset by 2.78% compared to the baseline.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168572
Pei-Hsuan Kuo, Yu-Hsiang Huang, Juinn-Dar Huang
The increasing AI applications demands efficient computing capabilities to support a huge amount of calculations. Among the related arithmetic operations, multiplication is an indispensable part in most of deep learning applications. To support computing in different precisions demanded by various applications, it is essential for a multiplier architecture to meet the multi-precision demand while still achieving high utilization of the multiplication array and power efficiency. In this paper, a configurable multi-precision FP multiplier architecture with minimized redundant bits is presented. It can execute 16× FP8 operations, or 8× brain-floating-point (BF16) operations, or 4× half-precision (FP16) operations, or 1× single-precision (FP32) operation every cycle while maintaining a 100% multiplication hardware utilization ratio. Moreover, the computing results can also be represented in higher precision formats for succeeding high-precision computations. The proposed design has been implemented using the TSMC 40nm process with 1GHz clock frequency and consumes only 16.78mW on average. Compared to existing multi-precision FP multiplier architectures, the proposed design achieves the highest hardware utilization ratio with only 4.9K logic gates in the multiplication array. It also achieves high energy efficiencies of 1212.1, 509.6, 207.3, and 42.6 GFLOPS/W at FP8, BF16, FP16 and FP32 modes, respectively.
{"title":"Configurable Multi-Precision Floating-Point Multiplier Architecture Design for Computation in Deep Learning","authors":"Pei-Hsuan Kuo, Yu-Hsiang Huang, Juinn-Dar Huang","doi":"10.1109/AICAS57966.2023.10168572","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168572","url":null,"abstract":"The increasing AI applications demands efficient computing capabilities to support a huge amount of calculations. Among the related arithmetic operations, multiplication is an indispensable part in most of deep learning applications. To support computing in different precisions demanded by various applications, it is essential for a multiplier architecture to meet the multi-precision demand while still achieving high utilization of the multiplication array and power efficiency. In this paper, a configurable multi-precision FP multiplier architecture with minimized redundant bits is presented. It can execute 16× FP8 operations, or 8× brain-floating-point (BF16) operations, or 4× half-precision (FP16) operations, or 1× single-precision (FP32) operation every cycle while maintaining a 100% multiplication hardware utilization ratio. Moreover, the computing results can also be represented in higher precision formats for succeeding high-precision computations. The proposed design has been implemented using the TSMC 40nm process with 1GHz clock frequency and consumes only 16.78mW on average. Compared to existing multi-precision FP multiplier architectures, the proposed design achieves the highest hardware utilization ratio with only 4.9K logic gates in the multiplication array. It also achieves high energy efficiencies of 1212.1, 509.6, 207.3, and 42.6 GFLOPS/W at FP8, BF16, FP16 and FP32 modes, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122212060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Perception, computing and communication are usually decoupled in today’s vehicle-road coordination applications, which significantly adds to the system delay and cost. In contrast, we showcase a platform that integrates perception, communication and computing to provide timely roadside bird-eye-view (BEV) maps to vehicles for vision fusion. A neural processing unit and a cellular vehicle-to-everything (C-V2X) wireless baseband are both implemented on FPGA.
{"title":"Live Demonstration: An Integrated Computing and Communication Platform for Vehicle-Infrastructure Cooperative Autonomous Driving","authors":"Yuhang Gu, Wei Zhang, Yi-xing Shi, Limin Jiang, Shan-Guo Li, Sha Cao, Zhiyuan Jiang, Ruiqing Mao, Zhewen Lou, Sheng Zhou","doi":"10.1109/AICAS57966.2023.10168600","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168600","url":null,"abstract":"Perception, computing and communication are usually decoupled in today’s vehicle-road coordination applications, which significantly adds to the system delay and cost. In contrast, we showcase a platform that integrates perception, communication and computing to provide timely roadside bird-eye-view (BEV) maps to vehicles for vision fusion. A neural processing unit and a cellular vehicle-to-everything (C-V2X) wireless baseband are both implemented on FPGA.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125778701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168560
Baichuan Huang, R. Zanetti, A. Abtahi, D. Atienza, A. Aminifar
Epilepsy is one of the most common neurological disorders that is characterized by recurrent and unpredictable seizures. Wearable systems can be used to detect the onset of a seizure and notify family members and emergency units for rescue. The majority of state-of-the-art studies in the epilepsy domain currently explore modern machine learning techniques, e.g., deep neural networks, to accurately detect epileptic seizures. However, training deep learning networks requires a large amount of data and computing resources, which is a major challenge for resource-constrained wearable systems. In this paper, we propose EpilepsyNet, the first interpretable self-supervised network tailored to resource-constrained devices without using any seizure data in its initial offline training. At runtime, however, once a seizure is detected, it can be incorporated into our self-supervised technique to improve seizure detection performance, without the need to retrain our learning model, hence incurring no energy overheads. Our self-supervised approach can reach a detection performance of 79.2%, which is on par with the state-of-the-art fully-supervised deep neural networks trained on seizure data. At the same time, our proposed approach can be deployed in resource-constrained wearable devices, reaching up to 1.3 days of battery life on a single charge.
{"title":"EpilepsyNet: Interpretable Self-Supervised Seizure Detection for Low-Power Wearable Systems","authors":"Baichuan Huang, R. Zanetti, A. Abtahi, D. Atienza, A. Aminifar","doi":"10.1109/AICAS57966.2023.10168560","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168560","url":null,"abstract":"Epilepsy is one of the most common neurological disorders that is characterized by recurrent and unpredictable seizures. Wearable systems can be used to detect the onset of a seizure and notify family members and emergency units for rescue. The majority of state-of-the-art studies in the epilepsy domain currently explore modern machine learning techniques, e.g., deep neural networks, to accurately detect epileptic seizures. However, training deep learning networks requires a large amount of data and computing resources, which is a major challenge for resource-constrained wearable systems. In this paper, we propose EpilepsyNet, the first interpretable self-supervised network tailored to resource-constrained devices without using any seizure data in its initial offline training. At runtime, however, once a seizure is detected, it can be incorporated into our self-supervised technique to improve seizure detection performance, without the need to retrain our learning model, hence incurring no energy overheads. Our self-supervised approach can reach a detection performance of 79.2%, which is on par with the state-of-the-art fully-supervised deep neural networks trained on seizure data. At the same time, our proposed approach can be deployed in resource-constrained wearable devices, reaching up to 1.3 days of battery life on a single charge.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168659
Minyong Yoon, Jungwook Choi
Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.
{"title":"Architecture-Aware Optimization of Layer Fusion for Latency-Optimal CNN Inference","authors":"Minyong Yoon, Jungwook Choi","doi":"10.1109/AICAS57966.2023.10168659","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168659","url":null,"abstract":"Layer fusion is an effective technique for accelerating latency-sensitive CNN inference tasks on resource-constrained accelerators that exploit distributed on-chip integrated memory-accelerator processing-in memory (PIM). However, previous research primarily focused on optimizing memory access, neglecting the significant impact of hardware architecture on latency. This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access-based fusion models. By incorporating layer fusion with our proposed latency model across different architectures, dataflows, and workloads, we achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125029728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168666
Yichuan Bai, Zhuang Shao, Chenshuo Zhang, Aojie Jiang, Yuan Du, Li Du
This demonstration showcases an efficient neural network processor implemented in TSMC 28nm CMOS technology. The processor conducts neural network inference with 16-bit dynamic fix-point activation and 10-bit dynamic fix-point weight. The reconfigurable streaming architecture is employed for off-chip data transmission reduction and on-chip shortcut mapping. An integrated neural network toolchain, including network model converter, quantitative analysis tool, and deep learning compiler, is also developed for fast network deployment.
{"title":"Live Demonstration: An Efficient Neural Network Processor with Reduced Data Transmission and On-chip Shortcut Mapping","authors":"Yichuan Bai, Zhuang Shao, Chenshuo Zhang, Aojie Jiang, Yuan Du, Li Du","doi":"10.1109/AICAS57966.2023.10168666","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168666","url":null,"abstract":"This demonstration showcases an efficient neural network processor implemented in TSMC 28nm CMOS technology. The processor conducts neural network inference with 16-bit dynamic fix-point activation and 10-bit dynamic fix-point weight. The reconfigurable streaming architecture is employed for off-chip data transmission reduction and on-chip shortcut mapping. An integrated neural network toolchain, including network model converter, quantitative analysis tool, and deep learning compiler, is also developed for fast network deployment.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125078682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-11DOI: 10.1109/AICAS57966.2023.10168638
Mohammad Amin Yaldagard, Sumit Diware, R. Joshi, S. Hamdioui, R. Bishnoi
Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures can meet the unprecedented energy efficiency requirements to execute AI algorithms directly on edge devices. However, the read-disturb problem associated with these architectures can lead to accumulated computational errors. To achieve the necessary level of computational accuracy, after a specific number of read cycles, these devices must undergo a reprogramming process which is a static approach and needs a large counter. This paper proposes a circuit-level RRAM read-disturb detection technique by monitoring real-time conductance drifts of RRAM devices, which initiate the reprogramming when actually it needs. Moreover, an analytic method is presented to determine the minimum conductance detection requirements, and our proposed read-disturb detection technique is tuned for the same to detect it dynamically. SPICE simulation result using TSMC 40 nm shows the correct functionality of our proposed detection technique.
{"title":"Read-disturb Detection Methodology for RRAM-based Computation-in-Memory Architecture","authors":"Mohammad Amin Yaldagard, Sumit Diware, R. Joshi, S. Hamdioui, R. Bishnoi","doi":"10.1109/AICAS57966.2023.10168638","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168638","url":null,"abstract":"Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures can meet the unprecedented energy efficiency requirements to execute AI algorithms directly on edge devices. However, the read-disturb problem associated with these architectures can lead to accumulated computational errors. To achieve the necessary level of computational accuracy, after a specific number of read cycles, these devices must undergo a reprogramming process which is a static approach and needs a large counter. This paper proposes a circuit-level RRAM read-disturb detection technique by monitoring real-time conductance drifts of RRAM devices, which initiate the reprogramming when actually it needs. Moreover, an analytic method is presented to determine the minimum conductance detection requirements, and our proposed read-disturb detection technique is tuned for the same to detect it dynamically. SPICE simulation result using TSMC 40 nm shows the correct functionality of our proposed detection technique.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125181076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}