2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)最新文献

英文中文

Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge 自由位:边缘混合精度量化神经网络的延迟优化

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168577

Georg Rutishauser, Francesco Conti, L. Benini

Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the in-tractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6 % reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.

混合精度量化，其中深度神经网络的层被量化到不同的精度，提供了优化模型大小、延迟和统计精度之间的权衡的机会，而不是均匀位宽量化所能实现的。针对给定网络中混合精度配置难以处理的搜索空间，提出了一种混合搜索方法。它包括一个硬件不可知的可微搜索算法，然后是一个硬件感知的启发式优化，以找到针对特定硬件目标优化的混合精度配置延迟。我们在MobileNetV1和MobileNetV2上评估了我们的算法，并将得到的网络部署在具有不同硬件特性的多核RISC-V微控制器平台上。与8位模型相比，我们实现了高达28.6%的端到端延迟减少，而与1000类ImageNet数据集上的全精度基线相比，准确度下降可以忽略不计。我们演示了相对于8位基准的加速，即使在没有硬件支持子字节算术的系统上，精度下降也可以忽略不计。此外，我们还展示了我们的方法在针对减少二进制操作计数作为延迟代理的可微搜索方面的优越性。

{"title":"Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge","authors":"Georg Rutishauser, Francesco Conti, L. Benini","doi":"10.1109/AICAS57966.2023.10168577","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168577","url":null,"abstract":"Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the in-tractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6 % reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130372354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

In-memory Activation Compression for GPT Training GPT训练的内存激活压缩

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168658

Seungyong Lee, Geonu Yun, Hyuk-Jae Lee

Recently, a large number of parameters in Transformer-based language models have caused memory short-ages during training. Although solutions such as mixed precision and model parallelism have been proposed, they have the limitation of inducing communication overhead and requiring modification of the model by a programmer. To address this issue, we propose a scheme that compresses activation data in memory, enabling the reduction of memory usage during training in a user-transparent manner. The compression algorithm gathers activation data into a block and compresses it, using base-delta compression for the exponent and bit-plane zero compression for the sign and mantissa. Then, the important bits are arranged in order, and LSB truncation is applied to fit the target size. The proposed compression algorithm achieves a compression ratio of 2.09 for the sign, 2.04 for the exponent, and 1.21 for the mantissa. A compression ratio of 3.2 is obtained by applying up to the truncation, and we confirm the convergence of GPT-2 training with the compression.

近年来，基于transformer的语言模型中大量的参数导致了训练过程中的记忆不足。尽管已经提出了混合精度和模型并行等解决方案，但它们存在导致通信开销和需要程序员修改模型的局限性。为了解决这个问题，我们提出了一个在内存中压缩激活数据的方案，以用户透明的方式减少训练期间的内存使用。压缩算法将激活数据收集到一个块中并对其进行压缩，对指数使用基增量压缩，对符号和尾数使用位平面零压缩。然后，按顺序排列重要位，并采用LSB截断来拟合目标大小。提出的压缩算法实现了符号的压缩比为2.09，指数的压缩比为2.04，尾数的压缩比为1.21。通过对截断的up应用，得到了3.2的压缩比，并证实了GPT-2训练与压缩的收敛性。

引用次数: 0

Reducing Overhead of Feature Importance Visualization via Static GradCAM Computation 通过静态渐变凸轮计算减少特征重要性可视化的开销

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168594

Ashwin Bhat, A. Raychowdhury

Explainable AI (XAI) methods provide insights into the operation of black-box Deep Neural Network (DNN) models. GradCAM, an XAI algorithm, provides an explanation by highlighting regions in the input feature space that were relevant to the model’s output. It involves a gradient computation step that adds a significant overhead compared to inference and hinders providing explanations to end-users. In this work, we identify the root cause of the problem to be the dynamic run-time automatic differentiation. To overcome this issue, we propose to offload the gradient computation step to compile time via analytic evaluation. We validate the idea by designing an FPGA implementation of GradCAM that schedules the entire computation graph statically. For a TinyML ResNet18 model, we achieve a reduction in the explanation generation overhead from > 2× using software frameworks on CPU/GPU systems to < 0.01× on the FPGA using our designed hardware and static scheduling.

可解释的人工智能(XAI)方法为黑盒深度神经网络(DNN)模型的操作提供了见解。GradCAM是一种XAI算法，它通过突出显示输入特征空间中与模型输出相关的区域来提供解释。它涉及到一个梯度计算步骤，与推理相比增加了很大的开销，并且阻碍了向最终用户提供解释。在这项工作中，我们确定了问题的根本原因是动态运行时自动区分。为了克服这一问题，我们建议通过分析计算将梯度计算步骤卸载到编译时间。我们通过设计GradCAM的FPGA实现来验证这个想法，该FPGA实现静态地调度整个计算图。对于TinyML ResNet18模型，我们在CPU/GPU系统上使用软件框架将解释生成开销从> 2x减少到使用我们设计的硬件和静态调度的FPGA上的< 0.01 x。

引用次数: 0

Grand Challenge on Software and Hardware Co-Optimization for E-Commerce Recommendation System 电子商务推荐系统软硬件协同优化的重大挑战

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168648

Jianing Li, Jiabin Liu, Xingyuan Hu, Yuhang Zhang, Guosheng Yu, Shimeng Qian, Wei Mao, Li Du, Yongfu Li, Yuan Du

E-commerce has become an indispensable part of the whole commodity economy with rapid expansion. A great deal of time is required for customers to search products by manual work. A good automatic recommendation system can not only bring the customers good shopping experience, but also help companies gain profit growth. In the IEEE AICAS 2023 conference, we have organized the grand challenge on software and hardware co-optimization for e-commerce recommendation system. The desensitized data from Alibaba Group which recorded online purchase behaviors of online shopping users in China are provided. We organize two rounds of the challenge with two different parts of data, separately encouraging participating teams to propose novel ideas for the recommendation algorithm design and deployment. In the preliminary round, participating teams are required to design a recommendation system with high accuracy performance. In the final round, the qualified teams from the preliminary round will be offered with an ARM-based multi-core Yitian 710 CPU cloud server, the teams are required to design an acceleration scheme for the hardware resolution. In the final, 6 best teams will be awarded by using standard evaluation criteria.

电子商务迅速发展，成为整个商品经济不可缺少的组成部分。顾客通过手工搜索产品需要花费大量的时间。一个好的自动推荐系统不仅可以给顾客带来良好的购物体验，还可以帮助企业获得利润增长。在IEEE AICAS 2023会议上，我们组织了电子商务推荐系统软硬件协同优化的大挑战。本文提供了来自阿里巴巴集团的脱敏数据，记录了中国网购用户的网购行为。我们用两个不同部分的数据组织了两轮挑战赛，分别鼓励参赛团队为推荐算法的设计和部署提出新颖的想法。在初赛阶段，参赛团队需要设计出具有较高准确率的推荐系统。在最后一轮，初赛合格的团队将获得一台基于arm的多核亿天710 CPU云服务器，并要求团队设计硬件分辨率的加速方案。在决赛中，将根据标准评审标准选出6支最佳队伍。

{"title":"Grand Challenge on Software and Hardware Co-Optimization for E-Commerce Recommendation System","authors":"Jianing Li, Jiabin Liu, Xingyuan Hu, Yuhang Zhang, Guosheng Yu, Shimeng Qian, Wei Mao, Li Du, Yongfu Li, Yuan Du","doi":"10.1109/AICAS57966.2023.10168648","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168648","url":null,"abstract":"E-commerce has become an indispensable part of the whole commodity economy with rapid expansion. A great deal of time is required for customers to search products by manual work. A good automatic recommendation system can not only bring the customers good shopping experience, but also help companies gain profit growth. In the IEEE AICAS 2023 conference, we have organized the grand challenge on software and hardware co-optimization for e-commerce recommendation system. The desensitized data from Alibaba Group which recorded online purchase behaviors of online shopping users in China are provided. We organize two rounds of the challenge with two different parts of data, separately encouraging participating teams to propose novel ideas for the recommendation algorithm design and deployment. In the preliminary round, participating teams are required to design a recommendation system with high accuracy performance. In the final round, the qualified teams from the preliminary round will be offered with an ARM-based multi-core Yitian 710 CPU cloud server, the teams are required to design an acceleration scheme for the hardware resolution. In the final, 6 best teams will be awarded by using standard evaluation criteria.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133358066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CPGAN: Collective Punishment Generative Adversarial Network for Dry Fingerprint Image Enhancement 用于干指纹图像增强的集体惩罚生成对抗网络

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168628

Yu-Chi Su, Ching-Te Chiu, Chih-Han Cheng, Kuan-Hsien Liu, Tsung-Chan Lee, Jia-Lin Chen, Jie-Yu Luo, Wei-Chang Chung, Yao-Ren Chang, Kuan-Ying Ho

Fingerprint has been widely used in our daily life, such as mobile. However, some circumstances may lead to low unlocking rate, like fingerprint at low temperature(dry fingerprint) or washed fingerprint. Our method mainly focuses on the former by making it close to normal temperature fingerprint. The main idea of our method, which called "CPGAN", is to improve GAN to boost the quality of the enhanced fingerprint. Our objective is to make the generator generates the high quality of enhanced fingerprint. The method is divided into two parts: "strengthening the discriminator" and "strengthening the generator". For strengthening the generator, we adopt the mechanism of "Collective Punishment" to our work. For strengthening the discriminator, we utilize two generators and feature extractor to boost the discriminator. In our experiments, the results surpass the state-of-the-arts on FVC2002 about 75%.

指纹已经在我们的日常生活中得到了广泛的应用，比如手机。但是，有些情况可能会导致解锁率低，例如低温指纹(干指纹)或水洗指纹。我们的方法主要针对前者，使其更接近于常温指纹。我们的方法被称为“CPGAN”，其主要思想是改进GAN以提高增强指纹的质量。我们的目标是使生成器生成高质量的增强指纹。该方法分为两部分:“增强鉴别器”和“增强生成器”。为了加强发电机，我们在工作中采用了“集体惩罚”机制。为了增强鉴别器，我们使用了两个生成器和特征提取器来增强鉴别器。在我们的实验中，结果优于FVC2002上的最新技术约75%。

引用次数: 0

WeightLock: A Mixed-Grained Weight Encryption Approach Using Local Decrypting Units for Ciphertext Computing in DNN Accelerators 在DNN加速器中使用本地解密单元进行密文计算的混合粒度权重加密方法

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168612

Jianfeng Wang, Zhonghao Chen, Yiming Chen, Yixin Xu, Tian Wang, Yao Yu, N. Vijaykrishnan, Sumitha George, Huazhong Yang, Xueqing Li

With the wide use of NVM-based DNN accelerators for higher computing efficiency, the long data retention time essentially causes a high risk of unauthorized weight stealing by attackers. Weight encryption is an effective method, but existing ciphertext computing accelerators cannot achieve high encryption complexity and flexibility. This paper proposes WeightLock, a mixed-grained hardware-software co-design approach based on local decrypting units (LDUs). This work proposes a key-controlled cell-level hardware design for higher granularity and two weight selection schemes for higher flexibility. The simulation results show that the accuracy of VGG-8 and ResNet-18 in the Cifar-10 classification drops from 80% to only 10% even if 80% of keys are leaked. This shows >20% higher key leakage tolerance and >17x longer retraining latency protection, compared with the prior state-of-the-art hardware and software approaches, respectively. The area cost of the encryption function is negligible, with ~600x, 2.2x, and 2.4x reduction from the state-of-the-art cell-wise, column-wise, and 1T4R structures, respectively.

随着基于nvm的深度神经网络加速器被广泛使用以提高计算效率，较长的数据保留时间本质上导致了攻击者未经授权窃取权重的高风险。权重加密是一种有效的加密方法，但现有的密文计算加速器无法实现较高的加密复杂度和灵活性。本文提出了一种基于本地解密单元(ldu)的混合粒度软硬件协同设计方法WeightLock。这项工作提出了一个键控制的单元级硬件设计，以获得更高的粒度和两种权重选择方案，以获得更高的灵活性。仿真结果表明，即使80%的密钥被泄露，VGG-8和ResNet-18在Cifar-10分类中的准确率也从80%下降到10%。与之前最先进的硬件和软件方法相比，这表明密钥泄漏容忍度提高了>20%，再训练延迟保护时间延长了>17倍。加密功能的面积成本可以忽略不计，与最先进的单元、列和1T4R结构相比，分别减少了约600x、2.2x和2.4x。

{"title":"WeightLock: A Mixed-Grained Weight Encryption Approach Using Local Decrypting Units for Ciphertext Computing in DNN Accelerators","authors":"Jianfeng Wang, Zhonghao Chen, Yiming Chen, Yixin Xu, Tian Wang, Yao Yu, N. Vijaykrishnan, Sumitha George, Huazhong Yang, Xueqing Li","doi":"10.1109/AICAS57966.2023.10168612","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168612","url":null,"abstract":"With the wide use of NVM-based DNN accelerators for higher computing efficiency, the long data retention time essentially causes a high risk of unauthorized weight stealing by attackers. Weight encryption is an effective method, but existing ciphertext computing accelerators cannot achieve high encryption complexity and flexibility. This paper proposes WeightLock, a mixed-grained hardware-software co-design approach based on local decrypting units (LDUs). This work proposes a key-controlled cell-level hardware design for higher granularity and two weight selection schemes for higher flexibility. The simulation results show that the accuracy of VGG-8 and ResNet-18 in the Cifar-10 classification drops from 80% to only 10% even if 80% of keys are leaked. This shows >20% higher key leakage tolerance and >17x longer retraining latency protection, compared with the prior state-of-the-art hardware and software approaches, respectively. The area cost of the encryption function is negligible, with ~600x, 2.2x, and 2.4x reduction from the state-of-the-art cell-wise, column-wise, and 1T4R structures, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114830238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Landmark-Based Adversarial Network for RGB-D Pose Invariant Face Recognition 基于里程碑的RGB-D姿态不变人脸识别对抗网络

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168669

Wei-Jyun Chen, Ching-Te Chiu, Ting-Chun Lin

Even though numerous studies have been conducted, face recognition still suffers from poor performance in pose variance. Besides fine appearance details of the face from RGB images, we use depth images that present the 3D contour of the face to improve recognition performance in large poses. At first, we propose a dual-path RGB-D face recognition model which learns features from separate RGB and depth images and fuses the two features into one identity feature. We add associate loss to strengthen the complementary and improve performance. Second, we proposed a landmark-based adversarial network to help the face recognition model extract the pose-invariant identity feature. Our landmark-based adversarial network contains a feature generator, pose discriminator, and landmark module. After we use 2-stage optimization to optimize the pose discriminator and feature generator, we removed the pose factor in the feature extracted by the generator. We conduct experiments on KinectFaceDB, RealSensetest and LiDARtest. On KinectFaceDB, we achieve a recognition accuracy of 99.41%, which is 1.31% higher than other methods. On RealSensetest, we achieve a classification accuracy of 92.57%, which is 30.51% higher than other methods. On LiDARtest, we achieve 98.21%, which is 21.88% higher than other methods.

尽管进行了大量的研究，但人脸识别在姿态方差方面的表现仍然不佳。除了来自RGB图像的面部精细外观细节外，我们还使用呈现面部3D轮廓的深度图像来提高大姿态下的识别性能。首先，我们提出了一种双路径RGB- d人脸识别模型，该模型从单独的RGB和深度图像中学习特征，并将这两个特征融合为一个身份特征。我们增加关联损失，加强互补性，提高性能。其次，我们提出了一种基于地标的对抗网络来帮助人脸识别模型提取姿势不变的身份特征。我们基于地标的对抗网络包含特征生成器、姿态鉴别器和地标模块。在对姿态鉴别器和特征生成器进行两阶段优化后，去除由特征生成器提取的特征中的姿态因子。我们在KinectFaceDB, RealSensetest和LiDARtest上进行了实验。在KinectFaceDB上，我们的识别准确率达到了99.41%，比其他方法高出1.31%。在RealSensetest上，我们实现了92.57%的分类准确率，比其他方法高出30.51%。在LiDARtest上，我们的准确率达到了98.21%，比其他方法高出21.88%。

{"title":"Landmark-Based Adversarial Network for RGB-D Pose Invariant Face Recognition","authors":"Wei-Jyun Chen, Ching-Te Chiu, Ting-Chun Lin","doi":"10.1109/AICAS57966.2023.10168669","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168669","url":null,"abstract":"Even though numerous studies have been conducted, face recognition still suffers from poor performance in pose variance. Besides fine appearance details of the face from RGB images, we use depth images that present the 3D contour of the face to improve recognition performance in large poses. At first, we propose a dual-path RGB-D face recognition model which learns features from separate RGB and depth images and fuses the two features into one identity feature. We add associate loss to strengthen the complementary and improve performance. Second, we proposed a landmark-based adversarial network to help the face recognition model extract the pose-invariant identity feature. Our landmark-based adversarial network contains a feature generator, pose discriminator, and landmark module. After we use 2-stage optimization to optimize the pose discriminator and feature generator, we removed the pose factor in the feature extracted by the generator. We conduct experiments on KinectFaceDB, RealSensetest and LiDARtest. On KinectFaceDB, we achieve a recognition accuracy of 99.41%, which is 1.31% higher than other methods. On RealSensetest, we achieve a classification accuracy of 92.57%, which is 30.51% higher than other methods. On LiDARtest, we achieve 98.21%, which is 21.88% higher than other methods.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122530409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud 一天710 SoC上基于gem的云工作负载的任务感知调度和性能优化

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168586

Guosheng Yu, Zhihong Lv, Haijiang Wang, Zilong Huang, Jicheng Chen

The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.

一天710 SoC是一款基于ARM Neoverse N2架构的服务器处理器，由T-HEAD半导体有限公司开发，用于加速阿里云中的计算密集型任务，其中ML相关工作负载在各种应用中发挥重要作用。通用矩阵乘法是机器学习工作负载中广泛使用的最基本和最重要的计算内核例程。一般来说，整个GEMM工作负载被划分为一系列块，子任务被专业地组装以利用并行硬件。然而，对于同时处理多任务并期望保证QoS的云工作负载来说，情况并非如此。我们引入了任务感知并行调度方法来处理ML工作负载，并平衡了YiTian710 ECS实例的响应延迟和吞吐量。为了提高GEMM子任务的调度效率，我们进一步设计了一种两级划分的多线程调度算法。为了达到最佳性能，开发了优化的GEMM内核。我们评估了一天710基于阿里云ECS在不同应用中的性能。结果表明，该方法可以在不同的应用中取得显著的性能提升。

{"title":"Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud","authors":"Guosheng Yu, Zhihong Lv, Haijiang Wang, Zilong Huang, Jicheng Chen","doi":"10.1109/AICAS57966.2023.10168586","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168586","url":null,"abstract":"The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122578736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Systolic Computing-in-Memory Array based Accelerator with Predictive Early Activation for Spatiotemporal Convolutions 一种基于收缩内存计算阵列的时空卷积预测早期激活加速器

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168581

Xiaofeng Chen, Ruiqi Guo, Zhiheng Yue, Yang Hu, Leibo Liu, Shaojun Wei, S. Yin

Residual (2+1)-dimensional convolution neural network (R(2+1)D CNN) has achieved great success in video recognition due to the spatiotemporal convolution structure. However, R(2+1)D CNN incurs large energy and latency overhead because of intensive computation and frequent memory access. To solve the issues, we propose a digital SRAM-CIM based accelerator with two key features: (1) Systolic CIM array to efficiently match massive computations in regular architecture; (2) Digtal CIM circuit design with output sparsity predicition to avoid redundant computations. The proposed design is implemented in 28nm technology and achieves an energy efficiency of 21.87 TOPS/W at 200 MHz and 0.9 V supply voltage.

残差(2+1)维卷积神经网络(R(2+1)D CNN)由于其时空卷积结构，在视频识别中取得了巨大的成功。然而，R(2+1)D CNN由于需要大量的计算和频繁的内存访问，会产生较大的能量和延迟开销。为了解决这一问题，我们提出了一种基于SRAM-CIM的数字加速器，它具有两个关键特征:(1)收缩式CIM阵列，可以有效地匹配常规架构中的大量计算;(2)采用输出稀疏性预测的数字CIM电路设计，避免冗余计算。该设计采用28nm技术，在200mhz和0.9 V电源电压下实现了21.87 TOPS/W的能量效率。

引用次数: 0

An Efficient Design Framework for 2×2 CNN Accelerator Chiplet Cluster with SerDes Interconnects 具有SerDes互连的2×2 CNN加速器芯片集群的高效设计框架

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

Pub Date : 2023-06-11 DOI: 10.1109/AICAS57966.2023.10168573

Yajie Wu, Tianze Li, Zhuang Shao, Li Du, Yuan Du

Multi-Chiplet integrated systems for high-performance computing with dedicated CNN accelerators are highly demanded due to ever-increasing AI-related training and inferencing tasks; however, many design challenges hinder their large-scale applications, such as complicated multi-task scheduling, high-speed die-to-die SerDes (Serializer/Deserializer) link modeling, and detailed communication and computation hardware co-simulation. In this paper, an optimized 2×2 CNN accelerator chiplet framework with a SerDes link model is presented, which addresses the above challenges. A methodology for designing a 2×2 CNN accelerator chiplet framework is also proposed, and several experiments are conducted. The system performances of different designs are compared and analyzed with different design parameters of computation hardware, SerDes links, and improved scheduling algorithms. The results show that with the same interconnection structure and bandwidth, every 1TFLOPS increase in one chiplet’s computing power can bring an average 3.7% execution time reduction.

由于人工智能相关的训练和推理任务不断增加，对具有专用CNN加速器的高性能计算多芯片集成系统的需求很高;然而，许多设计挑战阻碍了它们的大规模应用，例如复杂的多任务调度、高速模对模SerDes (Serializer/Deserializer)链路建模以及详细的通信和计算硬件协同仿真。本文提出了一种基于SerDes链路模型的优化2×2 CNN加速器芯片框架，解决了上述问题。提出了一种设计2×2 CNN加速器芯片框架的方法，并进行了实验。通过计算硬件、SerDes链路和改进调度算法的不同设计参数，对不同设计方案的系统性能进行了比较分析。结果表明，在相同的互连结构和带宽下，一个芯片的计算能力每提高1TFLOPS，执行时间平均减少3.7%。

{"title":"An Efficient Design Framework for 2×2 CNN Accelerator Chiplet Cluster with SerDes Interconnects","authors":"Yajie Wu, Tianze Li, Zhuang Shao, Li Du, Yuan Du","doi":"10.1109/AICAS57966.2023.10168573","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168573","url":null,"abstract":"Multi-Chiplet integrated systems for high-performance computing with dedicated CNN accelerators are highly demanded due to ever-increasing AI-related training and inferencing tasks; however, many design challenges hinder their large-scale applications, such as complicated multi-task scheduling, high-speed die-to-die SerDes (Serializer/Deserializer) link modeling, and detailed communication and computation hardware co-simulation. In this paper, an optimized 2×2 CNN accelerator chiplet framework with a SerDes link model is presented, which addresses the above challenges. A methodology for designing a 2×2 CNN accelerator chiplet framework is also proposed, and several experiments are conducted. The system performances of different designs are compared and analyzed with different design parameters of computation hardware, SerDes links, and improved scheduling algorithms. The results show that with the same interconnection structure and bandwidth, every 1TFLOPS increase in one chiplet’s computing power can bring an average 3.7% execution time reduction.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115378574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀