首页 > 最新文献

2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)最新文献

英文 中文
Grand Challenge on Software and Hardware Co-Optimization for E-Commerce Recommendation System 电子商务推荐系统软硬件协同优化的重大挑战
Jianing Li, Jiabin Liu, Xingyuan Hu, Yuhang Zhang, Guosheng Yu, Shimeng Qian, Wei Mao, Li Du, Yongfu Li, Yuan Du
E-commerce has become an indispensable part of the whole commodity economy with rapid expansion. A great deal of time is required for customers to search products by manual work. A good automatic recommendation system can not only bring the customers good shopping experience, but also help companies gain profit growth. In the IEEE AICAS 2023 conference, we have organized the grand challenge on software and hardware co-optimization for e-commerce recommendation system. The desensitized data from Alibaba Group which recorded online purchase behaviors of online shopping users in China are provided. We organize two rounds of the challenge with two different parts of data, separately encouraging participating teams to propose novel ideas for the recommendation algorithm design and deployment. In the preliminary round, participating teams are required to design a recommendation system with high accuracy performance. In the final round, the qualified teams from the preliminary round will be offered with an ARM-based multi-core Yitian 710 CPU cloud server, the teams are required to design an acceleration scheme for the hardware resolution. In the final, 6 best teams will be awarded by using standard evaluation criteria.
电子商务迅速发展,成为整个商品经济不可缺少的组成部分。顾客通过手工搜索产品需要花费大量的时间。一个好的自动推荐系统不仅可以给顾客带来良好的购物体验,还可以帮助企业获得利润增长。在IEEE AICAS 2023会议上,我们组织了电子商务推荐系统软硬件协同优化的大挑战。本文提供了来自阿里巴巴集团的脱敏数据,记录了中国网购用户的网购行为。我们用两个不同部分的数据组织了两轮挑战赛,分别鼓励参赛团队为推荐算法的设计和部署提出新颖的想法。在初赛阶段,参赛团队需要设计出具有较高准确率的推荐系统。在最后一轮,初赛合格的团队将获得一台基于arm的多核亿天710 CPU云服务器,并要求团队设计硬件分辨率的加速方案。在决赛中,将根据标准评审标准选出6支最佳队伍。
{"title":"Grand Challenge on Software and Hardware Co-Optimization for E-Commerce Recommendation System","authors":"Jianing Li, Jiabin Liu, Xingyuan Hu, Yuhang Zhang, Guosheng Yu, Shimeng Qian, Wei Mao, Li Du, Yongfu Li, Yuan Du","doi":"10.1109/AICAS57966.2023.10168648","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168648","url":null,"abstract":"E-commerce has become an indispensable part of the whole commodity economy with rapid expansion. A great deal of time is required for customers to search products by manual work. A good automatic recommendation system can not only bring the customers good shopping experience, but also help companies gain profit growth. In the IEEE AICAS 2023 conference, we have organized the grand challenge on software and hardware co-optimization for e-commerce recommendation system. The desensitized data from Alibaba Group which recorded online purchase behaviors of online shopping users in China are provided. We organize two rounds of the challenge with two different parts of data, separately encouraging participating teams to propose novel ideas for the recommendation algorithm design and deployment. In the preliminary round, participating teams are required to design a recommendation system with high accuracy performance. In the final round, the qualified teams from the preliminary round will be offered with an ARM-based multi-core Yitian 710 CPU cloud server, the teams are required to design an acceleration scheme for the hardware resolution. In the final, 6 best teams will be awarded by using standard evaluation criteria.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133358066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reducing Overhead of Feature Importance Visualization via Static GradCAM Computation 通过静态渐变凸轮计算减少特征重要性可视化的开销
Ashwin Bhat, A. Raychowdhury
Explainable AI (XAI) methods provide insights into the operation of black-box Deep Neural Network (DNN) models. GradCAM, an XAI algorithm, provides an explanation by highlighting regions in the input feature space that were relevant to the model’s output. It involves a gradient computation step that adds a significant overhead compared to inference and hinders providing explanations to end-users. In this work, we identify the root cause of the problem to be the dynamic run-time automatic differentiation. To overcome this issue, we propose to offload the gradient computation step to compile time via analytic evaluation. We validate the idea by designing an FPGA implementation of GradCAM that schedules the entire computation graph statically. For a TinyML ResNet18 model, we achieve a reduction in the explanation generation overhead from > 2× using software frameworks on CPU/GPU systems to < 0.01× on the FPGA using our designed hardware and static scheduling.
可解释的人工智能(XAI)方法为黑盒深度神经网络(DNN)模型的操作提供了见解。GradCAM是一种XAI算法,它通过突出显示输入特征空间中与模型输出相关的区域来提供解释。它涉及到一个梯度计算步骤,与推理相比增加了很大的开销,并且阻碍了向最终用户提供解释。在这项工作中,我们确定了问题的根本原因是动态运行时自动区分。为了克服这一问题,我们建议通过分析计算将梯度计算步骤卸载到编译时间。我们通过设计GradCAM的FPGA实现来验证这个想法,该FPGA实现静态地调度整个计算图。对于TinyML ResNet18模型,我们在CPU/GPU系统上使用软件框架将解释生成开销从> 2x减少到使用我们设计的硬件和静态调度的FPGA上的< 0.01 x。
{"title":"Reducing Overhead of Feature Importance Visualization via Static GradCAM Computation","authors":"Ashwin Bhat, A. Raychowdhury","doi":"10.1109/AICAS57966.2023.10168594","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168594","url":null,"abstract":"Explainable AI (XAI) methods provide insights into the operation of black-box Deep Neural Network (DNN) models. GradCAM, an XAI algorithm, provides an explanation by highlighting regions in the input feature space that were relevant to the model’s output. It involves a gradient computation step that adds a significant overhead compared to inference and hinders providing explanations to end-users. In this work, we identify the root cause of the problem to be the dynamic run-time automatic differentiation. To overcome this issue, we propose to offload the gradient computation step to compile time via analytic evaluation. We validate the idea by designing an FPGA implementation of GradCAM that schedules the entire computation graph statically. For a TinyML ResNet18 model, we achieve a reduction in the explanation generation overhead from > 2× using software frameworks on CPU/GPU systems to < 0.01× on the FPGA using our designed hardware and static scheduling.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132805715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge 自由位:边缘混合精度量化神经网络的延迟优化
Georg Rutishauser, Francesco Conti, L. Benini
Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the in-tractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6 % reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.
混合精度量化,其中深度神经网络的层被量化到不同的精度,提供了优化模型大小、延迟和统计精度之间的权衡的机会,而不是均匀位宽量化所能实现的。针对给定网络中混合精度配置难以处理的搜索空间,提出了一种混合搜索方法。它包括一个硬件不可知的可微搜索算法,然后是一个硬件感知的启发式优化,以找到针对特定硬件目标优化的混合精度配置延迟。我们在MobileNetV1和MobileNetV2上评估了我们的算法,并将得到的网络部署在具有不同硬件特性的多核RISC-V微控制器平台上。与8位模型相比,我们实现了高达28.6%的端到端延迟减少,而与1000类ImageNet数据集上的全精度基线相比,准确度下降可以忽略不计。我们演示了相对于8位基准的加速,即使在没有硬件支持子字节算术的系统上,精度下降也可以忽略不计。此外,我们还展示了我们的方法在针对减少二进制操作计数作为延迟代理的可微搜索方面的优越性。
{"title":"Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge","authors":"Georg Rutishauser, Francesco Conti, L. Benini","doi":"10.1109/AICAS57966.2023.10168577","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168577","url":null,"abstract":"Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the in-tractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6 % reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130372354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
TPE: A High-Performance Edge-Device Inference with Multi-level Transformational Mechanism TPE:具有多层次转换机制的高性能边缘器件推理
Zhou Wang, Jingchuang Wei, Xiaonan Tang, Boxiao Han, Hongjun He, Leibo Liu, Shaojun Wei, S. Yin
DNN inference of edge devices has been very important for a long time with large computing and energy consumption demand. This paper proposes a TPE(Transformation Process Element) with three characteristics. Firstly, TPE has a method of Data Segmentation Skip and Pre-Reorganization(DSSPR). Secondly, TPE has a Typical Value Matching and Calibration Computer (TVMCC) system, which converts direct calculation into matching and calibration calculation. Thirdly, TPE includes a Data Format Pre-Configuration and Self-Adjustment (DFPCSA) scheme. Compared with the most typical pure reasoning processor UNPU, TPE achieves 1.25× better energy consumption.
长期以来,边缘设备的深度神经网络推理一直是计算量大、能耗大的重要问题。本文提出了一种具有三个特征的TPE(Transformation Process Element)。首先,TPE具有数据分割跳过和预重组(DSSPR)方法。其次,TPE具有典型的数值匹配与校准计算机(TVMCC)系统,将直接计算转换为匹配与校准计算。第三,TPE包含数据格式预配置和自调整(DFPCSA)方案。与最典型的纯推理处理器UNPU相比,TPE的能耗提高了1.25倍。
{"title":"TPE: A High-Performance Edge-Device Inference with Multi-level Transformational Mechanism","authors":"Zhou Wang, Jingchuang Wei, Xiaonan Tang, Boxiao Han, Hongjun He, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1109/AICAS57966.2023.10168614","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168614","url":null,"abstract":"DNN inference of edge devices has been very important for a long time with large computing and energy consumption demand. This paper proposes a TPE(Transformation Process Element) with three characteristics. Firstly, TPE has a method of Data Segmentation Skip and Pre-Reorganization(DSSPR). Secondly, TPE has a Typical Value Matching and Calibration Computer (TVMCC) system, which converts direct calculation into matching and calibration calculation. Thirdly, TPE includes a Data Format Pre-Configuration and Self-Adjustment (DFPCSA) scheme. Compared with the most typical pure reasoning processor UNPU, TPE achieves 1.25× better energy consumption.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114851104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WeightLock: A Mixed-Grained Weight Encryption Approach Using Local Decrypting Units for Ciphertext Computing in DNN Accelerators 在DNN加速器中使用本地解密单元进行密文计算的混合粒度权重加密方法
Jianfeng Wang, Zhonghao Chen, Yiming Chen, Yixin Xu, Tian Wang, Yao Yu, N. Vijaykrishnan, Sumitha George, Huazhong Yang, Xueqing Li
With the wide use of NVM-based DNN accelerators for higher computing efficiency, the long data retention time essentially causes a high risk of unauthorized weight stealing by attackers. Weight encryption is an effective method, but existing ciphertext computing accelerators cannot achieve high encryption complexity and flexibility. This paper proposes WeightLock, a mixed-grained hardware-software co-design approach based on local decrypting units (LDUs). This work proposes a key-controlled cell-level hardware design for higher granularity and two weight selection schemes for higher flexibility. The simulation results show that the accuracy of VGG-8 and ResNet-18 in the Cifar-10 classification drops from 80% to only 10% even if 80% of keys are leaked. This shows >20% higher key leakage tolerance and >17x longer retraining latency protection, compared with the prior state-of-the-art hardware and software approaches, respectively. The area cost of the encryption function is negligible, with ~600x, 2.2x, and 2.4x reduction from the state-of-the-art cell-wise, column-wise, and 1T4R structures, respectively.
随着基于nvm的深度神经网络加速器被广泛使用以提高计算效率,较长的数据保留时间本质上导致了攻击者未经授权窃取权重的高风险。权重加密是一种有效的加密方法,但现有的密文计算加速器无法实现较高的加密复杂度和灵活性。本文提出了一种基于本地解密单元(ldu)的混合粒度软硬件协同设计方法WeightLock。这项工作提出了一个键控制的单元级硬件设计,以获得更高的粒度和两种权重选择方案,以获得更高的灵活性。仿真结果表明,即使80%的密钥被泄露,VGG-8和ResNet-18在Cifar-10分类中的准确率也从80%下降到10%。与之前最先进的硬件和软件方法相比,这表明密钥泄漏容忍度提高了>20%,再训练延迟保护时间延长了>17倍。加密功能的面积成本可以忽略不计,与最先进的单元、列和1T4R结构相比,分别减少了约600x、2.2x和2.4x。
{"title":"WeightLock: A Mixed-Grained Weight Encryption Approach Using Local Decrypting Units for Ciphertext Computing in DNN Accelerators","authors":"Jianfeng Wang, Zhonghao Chen, Yiming Chen, Yixin Xu, Tian Wang, Yao Yu, N. Vijaykrishnan, Sumitha George, Huazhong Yang, Xueqing Li","doi":"10.1109/AICAS57966.2023.10168612","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168612","url":null,"abstract":"With the wide use of NVM-based DNN accelerators for higher computing efficiency, the long data retention time essentially causes a high risk of unauthorized weight stealing by attackers. Weight encryption is an effective method, but existing ciphertext computing accelerators cannot achieve high encryption complexity and flexibility. This paper proposes WeightLock, a mixed-grained hardware-software co-design approach based on local decrypting units (LDUs). This work proposes a key-controlled cell-level hardware design for higher granularity and two weight selection schemes for higher flexibility. The simulation results show that the accuracy of VGG-8 and ResNet-18 in the Cifar-10 classification drops from 80% to only 10% even if 80% of keys are leaked. This shows >20% higher key leakage tolerance and >17x longer retraining latency protection, compared with the prior state-of-the-art hardware and software approaches, respectively. The area cost of the encryption function is negligible, with ~600x, 2.2x, and 2.4x reduction from the state-of-the-art cell-wise, column-wise, and 1T4R structures, respectively.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114830238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online low-power large-scale real-time decision-making all at once 在线低功耗大规模实时决策
Thomas Pontoizeau, Éric Jacopin
In this paper, we set up a simulation under Unreal Engine 5 that communicates with an Optical Processing Unit (OPU) in order to make real-time decisions on the current state of the actors of the simulation. Our experiment shows that the OPU is able to manage at least 50 000 actors in real-time and is able to make decisions depending of the current state of the actors.
在本文中,我们在虚幻引擎5下建立了一个仿真,该仿真与光学处理单元(OPU)通信,以便对仿真参与者的当前状态做出实时决策。我们的实验表明,OPU能够实时管理至少50,000个参与者,并能够根据参与者的当前状态做出决策。
{"title":"Online low-power large-scale real-time decision-making all at once","authors":"Thomas Pontoizeau, Éric Jacopin","doi":"10.1109/AICAS57966.2023.10168570","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168570","url":null,"abstract":"In this paper, we set up a simulation under Unreal Engine 5 that communicates with an Optical Processing Unit (OPU) in order to make real-time decisions on the current state of the actors of the simulation. Our experiment shows that the OPU is able to manage at least 50 000 actors in real-time and is able to make decisions depending of the current state of the actors.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115146688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud 一天710 SoC上基于gem的云工作负载的任务感知调度和性能优化
Guosheng Yu, Zhihong Lv, Haijiang Wang, Zilong Huang, Jicheng Chen
The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.
一天710 SoC是一款基于ARM Neoverse N2架构的服务器处理器,由T-HEAD半导体有限公司开发,用于加速阿里云中的计算密集型任务,其中ML相关工作负载在各种应用中发挥重要作用。通用矩阵乘法是机器学习工作负载中广泛使用的最基本和最重要的计算内核例程。一般来说,整个GEMM工作负载被划分为一系列块,子任务被专业地组装以利用并行硬件。然而,对于同时处理多任务并期望保证QoS的云工作负载来说,情况并非如此。我们引入了任务感知并行调度方法来处理ML工作负载,并平衡了YiTian710 ECS实例的响应延迟和吞吐量。为了提高GEMM子任务的调度效率,我们进一步设计了一种两级划分的多线程调度算法。为了达到最佳性能,开发了优化的GEMM内核。我们评估了一天710基于阿里云ECS在不同应用中的性能。结果表明,该方法可以在不同的应用中取得显著的性能提升。
{"title":"Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud","authors":"Guosheng Yu, Zhihong Lv, Haijiang Wang, Zilong Huang, Jicheng Chen","doi":"10.1109/AICAS57966.2023.10168586","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168586","url":null,"abstract":"The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122578736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 12-Lead ECG Delineation Algorithm based on a Quantized CNN-BiLSTM Auto-encoder with 1-12 Mapping 基于1-12映射量化CNN-BiLSTM自编码器的12导联心电圈定算法
Xinzi Xu, Qiao Cai, Hongqian Wang, Yanxing Suo, Yang Zhao, T. Wan, Guoxing Wang, Yong Lian
12-lead electrocardiogram (ECG) delineation is a critical step in diagnosing of various heart diseases. Current practices for 12-lead ECG delineation typically involve processing each of the 12 leads separately using a network, which is computationally expensive. To solve this issue, 1-12 mapping strategy is proposed to directly map one lead network predictions to other leads and then fine-tune boundaries. CNN-BiLSTM autoencoder architecture is employed to model the sequential dependencies of ECG signal. Besides, data augmentation and mixed losses are utilized to enhance the robustness of the network. Evaluated on QTDB and LUDB, the delineation results for 12-lead ECG achieve a Se of 97%, 99%, and 98%, DS of 95.3%, 96.2%, and 94.4% for P-wave, QRS complex, and T-wave respectively. At last, quantization-aware training is employed to convert float32 model to int8 one with only about a 2% drop of accuracy.
12导联心电图(ECG)圈定是诊断各种心脏疾病的关键步骤。目前的12导联心电图描绘通常涉及使用网络分别处理12导联中的每一个,这在计算上是昂贵的。为了解决这个问题,提出了1-12映射策略,直接将一个引线网络预测映射到其他引线,然后微调边界。采用CNN-BiLSTM自编码器结构对心电信号的顺序依赖关系进行建模。此外,利用数据扩充和混合损失增强了网络的鲁棒性。经QTDB和LUDB评价,12导联心电图的圈定结果Se分别为97%、99%和98%,p波、QRS复波和t波的DS分别为95.3%、96.2%和94.4%。最后,采用量化感知训练将float32模型转换为float8模型,精度仅下降2%左右。
{"title":"A 12-Lead ECG Delineation Algorithm based on a Quantized CNN-BiLSTM Auto-encoder with 1-12 Mapping","authors":"Xinzi Xu, Qiao Cai, Hongqian Wang, Yanxing Suo, Yang Zhao, T. Wan, Guoxing Wang, Yong Lian","doi":"10.1109/AICAS57966.2023.10168552","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168552","url":null,"abstract":"12-lead electrocardiogram (ECG) delineation is a critical step in diagnosing of various heart diseases. Current practices for 12-lead ECG delineation typically involve processing each of the 12 leads separately using a network, which is computationally expensive. To solve this issue, 1-12 mapping strategy is proposed to directly map one lead network predictions to other leads and then fine-tune boundaries. CNN-BiLSTM autoencoder architecture is employed to model the sequential dependencies of ECG signal. Besides, data augmentation and mixed losses are utilized to enhance the robustness of the network. Evaluated on QTDB and LUDB, the delineation results for 12-lead ECG achieve a Se of 97%, 99%, and 98%, DS of 95.3%, 96.2%, and 94.4% for P-wave, QRS complex, and T-wave respectively. At last, quantization-aware training is employed to convert float32 model to int8 one with only about a 2% drop of accuracy.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130066955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Frequency Separation Residual Network for End-to-end RAW to RGB Mapping 端到端RAW到RGB映射的图像频分残差网络
Mengchuan Dong, Weiti Zhou, Cong Pang, Xiangyu Zhang, Xin Lou
Due to the limitations of hardware specification of smartphones' camera system, there is still a visible gap in imaging quality between smartphones and digital singlelens reflex (DSLR) cameras. Sophisticated learning-based image processing becomes a promising solution to close this gap. In this paper, we propose an Image Frequency Separation Residual Network (IFS Net) to perform the end-to-end RAW to RGB image mapping. Different from existing methods that directly train the input image and the ground truth image one-to-one as a whole, our proposed method first divides the input image and the ground truth into high-frequency and low-frequency parts by discrete wavelet transform (DWT). These two parts are then trained separately using different networks for details and global information, and finally synthesized into the output image using inverse DWT. Experimental results show that the proposed IFS Net outperforms other existing algorithms in both PSNR and SSIM. Visual comparison shows that the images produces by IFS Net preserves more details and look close to that captured by DSLR cameras.
由于智能手机相机系统硬件规格的限制,智能手机与数码单反(DSLR)相机在成像质量上仍有明显差距。复杂的基于学习的图像处理成为缩小这一差距的有希望的解决方案。在本文中,我们提出了一个图像频率分离残差网络(IFS Net)来执行端到端的RAW到RGB图像映射。与现有方法直接将输入图像与地真图像一对一整体训练不同,本文提出的方法首先通过离散小波变换(DWT)将输入图像和地真图像分成高频和低频部分。然后使用不同的网络分别训练这两个部分的细节和全局信息,最后使用逆小波变换合成成输出图像。实验结果表明,本文提出的IFS Net在PSNR和SSIM方面都优于其他现有算法。视觉对比显示,IFS Net生成的图像保留了更多细节,看起来更接近数码单反相机拍摄的图像。
{"title":"Image Frequency Separation Residual Network for End-to-end RAW to RGB Mapping","authors":"Mengchuan Dong, Weiti Zhou, Cong Pang, Xiangyu Zhang, Xin Lou","doi":"10.1109/AICAS57966.2023.10168597","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168597","url":null,"abstract":"Due to the limitations of hardware specification of smartphones' camera system, there is still a visible gap in imaging quality between smartphones and digital singlelens reflex (DSLR) cameras. Sophisticated learning-based image processing becomes a promising solution to close this gap. In this paper, we propose an Image Frequency Separation Residual Network (IFS Net) to perform the end-to-end RAW to RGB image mapping. Different from existing methods that directly train the input image and the ground truth image one-to-one as a whole, our proposed method first divides the input image and the ground truth into high-frequency and low-frequency parts by discrete wavelet transform (DWT). These two parts are then trained separately using different networks for details and global information, and finally synthesized into the output image using inverse DWT. Experimental results show that the proposed IFS Net outperforms other existing algorithms in both PSNR and SSIM. Visual comparison shows that the images produces by IFS Net preserves more details and look close to that captured by DSLR cameras.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130957977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Landmark-Based Adversarial Network for RGB-D Pose Invariant Face Recognition 基于里程碑的RGB-D姿态不变人脸识别对抗网络
Wei-Jyun Chen, Ching-Te Chiu, Ting-Chun Lin
Even though numerous studies have been conducted, face recognition still suffers from poor performance in pose variance. Besides fine appearance details of the face from RGB images, we use depth images that present the 3D contour of the face to improve recognition performance in large poses. At first, we propose a dual-path RGB-D face recognition model which learns features from separate RGB and depth images and fuses the two features into one identity feature. We add associate loss to strengthen the complementary and improve performance. Second, we proposed a landmark-based adversarial network to help the face recognition model extract the pose-invariant identity feature. Our landmark-based adversarial network contains a feature generator, pose discriminator, and landmark module. After we use 2-stage optimization to optimize the pose discriminator and feature generator, we removed the pose factor in the feature extracted by the generator. We conduct experiments on KinectFaceDB, RealSensetest and LiDARtest. On KinectFaceDB, we achieve a recognition accuracy of 99.41%, which is 1.31% higher than other methods. On RealSensetest, we achieve a classification accuracy of 92.57%, which is 30.51% higher than other methods. On LiDARtest, we achieve 98.21%, which is 21.88% higher than other methods.
尽管进行了大量的研究,但人脸识别在姿态方差方面的表现仍然不佳。除了来自RGB图像的面部精细外观细节外,我们还使用呈现面部3D轮廓的深度图像来提高大姿态下的识别性能。首先,我们提出了一种双路径RGB- d人脸识别模型,该模型从单独的RGB和深度图像中学习特征,并将这两个特征融合为一个身份特征。我们增加关联损失,加强互补性,提高性能。其次,我们提出了一种基于地标的对抗网络来帮助人脸识别模型提取姿势不变的身份特征。我们基于地标的对抗网络包含特征生成器、姿态鉴别器和地标模块。在对姿态鉴别器和特征生成器进行两阶段优化后,去除由特征生成器提取的特征中的姿态因子。我们在KinectFaceDB, RealSensetest和LiDARtest上进行了实验。在KinectFaceDB上,我们的识别准确率达到了99.41%,比其他方法高出1.31%。在RealSensetest上,我们实现了92.57%的分类准确率,比其他方法高出30.51%。在LiDARtest上,我们的准确率达到了98.21%,比其他方法高出21.88%。
{"title":"Landmark-Based Adversarial Network for RGB-D Pose Invariant Face Recognition","authors":"Wei-Jyun Chen, Ching-Te Chiu, Ting-Chun Lin","doi":"10.1109/AICAS57966.2023.10168669","DOIUrl":"https://doi.org/10.1109/AICAS57966.2023.10168669","url":null,"abstract":"Even though numerous studies have been conducted, face recognition still suffers from poor performance in pose variance. Besides fine appearance details of the face from RGB images, we use depth images that present the 3D contour of the face to improve recognition performance in large poses. At first, we propose a dual-path RGB-D face recognition model which learns features from separate RGB and depth images and fuses the two features into one identity feature. We add associate loss to strengthen the complementary and improve performance. Second, we proposed a landmark-based adversarial network to help the face recognition model extract the pose-invariant identity feature. Our landmark-based adversarial network contains a feature generator, pose discriminator, and landmark module. After we use 2-stage optimization to optimize the pose discriminator and feature generator, we removed the pose factor in the feature extracted by the generator. We conduct experiments on KinectFaceDB, RealSensetest and LiDARtest. On KinectFaceDB, we achieve a recognition accuracy of 99.41%, which is 1.31% higher than other methods. On RealSensetest, we achieve a classification accuracy of 92.57%, which is 30.51% higher than other methods. On LiDARtest, we achieve 98.21%, which is 21.88% higher than other methods.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122530409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1