{"title":"用于图像分类的压缩域视觉变换器","authors":"Ruolei Ji;Lina J. Karam","doi":"10.1109/JETCAS.2024.3394878","DOIUrl":null,"url":null,"abstract":"Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compressed-Domain Vision Transformer for Image Classification\",\"authors\":\"Ruolei Ji;Lina J. Karam\",\"doi\":\"10.1109/JETCAS.2024.3394878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-04-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10510316/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10510316/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
在压缩域视觉任务方案中,视觉处理或计算机视觉直接在压缩域表征上执行,通过避免对压缩视觉信息进行解码,在训练和部署过程中实现了更高的计算效率,同时与相应的空间域视觉任务相比,具有竞争力甚至更好的性能。这项工作关注的是基于学习的压缩域图像分类,即直接在压缩域表征(也称为潜在表征)上执行图像分类,这些表征是使用基于学习的视觉编码器获得的。本文提出了一种压缩域视觉变换器(cViT),用于在基于学习的压缩域中执行图像分类。为此,本文采用并修改了视觉变换器(ViT)架构,以便直接在压缩域中执行分类。作为这项工作的一部分,我们引入了一种新颖的特征补丁嵌入方法,利用压缩域中的内部和跨通道信息。此外,还设计了一种适应性训练策略,采用预先训练好的空间域 ViT 的权重,并将其适应于压缩域分类任务。此外,预训练的 ViT 权重通过插值法用于位置嵌入初始化,以进一步提高 cViT 的性能。实验结果表明,所提出的 cViT 在分类精度 Top-1 和 Top-5 方面优于现有的压缩域分类网络。此外,与像素域方法相比,所提出的 cViT 能以更高的计算效率获得有竞争力的分类精度。
Compressed-Domain Vision Transformer for Image Classification
Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.
期刊介绍:
The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.