首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Multi-modal deep facial expression recognition framework combining knowledge distillation and retrieval-augmented generation 结合知识蒸馏和检索增强生成的多模态深度面部表情识别框架
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-11-15 DOI: 10.1016/j.jvcir.2025.104645
Beibei Jiang, Yu Zhou
In recent years, significant progress has been made in facial expression recognition (FER) methods based on deep learning. However, existing models still face challenges in terms of computational efficiency and generalization performance when dealing with diverse emotional expressions and complex environmental variations. Recently, large-scale vision-language pre-training models such as CLIP have achieved remarkable success in multi-modal learning. Their rich visual and textual representations offer valuable insights for downstream tasks. Consequently, transferring the knowledge to develop efficient and accurate facial expression recognition (FER) systems has emerged as a key research direction. To the end, this paper proposes a novel model, termed Knowledge Distillation and Retrieval-Augmented Generation (KDRAG), which combines Distillation and Retrieval-Augmented Generation (RAG) techniques to improve the efficiency and accuracy of FER. Through knowledge distillation, the teacher model (ViT-L/14) transfers its rich knowledge to the smaller student model (ViT-B/32). An additional linear projection layer is added to map the teacher model’s output features to the student model’s feature dimensions for feature alignment. Moreover, the RAG mechanism is developed to enhance the emotional understanding of students by retrieving text descriptions related to the input image. Additionally, this framework combines soft loss (from the teacher model’s knowledge) and hard loss (from the true targets of the labels) to enhance the model’s generalization ability. Extensive experimental results on multiple datasets demonstrate that the KDRAG framework can achieve significant improvements in accuracy and computational efficiency, providing new insights for real-time FER systems.
近年来,基于深度学习的面部表情识别方法取得了重大进展。然而,现有模型在处理复杂的情绪表达和复杂的环境变化时,在计算效率和泛化性能方面仍然面临挑战。近年来,大规模的视觉语言预训练模型(如CLIP)在多模态学习中取得了显著的成功。它们丰富的可视化和文本表示为下游任务提供了有价值的见解。因此,利用这些知识开发高效、准确的面部表情识别系统已成为一个重要的研究方向。最后,本文提出了一种新的知识精馏和检索增强生成(KDRAG)模型,该模型将精馏和检索增强生成(RAG)技术相结合,以提高知识精馏和检索增强生成的效率和准确性。教师模型(viti - l /14)通过知识提炼,将其丰富的知识传递给较小的学生模型(viti - b /32)。添加了一个额外的线性投影层,将教师模型的输出特征映射到学生模型的特征维度,以进行特征对齐。此外,我们开发了RAG机制,通过检索与输入图像相关的文本描述来增强学生的情感理解。此外,该框架结合了软损失(来自教师模型的知识)和硬损失(来自标签的真实目标),以增强模型的泛化能力。在多个数据集上的大量实验结果表明,KDRAG框架可以显著提高精度和计算效率,为实时FER系统提供新的见解。
{"title":"Multi-modal deep facial expression recognition framework combining knowledge distillation and retrieval-augmented generation","authors":"Beibei Jiang,&nbsp;Yu Zhou","doi":"10.1016/j.jvcir.2025.104645","DOIUrl":"10.1016/j.jvcir.2025.104645","url":null,"abstract":"<div><div>In recent years, significant progress has been made in facial expression recognition (FER) methods based on deep learning. However, existing models still face challenges in terms of computational efficiency and generalization performance when dealing with diverse emotional expressions and complex environmental variations. Recently, large-scale vision-language pre-training models such as CLIP have achieved remarkable success in multi-modal learning. Their rich visual and textual representations offer valuable insights for downstream tasks. Consequently, transferring the knowledge to develop efficient and accurate facial expression recognition (FER) systems has emerged as a key research direction. To the end, this paper proposes a novel model, termed Knowledge Distillation and Retrieval-Augmented Generation (KDRAG), which combines Distillation and Retrieval-Augmented Generation (RAG) techniques to improve the efficiency and accuracy of FER. Through knowledge distillation, the teacher model (ViT-L/14) transfers its rich knowledge to the smaller student model (ViT-B/32). An additional linear projection layer is added to map the teacher model’s output features to the student model’s feature dimensions for feature alignment. Moreover, the RAG mechanism is developed to enhance the emotional understanding of students by retrieving text descriptions related to the input image. Additionally, this framework combines soft loss (from the teacher model’s knowledge) and hard loss (from the true targets of the labels) to enhance the model’s generalization ability. Extensive experimental results on multiple datasets demonstrate that the KDRAG framework can achieve significant improvements in accuracy and computational efficiency, providing new insights for real-time FER systems.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104645"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visual saliency fixation via deeply tri-layered multi blended trans-encoder framework 基于深度三层多混合编码器框架的视觉显著性固定
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-13 DOI: 10.1016/j.jvcir.2025.104676
S. Caroline , Y.Jacob Vetha Raj
The capacity to guess where viewers look while reviewing a scene, likewise called saliency prediction or observation, has created critical interest in the fields of computer vision. Incorporating saliency prediction modeling into traditional CNN-based models is challenging. To address this, we developed the Deeply Tri-Layered Multi-Blended Trans-Encoder Framework (DTMBTE) to improve human eye fixation prediction in image saliency tasks. Unlike existing CNN-based methods that struggle with contextual encoding, our model integrates local feature extraction with global attention mechanisms to more accurately forecast saliency regions. We created a new trans-encoder called the Multi Blended Trans-Encoder (MBTE) by combining three different convolution types with encoders that use multiple heads of attention, which effectively localize the human eye fixation or saliency area. This combined design efficiently extracts both spatial and contextual information for saliency estimation. Experiments on MIT1003 and CAT2000 show that DTMBTE outperforms NSS and SIM scores and minimum EMD.
在回顾一个场景时猜测观众看向何处的能力,也被称为显著性预测或观察,已经在计算机视觉领域引起了重要的兴趣。将显著性预测模型整合到传统的基于cnn的模型中是具有挑战性的。为了解决这个问题,我们开发了深度三层多混合跨编码器框架(DTMBTE)来改进图像显著性任务中的人眼注视预测。与现有的基于cnn的上下文编码方法不同,我们的模型将局部特征提取与全局关注机制相结合,以更准确地预测显著区域。我们通过将三种不同的卷积类型与使用多个注意力头的编码器相结合,创建了一种新的跨编码器,称为Multi - Blended trans-encoder (MBTE),这可以有效地定位人眼注视或显著区域。这种组合设计有效地提取空间和上下文信息,用于显著性估计。在MIT1003和CAT2000上的实验表明,DTMBTE优于NSS和SIM分数以及最小EMD。
{"title":"Visual saliency fixation via deeply tri-layered multi blended trans-encoder framework","authors":"S. Caroline ,&nbsp;Y.Jacob Vetha Raj","doi":"10.1016/j.jvcir.2025.104676","DOIUrl":"10.1016/j.jvcir.2025.104676","url":null,"abstract":"<div><div>The capacity to guess where viewers look while reviewing a scene, likewise called saliency prediction or observation, has created critical interest in the fields of computer vision. Incorporating saliency prediction modeling into traditional CNN-based models is challenging. To address this, we developed the Deeply Tri-Layered Multi-Blended Trans-Encoder Framework (DTMBTE) to improve human eye fixation prediction in image saliency tasks. Unlike existing CNN-based methods that struggle with contextual encoding, our model integrates local feature extraction with global attention mechanisms to more accurately forecast saliency regions. We created a new <em>trans</em>-encoder called the Multi Blended Trans-Encoder (MBTE) by combining three different convolution types with encoders that use multiple heads of attention, which effectively localize the human eye fixation or saliency area. This combined design efficiently extracts both spatial and contextual information for saliency estimation. Experiments on MIT1003 and CAT2000 show that DTMBTE outperforms NSS and SIM scores and minimum EMD.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104676"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FFTDiff: Tuning-free image texture transfer based on diffusion model FFTDiff:基于扩散模型的免调优图像纹理传输
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-22 DOI: 10.1016/j.jvcir.2025.104681
Shilin Li, Hao Wang, Anna Zhu
Image texture transfer is pivotal in computer vision, holding extensive application potential. Existing methods typically transfer color alongside texture, lacking inherent color preservation and thus requiring a cumbersome two-stage process: color alignment followed by style transfer. The recent emergence of diffusion models has significantly advanced this field; however, current diffusion-based approaches usually necessitate additional training. To address this, we propose FFTDiff, a novel texture transfer model leveraging pre-trained diffusion models and the Fast Fourier Transform (FFT), eliminating extra training requirements. FFTDiff disentangles texture from content and color within the frequency domain, independently extracting texture from reference images while preserving original colors and semantics. This extracted texture is then seamlessly integrated into the content image within the diffusion model’s latent space during denoising. Comprehensive experimental results demonstrate FFTDiff’s effectiveness, highlighting its capability for realistic, aesthetically pleasing texture transfer without compromising the original semantic content or color integrity.
图像纹理传递是计算机视觉的关键,具有广泛的应用潜力。现有的方法通常是将颜色与纹理一起转移,缺乏固有的颜色保存,因此需要繁琐的两阶段过程:颜色对齐然后是样式转移。最近出现的扩散模型极大地推动了这一领域的发展;然而,目前基于扩散的方法通常需要额外的培训。为了解决这个问题,我们提出了FFTDiff,一种利用预训练扩散模型和快速傅里叶变换(FFT)的新型纹理传输模型,消除了额外的训练要求。FFTDiff在频域内将纹理从内容和颜色中分离出来,在保留原始颜色和语义的同时,独立地从参考图像中提取纹理。然后,在去噪期间,将提取的纹理无缝地集成到扩散模型的潜在空间内的内容图像中。综合实验结果证明了FFTDiff的有效性,突出了其在不损害原始语义内容或颜色完整性的情况下实现逼真,美观的纹理转移的能力。
{"title":"FFTDiff: Tuning-free image texture transfer based on diffusion model","authors":"Shilin Li,&nbsp;Hao Wang,&nbsp;Anna Zhu","doi":"10.1016/j.jvcir.2025.104681","DOIUrl":"10.1016/j.jvcir.2025.104681","url":null,"abstract":"<div><div>Image texture transfer is pivotal in computer vision, holding extensive application potential. Existing methods typically transfer color alongside texture, lacking inherent color preservation and thus requiring a cumbersome two-stage process: color alignment followed by style transfer. The recent emergence of diffusion models has significantly advanced this field; however, current diffusion-based approaches usually necessitate additional training. To address this, we propose FFTDiff, a novel texture transfer model leveraging pre-trained diffusion models and the Fast Fourier Transform (FFT), eliminating extra training requirements. FFTDiff disentangles texture from content and color within the frequency domain, independently extracting texture from reference images while preserving original colors and semantics. This extracted texture is then seamlessly integrated into the content image within the diffusion model’s latent space during denoising. Comprehensive experimental results demonstrate FFTDiff’s effectiveness, highlighting its capability for realistic, aesthetically pleasing texture transfer without compromising the original semantic content or color integrity.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104681"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAM-FireAdapter: An adapter for fire segmentation with SAM SAM- fireadapter:使用SAM进行火灾分割的适配器
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-16 DOI: 10.1016/j.jvcir.2025.104678
Yanan Wu, Chaoqun Hong, Yongfeng Chen, Haixi Cheng
With the rise of large foundation models, significant advancements have been made in the field of artificial intelligence. The Segment Anything Model (SAM) was specifically designed for image segmentation. However, experiments have demonstrated that SAM may encounter performance limitations in handling specific tasks, such as fire segmentation. To address this challenge, our study explores solutions to effectively adapt the pre-trained SAM model for fire segmentation. The adapter-enhanced approach is introduced to SAM, incorporating effective adapter modules into the segmentation network. The resulting approach, SAM-FireAdapter, incorporates fire-specific features into SAM, significantly enhancing its performance on fire segmentation. Additionally, we propose Fire-Adaptive Attention (FAA), a lightweight attention mechanism module to enhance feature representation. This module reweights the input features before decoding, emphasizing critical spatial features and suppressing less relevant ones. Experimental results demonstrate that SAM-FireAdapter surpasses existing fire segmentation networks including the base SAM.
随着大型基础模型的兴起,人工智能领域取得了重大进展。SAM模型是专门为图像分割而设计的。然而,实验表明,SAM在处理特定任务时可能会遇到性能限制,例如火灾分割。为了应对这一挑战,我们的研究探索了有效地将预训练的SAM模型用于火灾分割的解决方案。将适配器增强方法引入到SAM中,将有效的适配器模块集成到分段网络中。由此产生的方法SAM- fireadapter将火灾特定的特征集成到SAM中,显著提高了SAM在火灾分割方面的性能。此外,我们提出了Fire-Adaptive Attention (FAA),一个轻量级的注意机制模块来增强特征表示。该模块在解码前对输入特征进行加权,强调重要的空间特征,抑制不太相关的空间特征。实验结果表明,SAM- fireadapter优于包括基本SAM在内的现有火灾分割网络。
{"title":"SAM-FireAdapter: An adapter for fire segmentation with SAM","authors":"Yanan Wu,&nbsp;Chaoqun Hong,&nbsp;Yongfeng Chen,&nbsp;Haixi Cheng","doi":"10.1016/j.jvcir.2025.104678","DOIUrl":"10.1016/j.jvcir.2025.104678","url":null,"abstract":"<div><div>With the rise of large foundation models, significant advancements have been made in the field of artificial intelligence. The Segment Anything Model (SAM) was specifically designed for image segmentation. However, experiments have demonstrated that SAM may encounter performance limitations in handling specific tasks, such as fire segmentation. To address this challenge, our study explores solutions to effectively adapt the pre-trained SAM model for fire segmentation. The adapter-enhanced approach is introduced to SAM, incorporating effective adapter modules into the segmentation network. The resulting approach, SAM-FireAdapter, incorporates fire-specific features into SAM, significantly enhancing its performance on fire segmentation. Additionally, we propose Fire-Adaptive Attention (FAA), a lightweight attention mechanism module to enhance feature representation. This module reweights the input features before decoding, emphasizing critical spatial features and suppressing less relevant ones. Experimental results demonstrate that SAM-FireAdapter surpasses existing fire segmentation networks including the base SAM.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104678"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D human mesh recovery: Comparative review, models, and prospects 三维人体网格恢复:比较回顾,模型和前景
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2026-01-03 DOI: 10.1016/j.jvcir.2025.104699
Wonjun Kim
As a demand for immersive services increases in various fields, the ability to express objects or scenes in 3D has become essential. In particular, 3D human modeling has gained considerable attentions due to its plentiful possibilities for daily life as well as industrial applications. The first step of 3D human modeling is to restore a mesh, which is commonly defined as a set of connected vertices in the 3D space, from images and videos. This is so-called human mesh recovery (HMR). Such HMR has been studied based on complicated optimization techniques, however, owing to the great success of deep learning in recent years, it has been reformulated as a simple regression problem, thus numerous studies are now being actively conducted. This paper aims at providing a comprehensive review with a special focus on deep learning-based methods for HMR. Specifically, this paper covers a systematic taxonomy along with questions at the heart of each research period, diverse methodologies, and abundant performance evaluations on benchmark datasets both qualitatively and quantitatively, and also gives constructive discussions for realization of HMR-based commercialization services. This review is expected to serve as a concise handbook to HMR rather than a vast collection of existing studies.
随着各个领域对沉浸式服务需求的增加,以3D方式表达物体或场景的能力变得至关重要。特别是三维人体建模由于其在日常生活和工业应用中的丰富可能性而受到了相当大的关注。人体三维建模的第一步是从图像和视频中恢复网格,网格通常被定义为3D空间中连接的一组顶点。这就是所谓的人体网状恢复(HMR)。这种HMR的研究一直基于复杂的优化技术,但由于近年来深度学习的巨大成功,它被重新表述为一个简单的回归问题,因此目前正在积极进行大量的研究。本文旨在提供一个全面的综述,特别关注基于深度学习的HMR方法。具体而言,本文涵盖了系统的分类以及每个研究时期的核心问题,多样化的方法,以及对基准数据集进行定性和定量的大量绩效评估,并为实现基于人力资源管理的商业化服务提供了建设性的讨论。这篇综述预计将作为一个简明的人力资源管理手册,而不是现有研究的大量收集。
{"title":"3D human mesh recovery: Comparative review, models, and prospects","authors":"Wonjun Kim","doi":"10.1016/j.jvcir.2025.104699","DOIUrl":"10.1016/j.jvcir.2025.104699","url":null,"abstract":"<div><div>As a demand for immersive services increases in various fields, the ability to express objects or scenes in 3D has become essential. In particular, 3D human modeling has gained considerable attentions due to its plentiful possibilities for daily life as well as industrial applications. The first step of 3D human modeling is to restore a mesh, which is commonly defined as a set of connected vertices in the 3D space, from images and videos. This is so-called human mesh recovery (HMR). Such HMR has been studied based on complicated optimization techniques, however, owing to the great success of deep learning in recent years, it has been reformulated as a simple regression problem, thus numerous studies are now being actively conducted. This paper aims at providing a comprehensive review with a special focus on deep learning-based methods for HMR. Specifically, this paper covers a systematic taxonomy along with questions at the heart of each research period, diverse methodologies, and abundant performance evaluations on benchmark datasets both qualitatively and quantitatively, and also gives constructive discussions for realization of HMR-based commercialization services. This review is expected to serve as a concise handbook to HMR rather than a vast collection of existing studies.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104699"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Turbo principles meet compression: Rethinking nonlinear transformations in learned image compression Turbo原理满足压缩:重新思考学习图像压缩中的非线性变换
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-11-15 DOI: 10.1016/j.jvcir.2025.104643
Chao Li , Wen Tan , Fanyang Meng , Runwei Ding , Ye Wang , Wei Liu , Yongsheng Liang
Learned image compression (LIC) has emerged as a powerful approach for achieving high rate–distortion performance. Most existing LIC techniques attempt to address performance limitations associated with downsampling and quantization-induced information loss by employing intricate nonlinear transformations and increasing the feature dimensions in entropy models. In this paper, we introduce a novel perspective by modeling the quantizer as a generalized channel with uniform noise, shifting LIC design toward minimizing the channel’s negative impact on compact feature representations. Drawing inspiration from turbo codes, we propose a turbo-like nonlinear transformation (TLNT). On the encoder side, TLNT-E disperses information loss through parallel component coding units, random interleaving, and puncturing, preserving the integrity of encoded features. At the decoder side, TLNT-D iteratively refines feature representations through interactive processing, enabling accurate reconstruction. Experimental results show that our method outperforms several state-of-the-art nonlinear transformation techniques while maintaining efficiency in parameter count and computational complexity.
学习图像压缩(LIC)已成为实现高速率失真性能的一种强有力的方法。大多数现有的LIC技术试图通过使用复杂的非线性变换和增加熵模型中的特征维数来解决与降采样和量化引起的信息损失相关的性能限制。在本文中,我们引入了一种新的视角,将量化器建模为具有均匀噪声的广义通道,将LIC设计转向最小化通道对紧凑特征表示的负面影响。从涡轮码中获得灵感,我们提出了一种类涡轮非线性变换(TLNT)。在编码器方面,TLNT-E通过并行分量编码单元、随机交错和穿刺来分散信息丢失,保持编码特征的完整性。在解码器端,TLNT-D通过交互处理迭代地改进特征表示,从而实现准确的重建。实验结果表明,该方法在保持参数计数和计算复杂度的同时,优于几种最先进的非线性变换技术。
{"title":"Turbo principles meet compression: Rethinking nonlinear transformations in learned image compression","authors":"Chao Li ,&nbsp;Wen Tan ,&nbsp;Fanyang Meng ,&nbsp;Runwei Ding ,&nbsp;Ye Wang ,&nbsp;Wei Liu ,&nbsp;Yongsheng Liang","doi":"10.1016/j.jvcir.2025.104643","DOIUrl":"10.1016/j.jvcir.2025.104643","url":null,"abstract":"<div><div>Learned image compression (LIC) has emerged as a powerful approach for achieving high rate–distortion performance. Most existing LIC techniques attempt to address performance limitations associated with downsampling and quantization-induced information loss by employing intricate nonlinear transformations and increasing the feature dimensions in entropy models. In this paper, we introduce a novel perspective by modeling the quantizer as a generalized channel with uniform noise, shifting LIC design toward minimizing the channel’s negative impact on compact feature representations. Drawing inspiration from turbo codes, we propose a turbo-like nonlinear transformation (TLNT). On the encoder side, TLNT-E disperses information loss through parallel component coding units, random interleaving, and puncturing, preserving the integrity of encoded features. At the decoder side, TLNT-D iteratively refines feature representations through interactive processing, enabling accurate reconstruction. Experimental results show that our method outperforms several state-of-the-art nonlinear transformation techniques while maintaining efficiency in parameter count and computational complexity.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104643"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Blind compressed image diffusion restoration based on content prior and dense residual connection driven transformer 基于内容先验和密集残差连接驱动变压器的压缩图像扩散盲恢复
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-10 DOI: 10.1016/j.jvcir.2025.104674
Shuang Yue, Zhe Chen, Fuliang Yin
JPEG blind compressed image restoration (CIR) aims to restore high-quality images from compressed low-quality images, which is a long-standing low-level visual problem in the image processing. The existing blind CIR methods often overlook basic content details, leading to degradation of restoration quality of blind compressed images. To address this issue, this paper proposes a blind compressed image diffusion restoration model (BCDR) based on content prior and dense residual connection driven transformer. Specifically, we first utilize the image content restoration prior (ICP) learned in low-quality and high-quality images to refine the detail features. Then, the diffusion model estimator is used to reconstruct the texture of images and enhance the visual coherence. Finally, the dense residual connection is applied to capture global information and generate more realistic image details. The proposed model can greatly improve the image quality of blind compressed images and perform well in restoring image content details. The experimental results demonstrate that the proposed method exhibits excellent performance in both the benchmark dataset and the blind CIR task in real-world scenarios.
JPEG盲压缩图像恢复(CIR)旨在将压缩后的低质量图像还原为高质量图像,这是图像处理中长期存在的低级视觉问题。现有的盲CIR方法往往忽略了基本的内容细节,导致盲压缩图像的恢复质量下降。针对这一问题,提出了一种基于内容先验和密集残余连接驱动变压器的盲压缩图像扩散恢复模型(BCDR)。具体来说,我们首先利用在低质量和高质量图像中学习到的图像内容恢复先验(ICP)来细化细节特征。然后,利用扩散模型估计器重构图像纹理,增强图像的视觉一致性;最后,利用密集残差连接获取全局信息,生成更真实的图像细节。该模型可以大大提高盲压缩图像的图像质量,并能很好地还原图像内容细节。实验结果表明,该方法在基准数据集和真实场景的盲CIR任务中都表现出优异的性能。
{"title":"Blind compressed image diffusion restoration based on content prior and dense residual connection driven transformer","authors":"Shuang Yue,&nbsp;Zhe Chen,&nbsp;Fuliang Yin","doi":"10.1016/j.jvcir.2025.104674","DOIUrl":"10.1016/j.jvcir.2025.104674","url":null,"abstract":"<div><div>JPEG blind compressed image restoration (CIR) aims to restore high-quality images from compressed low-quality images, which is a long-standing low-level visual problem in the image processing. The existing blind CIR methods often overlook basic content details, leading to degradation of restoration quality of blind compressed images. To address this issue, this paper proposes a blind compressed image diffusion restoration model (BCDR) based on content prior and dense residual connection driven transformer. Specifically, we first utilize the image content restoration prior (ICP) learned in low-quality and high-quality images to refine the detail features. Then, the diffusion model estimator is used to reconstruct the texture of images and enhance the visual coherence. Finally, the dense residual connection is applied to capture global information and generate more realistic image details. The proposed model can greatly improve the image quality of blind compressed images and perform well in restoring image content details. The experimental results demonstrate that the proposed method exhibits excellent performance in both the benchmark dataset and the blind CIR task in real-world scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104674"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theft model-based black-box adversarial attack in embedding space 基于盗窃模型的嵌入空间黑盒对抗攻击
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-31 DOI: 10.1016/j.jvcir.2025.104702
Rui Zhang , Shuliang Jiang , Zi Kang , Shuo Xu , Yuanlong Lv , Hui Xia
Existing transfer-based adversarial attacks suffer from poor transferability due to limitations of the proxy dataset or inaccurate imitation of the target model by the substitute model. Thus, we propose a theft model-based black-box adversarial attack in embedding space. The substitute model acts as the discriminator of the generative adversarial network, and we introduce a diversity loss to train the generator without relying on a proxy dataset, enabling it to imitate the target model better. Furthermore, we design a combined adversarial attack method that integrates the gradient-based attack and natural evolution strategy to construct adversarial examples in the embedding space search. This ensures that the adversarial examples are compelling on both the target and the substitute models. Experimental results demonstrate that our method has good imitation ability and transferability. When using VGG16, OUR outperforms TREMBA by 14.71% in un-targeted attack success rate and shows a 13.49% improvement in targeted attacks.
现有的基于转移的对抗性攻击由于代理数据集的限制或替代模型对目标模型的不准确模仿,存在可转移性差的问题。因此,我们在嵌入空间中提出了一种基于盗窃模型的黑盒对抗攻击。替代模型作为生成对抗网络的鉴别器,我们引入多样性损失来训练生成器,而不依赖于代理数据集,使其能够更好地模仿目标模型。在此基础上,设计了一种结合梯度攻击和自然进化策略的组合对抗攻击方法,在嵌入空间搜索中构造对抗样本。这确保了对抗性示例在目标模型和替代模型上都是引人注目的。实验结果表明,该方法具有良好的模仿能力和可移植性。使用VGG16时,OUR的非目标攻击成功率比TREMBA高14.71%,目标攻击成功率比TREMBA高13.49%。
{"title":"Theft model-based black-box adversarial attack in embedding space","authors":"Rui Zhang ,&nbsp;Shuliang Jiang ,&nbsp;Zi Kang ,&nbsp;Shuo Xu ,&nbsp;Yuanlong Lv ,&nbsp;Hui Xia","doi":"10.1016/j.jvcir.2025.104702","DOIUrl":"10.1016/j.jvcir.2025.104702","url":null,"abstract":"<div><div>Existing transfer-based adversarial attacks suffer from poor transferability due to limitations of the proxy dataset or inaccurate imitation of the target model by the substitute model. Thus, we propose a theft model-based black-box adversarial attack in embedding space. The substitute model acts as the discriminator of the generative adversarial network, and we introduce a diversity loss to train the generator without relying on a proxy dataset, enabling it to imitate the target model better. Furthermore, we design a combined adversarial attack method that integrates the gradient-based attack and natural evolution strategy to construct adversarial examples in the embedding space search. This ensures that the adversarial examples are compelling on both the target and the substitute models. Experimental results demonstrate that our method has good imitation ability and transferability. When using VGG16, OUR outperforms TREMBA by 14.71% in un-targeted attack success rate and shows a 13.49% improvement in targeted attacks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104702"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting human-object interactions with image category-guided and query denoising 使用图像分类引导和查询去噪来检测人与物体的交互
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-01 DOI: 10.1016/j.jvcir.2025.104660
Jing Han , Hongyu Li , Xiaoying Wang , Xueqiang Lyu , Zangtai Cai , Yuzhong Chen
Existing Detection Transformer (DETR)-based algorithms in Human-Object Interaction (HOI) schemes learn instance-level human-object pair features to infer interaction behaviors, ignoring the influence of image macrostructure on the interaction behaviors. Meanwhile, the instability of the Hungarian matching process affects model convergence. Spurred by these concerns, this paper presents a novel HOI detection method featuring image category guidance and enhanced by query denoising. The proposed method constructs an image-level category query, which enhances the instance-level query based on the image-level contextual features to infer the interactions between humans and objects. Additionally, we introduce a query denoising training mechanism. Controlled noise is added to ground-truth queries, and the model is trained to reconstruct the original targets. This approach stabilizes matching and accelerates convergence. Furthermore, a branching shortcut is added to the triplet Hungarian matching process to stabilize the model’s training process. Experiments on the HICO-DET and V-COCO datasets demonstrate the superior performance of our method. Our method achieves accuracies of 37.71% on HICO-DET and 67.1% on V-COCO, while reducing training rounds from 500 to 25. The 95% reduction in training time results in significantly lower computational costs and energy consumption, enhancing the feasibility of practical deployment and accelerating experimental cycles. The code is available at https://github.com/lihy000/CADN-HOTR.
现有基于DETR的人-物交互(HOI)算法通过学习实例级人-物对特征来推断交互行为,忽略了图像宏观结构对交互行为的影响。同时,匈牙利匹配过程的不稳定性影响了模型的收敛性。在此基础上,提出了一种基于图像类别引导和查询去噪的HOI检测方法。该方法构建了图像级类别查询,该查询基于图像级上下文特征对实例级查询进行了改进,从而推断出人与对象之间的交互关系。此外,我们还引入了一种查询去噪训练机制。在地基真值查询中加入受控噪声,训练模型重建原始目标。该方法稳定了匹配,加快了收敛速度。此外,在三元组匈牙利匹配过程中增加了分支捷径,以稳定模型的训练过程。在HICO-DET和V-COCO数据集上的实验证明了该方法的优越性能。我们的方法在HICO-DET上的准确率为37.71%,在V-COCO上的准确率为67.1%,同时将训练轮数从500减少到25。训练时间减少95%,显著降低了计算成本和能耗,增强了实际部署的可行性,加快了实验周期。代码可在https://github.com/lihy000/CADN-HOTR上获得。
{"title":"Detecting human-object interactions with image category-guided and query denoising","authors":"Jing Han ,&nbsp;Hongyu Li ,&nbsp;Xiaoying Wang ,&nbsp;Xueqiang Lyu ,&nbsp;Zangtai Cai ,&nbsp;Yuzhong Chen","doi":"10.1016/j.jvcir.2025.104660","DOIUrl":"10.1016/j.jvcir.2025.104660","url":null,"abstract":"<div><div>Existing Detection Transformer (DETR)-based algorithms in Human-Object Interaction (HOI) schemes learn instance-level human-object pair features to infer interaction behaviors, ignoring the influence of image macrostructure on the interaction behaviors. Meanwhile, the instability of the Hungarian matching process affects model convergence. Spurred by these concerns, this paper presents a novel HOI detection method featuring image category guidance and enhanced by query denoising. The proposed method constructs an image-level category query, which enhances the instance-level query based on the image-level contextual features to infer the interactions between humans and objects. Additionally, we introduce a query denoising training mechanism. Controlled noise is added to ground-truth queries, and the model is trained to reconstruct the original targets. This approach stabilizes matching and accelerates convergence. Furthermore, a branching shortcut is added to the triplet Hungarian matching process to stabilize the model’s training process. Experiments on the HICO-DET and V-COCO datasets demonstrate the superior performance of our method. Our method achieves accuracies of 37.71% on HICO-DET and 67.1% on V-COCO, while reducing training rounds from 500 to 25. The 95% reduction in training time results in significantly lower computational costs and energy consumption, enhancing the feasibility of practical deployment and accelerating experimental cycles. The code is available at <span><span>https://github.com/lihy000/CADN-HOTR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104660"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAST: Semantic-Aware stylized Text-to-Image generation 语义感知的程式化文本到图像的生成
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 Epub Date: 2025-12-15 DOI: 10.1016/j.jvcir.2025.104685
Xinyue Sun , Jing Guo , yongzhen Ke , Shuai Yang , Kai Wang , Yemeng Wu
The pre-trained text-to-image diffusion probabilistic model has achieved excellent quality, providing users with good visual effects and attracting many users to use creative text to control the generated results. For users’ detailed generation requirements, using reference images to “stylize” text-to-image is more common because they cannot be fully explained in limited language. However, there is a style deviation between the images generated by existing methods and the style reference images, contrary to the human perception that similar semantic object regions in two images with the same style should share style. To solve this problem, this paper proposes a semantic-aware style transfer method (SAST) to strengthen the semantic-level style alignment between the generated image and style reference image. First, we lead language-driven semantic segmentation trained on the COCO dataset into a general style transfer model to capture the mask that the text in the style reference image focuses on. Similarly, we use the same text to perform mask extraction on the cross-attention layer of the text-to-image model. Based on the two obtained mask maps, we modify the self-attention layer in the diffusion model to control the injection process of style features. Experiments show that we achieve better style fidelity and style alignment metrics, indicating that the generated images are more consistent with human perception. Code is available at https://gitee.com/yongzhenke/SAST. Additional Keywords and Phrases:Text-to-image, Image style transfer, Diffusion model, Semantic alignment.
预训练的文本到图像扩散概率模型取得了优异的质量,为用户提供了良好的视觉效果,并吸引了许多用户使用创意文本来控制生成的结果。对于用户的详细生成需求,使用参考图像来“风格化”文本到图像更为常见,因为它们无法用有限的语言完全解释。然而,现有方法生成的图像与风格参考图像之间存在风格偏差,这与人类认为具有相同风格的两幅图像中相似语义对象区域应该共享风格相反。为了解决这一问题,本文提出了一种语义感知的风格转换方法(SAST),以加强生成图像与样式参考图像之间的语义级风格对齐。首先,我们将在COCO数据集上训练的语言驱动语义分割引入到一个通用的风格迁移模型中,以捕获样式参考图像中文本关注的掩码。类似地,我们使用相同的文本在文本到图像模型的交叉注意层上执行掩码提取。基于得到的两个掩模映射,我们修改扩散模型中的自注意层来控制样式特征的注入过程。实验表明,我们获得了更好的风格保真度和风格对齐指标,表明生成的图像更符合人类的感知。代码可从https://gitee.com/yongzhenke/SAST获得。附加关键词和短语:文本到图像,图像样式转移,扩散模型,语义对齐。
{"title":"SAST: Semantic-Aware stylized Text-to-Image generation","authors":"Xinyue Sun ,&nbsp;Jing Guo ,&nbsp;yongzhen Ke ,&nbsp;Shuai Yang ,&nbsp;Kai Wang ,&nbsp;Yemeng Wu","doi":"10.1016/j.jvcir.2025.104685","DOIUrl":"10.1016/j.jvcir.2025.104685","url":null,"abstract":"<div><div>The pre-trained text-to-image diffusion probabilistic model has achieved excellent quality, providing users with good visual effects and attracting many users to use creative text to control the generated results. For users’ detailed generation requirements, using reference images to “stylize” text-to-image is more common because they cannot be fully explained in limited language. However, there is a style deviation between the images generated by existing methods and the style reference images, contrary to the human perception that similar semantic object regions in two images with the same style should share style. To solve this problem, this paper proposes a semantic-aware style transfer method (SAST) to strengthen the semantic-level style alignment between the generated image and style reference image. First, we lead language-driven semantic segmentation trained on the COCO dataset into a general style transfer model to capture the mask that the text in the style reference image focuses on. Similarly, we use the same text to perform mask extraction on the cross-attention layer of the text-to-image model. Based on the two obtained mask maps, we modify the self-attention layer in the diffusion model to control the injection process of style features. Experiments show that we achieve better style fidelity and style alignment metrics, indicating that the generated images are more consistent with human perception. Code is available at https://gitee.com/yongzhenke/SAST. Additional Keywords and <strong>Phrases</strong>:Text-to-image, Image style transfer, Diffusion model, Semantic alignment.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104685"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1