Compared to its predecessor HEVC, VVC utilizes the Quad-Tree plus Multitype Tree (QTMT) structure for partitioning Coding Units (CU) and integrates a wider range of inter-frame prediction modes within its inter-frame coding framework. The incorporation of these innovative techniques enables VVC to achieve a substantial bitrate reduction of approximately 40% compared to HEVC. However, this efficiency boost is accompanied by a more than tenfold increase in encoding time. To accelerate the inter-frame prediction mode selection process, a FPMSN (Fast Prediction Mode Selection Network)-based method focusing on encoding acceleration during the non-partitioning mode testing phase is proposed in this paper. First, the execution results of the affine mode are collected as neural network input features. Next, FPMSN is proposed to extract critical information from multi-dimensional data and output the probabilities for each mode. Finally, multiple trade-off strategies are implemented to early terminate low-probability mode candidates.
Experimental results show that, under the Random Access (RA) configuration, the proposed method achieves a reduction in encoding time ranging from 3.22% to 19.3%, with a corresponding BDBR increase of only 0.12% to 1.363%, surpassing the performance of state-of-the-art methods.
{"title":"Accelerating inter-frame prediction in Versatile Video Coding via deep learning-based mode selection","authors":"Xudong Zhang , Jing Chen , Huanqiang Zeng , Wenjie Xiang , Yuting Zuo","doi":"10.1016/j.jvcir.2025.104653","DOIUrl":"10.1016/j.jvcir.2025.104653","url":null,"abstract":"<div><div>Compared to its predecessor HEVC, VVC utilizes the Quad-Tree plus Multitype Tree (QTMT) structure for partitioning Coding Units (CU) and integrates a wider range of inter-frame prediction modes within its inter-frame coding framework. The incorporation of these innovative techniques enables VVC to achieve a substantial bitrate reduction of approximately 40% compared to HEVC. However, this efficiency boost is accompanied by a more than tenfold increase in encoding time. To accelerate the inter-frame prediction mode selection process, a FPMSN (Fast Prediction Mode Selection Network)-based method focusing on encoding acceleration during the non-partitioning mode testing phase is proposed in this paper. First, the execution results of the affine mode are collected as neural network input features. Next, FPMSN is proposed to extract critical information from multi-dimensional data and output the probabilities for each mode. Finally, multiple trade-off strategies are implemented to early terminate low-probability mode candidates.</div><div>Experimental results show that, under the Random Access (RA) configuration, the proposed method achieves a reduction in encoding time ranging from 3.22% to 19.3%, with a corresponding BDBR increase of only 0.12% to 1.363%, surpassing the performance of state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104653"},"PeriodicalIF":3.1,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image restoration is a fundamental task in computer vision that recovers clean images from degraded inputs. However, preserving fine-details and maintaining global structural consistency are challenging tasks. Traditional convolutional neural network (CNN)-based methods capture local features but fail to model long-range dependencies and often overlook small objects within similar backgrounds. Transformers, conversely, model global context effectively but lack local detail precision. To overcome these limitations, this paper proposes a Degradation-aware All-in-one Image Restoration Network that integrates both CNNs and Transformers. Beginning with a multiscale feature extraction block, the network captures diverse features across different resolutions, enhancing its ability to handle complex image structures. The features from CNN encoder are subsequently passed through a Transformer decoder. Notably, an interleaved Transformer is applied to the features extracted by the CNN encoder, fostering cross-interaction between features and helping to propagate similar texture signals across the entire feature space, making them more distinguishable. These improved features are then concatenated with the transformer decoder blocks with degradation-aware information as prompts, enriching the restoration process. On average, across various restoration tasks, DAIRNet surpasses the state-of-the-art PromptIR and AirNet methods by 0.76 dB and 1.62 dB, respectively. Specifically, it achieves gains of 1.74 dB in image deraining, 0.26 dB in high-noise level denoising, and 0.84 dB in image dehazing tasks as compared to PromptIR. Single-task benchmarks further confirm the model’s effectiveness and generalizability.
{"title":"DAIRNet: Degradation-aware All-in-one Image Restoration Network with cross-channel feature interaction","authors":"Amit Monga , Hemkant Nehete , Tharun Kumar Reddy Bollu , Balasubramanian Raman","doi":"10.1016/j.jvcir.2025.104659","DOIUrl":"10.1016/j.jvcir.2025.104659","url":null,"abstract":"<div><div>Image restoration is a fundamental task in computer vision that recovers clean images from degraded inputs. However, preserving fine-details and maintaining global structural consistency are challenging tasks. Traditional convolutional neural network (CNN)-based methods capture local features but fail to model long-range dependencies and often overlook small objects within similar backgrounds. Transformers, conversely, model global context effectively but lack local detail precision. To overcome these limitations, this paper proposes a Degradation-aware All-in-one Image Restoration Network that integrates both CNNs and Transformers. Beginning with a multiscale feature extraction block, the network captures diverse features across different resolutions, enhancing its ability to handle complex image structures. The features from CNN encoder are subsequently passed through a Transformer decoder. Notably, an interleaved Transformer is applied to the features extracted by the CNN encoder, fostering cross-interaction between features and helping to propagate similar texture signals across the entire feature space, making them more distinguishable. These improved features are then concatenated with the transformer decoder blocks with degradation-aware information as prompts, enriching the restoration process. On average, across various restoration tasks, DAIRNet surpasses the state-of-the-art PromptIR and AirNet methods by 0.76 dB and 1.62 dB, respectively. Specifically, it achieves gains of 1.74 dB in image deraining, 0.26 dB in high-noise level denoising, and 0.84 dB in image dehazing tasks as compared to PromptIR. Single-task benchmarks further confirm the model’s effectiveness and generalizability.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104659"},"PeriodicalIF":3.1,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1016/j.jvcir.2025.104657
DaoPeng Zhang, Li Yu
3D point cloud generation plays a pivotal role in a wide range of applications, including robotics, medical imaging, autonomous driving, and virtual/augmented reality (VR/AR). However, generating high-quality point clouds remains highly challenging due to the irregularity and unordered nature of point cloud data. Existing Transformer-based generative models suffer from quadratic computational complexity, which limits their ability to capture global contextual dependencies and often leads to the loss of critical geometric information. To address these limitations, we propose a novel diffusion-based framework for point cloud generation that integrates the Mamba state-space model — known for its linear complexity and strong long-sequence modeling capability — with convolutional layers. Specifically, Mamba is employed to capture global structural dependencies across time steps, while the convolutional layers refine local geometric details. To effectively leverage the strengths of both components, we introduce a learnable masking mechanism that dynamically fuses global and local features at optimal time steps, thereby exploiting their complementary advantages. Extensive experiments demonstrate that our model outperforms previous point cloud generative approaches such as TIGER and PVD in terms of both quality and diversity. On the airplane category, our model achieves a 9.28% improvement in 1-NNA accuracy based on EMD compared to PVD, and a 1.72% improvement based on CD compared to TIGER. Compared with recent baseline models, our method consistently achieves significant gains across multiple evaluation metrics.
{"title":"Enhancing 3D point cloud generation via Mamba-based time-varying denoising diffusion","authors":"DaoPeng Zhang, Li Yu","doi":"10.1016/j.jvcir.2025.104657","DOIUrl":"10.1016/j.jvcir.2025.104657","url":null,"abstract":"<div><div>3D point cloud generation plays a pivotal role in a wide range of applications, including robotics, medical imaging, autonomous driving, and virtual/augmented reality (VR/AR). However, generating high-quality point clouds remains highly challenging due to the irregularity and unordered nature of point cloud data. Existing Transformer-based generative models suffer from quadratic computational complexity, which limits their ability to capture global contextual dependencies and often leads to the loss of critical geometric information. To address these limitations, we propose a novel diffusion-based framework for point cloud generation that integrates the Mamba state-space model — known for its linear complexity and strong long-sequence modeling capability — with convolutional layers. Specifically, Mamba is employed to capture global structural dependencies across time steps, while the convolutional layers refine local geometric details. To effectively leverage the strengths of both components, we introduce a learnable masking mechanism that dynamically fuses global and local features at optimal time steps, thereby exploiting their complementary advantages. Extensive experiments demonstrate that our model outperforms previous point cloud generative approaches such as TIGER and PVD in terms of both quality and diversity. On the airplane category, our model achieves a 9.28% improvement in 1-NNA accuracy based on EMD compared to PVD, and a 1.72% improvement based on CD compared to TIGER. Compared with recent baseline models, our method consistently achieves significant gains across multiple evaluation metrics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104657"},"PeriodicalIF":3.1,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145610298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1016/j.jvcir.2025.104654
Ahmed Alasri , Zhixiang Chen , Yalong Xiao , Chengzhang Zhu , Abdulrahman Noman , Raeed Alsabri , Harrison Xiao Bai
Retinal diseases are a significant global health concern, requiring advanced diagnostic tools for early detection and treatment. Automated diagnosis of retinal diseases using deep learning can significantly enhance early detection and intervention efforts. However, conventional deep learning models that concentrate on localized perspectives often develop feature representations that lack sufficient semantic discriminative capability. Conversely, models that prioritize global semantic-level information may fail to capture essential, subtle local pathological features. To address this issue, we propose BIAN, a novel Bidirectional Interwoven Attention Network designed for the classification of retinal Optical Coherence Tomography (OCT) images. BIAN synergistically combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) by integrating a ResNet architecture backbone with a ViT backbone through a bidirectional interwoven attention block. This network enables the model to effectively capture both local features and global contextual information. Specifically, the bidirectional interwoven attention block allow the ResNet and ViT components to attend to each other’s feature representations, enhancing the network’s overall learning capacity. We evaluated BIAN on both the OCTID and OCTDL datasets for retinal disease classification. The OCTID dataset includes conditions such as Age-related Macular Degeneration (AMD), Macular Hole (MH), Central Serous Retinopathy (CSR), etc., while OCTDL covers AMD, Diabetic Macular Edema (DME), Epiretinal Membrane (ERM), Retinal Vein Occlusion (RVO), etc. On OCTID, the proposed model achieved 95.7% accuracy for five-class classification, outperforming existing state-of-the-art models. On OCTDL, BIAN attained 94.7% accuracy, with consistently high F1-scores (95.6% on OCTID, 94.6% on OCTDL) and AUC values (99.3% and 99.0%, respectively). These results highlight the potential of BIAN as a robust network for retinal OCT image classification in medical applications.
{"title":"BIAN: Bidirectional interwoven attention network for retinal OCT image classification","authors":"Ahmed Alasri , Zhixiang Chen , Yalong Xiao , Chengzhang Zhu , Abdulrahman Noman , Raeed Alsabri , Harrison Xiao Bai","doi":"10.1016/j.jvcir.2025.104654","DOIUrl":"10.1016/j.jvcir.2025.104654","url":null,"abstract":"<div><div>Retinal diseases are a significant global health concern, requiring advanced diagnostic tools for early detection and treatment. Automated diagnosis of retinal diseases using deep learning can significantly enhance early detection and intervention efforts. However, conventional deep learning models that concentrate on localized perspectives often develop feature representations that lack sufficient semantic discriminative capability. Conversely, models that prioritize global semantic-level information may fail to capture essential, subtle local pathological features. To address this issue, we propose BIAN, a novel Bidirectional Interwoven Attention Network designed for the classification of retinal Optical Coherence Tomography (OCT) images. BIAN synergistically combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) by integrating a ResNet architecture backbone with a ViT backbone through a bidirectional interwoven attention block. This network enables the model to effectively capture both local features and global contextual information. Specifically, the bidirectional interwoven attention block allow the ResNet and ViT components to attend to each other’s feature representations, enhancing the network’s overall learning capacity. We evaluated BIAN on both the OCTID and OCTDL datasets for retinal disease classification. The OCTID dataset includes conditions such as Age-related Macular Degeneration (AMD), Macular Hole (MH), Central Serous Retinopathy (CSR), etc., while OCTDL covers AMD, Diabetic Macular Edema (DME), Epiretinal Membrane (ERM), Retinal Vein Occlusion (RVO), etc. On OCTID, the proposed model achieved 95.7% accuracy for five-class classification, outperforming existing state-of-the-art models. On OCTDL, BIAN attained 94.7% accuracy, with consistently high F1-scores (95.6% on OCTID, 94.6% on OCTDL) and AUC values (99.3% and 99.0%, respectively). These results highlight the potential of BIAN as a robust network for retinal OCT image classification in medical applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104654"},"PeriodicalIF":3.1,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1016/j.jvcir.2025.104656
Xingfa Wang , Chengjun Chen , Chenggang Dai , Kunhua Liu , Mingxing Lin
Due to inherent absorption and scattering effects, underwater images often exhibit low visibility and significant color deviation. These issues hinder the extraction of discriminative features and adversely impact instance-level segmentation accuracy. To address these challenges, this study proposes a novel Hybrid SAM and Mask R-CNN framework for underwater instance segmentation, integrating the strong generalization capability of SAM with the structural decoding strength of Mask R-CNN. The powerful global modeling ability of SAM effectively mitigates the impact of underwater image degradation, thereby enabling more robust feature representation. Moreover, a novel underwater feature weighted enhancer is introduced in the framework to enhance multi-scale feature fusion and improve the detection of small and scale-varying objects in underwater environments. To provide benchmark data, a large-scale underwater instance segmentation dataset, UW10K, is also constructed, comprising 13,551 images and 22,968 annotated instances across 15 categories. Comprehensive experiments validate the superiority of the proposed model across various instance segmentation tasks. Specifically, it achieves precisions of 74.2 %, 40.5 %, and 70.6 % on UW10K, USIS10K, and WHU Building datasets, respectively. This study is expected to advance ocean exploration and fisheries, while providing valuable training samples for instance segmentation tasks. Datasets and codes are available at https://github.com/xfwang-qut/HySaM.
{"title":"HySaM: An improved hybrid SAM and Mask R-CNN for underwater instance segmentation","authors":"Xingfa Wang , Chengjun Chen , Chenggang Dai , Kunhua Liu , Mingxing Lin","doi":"10.1016/j.jvcir.2025.104656","DOIUrl":"10.1016/j.jvcir.2025.104656","url":null,"abstract":"<div><div>Due to inherent absorption and scattering effects, underwater images often exhibit low visibility and significant color deviation. These issues hinder the extraction of discriminative features and adversely impact instance-level segmentation accuracy. To address these challenges, this study proposes a novel Hybrid SAM and Mask R-CNN framework for underwater instance segmentation, integrating the strong generalization capability of SAM with the structural decoding strength of Mask R-CNN. The powerful global modeling ability of SAM effectively mitigates the impact of underwater image degradation, thereby enabling more robust feature representation. Moreover, a novel underwater feature weighted enhancer is introduced in the framework to enhance multi-scale feature fusion and improve the detection of small and scale-varying objects in underwater environments. To provide benchmark data, a large-scale underwater instance segmentation dataset, UW10K, is also constructed, comprising 13,551 images and 22,968 annotated instances across 15 categories. Comprehensive experiments validate the superiority of the proposed model across various instance segmentation tasks. Specifically, it achieves precisions of 74.2 %, 40.5 %, and 70.6 % on UW10K, USIS10K, and WHU Building datasets, respectively. This study is expected to advance ocean exploration and fisheries, while providing valuable training samples for instance segmentation tasks. Datasets and codes are available at <span><span>https://github.com/xfwang-qut/HySaM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104656"},"PeriodicalIF":3.1,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1016/j.jvcir.2025.104655
Yue Que , Chen Qiu , Hanqing Xiong , Xue Xia , Zhiwei Liu
Convolutional neural networks (CNNs) have demonstrated impressive performance in image deraining tasks. However, CNNs have a limited receptive field, which restricts their ability to adapt to spatial variations in the input. Recently, Transformers have demonstrated promising results in image deraining, surpassing CNN in several cases. However, most existing methods leverage limited spatial input information through attribution analysis. In this paper, we investigated the construction of multi-scale feature representations within Transformers to fully exploit their potential in image deraining. We propose a multi-scale interleaved Transformer framework, which aims to reconstruct high-quality images by leveraging information across different scales, thereby enabling it to better capture the size and distribution of rain. In addition, we introduce a hybrid cross-attention mechanism to replace traditional feature fusion, facilitating global feature interaction and capturing complementary information across scales simultaneously. Our approach surpasses state-of-the-art methods in terms of image deraining performance on two types of benchmark datasets.
{"title":"Multi-scale interleaved transformer network for image deraining","authors":"Yue Que , Chen Qiu , Hanqing Xiong , Xue Xia , Zhiwei Liu","doi":"10.1016/j.jvcir.2025.104655","DOIUrl":"10.1016/j.jvcir.2025.104655","url":null,"abstract":"<div><div>Convolutional neural networks (CNNs) have demonstrated impressive performance in image deraining tasks. However, CNNs have a limited receptive field, which restricts their ability to adapt to spatial variations in the input. Recently, Transformers have demonstrated promising results in image deraining, surpassing CNN in several cases. However, most existing methods leverage limited spatial input information through attribution analysis. In this paper, we investigated the construction of multi-scale feature representations within Transformers to fully exploit their potential in image deraining. We propose a multi-scale interleaved Transformer framework, which aims to reconstruct high-quality images by leveraging information across different scales, thereby enabling it to better capture the size and distribution of rain. In addition, we introduce a hybrid cross-attention mechanism to replace traditional feature fusion, facilitating global feature interaction and capturing complementary information across scales simultaneously. Our approach surpasses state-of-the-art methods in terms of image deraining performance on two types of benchmark datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104655"},"PeriodicalIF":3.1,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.jvcir.2025.104602
Ziyu Hu , Xiaoguang Jiang , Qiong Liu , Xin Ding
Recent methods based on Neural Radiance Fields (NeRFs) have excelled in real-time novel view synthesis for small-scale scenes but struggle with fast rendering for large-scale scenes. Achieving a balance in performance between small-scale and large-scale scenes has emerged as a challenging problem. To address this, we propose PatchNeRF, a patch-based NeRF representation for wide-scale scenes. PatchNeRF uses small 2D patches to fit surfaces, learning a 2D neural radiance field for local geometry and texture. To make the most of sampling patches and skip empty space, we propose strategies for initializing and progressively updating the patch structure, along with performing end-to-end training using both large and tiny MLPs. After training, we prebake the implicit 2D neural radiance fields as feature maps to accelerate the rendering process. Experiments demonstrate that our approach outperforms state-of-the-art methods in both small-scale and large-scale scenes, while achieving superior rendering speeds.
{"title":"PatchNeRF: Patch-based Neural Radiance Fields for real time view synthesis in wide-scale scenes","authors":"Ziyu Hu , Xiaoguang Jiang , Qiong Liu , Xin Ding","doi":"10.1016/j.jvcir.2025.104602","DOIUrl":"10.1016/j.jvcir.2025.104602","url":null,"abstract":"<div><div>Recent methods based on Neural Radiance Fields (NeRFs) have excelled in real-time novel view synthesis for small-scale scenes but struggle with fast rendering for large-scale scenes. Achieving a balance in performance between small-scale and large-scale scenes has emerged as a challenging problem. To address this, we propose PatchNeRF, a patch-based NeRF representation for wide-scale scenes. PatchNeRF uses small 2D patches to fit surfaces, learning a 2D neural radiance field for local geometry and texture. To make the most of sampling patches and skip empty space, we propose strategies for initializing and progressively updating the patch structure, along with performing end-to-end training using both large and tiny MLPs. After training, we prebake the implicit 2D neural radiance fields as feature maps to accelerate the rendering process. Experiments demonstrate that our approach outperforms state-of-the-art methods in both small-scale and large-scale scenes, while achieving superior rendering speeds.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104602"},"PeriodicalIF":3.1,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145610299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.jvcir.2025.104641
Junyong You , Yuan Lin , Bin Hu
Generative models, e.g., stable diffusion, excel at producing compelling images but remain highly dependent on crafted prompts. Refining prompts for specific objectives, especially aesthetic quality, is time-consuming and inconsistent. We propose a novel approach that leverages LLMs to enhance prompt refinement process for stable diffusion. First, we propose a model to predict aesthetic image quality, examining various aesthetic elements in spatial, channel, and color domains. Reinforcement learning is employed to refine the prompt, starting from a rudimentary version and iteratively improving them with LLM’s assistance. This iterative process is guided by a policy network updating prompts based on interactions with the generated images, with a reward function measuring aesthetic improvement and adherence to the prompt. Our experimental results demonstrate that this method significantly boosts the visual quality of generated images when using these refined prompts. Beyond image synthesis, this approach provides a broader framework for improving prompts across diverse applications with the support of LLMs.
{"title":"Enhancing aesthetic image generation with reinforcement learning guided prompt optimization in stable diffusion","authors":"Junyong You , Yuan Lin , Bin Hu","doi":"10.1016/j.jvcir.2025.104641","DOIUrl":"10.1016/j.jvcir.2025.104641","url":null,"abstract":"<div><div>Generative models, e.g., stable diffusion, excel at producing compelling images but remain highly dependent on crafted prompts. Refining prompts for specific objectives, especially aesthetic quality, is time-consuming and inconsistent. We propose a novel approach that leverages LLMs to enhance prompt refinement process for stable diffusion. First, we propose a model to predict aesthetic image quality, examining various aesthetic elements in spatial, channel, and color domains. Reinforcement learning is employed to refine the prompt, starting from a rudimentary version and iteratively improving them with LLM’s assistance. This iterative process is guided by a policy network updating prompts based on interactions with the generated images, with a reward function measuring aesthetic improvement and adherence to the prompt. Our experimental results demonstrate that this method significantly boosts the visual quality of generated images when using these refined prompts. Beyond image synthesis, this approach provides a broader framework for improving prompts across diverse applications with the support of LLMs.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104641"},"PeriodicalIF":3.1,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.jvcir.2025.104644
Shuo Hu , Tongtong Liu , Liyang Han , Run Xing
Most existing visual tracking methods typically employ image patches as target references and endeavor to enhance tracking performance by maximizing the utilization of visual information through various deep networks. However, due to the intrinsic limitations of visual information, the performance of the trackers significantly deteriorates when confronted with drastic target variations or complex background environments. To address these issues, we propose a vision-language multimodal fusion tracker for object tracking. Firstly, we use semantic information from language descriptions to compensate for the instability of visual information, and establish multimodal cross-relations through the fusion of visual and language features. Secondly, we propose an attention-based token screening mechanism that utilizes semantic-guided attention and masking operations to eliminate irrelevant search tokens devoid of target information, thereby enhancing both accuracy and efficiency. Furthermore, we optimize the localization head by introducing channel attention, which effectively improves the accuracy of target positioning. Extensive experiments conducted on the OTB99, LaSOT, and TNL2K datasets demonstrate the effectiveness of our proposed tracking method, achieving success rates of 71.2%, 69.5%, and 58.9%, respectively.
{"title":"Vision-language tracking with attention-based optimization","authors":"Shuo Hu , Tongtong Liu , Liyang Han , Run Xing","doi":"10.1016/j.jvcir.2025.104644","DOIUrl":"10.1016/j.jvcir.2025.104644","url":null,"abstract":"<div><div>Most existing visual tracking methods typically employ image patches as target references and endeavor to enhance tracking performance by maximizing the utilization of visual information through various deep networks. However, due to the intrinsic limitations of visual information, the performance of the trackers significantly deteriorates when confronted with drastic target variations or complex background environments. To address these issues, we propose a vision-language multimodal fusion tracker for object tracking. Firstly, we use semantic information from language descriptions to compensate for the instability of visual information, and establish multimodal cross-relations through the fusion of visual and language features. Secondly, we propose an attention-based token screening mechanism that utilizes semantic-guided attention and masking operations to eliminate irrelevant search tokens devoid of target information, thereby enhancing both accuracy and efficiency. Furthermore, we optimize the localization head by introducing channel attention, which effectively improves the accuracy of target positioning. Extensive experiments conducted on the OTB99, LaSOT, and TNL2K datasets demonstrate the effectiveness of our proposed tracking method, achieving success rates of 71.2%, 69.5%, and 58.9%, respectively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104644"},"PeriodicalIF":3.1,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.jvcir.2025.104645
Beibei Jiang, Yu Zhou
In recent years, significant progress has been made in facial expression recognition (FER) methods based on deep learning. However, existing models still face challenges in terms of computational efficiency and generalization performance when dealing with diverse emotional expressions and complex environmental variations. Recently, large-scale vision-language pre-training models such as CLIP have achieved remarkable success in multi-modal learning. Their rich visual and textual representations offer valuable insights for downstream tasks. Consequently, transferring the knowledge to develop efficient and accurate facial expression recognition (FER) systems has emerged as a key research direction. To the end, this paper proposes a novel model, termed Knowledge Distillation and Retrieval-Augmented Generation (KDRAG), which combines Distillation and Retrieval-Augmented Generation (RAG) techniques to improve the efficiency and accuracy of FER. Through knowledge distillation, the teacher model (ViT-L/14) transfers its rich knowledge to the smaller student model (ViT-B/32). An additional linear projection layer is added to map the teacher model’s output features to the student model’s feature dimensions for feature alignment. Moreover, the RAG mechanism is developed to enhance the emotional understanding of students by retrieving text descriptions related to the input image. Additionally, this framework combines soft loss (from the teacher model’s knowledge) and hard loss (from the true targets of the labels) to enhance the model’s generalization ability. Extensive experimental results on multiple datasets demonstrate that the KDRAG framework can achieve significant improvements in accuracy and computational efficiency, providing new insights for real-time FER systems.
近年来,基于深度学习的面部表情识别方法取得了重大进展。然而,现有模型在处理复杂的情绪表达和复杂的环境变化时,在计算效率和泛化性能方面仍然面临挑战。近年来,大规模的视觉语言预训练模型(如CLIP)在多模态学习中取得了显著的成功。它们丰富的可视化和文本表示为下游任务提供了有价值的见解。因此,利用这些知识开发高效、准确的面部表情识别系统已成为一个重要的研究方向。最后,本文提出了一种新的知识精馏和检索增强生成(KDRAG)模型,该模型将精馏和检索增强生成(RAG)技术相结合,以提高知识精馏和检索增强生成的效率和准确性。教师模型(viti - l /14)通过知识提炼,将其丰富的知识传递给较小的学生模型(viti - b /32)。添加了一个额外的线性投影层,将教师模型的输出特征映射到学生模型的特征维度,以进行特征对齐。此外,我们开发了RAG机制,通过检索与输入图像相关的文本描述来增强学生的情感理解。此外,该框架结合了软损失(来自教师模型的知识)和硬损失(来自标签的真实目标),以增强模型的泛化能力。在多个数据集上的大量实验结果表明,KDRAG框架可以显著提高精度和计算效率,为实时FER系统提供新的见解。
{"title":"Multi-modal deep facial expression recognition framework combining knowledge distillation and retrieval-augmented generation","authors":"Beibei Jiang, Yu Zhou","doi":"10.1016/j.jvcir.2025.104645","DOIUrl":"10.1016/j.jvcir.2025.104645","url":null,"abstract":"<div><div>In recent years, significant progress has been made in facial expression recognition (FER) methods based on deep learning. However, existing models still face challenges in terms of computational efficiency and generalization performance when dealing with diverse emotional expressions and complex environmental variations. Recently, large-scale vision-language pre-training models such as CLIP have achieved remarkable success in multi-modal learning. Their rich visual and textual representations offer valuable insights for downstream tasks. Consequently, transferring the knowledge to develop efficient and accurate facial expression recognition (FER) systems has emerged as a key research direction. To the end, this paper proposes a novel model, termed Knowledge Distillation and Retrieval-Augmented Generation (KDRAG), which combines Distillation and Retrieval-Augmented Generation (RAG) techniques to improve the efficiency and accuracy of FER. Through knowledge distillation, the teacher model (ViT-L/14) transfers its rich knowledge to the smaller student model (ViT-B/32). An additional linear projection layer is added to map the teacher model’s output features to the student model’s feature dimensions for feature alignment. Moreover, the RAG mechanism is developed to enhance the emotional understanding of students by retrieving text descriptions related to the input image. Additionally, this framework combines soft loss (from the teacher model’s knowledge) and hard loss (from the true targets of the labels) to enhance the model’s generalization ability. Extensive experimental results on multiple datasets demonstrate that the KDRAG framework can achieve significant improvements in accuracy and computational efficiency, providing new insights for real-time FER systems.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"114 ","pages":"Article 104645"},"PeriodicalIF":3.1,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}