Pub Date : 2025-12-18DOI: 10.1016/j.jvcir.2025.104684
Jee-Ye Yoon , Je-Won Kang
Future object localization (FOL) seeks to predict the future locations of objects using information from past and present video frames. Ego-centric videos from vehicle-mounted cameras serve as a key source. However, these videos are constrained by a limited field of view and susceptibility to external conditions. To address these challenges, this paper presents a novel FOL approach that combines ego-centric video data with point cloud data, enhancing both robustness and accuracy. The proposed model is based on a deep neural network that prioritizes front-camera ego-centric videos, exploiting their rich visual cues. By integrating point cloud data, the system improves three-dimensional (3D) object localization. Furthermore, the paper introduces a novel method for ego-motion prediction. The ego-motion prediction network employs multi-modal sensors to comprehensively capture physical displacement in both 2D and 3D spaces, effectively handling occlusions and the limited perspective inherent in ego-centric videos. Experimental results indicate that the proposed FOL system with ego-motion prediction (MS-FOLe) outperforms existing methods on large-scale open datasets for intelligent driving.
{"title":"Future object localization using multi-modal ego-centric video","authors":"Jee-Ye Yoon , Je-Won Kang","doi":"10.1016/j.jvcir.2025.104684","DOIUrl":"10.1016/j.jvcir.2025.104684","url":null,"abstract":"<div><div>Future object localization (FOL) seeks to predict the future locations of objects using information from past and present video frames. Ego-centric videos from vehicle-mounted cameras serve as a key source. However, these videos are constrained by a limited field of view and susceptibility to external conditions. To address these challenges, this paper presents a novel FOL approach that combines ego-centric video data with point cloud data, enhancing both robustness and accuracy. The proposed model is based on a deep neural network that prioritizes front-camera ego-centric videos, exploiting their rich visual cues. By integrating point cloud data, the system improves three-dimensional (3D) object localization. Furthermore, the paper introduces a novel method for ego-motion prediction. The ego-motion prediction network employs multi-modal sensors to comprehensively capture physical displacement in both 2D and 3D spaces, effectively handling occlusions and the limited perspective inherent in ego-centric videos. Experimental results indicate that the proposed FOL system with ego-motion prediction (MS-FOLe) outperforms existing methods on large-scale open datasets for intelligent driving.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104684"},"PeriodicalIF":3.1,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.jvcir.2025.104690
Guoqing Zhang , Wenxin Sun , Long Wang , Yuhui Zheng , Zhonglin Ye
Existing multi-stage weakly supervised semantic segmentation (WSSS) methods typically use refined class activation maps (CAMs) to generate pseudo labels. However, CAMs are prone to misactivating background regions associated with foreground objects (e.g., train and railroad). Some previous efforts introduce additional supervisory signals as background cues but do not consider the rich foreground–background discrimination insights present in different channels of CAMs. In this work, we present a novel framework that explicitly models channel-specific information to enhance foreground–background discrimination and contextual understanding in CAM generation. By effectively capturing and integrating channel-wise local and global cues, our approach mitigates common misactivation issues without requiring additional supervision. Experiments on the PASCAL VOC 2012 dataset show that our method alleviates misactivation in CAMs without additional supervision, providing significant improvements over off-the-shelf methods and achieving strong segmentation performance.
{"title":"CSCA: Channel-specific information contrast and aggregation for weakly supervised semantic segmentation","authors":"Guoqing Zhang , Wenxin Sun , Long Wang , Yuhui Zheng , Zhonglin Ye","doi":"10.1016/j.jvcir.2025.104690","DOIUrl":"10.1016/j.jvcir.2025.104690","url":null,"abstract":"<div><div>Existing multi-stage weakly supervised semantic segmentation (WSSS) methods typically use refined class activation maps (CAMs) to generate pseudo labels. However, CAMs are prone to misactivating background regions associated with foreground objects (e.g., train and railroad). Some previous efforts introduce additional supervisory signals as background cues but do not consider the rich foreground–background discrimination insights present in different channels of CAMs. In this work, we present a novel framework that explicitly models channel-specific information to enhance foreground–background discrimination and contextual understanding in CAM generation. By effectively capturing and integrating channel-wise local and global cues, our approach mitigates common misactivation issues without requiring additional supervision. Experiments on the PASCAL VOC 2012 dataset show that our method alleviates misactivation in CAMs without additional supervision, providing significant improvements over off-the-shelf methods and achieving strong segmentation performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104690"},"PeriodicalIF":3.1,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1016/j.jvcir.2025.104691
Da Ai , Yunqiao Wang , Kai Jia , Zhike Ji , Ying Liu
In the actual video surveillance application scenarios, the imaging difference between near infrared (NIR) and visible light (VIS) spectrum and the photo distance are two important factors that restrict the accuracy of near infrared face recognition. In this paper, we first use a fixed focus near-infrared camera to capture NIR face images at different distances, constructing a large Cross-Spectral and Cross-Distance Face dataset (CSCD-F), and in order to improve recognition accuracy, we employ image enhancement techniques to preprocess low-quality face images. Furthermore, we adjusted the sampling depth of the generator in the CycleGAN network and introduced additional edge loss, proposing a general framework that combines generative models and transfer learning to achieve spectral feature translation between NIR and VIS images. The proposed method can effectively convert NIR face images into VIS images while retaining sufficient identity information. Various experimental results demonstrate that the proposed method achieves significant performance improvements on the self-built CSCD-F dataset. Additionally, it validates the generalization capability and effectiveness of the proposed method on public datasets such as HFB and Oulu-CASIA NIR-VIS.
在实际视频监控应用场景中,近红外(NIR)与可见光(VIS)光谱的成像差异以及拍摄距离是制约近红外人脸识别精度的两个重要因素。本文首先利用定焦近红外相机采集不同距离的近红外人脸图像,构建大型跨光谱和跨距离人脸数据集(Cross-Spectral and Cross-Distance face dataset, CSCD-F),并采用图像增强技术对低质量人脸图像进行预处理,以提高识别精度。此外,我们调整了CycleGAN网络中生成器的采样深度,并引入了额外的边缘损失,提出了一个结合生成模型和迁移学习的通用框架,以实现近红外和VIS图像之间的光谱特征转换。该方法可以有效地将近红外人脸图像转换为VIS图像,同时保留足够的身份信息。各种实验结果表明,该方法在自建的CSCD-F数据集上取得了显著的性能提升。此外,在HFB和Oulu-CASIA NIR-VIS等公共数据集上验证了该方法的泛化能力和有效性。
{"title":"Cross-distance near-infrared face recognition","authors":"Da Ai , Yunqiao Wang , Kai Jia , Zhike Ji , Ying Liu","doi":"10.1016/j.jvcir.2025.104691","DOIUrl":"10.1016/j.jvcir.2025.104691","url":null,"abstract":"<div><div>In the actual video surveillance application scenarios, the imaging difference between near infrared (NIR) and visible light (VIS) spectrum and the photo distance are two important factors that restrict the accuracy of near infrared face recognition. In this paper, we first use a fixed focus near-infrared camera to capture NIR face images at different distances, constructing a large Cross-Spectral and Cross-Distance Face dataset (CSCD-F), and in order to improve recognition accuracy, we employ image enhancement techniques to preprocess low-quality face images. Furthermore, we adjusted the sampling depth of the generator in the CycleGAN network and introduced additional edge loss, proposing a general framework that combines generative models and transfer learning to achieve spectral feature translation between NIR and VIS images. The proposed method can effectively convert NIR face images into VIS images while retaining sufficient identity information. Various experimental results demonstrate that the proposed method achieves significant performance improvements on the self-built CSCD-F dataset. Additionally, it validates the generalization capability and effectiveness of the proposed method on public datasets such as HFB and Oulu-CASIA NIR-VIS.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104691"},"PeriodicalIF":3.1,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.jvcir.2025.104686
Xiaoying Guo , Liang Li , Tao Yan , Lu Wang , Yuhua Qian
Perceptual analysis of image complexity bridges affective computing and visual perception, providing deeper insights into visual content. Conventional approaches mainly focus on global complexity scoring, neglecting the localization of region-specific complexity cues crucial for human perception. To address these challenges, we propose ICCORN, a dual-task framework that predicts image complexity scores while detecting complexity regions simultaneously. By integrating modified ICNet and rank-consistent ordinal regression (CORN), ICCORN generates complexity activation maps that are highly consistent with eye movement heatmaps. Comprehensive cross-dataset evaluations on four datasets demonstrate that ICCORN’s robust performance across diverse image types, enhancing its applicability in visual complexity analysis. Additionally, we introduce ICEye, a novel eye-tracking dataset of 1200 images across eight semantic categories, annotated with gaze trajectories, heatmaps, and segmented regions. This dataset facilitates advanced research into computational modeling of human visual complexity perception. ICEye dataset is available at https://github.com/gxyeagle19850102/ICEye.
{"title":"Aligning computational and human perceptions of image complexity: A dual-task framework for prediction and localization","authors":"Xiaoying Guo , Liang Li , Tao Yan , Lu Wang , Yuhua Qian","doi":"10.1016/j.jvcir.2025.104686","DOIUrl":"10.1016/j.jvcir.2025.104686","url":null,"abstract":"<div><div>Perceptual analysis of image complexity bridges affective computing and visual perception, providing deeper insights into visual content. Conventional approaches mainly focus on global complexity scoring, neglecting the localization of region-specific complexity cues crucial for human perception. To address these challenges, we propose ICCORN, a dual-task framework that predicts image complexity scores while detecting complexity regions simultaneously. By integrating modified ICNet and rank-consistent ordinal regression (CORN), ICCORN generates complexity activation maps that are highly consistent with eye movement heatmaps. Comprehensive cross-dataset evaluations on four datasets demonstrate that ICCORN’s robust performance across diverse image types, enhancing its applicability in visual complexity analysis. Additionally, we introduce ICEye, a novel eye-tracking dataset of 1200 images across eight semantic categories, annotated with gaze trajectories, heatmaps, and segmented regions. This dataset facilitates advanced research into computational modeling of human visual complexity perception. ICEye dataset is available at <span><span>https://github.com/gxyeagle19850102/ICEye</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104686"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.jvcir.2025.104688
Shujuan Huang, Jie Pan, Chunyu Lin, Lang Nie, Meiqin Liu, Yao Zhao
Point cloud accumulation aligns and merges 3D LiDAR frames to create dense, comprehensive scene representations that are critical for applications like autonomous driving. Effective accumulation relies on accurate scene flow estimation, yet error propagation and drift, particularly from noise and fast-moving objects, pose significant challenges. Existing clustering-based methods for instance association often falter under these conditions and depend heavily on manual labels, limiting scalability. To address these issues, we propose a Progressive Instance Association (PIA) method that integrates single-frame clustering with an enhanced Unscented Kalman Filter, improving tracking robustness in dynamic scenes. Additionally, our Multi-Dimensional Pseudo Label (MDPL) strategy leverages cross-modal supervision to reduce reliance on manual labels, enhancing scene flow accuracy. Evaluated on the Waymo Open Dataset, our method surpasses state-of-the-art LiDAR-based approaches and performs comparably to multi-modal methods. Qualitative visualizations further demonstrate denser, well-aligned accumulated point clouds.
{"title":"Point cloud accumulation via multi-dimensional pseudo label and progressive instance association","authors":"Shujuan Huang, Jie Pan, Chunyu Lin, Lang Nie, Meiqin Liu, Yao Zhao","doi":"10.1016/j.jvcir.2025.104688","DOIUrl":"10.1016/j.jvcir.2025.104688","url":null,"abstract":"<div><div>Point cloud accumulation aligns and merges 3D LiDAR frames to create dense, comprehensive scene representations that are critical for applications like autonomous driving. Effective accumulation relies on accurate scene flow estimation, yet error propagation and drift, particularly from noise and fast-moving objects, pose significant challenges. Existing clustering-based methods for instance association often falter under these conditions and depend heavily on manual labels, limiting scalability. To address these issues, we propose a Progressive Instance Association (PIA) method that integrates single-frame clustering with an enhanced Unscented Kalman Filter, improving tracking robustness in dynamic scenes. Additionally, our Multi-Dimensional Pseudo Label (MDPL) strategy leverages cross-modal supervision to reduce reliance on manual labels, enhancing scene flow accuracy. Evaluated on the Waymo Open Dataset, our method surpasses state-of-the-art LiDAR-based approaches and performs comparably to multi-modal methods. Qualitative visualizations further demonstrate denser, well-aligned accumulated point clouds.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104688"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.jvcir.2025.104682
Xu Jiang , Huiping Deng , Sen Xiang , Li Yu
Few-view 3D reconstruction technology aims to recover the 3D geometric shape of objects or scenes using only a limited number of views. In recent years, with the development of deep learning and 3D rendering technologies, this field has achieved significant progress. Due to the highly similar geometric and appearance features of repetitive texture regions, current few-view 3D reconstruction methods fail to distinguish their local differences during the global reconstruction process, thus frequently resulting in floating artifacts in these regions of the synthesized new views. We propose a method combining monocular depth supervision with depth-error-guided point optimization within the framework of 3D Gaussian Splatting to solve the floating artifact problem in repetitive texture regions under few-view input conditions. Specifically, we calculate a loss function using rendered depth maps and pseudo-true depth maps to achieve depth constraints, and we identify erroneous Gaussian points through depth error maps. For these erroneous point regions, we implement more effective point densification to guide the model in learning more correct geometric shapes in these regions and to synthesize views with fewer floating artifacts. We validate our method on the NeRF-LLFF dataset with different numbers of images. We conduct multiple experiments on randomly selected training images and provide average values to ensure fairness. The experimental results on the LLFF dataset show that our method outperforms the baseline method DRGS, achieving 0.53 dB higher PSNR and 0.021 higher SSIM. This confirms that we effectively reduce floating artifacts in the repetitive texture regions of few-view novel view synthesis.
{"title":"Depth error points optimization for 3D Gaussian Splatting in few-shot synthesis","authors":"Xu Jiang , Huiping Deng , Sen Xiang , Li Yu","doi":"10.1016/j.jvcir.2025.104682","DOIUrl":"10.1016/j.jvcir.2025.104682","url":null,"abstract":"<div><div>Few-view 3D reconstruction technology aims to recover the 3D geometric shape of objects or scenes using only a limited number of views. In recent years, with the development of deep learning and 3D rendering technologies, this field has achieved significant progress. Due to the highly similar geometric and appearance features of repetitive texture regions, current few-view 3D reconstruction methods fail to distinguish their local differences during the global reconstruction process, thus frequently resulting in floating artifacts in these regions of the synthesized new views. We propose a method combining monocular depth supervision with depth-error-guided point optimization within the framework of 3D Gaussian Splatting to solve the floating artifact problem in repetitive texture regions under few-view input conditions. Specifically, we calculate a loss function using rendered depth maps and pseudo-true depth maps to achieve depth constraints, and we identify erroneous Gaussian points through depth error maps. For these erroneous point regions, we implement more effective point densification to guide the model in learning more correct geometric shapes in these regions and to synthesize views with fewer floating artifacts. We validate our method on the NeRF-LLFF dataset with different numbers of images. We conduct multiple experiments on randomly selected training images and provide average values to ensure fairness. The experimental results on the LLFF dataset show that our method outperforms the baseline method DRGS, achieving 0.53 dB higher PSNR and 0.021 higher SSIM. This confirms that we effectively reduce floating artifacts in the repetitive texture regions of few-view novel view synthesis.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104682"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of large foundation models, significant advancements have been made in the field of artificial intelligence. The Segment Anything Model (SAM) was specifically designed for image segmentation. However, experiments have demonstrated that SAM may encounter performance limitations in handling specific tasks, such as fire segmentation. To address this challenge, our study explores solutions to effectively adapt the pre-trained SAM model for fire segmentation. The adapter-enhanced approach is introduced to SAM, incorporating effective adapter modules into the segmentation network. The resulting approach, SAM-FireAdapter, incorporates fire-specific features into SAM, significantly enhancing its performance on fire segmentation. Additionally, we propose Fire-Adaptive Attention (FAA), a lightweight attention mechanism module to enhance feature representation. This module reweights the input features before decoding, emphasizing critical spatial features and suppressing less relevant ones. Experimental results demonstrate that SAM-FireAdapter surpasses existing fire segmentation networks including the base SAM.
{"title":"SAM-FireAdapter: An adapter for fire segmentation with SAM","authors":"Yanan Wu, Chaoqun Hong, Yongfeng Chen, Haixi Cheng","doi":"10.1016/j.jvcir.2025.104678","DOIUrl":"10.1016/j.jvcir.2025.104678","url":null,"abstract":"<div><div>With the rise of large foundation models, significant advancements have been made in the field of artificial intelligence. The Segment Anything Model (SAM) was specifically designed for image segmentation. However, experiments have demonstrated that SAM may encounter performance limitations in handling specific tasks, such as fire segmentation. To address this challenge, our study explores solutions to effectively adapt the pre-trained SAM model for fire segmentation. The adapter-enhanced approach is introduced to SAM, incorporating effective adapter modules into the segmentation network. The resulting approach, SAM-FireAdapter, incorporates fire-specific features into SAM, significantly enhancing its performance on fire segmentation. Additionally, we propose Fire-Adaptive Attention (FAA), a lightweight attention mechanism module to enhance feature representation. This module reweights the input features before decoding, emphasizing critical spatial features and suppressing less relevant ones. Experimental results demonstrate that SAM-FireAdapter surpasses existing fire segmentation networks including the base SAM.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104678"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.jvcir.2025.104689
Jiawei Wang, Weiwei Shi, Xiaofan Wang, Xinhong Hei
Semi-supervised learning has achieved significant success through various approaches based on pseudo-labeling and consistency regularization. Despite efforts, effectively utilizing both labeled and unlabeled data remains a significant challenge. In this study, to enhance the efficient utilization of limited and valuable labeled data, we propose a self-adaptive weight redistribution strategy within a batch. This operation takes into account the heterogeneity of labeled data, adjusting its contribution to the overall loss based on sample-specific losses. This enables the model to more accurately identify challenging samples. Our experiments demonstrate that this weight reallocation strategy significantly enhances the model’s generalization ability. Additionally, to enhance intra-class compactness and inter-class separation of the learned features, we introduce a cosine similarity-based discriminative feature learning regularization term. This regularization term aims to reinforce feature consistency within the same class and enhance feature distinctiveness across different classes. Through this mechanism, we facilitate the model to prioritize learning discriminative feature representations, ensuring that features with authentic labels and those with high-confidence pseudo-labels are grouped together, while simultaneously separating features belonging to different clusters. The method can be combined with mainstream Semi-supervised learning methods, which we evaluate experimentally. Our experimental findings illustrate the efficacy of our approach in enhancing the performance of semi-supervised learning tasks across widely utilized image classification datasets.
{"title":"Deep semi-supervised learning method based on sample adaptive weights and discriminative feature learning","authors":"Jiawei Wang, Weiwei Shi, Xiaofan Wang, Xinhong Hei","doi":"10.1016/j.jvcir.2025.104689","DOIUrl":"10.1016/j.jvcir.2025.104689","url":null,"abstract":"<div><div>Semi-supervised learning has achieved significant success through various approaches based on pseudo-labeling and consistency regularization. Despite efforts, effectively utilizing both labeled and unlabeled data remains a significant challenge. In this study, to enhance the efficient utilization of limited and valuable labeled data, we propose a self-adaptive weight redistribution strategy within a batch. This operation takes into account the heterogeneity of labeled data, adjusting its contribution to the overall loss based on sample-specific losses. This enables the model to more accurately identify challenging samples. Our experiments demonstrate that this weight reallocation strategy significantly enhances the model’s generalization ability. Additionally, to enhance intra-class compactness and inter-class separation of the learned features, we introduce a cosine similarity-based discriminative feature learning regularization term. This regularization term aims to reinforce feature consistency within the same class and enhance feature distinctiveness across different classes. Through this mechanism, we facilitate the model to prioritize learning discriminative feature representations, ensuring that features with authentic labels and those with high-confidence pseudo-labels are grouped together, while simultaneously separating features belonging to different clusters. The method can be combined with mainstream Semi-supervised learning methods, which we evaluate experimentally. Our experimental findings illustrate the efficacy of our approach in enhancing the performance of semi-supervised learning tasks across widely utilized image classification datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104689"},"PeriodicalIF":3.1,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.jvcir.2025.104685
Xinyue Sun , Jing Guo , yongzhen Ke , Shuai Yang , Kai Wang , Yemeng Wu
The pre-trained text-to-image diffusion probabilistic model has achieved excellent quality, providing users with good visual effects and attracting many users to use creative text to control the generated results. For users’ detailed generation requirements, using reference images to “stylize” text-to-image is more common because they cannot be fully explained in limited language. However, there is a style deviation between the images generated by existing methods and the style reference images, contrary to the human perception that similar semantic object regions in two images with the same style should share style. To solve this problem, this paper proposes a semantic-aware style transfer method (SAST) to strengthen the semantic-level style alignment between the generated image and style reference image. First, we lead language-driven semantic segmentation trained on the COCO dataset into a general style transfer model to capture the mask that the text in the style reference image focuses on. Similarly, we use the same text to perform mask extraction on the cross-attention layer of the text-to-image model. Based on the two obtained mask maps, we modify the self-attention layer in the diffusion model to control the injection process of style features. Experiments show that we achieve better style fidelity and style alignment metrics, indicating that the generated images are more consistent with human perception. Code is available at https://gitee.com/yongzhenke/SAST. Additional Keywords and Phrases:Text-to-image, Image style transfer, Diffusion model, Semantic alignment.
{"title":"SAST: Semantic-Aware stylized Text-to-Image generation","authors":"Xinyue Sun , Jing Guo , yongzhen Ke , Shuai Yang , Kai Wang , Yemeng Wu","doi":"10.1016/j.jvcir.2025.104685","DOIUrl":"10.1016/j.jvcir.2025.104685","url":null,"abstract":"<div><div>The pre-trained text-to-image diffusion probabilistic model has achieved excellent quality, providing users with good visual effects and attracting many users to use creative text to control the generated results. For users’ detailed generation requirements, using reference images to “stylize” text-to-image is more common because they cannot be fully explained in limited language. However, there is a style deviation between the images generated by existing methods and the style reference images, contrary to the human perception that similar semantic object regions in two images with the same style should share style. To solve this problem, this paper proposes a semantic-aware style transfer method (SAST) to strengthen the semantic-level style alignment between the generated image and style reference image. First, we lead language-driven semantic segmentation trained on the COCO dataset into a general style transfer model to capture the mask that the text in the style reference image focuses on. Similarly, we use the same text to perform mask extraction on the cross-attention layer of the text-to-image model. Based on the two obtained mask maps, we modify the self-attention layer in the diffusion model to control the injection process of style features. Experiments show that we achieve better style fidelity and style alignment metrics, indicating that the generated images are more consistent with human perception. Code is available at https://gitee.com/yongzhenke/SAST. Additional Keywords and <strong>Phrases</strong>:Text-to-image, Image style transfer, Diffusion model, Semantic alignment.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104685"},"PeriodicalIF":3.1,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.jvcir.2025.104676
S. Caroline , Y.Jacob Vetha Raj
The capacity to guess where viewers look while reviewing a scene, likewise called saliency prediction or observation, has created critical interest in the fields of computer vision. Incorporating saliency prediction modeling into traditional CNN-based models is challenging. To address this, we developed the Deeply Tri-Layered Multi-Blended Trans-Encoder Framework (DTMBTE) to improve human eye fixation prediction in image saliency tasks. Unlike existing CNN-based methods that struggle with contextual encoding, our model integrates local feature extraction with global attention mechanisms to more accurately forecast saliency regions. We created a new trans-encoder called the Multi Blended Trans-Encoder (MBTE) by combining three different convolution types with encoders that use multiple heads of attention, which effectively localize the human eye fixation or saliency area. This combined design efficiently extracts both spatial and contextual information for saliency estimation. Experiments on MIT1003 and CAT2000 show that DTMBTE outperforms NSS and SIM scores and minimum EMD.
{"title":"Visual saliency fixation via deeply tri-layered multi blended trans-encoder framework","authors":"S. Caroline , Y.Jacob Vetha Raj","doi":"10.1016/j.jvcir.2025.104676","DOIUrl":"10.1016/j.jvcir.2025.104676","url":null,"abstract":"<div><div>The capacity to guess where viewers look while reviewing a scene, likewise called saliency prediction or observation, has created critical interest in the fields of computer vision. Incorporating saliency prediction modeling into traditional CNN-based models is challenging. To address this, we developed the Deeply Tri-Layered Multi-Blended Trans-Encoder Framework (DTMBTE) to improve human eye fixation prediction in image saliency tasks. Unlike existing CNN-based methods that struggle with contextual encoding, our model integrates local feature extraction with global attention mechanisms to more accurately forecast saliency regions. We created a new <em>trans</em>-encoder called the Multi Blended Trans-Encoder (MBTE) by combining three different convolution types with encoders that use multiple heads of attention, which effectively localize the human eye fixation or saliency area. This combined design efficiently extracts both spatial and contextual information for saliency estimation. Experiments on MIT1003 and CAT2000 show that DTMBTE outperforms NSS and SIM scores and minimum EMD.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104676"},"PeriodicalIF":3.1,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}