首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Future object localization using multi-modal ego-centric video 未来使用多模态以自我为中心的视频进行对象定位
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-18 DOI: 10.1016/j.jvcir.2025.104684
Jee-Ye Yoon , Je-Won Kang
Future object localization (FOL) seeks to predict the future locations of objects using information from past and present video frames. Ego-centric videos from vehicle-mounted cameras serve as a key source. However, these videos are constrained by a limited field of view and susceptibility to external conditions. To address these challenges, this paper presents a novel FOL approach that combines ego-centric video data with point cloud data, enhancing both robustness and accuracy. The proposed model is based on a deep neural network that prioritizes front-camera ego-centric videos, exploiting their rich visual cues. By integrating point cloud data, the system improves three-dimensional (3D) object localization. Furthermore, the paper introduces a novel method for ego-motion prediction. The ego-motion prediction network employs multi-modal sensors to comprehensively capture physical displacement in both 2D and 3D spaces, effectively handling occlusions and the limited perspective inherent in ego-centric videos. Experimental results indicate that the proposed FOL system with ego-motion prediction (MS-FOLe) outperforms existing methods on large-scale open datasets for intelligent driving.
未来目标定位(FOL)旨在利用过去和现在视频帧的信息来预测目标的未来位置。车载摄像头拍摄的以自我为中心的视频是一个关键来源。然而,这些视频受限于有限的视野和对外部条件的敏感性。为了应对这些挑战,本文提出了一种新颖的FOL方法,该方法将以自我为中心的视频数据与点云数据相结合,增强了鲁棒性和准确性。所提出的模型基于深度神经网络,该网络利用其丰富的视觉线索,优先考虑前置摄像头以自我为中心的视频。通过整合点云数据,该系统提高了三维物体的定位。在此基础上,提出了一种新的自我运动预测方法。自我运动预测网络采用多模态传感器全面捕捉二维和三维空间的物理位移,有效处理遮挡和以自我为中心的视频固有的有限视角。实验结果表明,基于自我运动预测的自动驾驶系统(MS-FOLe)在面向智能驾驶的大规模开放数据集上优于现有方法。
{"title":"Future object localization using multi-modal ego-centric video","authors":"Jee-Ye Yoon ,&nbsp;Je-Won Kang","doi":"10.1016/j.jvcir.2025.104684","DOIUrl":"10.1016/j.jvcir.2025.104684","url":null,"abstract":"<div><div>Future object localization (FOL) seeks to predict the future locations of objects using information from past and present video frames. Ego-centric videos from vehicle-mounted cameras serve as a key source. However, these videos are constrained by a limited field of view and susceptibility to external conditions. To address these challenges, this paper presents a novel FOL approach that combines ego-centric video data with point cloud data, enhancing both robustness and accuracy. The proposed model is based on a deep neural network that prioritizes front-camera ego-centric videos, exploiting their rich visual cues. By integrating point cloud data, the system improves three-dimensional (3D) object localization. Furthermore, the paper introduces a novel method for ego-motion prediction. The ego-motion prediction network employs multi-modal sensors to comprehensively capture physical displacement in both 2D and 3D spaces, effectively handling occlusions and the limited perspective inherent in ego-centric videos. Experimental results indicate that the proposed FOL system with ego-motion prediction (MS-FOLe) outperforms existing methods on large-scale open datasets for intelligent driving.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104684"},"PeriodicalIF":3.1,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSCA: Channel-specific information contrast and aggregation for weakly supervised semantic segmentation 弱监督语义分割的特定于信道的信息对比和聚合
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-18 DOI: 10.1016/j.jvcir.2025.104690
Guoqing Zhang , Wenxin Sun , Long Wang , Yuhui Zheng , Zhonglin Ye
Existing multi-stage weakly supervised semantic segmentation (WSSS) methods typically use refined class activation maps (CAMs) to generate pseudo labels. However, CAMs are prone to misactivating background regions associated with foreground objects (e.g., train and railroad). Some previous efforts introduce additional supervisory signals as background cues but do not consider the rich foreground–background discrimination insights present in different channels of CAMs. In this work, we present a novel framework that explicitly models channel-specific information to enhance foreground–background discrimination and contextual understanding in CAM generation. By effectively capturing and integrating channel-wise local and global cues, our approach mitigates common misactivation issues without requiring additional supervision. Experiments on the PASCAL VOC 2012 dataset show that our method alleviates misactivation in CAMs without additional supervision, providing significant improvements over off-the-shelf methods and achieving strong segmentation performance.
现有的多阶段弱监督语义分割(WSSS)方法通常使用改进的类激活映射(CAMs)来生成伪标签。然而,cam容易误激活与前景对象(例如,火车和铁路)相关的背景区域。之前的一些研究引入了额外的监控信号作为背景线索,但没有考虑到不同cam通道中存在的丰富的前景-背景区分见解。在这项工作中,我们提出了一个新的框架,该框架明确地模拟了特定渠道的信息,以增强CAM生成中的前景背景歧视和上下文理解。通过有效地捕获和整合渠道方面的本地和全局线索,我们的方法减轻了常见的误激活问题,而无需额外的监督。在PASCAL VOC 2012数据集上的实验表明,我们的方法在没有额外监督的情况下减轻了cam中的误激活,与现有方法相比有了显着改进,并实现了强大的分割性能。
{"title":"CSCA: Channel-specific information contrast and aggregation for weakly supervised semantic segmentation","authors":"Guoqing Zhang ,&nbsp;Wenxin Sun ,&nbsp;Long Wang ,&nbsp;Yuhui Zheng ,&nbsp;Zhonglin Ye","doi":"10.1016/j.jvcir.2025.104690","DOIUrl":"10.1016/j.jvcir.2025.104690","url":null,"abstract":"<div><div>Existing multi-stage weakly supervised semantic segmentation (WSSS) methods typically use refined class activation maps (CAMs) to generate pseudo labels. However, CAMs are prone to misactivating background regions associated with foreground objects (e.g., train and railroad). Some previous efforts introduce additional supervisory signals as background cues but do not consider the rich foreground–background discrimination insights present in different channels of CAMs. In this work, we present a novel framework that explicitly models channel-specific information to enhance foreground–background discrimination and contextual understanding in CAM generation. By effectively capturing and integrating channel-wise local and global cues, our approach mitigates common misactivation issues without requiring additional supervision. Experiments on the PASCAL VOC 2012 dataset show that our method alleviates misactivation in CAMs without additional supervision, providing significant improvements over off-the-shelf methods and achieving strong segmentation performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104690"},"PeriodicalIF":3.1,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-distance near-infrared face recognition 跨距离近红外人脸识别
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-17 DOI: 10.1016/j.jvcir.2025.104691
Da Ai , Yunqiao Wang , Kai Jia , Zhike Ji , Ying Liu
In the actual video surveillance application scenarios, the imaging difference between near infrared (NIR) and visible light (VIS) spectrum and the photo distance are two important factors that restrict the accuracy of near infrared face recognition. In this paper, we first use a fixed focus near-infrared camera to capture NIR face images at different distances, constructing a large Cross-Spectral and Cross-Distance Face dataset (CSCD-F), and in order to improve recognition accuracy, we employ image enhancement techniques to preprocess low-quality face images. Furthermore, we adjusted the sampling depth of the generator in the CycleGAN network and introduced additional edge loss, proposing a general framework that combines generative models and transfer learning to achieve spectral feature translation between NIR and VIS images. The proposed method can effectively convert NIR face images into VIS images while retaining sufficient identity information. Various experimental results demonstrate that the proposed method achieves significant performance improvements on the self-built CSCD-F dataset. Additionally, it validates the generalization capability and effectiveness of the proposed method on public datasets such as HFB and Oulu-CASIA NIR-VIS.
在实际视频监控应用场景中,近红外(NIR)与可见光(VIS)光谱的成像差异以及拍摄距离是制约近红外人脸识别精度的两个重要因素。本文首先利用定焦近红外相机采集不同距离的近红外人脸图像,构建大型跨光谱和跨距离人脸数据集(Cross-Spectral and Cross-Distance face dataset, CSCD-F),并采用图像增强技术对低质量人脸图像进行预处理,以提高识别精度。此外,我们调整了CycleGAN网络中生成器的采样深度,并引入了额外的边缘损失,提出了一个结合生成模型和迁移学习的通用框架,以实现近红外和VIS图像之间的光谱特征转换。该方法可以有效地将近红外人脸图像转换为VIS图像,同时保留足够的身份信息。各种实验结果表明,该方法在自建的CSCD-F数据集上取得了显著的性能提升。此外,在HFB和Oulu-CASIA NIR-VIS等公共数据集上验证了该方法的泛化能力和有效性。
{"title":"Cross-distance near-infrared face recognition","authors":"Da Ai ,&nbsp;Yunqiao Wang ,&nbsp;Kai Jia ,&nbsp;Zhike Ji ,&nbsp;Ying Liu","doi":"10.1016/j.jvcir.2025.104691","DOIUrl":"10.1016/j.jvcir.2025.104691","url":null,"abstract":"<div><div>In the actual video surveillance application scenarios, the imaging difference between near infrared (NIR) and visible light (VIS) spectrum and the photo distance are two important factors that restrict the accuracy of near infrared face recognition. In this paper, we first use a fixed focus near-infrared camera to capture NIR face images at different distances, constructing a large Cross-Spectral and Cross-Distance Face dataset (CSCD-F), and in order to improve recognition accuracy, we employ image enhancement techniques to preprocess low-quality face images. Furthermore, we adjusted the sampling depth of the generator in the CycleGAN network and introduced additional edge loss, proposing a general framework that combines generative models and transfer learning to achieve spectral feature translation between NIR and VIS images. The proposed method can effectively convert NIR face images into VIS images while retaining sufficient identity information. Various experimental results demonstrate that the proposed method achieves significant performance improvements on the self-built CSCD-F dataset. Additionally, it validates the generalization capability and effectiveness of the proposed method on public datasets such as HFB and Oulu-CASIA NIR-VIS.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104691"},"PeriodicalIF":3.1,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aligning computational and human perceptions of image complexity: A dual-task framework for prediction and localization 对齐计算和人类对图像复杂性的感知:预测和定位的双重任务框架
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-16 DOI: 10.1016/j.jvcir.2025.104686
Xiaoying Guo , Liang Li , Tao Yan , Lu Wang , Yuhua Qian
Perceptual analysis of image complexity bridges affective computing and visual perception, providing deeper insights into visual content. Conventional approaches mainly focus on global complexity scoring, neglecting the localization of region-specific complexity cues crucial for human perception. To address these challenges, we propose ICCORN, a dual-task framework that predicts image complexity scores while detecting complexity regions simultaneously. By integrating modified ICNet and rank-consistent ordinal regression (CORN), ICCORN generates complexity activation maps that are highly consistent with eye movement heatmaps. Comprehensive cross-dataset evaluations on four datasets demonstrate that ICCORN’s robust performance across diverse image types, enhancing its applicability in visual complexity analysis. Additionally, we introduce ICEye, a novel eye-tracking dataset of 1200 images across eight semantic categories, annotated with gaze trajectories, heatmaps, and segmented regions. This dataset facilitates advanced research into computational modeling of human visual complexity perception. ICEye dataset is available at https://github.com/gxyeagle19850102/ICEye.
图像复杂性的感知分析连接了情感计算和视觉感知,为视觉内容提供了更深入的见解。传统的方法主要关注全局复杂性评分,而忽略了对人类感知至关重要的特定区域复杂性线索的局部化。为了解决这些挑战,我们提出了ICCORN,这是一个双任务框架,可以在同时检测复杂性区域的同时预测图像复杂性分数。通过整合改进的ICNet和秩一致有序回归(CORN), ICCORN生成了与眼动热图高度一致的复杂性激活图。对4个数据集的综合跨数据集评估表明,ICCORN在不同图像类型上的鲁棒性,增强了其在视觉复杂性分析中的适用性。此外,我们还介绍了ICEye,这是一个新颖的眼动追踪数据集,包含8个语义类别的1200张图像,并附有凝视轨迹、热图和分割区域的注释。该数据集促进了对人类视觉复杂性感知的计算建模的高级研究。ICEye数据集可从https://github.com/gxyeagle19850102/ICEye获取。
{"title":"Aligning computational and human perceptions of image complexity: A dual-task framework for prediction and localization","authors":"Xiaoying Guo ,&nbsp;Liang Li ,&nbsp;Tao Yan ,&nbsp;Lu Wang ,&nbsp;Yuhua Qian","doi":"10.1016/j.jvcir.2025.104686","DOIUrl":"10.1016/j.jvcir.2025.104686","url":null,"abstract":"<div><div>Perceptual analysis of image complexity bridges affective computing and visual perception, providing deeper insights into visual content. Conventional approaches mainly focus on global complexity scoring, neglecting the localization of region-specific complexity cues crucial for human perception. To address these challenges, we propose ICCORN, a dual-task framework that predicts image complexity scores while detecting complexity regions simultaneously. By integrating modified ICNet and rank-consistent ordinal regression (CORN), ICCORN generates complexity activation maps that are highly consistent with eye movement heatmaps. Comprehensive cross-dataset evaluations on four datasets demonstrate that ICCORN’s robust performance across diverse image types, enhancing its applicability in visual complexity analysis. Additionally, we introduce ICEye, a novel eye-tracking dataset of 1200 images across eight semantic categories, annotated with gaze trajectories, heatmaps, and segmented regions. This dataset facilitates advanced research into computational modeling of human visual complexity perception. ICEye dataset is available at <span><span>https://github.com/gxyeagle19850102/ICEye</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104686"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Point cloud accumulation via multi-dimensional pseudo label and progressive instance association 基于多维伪标签和渐进实例关联的点云积累
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-16 DOI: 10.1016/j.jvcir.2025.104688
Shujuan Huang, Jie Pan, Chunyu Lin, Lang Nie, Meiqin Liu, Yao Zhao
Point cloud accumulation aligns and merges 3D LiDAR frames to create dense, comprehensive scene representations that are critical for applications like autonomous driving. Effective accumulation relies on accurate scene flow estimation, yet error propagation and drift, particularly from noise and fast-moving objects, pose significant challenges. Existing clustering-based methods for instance association often falter under these conditions and depend heavily on manual labels, limiting scalability. To address these issues, we propose a Progressive Instance Association (PIA) method that integrates single-frame clustering with an enhanced Unscented Kalman Filter, improving tracking robustness in dynamic scenes. Additionally, our Multi-Dimensional Pseudo Label (MDPL) strategy leverages cross-modal supervision to reduce reliance on manual labels, enhancing scene flow accuracy. Evaluated on the Waymo Open Dataset, our method surpasses state-of-the-art LiDAR-based approaches and performs comparably to multi-modal methods. Qualitative visualizations further demonstrate denser, well-aligned accumulated point clouds.
点云积累对齐并合并3D激光雷达框架,以创建密集、全面的场景表示,这对自动驾驶等应用至关重要。有效的积累依赖于准确的场景流估计,但误差传播和漂移,特别是来自噪声和快速移动的物体,构成了重大挑战。现有的基于聚类的实例关联方法在这些条件下往往会出现问题,并且严重依赖于手动标签,从而限制了可伸缩性。为了解决这些问题,我们提出了一种渐进实例关联(PIA)方法,该方法将单帧聚类与增强的Unscented卡尔曼滤波器相结合,提高了动态场景中的跟踪鲁棒性。此外,我们的多维伪标签(MDPL)策略利用跨模式监督来减少对手动标签的依赖,提高场景流的准确性。在Waymo开放数据集上进行评估后,我们的方法超越了最先进的基于激光雷达的方法,并与多模态方法相媲美。定性可视化进一步展示了密集的、排列良好的累积点云。
{"title":"Point cloud accumulation via multi-dimensional pseudo label and progressive instance association","authors":"Shujuan Huang,&nbsp;Jie Pan,&nbsp;Chunyu Lin,&nbsp;Lang Nie,&nbsp;Meiqin Liu,&nbsp;Yao Zhao","doi":"10.1016/j.jvcir.2025.104688","DOIUrl":"10.1016/j.jvcir.2025.104688","url":null,"abstract":"<div><div>Point cloud accumulation aligns and merges 3D LiDAR frames to create dense, comprehensive scene representations that are critical for applications like autonomous driving. Effective accumulation relies on accurate scene flow estimation, yet error propagation and drift, particularly from noise and fast-moving objects, pose significant challenges. Existing clustering-based methods for instance association often falter under these conditions and depend heavily on manual labels, limiting scalability. To address these issues, we propose a Progressive Instance Association (PIA) method that integrates single-frame clustering with an enhanced Unscented Kalman Filter, improving tracking robustness in dynamic scenes. Additionally, our Multi-Dimensional Pseudo Label (MDPL) strategy leverages cross-modal supervision to reduce reliance on manual labels, enhancing scene flow accuracy. Evaluated on the Waymo Open Dataset, our method surpasses state-of-the-art LiDAR-based approaches and performs comparably to multi-modal methods. Qualitative visualizations further demonstrate denser, well-aligned accumulated point clouds.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104688"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Depth error points optimization for 3D Gaussian Splatting in few-shot synthesis 少镜头合成中三维高斯溅射深度误差点优化
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-16 DOI: 10.1016/j.jvcir.2025.104682
Xu Jiang , Huiping Deng , Sen Xiang , Li Yu
Few-view 3D reconstruction technology aims to recover the 3D geometric shape of objects or scenes using only a limited number of views. In recent years, with the development of deep learning and 3D rendering technologies, this field has achieved significant progress. Due to the highly similar geometric and appearance features of repetitive texture regions, current few-view 3D reconstruction methods fail to distinguish their local differences during the global reconstruction process, thus frequently resulting in floating artifacts in these regions of the synthesized new views. We propose a method combining monocular depth supervision with depth-error-guided point optimization within the framework of 3D Gaussian Splatting to solve the floating artifact problem in repetitive texture regions under few-view input conditions. Specifically, we calculate a loss function using rendered depth maps and pseudo-true depth maps to achieve depth constraints, and we identify erroneous Gaussian points through depth error maps. For these erroneous point regions, we implement more effective point densification to guide the model in learning more correct geometric shapes in these regions and to synthesize views with fewer floating artifacts. We validate our method on the NeRF-LLFF dataset with different numbers of images. We conduct multiple experiments on randomly selected training images and provide average values to ensure fairness. The experimental results on the LLFF dataset show that our method outperforms the baseline method DRGS, achieving 0.53 dB higher PSNR and 0.021 higher SSIM. This confirms that we effectively reduce floating artifacts in the repetitive texture regions of few-view novel view synthesis.
少视图三维重建技术旨在使用有限的视图恢复物体或场景的三维几何形状。近年来,随着深度学习和3D渲染技术的发展,这一领域取得了重大进展。由于重复纹理区域具有高度相似的几何和外观特征,目前很少的几种三维重建方法在全局重建过程中无法区分它们的局部差异,导致合成新视图的这些区域经常出现浮动伪影。在三维高斯飞溅的框架下,提出了一种将单目深度监督与深度误差引导点优化相结合的方法来解决重复纹理区域在少视图输入条件下的浮动伪影问题。具体来说,我们使用渲染深度图和伪真深度图计算损失函数来实现深度约束,并通过深度误差图识别错误的高斯点。对于这些错误的点区域,我们实现了更有效的点密度化,以指导模型在这些区域中学习更正确的几何形状,并以更少的浮动伪影合成视图。我们用不同数量的图像在NeRF-LLFF数据集上验证了我们的方法。我们对随机选择的训练图像进行多次实验,并提供平均值以保证公平性。在LLFF数据集上的实验结果表明,我们的方法优于基线方法DRGS, PSNR提高0.53 dB, SSIM提高0.021。这证实了我们有效地减少了少视图新视图合成中重复纹理区域的浮动伪影。
{"title":"Depth error points optimization for 3D Gaussian Splatting in few-shot synthesis","authors":"Xu Jiang ,&nbsp;Huiping Deng ,&nbsp;Sen Xiang ,&nbsp;Li Yu","doi":"10.1016/j.jvcir.2025.104682","DOIUrl":"10.1016/j.jvcir.2025.104682","url":null,"abstract":"<div><div>Few-view 3D reconstruction technology aims to recover the 3D geometric shape of objects or scenes using only a limited number of views. In recent years, with the development of deep learning and 3D rendering technologies, this field has achieved significant progress. Due to the highly similar geometric and appearance features of repetitive texture regions, current few-view 3D reconstruction methods fail to distinguish their local differences during the global reconstruction process, thus frequently resulting in floating artifacts in these regions of the synthesized new views. We propose a method combining monocular depth supervision with depth-error-guided point optimization within the framework of 3D Gaussian Splatting to solve the floating artifact problem in repetitive texture regions under few-view input conditions. Specifically, we calculate a loss function using rendered depth maps and pseudo-true depth maps to achieve depth constraints, and we identify erroneous Gaussian points through depth error maps. For these erroneous point regions, we implement more effective point densification to guide the model in learning more correct geometric shapes in these regions and to synthesize views with fewer floating artifacts. We validate our method on the NeRF-LLFF dataset with different numbers of images. We conduct multiple experiments on randomly selected training images and provide average values to ensure fairness. The experimental results on the LLFF dataset show that our method outperforms the baseline method DRGS, achieving 0.53 dB higher PSNR and 0.021 higher SSIM. This confirms that we effectively reduce floating artifacts in the repetitive texture regions of few-view novel view synthesis.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104682"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAM-FireAdapter: An adapter for fire segmentation with SAM SAM- fireadapter:使用SAM进行火灾分割的适配器
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-16 DOI: 10.1016/j.jvcir.2025.104678
Yanan Wu, Chaoqun Hong, Yongfeng Chen, Haixi Cheng
With the rise of large foundation models, significant advancements have been made in the field of artificial intelligence. The Segment Anything Model (SAM) was specifically designed for image segmentation. However, experiments have demonstrated that SAM may encounter performance limitations in handling specific tasks, such as fire segmentation. To address this challenge, our study explores solutions to effectively adapt the pre-trained SAM model for fire segmentation. The adapter-enhanced approach is introduced to SAM, incorporating effective adapter modules into the segmentation network. The resulting approach, SAM-FireAdapter, incorporates fire-specific features into SAM, significantly enhancing its performance on fire segmentation. Additionally, we propose Fire-Adaptive Attention (FAA), a lightweight attention mechanism module to enhance feature representation. This module reweights the input features before decoding, emphasizing critical spatial features and suppressing less relevant ones. Experimental results demonstrate that SAM-FireAdapter surpasses existing fire segmentation networks including the base SAM.
随着大型基础模型的兴起,人工智能领域取得了重大进展。SAM模型是专门为图像分割而设计的。然而,实验表明,SAM在处理特定任务时可能会遇到性能限制,例如火灾分割。为了应对这一挑战,我们的研究探索了有效地将预训练的SAM模型用于火灾分割的解决方案。将适配器增强方法引入到SAM中,将有效的适配器模块集成到分段网络中。由此产生的方法SAM- fireadapter将火灾特定的特征集成到SAM中,显著提高了SAM在火灾分割方面的性能。此外,我们提出了Fire-Adaptive Attention (FAA),一个轻量级的注意机制模块来增强特征表示。该模块在解码前对输入特征进行加权,强调重要的空间特征,抑制不太相关的空间特征。实验结果表明,SAM- fireadapter优于包括基本SAM在内的现有火灾分割网络。
{"title":"SAM-FireAdapter: An adapter for fire segmentation with SAM","authors":"Yanan Wu,&nbsp;Chaoqun Hong,&nbsp;Yongfeng Chen,&nbsp;Haixi Cheng","doi":"10.1016/j.jvcir.2025.104678","DOIUrl":"10.1016/j.jvcir.2025.104678","url":null,"abstract":"<div><div>With the rise of large foundation models, significant advancements have been made in the field of artificial intelligence. The Segment Anything Model (SAM) was specifically designed for image segmentation. However, experiments have demonstrated that SAM may encounter performance limitations in handling specific tasks, such as fire segmentation. To address this challenge, our study explores solutions to effectively adapt the pre-trained SAM model for fire segmentation. The adapter-enhanced approach is introduced to SAM, incorporating effective adapter modules into the segmentation network. The resulting approach, SAM-FireAdapter, incorporates fire-specific features into SAM, significantly enhancing its performance on fire segmentation. Additionally, we propose Fire-Adaptive Attention (FAA), a lightweight attention mechanism module to enhance feature representation. This module reweights the input features before decoding, emphasizing critical spatial features and suppressing less relevant ones. Experimental results demonstrate that SAM-FireAdapter surpasses existing fire segmentation networks including the base SAM.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104678"},"PeriodicalIF":3.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep semi-supervised learning method based on sample adaptive weights and discriminative feature learning 基于样本自适应权值和判别特征学习的深度半监督学习方法
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-15 DOI: 10.1016/j.jvcir.2025.104689
Jiawei Wang, Weiwei Shi, Xiaofan Wang, Xinhong Hei
Semi-supervised learning has achieved significant success through various approaches based on pseudo-labeling and consistency regularization. Despite efforts, effectively utilizing both labeled and unlabeled data remains a significant challenge. In this study, to enhance the efficient utilization of limited and valuable labeled data, we propose a self-adaptive weight redistribution strategy within a batch. This operation takes into account the heterogeneity of labeled data, adjusting its contribution to the overall loss based on sample-specific losses. This enables the model to more accurately identify challenging samples. Our experiments demonstrate that this weight reallocation strategy significantly enhances the model’s generalization ability. Additionally, to enhance intra-class compactness and inter-class separation of the learned features, we introduce a cosine similarity-based discriminative feature learning regularization term. This regularization term aims to reinforce feature consistency within the same class and enhance feature distinctiveness across different classes. Through this mechanism, we facilitate the model to prioritize learning discriminative feature representations, ensuring that features with authentic labels and those with high-confidence pseudo-labels are grouped together, while simultaneously separating features belonging to different clusters. The method can be combined with mainstream Semi-supervised learning methods, which we evaluate experimentally. Our experimental findings illustrate the efficacy of our approach in enhancing the performance of semi-supervised learning tasks across widely utilized image classification datasets.
通过基于伪标记和一致性正则化的各种方法,半监督学习已经取得了显著的成功。尽管努力,有效地利用标记和未标记的数据仍然是一个重大挑战。在本研究中,为了提高有限和有价值的标记数据的有效利用,我们提出了一种自适应的批内权重重新分配策略。该操作考虑了标记数据的异质性,根据样本特定损失调整其对总体损失的贡献。这使得模型能够更准确地识别具有挑战性的样本。我们的实验表明,这种权重重新分配策略显著提高了模型的泛化能力。此外,为了增强学习特征的类内紧密性和类间分离性,我们引入了基于余弦相似度的判别特征学习正则化项。这个正则化术语旨在加强同一类内的特征一致性,并增强不同类之间的特征独特性。通过这种机制,我们促进模型优先学习判别特征表示,确保具有真实标签的特征和具有高置信度伪标签的特征被分组在一起,同时分离属于不同聚类的特征。该方法可以与主流的半监督学习方法相结合,并进行了实验验证。我们的实验结果说明了我们的方法在提高广泛使用的图像分类数据集的半监督学习任务的性能方面的有效性。
{"title":"Deep semi-supervised learning method based on sample adaptive weights and discriminative feature learning","authors":"Jiawei Wang,&nbsp;Weiwei Shi,&nbsp;Xiaofan Wang,&nbsp;Xinhong Hei","doi":"10.1016/j.jvcir.2025.104689","DOIUrl":"10.1016/j.jvcir.2025.104689","url":null,"abstract":"<div><div>Semi-supervised learning has achieved significant success through various approaches based on pseudo-labeling and consistency regularization. Despite efforts, effectively utilizing both labeled and unlabeled data remains a significant challenge. In this study, to enhance the efficient utilization of limited and valuable labeled data, we propose a self-adaptive weight redistribution strategy within a batch. This operation takes into account the heterogeneity of labeled data, adjusting its contribution to the overall loss based on sample-specific losses. This enables the model to more accurately identify challenging samples. Our experiments demonstrate that this weight reallocation strategy significantly enhances the model’s generalization ability. Additionally, to enhance intra-class compactness and inter-class separation of the learned features, we introduce a cosine similarity-based discriminative feature learning regularization term. This regularization term aims to reinforce feature consistency within the same class and enhance feature distinctiveness across different classes. Through this mechanism, we facilitate the model to prioritize learning discriminative feature representations, ensuring that features with authentic labels and those with high-confidence pseudo-labels are grouped together, while simultaneously separating features belonging to different clusters. The method can be combined with mainstream Semi-supervised learning methods, which we evaluate experimentally. Our experimental findings illustrate the efficacy of our approach in enhancing the performance of semi-supervised learning tasks across widely utilized image classification datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104689"},"PeriodicalIF":3.1,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAST: Semantic-Aware stylized Text-to-Image generation 语义感知的程式化文本到图像的生成
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-15 DOI: 10.1016/j.jvcir.2025.104685
Xinyue Sun , Jing Guo , yongzhen Ke , Shuai Yang , Kai Wang , Yemeng Wu
The pre-trained text-to-image diffusion probabilistic model has achieved excellent quality, providing users with good visual effects and attracting many users to use creative text to control the generated results. For users’ detailed generation requirements, using reference images to “stylize” text-to-image is more common because they cannot be fully explained in limited language. However, there is a style deviation between the images generated by existing methods and the style reference images, contrary to the human perception that similar semantic object regions in two images with the same style should share style. To solve this problem, this paper proposes a semantic-aware style transfer method (SAST) to strengthen the semantic-level style alignment between the generated image and style reference image. First, we lead language-driven semantic segmentation trained on the COCO dataset into a general style transfer model to capture the mask that the text in the style reference image focuses on. Similarly, we use the same text to perform mask extraction on the cross-attention layer of the text-to-image model. Based on the two obtained mask maps, we modify the self-attention layer in the diffusion model to control the injection process of style features. Experiments show that we achieve better style fidelity and style alignment metrics, indicating that the generated images are more consistent with human perception. Code is available at https://gitee.com/yongzhenke/SAST. Additional Keywords and Phrases:Text-to-image, Image style transfer, Diffusion model, Semantic alignment.
预训练的文本到图像扩散概率模型取得了优异的质量,为用户提供了良好的视觉效果,并吸引了许多用户使用创意文本来控制生成的结果。对于用户的详细生成需求,使用参考图像来“风格化”文本到图像更为常见,因为它们无法用有限的语言完全解释。然而,现有方法生成的图像与风格参考图像之间存在风格偏差,这与人类认为具有相同风格的两幅图像中相似语义对象区域应该共享风格相反。为了解决这一问题,本文提出了一种语义感知的风格转换方法(SAST),以加强生成图像与样式参考图像之间的语义级风格对齐。首先,我们将在COCO数据集上训练的语言驱动语义分割引入到一个通用的风格迁移模型中,以捕获样式参考图像中文本关注的掩码。类似地,我们使用相同的文本在文本到图像模型的交叉注意层上执行掩码提取。基于得到的两个掩模映射,我们修改扩散模型中的自注意层来控制样式特征的注入过程。实验表明,我们获得了更好的风格保真度和风格对齐指标,表明生成的图像更符合人类的感知。代码可从https://gitee.com/yongzhenke/SAST获得。附加关键词和短语:文本到图像,图像样式转移,扩散模型,语义对齐。
{"title":"SAST: Semantic-Aware stylized Text-to-Image generation","authors":"Xinyue Sun ,&nbsp;Jing Guo ,&nbsp;yongzhen Ke ,&nbsp;Shuai Yang ,&nbsp;Kai Wang ,&nbsp;Yemeng Wu","doi":"10.1016/j.jvcir.2025.104685","DOIUrl":"10.1016/j.jvcir.2025.104685","url":null,"abstract":"<div><div>The pre-trained text-to-image diffusion probabilistic model has achieved excellent quality, providing users with good visual effects and attracting many users to use creative text to control the generated results. For users’ detailed generation requirements, using reference images to “stylize” text-to-image is more common because they cannot be fully explained in limited language. However, there is a style deviation between the images generated by existing methods and the style reference images, contrary to the human perception that similar semantic object regions in two images with the same style should share style. To solve this problem, this paper proposes a semantic-aware style transfer method (SAST) to strengthen the semantic-level style alignment between the generated image and style reference image. First, we lead language-driven semantic segmentation trained on the COCO dataset into a general style transfer model to capture the mask that the text in the style reference image focuses on. Similarly, we use the same text to perform mask extraction on the cross-attention layer of the text-to-image model. Based on the two obtained mask maps, we modify the self-attention layer in the diffusion model to control the injection process of style features. Experiments show that we achieve better style fidelity and style alignment metrics, indicating that the generated images are more consistent with human perception. Code is available at https://gitee.com/yongzhenke/SAST. Additional Keywords and <strong>Phrases</strong>:Text-to-image, Image style transfer, Diffusion model, Semantic alignment.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104685"},"PeriodicalIF":3.1,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visual saliency fixation via deeply tri-layered multi blended trans-encoder framework 基于深度三层多混合编码器框架的视觉显著性固定
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-13 DOI: 10.1016/j.jvcir.2025.104676
S. Caroline , Y.Jacob Vetha Raj
The capacity to guess where viewers look while reviewing a scene, likewise called saliency prediction or observation, has created critical interest in the fields of computer vision. Incorporating saliency prediction modeling into traditional CNN-based models is challenging. To address this, we developed the Deeply Tri-Layered Multi-Blended Trans-Encoder Framework (DTMBTE) to improve human eye fixation prediction in image saliency tasks. Unlike existing CNN-based methods that struggle with contextual encoding, our model integrates local feature extraction with global attention mechanisms to more accurately forecast saliency regions. We created a new trans-encoder called the Multi Blended Trans-Encoder (MBTE) by combining three different convolution types with encoders that use multiple heads of attention, which effectively localize the human eye fixation or saliency area. This combined design efficiently extracts both spatial and contextual information for saliency estimation. Experiments on MIT1003 and CAT2000 show that DTMBTE outperforms NSS and SIM scores and minimum EMD.
在回顾一个场景时猜测观众看向何处的能力,也被称为显著性预测或观察,已经在计算机视觉领域引起了重要的兴趣。将显著性预测模型整合到传统的基于cnn的模型中是具有挑战性的。为了解决这个问题,我们开发了深度三层多混合跨编码器框架(DTMBTE)来改进图像显著性任务中的人眼注视预测。与现有的基于cnn的上下文编码方法不同,我们的模型将局部特征提取与全局关注机制相结合,以更准确地预测显著区域。我们通过将三种不同的卷积类型与使用多个注意力头的编码器相结合,创建了一种新的跨编码器,称为Multi - Blended trans-encoder (MBTE),这可以有效地定位人眼注视或显著区域。这种组合设计有效地提取空间和上下文信息,用于显著性估计。在MIT1003和CAT2000上的实验表明,DTMBTE优于NSS和SIM分数以及最小EMD。
{"title":"Visual saliency fixation via deeply tri-layered multi blended trans-encoder framework","authors":"S. Caroline ,&nbsp;Y.Jacob Vetha Raj","doi":"10.1016/j.jvcir.2025.104676","DOIUrl":"10.1016/j.jvcir.2025.104676","url":null,"abstract":"<div><div>The capacity to guess where viewers look while reviewing a scene, likewise called saliency prediction or observation, has created critical interest in the fields of computer vision. Incorporating saliency prediction modeling into traditional CNN-based models is challenging. To address this, we developed the Deeply Tri-Layered Multi-Blended Trans-Encoder Framework (DTMBTE) to improve human eye fixation prediction in image saliency tasks. Unlike existing CNN-based methods that struggle with contextual encoding, our model integrates local feature extraction with global attention mechanisms to more accurately forecast saliency regions. We created a new <em>trans</em>-encoder called the Multi Blended Trans-Encoder (MBTE) by combining three different convolution types with encoders that use multiple heads of attention, which effectively localize the human eye fixation or saliency area. This combined design efficiently extracts both spatial and contextual information for saliency estimation. Experiments on MIT1003 and CAT2000 show that DTMBTE outperforms NSS and SIM scores and minimum EMD.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104676"},"PeriodicalIF":3.1,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1