Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104693
Yi Wang, Haonan Su, Zhaolin Xiao, Haiyan Jin
Deep low-light enhancement methods typically learn from long-exposure ground truths. While effective for dark scenes, this approach often causes overexposure in HDR scenarios and lacks adaptability to varying illumination levels. Therefore, we develop a deep low light image enhancement via Multi-task Learning of Few Shot Exposure Imaging (MLFSEI) which is formulated as Bayesian multi-task directed graphical model and predict the enhanced images by learning few-shot tasks comprising multi-exposure images and their corresponding exposure vectors. The proposed method predicts the enhanced image from the selected exposure vector and the learned latent variable among few tasks. The exposure vectors are defined as the characteristics of few shot exposure datasets containing mean, variance and contrast of images. Moreover, the multi order gradients are developed to constrain the structure and details from the ground truth. Experimental results demonstrate significant improvements, with average gains of 4.64 dB in PSNR and 0.071 in SSIM, along with an average reduction of 1.12 in NIQE across multiple benchmark datasets compared to state-of-the-art methods. Furthermore, the proposed method can be extended to accommodate multiple outputs with varying exposure levels among one model.
{"title":"Deep low light image enhancement via Multi-Task Learning of Few Shot Exposure Imaging","authors":"Yi Wang, Haonan Su, Zhaolin Xiao, Haiyan Jin","doi":"10.1016/j.jvcir.2025.104693","DOIUrl":"10.1016/j.jvcir.2025.104693","url":null,"abstract":"<div><div>Deep low-light enhancement methods typically learn from long-exposure ground truths. While effective for dark scenes, this approach often causes overexposure in HDR scenarios and lacks adaptability to varying illumination levels. Therefore, we develop a deep low light image enhancement via Multi-task Learning of Few Shot Exposure Imaging (MLFSEI) which is formulated as Bayesian multi-task directed graphical model and predict the enhanced images by learning few-shot tasks comprising multi-exposure images and their corresponding exposure vectors. The proposed method predicts the enhanced image from the selected exposure vector and the learned latent variable among few tasks. The exposure vectors are defined as the characteristics of few shot exposure datasets containing mean, variance and contrast of images. Moreover, the multi order gradients are developed to constrain the structure and details from the ground truth. Experimental results demonstrate significant improvements, with average gains of 4.64 dB in PSNR and 0.071 in SSIM, along with an average reduction of 1.12 in NIQE across multiple benchmark datasets compared to state-of-the-art methods. Furthermore, the proposed method can be extended to accommodate multiple outputs with varying exposure levels among one model.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104693"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104698
Jian He , Rongqi Cao , Cheng Zhang , Suyu Wang
Traffic police gesture recognition is important in autonomous driving. Most existing methods rely on extracting pixel-level features from RGB images, which lack interpretability due to the absence of explicit skeletal gesture features. Current deep learning approaches often fail to effectively model skeletal gesture information because they ignore the inherent connections between joint coordinate data and gesture semantics. Additionally, many methods fail to integrate multi-modal skeletal information (such as joint positions, rotations, and root orientation), limiting their ability to capture cross-modal correlations. Beyond methodological limitations, existing datasets often lack diversity in commanding directions, hindering fine-grained recognition of gestures intended for different traffic flows. To address these limitations, this paper presents the CTPGesture v2 dataset with Chinese traffic police gestures that command vehicles in four directions and proposes a skeleton-based graph convolution method for continuous gesture recognition. Specifically, a position-rotation graph (PR-Graph) is constructed with joint positions, rotations, and root rotations all in the same graph to enrich the graph’s representational power. An elevation partitioning strategy (EPS) is introduced to address the shortcutting issue of the conventional spatial configuration partitioning strategy (SCPS). Experiments demonstrate our method achieves 0.842 Jaccard score on CTPGesture v2 at 31.9 FPS, improving over previous works. The proposed PR-Graph and EPS establish a more descriptive graph for GCN and help capture cross-modality correlations during the graph convolution stages. Our code is available at https://github.com/crq0528/RT-VIBT. Our datasets are available at https://github.com/crq0528/traffic-gesture-datasets.
{"title":"Position-rotation graph and elevation partitioning strategy for traffic police gesture recognition","authors":"Jian He , Rongqi Cao , Cheng Zhang , Suyu Wang","doi":"10.1016/j.jvcir.2025.104698","DOIUrl":"10.1016/j.jvcir.2025.104698","url":null,"abstract":"<div><div>Traffic police gesture recognition is important in autonomous driving. Most existing methods rely on extracting pixel-level features from RGB images, which lack interpretability due to the absence of explicit skeletal gesture features. Current deep learning approaches often fail to effectively model skeletal gesture information because they ignore the inherent connections between joint coordinate data and gesture semantics. Additionally, many methods fail to integrate multi-modal skeletal information (such as joint positions, rotations, and root orientation), limiting their ability to capture cross-modal correlations. Beyond methodological limitations, existing datasets often lack diversity in commanding directions, hindering fine-grained recognition of gestures intended for different traffic flows. To address these limitations, this paper presents the CTPGesture v2 dataset with Chinese traffic police gestures that command vehicles in four directions and proposes a skeleton-based graph convolution method for continuous gesture recognition. Specifically, a position-rotation graph (PR-Graph) is constructed with joint positions, rotations, and root rotations all in the same graph to enrich the graph’s representational power. An elevation partitioning strategy (EPS) is introduced to address the shortcutting issue of the conventional spatial configuration partitioning strategy (SCPS). Experiments demonstrate our method achieves 0.842 Jaccard score on CTPGesture v2 at 31.9 FPS, improving over previous works. The proposed PR-Graph and EPS establish a more descriptive graph for GCN and help capture cross-modality correlations during the graph convolution stages. Our code is available at <span><span>https://github.com/crq0528/RT-VIBT</span><svg><path></path></svg></span>. Our datasets are available at <span><span>https://github.com/crq0528/traffic-gesture-datasets</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104698"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104694
Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah
Existing Transformer-based shadow removal methods are limited by fixed window sizes, making it difficult to effectively model global information. In addition, they do not fully utilize the distance prior in shadow images. This study argues that shadow removal should model brightness variations between regions from a global perspective. Non-shadow areas near the shadow boundaries are the most important for restoring brightness in shadow regions, and their importance gradually decreases as the distance increases. To achieve this, a regional decay attention mechanism is proposed, which introduces a positional decay bias into the self-attention computation to enable dynamic modeling of contributions from different spatial positions. A local perception module is introduced to improve the model’s ability to capture local details, and a shadow removal model named FW-Former is developed. This model achieves superior performance across multiple datasets, demonstrates stable generalization capability, and maintains a low parameter count.
{"title":"Regional decay attention for image shadow removal","authors":"Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah","doi":"10.1016/j.jvcir.2025.104694","DOIUrl":"10.1016/j.jvcir.2025.104694","url":null,"abstract":"<div><div>Existing Transformer-based shadow removal methods are limited by fixed window sizes, making it difficult to effectively model global information. In addition, they do not fully utilize the distance prior in shadow images. This study argues that shadow removal should model brightness variations between regions from a global perspective. Non-shadow areas near the shadow boundaries are the most important for restoring brightness in shadow regions, and their importance gradually decreases as the distance increases. To achieve this, a regional decay attention mechanism is proposed, which introduces a positional decay bias into the self-attention computation to enable dynamic modeling of contributions from different spatial positions. A local perception module is introduced to improve the model’s ability to capture local details, and a shadow removal model named FW-Former is developed. This model achieves superior performance across multiple datasets, demonstrates stable generalization capability, and maintains a low parameter count.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104694"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104700
Nili Tian, YuLong Ling, Qing Pan
To address the limitations of insufficient consideration of multi-dimensional modal image features in existing medical image fusion methods, this paper proposes a novel dual-branch step-by-step fusion (DSFusion) network. The dual sequence extraction block (DSE) is used for initial feature extraction, followed by the multi-scale lightweight residual (MLRes) block for enhanced efficiency and generalization. Features are then fused through the global pixel-level multi-dimensional fusion (GPMF) module, comprising a multi-dimensional feature extraction block (MFEB) and a pixel-level global fusion branch (PGFB). Finally, fused features are reconstructed into the final image. Experiments performed on different modality datasets demonstrate that DSFusion achieves competitive or superior performance across multiple evaluation metrics in the four indicators as , , and metrics.
{"title":"DSFusion: A dual-branch step-by-step fusion network for medical image fusion","authors":"Nili Tian, YuLong Ling, Qing Pan","doi":"10.1016/j.jvcir.2025.104700","DOIUrl":"10.1016/j.jvcir.2025.104700","url":null,"abstract":"<div><div>To address the limitations of insufficient consideration of multi-dimensional modal image features in existing medical image fusion methods, this paper proposes a novel dual-branch step-by-step fusion (DSFusion) network. The dual sequence extraction block (DSE) is used for initial feature extraction, followed by the multi-scale lightweight residual (MLRes) block for enhanced efficiency and generalization. Features are then fused through the global pixel-level multi-dimensional fusion (GPMF) module, comprising a multi-dimensional feature extraction block (MFEB) and a pixel-level global fusion branch (PGFB). Finally, fused features are reconstructed into the final image. Experiments performed on different modality datasets demonstrate that DSFusion achieves competitive or superior performance across multiple evaluation metrics in the four indicators as <span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>M</mi><mi>I</mi></mrow></msub></math></span>, <span><math><mrow><mi>P</mi><mi>S</mi><mi>N</mi><mi>R</mi></mrow></math></span>, and <span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>P</mi></mrow></msub></math></span> metrics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104700"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104701
Noor Ul Ain Tahir , Li Kuang , Melikamu Liyih Sinishaw , Muhammad Asim
Pedestrian and vehicle detection in aerial images remains a challenging task due to small object sizes, scale variation, and occlusion, often resulting in missed detections or the need for overly complex models. This study introduces the PV3M-YOLO approach, which incorporates three key focus modules to enhance the detection of small and undetectable objects. The proposed model integrates lightweight Ghost convolution, Convolutional Block Attention Module (CBAM), and Coordination Attention (CA) modules with optimized feature aggregation (C2f) into a unified architecture. These modifications improve the capacity of model to capture essential spatial details and contextual dependencies without increasing computational complexity. Furthermore, the Wise-IoUv3 loss calculus is employed to reduce the influence of subpar cases, enhancing localization and reducing erroneous identifications by reducing negative gradients. Experimental evaluations on VisDrone2019 dataset demonstrate that PV3M-YOLO achieves a [email protected] of 45.4% and [email protected]:95 of 27.9%, surpassing the baseline by 3.9% and 2.7% respectively. The model maintains efficiency with a compact size of 45.3 MB and a runtime of 8.8 ms. However, the detection of extremely small objects remains a limitation due to high-altitude nature of aerial imagery, indicating the need for future model enhancements targeting ultra-small object detection.
{"title":"PV3M-YOLO: A triple attention-enhanced model for detecting pedestrians and vehicles in UAV-enabled smart transport networks","authors":"Noor Ul Ain Tahir , Li Kuang , Melikamu Liyih Sinishaw , Muhammad Asim","doi":"10.1016/j.jvcir.2025.104701","DOIUrl":"10.1016/j.jvcir.2025.104701","url":null,"abstract":"<div><div>Pedestrian and vehicle detection in aerial images remains a challenging task due to small object sizes, scale variation, and occlusion, often resulting in missed detections or the need for overly complex models. This study introduces the PV3M-YOLO approach, which incorporates three key focus modules to enhance the detection of small and undetectable objects. The proposed model integrates lightweight Ghost convolution, Convolutional Block Attention Module (CBAM), and Coordination Attention (CA) modules with optimized feature aggregation (C2f) into a unified architecture. These modifications improve the capacity of model to capture essential spatial details and contextual dependencies without increasing computational complexity. Furthermore, the Wise-IoUv3 loss calculus is employed to reduce the influence of subpar cases, enhancing localization and reducing erroneous identifications by reducing negative gradients. Experimental evaluations on VisDrone2019 dataset demonstrate that PV3M-YOLO achieves a [email protected] of 45.4% and [email protected]:95 of 27.9%, surpassing the baseline by 3.9% and 2.7% respectively. The model maintains efficiency with a compact size of 45.3 MB and a runtime of 8.8 ms. However, the detection of extremely small objects remains a limitation due to high-altitude nature of aerial imagery, indicating the need for future model enhancements targeting ultra-small object detection.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104701"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A key feature of video surveillance systems is face recognition, which allows the identification and verification of people who appear in scenes frequently collected by a distributed network of cameras. The scientific community is interested in recognizing the individuals faces in videos, in part due to the potential applications also due to the difficulty in the artificial vision algorithms. The deep convolutional neural network is utilized to recognize the face from the set of provided video samples by the hybrid weighted texture pattern descriptor (HWTP). The deep CNN parameter is tuned by the Enhanced social collie optimization (ESCO), which determines the better solution by the various strategies, similar to this, the face of an individual is identified using optimum parameters. The attained accuracy, precision, recall, and F-measure of the proposed model is 87.92 %, 88.01 %, 88.01 %, and 88.01 % for the number of retrieval 500, respectively.
视频监控系统的一个关键特征是面部识别,它允许识别和验证出现在分布式摄像机网络经常收集的场景中的人。科学界对识别视频中的个人面孔很感兴趣,部分原因是由于潜在的应用,也由于人工视觉算法的困难。通过混合加权纹理模式描述符(HWTP),利用深度卷积神经网络从提供的视频样本集中识别人脸。深度CNN参数通过增强社会牧羊犬优化(Enhanced social collie optimization, ESCO)进行调整,通过各种策略确定更好的解决方案,类似地,使用最优参数识别个体的面部。在检索次数为500的情况下,该模型的准确率、精密度、召回率和F-measure分别为87.92%、88.01%、88.01%和88.01%。
{"title":"Effective face recognition from video using enhanced social collie optimization-based deep convolutional neural network technique","authors":"Jitendra Chandrakant Musale , Anuj kumar Singh , Swati Shirke","doi":"10.1016/j.jvcir.2025.104639","DOIUrl":"10.1016/j.jvcir.2025.104639","url":null,"abstract":"<div><div>A key feature of video surveillance systems is face recognition, which allows the identification and verification of people who appear in scenes frequently collected by a distributed network of cameras. The scientific community is interested in recognizing the individuals faces in videos, in part due to the potential applications also due to the difficulty in the artificial vision algorithms. The deep convolutional neural network is utilized to recognize the face from the set of provided video samples by the hybrid weighted texture pattern descriptor (HWTP). The deep CNN parameter is tuned by the Enhanced social collie optimization (ESCO), which determines the better solution by the various strategies, similar to this, the face of an individual is identified using optimum parameters. The attained accuracy, precision, recall, and F-measure of the proposed model is 87.92 %, 88.01 %, 88.01 %, and 88.01 % for the number of retrieval 500, respectively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104639"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2026.104706
Morteza Moradi , Mohammad Moradi , Concetto Spampinato , Ali Borji , Simone Palazzo
Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.
{"title":"Knowledge distillation meets video foundation models: A video saliency prediction case study","authors":"Morteza Moradi , Mohammad Moradi , Concetto Spampinato , Ali Borji , Simone Palazzo","doi":"10.1016/j.jvcir.2026.104706","DOIUrl":"10.1016/j.jvcir.2026.104706","url":null,"abstract":"<div><div>Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104706"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104697
Suning Ge , Chunlin Ren , Ya’nan He , Linjie Li , Jiaqi Yang , Kun Sun , Yanning Zhang
As a crucial topic in 3D vision, Structure-from-Motion (SfM) aims to recover camera poses and 3D structures from unconstrained images. Performing pairwise image matching is a critical step. Typically, matching relationships are represented as a view graph, but the initial graph often contains redundant or potentially false edges, affecting both efficiency and accuracy. We propose an efficient incremental SfM method that optimizes the critical image matching step. Specifically, given an image similarity graph, an initialized weighted view graph is constructed. Next, the vertices and edges of the graph are treated as candidates and voters, with iterative mutual voting performed to score image pairs until convergence. Then, the optimal subgraph is extracted using the maximum spanning tree (MST). Finally, incremental reconstruction is carried out based on the selected images. We demonstrate the efficiency and accuracy of our method on general datasets and ambiguous datasets.
{"title":"Iterative mutual voting matching for efficient and accurate Structure-from-Motion","authors":"Suning Ge , Chunlin Ren , Ya’nan He , Linjie Li , Jiaqi Yang , Kun Sun , Yanning Zhang","doi":"10.1016/j.jvcir.2025.104697","DOIUrl":"10.1016/j.jvcir.2025.104697","url":null,"abstract":"<div><div>As a crucial topic in 3D vision, Structure-from-Motion (SfM) aims to recover camera poses and 3D structures from unconstrained images. Performing pairwise image matching is a critical step. Typically, matching relationships are represented as a view graph, but the initial graph often contains redundant or potentially false edges, affecting both efficiency and accuracy. We propose an efficient incremental SfM method that optimizes the critical image matching step. Specifically, given an image similarity graph, an initialized weighted view graph is constructed. Next, the vertices and edges of the graph are treated as candidates and voters, with iterative mutual voting performed to score image pairs until convergence. Then, the optimal subgraph is extracted using the maximum spanning tree (MST). Finally, incremental reconstruction is carried out based on the selected images. We demonstrate the efficiency and accuracy of our method on general datasets and ambiguous datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104697"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104695
Chunman Yan, Ningning Qi
Ship detection plays an increasingly important role in the field of marine monitoring, with Optical Remote Sensing (ORS) technology providing high-resolution spatial and texture information support. However, existing ship detection methods still face significant challenges in accurately detecting small targets, suppressing complex background interference, and modeling cross-scale semantic relationships, limiting their effectiveness in practical applications. Inspired by feature diffusion theory and higher-order spatial interaction mechanisms, this paper proposes a ship detection model for Optical Remote Sensing imagery. Specifically, to address the problem of fine-grained information loss during feature downsampling, the Single-branch and Dual-branch Residual Feature Downsampling (SRFD and DRFD) modules are designed to enhance small target preservation and multi-scale robustness. To capture long-range spatial dependencies and improve robustness against target rotation variations, the Fast Spatial Pyramid Pooling module based on Large Kernel Separable Convolution Attention (SPPF-LSKA) is introduced, enabling efficient large receptive field modeling with rotation-invariant constraints. Furthermore, to dynamically model complex semantic dependencies between different feature scales, the Feature Diffusion Pyramid Network (FDPN) is proposed based on continuous feature diffusion and cross-scale graph reasoning. Experimental results show that model achieves an AP50 of 86.2 % and an AP50-95 of 58.0 % on multiple remote sensing ship detection datasets, with the number of parameters reduced to 2.6 M and the model size compressed to 5.5 MB, significantly outperforming several state-of-the-art models in terms of both detection accuracy and lightweight deployment. These results demonstrate the detection capability, robustness, and application potential of the proposed model in Optical Remote Sensing ship monitoring tasks.
{"title":"An optical remote sensing ship detection model based on feature diffusion and higher-order relationship modeling","authors":"Chunman Yan, Ningning Qi","doi":"10.1016/j.jvcir.2025.104695","DOIUrl":"10.1016/j.jvcir.2025.104695","url":null,"abstract":"<div><div>Ship detection plays an increasingly important role in the field of marine monitoring, with Optical Remote Sensing (ORS) technology providing high-resolution spatial and texture information support. However, existing ship detection methods still face significant challenges in accurately detecting small targets, suppressing complex background interference, and modeling cross-scale semantic relationships, limiting their effectiveness in practical applications. Inspired by feature diffusion theory and higher-order spatial interaction mechanisms, this paper proposes a ship detection model for Optical Remote Sensing imagery. Specifically, to address the problem of fine-grained information loss during feature downsampling, the Single-branch and Dual-branch Residual Feature Downsampling (SRFD and DRFD) modules are designed to enhance small target preservation and multi-scale robustness. To capture long-range spatial dependencies and improve robustness against target rotation variations, the Fast Spatial Pyramid Pooling module based on Large Kernel Separable Convolution Attention (SPPF-LSKA) is introduced, enabling efficient large receptive field modeling with rotation-invariant constraints. Furthermore, to dynamically model complex semantic dependencies between different feature scales, the Feature Diffusion Pyramid Network (FDPN) is proposed based on continuous feature diffusion and cross-scale graph reasoning. Experimental results show that model achieves an <em>AP<sub>50</sub></em> of 86.2 % and an <em>AP<sub>50-95</sub></em> of 58.0 % on multiple remote sensing ship detection datasets, with the number of parameters reduced to 2.6 M and the model size compressed to 5.5 MB, significantly outperforming several state-of-the-art models in terms of both detection accuracy and lightweight deployment. These results demonstrate the detection capability, robustness, and application potential of the proposed model in Optical Remote Sensing ship monitoring tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104695"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104673
Hongyue Huang , Chen Cui , Chuanmin Jia , Xinfeng Zhang , Siwei Ma
{"title":"Corrigendum to “Lightweight macro-pixel quality enhancement network for light field images compressed by versatile video coding” [J. Vis. Commun. Image Represent. 105 (2024) 104329]","authors":"Hongyue Huang , Chen Cui , Chuanmin Jia , Xinfeng Zhang , Siwei Ma","doi":"10.1016/j.jvcir.2025.104673","DOIUrl":"10.1016/j.jvcir.2025.104673","url":null,"abstract":"","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104673"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}