Pub Date : 2025-11-25DOI: 10.1016/j.cviu.2025.104575
Anurag Dalal, Daniel Hagen, Kjell Gunnar Robbersmyr, Kristian Muri Knausgård
3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.
{"title":"Evaluating the effect of image quantity on Gaussian Splatting: A statistical perspective","authors":"Anurag Dalal, Daniel Hagen, Kjell Gunnar Robbersmyr, Kristian Muri Knausgård","doi":"10.1016/j.cviu.2025.104575","DOIUrl":"10.1016/j.cviu.2025.104575","url":null,"abstract":"<div><div>3D reconstruction is now a key capability in computer vision. With the advancements in NeRFs and Gaussian Splatting, there is an increasing need on properly capturing data to feed these algorithms and use them in real world scenarios. Most publicly available datasets that can be used for Gaussian Splatting are not suitable to do proper statistical analysis on reducing the number of cameras or the effect of uniformly placed cameras versus randomly placed cameras. The number of cameras in the scene significantly affects the accuracy and resolution of the final 3D reconstruction. Thus, designing a proper data capture system with a certain number of cameras is crucial for 3D reconstruction. In this paper UnrealGaussianStat dataset is introduced, and a statistical analysis is performed on decreasing viewpoints have on Gaussian splatting. It is found that when the number of cameras is increased after 100 the train and test metrics saturates, and does not have significant impact on the reconstruction quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104575"},"PeriodicalIF":3.5,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.cviu.2025.104576
Haoyuan Sun , Dahua Gao , Pengfei He , Xiaoqian Li , Fuming Wang
Low-Light Image Enhancement (LLIE) plays a crucial role in the field of computer vision, particularly in tasks such as face recognition and surveillance systems, where clear visual information is significant. However, existing LLIE paired datasets are scarce, especially those focused on facial datasets, which has somewhat limited the development of robust LLIE methods. Furthermore, existing LLIE methods still have performance bottlenecks under extreme low-light conditions. To address these challenges, inspired by the ability of infrared images to provide additional details and contrast information unaffected by lighting conditions, we propose a new dataset, LLIE-Face, which contains 500 pairs of low-light, infrared, and normal-light facial images. Based on this dataset, we design a Brightness and Structure Decoupling Network (BSDNet), which uses two branches to process the brightness and structural information of the image separately. The goal is to enhance the brightness while simultaneously recovering fine details. Additionally, we introduce the Cross Attention State Space Model (CASSM) module, designed to facilitate effective interaction between brightness and structural information. Finally, we fully consider the intrinsic relationship between low-light image enhancement and image fusion, achieve effective image enhancement. Using the LLIE-Face dataset, we train and evaluate both BSDNet and SOTA models, conducting comprehensive benchmarking. Experimental results demonstrate that the proposed method significantly improves image contrast, detail clarity, and visual quality under extreme low-light conditions.
{"title":"LLIE-Face: A multi-modal dataset for low-light facial image enhancement","authors":"Haoyuan Sun , Dahua Gao , Pengfei He , Xiaoqian Li , Fuming Wang","doi":"10.1016/j.cviu.2025.104576","DOIUrl":"10.1016/j.cviu.2025.104576","url":null,"abstract":"<div><div>Low-Light Image Enhancement (LLIE) plays a crucial role in the field of computer vision, particularly in tasks such as face recognition and surveillance systems, where clear visual information is significant. However, existing LLIE paired datasets are scarce, especially those focused on facial datasets, which has somewhat limited the development of robust LLIE methods. Furthermore, existing LLIE methods still have performance bottlenecks under extreme low-light conditions. To address these challenges, inspired by the ability of infrared images to provide additional details and contrast information unaffected by lighting conditions, we propose a new dataset, LLIE-Face, which contains 500 pairs of low-light, infrared, and normal-light facial images. Based on this dataset, we design a Brightness and Structure Decoupling Network (BSDNet), which uses two branches to process the brightness and structural information of the image separately. The goal is to enhance the brightness while simultaneously recovering fine details. Additionally, we introduce the Cross Attention State Space Model (CASSM) module, designed to facilitate effective interaction between brightness and structural information. Finally, we fully consider the intrinsic relationship between low-light image enhancement and image fusion, achieve effective image enhancement. Using the LLIE-Face dataset, we train and evaluate both BSDNet and SOTA models, conducting comprehensive benchmarking. Experimental results demonstrate that the proposed method significantly improves image contrast, detail clarity, and visual quality under extreme low-light conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104576"},"PeriodicalIF":3.5,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.cviu.2025.104569
Jing Liu , Lizong Zhang , Chong Mu , Guangxi Lu , Ben Zhang , Junsong Li
In knowledge-based visual question answering, most current research focuses on the integration of external knowledge with VQA systems. However, the extraction of visual features within knowledge-based VQA remains relatively unexplored. This is surprising since even for the same image, answering different questions requires attention to different visual regions. In this paper, we propose a novel question-guided multigranular visual augmentation method for knowledge-based VQA tasks. Our method uses input questions to identify and focus on question-related regions within the image, which improves prediction quality. Specifically, our method first performs semantic embedding learning for questions at both the word-level and the phrase-level. To preserve rich visual information for QA, our method uses questions as a guide to extract question-related visual features. This is implemented by multiple convolution operations. In these operations, the convolutional kernels are dynamically derived from the representations of questions. By capturing visual information from diverse perspectives, our method extract information at the word level, phrase level, and common level more comprehensively. Additionally, relevant knowledge is retrieved from knowledge graph through entity linking and random walk techniques to respond to the question. A series of experiments are conducted on public knowledge-based VQA datasets to demonstrate the effectiveness of our model. The experimental results show that our method achieves state-of-the-art performance.
{"title":"Question-guided multigranular visual augmentation for knowledge-based visual question answering","authors":"Jing Liu , Lizong Zhang , Chong Mu , Guangxi Lu , Ben Zhang , Junsong Li","doi":"10.1016/j.cviu.2025.104569","DOIUrl":"10.1016/j.cviu.2025.104569","url":null,"abstract":"<div><div>In knowledge-based visual question answering, most current research focuses on the integration of external knowledge with VQA systems. However, the extraction of visual features within knowledge-based VQA remains relatively unexplored. This is surprising since even for the same image, answering different questions requires attention to different visual regions. In this paper, we propose a novel question-guided multigranular visual augmentation method for knowledge-based VQA tasks. Our method uses input questions to identify and focus on question-related regions within the image, which improves prediction quality. Specifically, our method first performs semantic embedding learning for questions at both the word-level and the phrase-level. To preserve rich visual information for QA, our method uses questions as a guide to extract question-related visual features. This is implemented by multiple convolution operations. In these operations, the convolutional kernels are dynamically derived from the representations of questions. By capturing visual information from diverse perspectives, our method extract information at the word level, phrase level, and common level more comprehensively. Additionally, relevant knowledge is retrieved from knowledge graph through entity linking and random walk techniques to respond to the question. A series of experiments are conducted on public knowledge-based VQA datasets to demonstrate the effectiveness of our model. The experimental results show that our method achieves state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104569"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.cviu.2025.104574
Linli Ma, Suzhen Lin, Jianchao Zeng, Yanbo Wang, Zanxia Jin
Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.
{"title":"CLAFusion: Misaligned infrared and visible image fusion based on contrastive learning and collaborative attention","authors":"Linli Ma, Suzhen Lin, Jianchao Zeng, Yanbo Wang, Zanxia Jin","doi":"10.1016/j.cviu.2025.104574","DOIUrl":"10.1016/j.cviu.2025.104574","url":null,"abstract":"<div><div>Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104574"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.cviu.2025.104571
Bin Xu , Yazhou Zhu , Shidong Wang , Yang Long , Haofeng Zhang
Few-Shot Medical Image Segmentation (FSMIS) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (BePMI), which includes two key modules: a Boundary-extended Prototypes (BePro) module and a Momentum Inference (MoIf) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at https://github.com/xubin471/BePMI.
{"title":"Few-shot Medical Image Segmentation via Boundary-extended Prototypes and Momentum Inference","authors":"Bin Xu , Yazhou Zhu , Shidong Wang , Yang Long , Haofeng Zhang","doi":"10.1016/j.cviu.2025.104571","DOIUrl":"10.1016/j.cviu.2025.104571","url":null,"abstract":"<div><div>Few-Shot Medical Image Segmentation (<strong>FSMIS</strong>) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (<strong>BePMI</strong>), which includes two key modules: a Boundary-extended Prototypes (<strong>BePro</strong>) module and a Momentum Inference (<strong>MoIf</strong>) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at <span><span>https://github.com/xubin471/BePMI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104571"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-20DOI: 10.1016/j.cviu.2025.104570
Baqar Abbas, Abderrazak Chahi, Yassine Ruichek
Human action recognition is a prominent area of research in computer vision due to its wide-ranging applications, including surveillance, human–computer interaction, and autonomous systems. Although recent 3D CNN approaches have shown promising results by capturing both spatial and temporal information, they often struggle to incorporate the environmental context in which actions occur, limiting their ability to discriminate between similar actions and accurately recognize complex scenarios. To overcome these challenges, a novel and effective approach called Context-aware 3D CNN for Action Recognition based on Semantic segmentation (CARS) is presented in this paper. The CARS approach consists of an intermediate scene recognition module that uses a semantic segmentation model to capture contextual cues from video sequences. This information is then encoded and linked to the features captured by the 3D CNN model, resulting in a comprehensive global feature map. CARS integrates a Convolutional Block Attention Module (CBAM) that utilizes channel and spatial attention mechanisms to focus on the most relevant parts of the relevant 3D CNN feature map. We also replace the traditional cross-entropy loss with a focal loss that can better deal with underrepresented and hard- to-classify human actions. Extensive experiments on well-known benchmark datasets, including HMD51 and UCF101, show that the proposed CARS approach outperforms current 3D CNN-based state-of-the-art approaches. Moreover, the context extraction module in CARS is a generic plug-and-play network that can improve the classification performance of any 3D CNN architecture.
{"title":"Context-aware 3D CNN for action recognition based on semantic segmentation (CARS)","authors":"Baqar Abbas, Abderrazak Chahi, Yassine Ruichek","doi":"10.1016/j.cviu.2025.104570","DOIUrl":"10.1016/j.cviu.2025.104570","url":null,"abstract":"<div><div>Human action recognition is a prominent area of research in computer vision due to its wide-ranging applications, including surveillance, human–computer interaction, and autonomous systems. Although recent 3D CNN approaches have shown promising results by capturing both spatial and temporal information, they often struggle to incorporate the environmental context in which actions occur, limiting their ability to discriminate between similar actions and accurately recognize complex scenarios. To overcome these challenges, a novel and effective approach called Context-aware 3D CNN for Action Recognition based on Semantic segmentation (CARS) is presented in this paper. The CARS approach consists of an intermediate scene recognition module that uses a semantic segmentation model to capture contextual cues from video sequences. This information is then encoded and linked to the features captured by the 3D CNN model, resulting in a comprehensive global feature map. CARS integrates a Convolutional Block Attention Module (CBAM) that utilizes channel and spatial attention mechanisms to focus on the most relevant parts of the relevant 3D CNN feature map. We also replace the traditional cross-entropy loss with a focal loss that can better deal with underrepresented and hard- to-classify human actions. Extensive experiments on well-known benchmark datasets, including HMD51 and UCF101, show that the proposed CARS approach outperforms current 3D CNN-based state-of-the-art approaches. Moreover, the context extraction module in CARS is a generic plug-and-play network that can improve the classification performance of any 3D CNN architecture.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104570"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-13DOI: 10.1016/j.cviu.2025.104568
Qida Yu , Rongrong Jiang , Xiaoyan Zhou , Yiru Wang , Guili Xu , Wu Quan
In this paper, we examine the Perspective-3-Point (P3P) problem, which involves determining the absolute pose of a calibrated camera using three known 3D–2D point correspondences. Traditionally, this problem reduces to solving a quartic polynomial. Recently, some cubic formulations based on degenerate conic curves have emerged, offering notable improvements in computational efficiency and avoiding repeated solutions. However, existing cubic formulations typically utilize a two-stage solution framework, which is inherently less efficient and stable compared to the single-stage framework. Motivated by this observation, we propose a novel single-stage degenerate-conic-based method. Our core idea is to algebraically and directly formulate the P3P problem as finding the intersection of two degenerate conic curves. Specifically, we first parameterize the rotation matrix and translation vector as linear combinations of three known vectors, leaving only two unknown elements of the rotation matrix. Next, leveraging orthogonality constraints of the rotation matrix, we derive two conic equations. Finally, we efficiently solve these equations under the degenerate condition. Since our method combines advantage of both single-stage approaches and degenerate-conic-based techniques, it is efficient, accurate, and stable. Extensive experiments validate the superior performance of our method.
{"title":"An efficient direct solution of the perspective-three-point problem","authors":"Qida Yu , Rongrong Jiang , Xiaoyan Zhou , Yiru Wang , Guili Xu , Wu Quan","doi":"10.1016/j.cviu.2025.104568","DOIUrl":"10.1016/j.cviu.2025.104568","url":null,"abstract":"<div><div>In this paper, we examine the Perspective-3-Point (P3P) problem, which involves determining the absolute pose of a calibrated camera using three known 3D–2D point correspondences. Traditionally, this problem reduces to solving a quartic polynomial. Recently, some cubic formulations based on degenerate conic curves have emerged, offering notable improvements in computational efficiency and avoiding repeated solutions. However, existing cubic formulations typically utilize a two-stage solution framework, which is inherently less efficient and stable compared to the single-stage framework. Motivated by this observation, we propose a novel single-stage degenerate-conic-based method. Our core idea is to algebraically and directly formulate the P3P problem as finding the intersection of two degenerate conic curves. Specifically, we first parameterize the rotation matrix and translation vector as linear combinations of three known vectors, leaving only two unknown elements of the rotation matrix. Next, leveraging orthogonality constraints of the rotation matrix, we derive two conic equations. Finally, we efficiently solve these equations under the degenerate condition. Since our method combines advantage of both single-stage approaches and degenerate-conic-based techniques, it is efficient, accurate, and stable. Extensive experiments validate the superior performance of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104568"},"PeriodicalIF":3.5,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.cviu.2025.104567
Simin Chen , Qinxia Hu , Mingjin Zhu , Qiming Wu , Xiao Hu
Traditional one-stage object detection (OOD) methods often simultaneously perform independent classification () and localization () tasks. As a result, spatial misalignment occurs between the and results. To address this spatial misalignment issue, a method of novel task-aligned one-stage object detection with interactions between branches (TOODIB) is proposed to learn some interactive features from both tasks and then encourage the features between two forward and branches to interact. Inspired by a human retina neural network in which lateral paths exist between longitudinal paths, an interaction-between-branches (IBB) module is developed to encourage the interactive features between and branches to interact. In addition, this paper improves the interactive convolution layers (IICLs) to produce more interactive features and designs a task-related spatial decoupling (TRSD) module to decouple interactive features for specific tasks, thereby providing task-specific features. The MS-COCO2017 dataset is used to evaluate TOODIB. TOODIB significantly reduces the degree of spatial misalignment, and the task misalignment metric decreases from 19.85 to 3.41 pixels. TOODIB also improves one-stage object detection and achieves average precision (AP) values of 43.3 and 47.6 on the ResNet-50 and ResNet-101 backbones, respectively.
{"title":"TOODIB: Task-aligned one-stage object detection with interactions between branches","authors":"Simin Chen , Qinxia Hu , Mingjin Zhu , Qiming Wu , Xiao Hu","doi":"10.1016/j.cviu.2025.104567","DOIUrl":"10.1016/j.cviu.2025.104567","url":null,"abstract":"<div><div>Traditional one-stage object detection (OOD) methods often simultaneously perform independent classification (<span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span>) and localization (<span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span>) tasks. As a result, spatial misalignment occurs between the <span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span> and <span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span> results. To address this spatial misalignment issue, a method of novel task-aligned one-stage object detection with interactions between branches (TOODIB) is proposed to learn some interactive features from both tasks and then encourage the features between two forward <span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span> and <span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span> branches to interact. Inspired by a human retina neural network in which lateral paths exist between longitudinal paths, an interaction-between-branches (IBB) module is developed to encourage the interactive features between <span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span> and <span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span> branches to interact. In addition, this paper improves the interactive convolution layers (IICLs) to produce more interactive features and designs a task-related spatial decoupling (TRSD) module to decouple interactive features for specific tasks, thereby providing task-specific features. The MS-COCO2017 dataset is used to evaluate TOODIB. TOODIB significantly reduces the degree of spatial misalignment, and the task misalignment metric decreases from 19.85 to 3.41 pixels. TOODIB also improves one-stage object detection and achieves average precision (AP) values of 43.3 and 47.6 on the ResNet-50 and ResNet-101 backbones, respectively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104567"},"PeriodicalIF":3.5,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.cviu.2025.104561
Douglas Townsell , Lingwei Chen , Mimi Xie , Chen Pan , Wen Zhang
Aerial imagery, with its vast applications in environmental monitoring, disaster management, and autonomous navigation, demands advanced data engineering solutions to overcome the scarcity and limited diversity of publicly available datasets. These limitations hinder the development of robust models capable of addressing dynamic, complex aerial scenarios. While recent text-guided generative models have shown promise in synthesizing high-quality images, they fall short in handling the unique challenges of aerial imagery, including densely packed objects, intricate spatial relationships, and the absence of paired text-aerial image datasets. To tackle these limitations, we propose STARS, a groundbreaking framework for Semantic-Aware Text-Guided Aerial Image Refinement and Synthesis. STARS introduces a three-pronged approach: context-aware text generation using chain-of-thought prompting for precise and diverse text annotations, feature-augmented image representation with multi-head attention to preserve small-object details and spatial coherence, and a latent diffusion mechanism conditioned on multi-modal embedding fusion for high-fidelity image synthesis. These innovations enable STARS to generate semantically accurate and visually complex aerial images, even in scenarios with extreme complexity. Our extensive evaluation across multiple benchmarks demonstrates that STARS outperforms state-of-the-art models such as Stable Diffusion and ARLDM, achieving superior FID scores and setting a new standard for aerial image synthesis.
{"title":"STARS: Semantics-Aware Text-guided Aerial Image Refinement and Synthesis","authors":"Douglas Townsell , Lingwei Chen , Mimi Xie , Chen Pan , Wen Zhang","doi":"10.1016/j.cviu.2025.104561","DOIUrl":"10.1016/j.cviu.2025.104561","url":null,"abstract":"<div><div>Aerial imagery, with its vast applications in environmental monitoring, disaster management, and autonomous navigation, demands advanced data engineering solutions to overcome the scarcity and limited diversity of publicly available datasets. These limitations hinder the development of robust models capable of addressing dynamic, complex aerial scenarios. While recent text-guided generative models have shown promise in synthesizing high-quality images, they fall short in handling the unique challenges of aerial imagery, including densely packed objects, intricate spatial relationships, and the absence of paired text-aerial image datasets. To tackle these limitations, we propose <strong>STARS</strong>, a groundbreaking framework for <strong>S</strong>emantic-Aware <strong>T</strong>ext-Guided <strong>A</strong>erial Image <strong>R</strong>efinement and <strong>S</strong>ynthesis. STARS introduces a three-pronged approach: context-aware text generation using chain-of-thought prompting for precise and diverse text annotations, feature-augmented image representation with multi-head attention to preserve small-object details and spatial coherence, and a latent diffusion mechanism conditioned on multi-modal embedding fusion for high-fidelity image synthesis. These innovations enable STARS to generate semantically accurate and visually complex aerial images, even in scenarios with extreme complexity. Our extensive evaluation across multiple benchmarks demonstrates that STARS outperforms state-of-the-art models such as Stable Diffusion and ARLDM, achieving superior FID scores and setting a new standard for aerial image synthesis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104561"},"PeriodicalIF":3.5,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1016/j.cviu.2025.104556
Haoke Yin , Changdong Yu , Chengshang Wu , Kexin Dai , Junfeng Shi , Yifan Xu , Yuan Zhu
The rapid development of marine environmental sensing technologies has significantly advanced applications such as unmanned surface vehicles (USVs), maritime surveillance, and autonomous navigation, all of which increasingly require precise and robust instance-level segmentation of maritime objects. However, real-world maritime scenes pose substantial challenges, including dynamic backgrounds, scale variation, and the frequent occurrence of small objects. To address these issues, we propose DAMFFNet, a one-stage instance segmentation framework based on the Swin Transformer backbone architecture. First, we introduce a Dual Attention Module (DAM) that effectively suppresses background interference and enhances salient feature representation in complex marine environments. Second, we design a Bottom-up Path Aggregation Module (BPAM) to facilitate fine-grained multi-scale feature fusion, which significantly improves segmentation accuracy, particularly for small and scale-variant objects. Third, we construct MOISD, a new large-scale maritime instance segmentation dataset comprising 7,938 high-resolution images with pixel-level annotations across 12 representative object categories under diverse sea states and lighting conditions. Extensive experiments conducted on both the MOISD and the public MariShipInsSeg datasets demonstrate that DAMFFNet outperforms existing methods in complex background and small-object segmentation tasks, achieving an AP of 82.71% on the MOISD dataset while maintaining an inference speed of 83 ms, thus establishing an effective balance between segmentation precision and computational efficiency.
{"title":"Swin Transformer-based maritime objects instance segmentation with dual attention and multi-scale fusion","authors":"Haoke Yin , Changdong Yu , Chengshang Wu , Kexin Dai , Junfeng Shi , Yifan Xu , Yuan Zhu","doi":"10.1016/j.cviu.2025.104556","DOIUrl":"10.1016/j.cviu.2025.104556","url":null,"abstract":"<div><div>The rapid development of marine environmental sensing technologies has significantly advanced applications such as unmanned surface vehicles (USVs), maritime surveillance, and autonomous navigation, all of which increasingly require precise and robust instance-level segmentation of maritime objects. However, real-world maritime scenes pose substantial challenges, including dynamic backgrounds, scale variation, and the frequent occurrence of small objects. To address these issues, we propose DAMFFNet, a one-stage instance segmentation framework based on the Swin Transformer backbone architecture. First, we introduce a Dual Attention Module (DAM) that effectively suppresses background interference and enhances salient feature representation in complex marine environments. Second, we design a Bottom-up Path Aggregation Module (BPAM) to facilitate fine-grained multi-scale feature fusion, which significantly improves segmentation accuracy, particularly for small and scale-variant objects. Third, we construct MOISD, a new large-scale maritime instance segmentation dataset comprising 7,938 high-resolution images with pixel-level annotations across 12 representative object categories under diverse sea states and lighting conditions. Extensive experiments conducted on both the MOISD and the public MariShipInsSeg datasets demonstrate that DAMFFNet outperforms existing methods in complex background and small-object segmentation tasks, achieving an <em>AP</em> of 82.71% on the MOISD dataset while maintaining an inference speed of 83 ms, thus establishing an effective balance between segmentation precision and computational efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104556"},"PeriodicalIF":3.5,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}