首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Question-guided multigranular visual augmentation for knowledge-based visual question answering 基于知识的视觉问答问题导向的多颗粒视觉增强
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-20 DOI: 10.1016/j.cviu.2025.104569
Jing Liu , Lizong Zhang , Chong Mu , Guangxi Lu , Ben Zhang , Junsong Li
In knowledge-based visual question answering, most current research focuses on the integration of external knowledge with VQA systems. However, the extraction of visual features within knowledge-based VQA remains relatively unexplored. This is surprising since even for the same image, answering different questions requires attention to different visual regions. In this paper, we propose a novel question-guided multigranular visual augmentation method for knowledge-based VQA tasks. Our method uses input questions to identify and focus on question-related regions within the image, which improves prediction quality. Specifically, our method first performs semantic embedding learning for questions at both the word-level and the phrase-level. To preserve rich visual information for QA, our method uses questions as a guide to extract question-related visual features. This is implemented by multiple convolution operations. In these operations, the convolutional kernels are dynamically derived from the representations of questions. By capturing visual information from diverse perspectives, our method extract information at the word level, phrase level, and common level more comprehensively. Additionally, relevant knowledge is retrieved from knowledge graph through entity linking and random walk techniques to respond to the question. A series of experiments are conducted on public knowledge-based VQA datasets to demonstrate the effectiveness of our model. The experimental results show that our method achieves state-of-the-art performance.
在基于知识的可视化问答中,目前的研究主要集中在外部知识与VQA系统的集成上。然而,基于知识的VQA中视觉特征的提取仍然相对未被探索。这是令人惊讶的,因为即使是同一幅图像,回答不同的问题也需要关注不同的视觉区域。本文针对基于知识的VQA任务,提出了一种新的问题导向的多粒度视觉增强方法。我们的方法使用输入问题来识别和关注图像中与问题相关的区域,从而提高了预测质量。具体来说,我们的方法首先在单词级和短语级对问题进行语义嵌入学习。为了为QA保留丰富的视觉信息,我们的方法使用问题作为向导来提取与问题相关的视觉特征。这是通过多个卷积操作实现的。在这些操作中,卷积核是由问题的表示动态导出的。我们的方法通过从多个角度获取视觉信息,更全面地提取词级、短语级和通用级的信息。此外,通过实体链接和随机游走技术从知识图谱中检索相关知识来响应问题。在基于公共知识的VQA数据集上进行了一系列实验,以验证该模型的有效性。实验结果表明,我们的方法达到了最先进的性能。
{"title":"Question-guided multigranular visual augmentation for knowledge-based visual question answering","authors":"Jing Liu ,&nbsp;Lizong Zhang ,&nbsp;Chong Mu ,&nbsp;Guangxi Lu ,&nbsp;Ben Zhang ,&nbsp;Junsong Li","doi":"10.1016/j.cviu.2025.104569","DOIUrl":"10.1016/j.cviu.2025.104569","url":null,"abstract":"<div><div>In knowledge-based visual question answering, most current research focuses on the integration of external knowledge with VQA systems. However, the extraction of visual features within knowledge-based VQA remains relatively unexplored. This is surprising since even for the same image, answering different questions requires attention to different visual regions. In this paper, we propose a novel question-guided multigranular visual augmentation method for knowledge-based VQA tasks. Our method uses input questions to identify and focus on question-related regions within the image, which improves prediction quality. Specifically, our method first performs semantic embedding learning for questions at both the word-level and the phrase-level. To preserve rich visual information for QA, our method uses questions as a guide to extract question-related visual features. This is implemented by multiple convolution operations. In these operations, the convolutional kernels are dynamically derived from the representations of questions. By capturing visual information from diverse perspectives, our method extract information at the word level, phrase level, and common level more comprehensively. Additionally, relevant knowledge is retrieved from knowledge graph through entity linking and random walk techniques to respond to the question. A series of experiments are conducted on public knowledge-based VQA datasets to demonstrate the effectiveness of our model. The experimental results show that our method achieves state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104569"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLAFusion: Misaligned infrared and visible image fusion based on contrastive learning and collaborative attention 融合:基于对比学习和协同注意的红外和可见光图像融合
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-20 DOI: 10.1016/j.cviu.2025.104574
Linli Ma, Suzhen Lin, Jianchao Zeng, Yanbo Wang, Zanxia Jin
Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.
由于成像原理和拍摄位置的差异,实现不同传感器图像之间严格的空间对齐是一项挑战。现有的融合方法在源图像之间存在微小位移或变形时,往往会在融合结果中引入伪影。虽然配准与融合联合训练方案通过融合对配准的反馈改善了融合效果,但仍然面临着配准精度不稳定和局部非刚性畸变产生伪影的挑战。为此,我们提出了一种新的红外与可见光失对准图像融合方法——CLAFusion。引入基于对比学习的多尺度特征提取模块(CLMFE),增强同一场景不同模态图像之间的相似性,增大不同场景图像之间的差异性,提高配准精度的稳定性。同时,设计了协同关注融合模块(CAFM),将窗口关注、梯度通道关注和融合配准反馈相结合,实现特征的精确对准和对不对准冗余特征的抑制,减轻融合结果中的伪像。大量的实验表明,该方法在不对齐图像融合和语义分割方面优于现有的方法。
{"title":"CLAFusion: Misaligned infrared and visible image fusion based on contrastive learning and collaborative attention","authors":"Linli Ma,&nbsp;Suzhen Lin,&nbsp;Jianchao Zeng,&nbsp;Yanbo Wang,&nbsp;Zanxia Jin","doi":"10.1016/j.cviu.2025.104574","DOIUrl":"10.1016/j.cviu.2025.104574","url":null,"abstract":"<div><div>Due to differences in imaging principles and shooting positions, achieving strict spatial alignment between images from different sensors is challenging. Existing fusion methods often introduce artifacts in fusion results when there are slight shifts or deformations between source images. Although joint training schemes of registration and fusion improves fusion results through the feedback of fusion on registration, it still faces challenges of unstable registration accuracy and artifacts caused by local non-rigid distortions. For this, we proposes a new misaligned infrared and visible image fusion method, named CLAFusion. It introduce a contrastive learning-based multi-scale feature extraction module (CLMFE) to enhance the similarity between images of different modalities from same scene and increase the differences between images from different scenes, improving stability of registration accuracy. Meanwhile, a collaborative attention fusion module (CAFM) is designed to combine window attention, gradient channel attention, and the feedback of fusion on registration to realize the precise alignment of features and suppression of misaligned redundant features, alleviating artifacts in fusion results. Extensive experiments show that the proposed method outperforms state-of-the-art methods in misaligned image fusion and semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104574"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Few-shot Medical Image Segmentation via Boundary-extended Prototypes and Momentum Inference 基于边界扩展原型和动量推理的小镜头医学图像分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-20 DOI: 10.1016/j.cviu.2025.104571
Bin Xu , Yazhou Zhu , Shidong Wang , Yang Long , Haofeng Zhang
Few-Shot Medical Image Segmentation (FSMIS) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (BePMI), which includes two key modules: a Boundary-extended Prototypes (BePro) module and a Momentum Inference (MoIf) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at https://github.com/xubin471/BePMI.
少镜头医学图像分割(FSMIS)旨在使用最少的注释数据实现不同器官的精确分割。目前基于原型的FSMIS方法主要是通过随机抽样或局部平均从支持样本中提取原型。然而,由于边界特征所占比例极小,传统方法难以生成边界原型,导致分割结果中边界圈定不清。此外,当支持图像和查询图像之间存在重大差异时,它们依赖于单个支持图像来分割所有查询图像,这会导致性能显著下降。为了应对这些挑战,我们提出了一种创新的解决方案,即边界扩展原型和动量推理(BePMI),它包括两个关键模块:边界扩展原型(BePro)模块和动量推理(MoIf)模块。BePro通过显式聚类内部和外部边界特征来构建边界原型,以缓解边界模糊问题。MoIf利用三维医学图像中相邻切片的空间一致性来动态优化原型表示,从而减少对单个样本的依赖。在三个公开可用的医学图像数据集上进行的大量实验表明,我们的方法优于最先进的方法。代码可从https://github.com/xubin471/BePMI获得。
{"title":"Few-shot Medical Image Segmentation via Boundary-extended Prototypes and Momentum Inference","authors":"Bin Xu ,&nbsp;Yazhou Zhu ,&nbsp;Shidong Wang ,&nbsp;Yang Long ,&nbsp;Haofeng Zhang","doi":"10.1016/j.cviu.2025.104571","DOIUrl":"10.1016/j.cviu.2025.104571","url":null,"abstract":"<div><div>Few-Shot Medical Image Segmentation (<strong>FSMIS</strong>) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (<strong>BePMI</strong>), which includes two key modules: a Boundary-extended Prototypes (<strong>BePro</strong>) module and a Momentum Inference (<strong>MoIf</strong>) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at <span><span>https://github.com/xubin471/BePMI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104571"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Context-aware 3D CNN for action recognition based on semantic segmentation (CARS) 基于语义分割(CARS)的上下文感知3D CNN动作识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-20 DOI: 10.1016/j.cviu.2025.104570
Baqar Abbas, Abderrazak Chahi, Yassine Ruichek
Human action recognition is a prominent area of research in computer vision due to its wide-ranging applications, including surveillance, human–computer interaction, and autonomous systems. Although recent 3D CNN approaches have shown promising results by capturing both spatial and temporal information, they often struggle to incorporate the environmental context in which actions occur, limiting their ability to discriminate between similar actions and accurately recognize complex scenarios. To overcome these challenges, a novel and effective approach called Context-aware 3D CNN for Action Recognition based on Semantic segmentation (CARS) is presented in this paper. The CARS approach consists of an intermediate scene recognition module that uses a semantic segmentation model to capture contextual cues from video sequences. This information is then encoded and linked to the features captured by the 3D CNN model, resulting in a comprehensive global feature map. CARS integrates a Convolutional Block Attention Module (CBAM) that utilizes channel and spatial attention mechanisms to focus on the most relevant parts of the relevant 3D CNN feature map. We also replace the traditional cross-entropy loss with a focal loss that can better deal with underrepresented and hard- to-classify human actions. Extensive experiments on well-known benchmark datasets, including HMD51 and UCF101, show that the proposed CARS approach outperforms current 3D CNN-based state-of-the-art approaches. Moreover, the context extraction module in CARS is a generic plug-and-play network that can improve the classification performance of any 3D CNN architecture.
人类行为识别是计算机视觉研究的一个突出领域,因为它具有广泛的应用,包括监视,人机交互和自主系统。尽管最近的3D CNN方法通过捕获空间和时间信息显示出有希望的结果,但它们往往难以纳入行动发生的环境背景,限制了它们区分类似行动和准确识别复杂场景的能力。为了克服这些挑战,本文提出了一种新颖有效的基于语义分割(CARS)的上下文感知3D CNN动作识别方法。CARS方法包括一个中间场景识别模块,该模块使用语义分割模型从视频序列中捕获上下文线索。然后对这些信息进行编码,并与3D CNN模型捕获的特征相关联,从而得到一个全面的全球特征图。CARS集成了卷积块注意模块(CBAM),该模块利用通道和空间注意机制来关注相关3D CNN特征图中最相关的部分。我们还用焦点损失取代了传统的交叉熵损失,它可以更好地处理代表性不足和难以分类的人类行为。在包括HMD51和UCF101在内的知名基准数据集上进行的大量实验表明,所提出的CARS方法优于当前基于3D cnn的最先进方法。此外,CARS中的上下文提取模块是一个通用的即插即用网络,可以提高任何3D CNN架构的分类性能。
{"title":"Context-aware 3D CNN for action recognition based on semantic segmentation (CARS)","authors":"Baqar Abbas,&nbsp;Abderrazak Chahi,&nbsp;Yassine Ruichek","doi":"10.1016/j.cviu.2025.104570","DOIUrl":"10.1016/j.cviu.2025.104570","url":null,"abstract":"<div><div>Human action recognition is a prominent area of research in computer vision due to its wide-ranging applications, including surveillance, human–computer interaction, and autonomous systems. Although recent 3D CNN approaches have shown promising results by capturing both spatial and temporal information, they often struggle to incorporate the environmental context in which actions occur, limiting their ability to discriminate between similar actions and accurately recognize complex scenarios. To overcome these challenges, a novel and effective approach called Context-aware 3D CNN for Action Recognition based on Semantic segmentation (CARS) is presented in this paper. The CARS approach consists of an intermediate scene recognition module that uses a semantic segmentation model to capture contextual cues from video sequences. This information is then encoded and linked to the features captured by the 3D CNN model, resulting in a comprehensive global feature map. CARS integrates a Convolutional Block Attention Module (CBAM) that utilizes channel and spatial attention mechanisms to focus on the most relevant parts of the relevant 3D CNN feature map. We also replace the traditional cross-entropy loss with a focal loss that can better deal with underrepresented and hard- to-classify human actions. Extensive experiments on well-known benchmark datasets, including HMD51 and UCF101, show that the proposed CARS approach outperforms current 3D CNN-based state-of-the-art approaches. Moreover, the context extraction module in CARS is a generic plug-and-play network that can improve the classification performance of any 3D CNN architecture.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104570"},"PeriodicalIF":3.5,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An efficient direct solution of the perspective-three-point problem 透视三点问题的有效直接解
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-13 DOI: 10.1016/j.cviu.2025.104568
Qida Yu , Rongrong Jiang , Xiaoyan Zhou , Yiru Wang , Guili Xu , Wu Quan
In this paper, we examine the Perspective-3-Point (P3P) problem, which involves determining the absolute pose of a calibrated camera using three known 3D–2D point correspondences. Traditionally, this problem reduces to solving a quartic polynomial. Recently, some cubic formulations based on degenerate conic curves have emerged, offering notable improvements in computational efficiency and avoiding repeated solutions. However, existing cubic formulations typically utilize a two-stage solution framework, which is inherently less efficient and stable compared to the single-stage framework. Motivated by this observation, we propose a novel single-stage degenerate-conic-based method. Our core idea is to algebraically and directly formulate the P3P problem as finding the intersection of two degenerate conic curves. Specifically, we first parameterize the rotation matrix and translation vector as linear combinations of three known vectors, leaving only two unknown elements of the rotation matrix. Next, leveraging orthogonality constraints of the rotation matrix, we derive two conic equations. Finally, we efficiently solve these equations under the degenerate condition. Since our method combines advantage of both single-stage approaches and degenerate-conic-based techniques, it is efficient, accurate, and stable. Extensive experiments validate the superior performance of our method.
在本文中,我们研究了视角-3点(P3P)问题,该问题涉及使用三个已知的3D-2D点对应来确定校准相机的绝对姿态。传统上,这个问题简化为求解一个四次多项式。近年来,出现了一些基于退化二次曲线的三次公式,在计算效率和避免重复求解方面有了显著的提高。然而,现有的立方配方通常使用两级解决方案框架,与单级框架相比,其效率和稳定性都较低。基于这一观察结果,我们提出了一种新的单阶段退化锥型方法。我们的核心思想是将P3P问题用代数方法直接表述为求两条退化二次曲线的相交。具体来说,我们首先将旋转矩阵和平移向量参数化为三个已知向量的线性组合,只留下旋转矩阵的两个未知元素。接下来,利用旋转矩阵的正交性约束,我们导出了两个二次方程。最后,在退化条件下有效地求解了这些方程。由于该方法结合了单阶段方法和基于退化圆锥曲线的方法的优点,因此具有高效、准确和稳定的特点。大量的实验验证了该方法的优越性能。
{"title":"An efficient direct solution of the perspective-three-point problem","authors":"Qida Yu ,&nbsp;Rongrong Jiang ,&nbsp;Xiaoyan Zhou ,&nbsp;Yiru Wang ,&nbsp;Guili Xu ,&nbsp;Wu Quan","doi":"10.1016/j.cviu.2025.104568","DOIUrl":"10.1016/j.cviu.2025.104568","url":null,"abstract":"<div><div>In this paper, we examine the Perspective-3-Point (P3P) problem, which involves determining the absolute pose of a calibrated camera using three known 3D–2D point correspondences. Traditionally, this problem reduces to solving a quartic polynomial. Recently, some cubic formulations based on degenerate conic curves have emerged, offering notable improvements in computational efficiency and avoiding repeated solutions. However, existing cubic formulations typically utilize a two-stage solution framework, which is inherently less efficient and stable compared to the single-stage framework. Motivated by this observation, we propose a novel single-stage degenerate-conic-based method. Our core idea is to algebraically and directly formulate the P3P problem as finding the intersection of two degenerate conic curves. Specifically, we first parameterize the rotation matrix and translation vector as linear combinations of three known vectors, leaving only two unknown elements of the rotation matrix. Next, leveraging orthogonality constraints of the rotation matrix, we derive two conic equations. Finally, we efficiently solve these equations under the degenerate condition. Since our method combines advantage of both single-stage approaches and degenerate-conic-based techniques, it is efficient, accurate, and stable. Extensive experiments validate the superior performance of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104568"},"PeriodicalIF":3.5,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TOODIB: Task-aligned one-stage object detection with interactions between branches TOODIB:具有分支之间交互的与任务对齐的单阶段对象检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-12 DOI: 10.1016/j.cviu.2025.104567
Simin Chen , Qinxia Hu , Mingjin Zhu , Qiming Wu , Xiao Hu
Traditional one-stage object detection (OOD) methods often simultaneously perform independent classification (cls) and localization (loc) tasks. As a result, spatial misalignment occurs between the cls and loc results. To address this spatial misalignment issue, a method of novel task-aligned one-stage object detection with interactions between branches (TOODIB) is proposed to learn some interactive features from both tasks and then encourage the features between two forward cls and loc branches to interact. Inspired by a human retina neural network in which lateral paths exist between longitudinal paths, an interaction-between-branches (IBB) module is developed to encourage the interactive features between cls and loc branches to interact. In addition, this paper improves the interactive convolution layers (IICLs) to produce more interactive features and designs a task-related spatial decoupling (TRSD) module to decouple interactive features for specific tasks, thereby providing task-specific features. The MS-COCO2017 dataset is used to evaluate TOODIB. TOODIB significantly reduces the degree of spatial misalignment, and the task misalignment metric decreases from 19.85 to 3.41 pixels. TOODIB also improves one-stage object detection and achieves average precision (AP) values of 43.3 and 47.6 on the ResNet-50 and ResNet-101 backbones, respectively.
传统的单阶段目标检测方法通常同时执行独立的分类(cls)和定位(loc)任务。因此,cls和loc结果之间会出现空间不对齐。为了解决这一空间错位问题,提出了一种新的任务对齐分支间交互单阶段目标检测方法(TOODIB),该方法从两个任务中学习一些交互特征,然后鼓励两个前向分支和本地分支之间的特征进行交互。受纵向路径之间存在横向路径的人类视网膜神经网络的启发,开发了分支间交互(IBB)模块,以鼓励细胞和局部分支之间的交互特征相互作用。此外,本文还对交互卷积层(IICLs)进行了改进,以产生更多的交互特征,并设计了任务相关空间解耦(TRSD)模块,对特定任务的交互特征进行解耦,从而提供特定任务的特征。MS-COCO2017数据集用于评估TOODIB。TOODIB显著降低了空间偏差程度,任务偏差度量从19.85像素降低到3.41像素。TOODIB改进了单阶段目标检测,在ResNet-50和ResNet-101骨干网上的平均精度(AP)分别达到43.3和47.6。
{"title":"TOODIB: Task-aligned one-stage object detection with interactions between branches","authors":"Simin Chen ,&nbsp;Qinxia Hu ,&nbsp;Mingjin Zhu ,&nbsp;Qiming Wu ,&nbsp;Xiao Hu","doi":"10.1016/j.cviu.2025.104567","DOIUrl":"10.1016/j.cviu.2025.104567","url":null,"abstract":"<div><div>Traditional one-stage object detection (OOD) methods often simultaneously perform independent classification (<span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span>) and localization (<span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span>) tasks. As a result, spatial misalignment occurs between the <span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span> and <span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span> results. To address this spatial misalignment issue, a method of novel task-aligned one-stage object detection with interactions between branches (TOODIB) is proposed to learn some interactive features from both tasks and then encourage the features between two forward <span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span> and <span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span> branches to interact. Inspired by a human retina neural network in which lateral paths exist between longitudinal paths, an interaction-between-branches (IBB) module is developed to encourage the interactive features between <span><math><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></math></span> and <span><math><mrow><mi>l</mi><mi>o</mi><mi>c</mi></mrow></math></span> branches to interact. In addition, this paper improves the interactive convolution layers (IICLs) to produce more interactive features and designs a task-related spatial decoupling (TRSD) module to decouple interactive features for specific tasks, thereby providing task-specific features. The MS-COCO2017 dataset is used to evaluate TOODIB. TOODIB significantly reduces the degree of spatial misalignment, and the task misalignment metric decreases from 19.85 to 3.41 pixels. TOODIB also improves one-stage object detection and achieves average precision (AP) values of 43.3 and 47.6 on the ResNet-50 and ResNet-101 backbones, respectively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104567"},"PeriodicalIF":3.5,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STARS: Semantics-Aware Text-guided Aerial Image Refinement and Synthesis STARS:语义感知的文本引导航空图像细化和合成
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-12 DOI: 10.1016/j.cviu.2025.104561
Douglas Townsell , Lingwei Chen , Mimi Xie , Chen Pan , Wen Zhang
Aerial imagery, with its vast applications in environmental monitoring, disaster management, and autonomous navigation, demands advanced data engineering solutions to overcome the scarcity and limited diversity of publicly available datasets. These limitations hinder the development of robust models capable of addressing dynamic, complex aerial scenarios. While recent text-guided generative models have shown promise in synthesizing high-quality images, they fall short in handling the unique challenges of aerial imagery, including densely packed objects, intricate spatial relationships, and the absence of paired text-aerial image datasets. To tackle these limitations, we propose STARS, a groundbreaking framework for Semantic-Aware Text-Guided Aerial Image Refinement and Synthesis. STARS introduces a three-pronged approach: context-aware text generation using chain-of-thought prompting for precise and diverse text annotations, feature-augmented image representation with multi-head attention to preserve small-object details and spatial coherence, and a latent diffusion mechanism conditioned on multi-modal embedding fusion for high-fidelity image synthesis. These innovations enable STARS to generate semantically accurate and visually complex aerial images, even in scenarios with extreme complexity. Our extensive evaluation across multiple benchmarks demonstrates that STARS outperforms state-of-the-art models such as Stable Diffusion and ARLDM, achieving superior FID scores and setting a new standard for aerial image synthesis.
航空图像在环境监测、灾害管理和自主导航方面有着广泛的应用,需要先进的数据工程解决方案来克服公共可用数据集的稀缺性和有限的多样性。这些限制阻碍了能够处理动态、复杂空中场景的鲁棒模型的发展。虽然最近的文本引导生成模型在合成高质量图像方面显示出了希望,但它们在处理航空图像的独特挑战方面存在不足,包括密集堆积的物体、复杂的空间关系以及缺乏成对的文本航空图像数据集。为了解决这些限制,我们提出了STARS,这是一个开创性的框架,用于语义感知文本引导的航空图像细化和合成。STARS引入了一种三管齐下的方法:使用思维链提示的上下文感知文本生成,用于精确和多样化的文本注释;使用多头关注的特征增强图像表示,以保留小对象细节和空间一致性;以及基于多模态嵌入融合的潜在扩散机制,用于高保真图像合成。这些创新使STARS能够生成语义准确、视觉复杂的航空图像,即使在极端复杂的情况下也是如此。我们对多个基准的广泛评估表明,STARS优于最先进的模型,如稳定扩散和ARLDM,获得了更高的FID分数,并为航空图像合成设定了新的标准。
{"title":"STARS: Semantics-Aware Text-guided Aerial Image Refinement and Synthesis","authors":"Douglas Townsell ,&nbsp;Lingwei Chen ,&nbsp;Mimi Xie ,&nbsp;Chen Pan ,&nbsp;Wen Zhang","doi":"10.1016/j.cviu.2025.104561","DOIUrl":"10.1016/j.cviu.2025.104561","url":null,"abstract":"<div><div>Aerial imagery, with its vast applications in environmental monitoring, disaster management, and autonomous navigation, demands advanced data engineering solutions to overcome the scarcity and limited diversity of publicly available datasets. These limitations hinder the development of robust models capable of addressing dynamic, complex aerial scenarios. While recent text-guided generative models have shown promise in synthesizing high-quality images, they fall short in handling the unique challenges of aerial imagery, including densely packed objects, intricate spatial relationships, and the absence of paired text-aerial image datasets. To tackle these limitations, we propose <strong>STARS</strong>, a groundbreaking framework for <strong>S</strong>emantic-Aware <strong>T</strong>ext-Guided <strong>A</strong>erial Image <strong>R</strong>efinement and <strong>S</strong>ynthesis. STARS introduces a three-pronged approach: context-aware text generation using chain-of-thought prompting for precise and diverse text annotations, feature-augmented image representation with multi-head attention to preserve small-object details and spatial coherence, and a latent diffusion mechanism conditioned on multi-modal embedding fusion for high-fidelity image synthesis. These innovations enable STARS to generate semantically accurate and visually complex aerial images, even in scenarios with extreme complexity. Our extensive evaluation across multiple benchmarks demonstrates that STARS outperforms state-of-the-art models such as Stable Diffusion and ARLDM, achieving superior FID scores and setting a new standard for aerial image synthesis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104561"},"PeriodicalIF":3.5,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Swin Transformer-based maritime objects instance segmentation with dual attention and multi-scale fusion 基于Swin变压器的双关注多尺度融合海事目标实例分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-11 DOI: 10.1016/j.cviu.2025.104556
Haoke Yin , Changdong Yu , Chengshang Wu , Kexin Dai , Junfeng Shi , Yifan Xu , Yuan Zhu
The rapid development of marine environmental sensing technologies has significantly advanced applications such as unmanned surface vehicles (USVs), maritime surveillance, and autonomous navigation, all of which increasingly require precise and robust instance-level segmentation of maritime objects. However, real-world maritime scenes pose substantial challenges, including dynamic backgrounds, scale variation, and the frequent occurrence of small objects. To address these issues, we propose DAMFFNet, a one-stage instance segmentation framework based on the Swin Transformer backbone architecture. First, we introduce a Dual Attention Module (DAM) that effectively suppresses background interference and enhances salient feature representation in complex marine environments. Second, we design a Bottom-up Path Aggregation Module (BPAM) to facilitate fine-grained multi-scale feature fusion, which significantly improves segmentation accuracy, particularly for small and scale-variant objects. Third, we construct MOISD, a new large-scale maritime instance segmentation dataset comprising 7,938 high-resolution images with pixel-level annotations across 12 representative object categories under diverse sea states and lighting conditions. Extensive experiments conducted on both the MOISD and the public MariShipInsSeg datasets demonstrate that DAMFFNet outperforms existing methods in complex background and small-object segmentation tasks, achieving an AP of 82.71% on the MOISD dataset while maintaining an inference speed of 83 ms, thus establishing an effective balance between segmentation precision and computational efficiency.
海洋环境传感技术的快速发展显著推进了无人水面车辆(usv)、海上监视和自主导航等应用,这些应用越来越需要对海洋物体进行精确和鲁棒的实例级分割。然而,现实世界的海上场景带来了巨大的挑战,包括动态背景、尺度变化和小物体的频繁出现。为了解决这些问题,我们提出了基于Swin Transformer主干架构的单阶段实例分割框架DAMFFNet。首先,我们引入了一种双注意模块(Dual Attention Module, DAM),该模块可以有效地抑制背景干扰并增强复杂海洋环境中的显著特征表示。其次,设计了自底向上的路径聚合模块(BPAM),实现了细粒度的多尺度特征融合,显著提高了分割精度,特别是对于小型和尺度变化的目标;第三,我们构建了一个新的大规模海事实例分割数据集MOISD,该数据集包含7,938幅高分辨率图像,并在不同的海况和光照条件下对12个代表性对象类别进行了像素级注释。在MOISD和MariShipInsSeg公共数据集上进行的大量实验表明,DAMFFNet在复杂背景和小目标分割任务中优于现有方法,在MOISD数据集上实现了82.71%的AP,同时保持了83 ms的推理速度,从而在分割精度和计算效率之间建立了有效的平衡。
{"title":"Swin Transformer-based maritime objects instance segmentation with dual attention and multi-scale fusion","authors":"Haoke Yin ,&nbsp;Changdong Yu ,&nbsp;Chengshang Wu ,&nbsp;Kexin Dai ,&nbsp;Junfeng Shi ,&nbsp;Yifan Xu ,&nbsp;Yuan Zhu","doi":"10.1016/j.cviu.2025.104556","DOIUrl":"10.1016/j.cviu.2025.104556","url":null,"abstract":"<div><div>The rapid development of marine environmental sensing technologies has significantly advanced applications such as unmanned surface vehicles (USVs), maritime surveillance, and autonomous navigation, all of which increasingly require precise and robust instance-level segmentation of maritime objects. However, real-world maritime scenes pose substantial challenges, including dynamic backgrounds, scale variation, and the frequent occurrence of small objects. To address these issues, we propose DAMFFNet, a one-stage instance segmentation framework based on the Swin Transformer backbone architecture. First, we introduce a Dual Attention Module (DAM) that effectively suppresses background interference and enhances salient feature representation in complex marine environments. Second, we design a Bottom-up Path Aggregation Module (BPAM) to facilitate fine-grained multi-scale feature fusion, which significantly improves segmentation accuracy, particularly for small and scale-variant objects. Third, we construct MOISD, a new large-scale maritime instance segmentation dataset comprising 7,938 high-resolution images with pixel-level annotations across 12 representative object categories under diverse sea states and lighting conditions. Extensive experiments conducted on both the MOISD and the public MariShipInsSeg datasets demonstrate that DAMFFNet outperforms existing methods in complex background and small-object segmentation tasks, achieving an <em>AP</em> of 82.71% on the MOISD dataset while maintaining an inference speed of 83 ms, thus establishing an effective balance between segmentation precision and computational efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104556"},"PeriodicalIF":3.5,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPSC-Net: Shared parallel space-channel attention mechanism transformer network for cell sequence image segmentation SPSC-Net:用于细胞序列图像分割的共享并行空间通道注意机制变压器网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-10 DOI: 10.1016/j.cviu.2025.104562
Shengwen Chen , Haixing Song , Huiling Feng , Jieyuan Hu , Qiuyi Bai , Fuyun He
In recent years, Transformer networks have achieved significant success in the field of computer vision, mainly due to their self-attention mechanism, which calculates the attention weights of each element in the input sequence to all other elements. This allows the Transformer to consider the information of the entire sequence, effectively capturing global contextual information, thereby excelling in modeling long-distance dependencies. To overcome the high computational cost of the self-attention mechanism and its insufficient consideration of channel information, this study proposes a shared parallel space-channel attention mechanism, known as the SPSC module. This module integrates spatial and channel attention, effectively capturing global attention and enhancing feature representation through a sharing mechanism. Additionally, hierarchical feature design and overlapping patch design are introduced to further strengthen the spatial and channel dimensions of feature representation. Concurrently, Haar wavelet down sampling is employed to replace the patch embedding part of the base network, utilizing its multi-scale analysis capability to enhance the model’s ability to capture image hierarchy and detail information. In the decoder design, a multi-resolution Transformer cascade decoder based on the SPSC module is used. This design significantly enhances the performance of the Transformer model in cell image segmentation tasks. Experimental results indicate that the proposed SPSC-Net model significantly outperforms other methods in segmentation performance across multiple cell datasets. Furthermore, the importance of each module in the model is verified through ablation studies.
近年来,变压器网络在计算机视觉领域取得了显著的成功,这主要归功于其自关注机制,该机制计算输入序列中每个元素对所有其他元素的关注权重。这使得Transformer可以考虑整个序列的信息,有效地捕获全局上下文信息,从而在远程依赖关系建模方面表现出色。为了克服自注意机制计算成本高和对信道信息考虑不足的问题,本研究提出了一种共享并行空间信道注意机制,称为SPSC模块。该模块集成了空间注意力和通道注意力,通过共享机制有效捕获全局注意力,增强特征表征。此外,还引入了层次化特征设计和重叠斑块设计,进一步增强了特征表示的空间维度和通道维度。同时,采用Haar小波下采样取代基网络的补丁嵌入部分,利用Haar小波下采样的多尺度分析能力增强模型捕捉图像层次和细节信息的能力。在解码器设计中,采用了基于SPSC模块的多分辨率变压器级联解码器。该设计显著提高了Transformer模型在细胞图像分割任务中的性能。实验结果表明,SPSC-Net模型在跨多单元数据集的分割性能上明显优于其他方法。此外,通过烧蚀研究验证了模型中各模块的重要性。
{"title":"SPSC-Net: Shared parallel space-channel attention mechanism transformer network for cell sequence image segmentation","authors":"Shengwen Chen ,&nbsp;Haixing Song ,&nbsp;Huiling Feng ,&nbsp;Jieyuan Hu ,&nbsp;Qiuyi Bai ,&nbsp;Fuyun He","doi":"10.1016/j.cviu.2025.104562","DOIUrl":"10.1016/j.cviu.2025.104562","url":null,"abstract":"<div><div>In recent years, Transformer networks have achieved significant success in the field of computer vision, mainly due to their self-attention mechanism, which calculates the attention weights of each element in the input sequence to all other elements. This allows the Transformer to consider the information of the entire sequence, effectively capturing global contextual information, thereby excelling in modeling long-distance dependencies. To overcome the high computational cost of the self-attention mechanism and its insufficient consideration of channel information, this study proposes a shared parallel space-channel attention mechanism, known as the SPSC module. This module integrates spatial and channel attention, effectively capturing global attention and enhancing feature representation through a sharing mechanism. Additionally, hierarchical feature design and overlapping patch design are introduced to further strengthen the spatial and channel dimensions of feature representation. Concurrently, Haar wavelet down sampling is employed to replace the patch embedding part of the base network, utilizing its multi-scale analysis capability to enhance the model’s ability to capture image hierarchy and detail information. In the decoder design, a multi-resolution Transformer cascade decoder based on the SPSC module is used. This design significantly enhances the performance of the Transformer model in cell image segmentation tasks. Experimental results indicate that the proposed SPSC-Net model significantly outperforms other methods in segmentation performance across multiple cell datasets. Furthermore, the importance of each module in the model is verified through ablation studies.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104562"},"PeriodicalIF":3.5,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BAP-DETR: Efficient drone object detection network based on bipartite attentive processing and dual fusion encoder 基于二部注意处理和双融合编码器的高效无人机目标检测网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-10 DOI: 10.1016/j.cviu.2025.104565
Zhiping Wang, Peng Yu, Xuchong Zhang, Hongbin Sun
Object detection in drone aerial imagery faces critical challenges including extreme scale variance, clustered small objects, and complex backgrounds, leading to notable performance gaps in general detectors. The most effective solution is to increase the input resolution, but this substantially increases computational load. Existing methods are unable to achieve a satisfactory balance between accuracy and speed due to architectural inadequacies in preserving fine-grained features essential for small objects. Thus, we present an optimized model architecture based on the RT-DETR framework. By proposing the Bipartite Attentive Processing Block, which employs a channel-splitting strategy that allows parallel convolution and attention refinement, we improve the model’s ability to extract discriminative features from complex aerial images. A novel dual-fusion encoder with a Frequency-Aware Fusion Module further improves the model’s performance by retaining critical low-level features while effectively merging them with high-level semantic information. Additionally, we optimize the loss function by combining the Reciprocal Normalized Wasserstein Distance with CIoU. Extensive experiments on the VisDrone, UAVDT and AI-TOD datasets demonstrate the efficiency and effectiveness of our method. In particular, our method achieves a 6.9% higher AP than the baseline, requires 17.5% less computational load and provides superior accuracy compared to state-of-the-art methods.
无人机航拍图像中的目标检测面临着极端尺度差异、聚类小目标和复杂背景等严峻挑战,导致普通探测器的性能差距显著。最有效的解决方案是增加输入分辨率,但这会大大增加计算负荷。现有的方法无法在精度和速度之间取得令人满意的平衡,因为在保留小对象所需的细粒度特征方面存在架构上的不足。因此,我们提出了一种基于RT-DETR框架的优化模型架构。通过提出Bipartite attention Processing Block,该Block采用通道分裂策略,允许并行卷积和注意细化,我们提高了模型从复杂航空图像中提取判别特征的能力。一种新型的双融合编码器与频率感知融合模块进一步提高了模型的性能,通过保留关键的低级特征,同时有效地将它们与高级语义信息合并。此外,我们通过将互反归一化Wasserstein距离与CIoU相结合来优化损失函数。在VisDrone, UAVDT和AI-TOD数据集上的大量实验证明了我们的方法的效率和有效性。特别是,与最先进的方法相比,我们的方法实现了比基线高6.9%的AP,所需的计算负载减少了17.5%,并且提供了更高的准确性。
{"title":"BAP-DETR: Efficient drone object detection network based on bipartite attentive processing and dual fusion encoder","authors":"Zhiping Wang,&nbsp;Peng Yu,&nbsp;Xuchong Zhang,&nbsp;Hongbin Sun","doi":"10.1016/j.cviu.2025.104565","DOIUrl":"10.1016/j.cviu.2025.104565","url":null,"abstract":"<div><div>Object detection in drone aerial imagery faces critical challenges including extreme scale variance, clustered small objects, and complex backgrounds, leading to notable performance gaps in general detectors. The most effective solution is to increase the input resolution, but this substantially increases computational load. Existing methods are unable to achieve a satisfactory balance between accuracy and speed due to architectural inadequacies in preserving fine-grained features essential for small objects. Thus, we present an optimized model architecture based on the RT-DETR framework. By proposing the Bipartite Attentive Processing Block, which employs a channel-splitting strategy that allows parallel convolution and attention refinement, we improve the model’s ability to extract discriminative features from complex aerial images. A novel dual-fusion encoder with a Frequency-Aware Fusion Module further improves the model’s performance by retaining critical low-level features while effectively merging them with high-level semantic information. Additionally, we optimize the loss function by combining the Reciprocal Normalized Wasserstein Distance with CIoU. Extensive experiments on the VisDrone, UAVDT and AI-TOD datasets demonstrate the efficiency and effectiveness of our method. In particular, our method achieves a 6.9% higher AP than the baseline, requires 17.5% less computational load and provides superior accuracy compared to state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104565"},"PeriodicalIF":3.5,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1