首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Action-conditioned contrastive learning for 3D human pose and shape estimation in videos 视频中三维人体姿态和形状估计的动作条件对比学习
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-12 DOI: 10.1016/j.cviu.2024.104149
Inpyo Song , Moonwook Ryu , Jangwon Lee
The aim of this research is to estimate 3D human pose and shape in videos, which is a challenging task due to the complex nature of the human body and the wide range of possible pose and shape variations. This problem also poses difficulty in finding a satisfactory solution due to the trade-off between the accuracy and temporal consistency of the estimated 3D pose and shape. Thus previous researches have prioritized one objective over the other. In contrast, we propose a novel approach called the action-conditioned mesh recovery (ACMR) model, which improves accuracy without compromising temporal consistency by leveraging human action information. Our ACMR model outperforms existing methods that prioritize temporal consistency in terms of accuracy, while also achieving comparable temporal consistency with other state-of-the-art methods. Significantly, the action-conditioned learning process occurs only during training, requiring no additional resources at inference time, thereby enhancing performance without increasing computational demands.
这项研究的目的是估计视频中的三维人体姿态和形状,由于人体的复杂性以及可能出现的各种姿态和形状变化,这是一项具有挑战性的任务。由于需要在估计的三维姿势和形状的准确性和时间一致性之间进行权衡,因此很难找到令人满意的解决方案。因此,以往的研究都是优先考虑其中一个目标。与此相反,我们提出了一种名为 "动作条件网格恢复(ACMR)"模型的新方法,通过利用人体动作信息,在不影响时间一致性的前提下提高了准确性。我们的 ACMR 模型在准确性方面优于优先考虑时间一致性的现有方法,同时在时间一致性方面也达到了其他最先进方法的水平。值得注意的是,以动作为条件的学习过程只发生在训练过程中,在推理时不需要额外资源,从而在不增加计算需求的情况下提高了性能。
{"title":"Action-conditioned contrastive learning for 3D human pose and shape estimation in videos","authors":"Inpyo Song ,&nbsp;Moonwook Ryu ,&nbsp;Jangwon Lee","doi":"10.1016/j.cviu.2024.104149","DOIUrl":"10.1016/j.cviu.2024.104149","url":null,"abstract":"<div><div>The aim of this research is to estimate 3D human pose and shape in videos, which is a challenging task due to the complex nature of the human body and the wide range of possible pose and shape variations. This problem also poses difficulty in finding a satisfactory solution due to the trade-off between the accuracy and temporal consistency of the estimated 3D pose and shape. Thus previous researches have prioritized one objective over the other. In contrast, we propose a novel approach called the action-conditioned mesh recovery (ACMR) model, which improves accuracy without compromising temporal consistency by leveraging human action information. Our ACMR model outperforms existing methods that prioritize temporal consistency in terms of accuracy, while also achieving comparable temporal consistency with other state-of-the-art methods. Significantly, the action-conditioned learning process occurs only during training, requiring no additional resources at inference time, thereby enhancing performance without increasing computational demands.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104149"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Triple-Stream Commonsense Circulation Transformer Network for Image Captioning 用于图像字幕的三流共用循环变压器网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-12 DOI: 10.1016/j.cviu.2024.104165
Jianchao Li, Wei Zhou, Kai Wang, Haifeng Hu

Traditional image captioning methods only have a local perspective at the dataset level, allowing them to explore dispersed information within individual images. However, the lack of a global perspective prevents them from capturing common characteristics among similar images. To address the limitation, this paper introduces a novel Triple-stream Commonsense Circulating Transformer Network (TCCTN). It incorporates contextual stream into the encoder, combining enhanced channel stream and spatial stream for comprehensive feature learning. The proposed commonsense-aware contextual attention (CCA) module queries commonsense contextual features from the dataset, obtaining global contextual association information by projecting grid features into the contextual space. The pure semantic channel attention (PSCA) module leverages compressed spatial domain for channel pooling, focusing on attention weights of pure channel features to capture inherent semantic features. The region spatial attention (RSA) module enhances spatial concepts in semantic learning by incorporating region position information. Furthermore, leveraging the complementary differences among the three features, TCCTN introduces the mixture of experts strategy to enhance the unique discriminative ability of features and promote their integration in textual feature learning. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of contextual commonsense stream and the superior performance of TCCTN.

传统的图像标题制作方法只能从数据集层面的局部视角出发,探索单张图像中的分散信息。然而,由于缺乏全局视角,这些方法无法捕捉相似图像的共同特征。为解决这一局限性,本文介绍了一种新颖的三重流常识循环变压器网络(TCCTN)。它将上下文流纳入编码器,结合增强的信道流和空间流进行综合特征学习。所提出的常识感知上下文注意(CCA)模块从数据集中查询常识上下文特征,通过将网格特征投射到上下文空间来获取全局上下文关联信息。纯语义通道注意(PSCA)模块利用压缩空间域进行通道池化,重点关注纯通道特征的注意权重,以捕捉固有的语义特征。区域空间注意力(RSA)模块通过纳入区域位置信息,增强了语义学习中的空间概念。此外,TCCTN 利用三种特征之间的互补性差异,引入了专家混合策略,以增强特征的独特判别能力,并促进它们在文本特征学习中的融合。在 MS-COCO 数据集上进行的大量实验证明了上下文常识流的有效性和 TCCTN 的优越性能。
{"title":"Triple-Stream Commonsense Circulation Transformer Network for Image Captioning","authors":"Jianchao Li,&nbsp;Wei Zhou,&nbsp;Kai Wang,&nbsp;Haifeng Hu","doi":"10.1016/j.cviu.2024.104165","DOIUrl":"10.1016/j.cviu.2024.104165","url":null,"abstract":"<div><p>Traditional image captioning methods only have a local perspective at the dataset level, allowing them to explore dispersed information within individual images. However, the lack of a global perspective prevents them from capturing common characteristics among similar images. To address the limitation, this paper introduces a novel <strong>T</strong>riple-stream <strong>C</strong>ommonsense <strong>C</strong>irculating <strong>T</strong>ransformer <strong>N</strong>etwork (TCCTN). It incorporates contextual stream into the encoder, combining enhanced channel stream and spatial stream for comprehensive feature learning. The proposed commonsense-aware contextual attention (CCA) module queries commonsense contextual features from the dataset, obtaining global contextual association information by projecting grid features into the contextual space. The pure semantic channel attention (PSCA) module leverages compressed spatial domain for channel pooling, focusing on attention weights of pure channel features to capture inherent semantic features. The region spatial attention (RSA) module enhances spatial concepts in semantic learning by incorporating region position information. Furthermore, leveraging the complementary differences among the three features, TCCTN introduces the mixture of experts strategy to enhance the unique discriminative ability of features and promote their integration in textual feature learning. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of contextual commonsense stream and the superior performance of TCCTN.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104165"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Delving into CLIP latent space for Video Anomaly Recognition 深入研究 CLIP 潜在空间,实现视频异常识别
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-12 DOI: 10.1016/j.cviu.2024.104163
Luca Zanella , Benedetta Liberatori , Willi Menapace , Fabio Poiesi , Yiming Wang , Elisa Ricci
We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also leverage a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at https://lucazanella.github.io/AnomalyCLIP/.
我们仅利用视频级监督来解决在帧级检测和识别监控视频中异常情况的复杂问题。我们引入了新颖的 AnomalyCLIP 方法,这是首个将视觉和语言模型(VLM)(如 CLIP)与多实例学习相结合的方法,用于联合视频异常检测和分类。我们的方法特别涉及操纵潜在的 CLIP 特征空间来识别正常事件子空间,这反过来又使我们能够有效地学习异常事件的文本驱动方向。当异常帧被投射到这些方向上时,如果它们属于某个特定类别,就会表现出较大的特征幅度。我们还利用计算效率高的 Transformer 架构来模拟帧之间的短期和长期时间依赖关系,最终得出最终的异常得分和类别预测概率。我们将 AnomalyCLIP 与三种主要异常检测基准(即 ShanghaiTech、UCF-Crime 和 XD-Violence)中最先进的方法进行了比较,经验表明它在识别视频异常方面优于基准方法。项目网站和代码见 https://lucazanella.github.io/AnomalyCLIP/。
{"title":"Delving into CLIP latent space for Video Anomaly Recognition","authors":"Luca Zanella ,&nbsp;Benedetta Liberatori ,&nbsp;Willi Menapace ,&nbsp;Fabio Poiesi ,&nbsp;Yiming Wang ,&nbsp;Elisa Ricci","doi":"10.1016/j.cviu.2024.104163","DOIUrl":"10.1016/j.cviu.2024.104163","url":null,"abstract":"<div><div>We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method <span><math><mrow><mi>A</mi><mi>n</mi><mi>o</mi><mi>m</mi><mi>a</mi><mi>l</mi><mi>y</mi><mi>C</mi><mi>L</mi><mi>I</mi><mi>P</mi></mrow></math></span>, the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also leverage a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare <span><math><mrow><mi>A</mi><mi>n</mi><mi>o</mi><mi>m</mi><mi>a</mi><mi>l</mi><mi>y</mi><mi>C</mi><mi>L</mi><mi>I</mi><mi>P</mi></mrow></math></span> against state-of-the-art methods considering three major anomaly detection benchmarks, <em>i.e.</em> ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at <span><span>https://lucazanella.github.io/AnomalyCLIP/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104163"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A convex Kullback–Leibler optimization for semi-supervised few-shot learning 用于半监督少点学习的凸库尔巴克-莱伯勒优化方法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-12 DOI: 10.1016/j.cviu.2024.104152
Yukun Liu , Zhaohui Luo , Daming Shi

Few-shot learning has achieved great success in many fields, thanks to its requirement of limited number of labeled data. However, most of the state-of-the-art techniques of few-shot learning employ transfer learning, which still requires massive labeled data to train a meta-learning system. To simulate the human learning mechanism, a deep model of few-shot learning is proposed to learn from one, or a few examples. First of all in this paper, we analyze and note that the problem with representative semi-supervised few-shot learning methods is getting stuck in local optimization and the negligence of intra-class compactness problem. To address these issue, we propose a novel semi-supervised few-shot learning method with Convex Kullback–Leibler, hereafter referred to as CKL, in which KL divergence is employed to achieve global optimum solution by optimizing a strictly convex functions to perform clustering; whereas sample selection strategy is employed to achieve intra-class compactness. In training, the CKL is optimized iteratively via deep learning and expectation–maximization algorithm. Intensive experiments have been conducted on three popular benchmark data sets, take miniImagenet data set for example, our proposed CKL achieved 76.83% and 85.78% under 5-way 1-shot and 5-way 5-shot, the experimental results show that this method significantly improves the classification ability of few-shot learning tasks and obtains the start-of-the-art performance.

少量学习只需要有限数量的标记数据,因此在许多领域都取得了巨大成功。然而,最先进的少量学习技术大多采用迁移学习,这仍然需要大量的标记数据来训练元学习系统。为了模拟人类的学习机制,我们提出了一种从一个或几个示例中学习的少次学习深度模型。本文首先分析并指出,具有代表性的半监督少量学习方法的问题在于陷入局部优化和忽略类内紧凑性问题。为了解决这些问题,我们提出了一种新颖的带凸 Kullback-Leibler(以下简称 CKL)的半监督少点学习方法,该方法通过优化严格凸函数来进行聚类,从而利用 KL 发散实现全局最优解;同时利用样本选择策略来实现类内紧凑性。在训练过程中,通过深度学习和期望最大化算法对 CKL 进行迭代优化。以 miniImagenet 数据集为例,我们提出的 CKL 在 5 路 1-shot 和 5 路 5-shot 下分别达到了 76.83% 和 85.78%,实验结果表明该方法显著提高了少数几次学习任务的分类能力,并获得了最先进的性能。
{"title":"A convex Kullback–Leibler optimization for semi-supervised few-shot learning","authors":"Yukun Liu ,&nbsp;Zhaohui Luo ,&nbsp;Daming Shi","doi":"10.1016/j.cviu.2024.104152","DOIUrl":"10.1016/j.cviu.2024.104152","url":null,"abstract":"<div><p>Few-shot learning has achieved great success in many fields, thanks to its requirement of limited number of labeled data. However, most of the state-of-the-art techniques of few-shot learning employ transfer learning, which still requires massive labeled data to train a meta-learning system. To simulate the human learning mechanism, a deep model of few-shot learning is proposed to learn from one, or a few examples. First of all in this paper, we analyze and note that the problem with representative semi-supervised few-shot learning methods is getting stuck in local optimization and the negligence of intra-class compactness problem. To address these issue, we propose a novel semi-supervised few-shot learning method with Convex Kullback–Leibler, hereafter referred to as CKL, in which KL divergence is employed to achieve global optimum solution by optimizing a strictly convex functions to perform clustering; whereas sample selection strategy is employed to achieve intra-class compactness. In training, the CKL is optimized iteratively via deep learning and expectation–maximization algorithm. Intensive experiments have been conducted on three popular benchmark data sets, take miniImagenet data set for example, our proposed CKL achieved 76.83% and 85.78% under 5-way 1-shot and 5-way 5-shot, the experimental results show that this method significantly improves the classification ability of few-shot learning tasks and obtains the start-of-the-art performance.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104152"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAFNet: Context aligned fusion for depth completion CAFNet:上下文对齐融合,实现深度补全
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-11 DOI: 10.1016/j.cviu.2024.104158
Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He

Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.

深度补全旨在从稀疏的深度输入中重建密集的深度,通常使用彩色图像作为指导。稀疏深度图缺乏足够的上下文来重建物体形状等焦点上下文。RGB 图像包含冗余上下文,包括对重建无用的细节,这降低了焦点上下文提取的效率。这两种模式的上下文信息不一致,给焦点上下文提取和进一步融合以及深度补全的准确性带来了挑战。为了优化多模态上下文信息的利用,我们探索了一种新颖的框架:上下文对齐融合网络(CAFNet)。CAFNet 包括两个阶段:上下文对齐阶段和全面阶段。在上下文对齐阶段,CAFNet 对输入的 RGB-D 对进行缩放采样,在此阶段,多模态上下文信息得到充分对齐,以便在两个编码器中进行特征提取,并在 CF 模块中进行融合。在全尺度阶段,上一阶段融合了多模态上下文的特征图被上采样到原始尺度,随后由 GF 模块利用动态屏蔽融合策略与全尺度深度特征融合。最终,利用 GF 模块的结果特征重建精确的密集深度图。在室内和室外基准数据集上进行的实验表明,CAFNet 所产生的结果可与最先进的方法相媲美,同时还能有效降低计算成本。
{"title":"CAFNet: Context aligned fusion for depth completion","authors":"Zhichao Fu,&nbsp;Anran Wu,&nbsp;Shuwen Yang,&nbsp;Tianlong Ma,&nbsp;Liang He","doi":"10.1016/j.cviu.2024.104158","DOIUrl":"10.1016/j.cviu.2024.104158","url":null,"abstract":"<div><p>Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104158"},"PeriodicalIF":4.3,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HBANet: A hybrid boundary-aware attention network for infrared and visible image fusion HBANet:用于红外和可见光图像融合的混合边界感知注意力网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-10 DOI: 10.1016/j.cviu.2024.104161
Xubo Luo , Jinshuo Zhang , Liping Wang , Dongmei Niu

Infrared and visible image fusion is an extensively investigated problem in infrared image processing, aiming to extract useful information from source images. However, the automatic fusion of these images presents a significant challenge due to the large domain difference and ambiguous boundaries. In this article, we propose a novel image fusion approach based on hybrid boundary-aware attention, termed HBANet, which models global dependencies across the image and leverages boundary-wise prior knowledge to supplement local details. Specifically, we design a novel mixed boundary-aware attention module that is capable of leveraging spatial information to the fullest extent and integrating long dependencies across different domains. To preserve the integrity of texture and structural information, we introduced a sophisticated loss function that comprises structure, intensity, and variation losses. Our method has been demonstrated to outperform state-of-the-art methods in terms of both visual and quantitative metrics, in our experiments on public datasets. Furthermore, our approach also exhibits great generalization capability, achieving satisfactory results in CT and MRI image fusion tasks.

红外与可见光图像融合是红外图像处理中一个广泛研究的问题,其目的是从源图像中提取有用信息。然而,由于领域差异大、边界模糊,这些图像的自动融合面临着巨大挑战。在本文中,我们提出了一种基于混合边界感知注意力的新型图像融合方法(称为 HBANet),该方法对整个图像的全局依赖性进行建模,并利用边界先验知识对局部细节进行补充。具体来说,我们设计了一种新颖的混合边界感知注意力模块,能够最大限度地利用空间信息,并整合不同领域的长期依赖关系。为了保持纹理和结构信息的完整性,我们引入了一个复杂的损失函数,其中包括结构、强度和变化损失。我们在公开数据集上进行的实验证明,我们的方法在视觉和定量指标方面都优于最先进的方法。此外,我们的方法还具有很强的通用能力,在 CT 和 MRI 图像融合任务中取得了令人满意的结果。
{"title":"HBANet: A hybrid boundary-aware attention network for infrared and visible image fusion","authors":"Xubo Luo ,&nbsp;Jinshuo Zhang ,&nbsp;Liping Wang ,&nbsp;Dongmei Niu","doi":"10.1016/j.cviu.2024.104161","DOIUrl":"10.1016/j.cviu.2024.104161","url":null,"abstract":"<div><p>Infrared and visible image fusion is an extensively investigated problem in infrared image processing, aiming to extract useful information from source images. However, the automatic fusion of these images presents a significant challenge due to the large domain difference and ambiguous boundaries. In this article, we propose a novel image fusion approach based on hybrid boundary-aware attention, termed HBANet, which models global dependencies across the image and leverages boundary-wise prior knowledge to supplement local details. Specifically, we design a novel mixed boundary-aware attention module that is capable of leveraging spatial information to the fullest extent and integrating long dependencies across different domains. To preserve the integrity of texture and structural information, we introduced a sophisticated loss function that comprises structure, intensity, and variation losses. Our method has been demonstrated to outperform state-of-the-art methods in terms of both visual and quantitative metrics, in our experiments on public datasets. Furthermore, our approach also exhibits great generalization capability, achieving satisfactory results in CT and MRI image fusion tasks.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104161"},"PeriodicalIF":4.3,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal transformer with language modality distillation for early pedestrian action anticipation 多模态转换器与语言模态提炼,用于早期行人行动预测
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-10 DOI: 10.1016/j.cviu.2024.104144
Nada Osman, Guglielmo Camporese, Lamberto Ballan

Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.

语言-视觉整合已成为计算机视觉领域日益流行的研究方向。近年来,越来越多的人认识到将语言信息融入视觉任务的重要性,尤其是在动作预测等领域。这种整合使预测模型能够利用文本描述获得更深入的上下文理解,从而做出更准确的预测。在这项工作中,我们的重点是行人行动预测,目标是尽早预测行人在城市环境中的未来行动。我们的方法依赖于多模态转换器模型,该模型可对过去的观察结果进行编码,并在不同的预测时间进行预测,同时采用学习掩码技术来过滤观察帧中的冗余信息。我们没有单纯依赖从图像或视频中提取的视觉线索,而是探索了整合文本信息对丰富行人行动预测模型输入模式的影响。我们研究了生成与输入图像相对应的描述性标题的各种技术,旨在提高预测性能。现有公共基准的评估结果表明,与之前的研究相比,我们的方法能有效提高不同预测时间的预测性能。此外,在我们的预测模型中加入语言模式后,效果显著,在 1 秒钟预测时间内,F1 分数提高了 29.5%,在 4 秒钟预测时间内,F1 分数提高了 16.66%。这些结果凸显了语言-视觉整合在复杂城市环境中提高行人行动预测能力的潜力。
{"title":"Multi-modal transformer with language modality distillation for early pedestrian action anticipation","authors":"Nada Osman,&nbsp;Guglielmo Camporese,&nbsp;Lamberto Ballan","doi":"10.1016/j.cviu.2024.104144","DOIUrl":"10.1016/j.cviu.2024.104144","url":null,"abstract":"<div><p>Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104144"},"PeriodicalIF":4.3,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S107731422400225X/pdfft?md5=56f12e2679069b787f5e626421a0e104&pid=1-s2.0-S107731422400225X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human–object interaction detection algorithm based on graph structure and improved cascade pyramid network 基于图结构和改进级联金字塔网络的人机交互检测算法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-07 DOI: 10.1016/j.cviu.2024.104162
Qing Ye, Xiuju Xu, Rui Li, Yongmei Zhang

Aiming at the problem of insufficient use of human–object interaction (HOI) information and spatial location information in images, we propose a human–object​ interaction detection network based on graph structure and improved cascade pyramid. This network is composed of three branches, namely, graph branch, human–object branch and human pose branch. In graph branch, we propose a Graph-based Interactive Feature Generation Algorithm (GIFGA) to address the inadequate utilization of interaction information. GIFGA constructs an initial dense graph model by taking humans and objects as nodes and their interaction relationships as edges. Then, by traversing each node, the graph model is updated to generate the final interaction features. In human pose branch, we propose an Improved Cascade Pyramid Network (ICPN) to tackle the underutilization of spatial location information. ICPN extracts human pose features and maps both the object bounding boxes and extracted human pose maps onto the global feature map to capture the most discriminative interaction-related region features within the global context. Finally, the features from the three branches are fed into a Multi-Layer Perceptron (MLP) for fusion and then classified for recognition. Experimental results demonstrate that our network achieves mAP of 54.93% and 28.69% on the V-COCO and HICO-DET datasets, respectively.

针对图像中人机交互(HOI)信息和空间位置信息利用不足的问题,我们提出了一种基于图结构和改进级联金字塔的人机交互检测网络。该网络由三个分支组成,即图分支、人-物分支和人的姿势分支。在图分支中,我们提出了基于图的交互特征生成算法(GIFGA),以解决交互信息利用不足的问题。GIFGA 将人和物体作为节点,将它们之间的交互关系作为边,从而构建一个初始密集图模型。然后,通过遍历每个节点,更新图模型,生成最终的交互特征。在人体姿态分支中,我们提出了一种改进的级联金字塔网络(ICPN),以解决空间位置信息利用不足的问题。ICPN 可提取人体姿态特征,并将物体边界框和提取的人体姿态映射到全局特征图上,从而在全局范围内捕捉最具区分度的交互相关区域特征。最后,将三个分支的特征输入多层感知器(MLP)进行融合,然后进行识别分类。实验结果表明,我们的网络在 V-COCO 和 HICO-DET 数据集上的 mAP 分别达到了 54.93% 和 28.69%。
{"title":"Human–object interaction detection algorithm based on graph structure and improved cascade pyramid network","authors":"Qing Ye,&nbsp;Xiuju Xu,&nbsp;Rui Li,&nbsp;Yongmei Zhang","doi":"10.1016/j.cviu.2024.104162","DOIUrl":"10.1016/j.cviu.2024.104162","url":null,"abstract":"<div><p>Aiming at the problem of insufficient use of human–object interaction (HOI) information and spatial location information in images, we propose a human–object​ interaction detection network based on graph structure and improved cascade pyramid. This network is composed of three branches, namely, graph branch, human–object branch and human pose branch. In graph branch, we propose a Graph-based Interactive Feature Generation Algorithm (GIFGA) to address the inadequate utilization of interaction information. GIFGA constructs an initial dense graph model by taking humans and objects as nodes and their interaction relationships as edges. Then, by traversing each node, the graph model is updated to generate the final interaction features. In human pose branch, we propose an Improved Cascade Pyramid Network (ICPN) to tackle the underutilization of spatial location information. ICPN extracts human pose features and maps both the object bounding boxes and extracted human pose maps onto the global feature map to capture the most discriminative interaction-related region features within the global context. Finally, the features from the three branches are fed into a Multi-Layer Perceptron (MLP) for fusion and then classified for recognition. Experimental results demonstrate that our network achieves mAP of 54.93% and 28.69% on the V-COCO and HICO-DET datasets, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104162"},"PeriodicalIF":4.3,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VIDF-Net: A Voxel-Image Dynamic Fusion method for 3D object detection VIDF-Net:用于三维物体检测的体素-图像动态融合方法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-07 DOI: 10.1016/j.cviu.2024.104164
Xuezhi Xiang , Dianang Li , Xi Wang , Xiankun Zhou , Yulong Qiao

In recent years, multi-modal fusion methods have shown excellent performance in the field of 3D object detection, which select the voxel centers and globally fuse with image features across the scene. However, these approaches exist two issues. First, The distribution of voxel density is highly heterogeneous due to the discrete volumes. Additionally, there are significant differences in the features between images and point clouds. Global fusion does not take into account the correspondence between these two modalities, which leads to the insufficient fusion. In this paper, we propose a new multi-modal fusion method named Voxel-Image Dynamic Fusion (VIDF). Specifically, VIDF-Net is composed of the Voxel Centroid Mapping module (VCM) and the Deformable Attention Fusion module (DAF). The Voxel Centroid Mapping module is used to calculate the centroid of voxel features and map them onto the image plane, which can locate the position of voxel features more effectively. We then use the Deformable Attention Fusion module to dynamically calculates the offset of each voxel centroid from the image position and combine these two modalities. Furthermore, we propose Region Proposal Network with Channel-Spatial Aggregate to combine channel and spatial attention maps for improved multi-scale feature interaction. We conduct extensive experiments on the KITTI dataset to demonstrate the outstanding performance of proposed VIDF network. In particular, significant improvements have been observed in the Hard categories of Cars and Pedestrians, which shows the significant effectiveness of our approach in dealing with complex scenarios.

近年来,多模态融合方法在三维物体检测领域表现出色,这些方法选择体素中心,并与整个场景的图像特征进行全局融合。然而,这些方法存在两个问题。首先,由于体积离散,体素密度的分布具有高度异质性。此外,图像和点云之间的特征也存在显著差异。全局融合没有考虑这两种模式之间的对应关系,从而导致融合不充分。在本文中,我们提出了一种新的多模态融合方法,名为体素-图像动态融合(VIDF)。具体来说,VIDF-Net 由体素中心点映射模块(VCM)和可变形注意力融合模块(DAF)组成。体素中心点映射模块用于计算体素特征的中心点,并将其映射到图像平面上,从而更有效地定位体素特征的位置。然后,我们使用可变形注意力融合模块动态计算每个体素中心点与图像位置的偏移,并将这两种模式结合起来。此外,我们还提出了具有通道-空间聚合功能的区域建议网络(Region Proposal Network with Channel-Spatial Aggregate),以结合通道和空间注意力图,从而改进多尺度特征交互。我们在 KITTI 数据集上进行了大量实验,证明了所提出的 VIDF 网络的卓越性能。特别是在汽车和行人这两个难点类别中,我们观察到了明显的改进,这表明我们的方法在处理复杂场景时非常有效。
{"title":"VIDF-Net: A Voxel-Image Dynamic Fusion method for 3D object detection","authors":"Xuezhi Xiang ,&nbsp;Dianang Li ,&nbsp;Xi Wang ,&nbsp;Xiankun Zhou ,&nbsp;Yulong Qiao","doi":"10.1016/j.cviu.2024.104164","DOIUrl":"10.1016/j.cviu.2024.104164","url":null,"abstract":"<div><p>In recent years, multi-modal fusion methods have shown excellent performance in the field of 3D object detection, which select the voxel centers and globally fuse with image features across the scene. However, these approaches exist two issues. First, The distribution of voxel density is highly heterogeneous due to the discrete volumes. Additionally, there are significant differences in the features between images and point clouds. Global fusion does not take into account the correspondence between these two modalities, which leads to the insufficient fusion. In this paper, we propose a new multi-modal fusion method named Voxel-Image Dynamic Fusion (VIDF). Specifically, VIDF-Net is composed of the Voxel Centroid Mapping module (VCM) and the Deformable Attention Fusion module (DAF). The Voxel Centroid Mapping module is used to calculate the centroid of voxel features and map them onto the image plane, which can locate the position of voxel features more effectively. We then use the Deformable Attention Fusion module to dynamically calculates the offset of each voxel centroid from the image position and combine these two modalities. Furthermore, we propose Region Proposal Network with Channel-Spatial Aggregate to combine channel and spatial attention maps for improved multi-scale feature interaction. We conduct extensive experiments on the KITTI dataset to demonstrate the outstanding performance of proposed VIDF network. In particular, significant improvements have been observed in the Hard categories of Cars and Pedestrians, which shows the significant effectiveness of our approach in dealing with complex scenarios.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104164"},"PeriodicalIF":4.3,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HAD-Net: An attention U-based network with hyper-scale shifted aggregating and max-diagonal sampling for medical image segmentation HAD-Net:基于注意力 U 的网络,采用超尺度移动聚合和最大对角线采样,用于医学图像分割
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-07 DOI: 10.1016/j.cviu.2024.104151
Junding Sun , Yabei Li , Xiaosheng Wu , Chaosheng Tang , Shuihua Wang , Yudong Zhang

Objectives:

Accurate extraction of regions of interest (ROI) with variable shapes and scales is one of the primary challenges in medical image segmentation. Current U-based networks mostly aggregate multi-stage encoding outputs as an improved multi-scale skip connection. Although this design has been proven to provide scale diversity and contextual integrity, there remain several intuitive limits: (i) the encoding outputs are resampled to the same size simply, which destruct the fine-grained information. The advantages of utilization of multiple scales are insufficient. (ii) Certain redundant information proportional to the feature dimension size is introduced and causes multi-stage interference. And (iii) the precision of information delivery relies on the up-sampling and down-sampling layers, but guidance on maintaining consistency in feature locations and trends between them is lacking.

Methods:

To improve these situations, this paper proposed a U-based CNN network named HAD-Net, by assembling a new hyper-scale shifted aggregating module (HSAM) paradigm and progressive reusing attention (PRA) for skip connections, as well as employing a novel pair of dual-branch parameter-free sampling layers, i.e. max-diagonal pooling (MDP) and max-diagonal un-pooling (MDUP). That is, the aggregating scheme additionally combines five subregions with certain offsets in the shallower stage. Since the lower scale-down ratios of subregions enrich scales and fine-grain context. Then, the attention scheme contains a partial-to-global channel attention (PGCA) and a multi-scale reusing spatial attention (MRSA), it builds reusing connections internally and adjusts the focus on more useful dimensions. Finally, MDP and MDUP are explored in pairs to improve texture delivery and feature consistency, enhancing information retention and avoiding positional confusion.

Results:

Compared to state-of-the-art networks, HAD-Net has achieved comparable and even better performances with Dice of 90.13%, 81.51%, and 75.43% for each class on BraTS20, 89.59% Dice and 98.56% AUC on Kvasir-SEG, as well as 82.17% Dice and 98.05% AUC on DRIVE.

Conclusions:

The scheme of HSAM+PRA+MDP+MDUP has been proven to be a remarkable improvement and leaves room for further research.

目标:准确提取具有不同形状和尺度的感兴趣区(ROI)是医学图像分割的主要挑战之一。目前基于 U 的网络大多将多级编码输出汇总为改进的多尺度跳转连接。虽然这种设计已被证明能提供尺度多样性和上下文完整性,但仍存在一些直观限制:(i) 编码输出被简单地重新采样到相同大小,从而破坏了细粒度信息。利用多尺度的优势并不充分。(ii) 某些与特征维度大小成正比的冗余信息被引入,造成多级干扰。(iii) 信息传递的精确度依赖于上采样层和下采样层,但它们之间缺乏保持特征位置和趋势一致性的指导。方法:为了改善这些情况,本文提出了一种基于 U 的 CNN 网络,命名为 HAD-Net,它集合了一种新的超大规模移位聚合模块(HSAM)范式和用于跳过连接的渐进重用注意力(PRA),并采用了一对新颖的双分支无参数采样层,即最大对角线池化(MDP)和最大对角线非池化(MDUP)。也就是说,该汇集方案在较浅的阶段额外合并了五个具有一定偏移的子区域。由于子区域的缩放比例较低,可以丰富尺度和细粒度背景。然后,注意力方案包含部分到全局通道注意力(PGCA)和多尺度重用空间注意力(MRSA),它在内部建立重用连接,并将重点调整到更有用的维度上。结果:与最先进的网络相比,HAD-Net 的性能相当甚至更好,其 Dice 分别为 90.结论:事实证明,HSAM+PRA+MDP+MDUP 方案具有显著的改进效果,并留有进一步研究的空间。
{"title":"HAD-Net: An attention U-based network with hyper-scale shifted aggregating and max-diagonal sampling for medical image segmentation","authors":"Junding Sun ,&nbsp;Yabei Li ,&nbsp;Xiaosheng Wu ,&nbsp;Chaosheng Tang ,&nbsp;Shuihua Wang ,&nbsp;Yudong Zhang","doi":"10.1016/j.cviu.2024.104151","DOIUrl":"10.1016/j.cviu.2024.104151","url":null,"abstract":"<div><h3>Objectives:</h3><p>Accurate extraction of regions of interest (ROI) with variable shapes and scales is one of the primary challenges in medical image segmentation. Current U-based networks mostly aggregate multi-stage encoding outputs as an improved multi-scale skip connection. Although this design has been proven to provide scale diversity and contextual integrity, there remain several intuitive limits: <strong>(i)</strong> the encoding outputs are resampled to the same size simply, which destruct the fine-grained information. The advantages of utilization of multiple scales are insufficient. <strong>(ii)</strong> Certain redundant information proportional to the feature dimension size is introduced and causes multi-stage interference. And <strong>(iii)</strong> the precision of information delivery relies on the up-sampling and down-sampling layers, but guidance on maintaining consistency in feature locations and trends between them is lacking.</p></div><div><h3>Methods:</h3><p>To improve these situations, this paper proposed a U-based CNN network named HAD-Net, by assembling a new hyper-scale shifted aggregating module (HSAM) paradigm and progressive reusing attention (PRA) for skip connections, as well as employing a novel pair of dual-branch parameter-free sampling layers, i.e. max-diagonal pooling (MDP) and max-diagonal un-pooling (MDUP). That is, the aggregating scheme additionally combines five subregions with certain offsets in the shallower stage. Since the lower scale-down ratios of subregions enrich scales and fine-grain context. Then, the attention scheme contains a partial-to-global channel attention (PGCA) and a multi-scale reusing spatial attention (MRSA), it builds reusing connections internally and adjusts the focus on more useful dimensions. Finally, MDP and MDUP are explored in pairs to improve texture delivery and feature consistency, enhancing information retention and avoiding positional confusion.</p></div><div><h3>Results:</h3><p>Compared to state-of-the-art networks, HAD-Net has achieved comparable and even better performances with Dice of 90.13%, 81.51%, and 75.43% for each class on BraTS20, 89.59% Dice and 98.56% AUC on Kvasir-SEG, as well as 82.17% Dice and 98.05% AUC on DRIVE.</p></div><div><h3>Conclusions:</h3><p>The scheme of HSAM+PRA+MDP+MDUP has been proven to be a remarkable improvement and leaves room for further research.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104151"},"PeriodicalIF":4.3,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002327/pdfft?md5=8776295cbe51596acb5f3c2feb76b9bf&pid=1-s2.0-S1077314224002327-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142229388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1