IEEE transactions on pattern analysis and machine intelligence最新文献_第8页

Say No to Freeloader: Protecting Intellectual Property of Your Deep Model 对免费者说不：保护深度模型的知识产权。

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-26 DOI: 10.1109/TPAMI.2024.3450282

Lianyu Wang;Meng Wang;Huazhu Fu;Daoqiang Zhang

Model intellectual property (IP) protection has gained attention due to the significance of safeguarding intellectual labor and computational resources. Ensuring IP safety for trainers and owners is critical, especially when ownership verification and applicability authorization are required. A notable approach involves preventing the transfer of well-trained models from authorized to unauthorized domains. We introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Inspired by human transitive inference, the CUPI-Domain emphasizes distinctive style features of the authorized domain, leading to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. These generators fuse the style features and semantic features to create labeled, style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains. We offer two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain. Comprehensive experiments on various public datasets demonstrate the effectiveness of our CUPI-Domain approach with different backbone models, providing an efficient solution for model intellectual property protection.

随着科技进步源于人类的智力劳动和计算支出，模型知识产权（IP）保护日益受到关注。确保训练者和所有者的知识产权安全至关重要，尤其是在需要所有权验证和适用性授权的领域。保护模型知识产权的一个显著方法是主动防止未经授权的领域使用训练有素的授权领域模型。在本文中，我们介绍了一种新颖的紧凑型不可转移金字塔隔离域（CUPI-Domain），它是防止从授权域向未授权域非法转移的屏障。CUPI 域的设计灵感来自人类的反式推理和学习能力，通过强调授权域的独特风格特征来阻止跨域转移。这种强调会导致无法识别未授权域中不相关的私人风格特征。为此，我们提出了新颖的 CUPI-Domain 生成器，从授权域和 CUPI-Domain 中选择特征作为锚点。然后，我们融合这些锚点的风格特征和语义特征，生成标签化的、风格丰富的 CUPI-Domain。此外，我们还设计了外部领域信息记忆库（Domain-Information Memory Bank，DIMB），用于存储和更新标记的金字塔特征，从而获得稳定的领域类特征和领域类风格特征。基于所提出的整个方法，我们设计了新颖的风格损失函数和判别损失函数，分别有效地提高了授权域和非授权域之间的风格和判别特征的区分度。此外，我们还根据是否已知未授权域提供了两种利用 CUPI-Domain 的解决方案：目标指定 CUPI-Domain 和无目标 CUPI-Domain。通过在各种公共数据集上进行综合实验，我们验证了我们提出的 CUPI-Domain 方法在不同骨干网模型下的有效性。结果表明，我们的方法提供了一种高效的模型知识产权保护解决方案。

{"title":"Say No to Freeloader: Protecting Intellectual Property of Your Deep Model","authors":"Lianyu Wang;Meng Wang;Huazhu Fu;Daoqiang Zhang","doi":"10.1109/TPAMI.2024.3450282","DOIUrl":"10.1109/TPAMI.2024.3450282","url":null,"abstract":"Model intellectual property (IP) protection has gained attention due to the significance of safeguarding intellectual labor and computational resources. Ensuring IP safety for trainers and owners is critical, especially when ownership verification and applicability authorization are required. A notable approach involves preventing the transfer of well-trained models from authorized to unauthorized domains. We introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Inspired by human transitive inference, the CUPI-Domain emphasizes distinctive style features of the authorized domain, leading to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. These generators fuse the style features and semantic features to create labeled, style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains. We offer two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain. Comprehensive experiments on various public datasets demonstrate the effectiveness of our CUPI-Domain approach with different backbone models, providing an efficient solution for model intellectual property protection.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11073-11086"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gaseous Object Detection 气态物体探测

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-26 DOI: 10.1109/TPAMI.2024.3449994

Kailai Zhou;Yibo Wang;Tao Lv;Qiu Shen;Xun Cao

Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.

物体检测是计算机视觉领域的一个基本而又具有挑战性的问题，由于深度学习的有效性，这一问题得到了快速发展。目前需要检测的物体大多是具有明显视觉特征的刚性固体物质。在本文中，我们致力于研究一项鲜有探索的任务--气态物体检测（GOD），旨在探索物体检测技术能否从固态物质扩展到气态物质。然而，气体具有明显不同的视觉特征：1) 显著性不足；2) 任意和不断变化的形状；3) 缺乏明显的边界。为了便于研究这项具有挑战性的任务，我们构建了一个 GOD-Video 数据集，其中包括 600 个视频（141,017 帧），涵盖了多种类型气体的各种属性。在此数据集的基础上建立了一个综合基准，对帧级和视频级探测器进行了严格评估。由物理学启发的体素偏移场（VSF）是从高斯离散模型中推导出来的，旨在对潜在三维空间中的几何不规则性和不断变化的形状进行建模。通过将 VSF 集成到 Faster RCNN 中，VSF RCNN 成为气态物体检测的一个简单而强大的基线。我们的工作旨在吸引对这一极具价值但又充满挑战的领域的进一步研究。

{"title":"Gaseous Object Detection","authors":"Kailai Zhou;Yibo Wang;Tao Lv;Qiu Shen;Xun Cao","doi":"10.1109/TPAMI.2024.3449994","DOIUrl":"10.1109/TPAMI.2024.3449994","url":null,"abstract":"Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10715-10731"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LTM-NeRF: Embedding 3D Local Tone Mapping in HDR Neural Radiance Field LTM-NeRF：在 HDR 神经辐射场中嵌入 3D 局部色调映射。

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-23 DOI: 10.1109/TPAMI.2024.3448620

Xin Huang;Qi Zhang;Ying Feng;Hongdong Li;Qing Wang

Recent advances in Neural Radiance Fields (NeRF) have provided a new geometric primitive for novel view synthesis. High Dynamic Range NeRF (HDR NeRF) can render novel views with a higher dynamic range. However, effectively displaying the scene contents of HDR NeRF on diverse devices with limited dynamic range poses a significant challenge. To address this, we present LTM-NeRF, a method designed to recover HDR NeRF and support 3D local tone mapping. LTM-NeRF allows for the synthesis of HDR views, tone-mapped views, and LDR views under different exposure settings, using only the multi-view multi-exposure LDR inputs for supervision. Specifically, we propose a differentiable Camera Response Function (CRF) module for HDR NeRF reconstruction, globally mapping the scene’s HDR radiance to LDR pixels. Moreover, we introduce a Neural Exposure Field (NeEF) to represent the spatially varying exposure time of an HDR NeRF to achieve 3D local tone mapping, for compatibility with various displays. Comprehensive experiments demonstrate that our method can not only synthesize HDR views and exposure-varying LDR views accurately but also render locally tone-mapped views naturally.

神经辐射场（NeRF）的最新进展为新型视图合成提供了一种新的几何原型。高动态范围 NeRF（HDR NeRF）能以更高的动态范围呈现新颖的视图。然而，在动态范围有限的各种设备上有效显示 HDR NeRF 的场景内容是一项重大挑战。为了解决这个问题，我们提出了 LTM-NeRF，一种旨在恢复 HDR NeRF 并支持 3D 局部色调映射的方法。LTM-NeRF 允许在不同曝光设置下合成 HDR 视图、色调映射视图和 LDR 视图，仅使用多视图多曝光 LDR 输入进行监督。具体来说，我们提出了一个用于 HDR NeRF 重建的可变相机响应函数（CRF）模块，可将场景的 HDR 辐射度全局映射到 LDR 像素。此外，我们还引入了神经曝光场（NeEF）来表示 HDR NeRF 的空间变化曝光时间，以实现三维局部色调映射，从而兼容各种显示器。综合实验证明，我们的方法不仅能准确合成 HDR 视图和曝光变化的 LDR 视图，还能自然渲染局部色调映射视图。有关视频结果，请查看我们的补充材料或访问项目页面：https://xhuangcv.github.io/LTM-NeRF/。

{"title":"LTM-NeRF: Embedding 3D Local Tone Mapping in HDR Neural Radiance Field","authors":"Xin Huang;Qi Zhang;Ying Feng;Hongdong Li;Qing Wang","doi":"10.1109/TPAMI.2024.3448620","DOIUrl":"10.1109/TPAMI.2024.3448620","url":null,"abstract":"Recent advances in Neural Radiance Fields (NeRF) have provided a new geometric primitive for novel view synthesis. High Dynamic Range NeRF (HDR NeRF) can render novel views with a higher dynamic range. However, effectively displaying the scene contents of HDR NeRF on diverse devices with limited dynamic range poses a significant challenge. To address this, we present LTM-NeRF, a method designed to recover HDR NeRF and support 3D local tone mapping. LTM-NeRF allows for the synthesis of HDR views, tone-mapped views, and LDR views under different exposure settings, using only the multi-view multi-exposure LDR inputs for supervision. Specifically, we propose a differentiable Camera Response Function (CRF) module for HDR NeRF reconstruction, globally mapping the scene’s HDR radiance to LDR pixels. Moreover, we introduce a Neural Exposure Field (NeEF) to represent the spatially varying exposure time of an HDR NeRF to achieve 3D local tone mapping, for compatibility with various displays. Comprehensive experiments demonstrate that our method can not only synthesize HDR views and exposure-varying LDR views accurately but also render locally tone-mapped views naturally.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10944-10959"},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LIA: Latent Image Animator LIA：潜像动画器。

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-23 DOI: 10.1109/TPAMI.2024.3449075

Yaohui Wang;Di Yang;Francois Bremond;Antitza Dantcheva

Previous animation techniques mainly focus on leveraging explicit structure representations (e.g., meshes or keypoints) for transferring motion from driving videos to source images. However, such methods are challenged with large appearance variations between source and driving data, as well as require complex additional modules to respectively model appearance and motion. Towards addressing these issues, we introduce the Latent Image Animator (LIA), streamlined to animate high-resolution images. LIA is designed as a simple autoencoder that does not rely on explicit representations. Motion transfer in the pixel space is modeled as linear navigation of motion codes in the latent space. Specifically such navigation is represented as an orthogonal motion dictionary learned in a self-supervised manner based on proposed Linear Motion Decomposition (LMD). Extensive experimental results demonstrate that LIA outperforms state-of-the-art on VoxCeleb, TaichiHD, and TED-talk datasets with respect to video quality and spatio-temporal consistency. In addition LIA is well equipped for zero-shot high-resolution image animation. Code, models, and demo video are available at https://github.com/wyhsirius/LIA.

以往的动画技术主要侧重于利用明确的结构表示（如网格或关键点）将运动从驾驶视频转移到源图像。然而，这些方法在源数据和驾驶数据之间存在较大外观差异的情况下面临挑战，并且需要复杂的附加模块来分别对外观和运动进行建模。为了解决这些问题，我们引入了潜像动画器（LIA），该动画器可简化高分辨率图像的动画制作。LIA 设计为一个简单的自动编码器，不依赖于明确的表征。像素空间中的运动传输被模拟为潜空间中运动代码的线性导航。具体来说，这种导航是以基于线性运动分解（LMD）的自监督方式学习的正交运动字典来表示的。广泛的实验结果表明，在 VoxCeleb、TaichiHD 和 TED-talk 数据集上，LIA 在视频质量和时空一致性方面优于最先进的技术。此外，LIA 还可用于零镜头高分辨率图像动画。

{"title":"LIA: Latent Image Animator","authors":"Yaohui Wang;Di Yang;Francois Bremond;Antitza Dantcheva","doi":"10.1109/TPAMI.2024.3449075","DOIUrl":"10.1109/TPAMI.2024.3449075","url":null,"abstract":"Previous animation techniques mainly focus on leveraging explicit structure representations (\u0000<italic>e.g.\u0000, meshes or keypoints) for transferring motion from driving videos to source images. However, such methods are challenged with large appearance variations between source and driving data, as well as require complex additional modules to respectively model appearance and motion. Towards addressing these issues, we introduce the Latent Image Animator (LIA), streamlined to animate high-resolution images. LIA is designed as a simple autoencoder that does not rely on explicit representations. Motion transfer in the pixel space is modeled as linear navigation of motion codes in the latent space. Specifically such navigation is represented as an orthogonal motion dictionary learned in a self-supervised manner based on proposed Linear Motion Decomposition (LMD). Extensive experimental results demonstrate that LIA outperforms state-of-the-art on VoxCeleb, TaichiHD, and TED-talk datasets with respect to video quality and spatio-temporal consistency. In addition LIA is well equipped for zero-shot high-resolution image animation. Code, models, and demo video are available at \u0000<uri>https://github.com/wyhsirius/LIA</uri>\u0000.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10829-10844"},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correlation-Embedded Transformer Tracking: A Single-Branch Framework 相关嵌入式变压器跟踪：单分支框架

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-22 DOI: 10.1109/TPAMI.2024.3448254

Fei Xie;Wankou Yang;Chunyu Wang;Lei Chu;Yue Cao;Chao Ma;Wenjun Zeng

Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.

在视觉物体跟踪领域，开发稳健且具有分辨能力的外观模型是一项长期存在的研究挑战。在流行的基于连体的范例中，由连体网络提取的特征往往不足以对跟踪的目标和分心物体进行建模，从而阻碍了它们同时具有鲁棒性和鉴别性。大多数连体跟踪器都侧重于设计稳健的相关操作，而我们则受变换器的启发，提出了一种新颖的单分支跟踪框架。与连体特征提取不同，我们的跟踪器将跨图像特征相关性深入到多层特征网络中。通过多层广泛匹配两幅图像的特征，它可以抑制非目标特征，从而实现目标感知特征提取。输出特征可直接用于预测目标位置，而无需额外的相关步骤。因此，我们将双分支连体跟踪重新表述为一个概念简单、完全基于变换器的单分支跟踪管道，称为 SBT。在对 SBT 基线进行深入分析后，我们总结了许多有效的设计原则，并提出了一种改进的跟踪器，称为 SuperSBT。SuperSBT 采用分层架构和局部建模层来增强浅层特征。提出了一种统一的关系建模，以去除复杂的手工层模式设计。通过屏蔽图像建模预训练、整合时序建模和配备专用预测头，SuperSBT 得到了进一步改进。因此，在 LaSOT、TrackingNet 和 GOT-10K 中，SuperSBT 的 AUC 分数分别比 SBT 基线高出 4.7%、3.0% 和 4.5%。值得注意的是，SuperSBT 将 SBT 的速度从 37 FPS 大幅提高到 81 FPS。广泛的实验表明，我们的方法在八个 VOT 基准上取得了优异的成绩。代码见 https://github.com/phiphiphi31/SBT。

{"title":"Correlation-Embedded Transformer Tracking: A Single-Branch Framework","authors":"Fei Xie;Wankou Yang;Chunyu Wang;Lei Chu;Yue Cao;Chao Ma;Wenjun Zeng","doi":"10.1109/TPAMI.2024.3448254","DOIUrl":"10.1109/TPAMI.2024.3448254","url":null,"abstract":"Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10681-10696"},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Cross-Lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-Stage Training Method 多模态跨语言视频摘要：知识蒸馏诱导三阶段训练法的再探讨。

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-22 DOI: 10.1109/TPAMI.2024.3447778

Nayu Liu;Kaiwen Wei;Yong Yang;Jianhua Tao;Xian Sun;Fanglong Yao;Hongfeng Yu;Li Jin;Zhao Lv;Cunhang Fan

Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), showing promising progress recently. However, existing works are limited to monolingual scenarios, neglecting non-native viewers' needs to understand videos in other languages. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through diverse fusion strategies in the encoder and decoder; What's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD), designed for encoder-level and vocab-level distillation objects to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate MCLS scenarios. Experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.

视频多模态摘要（MS）旨在从多源信息（如视频和文本转录）中生成摘要，这一技术近来取得了可喜的进展。然而，现有的工作仅限于单语视频场景，忽视了非母语视频观众在实际应用中理解跨语言视频的需求。这促使我们引入了视频多模态跨语言摘要（MCLS），旨在从视频的多模态输入生成跨语言摘要。考虑到 MCLS 所面临的高注释成本和资源限制的挑战，我们提出了一种知识蒸馏（KD）诱导的三阶段训练方法，通过将丰富的单语 MS 数据中的知识转移到数量不足的数据中来帮助 MCLS。在三阶段训练方法中，我们设计了一个视频引导的双融合网络（VDF）作为骨干网络，通过编码器和解码器的不同融合策略整合多模态和跨语言信息；此外，我们还提出了两种跨语言知识蒸馏策略：自适应池化蒸馏和语言自适应翘曲蒸馏（LAWD）。这些策略针对蒸馏对象（即编码器级和词汇表级 KD）量身定制，以促进 MS 和 MCLS 模型之间不同长度的跨语言序列之间的有效知识转移。具体来说，为了解决 KD 中并行跨语言序列长度不等的难题，我们提出的 LAWD 可以直接进行跨语言蒸馏，同时保持语言特征形状不变，以减少潜在的信息丢失。我们在 How2 数据集的基础上对 How2-MCLS 数据集进行了细致的注释，以模拟 MCLS 场景。实验结果表明，与强大的基线相比，所提出的方法取得了具有竞争力的性能，并能通过转移 MS 模型的知识为 MCLS 模型带来实质性的性能改进。

{"title":"Multimodal Cross-Lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-Stage Training Method","authors":"Nayu Liu;Kaiwen Wei;Yong Yang;Jianhua Tao;Xian Sun;Fanglong Yao;Hongfeng Yu;Li Jin;Zhao Lv;Cunhang Fan","doi":"10.1109/TPAMI.2024.3447778","DOIUrl":"10.1109/TPAMI.2024.3447778","url":null,"abstract":"Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), showing promising progress recently. However, existing works are limited to monolingual scenarios, neglecting non-native viewers' needs to understand videos in other languages. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through diverse fusion strategies in the encoder and decoder; What's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD), designed for encoder-level and vocab-level distillation objects to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate MCLS scenarios. Experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10697-10714"},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application 连续语义分割调查：理论、挑战、方法与应用

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3446949

Bo Yuan;Danpei Zhao

Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including data-replay and data-free sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.

持续学习，又称增量学习或终身学习，是深度学习和人工智能系统的前沿技术。它突破了封闭集单向训练的障碍，实现了开放集条件下的持续自适应学习。近十年来，持续学习在多个领域得到了探索和应用，尤其是在计算机视觉领域，涵盖了分类、检测和分割任务。其中，连续语义分割（CSS）的密集预测特性使其成为一项极具挑战性、错综复杂且方兴未艾的任务。在本文中，我们将对 CSS 进行综述，致力于对问题的提出、主要挑战、通用数据集、新理论和多种应用进行全面研究。具体来说，我们首先阐明了问题定义和主要挑战。在深入研究相关方法的基础上，我们将当前的 CSS 模型梳理归类为两大分支，包括数据重放和无数据集。在每个分支中，我们对相应的方法进行了基于相似性的聚类和深入分析，并在相关数据集上进行了定性比较和定量再现。此外，我们还介绍了四种具有不同应用场景和发展趋势的 CSS 特性。此外，我们还开发了一个 CSS 基准，其中包括代表性参考文献、评估结果和复制品，可在 https://github.com/YBIO/SurveyCSS 上查阅。我们希望本调查报告能对终身学习领域的发展起到参考和激励作用，同时也能为相关领域提供有价值的视角。

{"title":"A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application","authors":"Bo Yuan;Danpei Zhao","doi":"10.1109/TPAMI.2024.3446949","DOIUrl":"10.1109/TPAMI.2024.3446949","url":null,"abstract":"Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including \u0000<italic>data-replay\u0000 and \u0000<italic>data-free\u0000 sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10891-10910"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations 深度神经网络剪枝调查：分类、比较、分析和建议。

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3447085

Hongrong Cheng;Miao Zhang;Javen Qinfeng Shi

Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and to accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than three thousand pruning papers have been published from 2020 to 2024. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of eight pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pre-trained weights, etc.) and explore several emerging topics, including pruning for large language models, vision transformers, diffusion models, and large multimodal models, post-training pruning, and different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We maintain a repository on https://github.com/hrcheng1066/awesome-pruning that serves as a comprehensive resource for neural network pruning papers and corresponding open-source codes. We will keep updating this repository to include the latest advancements in the field.

现代深度神经网络，尤其是最新的大型语言模型，具有庞大的模型规模，需要大量的计算和存储资源。为了在资源受限的环境中部署现代模型并加快推理时间，研究人员越来越多地探索剪枝技术，将其作为神经网络压缩的一个热门研究方向。从 2020 年到 2024 年，已经发表了三千多篇剪枝论文。然而，关于剪枝的最新综合综述论文却十分匮乏。为解决这一问题，我们在本调查报告中全面回顾了现有的深度神经网络剪枝研究工作，并从以下几个方面进行了分类：1）通用/特定加速；2）何时剪枝；3）如何剪枝；4）剪枝与其他压缩技术的融合。然后，我们对剪枝的八对对比设置（如非结构化/结构化、单次/迭代、无数据/数据驱动、初始化/预训练权重等）进行了深入的对比分析，并探讨了几个新出现的主题，包括大型语言模型、视觉变换器、扩散模型和大型多模态模型的剪枝、后训练剪枝以及剪枝的不同监督水平，以揭示现有方法的共性和差异，为进一步的方法开发奠定基础。最后，我们就剪枝方法的选择提出了一些有价值的建议，并展望了神经网络剪枝的几个有前景的研究方向。为了促进未来对深度神经网络剪枝的研究，我们总结了剪枝的广泛应用（如对抗鲁棒性、自然语言理解等），并建立了一个数据集、网络和不同应用评估的集合。我们在 https://github.com/hrcheng1066/awesome-pruning 上维护了一个资源库，作为神经网络剪枝论文和相应开源代码的综合资源。我们将不断更新该资源库，以纳入该领域的最新进展。

{"title":"A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations","authors":"Hongrong Cheng;Miao Zhang;Javen Qinfeng Shi","doi":"10.1109/TPAMI.2024.3447085","DOIUrl":"10.1109/TPAMI.2024.3447085","url":null,"abstract":"Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and to accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than three thousand pruning papers have been published from 2020 to 2024. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of eight pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pre-trained weights, etc.) and explore several emerging topics, including pruning for large language models, vision transformers, diffusion models, and large multimodal models, post-training pruning, and different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We maintain a repository on \u0000<uri>https://github.com/hrcheng1066/awesome-pruning</uri>\u0000 that serves as a comprehensive resource for neural network pruning papers and corresponding open-source codes. We will keep updating this repository to include the latest advancements in the field.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10558-10578"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Medical Image Segmentation Review: The Success of U-Net 医学图像分割回顾：U-Net 的成功

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3435571

Reza Azad;Ehsan Khodapanah Aghdam;Amelie Rauland;Yiwei Jia;Atlas Haddadi Avval;Afshin Bozorgpour;Sanaz Karimijafarbigloo;Joseph Paul Cohen;Ehsan Adeli;Dorit Merhof

Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model has received tremendous attention from academic and industrial researchers who have extended it to address the scale and complexity created by medical tasks. These extensions are commonly related to enhancing the U-Net's backbone, bottleneck, or skip connections, or including representation learning, or combining it with a Transformer architecture, or even addressing probabilistic prediction of the segmentation map. Having a compendium of different previously proposed U-Net variants makes it easier for machine learning researchers to identify relevant research questions and understand the challenges of the biological tasks that challenge the model. In this work, we discuss the practical aspects of the U-Net model and organize each variant model into a taxonomy. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. Furthermore, we provide a comprehensive implementation library with trained models. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation.

自动医学影像分割是医学领域的一个重要课题，也是计算机辅助诊断范例中的一个重要组成部分。U-Net 因其灵活性、优化的模块化设计以及在所有医学图像模式中的成功应用而成为最广泛的图像分割架构。多年来，U-Net 模型受到了学术界和工业界研究人员的极大关注，他们对其进行了扩展，以应对医疗任务所造成的规模和复杂性。这些扩展通常涉及增强 U-Net 的主干、瓶颈或跳接连接，或包括表征学习，或将其与 Transformer 架构相结合，甚至是解决分割图的概率预测问题。有了以前提出的不同 U-Net 变体的汇编，机器学习研究人员就能更容易地确定相关的研究问题，并了解挑战该模型的生物任务所面临的挑战。在这项工作中，我们讨论了 U-Net 模型的实际方面，并将每个变体模型整理成一个分类法。此外，为了衡量这些策略在临床应用中的性能，我们建议在知名数据集上对一些独特的著名设计进行公平评估。此外，我们还提供了一个包含训练有素模型的综合实施库。此外，为了方便未来的研究，我们创建了一个 U-Net 论文在线列表，其中包含可能的正式实施方案。所有信息都收集在 GitHub 存储库 https://github.com/NITR098/Awesome-U-Net 中。

{"title":"Medical Image Segmentation Review: The Success of U-Net","authors":"Reza Azad;Ehsan Khodapanah Aghdam;Amelie Rauland;Yiwei Jia;Atlas Haddadi Avval;Afshin Bozorgpour;Sanaz Karimijafarbigloo;Joseph Paul Cohen;Ehsan Adeli;Dorit Merhof","doi":"10.1109/TPAMI.2024.3435571","DOIUrl":"10.1109/TPAMI.2024.3435571","url":null,"abstract":"Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model has received tremendous attention from academic and industrial researchers who have extended it to address the scale and complexity created by medical tasks. These extensions are commonly related to enhancing the U-Net's backbone, bottleneck, or skip connections, or including representation learning, or combining it with a Transformer architecture, or even addressing probabilistic prediction of the segmentation map. Having a compendium of different previously proposed U-Net variants makes it easier for machine learning researchers to identify relevant research questions and understand the challenges of the biological tasks that challenge the model. In this work, we discuss the practical aspects of the U-Net model and organize each variant model into a taxonomy. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. Furthermore, we provide a comprehensive implementation library with trained models. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10076-10095"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

U-Match: Exploring Hierarchy-Aware Local Context for Two-View Correspondence Learning U-Match：探索双视角对应学习的层次感知本地语境

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3447048

Zizhuo Li;Shihua Zhang;Jiayi Ma

Rejecting outlier correspondences is one of the critical steps for successful feature-based two-view geometry estimation, and contingent heavily upon local context exploration. Recent advances focus on devising elaborate local context extractors whereas typically adopting explicit neighborhood relationship modeling at a specific scale, which is intrinsically flawed and inflexible, because 1) severe outliers often populated in putative correspondences and 2) the uncertainty in the distribution of inliers and outliers make the network incapable of capturing adequate and reliable local context from such neighborhoods, therefore resulting in the failure of pose estimation. This prospective study proposes a novel network called U-Match that has the flexibility to enable implicit local context awareness at multiple levels, naturally circumventing the aforementioned issues that plague most existing studies. Specifically, to aggregate multi-level local context implicitly, a hierarchy-aware graph representation module is designed to flexibly encode and decode hierarchical features. Moreover, considering that global context always works collaboratively with local context, an orthogonal local-and-global information fusion module is presented to integrate complementary local and global context in a redundancy-free manner, thus yielding compact feature representations to facilitate correspondence learning. Thorough experimentation across relative pose estimation, homography estimation, visual localization, and point cloud registration affirms U-Match's remarkable capabilities.

拒绝异常值对应是成功进行基于特征的双视角几何估算的关键步骤之一，这在很大程度上取决于局部上下文探索。最近的进展主要集中在设计精细的局部背景提取器上，而通常采用的是特定尺度的显式邻域关系建模，这种建模方式本身存在缺陷且缺乏灵活性，因为：1）在推定的对应关系中经常出现严重的异常值；2）异常值和异常值分布的不确定性使得网络无法从这些邻域中捕捉到充分可靠的局部背景，从而导致姿态估计失败。本前瞻性研究提出了一种名为 U-Match 的新型网络，该网络具有灵活性，可实现多层次的隐式本地上下文感知，自然而然地规避了困扰大多数现有研究的上述问题。具体来说，为了隐式聚合多层次的本地上下文，设计了一个分层感知图表示模块，以灵活地编码和解码分层特征。此外，考虑到全局上下文总是与局部上下文协同工作，提出了一个正交的局部和全局信息融合模块，以无冗余的方式整合互补的局部和全局上下文，从而产生紧凑的特征表示，促进对应学习。在相对姿态估计、同构估计、视觉定位和点云注册等方面进行的深入实验证实了 U-Match 的卓越能力。我们的代码可通过 https://github.com/ZizhuoLi/U-Match 公开获取。

{"title":"U-Match: Exploring Hierarchy-Aware Local Context for Two-View Correspondence Learning","authors":"Zizhuo Li;Shihua Zhang;Jiayi Ma","doi":"10.1109/TPAMI.2024.3447048","DOIUrl":"10.1109/TPAMI.2024.3447048","url":null,"abstract":"Rejecting outlier correspondences is one of the critical steps for successful feature-based two-view geometry estimation, and contingent heavily upon local context exploration. Recent advances focus on devising elaborate local context extractors whereas typically adopting \u0000<italic>explicit\u0000 neighborhood relationship modeling at a specific scale, which is intrinsically flawed and inflexible, because 1) severe outliers often populated in putative correspondences and 2) the uncertainty in the distribution of inliers and outliers make the network incapable of capturing adequate and reliable local context from such neighborhoods, therefore resulting in the failure of pose estimation. This prospective study proposes a novel network called U-Match that has the flexibility to enable \u0000<italic>implicit\u0000 local context awareness at multiple levels, naturally circumventing the aforementioned issues that plague most existing studies. Specifically, to aggregate multi-level local context implicitly, a hierarchy-aware graph representation module is designed to flexibly encode and decode hierarchical features. Moreover, considering that global context always works collaboratively with local context, an orthogonal local-and-global information fusion module is presented to integrate complementary local and global context in a redundancy-free manner, thus yielding compact feature representations to facilitate correspondence learning. Thorough experimentation across relative pose estimation, homography estimation, visual localization, and point cloud registration affirms U-Match's remarkable capabilities.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10960-10977"},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0