首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
Frequency-aware Feature Fusion for Dense Image Prediction. 频率感知特征融合用于密集图像预测
Pub Date : 2024-08-26 DOI: 10.1109/TPAMI.2024.3449959
Linwei Chen, Ying Fu, Lin Gu, Chenggang Yan, Tatsuya Harada, Gao Huang

Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness. The code is made publicly available at https://github.com/Linwei-Chen/FreqFusion.

密集图像预测任务要求特征具有强大的类别信息和精确的高分辨率空间边界细节。为了实现这一目标,现代分层模型通常会利用特征融合技术,直接添加来自深层的上采样粗特征和来自低层的高分辨率特征。在本文中,我们观察到对象内部融合特征值的快速变化,由于高频特征受到干扰,导致类别内部不一致。此外,融合特征中模糊的边界缺乏准确的高频率,从而导致边界位移。基于这些观察结果,我们提出了频率感知特征融合(FreqFusion),它集成了自适应低通滤波器(ALPF)生成器、偏移生成器和自适应高通滤波器(AHPF)生成器。ALPF 生成器可预测空间变异低通滤波器,以衰减对象内的高频成分,从而在上采样过程中减少类内不一致性。偏移发生器通过重新采样,用更一致的特征替换不一致的特征,从而完善大的不一致特征和细边界,而 AHPF 发生器则增强了在下采样过程中丢失的高频详细边界信息。全面的可视化和定量分析证明,FreqFusion 能有效提高特征的一致性,并使物体边界更加清晰。在各种高密度预测任务中进行的大量实验证实了它的有效性。代码可在 https://github.com/Linwei-Chen/FreqFusion 公开获取。
{"title":"Frequency-aware Feature Fusion for Dense Image Prediction.","authors":"Linwei Chen, Ying Fu, Lin Gu, Chenggang Yan, Tatsuya Harada, Gao Huang","doi":"10.1109/TPAMI.2024.3449959","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3449959","url":null,"abstract":"<p><p>Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness. The code is made publicly available at https://github.com/Linwei-Chen/FreqFusion.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Say No to Freeloader: Protecting Intellectual Property of Your Deep Model. 对免费者说不:保护深度模型的知识产权。
Pub Date : 2024-08-26 DOI: 10.1109/TPAMI.2024.3450282
Lianyu Wang, Meng Wang, Huazhu Fu, Daoqaing Zhang

Model intellectual property (IP) protection has attracted growing attention as science and technology advancements stem from human intellectual labor and computational expenses. Ensuring IP safety for trainers and owners is of utmost importance, particularly in domains where ownership verification and applicability authorization are required. A notable approach to safeguarding model IP involves proactively preventing the use of well-trained models of authorized domains from unauthorized domains. In this paper, we introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Drawing inspiration from human transitive inference and learning abilities, the CUPI-Domain is designed to obstruct cross-domain transfers by emphasizing the distinctive style features of the authorized domain. This emphasis leads to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose novel CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. Then, we fuse the style features and semantic features of these anchors to generate labeled and style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains, respectively. Moreover, we provide two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain. By conducting comprehensive experiments on various public datasets, we validate the effectiveness of our proposed CUPI-Domain approach with different backbone models. The results highlight that our method offers an efficient model intellectual property protection solution.

随着科技进步源于人类的智力劳动和计算支出,模型知识产权(IP)保护日益受到关注。确保训练者和所有者的知识产权安全至关重要,尤其是在需要所有权验证和适用性授权的领域。保护模型知识产权的一个显著方法是主动防止未经授权的领域使用训练有素的授权领域模型。在本文中,我们介绍了一种新颖的紧凑型不可转移金字塔隔离域(CUPI-Domain),它是防止从授权域向未授权域非法转移的屏障。CUPI 域的设计灵感来自人类的反式推理和学习能力,通过强调授权域的独特风格特征来阻止跨域转移。这种强调会导致无法识别未授权域中不相关的私人风格特征。为此,我们提出了新颖的 CUPI-Domain 生成器,从授权域和 CUPI-Domain 中选择特征作为锚点。然后,我们融合这些锚点的风格特征和语义特征,生成标签化的、风格丰富的 CUPI-Domain。此外,我们还设计了外部领域信息记忆库(Domain-Information Memory Bank,DIMB),用于存储和更新标记的金字塔特征,从而获得稳定的领域类特征和领域类风格特征。基于所提出的整个方法,我们设计了新颖的风格损失函数和判别损失函数,分别有效地提高了授权域和非授权域之间的风格和判别特征的区分度。此外,我们还根据是否已知未授权域提供了两种利用 CUPI-Domain 的解决方案:目标指定 CUPI-Domain 和无目标 CUPI-Domain。通过在各种公共数据集上进行综合实验,我们验证了我们提出的 CUPI-Domain 方法在不同骨干网模型下的有效性。结果表明,我们的方法提供了一种高效的模型知识产权保护解决方案。
{"title":"Say No to Freeloader: Protecting Intellectual Property of Your Deep Model.","authors":"Lianyu Wang, Meng Wang, Huazhu Fu, Daoqaing Zhang","doi":"10.1109/TPAMI.2024.3450282","DOIUrl":"10.1109/TPAMI.2024.3450282","url":null,"abstract":"<p><p>Model intellectual property (IP) protection has attracted growing attention as science and technology advancements stem from human intellectual labor and computational expenses. Ensuring IP safety for trainers and owners is of utmost importance, particularly in domains where ownership verification and applicability authorization are required. A notable approach to safeguarding model IP involves proactively preventing the use of well-trained models of authorized domains from unauthorized domains. In this paper, we introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Drawing inspiration from human transitive inference and learning abilities, the CUPI-Domain is designed to obstruct cross-domain transfers by emphasizing the distinctive style features of the authorized domain. This emphasis leads to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose novel CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. Then, we fuse the style features and semantic features of these anchors to generate labeled and style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains, respectively. Moreover, we provide two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain. By conducting comprehensive experiments on various public datasets, we validate the effectiveness of our proposed CUPI-Domain approach with different backbone models. The results highlight that our method offers an efficient model intellectual property protection solution.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gaseous Object Detection. 气态物体探测
Pub Date : 2024-08-26 DOI: 10.1109/TPAMI.2024.3449994
Kailai Zhou, Yibo Wang, Tao Lv, Qiu Shen, Xun Cao

Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.

物体检测是计算机视觉领域的一个基本而又具有挑战性的问题,由于深度学习的有效性,这一问题得到了快速发展。目前需要检测的物体大多是具有明显视觉特征的刚性固体物质。在本文中,我们致力于研究一项鲜有探索的任务--气态物体检测(GOD),旨在探索物体检测技术能否从固态物质扩展到气态物质。然而,气体具有明显不同的视觉特征:1) 显著性不足;2) 任意和不断变化的形状;3) 缺乏明显的边界。为了便于研究这项具有挑战性的任务,我们构建了一个 GOD-Video 数据集,其中包括 600 个视频(141,017 帧),涵盖了多种类型气体的各种属性。在此数据集的基础上建立了一个综合基准,对帧级和视频级探测器进行了严格评估。由物理学启发的体素偏移场(VSF)是从高斯离散模型中推导出来的,旨在对潜在三维空间中的几何不规则性和不断变化的形状进行建模。通过将 VSF 集成到 Faster RCNN 中,VSF RCNN 成为气态物体检测的一个简单而强大的基线。我们的工作旨在吸引对这一极具价值但又充满挑战的领域的进一步研究。
{"title":"Gaseous Object Detection.","authors":"Kailai Zhou, Yibo Wang, Tao Lv, Qiu Shen, Xun Cao","doi":"10.1109/TPAMI.2024.3449994","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3449994","url":null,"abstract":"<p><p>Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LTM-NeRF: Embedding 3D Local Tone Mapping in HDR Neural Radiance Field. LTM-NeRF:在 HDR 神经辐射场中嵌入 3D 局部色调映射。
Pub Date : 2024-08-23 DOI: 10.1109/TPAMI.2024.3448620
Xin Huang, Qi Zhang, Ying Feng, Hongdong Li, Qing Wang

Recent advances in Neural Radiance Fields (NeRF) have provided a new geometric primitive for novel view synthesis. High Dynamic Range NeRF (HDR NeRF) can render novel views with a higher dynamic range. However, effectively displaying the scene contents of HDR NeRF on diverse devices with limited dynamic range poses a significant challenge. To address this, we present LTM-NeRF, a method designed to recover HDR NeRF and support 3D local tone mapping. LTM-NeRF allows for the synthesis of HDR views, tone-mapped views, and LDR views under different exposure settings, using only the multi-view multi-exposure LDR inputs for supervision. Specifically, we propose a differentiable Camera Response Function (CRF) module for HDR NeRF reconstruction, globally mapping the scene's HDR radiance to LDR pixels. Moreover, we introduce a Neural Exposure Field (NeEF) to represent the spatially varying exposure time of an HDR NeRF to achieve 3D local tone mapping, for compatibility with various displays. Comprehensive experiments demonstrate that our method can not only synthesize HDR views and exposure-varying LDR views accurately but also render locally tone-mapped views naturally. For video results, please view our supplemental materials or visit the project page at: https://xhuangcv.github.io/LTM-NeRF/.

神经辐射场(NeRF)的最新进展为新型视图合成提供了一种新的几何原型。高动态范围 NeRF(HDR NeRF)能以更高的动态范围呈现新颖的视图。然而,在动态范围有限的各种设备上有效显示 HDR NeRF 的场景内容是一项重大挑战。为了解决这个问题,我们提出了 LTM-NeRF,一种旨在恢复 HDR NeRF 并支持 3D 局部色调映射的方法。LTM-NeRF 允许在不同曝光设置下合成 HDR 视图、色调映射视图和 LDR 视图,仅使用多视图多曝光 LDR 输入进行监督。具体来说,我们提出了一个用于 HDR NeRF 重建的可变相机响应函数(CRF)模块,可将场景的 HDR 辐射度全局映射到 LDR 像素。此外,我们还引入了神经曝光场(NeEF)来表示 HDR NeRF 的空间变化曝光时间,以实现三维局部色调映射,从而兼容各种显示器。综合实验证明,我们的方法不仅能准确合成 HDR 视图和曝光变化的 LDR 视图,还能自然渲染局部色调映射视图。有关视频结果,请查看我们的补充材料或访问项目页面:https://xhuangcv.github.io/LTM-NeRF/。
{"title":"LTM-NeRF: Embedding 3D Local Tone Mapping in HDR Neural Radiance Field.","authors":"Xin Huang, Qi Zhang, Ying Feng, Hongdong Li, Qing Wang","doi":"10.1109/TPAMI.2024.3448620","DOIUrl":"10.1109/TPAMI.2024.3448620","url":null,"abstract":"<p><p>Recent advances in Neural Radiance Fields (NeRF) have provided a new geometric primitive for novel view synthesis. High Dynamic Range NeRF (HDR NeRF) can render novel views with a higher dynamic range. However, effectively displaying the scene contents of HDR NeRF on diverse devices with limited dynamic range poses a significant challenge. To address this, we present LTM-NeRF, a method designed to recover HDR NeRF and support 3D local tone mapping. LTM-NeRF allows for the synthesis of HDR views, tone-mapped views, and LDR views under different exposure settings, using only the multi-view multi-exposure LDR inputs for supervision. Specifically, we propose a differentiable Camera Response Function (CRF) module for HDR NeRF reconstruction, globally mapping the scene's HDR radiance to LDR pixels. Moreover, we introduce a Neural Exposure Field (NeEF) to represent the spatially varying exposure time of an HDR NeRF to achieve 3D local tone mapping, for compatibility with various displays. Comprehensive experiments demonstrate that our method can not only synthesize HDR views and exposure-varying LDR views accurately but also render locally tone-mapped views naturally. For video results, please view our supplemental materials or visit the project page at: https://xhuangcv.github.io/LTM-NeRF/.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LIA: Latent Image Animator. LIA:潜像动画器。
Pub Date : 2024-08-23 DOI: 10.1109/TPAMI.2024.3449075
Yaohui Wang, Di Yang, Francois Bremond, Antitza Dantcheva

Previous animation techniques mainly focus on leveraging explicit structure representations (e.g., meshes or keypoints) for transferring motion from driving videos to source images. However, such methods are challenged with large appearance variations between source and driving data, as well as require complex additional modules to respectively model appearance and motion. Towards addressing these issues, we introduce the Latent Image Animator (LIA), streamlined to animate high-resolution images. LIA is designed as a simple autoencoder that does not rely on explicit representations. Motion transfer in the pixel space is modeled as linear navigation of motion codes in the latent space. Specifically such navigation is represented as an orthogonal motion dictionary learned in a self-supervised manner based on proposed Linear Motion Decomposition (LMD). Extensive experimental results demonstrate that LIA outperforms state-of-the-art on VoxCeleb, TaichiHD, and TED-talk datasets with respect to video quality and spatio-temporal consistency. In addition LIA is well equipped for zero-shot high-resolution image animation.

以往的动画技术主要侧重于利用明确的结构表示(如网格或关键点)将运动从驾驶视频转移到源图像。然而,这些方法在源数据和驾驶数据之间存在较大外观差异的情况下面临挑战,并且需要复杂的附加模块来分别对外观和运动进行建模。为了解决这些问题,我们引入了潜像动画器(LIA),该动画器可简化高分辨率图像的动画制作。LIA 设计为一个简单的自动编码器,不依赖于明确的表征。像素空间中的运动传输被模拟为潜空间中运动代码的线性导航。具体来说,这种导航是以基于线性运动分解(LMD)的自监督方式学习的正交运动字典来表示的。广泛的实验结果表明,在 VoxCeleb、TaichiHD 和 TED-talk 数据集上,LIA 在视频质量和时空一致性方面优于最先进的技术。此外,LIA 还可用于零镜头高分辨率图像动画。
{"title":"LIA: Latent Image Animator.","authors":"Yaohui Wang, Di Yang, Francois Bremond, Antitza Dantcheva","doi":"10.1109/TPAMI.2024.3449075","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3449075","url":null,"abstract":"<p><p>Previous animation techniques mainly focus on leveraging explicit structure representations (e.g., meshes or keypoints) for transferring motion from driving videos to source images. However, such methods are challenged with large appearance variations between source and driving data, as well as require complex additional modules to respectively model appearance and motion. Towards addressing these issues, we introduce the Latent Image Animator (LIA), streamlined to animate high-resolution images. LIA is designed as a simple autoencoder that does not rely on explicit representations. Motion transfer in the pixel space is modeled as linear navigation of motion codes in the latent space. Specifically such navigation is represented as an orthogonal motion dictionary learned in a self-supervised manner based on proposed Linear Motion Decomposition (LMD). Extensive experimental results demonstrate that LIA outperforms state-of-the-art on VoxCeleb, TaichiHD, and TED-talk datasets with respect to video quality and spatio-temporal consistency. In addition LIA is well equipped for zero-shot high-resolution image animation.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correlation-Embedded Transformer Tracking: A Single-Branch Framework. 相关嵌入式变压器跟踪:单分支框架
Pub Date : 2024-08-22 DOI: 10.1109/TPAMI.2024.3448254
Fei Xie, Wankou Yang, Chunyu Wang, Lei Chu, Yue Cao, Chao Ma, Wenjun Zeng

Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks. Code is available at https://github.com/phiphiphi31/SBT.

在视觉物体跟踪领域,开发稳健且具有分辨能力的外观模型是一项长期存在的研究挑战。在流行的基于连体的范例中,由连体网络提取的特征往往不足以对跟踪的目标和分心物体进行建模,从而阻碍了它们同时具有鲁棒性和鉴别性。大多数连体跟踪器都侧重于设计稳健的相关操作,而我们则受变换器的启发,提出了一种新颖的单分支跟踪框架。与连体特征提取不同,我们的跟踪器将跨图像特征相关性深入到多层特征网络中。通过多层广泛匹配两幅图像的特征,它可以抑制非目标特征,从而实现目标感知特征提取。输出特征可直接用于预测目标位置,而无需额外的相关步骤。因此,我们将双分支连体跟踪重新表述为一个概念简单、完全基于变换器的单分支跟踪管道,称为 SBT。在对 SBT 基线进行深入分析后,我们总结了许多有效的设计原则,并提出了一种改进的跟踪器,称为 SuperSBT。SuperSBT 采用分层架构和局部建模层来增强浅层特征。提出了一种统一的关系建模,以去除复杂的手工层模式设计。通过屏蔽图像建模预训练、整合时序建模和配备专用预测头,SuperSBT 得到了进一步改进。因此,在 LaSOT、TrackingNet 和 GOT-10K 中,SuperSBT 的 AUC 分数分别比 SBT 基线高出 4.7%、3.0% 和 4.5%。值得注意的是,SuperSBT 将 SBT 的速度从 37 FPS 大幅提高到 81 FPS。广泛的实验表明,我们的方法在八个 VOT 基准上取得了优异的成绩。代码见 https://github.com/phiphiphi31/SBT。
{"title":"Correlation-Embedded Transformer Tracking: A Single-Branch Framework.","authors":"Fei Xie, Wankou Yang, Chunyu Wang, Lei Chu, Yue Cao, Chao Ma, Wenjun Zeng","doi":"10.1109/TPAMI.2024.3448254","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3448254","url":null,"abstract":"<p><p>Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks. Code is available at https://github.com/phiphiphi31/SBT.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Cross-lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-stage Training Method. 多模态跨语言视频摘要:知识蒸馏诱导三阶段训练法的再探讨。
Pub Date : 2024-08-22 DOI: 10.1109/TPAMI.2024.3447778
Nayu Liu, Kaiwen Wei, Yong Yang, Jianhua Tao, Xian Sun, Fanglong Yao, Hongfeng Yu, Li Jin, Zhao Lv, Cunhang Fan

Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), and this technique has made promising progress recently. However, existing works are limited to monolingual video scenarios, overlooking the demands of non-native language video viewers to understand cross-lingual videos in practical applications. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims at generating cross-lingual summarization from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through different fusion strategies of encoder and decoder; what's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD). These strategies are tailored for distillation objects (i.e., encoder-level and vocab-level KD) to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, our proposed LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate the MCLS scenario. The experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.

视频多模态摘要(MS)旨在从多源信息(如视频和文本转录)中生成摘要,这一技术近来取得了可喜的进展。然而,现有的工作仅限于单语视频场景,忽视了非母语视频观众在实际应用中理解跨语言视频的需求。这促使我们引入了视频多模态跨语言摘要(MCLS),旨在从视频的多模态输入生成跨语言摘要。考虑到 MCLS 所面临的高注释成本和资源限制的挑战,我们提出了一种知识蒸馏(KD)诱导的三阶段训练方法,通过将丰富的单语 MS 数据中的知识转移到数量不足的数据中来帮助 MCLS。在三阶段训练方法中,我们设计了一个视频引导的双融合网络(VDF)作为骨干网络,通过编码器和解码器的不同融合策略整合多模态和跨语言信息;此外,我们还提出了两种跨语言知识蒸馏策略:自适应池化蒸馏和语言自适应翘曲蒸馏(LAWD)。这些策略针对蒸馏对象(即编码器级和词汇表级 KD)量身定制,以促进 MS 和 MCLS 模型之间不同长度的跨语言序列之间的有效知识转移。具体来说,为了解决 KD 中并行跨语言序列长度不等的难题,我们提出的 LAWD 可以直接进行跨语言蒸馏,同时保持语言特征形状不变,以减少潜在的信息丢失。我们在 How2 数据集的基础上对 How2-MCLS 数据集进行了细致的注释,以模拟 MCLS 场景。实验结果表明,与强大的基线相比,所提出的方法取得了具有竞争力的性能,并能通过转移 MS 模型的知识为 MCLS 模型带来实质性的性能改进。
{"title":"Multimodal Cross-lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-stage Training Method.","authors":"Nayu Liu, Kaiwen Wei, Yong Yang, Jianhua Tao, Xian Sun, Fanglong Yao, Hongfeng Yu, Li Jin, Zhao Lv, Cunhang Fan","doi":"10.1109/TPAMI.2024.3447778","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3447778","url":null,"abstract":"<p><p>Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), and this technique has made promising progress recently. However, existing works are limited to monolingual video scenarios, overlooking the demands of non-native language video viewers to understand cross-lingual videos in practical applications. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims at generating cross-lingual summarization from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through different fusion strategies of encoder and decoder; what's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD). These strategies are tailored for distillation objects (i.e., encoder-level and vocab-level KD) to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, our proposed LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate the MCLS scenario. The experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application. 连续语义分割调查:理论、挑战、方法与应用
Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3446949
Bo Yuan, Danpei Zhao

Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including data-replay and data-free sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions, which is available at https://github.com/YBIO/SurveyCSS. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.

持续学习,又称增量学习或终身学习,是深度学习和人工智能系统的前沿技术。它突破了封闭集单向训练的障碍,实现了开放集条件下的持续自适应学习。近十年来,持续学习在多个领域得到了探索和应用,尤其是在计算机视觉领域,涵盖了分类、检测和分割任务。其中,连续语义分割(CSS)的密集预测特性使其成为一项极具挑战性、错综复杂且方兴未艾的任务。在本文中,我们将对 CSS 进行综述,致力于对问题的提出、主要挑战、通用数据集、新理论和多种应用进行全面研究。具体来说,我们首先阐明了问题定义和主要挑战。在深入研究相关方法的基础上,我们将当前的 CSS 模型梳理归类为两大分支,包括数据重放和无数据集。在每个分支中,我们对相应的方法进行了基于相似性的聚类和深入分析,并在相关数据集上进行了定性比较和定量再现。此外,我们还介绍了四种具有不同应用场景和发展趋势的 CSS 特性。此外,我们还开发了一个 CSS 基准,其中包括代表性参考文献、评估结果和复制品,可在 https://github.com/YBIO/SurveyCSS 上查阅。我们希望本调查报告能对终身学习领域的发展起到参考和激励作用,同时也能为相关领域提供有价值的视角。
{"title":"A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application.","authors":"Bo Yuan, Danpei Zhao","doi":"10.1109/TPAMI.2024.3446949","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3446949","url":null,"abstract":"<p><p>Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including data-replay and data-free sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions, which is available at https://github.com/YBIO/SurveyCSS. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
U-Match: Exploring Hierarchy-aware Local Context for Two-view Correspondence Learning. U-Match:探索双视角对应学习的层次感知本地语境
Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3447048
Zizhuo Li, Shihua Zhang, Jiayi Ma

Rejecting outlier correspondences is one of the critical steps for successful feature-based two-view geometry estimation, and contingent heavily upon local context exploration. Recent advances focus on devising elaborate local context extractors whereas typically adopting explicit neighborhood relationship modeling at a specific scale, which is intrinsically flawed and inflexible, because 1) severe outliers often populated in putative correspondences and 2) the uncertainty in the distribution of inliers and outliers make the network incapable of capturing adequate and reliable local context from such neighborhoods, therefore resulting in the failure of pose estimation. This prospective study proposes a novel network called U-Match that has the flexibility to enable implicit local context awareness at multiple levels, naturally circumventing the aforementioned issues that plague most existing studies. Specifically, to aggregate multi-level local context implicitly, a hierarchy-aware graph representation module is designed to flexibly encode and decode hierarchical features. Moreover, considering that global context always works collaboratively with local context, an orthogonal local-and-global information fusion module is presented to integrate complementary local and global context in a redundancy-free manner, thus yielding compact feature representations to facilitate correspondence learning. Thorough experimentation across relative pose estimation, homography estimation, visual localization, and point cloud registration affirms U-Match's remarkable capabilities. Our code is publicly available at https://github.com/ZizhuoLi/U-Match.

拒绝异常值对应是成功进行基于特征的双视角几何估算的关键步骤之一,这在很大程度上取决于局部上下文探索。最近的进展主要集中在设计精细的局部背景提取器上,而通常采用的是特定尺度的显式邻域关系建模,这种建模方式本身存在缺陷且缺乏灵活性,因为:1)在推定的对应关系中经常出现严重的异常值;2)异常值和异常值分布的不确定性使得网络无法从这些邻域中捕捉到充分可靠的局部背景,从而导致姿态估计失败。本前瞻性研究提出了一种名为 U-Match 的新型网络,该网络具有灵活性,可实现多层次的隐式本地上下文感知,自然而然地规避了困扰大多数现有研究的上述问题。具体来说,为了隐式聚合多层次的本地上下文,设计了一个分层感知图表示模块,以灵活地编码和解码分层特征。此外,考虑到全局上下文总是与局部上下文协同工作,提出了一个正交的局部和全局信息融合模块,以无冗余的方式整合互补的局部和全局上下文,从而产生紧凑的特征表示,促进对应学习。在相对姿态估计、同构估计、视觉定位和点云注册等方面进行的深入实验证实了 U-Match 的卓越能力。我们的代码可通过 https://github.com/ZizhuoLi/U-Match 公开获取。
{"title":"U-Match: Exploring Hierarchy-aware Local Context for Two-view Correspondence Learning.","authors":"Zizhuo Li, Shihua Zhang, Jiayi Ma","doi":"10.1109/TPAMI.2024.3447048","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3447048","url":null,"abstract":"<p><p>Rejecting outlier correspondences is one of the critical steps for successful feature-based two-view geometry estimation, and contingent heavily upon local context exploration. Recent advances focus on devising elaborate local context extractors whereas typically adopting explicit neighborhood relationship modeling at a specific scale, which is intrinsically flawed and inflexible, because 1) severe outliers often populated in putative correspondences and 2) the uncertainty in the distribution of inliers and outliers make the network incapable of capturing adequate and reliable local context from such neighborhoods, therefore resulting in the failure of pose estimation. This prospective study proposes a novel network called U-Match that has the flexibility to enable implicit local context awareness at multiple levels, naturally circumventing the aforementioned issues that plague most existing studies. Specifically, to aggregate multi-level local context implicitly, a hierarchy-aware graph representation module is designed to flexibly encode and decode hierarchical features. Moreover, considering that global context always works collaboratively with local context, an orthogonal local-and-global information fusion module is presented to integrate complementary local and global context in a redundancy-free manner, thus yielding compact feature representations to facilitate correspondence learning. Thorough experimentation across relative pose estimation, homography estimation, visual localization, and point cloud registration affirms U-Match's remarkable capabilities. Our code is publicly available at https://github.com/ZizhuoLi/U-Match.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Medical Image Segmentation Review: The Success of U-Net. 医学图像分割回顾:U-Net 的成功
Pub Date : 2024-08-21 DOI: 10.1109/TPAMI.2024.3435571
Reza Azad, Ehsan Khodapanah Aghdam, Amelie Rauland, Yiwei Jia, Atlas Haddadi Avval, Afshin Bozorgpour, Sanaz Karimijafarbigloo, Joseph Paul Cohen, Ehsan Adeli, Dorit Merhof

Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model has received tremendous attention from academic and industrial researchers who have extended it to address the scale and complexity created by medical tasks. These extensions are commonly related to enhancing the U-Net's backbone, bottleneck, or skip connections, or including representation learning, or combining it with a Transformer architecture, or even addressing probabilistic prediction of the segmentation map. Having a compendium of different previously proposed U-Net variants makes it easier for machine learning researchers to identify relevant research questions and understand the challenges of the biological tasks that challenge the model. In this work, we discuss the practical aspects of the U-Net model and organize each variant model into a taxonomy. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. Furthermore, we provide a comprehensive implementation library with trained models. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation. All information is gathered in a GitHub repository https://github.com/NITR098/Awesome-U-Net.

自动医学影像分割是医学领域的一个重要课题,也是计算机辅助诊断范例中的一个重要组成部分。U-Net 因其灵活性、优化的模块化设计以及在所有医学图像模式中的成功应用而成为最广泛的图像分割架构。多年来,U-Net 模型受到了学术界和工业界研究人员的极大关注,他们对其进行了扩展,以应对医疗任务所造成的规模和复杂性。这些扩展通常涉及增强 U-Net 的主干、瓶颈或跳接连接,或包括表征学习,或将其与 Transformer 架构相结合,甚至是解决分割图的概率预测问题。有了以前提出的不同 U-Net 变体的汇编,机器学习研究人员就能更容易地确定相关的研究问题,并了解挑战该模型的生物任务所面临的挑战。在这项工作中,我们讨论了 U-Net 模型的实际方面,并将每个变体模型整理成一个分类法。此外,为了衡量这些策略在临床应用中的性能,我们建议在知名数据集上对一些独特的著名设计进行公平评估。此外,我们还提供了一个包含训练有素模型的综合实施库。此外,为了方便未来的研究,我们创建了一个 U-Net 论文在线列表,其中包含可能的正式实施方案。所有信息都收集在 GitHub 存储库 https://github.com/NITR098/Awesome-U-Net 中。
{"title":"Medical Image Segmentation Review: The Success of U-Net.","authors":"Reza Azad, Ehsan Khodapanah Aghdam, Amelie Rauland, Yiwei Jia, Atlas Haddadi Avval, Afshin Bozorgpour, Sanaz Karimijafarbigloo, Joseph Paul Cohen, Ehsan Adeli, Dorit Merhof","doi":"10.1109/TPAMI.2024.3435571","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3435571","url":null,"abstract":"<p><p>Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model has received tremendous attention from academic and industrial researchers who have extended it to address the scale and complexity created by medical tasks. These extensions are commonly related to enhancing the U-Net's backbone, bottleneck, or skip connections, or including representation learning, or combining it with a Transformer architecture, or even addressing probabilistic prediction of the segmentation map. Having a compendium of different previously proposed U-Net variants makes it easier for machine learning researchers to identify relevant research questions and understand the challenges of the biological tasks that challenge the model. In this work, we discuss the practical aspects of the U-Net model and organize each variant model into a taxonomy. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. Furthermore, we provide a comprehensive implementation library with trained models. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation. All information is gathered in a GitHub repository https://github.com/NITR098/Awesome-U-Net.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142019961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1