arXiv - CS - Computer Vision and Pattern Recognition最新文献_第10页

SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality SimMAT：探索从视觉基础模型到任何图像模式的可移植性

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08083

Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang

Foundation models like ChatGPT and Sora that are trained on a huge scale ofdata have made a revolutionary social impact. However, it is extremelychallenging for sensors in many different fields to collect similar scales ofnatural images to train strong foundation models. To this end, this workpresents a simple and effective framework SimMAT to study an open problem: thetransferability from vision foundation models trained on natural RGB images toother image modalities of different physical properties (e.g., polarization).SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrainedfoundation model. We apply SimMAT to a representative vision foundation modelSegment Anything Model (SAM) to support any evaluated new image modality. Giventhe absence of relevant benchmarks, we construct a new benchmark to evaluatethe transfer learning performance. Our experiments confirm the intriguingpotential of transferring vision foundation models in enhancing other sensors'performance. Specifically, SimMAT can improve the segmentation performance(mIoU) from 22.15% to 53.88% on average for evaluated modalities andconsistently outperforms other baselines. We hope that SimMAT can raiseawareness of cross-modal transfer learning and benefit various fields forbetter results with vision foundation models.

像 ChatGPT 和 Sora 这样的基础模型是在大量数据的基础上训练出来的，已经产生了革命性的社会影响。然而，对于许多不同领域的传感器来说，要收集类似规模的自然图像来训练强大的基础模型是极具挑战性的。为此，本研究提出了一个简单有效的框架 SimMAT 来研究一个开放性问题：从在自然 RGB 图像上训练的视觉基础模型到不同物理特性（如偏振）的其他图像模态的可转移性。我们将 SimMAT 应用于具有代表性的视觉基础模型--分段任意模型（SAM），以支持任何经过评估的新图像模态。鉴于缺乏相关基准，我们构建了一个新的基准来评估迁移学习性能。我们的实验证实了转移视觉基础模型在提高其他传感器性能方面的巨大潜力。具体来说，SimMAT 可以将所评估模态的分割性能（mIoU）从平均 22.15% 提高到 53.88%，并持续优于其他基线。我们希望 SimMAT 能够提高人们对跨模态迁移学习的认识，并使各个领域受益，从而利用视觉基础模型获得更好的结果。

{"title":"SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality","authors":"Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang","doi":"arxiv-2409.08083","DOIUrl":"https://doi.org/arxiv-2409.08083","url":null,"abstract":"Foundation models like ChatGPT and Sora that are trained on a huge scale of\u0000data have made a revolutionary social impact. However, it is extremely\u0000challenging for sensors in many different fields to collect similar scales of\u0000natural images to train strong foundation models. To this end, this work\u0000presents a simple and effective framework SimMAT to study an open problem: the\u0000transferability from vision foundation models trained on natural RGB images to\u0000other image modalities of different physical properties (e.g., polarization).\u0000SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained\u0000foundation model. We apply SimMAT to a representative vision foundation model\u0000Segment Anything Model (SAM) to support any evaluated new image modality. Given\u0000the absence of relevant benchmarks, we construct a new benchmark to evaluate\u0000the transfer learning performance. Our experiments confirm the intriguing\u0000potential of transferring vision foundation models in enhancing other sensors'\u0000performance. Specifically, SimMAT can improve the segmentation performance\u0000(mIoU) from 22.15% to 53.88% on average for evaluated modalities and\u0000consistently outperforms other baselines. We hope that SimMAT can raise\u0000awareness of cross-modal transfer learning and benefit various fields for\u0000better results with vision foundation models.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms 通过可学习的多尺度嵌入和注意力机制增强少镜头图像分类能力

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07989

Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi

In the context of few-shot classification, the goal is to train a classifierusing a limited number of samples while maintaining satisfactory performance.However, traditional metric-based methods exhibit certain limitations inachieving this objective. These methods typically rely on a single distancevalue between the query feature and support feature, thereby overlooking thecontribution of shallow features. To overcome this challenge, we propose anovel approach in this paper. Our approach involves utilizing multi-outputembedding network that maps samples into distinct feature spaces. The proposedmethod extract feature vectors at different stages, enabling the model tocapture both global and abstract features. By utilizing these diverse featurespaces, our model enhances its performance. Moreover, employing aself-attention mechanism improves the refinement of features at each stage,leading to even more robust representations and improved overall performance.Furthermore, assigning learnable weights to each stage significantly improvedperformance and results. We conducted comprehensive evaluations on theMiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way5-shot scenarios. Additionally, we performed a cross-domain task fromMiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.These evaluations demonstrate the efficacy of our proposed method in comparisonto state-of-the-art approaches. https://github.com/FatemehAskari/MSENet

然而，传统的基于度量的方法在实现这一目标时表现出一定的局限性。这些方法通常依赖于查询特征和支持特征之间的单一距离值，从而忽略了浅层特征的贡献。为了克服这一挑战，我们在本文中提出了一种新的方法。我们的方法涉及利用多输出嵌入网络，将样本映射到不同的特征空间。所提出的方法在不同阶段提取特征向量，使模型能够捕捉全局特征和抽象特征。通过利用这些不同的特征空间，我们的模型提高了性能。此外，采用自我关注机制可以改进每个阶段的特征提取，从而获得更稳健的表征和更高的整体性能。我们在MiniImageNet和FC100数据集上进行了全面评估，特别是在5路1拍和5路5拍场景中。此外，我们还完成了从MiniImageNet到CUB数据集的跨领域任务，并在测试领域取得了很高的准确率。这些评估结果表明，与最先进的方法相比，我们提出的方法非常有效。https://github.com/FatemehAskari/MSENet。

{"title":"Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms","authors":"Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi","doi":"arxiv-2409.07989","DOIUrl":"https://doi.org/arxiv-2409.07989","url":null,"abstract":"In the context of few-shot classification, the goal is to train a classifier\u0000using a limited number of samples while maintaining satisfactory performance.\u0000However, traditional metric-based methods exhibit certain limitations in\u0000achieving this objective. These methods typically rely on a single distance\u0000value between the query feature and support feature, thereby overlooking the\u0000contribution of shallow features. To overcome this challenge, we propose a\u0000novel approach in this paper. Our approach involves utilizing multi-output\u0000embedding network that maps samples into distinct feature spaces. The proposed\u0000method extract feature vectors at different stages, enabling the model to\u0000capture both global and abstract features. By utilizing these diverse feature\u0000spaces, our model enhances its performance. Moreover, employing a\u0000self-attention mechanism improves the refinement of features at each stage,\u0000leading to even more robust representations and improved overall performance.\u0000Furthermore, assigning learnable weights to each stage significantly improved\u0000performance and results. We conducted comprehensive evaluations on the\u0000MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way\u00005-shot scenarios. Additionally, we performed a cross-domain task from\u0000MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.\u0000These evaluations demonstrate the efficacy of our proposed method in comparison\u0000to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder TextBoost：通过微调文本编码器实现文本到图像模型的一次性个性化定制

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08248

NaHyeon Park, Kunhee Kim, Hyunjung Shim

Recent breakthroughs in text-to-image models have opened up promisingresearch avenues in personalized image generation, enabling users to creatediverse images of a specific subject using natural language prompts. However,existing methods often suffer from performance degradation when given only asingle reference image. They tend to overfit the input, producing highlysimilar outputs regardless of the text prompt. This paper addresses thechallenge of one-shot personalization by mitigating overfitting, enabling thecreation of controllable images through text prompts. Specifically, we proposea selective fine-tuning strategy that focuses on the text encoder. Furthermore,we introduce three key techniques to enhance personalization performance: (1)augmentation tokens to encourage feature disentanglement and alleviateoverfitting, (2) a knowledge-preservation loss to reduce language drift andpromote generalizability across diverse prompts, and (3) SNR-weighted samplingfor efficient training. Extensive experiments demonstrate that our approachefficiently generates high-quality, diverse images using only a singlereference image while significantly reducing memory and storage requirements.

文本到图像模型的最新突破为个性化图像生成开辟了前景广阔的研究途径，使用户能够使用自然语言提示创建特定主题的多种图像。然而，现有的方法在只给出一张参考图像时往往会出现性能下降的问题。它们往往会过度拟合输入，产生高度相似的输出，而与文本提示无关。本文通过减轻过拟合来解决单次个性化的挑战，使通过文本提示创建可控图像成为可能。具体来说，我们提出了一种侧重于文本编码器的选择性微调策略。此外，我们还引入了三项关键技术来提高个性化性能：(1) 增强标记，以鼓励特征分离并减轻过拟合；(2) 知识保留损失，以减少语言漂移并提高不同提示的通用性；(3) SNR 加权采样，以实现高效训练。广泛的实验证明，我们的方法只需使用单个参考图像就能有效生成高质量、多样化的图像，同时大大降低了内存和存储要求。

{"title":"TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder","authors":"NaHyeon Park, Kunhee Kim, Hyunjung Shim","doi":"arxiv-2409.08248","DOIUrl":"https://doi.org/arxiv-2409.08248","url":null,"abstract":"Recent breakthroughs in text-to-image models have opened up promising\u0000research avenues in personalized image generation, enabling users to create\u0000diverse images of a specific subject using natural language prompts. However,\u0000existing methods often suffer from performance degradation when given only a\u0000single reference image. They tend to overfit the input, producing highly\u0000similar outputs regardless of the text prompt. This paper addresses the\u0000challenge of one-shot personalization by mitigating overfitting, enabling the\u0000creation of controllable images through text prompts. Specifically, we propose\u0000a selective fine-tuning strategy that focuses on the text encoder. Furthermore,\u0000we introduce three key techniques to enhance personalization performance: (1)\u0000augmentation tokens to encourage feature disentanglement and alleviate\u0000overfitting, (2) a knowledge-preservation loss to reduce language drift and\u0000promote generalizability across diverse prompts, and (3) SNR-weighted sampling\u0000for efficient training. Extensive experiments demonstrate that our approach\u0000efficiently generates high-quality, diverse images using only a single\u0000reference image while significantly reducing memory and storage requirements.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation? 视觉基础模型能否增强医学图像分割的领域泛化？

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07960

Kerem Cekmeceli, Meva Himmetoglu, Guney I. Tombak, Anna Susmelj, Ertunc Erdil, Ender Konukoglu

Neural networks achieve state-of-the-art performance in many supervisedlearning tasks when the training data distribution matches the test datadistribution. However, their performance drops significantly under domain(covariate) shift, a prevalent issue in medical image segmentation due tovarying acquisition settings across different scanner models and protocols.Recently, foundational models (FMs) trained on large datasets have gainedattention for their ability to be adapted for downstream tasks and achievestate-of-the-art performance with excellent generalization capabilities onnatural images. However, their effectiveness in medical image segmentationremains underexplored. In this paper, we investigate the domain generalizationperformance of various FMs, including DinoV2, SAM, MedSAM, and MAE, whenfine-tuned using various parameter-efficient fine-tuning (PEFT) techniques suchas Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode headarchitecture, HQHSAM, which simply integrates elements from twostate-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentationperformance. Our extensive experiments on multiple datasets, encompassingvarious anatomies and modalities, reveal that FMs, particularly with the HQHSAMdecode head, improve domain generalization for medical image segmentation.Moreover, we found that the effectiveness of PEFT techniques varies acrossdifferent FMs. These findings underscore the potential of FMs to enhance thedomain generalization performance of neural networks in medical imagesegmentation across diverse clinical settings, providing a solid foundation forfuture research. Code and models are available for research purposes aturl{https://github.com/kerem-cekmeceli/Foundation-Models-for-Medical-Imagery}.

当训练数据分布与测试数据分布相匹配时，神经网络在许多监督学习任务中都能达到最先进的性能。最近，在大型数据集上训练的基础模型（FMs）因其适应下游任务的能力和在自然图像上出色的泛化能力而备受关注。然而，它们在医学图像分割中的有效性仍未得到充分探索。在本文中，我们研究了包括 DinoV2、SAM、MedSAM 和 MAE 在内的各种调频技术在使用梯形图和莱因（+LoRA）等各种参数高效微调（PEFT）技术和解码头进行微调时的领域泛化性能。我们引入了一种新颖的解码头架构 HQHSAM，它简单地整合了 HSAM 和 HQSAM 这两种最先进解码头的元素，以提高分割性能。我们在多个数据集（包括各种解剖结构和模式）上进行了广泛的实验，结果表明调频（尤其是 HQHSAM 解码头）提高了医学图像分割的领域泛化能力。这些发现强调了调频技术在不同临床环境下提高神经网络在医学图像分割中的领域泛化性能的潜力，为未来的研究奠定了坚实的基础。用于研究目的的代码和模型请访问：url{https://github.com/kerem-cekmeceli/Foundation-Models-for-Medical-Imagery}。

{"title":"Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?","authors":"Kerem Cekmeceli, Meva Himmetoglu, Guney I. Tombak, Anna Susmelj, Ertunc Erdil, Ender Konukoglu","doi":"arxiv-2409.07960","DOIUrl":"https://doi.org/arxiv-2409.07960","url":null,"abstract":"Neural networks achieve state-of-the-art performance in many supervised\u0000learning tasks when the training data distribution matches the test data\u0000distribution. However, their performance drops significantly under domain\u0000(covariate) shift, a prevalent issue in medical image segmentation due to\u0000varying acquisition settings across different scanner models and protocols.\u0000Recently, foundational models (FMs) trained on large datasets have gained\u0000attention for their ability to be adapted for downstream tasks and achieve\u0000state-of-the-art performance with excellent generalization capabilities on\u0000natural images. However, their effectiveness in medical image segmentation\u0000remains underexplored. In this paper, we investigate the domain generalization\u0000performance of various FMs, including DinoV2, SAM, MedSAM, and MAE, when\u0000fine-tuned using various parameter-efficient fine-tuning (PEFT) techniques such\u0000as Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode head\u0000architecture, HQHSAM, which simply integrates elements from two\u0000state-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentation\u0000performance. Our extensive experiments on multiple datasets, encompassing\u0000various anatomies and modalities, reveal that FMs, particularly with the HQHSAM\u0000decode head, improve domain generalization for medical image segmentation.\u0000Moreover, we found that the effectiveness of PEFT techniques varies across\u0000different FMs. These findings underscore the potential of FMs to enhance the\u0000domain generalization performance of neural networks in medical image\u0000segmentation across diverse clinical settings, providing a solid foundation for\u0000future research. Code and models are available for research purposes at\u0000url{https://github.com/kerem-cekmeceli/Foundation-Models-for-Medical-Imagery}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors DreamHOI：利用扩散先验条件生成受试者驱动的三维人-物互动效果

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08278

Thomas Hanwen Zhu, Ruining Li, Tomas Jakab

We present DreamHOI, a novel method for zero-shot synthesis of human-objectinteractions (HOIs), enabling a 3D human model to realistically interact withany given object based on a textual description. This task is complicated bythe varying categories and geometries of real-world objects and the scarcity ofdatasets encompassing diverse HOIs. To circumvent the need for extensive data,we leverage text-to-image diffusion models trained on billions of image-captionpairs. We optimize the articulation of a skinned human mesh using ScoreDistillation Sampling (SDS) gradients obtained from these models, which predictimage-space edits. However, directly backpropagating image-space gradients intocomplex articulation parameters is ineffective due to the local nature of suchgradients. To overcome this, we introduce a dual implicit-explicitrepresentation of a skinned mesh, combining (implicit) neural radiance fields(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,we transition between implicit and explicit forms, grounding the NeRFgeneration while refining the mesh articulation. We validate our approachthrough extensive experiments, demonstrating its effectiveness in generatingrealistic HOIs.

我们介绍的 DreamHOI 是一种用于人-物互动（HOIs）零镜头合成的新方法，它能让三维人体模型根据文字描述与任何给定物体进行逼真的互动。由于现实世界中物体的类别和几何形状各不相同，而包含各种 HOIs 的数据集又十分稀缺，因此这项任务变得十分复杂。为了避免对大量数据的需求，我们利用了在数十亿图像标题对上训练的文本到图像扩散模型。我们利用从这些模型中获得的分数扩散采样（SDS）梯度来优化带皮肤人体网格的衔接，这些梯度可以预测图像空间的编辑。然而，由于图像空间梯度的局部性，直接将图像空间梯度反向传播到复杂的衔接参数中是无效的。为了克服这一问题，我们引入了蒙皮网格的隐式-显式双重表示法，将（隐式）神经辐射场（NeRFs）与（显式）骨架驱动的网格衔接相结合。在优化过程中，我们在隐式和显式之间进行转换，在完善网格衔接的同时将神经辐射场的生成基础化。我们通过大量实验验证了我们的方法，证明了它在生成逼真的 HOIs 方面的有效性。

{"title":"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":"https://doi.org/arxiv-2409.08278","url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\u0000interactions (HOIs), enabling a 3D human model to realistically interact with\u0000any given object based on a textual description. This task is complicated by\u0000the varying categories and geometries of real-world objects and the scarcity of\u0000datasets encompassing diverse HOIs. To circumvent the need for extensive data,\u0000we leverage text-to-image diffusion models trained on billions of image-caption\u0000pairs. We optimize the articulation of a skinned human mesh using Score\u0000Distillation Sampling (SDS) gradients obtained from these models, which predict\u0000image-space edits. However, directly backpropagating image-space gradients into\u0000complex articulation parameters is ineffective due to the local nature of such\u0000gradients. To overcome this, we introduce a dual implicit-explicit\u0000representation of a skinned mesh, combining (implicit) neural radiance fields\u0000(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\u0000we transition between implicit and explicit forms, grounding the NeRF\u0000generation while refining the mesh articulation. We validate our approach\u0000through extensive experiments, demonstrating its effectiveness in generating\u0000realistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to Match 2D Keypoints Across Preoperative MR and Intraoperative Ultrasound 学习匹配术前磁共振和术中超声的二维关键点

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08169

Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine

We propose in this paper a texture-invariant 2D keypoints descriptorspecifically designed for matching preoperative Magnetic Resonance (MR) imageswith intraoperative Ultrasound (US) images. We introduce amatching-by-synthesis strategy, where intraoperative US images are synthesizedfrom MR images accounting for multiple MR modalities and intraoperative USvariability. We build our training set by enforcing keypoints localization overall images then train a patient-specific descriptor network that learnstexture-invariant discriminant features in a supervised contrastive manner,leading to robust keypoints descriptors. Our experiments on real cases withground truth show the effectiveness of the proposed approach, outperforming thestate-of-the-art methods and achieving 80.35% matching precision on average.

本文提出了一种纹理不变的二维关键点描述符，专门用于匹配术前磁共振（MR）图像和术中超声（US）图像。我们引入了 "合成匹配"（batching-by-synthesis）策略，即术中 US 图像由 MR 图像合成，其中考虑了多种 MR 模式和术中 US 变异性。我们通过对整体图像进行关键点定位来建立训练集，然后训练患者特定的描述符网络，该网络以监督对比的方式学习与纹理无关的判别特征，从而获得稳健的关键点描述符。我们在真实病例中进行的实验表明，所提出的方法非常有效，其性能优于最先进的方法，平均匹配精度达到 80.35%。

引用次数: 0

Sparse R-CNN OBB: Ship Target Detection in SAR Images Based on Oriented Sparse Proposals 稀疏 R-CNN OBB：基于定向稀疏提议的合成孔径雷达图像中的舰船目标探测

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07973

Kamirul Kamirul, Odysseas Pappas, Alin Achim

We present Sparse R-CNN OBB, a novel framework for the detection of orientedobjects in SAR images leveraging sparse learnable proposals. The Sparse R-CNNOBB has streamlined architecture and ease of training as it utilizes a sparseset of 300 proposals instead of training a proposals generator on hundreds ofthousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is thefirst to adopt the concept of sparse learnable proposals for the detection oforiented objects, as well as for the detection of ships in Synthetic ApertureRadar (SAR) images. The detection head of the baseline model, Sparse R-CNN, isre-designed to enable the model to capture object orientation. We alsofine-tune the model on RSDD-SAR dataset and provide a performance comparison tostate-of-the-art models. Experimental results shows that Sparse R-CNN OBBachieves outstanding performance, surpassing other models on both inshore andoffshore scenarios. The code is available at:www.github.com/ka-mirul/Sparse-R-CNN-OBB.

我们提出了稀疏 R-CNN OBB，这是一种利用稀疏可学习提案检测合成孔径雷达图像中定向物体的新型框架。稀疏 R-CNN OBB 架构精简，易于训练，因为它利用了 300 个提案的稀疏集，而不是在数十万个锚点上训练提案生成器。据我们所知，稀疏 R-CNN OBB 是第一个采用稀疏可学习提案概念来检测定向物体以及合成孔径雷达（SAR）图像中的船只的方法。我们对基线模型--稀疏 R-CNN 的检测头进行了重新设计，使该模型能够捕捉物体的方向。我们还在 RSDD-SAR 数据集上对模型进行了微调，并与最先进的模型进行了性能比较。实验结果表明，稀疏 R-CNN OBB 性能突出，在近岸和离岸场景中都超过了其他模型。代码可在以下网址获取：www.github.com/ka-mirul/Sparse-R-CNN-OBB。

引用次数: 0

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization 用于密集视听事件定位的位置感知跨模态对应学习

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07967

Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang

Dense-localization Audio-Visual Events (DAVE) aims to identify timeboundaries and corresponding categories for events that can be heard and seenconcurrently in an untrimmed video. Existing methods typically encode audio andvisual representation separately without any explicit cross-modal alignmentconstraint. Then they adopt dense cross-modal attention to integrate multimodalinformation for DAVE. Thus these methods inevitably aggregate irrelevant noiseand events, especially in complex and long videos, leading to imprecisedetection. In this paper, we present LOCO, a Locality-aware cross-modalCorrespondence learning framework for DAVE. The core idea is to explore localtemporal continuity nature of audio-visual events, which serves as informativeyet free supervision signals to guide the filtering of irrelevant informationand inspire the extraction of complementary multimodal information during bothunimodal and cross-modal learning stages. i) Specifically, LOCO appliesLocality-aware Correspondence Correction (LCC) to uni-modal features vialeveraging cross-modal local-correlated properties without any extraannotations. This enforces uni-modal encoders to highlight similar semanticsshared by audio and visual features. ii) To better aggregate such audio andvisual features, we further customize Cross-modal Dynamic Perception layer(CDP) in cross-modal feature pyramid to understand local temporal patterns ofaudio-visual events by imposing local consistency within multimodal features ina data-driven manner. By incorporating LCC and CDP, LOCO provides solidperformance gains and outperforms existing methods for DAVE. The source codewill be released.

密集定位视听事件（DAVE）旨在识别未剪辑视频中可同时听到和看到的事件的时间界限和相应类别。现有方法通常是将音频和视频分别编码，而没有明确的跨模态对齐约束。然后，它们采用密集的跨模态注意力来整合多模态信息，用于 DAVE。因此，这些方法不可避免地会将不相关的噪声和事件聚合在一起，尤其是在复杂的长视频中，从而导致不精确的检测。在本文中，我们提出了用于 DAVE 的局部感知跨模态对应学习框架 LOCO。其核心理念是探索视听事件的局部时空连续性，并将其作为信息丰富但不受约束的监督信号，在单模态和跨模态学习阶段指导过滤无关信息，并启发提取互补的多模态信息。 i) 具体来说，LOCO 将局部感知对应校正（Locality-aware Correspondence Correction，LCC）应用于单模态特征，在不进行任何额外注释的情况下，评估跨模态局部相关属性。ii) 为了更好地聚合这些音频和视频特征，我们在跨模态特征金字塔中进一步定制了跨模态动态感知层（CDP），以数据驱动的方式在多模态特征中施加局部一致性，从而理解音频和视频事件的局部时间模式。通过结合 LCC 和 CDP，LOCO 为 DAVE 提供了坚实的性能增益，并优于现有方法。源代码即将发布。

{"title":"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":"https://doi.org/arxiv-2409.07967","url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\u0000boundaries and corresponding categories for events that can be heard and seen\u0000concurrently in an untrimmed video. Existing methods typically encode audio and\u0000visual representation separately without any explicit cross-modal alignment\u0000constraint. Then they adopt dense cross-modal attention to integrate multimodal\u0000information for DAVE. Thus these methods inevitably aggregate irrelevant noise\u0000and events, especially in complex and long videos, leading to imprecise\u0000detection. In this paper, we present LOCO, a Locality-aware cross-modal\u0000Correspondence learning framework for DAVE. The core idea is to explore local\u0000temporal continuity nature of audio-visual events, which serves as informative\u0000yet free supervision signals to guide the filtering of irrelevant information\u0000and inspire the extraction of complementary multimodal information during both\u0000unimodal and cross-modal learning stages. i) Specifically, LOCO applies\u0000Locality-aware Correspondence Correction (LCC) to uni-modal features via\u0000leveraging cross-modal local-correlated properties without any extra\u0000annotations. This enforces uni-modal encoders to highlight similar semantics\u0000shared by audio and visual features. ii) To better aggregate such audio and\u0000visual features, we further customize Cross-modal Dynamic Perception layer\u0000(CDP) in cross-modal feature pyramid to understand local temporal patterns of\u0000audio-visual events by imposing local consistency within multimodal features in\u0000a data-driven manner. By incorporating LCC and CDP, LOCO provides solid\u0000performance gains and outperforms existing methods for DAVE. The source code\u0000will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors 从 COCO 到 COCO-FP：深入探讨 COCO 检测器的背景误报问题

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07907

Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen

Reducing false positives is essential for enhancing object detectorperformance, as reflected in the mean Average Precision (mAP) metric. Althoughobject detectors have achieved notable improvements and high mAP scores on theCOCO dataset, analysis reveals limited progress in addressing false positivescaused by non-target visual clutter-background objects not included in theannotated categories. This issue is particularly critical in real-worldapplications, such as fire and smoke detection, where minimizing false alarmsis crucial. In this study, we introduce COCO-FP, a new evaluation datasetderived from the ImageNet-1K dataset, designed to address this issue. Byextending the original COCO validation dataset, COCO-FP specifically assessesobject detectors' performance in mitigating background false positives. Ourevaluation of both standard and advanced object detectors shows a significantnumber of false positives in both closed-set and open-set scenarios. Forexample, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shiftingfrom COCO to COCO-FP. The dataset is available athttps://github.com/COCO-FP/COCO-FP.

减少误报对于提高目标检测器性能至关重要，这反映在平均精度（mAP）指标上。尽管物体检测器在 COCO 数据集上取得了显著的改进和较高的 mAP 分数，但分析表明，在解决由非目标视觉杂波--未包含在标注类别中的背景物体--引起的误报方面进展有限。这个问题在火灾和烟雾检测等实际应用中尤为关键，因为在这些应用中，尽量减少误报是至关重要的。在本研究中，我们介绍了 COCO-FP，这是一个从 ImageNet-1K 数据集中提取的新评估数据集，旨在解决这一问题。通过扩展原始 COCO 验证数据集，COCO-FP 专门评估了物体检测器在减少背景误报方面的性能。对标准和高级物体检测器的评估结果表明，在封闭集和开放集场景中都存在大量误报。例如，当从 COCO 转向 COCO-FP 时，YOLOv9-E 的 AP50 指标从 72.8 降至 65.7。数据集可在https://github.com/COCO-FP/COCO-FP。

{"title":"From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors","authors":"Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen","doi":"arxiv-2409.07907","DOIUrl":"https://doi.org/arxiv-2409.07907","url":null,"abstract":"Reducing false positives is essential for enhancing object detector\u0000performance, as reflected in the mean Average Precision (mAP) metric. Although\u0000object detectors have achieved notable improvements and high mAP scores on the\u0000COCO dataset, analysis reveals limited progress in addressing false positives\u0000caused by non-target visual clutter-background objects not included in the\u0000annotated categories. This issue is particularly critical in real-world\u0000applications, such as fire and smoke detection, where minimizing false alarms\u0000is crucial. In this study, we introduce COCO-FP, a new evaluation dataset\u0000derived from the ImageNet-1K dataset, designed to address this issue. By\u0000extending the original COCO validation dataset, COCO-FP specifically assesses\u0000object detectors' performance in mitigating background false positives. Our\u0000evaluation of both standard and advanced object detectors shows a significant\u0000number of false positives in both closed-set and open-set scenarios. For\u0000example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting\u0000from COCO to COCO-FP. The dataset is available at\u0000https://github.com/COCO-FP/COCO-FP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ThermalGaussian: Thermal 3D Gaussian Splatting ThermalGaussian：热三维高斯拼接技术

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-11 DOI: arxiv-2409.07200

Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue

Thermography is especially valuable for the military and other users ofsurveillance cameras. Some recent methods based on Neural Radiance Fields(NeRF) are proposed to reconstruct the thermal scenes in 3D from a set ofthermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS)prevails due to its rapid training and real-time rendering. In this work, wepropose ThermalGaussian, the first thermal 3DGS approach capable of renderinghigh-quality images in RGB and thermal modalities. We first calibrate the RGBcamera and the thermal camera to ensure that both modalities are accuratelyaligned. Subsequently, we use the registered images to learn the multimodal 3DGaussians. To prevent the overfitting of any single modality, we introduceseveral multimodal regularization constraints. We also develop smoothingconstraints tailored to the physical characteristics of the thermal modality.Besides, we contribute a real-world dataset named RGBT-Scenes, captured by ahand-hold thermal-infrared camera, facilitating future research on thermalscene reconstruction. We conduct comprehensive experiments to show thatThermalGaussian achieves photorealistic rendering of thermal images andimproves the rendering quality of RGB images. With the proposed multimodalregularization constraints, we also reduced the model's storage cost by 90%.The code and dataset will be released.

热成像技术对于军队和其他监控摄像机用户尤为重要。最近提出了一些基于神经辐射场（NeRF）的方法，用于从一组热图像和 RGB 图像重建三维热场景。然而，与 NeRF 不同的是，3D 高斯拼接（3DGS）因其快速训练和实时渲染的特点而大行其道。在这项工作中，我们提出了 ThermalGaussian，这是第一种能够渲染 RGB 和热模式下高质量图像的热 3DGS 方法。我们首先校准 RGB 摄像机和红外热像仪，确保两种模式准确对齐。随后，我们使用注册图像来学习多模态 3D 高斯。为了防止任何单一模态的过度拟合，我们引入了多模态正则化约束。此外，我们还提供了一个名为 RGBT 场景的真实世界数据集，该数据集由手持式红外热像仪捕获，为未来的热场景重建研究提供了便利。我们进行了全面的实验，结果表明热高斯技术实现了热图像的逼真渲染，并提高了 RGB 图像的渲染质量。通过提出多模态规则化约束，我们还将模型的存储成本降低了90%。

{"title":"ThermalGaussian: Thermal 3D Gaussian Splatting","authors":"Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue","doi":"arxiv-2409.07200","DOIUrl":"https://doi.org/arxiv-2409.07200","url":null,"abstract":"Thermography is especially valuable for the military and other users of\u0000surveillance cameras. Some recent methods based on Neural Radiance Fields\u0000(NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of\u0000thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS)\u0000prevails due to its rapid training and real-time rendering. In this work, we\u0000propose ThermalGaussian, the first thermal 3DGS approach capable of rendering\u0000high-quality images in RGB and thermal modalities. We first calibrate the RGB\u0000camera and the thermal camera to ensure that both modalities are accurately\u0000aligned. Subsequently, we use the registered images to learn the multimodal 3D\u0000Gaussians. To prevent the overfitting of any single modality, we introduce\u0000several multimodal regularization constraints. We also develop smoothing\u0000constraints tailored to the physical characteristics of the thermal modality.\u0000Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a\u0000hand-hold thermal-infrared camera, facilitating future research on thermal\u0000scene reconstruction. We conduct comprehensive experiments to show that\u0000ThermalGaussian achieves photorealistic rendering of thermal images and\u0000improves the rendering quality of RGB images. With the proposed multimodal\u0000regularization constraints, we also reduced the model's storage cost by 90%.\u0000The code and dataset will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0