首页 > 最新文献

arXiv - CS - Computer Vision and Pattern Recognition最新文献

英文 中文
Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms 通过可学习的多尺度嵌入和注意力机制增强少镜头图像分类能力
Pub Date : 2024-09-12 DOI: arxiv-2409.07989
Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi
In the context of few-shot classification, the goal is to train a classifierusing a limited number of samples while maintaining satisfactory performance.However, traditional metric-based methods exhibit certain limitations inachieving this objective. These methods typically rely on a single distancevalue between the query feature and support feature, thereby overlooking thecontribution of shallow features. To overcome this challenge, we propose anovel approach in this paper. Our approach involves utilizing multi-outputembedding network that maps samples into distinct feature spaces. The proposedmethod extract feature vectors at different stages, enabling the model tocapture both global and abstract features. By utilizing these diverse featurespaces, our model enhances its performance. Moreover, employing aself-attention mechanism improves the refinement of features at each stage,leading to even more robust representations and improved overall performance.Furthermore, assigning learnable weights to each stage significantly improvedperformance and results. We conducted comprehensive evaluations on theMiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way5-shot scenarios. Additionally, we performed a cross-domain task fromMiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.These evaluations demonstrate the efficacy of our proposed method in comparisonto state-of-the-art approaches. https://github.com/FatemehAskari/MSENet
然而,传统的基于度量的方法在实现这一目标时表现出一定的局限性。这些方法通常依赖于查询特征和支持特征之间的单一距离值,从而忽略了浅层特征的贡献。为了克服这一挑战,我们在本文中提出了一种新的方法。我们的方法涉及利用多输出嵌入网络,将样本映射到不同的特征空间。所提出的方法在不同阶段提取特征向量,使模型能够捕捉全局特征和抽象特征。通过利用这些不同的特征空间,我们的模型提高了性能。此外,采用自我关注机制可以改进每个阶段的特征提取,从而获得更稳健的表征和更高的整体性能。我们在MiniImageNet和FC100数据集上进行了全面评估,特别是在5路1拍和5路5拍场景中。此外,我们还完成了从MiniImageNet到CUB数据集的跨领域任务,并在测试领域取得了很高的准确率。这些评估结果表明,与最先进的方法相比,我们提出的方法非常有效。https://github.com/FatemehAskari/MSENet。
{"title":"Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms","authors":"Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi","doi":"arxiv-2409.07989","DOIUrl":"https://doi.org/arxiv-2409.07989","url":null,"abstract":"In the context of few-shot classification, the goal is to train a classifier\u0000using a limited number of samples while maintaining satisfactory performance.\u0000However, traditional metric-based methods exhibit certain limitations in\u0000achieving this objective. These methods typically rely on a single distance\u0000value between the query feature and support feature, thereby overlooking the\u0000contribution of shallow features. To overcome this challenge, we propose a\u0000novel approach in this paper. Our approach involves utilizing multi-output\u0000embedding network that maps samples into distinct feature spaces. The proposed\u0000method extract feature vectors at different stages, enabling the model to\u0000capture both global and abstract features. By utilizing these diverse feature\u0000spaces, our model enhances its performance. Moreover, employing a\u0000self-attention mechanism improves the refinement of features at each stage,\u0000leading to even more robust representations and improved overall performance.\u0000Furthermore, assigning learnable weights to each stage significantly improved\u0000performance and results. We conducted comprehensive evaluations on the\u0000MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way\u00005-shot scenarios. Additionally, we performed a cross-domain task from\u0000MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.\u0000These evaluations demonstrate the efficacy of our proposed method in comparison\u0000to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder TextBoost:通过微调文本编码器实现文本到图像模型的一次性个性化定制
Pub Date : 2024-09-12 DOI: arxiv-2409.08248
NaHyeon Park, Kunhee Kim, Hyunjung Shim
Recent breakthroughs in text-to-image models have opened up promisingresearch avenues in personalized image generation, enabling users to creatediverse images of a specific subject using natural language prompts. However,existing methods often suffer from performance degradation when given only asingle reference image. They tend to overfit the input, producing highlysimilar outputs regardless of the text prompt. This paper addresses thechallenge of one-shot personalization by mitigating overfitting, enabling thecreation of controllable images through text prompts. Specifically, we proposea selective fine-tuning strategy that focuses on the text encoder. Furthermore,we introduce three key techniques to enhance personalization performance: (1)augmentation tokens to encourage feature disentanglement and alleviateoverfitting, (2) a knowledge-preservation loss to reduce language drift andpromote generalizability across diverse prompts, and (3) SNR-weighted samplingfor efficient training. Extensive experiments demonstrate that our approachefficiently generates high-quality, diverse images using only a singlereference image while significantly reducing memory and storage requirements.
文本到图像模型的最新突破为个性化图像生成开辟了前景广阔的研究途径,使用户能够使用自然语言提示创建特定主题的多种图像。然而,现有的方法在只给出一张参考图像时往往会出现性能下降的问题。它们往往会过度拟合输入,产生高度相似的输出,而与文本提示无关。本文通过减轻过拟合来解决单次个性化的挑战,使通过文本提示创建可控图像成为可能。具体来说,我们提出了一种侧重于文本编码器的选择性微调策略。此外,我们还引入了三项关键技术来提高个性化性能:(1) 增强标记,以鼓励特征分离并减轻过拟合;(2) 知识保留损失,以减少语言漂移并提高不同提示的通用性;(3) SNR 加权采样,以实现高效训练。广泛的实验证明,我们的方法只需使用单个参考图像就能有效生成高质量、多样化的图像,同时大大降低了内存和存储要求。
{"title":"TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder","authors":"NaHyeon Park, Kunhee Kim, Hyunjung Shim","doi":"arxiv-2409.08248","DOIUrl":"https://doi.org/arxiv-2409.08248","url":null,"abstract":"Recent breakthroughs in text-to-image models have opened up promising\u0000research avenues in personalized image generation, enabling users to create\u0000diverse images of a specific subject using natural language prompts. However,\u0000existing methods often suffer from performance degradation when given only a\u0000single reference image. They tend to overfit the input, producing highly\u0000similar outputs regardless of the text prompt. This paper addresses the\u0000challenge of one-shot personalization by mitigating overfitting, enabling the\u0000creation of controllable images through text prompts. Specifically, we propose\u0000a selective fine-tuning strategy that focuses on the text encoder. Furthermore,\u0000we introduce three key techniques to enhance personalization performance: (1)\u0000augmentation tokens to encourage feature disentanglement and alleviate\u0000overfitting, (2) a knowledge-preservation loss to reduce language drift and\u0000promote generalizability across diverse prompts, and (3) SNR-weighted sampling\u0000for efficient training. Extensive experiments demonstrate that our approach\u0000efficiently generates high-quality, diverse images using only a single\u0000reference image while significantly reducing memory and storage requirements.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation 通过即时插值进行噪声校正,实现基于扩散的图像间平移
Pub Date : 2024-09-12 DOI: arxiv-2409.08077
Junsung Lee, Minsoo Kang, Bohyung Han
We propose a simple but effective training-free approach tailored todiffusion-based image-to-image translation. Our approach revises the originalnoise prediction network of a pretrained diffusion model by introducing a noisecorrection term. We formulate the noise correction term as the differencebetween two noise predictions; one is computed from the denoising network witha progressive interpolation of the source and target prompt embeddings, whilethe other is the noise prediction with the source prompt embedding. The finalnoise prediction network is given by a linear combination of the standarddenoising term and the noise correction term, where the former is designed toreconstruct must-be-preserved regions while the latter aims to effectively editregions of interest relevant to the target prompt. Our approach can be easilyincorporated into existing image-to-image translation methods based ondiffusion models. Extensive experiments verify that the proposed techniqueachieves outstanding performance with low latency and consistently improvesexisting frameworks when combined with them.
我们针对基于扩散的图像到图像转换提出了一种简单而有效的免训练方法。我们的方法通过引入噪声校正项,修改了预训练扩散模型的原始噪声预测网络。我们将噪声校正项表述为两个噪声预测之间的差值;一个是通过对源和目标提示嵌入进行渐进插值的去噪网络计算得出的,另一个是通过源提示嵌入得出的噪声预测。最终的噪声预测网络由标准去噪项和噪声校正项的线性组合构成,前者旨在重建必须保留的区域,后者旨在有效编辑与目标提示相关的感兴趣区域。我们的方法可以轻松融入现有的基于扩散模型的图像到图像翻译方法中。广泛的实验验证了所提出的技术能以较低的延迟实现出色的性能,并在与现有框架相结合时持续改进现有框架。
{"title":"Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation","authors":"Junsung Lee, Minsoo Kang, Bohyung Han","doi":"arxiv-2409.08077","DOIUrl":"https://doi.org/arxiv-2409.08077","url":null,"abstract":"We propose a simple but effective training-free approach tailored to\u0000diffusion-based image-to-image translation. Our approach revises the original\u0000noise prediction network of a pretrained diffusion model by introducing a noise\u0000correction term. We formulate the noise correction term as the difference\u0000between two noise predictions; one is computed from the denoising network with\u0000a progressive interpolation of the source and target prompt embeddings, while\u0000the other is the noise prediction with the source prompt embedding. The final\u0000noise prediction network is given by a linear combination of the standard\u0000denoising term and the noise correction term, where the former is designed to\u0000reconstruct must-be-preserved regions while the latter aims to effectively edit\u0000regions of interest relevant to the target prompt. Our approach can be easily\u0000incorporated into existing image-to-image translation methods based on\u0000diffusion models. Extensive experiments verify that the proposed technique\u0000achieves outstanding performance with low latency and consistently improves\u0000existing frameworks when combined with them.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality SimMAT:探索从视觉基础模型到任何图像模式的可移植性
Pub Date : 2024-09-12 DOI: arxiv-2409.08083
Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang
Foundation models like ChatGPT and Sora that are trained on a huge scale ofdata have made a revolutionary social impact. However, it is extremelychallenging for sensors in many different fields to collect similar scales ofnatural images to train strong foundation models. To this end, this workpresents a simple and effective framework SimMAT to study an open problem: thetransferability from vision foundation models trained on natural RGB images toother image modalities of different physical properties (e.g., polarization).SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrainedfoundation model. We apply SimMAT to a representative vision foundation modelSegment Anything Model (SAM) to support any evaluated new image modality. Giventhe absence of relevant benchmarks, we construct a new benchmark to evaluatethe transfer learning performance. Our experiments confirm the intriguingpotential of transferring vision foundation models in enhancing other sensors'performance. Specifically, SimMAT can improve the segmentation performance(mIoU) from 22.15% to 53.88% on average for evaluated modalities andconsistently outperforms other baselines. We hope that SimMAT can raiseawareness of cross-modal transfer learning and benefit various fields forbetter results with vision foundation models.
像 ChatGPT 和 Sora 这样的基础模型是在大量数据的基础上训练出来的,已经产生了革命性的社会影响。然而,对于许多不同领域的传感器来说,要收集类似规模的自然图像来训练强大的基础模型是极具挑战性的。为此,本研究提出了一个简单有效的框架 SimMAT 来研究一个开放性问题:从在自然 RGB 图像上训练的视觉基础模型到不同物理特性(如偏振)的其他图像模态的可转移性。我们将 SimMAT 应用于具有代表性的视觉基础模型--分段任意模型(SAM),以支持任何经过评估的新图像模态。鉴于缺乏相关基准,我们构建了一个新的基准来评估迁移学习性能。我们的实验证实了转移视觉基础模型在提高其他传感器性能方面的巨大潜力。具体来说,SimMAT 可以将所评估模态的分割性能(mIoU)从平均 22.15% 提高到 53.88%,并持续优于其他基线。我们希望 SimMAT 能够提高人们对跨模态迁移学习的认识,并使各个领域受益,从而利用视觉基础模型获得更好的结果。
{"title":"SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality","authors":"Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang","doi":"arxiv-2409.08083","DOIUrl":"https://doi.org/arxiv-2409.08083","url":null,"abstract":"Foundation models like ChatGPT and Sora that are trained on a huge scale of\u0000data have made a revolutionary social impact. However, it is extremely\u0000challenging for sensors in many different fields to collect similar scales of\u0000natural images to train strong foundation models. To this end, this work\u0000presents a simple and effective framework SimMAT to study an open problem: the\u0000transferability from vision foundation models trained on natural RGB images to\u0000other image modalities of different physical properties (e.g., polarization).\u0000SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained\u0000foundation model. We apply SimMAT to a representative vision foundation model\u0000Segment Anything Model (SAM) to support any evaluated new image modality. Given\u0000the absence of relevant benchmarks, we construct a new benchmark to evaluate\u0000the transfer learning performance. Our experiments confirm the intriguing\u0000potential of transferring vision foundation models in enhancing other sensors'\u0000performance. Specifically, SimMAT can improve the segmentation performance\u0000(mIoU) from 22.15% to 53.88% on average for evaluated modalities and\u0000consistently outperforms other baselines. We hope that SimMAT can raise\u0000awareness of cross-modal transfer learning and benefit various fields for\u0000better results with vision foundation models.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors DreamHOI:利用扩散先验条件生成受试者驱动的三维人-物互动效果
Pub Date : 2024-09-12 DOI: arxiv-2409.08278
Thomas Hanwen Zhu, Ruining Li, Tomas Jakab
We present DreamHOI, a novel method for zero-shot synthesis of human-objectinteractions (HOIs), enabling a 3D human model to realistically interact withany given object based on a textual description. This task is complicated bythe varying categories and geometries of real-world objects and the scarcity ofdatasets encompassing diverse HOIs. To circumvent the need for extensive data,we leverage text-to-image diffusion models trained on billions of image-captionpairs. We optimize the articulation of a skinned human mesh using ScoreDistillation Sampling (SDS) gradients obtained from these models, which predictimage-space edits. However, directly backpropagating image-space gradients intocomplex articulation parameters is ineffective due to the local nature of suchgradients. To overcome this, we introduce a dual implicit-explicitrepresentation of a skinned mesh, combining (implicit) neural radiance fields(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,we transition between implicit and explicit forms, grounding the NeRFgeneration while refining the mesh articulation. We validate our approachthrough extensive experiments, demonstrating its effectiveness in generatingrealistic HOIs.
我们介绍的 DreamHOI 是一种用于人-物互动(HOIs)零镜头合成的新方法,它能让三维人体模型根据文字描述与任何给定物体进行逼真的互动。由于现实世界中物体的类别和几何形状各不相同,而包含各种 HOIs 的数据集又十分稀缺,因此这项任务变得十分复杂。为了避免对大量数据的需求,我们利用了在数十亿图像标题对上训练的文本到图像扩散模型。我们利用从这些模型中获得的分数扩散采样(SDS)梯度来优化带皮肤人体网格的衔接,这些梯度可以预测图像空间的编辑。然而,由于图像空间梯度的局部性,直接将图像空间梯度反向传播到复杂的衔接参数中是无效的。为了克服这一问题,我们引入了蒙皮网格的隐式-显式双重表示法,将(隐式)神经辐射场(NeRFs)与(显式)骨架驱动的网格衔接相结合。在优化过程中,我们在隐式和显式之间进行转换,在完善网格衔接的同时将神经辐射场的生成基础化。我们通过大量实验验证了我们的方法,证明了它在生成逼真的 HOIs 方面的有效性。
{"title":"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":"https://doi.org/arxiv-2409.08278","url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\u0000interactions (HOIs), enabling a 3D human model to realistically interact with\u0000any given object based on a textual description. This task is complicated by\u0000the varying categories and geometries of real-world objects and the scarcity of\u0000datasets encompassing diverse HOIs. To circumvent the need for extensive data,\u0000we leverage text-to-image diffusion models trained on billions of image-caption\u0000pairs. We optimize the articulation of a skinned human mesh using Score\u0000Distillation Sampling (SDS) gradients obtained from these models, which predict\u0000image-space edits. However, directly backpropagating image-space gradients into\u0000complex articulation parameters is ineffective due to the local nature of such\u0000gradients. To overcome this, we introduce a dual implicit-explicit\u0000representation of a skinned mesh, combining (implicit) neural radiance fields\u0000(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\u0000we transition between implicit and explicit forms, grounding the NeRF\u0000generation while refining the mesh articulation. We validate our approach\u0000through extensive experiments, demonstrating its effectiveness in generating\u0000realistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to Match 2D Keypoints Across Preoperative MR and Intraoperative Ultrasound 学习匹配术前磁共振和术中超声的二维关键点
Pub Date : 2024-09-12 DOI: arxiv-2409.08169
Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine
We propose in this paper a texture-invariant 2D keypoints descriptorspecifically designed for matching preoperative Magnetic Resonance (MR) imageswith intraoperative Ultrasound (US) images. We introduce amatching-by-synthesis strategy, where intraoperative US images are synthesizedfrom MR images accounting for multiple MR modalities and intraoperative USvariability. We build our training set by enforcing keypoints localization overall images then train a patient-specific descriptor network that learnstexture-invariant discriminant features in a supervised contrastive manner,leading to robust keypoints descriptors. Our experiments on real cases withground truth show the effectiveness of the proposed approach, outperforming thestate-of-the-art methods and achieving 80.35% matching precision on average.
本文提出了一种纹理不变的二维关键点描述符,专门用于匹配术前磁共振(MR)图像和术中超声(US)图像。我们引入了 "合成匹配"(batching-by-synthesis)策略,即术中 US 图像由 MR 图像合成,其中考虑了多种 MR 模式和术中 US 变异性。我们通过对整体图像进行关键点定位来建立训练集,然后训练患者特定的描述符网络,该网络以监督对比的方式学习与纹理无关的判别特征,从而获得稳健的关键点描述符。我们在真实病例中进行的实验表明,所提出的方法非常有效,其性能优于最先进的方法,平均匹配精度达到 80.35%。
{"title":"Learning to Match 2D Keypoints Across Preoperative MR and Intraoperative Ultrasound","authors":"Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine","doi":"arxiv-2409.08169","DOIUrl":"https://doi.org/arxiv-2409.08169","url":null,"abstract":"We propose in this paper a texture-invariant 2D keypoints descriptor\u0000specifically designed for matching preoperative Magnetic Resonance (MR) images\u0000with intraoperative Ultrasound (US) images. We introduce a\u0000matching-by-synthesis strategy, where intraoperative US images are synthesized\u0000from MR images accounting for multiple MR modalities and intraoperative US\u0000variability. We build our training set by enforcing keypoints localization over\u0000all images then train a patient-specific descriptor network that learns\u0000texture-invariant discriminant features in a supervised contrastive manner,\u0000leading to robust keypoints descriptors. Our experiments on real cases with\u0000ground truth show the effectiveness of the proposed approach, outperforming the\u0000state-of-the-art methods and achieving 80.35% matching precision on average.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse R-CNN OBB: Ship Target Detection in SAR Images Based on Oriented Sparse Proposals 稀疏 R-CNN OBB:基于定向稀疏提议的合成孔径雷达图像中的舰船目标探测
Pub Date : 2024-09-12 DOI: arxiv-2409.07973
Kamirul Kamirul, Odysseas Pappas, Alin Achim
We present Sparse R-CNN OBB, a novel framework for the detection of orientedobjects in SAR images leveraging sparse learnable proposals. The Sparse R-CNNOBB has streamlined architecture and ease of training as it utilizes a sparseset of 300 proposals instead of training a proposals generator on hundreds ofthousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is thefirst to adopt the concept of sparse learnable proposals for the detection oforiented objects, as well as for the detection of ships in Synthetic ApertureRadar (SAR) images. The detection head of the baseline model, Sparse R-CNN, isre-designed to enable the model to capture object orientation. We alsofine-tune the model on RSDD-SAR dataset and provide a performance comparison tostate-of-the-art models. Experimental results shows that Sparse R-CNN OBBachieves outstanding performance, surpassing other models on both inshore andoffshore scenarios. The code is available at:www.github.com/ka-mirul/Sparse-R-CNN-OBB.
我们提出了稀疏 R-CNN OBB,这是一种利用稀疏可学习提案检测合成孔径雷达图像中定向物体的新型框架。稀疏 R-CNN OBB 架构精简,易于训练,因为它利用了 300 个提案的稀疏集,而不是在数十万个锚点上训练提案生成器。据我们所知,稀疏 R-CNN OBB 是第一个采用稀疏可学习提案概念来检测定向物体以及合成孔径雷达(SAR)图像中的船只的方法。我们对基线模型--稀疏 R-CNN 的检测头进行了重新设计,使该模型能够捕捉物体的方向。我们还在 RSDD-SAR 数据集上对模型进行了微调,并与最先进的模型进行了性能比较。实验结果表明,稀疏 R-CNN OBB 性能突出,在近岸和离岸场景中都超过了其他模型。代码可在以下网址获取:www.github.com/ka-mirul/Sparse-R-CNN-OBB。
{"title":"Sparse R-CNN OBB: Ship Target Detection in SAR Images Based on Oriented Sparse Proposals","authors":"Kamirul Kamirul, Odysseas Pappas, Alin Achim","doi":"arxiv-2409.07973","DOIUrl":"https://doi.org/arxiv-2409.07973","url":null,"abstract":"We present Sparse R-CNN OBB, a novel framework for the detection of oriented\u0000objects in SAR images leveraging sparse learnable proposals. The Sparse R-CNN\u0000OBB has streamlined architecture and ease of training as it utilizes a sparse\u0000set of 300 proposals instead of training a proposals generator on hundreds of\u0000thousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is the\u0000first to adopt the concept of sparse learnable proposals for the detection of\u0000oriented objects, as well as for the detection of ships in Synthetic Aperture\u0000Radar (SAR) images. The detection head of the baseline model, Sparse R-CNN, is\u0000re-designed to enable the model to capture object orientation. We also\u0000fine-tune the model on RSDD-SAR dataset and provide a performance comparison to\u0000state-of-the-art models. Experimental results shows that Sparse R-CNN OBB\u0000achieves outstanding performance, surpassing other models on both inshore and\u0000offshore scenarios. The code is available at:\u0000www.github.com/ka-mirul/Sparse-R-CNN-OBB.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization 用于密集视听事件定位的位置感知跨模态对应学习
Pub Date : 2024-09-12 DOI: arxiv-2409.07967
Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang
Dense-localization Audio-Visual Events (DAVE) aims to identify timeboundaries and corresponding categories for events that can be heard and seenconcurrently in an untrimmed video. Existing methods typically encode audio andvisual representation separately without any explicit cross-modal alignmentconstraint. Then they adopt dense cross-modal attention to integrate multimodalinformation for DAVE. Thus these methods inevitably aggregate irrelevant noiseand events, especially in complex and long videos, leading to imprecisedetection. In this paper, we present LOCO, a Locality-aware cross-modalCorrespondence learning framework for DAVE. The core idea is to explore localtemporal continuity nature of audio-visual events, which serves as informativeyet free supervision signals to guide the filtering of irrelevant informationand inspire the extraction of complementary multimodal information during bothunimodal and cross-modal learning stages. i) Specifically, LOCO appliesLocality-aware Correspondence Correction (LCC) to uni-modal features vialeveraging cross-modal local-correlated properties without any extraannotations. This enforces uni-modal encoders to highlight similar semanticsshared by audio and visual features. ii) To better aggregate such audio andvisual features, we further customize Cross-modal Dynamic Perception layer(CDP) in cross-modal feature pyramid to understand local temporal patterns ofaudio-visual events by imposing local consistency within multimodal features ina data-driven manner. By incorporating LCC and CDP, LOCO provides solidperformance gains and outperforms existing methods for DAVE. The source codewill be released.
密集定位视听事件(DAVE)旨在识别未剪辑视频中可同时听到和看到的事件的时间界限和相应类别。现有方法通常是将音频和视频分别编码,而没有明确的跨模态对齐约束。然后,它们采用密集的跨模态注意力来整合多模态信息,用于 DAVE。因此,这些方法不可避免地会将不相关的噪声和事件聚合在一起,尤其是在复杂的长视频中,从而导致不精确的检测。在本文中,我们提出了用于 DAVE 的局部感知跨模态对应学习框架 LOCO。其核心理念是探索视听事件的局部时空连续性,并将其作为信息丰富但不受约束的监督信号,在单模态和跨模态学习阶段指导过滤无关信息,并启发提取互补的多模态信息。 i) 具体来说,LOCO 将局部感知对应校正(Locality-aware Correspondence Correction,LCC)应用于单模态特征,在不进行任何额外注释的情况下,评估跨模态局部相关属性。ii) 为了更好地聚合这些音频和视频特征,我们在跨模态特征金字塔中进一步定制了跨模态动态感知层(CDP),以数据驱动的方式在多模态特征中施加局部一致性,从而理解音频和视频事件的局部时间模式。通过结合 LCC 和 CDP,LOCO 为 DAVE 提供了坚实的性能增益,并优于现有方法。源代码即将发布。
{"title":"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":"https://doi.org/arxiv-2409.07967","url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\u0000boundaries and corresponding categories for events that can be heard and seen\u0000concurrently in an untrimmed video. Existing methods typically encode audio and\u0000visual representation separately without any explicit cross-modal alignment\u0000constraint. Then they adopt dense cross-modal attention to integrate multimodal\u0000information for DAVE. Thus these methods inevitably aggregate irrelevant noise\u0000and events, especially in complex and long videos, leading to imprecise\u0000detection. In this paper, we present LOCO, a Locality-aware cross-modal\u0000Correspondence learning framework for DAVE. The core idea is to explore local\u0000temporal continuity nature of audio-visual events, which serves as informative\u0000yet free supervision signals to guide the filtering of irrelevant information\u0000and inspire the extraction of complementary multimodal information during both\u0000unimodal and cross-modal learning stages. i) Specifically, LOCO applies\u0000Locality-aware Correspondence Correction (LCC) to uni-modal features via\u0000leveraging cross-modal local-correlated properties without any extra\u0000annotations. This enforces uni-modal encoders to highlight similar semantics\u0000shared by audio and visual features. ii) To better aggregate such audio and\u0000visual features, we further customize Cross-modal Dynamic Perception layer\u0000(CDP) in cross-modal feature pyramid to understand local temporal patterns of\u0000audio-visual events by imposing local consistency within multimodal features in\u0000a data-driven manner. By incorporating LCC and CDP, LOCO provides solid\u0000performance gains and outperforms existing methods for DAVE. The source code\u0000will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors 从 COCO 到 COCO-FP:深入探讨 COCO 检测器的背景误报问题
Pub Date : 2024-09-12 DOI: arxiv-2409.07907
Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen
Reducing false positives is essential for enhancing object detectorperformance, as reflected in the mean Average Precision (mAP) metric. Althoughobject detectors have achieved notable improvements and high mAP scores on theCOCO dataset, analysis reveals limited progress in addressing false positivescaused by non-target visual clutter-background objects not included in theannotated categories. This issue is particularly critical in real-worldapplications, such as fire and smoke detection, where minimizing false alarmsis crucial. In this study, we introduce COCO-FP, a new evaluation datasetderived from the ImageNet-1K dataset, designed to address this issue. Byextending the original COCO validation dataset, COCO-FP specifically assessesobject detectors' performance in mitigating background false positives. Ourevaluation of both standard and advanced object detectors shows a significantnumber of false positives in both closed-set and open-set scenarios. Forexample, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shiftingfrom COCO to COCO-FP. The dataset is available athttps://github.com/COCO-FP/COCO-FP.
减少误报对于提高目标检测器性能至关重要,这反映在平均精度(mAP)指标上。尽管物体检测器在 COCO 数据集上取得了显著的改进和较高的 mAP 分数,但分析表明,在解决由非目标视觉杂波--未包含在标注类别中的背景物体--引起的误报方面进展有限。这个问题在火灾和烟雾检测等实际应用中尤为关键,因为在这些应用中,尽量减少误报是至关重要的。在本研究中,我们介绍了 COCO-FP,这是一个从 ImageNet-1K 数据集中提取的新评估数据集,旨在解决这一问题。通过扩展原始 COCO 验证数据集,COCO-FP 专门评估了物体检测器在减少背景误报方面的性能。对标准和高级物体检测器的评估结果表明,在封闭集和开放集场景中都存在大量误报。例如,当从 COCO 转向 COCO-FP 时,YOLOv9-E 的 AP50 指标从 72.8 降至 65.7。数据集可在https://github.com/COCO-FP/COCO-FP。
{"title":"From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors","authors":"Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen","doi":"arxiv-2409.07907","DOIUrl":"https://doi.org/arxiv-2409.07907","url":null,"abstract":"Reducing false positives is essential for enhancing object detector\u0000performance, as reflected in the mean Average Precision (mAP) metric. Although\u0000object detectors have achieved notable improvements and high mAP scores on the\u0000COCO dataset, analysis reveals limited progress in addressing false positives\u0000caused by non-target visual clutter-background objects not included in the\u0000annotated categories. This issue is particularly critical in real-world\u0000applications, such as fire and smoke detection, where minimizing false alarms\u0000is crucial. In this study, we introduce COCO-FP, a new evaluation dataset\u0000derived from the ImageNet-1K dataset, designed to address this issue. By\u0000extending the original COCO validation dataset, COCO-FP specifically assesses\u0000object detectors' performance in mitigating background false positives. Our\u0000evaluation of both standard and advanced object detectors shows a significant\u0000number of false positives in both closed-set and open-set scenarios. For\u0000example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting\u0000from COCO to COCO-FP. The dataset is available at\u0000https://github.com/COCO-FP/COCO-FP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ThermalGaussian: Thermal 3D Gaussian Splatting ThermalGaussian:热三维高斯拼接技术
Pub Date : 2024-09-11 DOI: arxiv-2409.07200
Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue
Thermography is especially valuable for the military and other users ofsurveillance cameras. Some recent methods based on Neural Radiance Fields(NeRF) are proposed to reconstruct the thermal scenes in 3D from a set ofthermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS)prevails due to its rapid training and real-time rendering. In this work, wepropose ThermalGaussian, the first thermal 3DGS approach capable of renderinghigh-quality images in RGB and thermal modalities. We first calibrate the RGBcamera and the thermal camera to ensure that both modalities are accuratelyaligned. Subsequently, we use the registered images to learn the multimodal 3DGaussians. To prevent the overfitting of any single modality, we introduceseveral multimodal regularization constraints. We also develop smoothingconstraints tailored to the physical characteristics of the thermal modality.Besides, we contribute a real-world dataset named RGBT-Scenes, captured by ahand-hold thermal-infrared camera, facilitating future research on thermalscene reconstruction. We conduct comprehensive experiments to show thatThermalGaussian achieves photorealistic rendering of thermal images andimproves the rendering quality of RGB images. With the proposed multimodalregularization constraints, we also reduced the model's storage cost by 90%.The code and dataset will be released.
热成像技术对于军队和其他监控摄像机用户尤为重要。最近提出了一些基于神经辐射场(NeRF)的方法,用于从一组热图像和 RGB 图像重建三维热场景。然而,与 NeRF 不同的是,3D 高斯拼接(3DGS)因其快速训练和实时渲染的特点而大行其道。在这项工作中,我们提出了 ThermalGaussian,这是第一种能够渲染 RGB 和热模式下高质量图像的热 3DGS 方法。我们首先校准 RGB 摄像机和红外热像仪,确保两种模式准确对齐。随后,我们使用注册图像来学习多模态 3D 高斯。为了防止任何单一模态的过度拟合,我们引入了多模态正则化约束。此外,我们还提供了一个名为 RGBT 场景的真实世界数据集,该数据集由手持式红外热像仪捕获,为未来的热场景重建研究提供了便利。我们进行了全面的实验,结果表明热高斯技术实现了热图像的逼真渲染,并提高了 RGB 图像的渲染质量。通过提出多模态规则化约束,我们还将模型的存储成本降低了90%。
{"title":"ThermalGaussian: Thermal 3D Gaussian Splatting","authors":"Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue","doi":"arxiv-2409.07200","DOIUrl":"https://doi.org/arxiv-2409.07200","url":null,"abstract":"Thermography is especially valuable for the military and other users of\u0000surveillance cameras. Some recent methods based on Neural Radiance Fields\u0000(NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of\u0000thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS)\u0000prevails due to its rapid training and real-time rendering. In this work, we\u0000propose ThermalGaussian, the first thermal 3DGS approach capable of rendering\u0000high-quality images in RGB and thermal modalities. We first calibrate the RGB\u0000camera and the thermal camera to ensure that both modalities are accurately\u0000aligned. Subsequently, we use the registered images to learn the multimodal 3D\u0000Gaussians. To prevent the overfitting of any single modality, we introduce\u0000several multimodal regularization constraints. We also develop smoothing\u0000constraints tailored to the physical characteristics of the thermal modality.\u0000Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a\u0000hand-hold thermal-infrared camera, facilitating future research on thermal\u0000scene reconstruction. We conduct comprehensive experiments to show that\u0000ThermalGaussian achieves photorealistic rendering of thermal images and\u0000improves the rendering quality of RGB images. With the proposed multimodal\u0000regularization constraints, we also reduced the model's storage cost by 90%.\u0000The code and dataset will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1