Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang
Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
{"title":"SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality","authors":"Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang","doi":"arxiv-2409.08083","DOIUrl":"https://doi.org/arxiv-2409.08083","url":null,"abstract":"Foundation models like ChatGPT and Sora that are trained on a huge scale of\u0000data have made a revolutionary social impact. However, it is extremely\u0000challenging for sensors in many different fields to collect similar scales of\u0000natural images to train strong foundation models. To this end, this work\u0000presents a simple and effective framework SimMAT to study an open problem: the\u0000transferability from vision foundation models trained on natural RGB images to\u0000other image modalities of different physical properties (e.g., polarization).\u0000SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained\u0000foundation model. We apply SimMAT to a representative vision foundation model\u0000Segment Anything Model (SAM) to support any evaluated new image modality. Given\u0000the absence of relevant benchmarks, we construct a new benchmark to evaluate\u0000the transfer learning performance. Our experiments confirm the intriguing\u0000potential of transferring vision foundation models in enhancing other sensors'\u0000performance. Specifically, SimMAT can improve the segmentation performance\u0000(mIoU) from 22.15% to 53.88% on average for evaluated modalities and\u0000consistently outperforms other baselines. We hope that SimMAT can raise\u0000awareness of cross-modal transfer learning and benefit various fields for\u0000better results with vision foundation models.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi
In the context of few-shot classification, the goal is to train a classifier using a limited number of samples while maintaining satisfactory performance. However, traditional metric-based methods exhibit certain limitations in achieving this objective. These methods typically rely on a single distance value between the query feature and support feature, thereby overlooking the contribution of shallow features. To overcome this challenge, we propose a novel approach in this paper. Our approach involves utilizing multi-output embedding network that maps samples into distinct feature spaces. The proposed method extract feature vectors at different stages, enabling the model to capture both global and abstract features. By utilizing these diverse feature spaces, our model enhances its performance. Moreover, employing a self-attention mechanism improves the refinement of features at each stage, leading to even more robust representations and improved overall performance. Furthermore, assigning learnable weights to each stage significantly improved performance and results. We conducted comprehensive evaluations on the MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way 5-shot scenarios. Additionally, we performed a cross-domain task from MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain. These evaluations demonstrate the efficacy of our proposed method in comparison to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet
{"title":"Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms","authors":"Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi","doi":"arxiv-2409.07989","DOIUrl":"https://doi.org/arxiv-2409.07989","url":null,"abstract":"In the context of few-shot classification, the goal is to train a classifier\u0000using a limited number of samples while maintaining satisfactory performance.\u0000However, traditional metric-based methods exhibit certain limitations in\u0000achieving this objective. These methods typically rely on a single distance\u0000value between the query feature and support feature, thereby overlooking the\u0000contribution of shallow features. To overcome this challenge, we propose a\u0000novel approach in this paper. Our approach involves utilizing multi-output\u0000embedding network that maps samples into distinct feature spaces. The proposed\u0000method extract feature vectors at different stages, enabling the model to\u0000capture both global and abstract features. By utilizing these diverse feature\u0000spaces, our model enhances its performance. Moreover, employing a\u0000self-attention mechanism improves the refinement of features at each stage,\u0000leading to even more robust representations and improved overall performance.\u0000Furthermore, assigning learnable weights to each stage significantly improved\u0000performance and results. We conducted comprehensive evaluations on the\u0000MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way\u00005-shot scenarios. Additionally, we performed a cross-domain task from\u0000MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.\u0000These evaluations demonstrate the efficacy of our proposed method in comparison\u0000to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.
{"title":"TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder","authors":"NaHyeon Park, Kunhee Kim, Hyunjung Shim","doi":"arxiv-2409.08248","DOIUrl":"https://doi.org/arxiv-2409.08248","url":null,"abstract":"Recent breakthroughs in text-to-image models have opened up promising\u0000research avenues in personalized image generation, enabling users to create\u0000diverse images of a specific subject using natural language prompts. However,\u0000existing methods often suffer from performance degradation when given only a\u0000single reference image. They tend to overfit the input, producing highly\u0000similar outputs regardless of the text prompt. This paper addresses the\u0000challenge of one-shot personalization by mitigating overfitting, enabling the\u0000creation of controllable images through text prompts. Specifically, we propose\u0000a selective fine-tuning strategy that focuses on the text encoder. Furthermore,\u0000we introduce three key techniques to enhance personalization performance: (1)\u0000augmentation tokens to encourage feature disentanglement and alleviate\u0000overfitting, (2) a knowledge-preservation loss to reduce language drift and\u0000promote generalizability across diverse prompts, and (3) SNR-weighted sampling\u0000for efficient training. Extensive experiments demonstrate that our approach\u0000efficiently generates high-quality, diverse images using only a single\u0000reference image while significantly reducing memory and storage requirements.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kerem Cekmeceli, Meva Himmetoglu, Guney I. Tombak, Anna Susmelj, Ertunc Erdil, Ender Konukoglu
Neural networks achieve state-of-the-art performance in many supervised learning tasks when the training data distribution matches the test data distribution. However, their performance drops significantly under domain (covariate) shift, a prevalent issue in medical image segmentation due to varying acquisition settings across different scanner models and protocols. Recently, foundational models (FMs) trained on large datasets have gained attention for their ability to be adapted for downstream tasks and achieve state-of-the-art performance with excellent generalization capabilities on natural images. However, their effectiveness in medical image segmentation remains underexplored. In this paper, we investigate the domain generalization performance of various FMs, including DinoV2, SAM, MedSAM, and MAE, when fine-tuned using various parameter-efficient fine-tuning (PEFT) techniques such as Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode head architecture, HQHSAM, which simply integrates elements from two state-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentation performance. Our extensive experiments on multiple datasets, encompassing various anatomies and modalities, reveal that FMs, particularly with the HQHSAM decode head, improve domain generalization for medical image segmentation. Moreover, we found that the effectiveness of PEFT techniques varies across different FMs. These findings underscore the potential of FMs to enhance the domain generalization performance of neural networks in medical image segmentation across diverse clinical settings, providing a solid foundation for future research. Code and models are available for research purposes at url{https://github.com/kerem-cekmeceli/Foundation-Models-for-Medical-Imagery}.
{"title":"Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?","authors":"Kerem Cekmeceli, Meva Himmetoglu, Guney I. Tombak, Anna Susmelj, Ertunc Erdil, Ender Konukoglu","doi":"arxiv-2409.07960","DOIUrl":"https://doi.org/arxiv-2409.07960","url":null,"abstract":"Neural networks achieve state-of-the-art performance in many supervised\u0000learning tasks when the training data distribution matches the test data\u0000distribution. However, their performance drops significantly under domain\u0000(covariate) shift, a prevalent issue in medical image segmentation due to\u0000varying acquisition settings across different scanner models and protocols.\u0000Recently, foundational models (FMs) trained on large datasets have gained\u0000attention for their ability to be adapted for downstream tasks and achieve\u0000state-of-the-art performance with excellent generalization capabilities on\u0000natural images. However, their effectiveness in medical image segmentation\u0000remains underexplored. In this paper, we investigate the domain generalization\u0000performance of various FMs, including DinoV2, SAM, MedSAM, and MAE, when\u0000fine-tuned using various parameter-efficient fine-tuning (PEFT) techniques such\u0000as Ladder and Rein (+LoRA) and decoder heads. We introduce a novel decode head\u0000architecture, HQHSAM, which simply integrates elements from two\u0000state-of-the-art decoder heads, HSAM and HQSAM, to enhance segmentation\u0000performance. Our extensive experiments on multiple datasets, encompassing\u0000various anatomies and modalities, reveal that FMs, particularly with the HQHSAM\u0000decode head, improve domain generalization for medical image segmentation.\u0000Moreover, we found that the effectiveness of PEFT techniques varies across\u0000different FMs. These findings underscore the potential of FMs to enhance the\u0000domain generalization performance of neural networks in medical image\u0000segmentation across diverse clinical settings, providing a solid foundation for\u0000future research. Code and models are available for research purposes at\u0000url{https://github.com/kerem-cekmeceli/Foundation-Models-for-Medical-Imagery}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present DreamHOI, a novel method for zero-shot synthesis of human-object interactions (HOIs), enabling a 3D human model to realistically interact with any given object based on a textual description. This task is complicated by the varying categories and geometries of real-world objects and the scarcity of datasets encompassing diverse HOIs. To circumvent the need for extensive data, we leverage text-to-image diffusion models trained on billions of image-caption pairs. We optimize the articulation of a skinned human mesh using Score Distillation Sampling (SDS) gradients obtained from these models, which predict image-space edits. However, directly backpropagating image-space gradients into complex articulation parameters is ineffective due to the local nature of such gradients. To overcome this, we introduce a dual implicit-explicit representation of a skinned mesh, combining (implicit) neural radiance fields (NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization, we transition between implicit and explicit forms, grounding the NeRF generation while refining the mesh articulation. We validate our approach through extensive experiments, demonstrating its effectiveness in generating realistic HOIs.
{"title":"DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors","authors":"Thomas Hanwen Zhu, Ruining Li, Tomas Jakab","doi":"arxiv-2409.08278","DOIUrl":"https://doi.org/arxiv-2409.08278","url":null,"abstract":"We present DreamHOI, a novel method for zero-shot synthesis of human-object\u0000interactions (HOIs), enabling a 3D human model to realistically interact with\u0000any given object based on a textual description. This task is complicated by\u0000the varying categories and geometries of real-world objects and the scarcity of\u0000datasets encompassing diverse HOIs. To circumvent the need for extensive data,\u0000we leverage text-to-image diffusion models trained on billions of image-caption\u0000pairs. We optimize the articulation of a skinned human mesh using Score\u0000Distillation Sampling (SDS) gradients obtained from these models, which predict\u0000image-space edits. However, directly backpropagating image-space gradients into\u0000complex articulation parameters is ineffective due to the local nature of such\u0000gradients. To overcome this, we introduce a dual implicit-explicit\u0000representation of a skinned mesh, combining (implicit) neural radiance fields\u0000(NeRFs) with (explicit) skeleton-driven mesh articulation. During optimization,\u0000we transition between implicit and explicit forms, grounding the NeRF\u0000generation while refining the mesh articulation. We validate our approach\u0000through extensive experiments, demonstrating its effectiveness in generating\u0000realistic HOIs.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine
We propose in this paper a texture-invariant 2D keypoints descriptor specifically designed for matching preoperative Magnetic Resonance (MR) images with intraoperative Ultrasound (US) images. We introduce a matching-by-synthesis strategy, where intraoperative US images are synthesized from MR images accounting for multiple MR modalities and intraoperative US variability. We build our training set by enforcing keypoints localization over all images then train a patient-specific descriptor network that learns texture-invariant discriminant features in a supervised contrastive manner, leading to robust keypoints descriptors. Our experiments on real cases with ground truth show the effectiveness of the proposed approach, outperforming the state-of-the-art methods and achieving 80.35% matching precision on average.
本文提出了一种纹理不变的二维关键点描述符,专门用于匹配术前磁共振(MR)图像和术中超声(US)图像。我们引入了 "合成匹配"(batching-by-synthesis)策略,即术中 US 图像由 MR 图像合成,其中考虑了多种 MR 模式和术中 US 变异性。我们通过对整体图像进行关键点定位来建立训练集,然后训练患者特定的描述符网络,该网络以监督对比的方式学习与纹理无关的判别特征,从而获得稳健的关键点描述符。我们在真实病例中进行的实验表明,所提出的方法非常有效,其性能优于最先进的方法,平均匹配精度达到 80.35%。
{"title":"Learning to Match 2D Keypoints Across Preoperative MR and Intraoperative Ultrasound","authors":"Hassan Rasheed, Reuben Dorent, Maximilian Fehrentz, Tina Kapur, William M. Wells III, Alexandra Golby, Sarah Frisken, Julia A. Schnabel, Nazim Haouchine","doi":"arxiv-2409.08169","DOIUrl":"https://doi.org/arxiv-2409.08169","url":null,"abstract":"We propose in this paper a texture-invariant 2D keypoints descriptor\u0000specifically designed for matching preoperative Magnetic Resonance (MR) images\u0000with intraoperative Ultrasound (US) images. We introduce a\u0000matching-by-synthesis strategy, where intraoperative US images are synthesized\u0000from MR images accounting for multiple MR modalities and intraoperative US\u0000variability. We build our training set by enforcing keypoints localization over\u0000all images then train a patient-specific descriptor network that learns\u0000texture-invariant discriminant features in a supervised contrastive manner,\u0000leading to robust keypoints descriptors. Our experiments on real cases with\u0000ground truth show the effectiveness of the proposed approach, outperforming the\u0000state-of-the-art methods and achieving 80.35% matching precision on average.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present Sparse R-CNN OBB, a novel framework for the detection of oriented objects in SAR images leveraging sparse learnable proposals. The Sparse R-CNN OBB has streamlined architecture and ease of training as it utilizes a sparse set of 300 proposals instead of training a proposals generator on hundreds of thousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is the first to adopt the concept of sparse learnable proposals for the detection of oriented objects, as well as for the detection of ships in Synthetic Aperture Radar (SAR) images. The detection head of the baseline model, Sparse R-CNN, is re-designed to enable the model to capture object orientation. We also fine-tune the model on RSDD-SAR dataset and provide a performance comparison to state-of-the-art models. Experimental results shows that Sparse R-CNN OBB achieves outstanding performance, surpassing other models on both inshore and offshore scenarios. The code is available at: www.github.com/ka-mirul/Sparse-R-CNN-OBB.
{"title":"Sparse R-CNN OBB: Ship Target Detection in SAR Images Based on Oriented Sparse Proposals","authors":"Kamirul Kamirul, Odysseas Pappas, Alin Achim","doi":"arxiv-2409.07973","DOIUrl":"https://doi.org/arxiv-2409.07973","url":null,"abstract":"We present Sparse R-CNN OBB, a novel framework for the detection of oriented\u0000objects in SAR images leveraging sparse learnable proposals. The Sparse R-CNN\u0000OBB has streamlined architecture and ease of training as it utilizes a sparse\u0000set of 300 proposals instead of training a proposals generator on hundreds of\u0000thousands of anchors. To the best of our knowledge, Sparse R-CNN OBB is the\u0000first to adopt the concept of sparse learnable proposals for the detection of\u0000oriented objects, as well as for the detection of ships in Synthetic Aperture\u0000Radar (SAR) images. The detection head of the baseline model, Sparse R-CNN, is\u0000re-designed to enable the model to capture object orientation. We also\u0000fine-tune the model on RSDD-SAR dataset and provide a performance comparison to\u0000state-of-the-art models. Experimental results shows that Sparse R-CNN OBB\u0000achieves outstanding performance, surpassing other models on both inshore and\u0000offshore scenarios. The code is available at:\u0000www.github.com/ka-mirul/Sparse-R-CNN-OBB.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.
密集定位视听事件(DAVE)旨在识别未剪辑视频中可同时听到和看到的事件的时间界限和相应类别。现有方法通常是将音频和视频分别编码,而没有明确的跨模态对齐约束。然后,它们采用密集的跨模态注意力来整合多模态信息,用于 DAVE。因此,这些方法不可避免地会将不相关的噪声和事件聚合在一起,尤其是在复杂的长视频中,从而导致不精确的检测。在本文中,我们提出了用于 DAVE 的局部感知跨模态对应学习框架 LOCO。其核心理念是探索视听事件的局部时空连续性,并将其作为信息丰富但不受约束的监督信号,在单模态和跨模态学习阶段指导过滤无关信息,并启发提取互补的多模态信息。 i) 具体来说,LOCO 将局部感知对应校正(Locality-aware Correspondence Correction,LCC)应用于单模态特征,在不进行任何额外注释的情况下,评估跨模态局部相关属性。ii) 为了更好地聚合这些音频和视频特征,我们在跨模态特征金字塔中进一步定制了跨模态动态感知层(CDP),以数据驱动的方式在多模态特征中施加局部一致性,从而理解音频和视频事件的局部时间模式。通过结合 LCC 和 CDP,LOCO 为 DAVE 提供了坚实的性能增益,并优于现有方法。源代码即将发布。
{"title":"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":"https://doi.org/arxiv-2409.07967","url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\u0000boundaries and corresponding categories for events that can be heard and seen\u0000concurrently in an untrimmed video. Existing methods typically encode audio and\u0000visual representation separately without any explicit cross-modal alignment\u0000constraint. Then they adopt dense cross-modal attention to integrate multimodal\u0000information for DAVE. Thus these methods inevitably aggregate irrelevant noise\u0000and events, especially in complex and long videos, leading to imprecise\u0000detection. In this paper, we present LOCO, a Locality-aware cross-modal\u0000Correspondence learning framework for DAVE. The core idea is to explore local\u0000temporal continuity nature of audio-visual events, which serves as informative\u0000yet free supervision signals to guide the filtering of irrelevant information\u0000and inspire the extraction of complementary multimodal information during both\u0000unimodal and cross-modal learning stages. i) Specifically, LOCO applies\u0000Locality-aware Correspondence Correction (LCC) to uni-modal features via\u0000leveraging cross-modal local-correlated properties without any extra\u0000annotations. This enforces uni-modal encoders to highlight similar semantics\u0000shared by audio and visual features. ii) To better aggregate such audio and\u0000visual features, we further customize Cross-modal Dynamic Perception layer\u0000(CDP) in cross-modal feature pyramid to understand local temporal patterns of\u0000audio-visual events by imposing local consistency within multimodal features in\u0000a data-driven manner. By incorporating LCC and CDP, LOCO provides solid\u0000performance gains and outperforms existing methods for DAVE. The source code\u0000will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen
Reducing false positives is essential for enhancing object detector performance, as reflected in the mean Average Precision (mAP) metric. Although object detectors have achieved notable improvements and high mAP scores on the COCO dataset, analysis reveals limited progress in addressing false positives caused by non-target visual clutter-background objects not included in the annotated categories. This issue is particularly critical in real-world applications, such as fire and smoke detection, where minimizing false alarms is crucial. In this study, we introduce COCO-FP, a new evaluation dataset derived from the ImageNet-1K dataset, designed to address this issue. By extending the original COCO validation dataset, COCO-FP specifically assesses object detectors' performance in mitigating background false positives. Our evaluation of both standard and advanced object detectors shows a significant number of false positives in both closed-set and open-set scenarios. For example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting from COCO to COCO-FP. The dataset is available at https://github.com/COCO-FP/COCO-FP.
{"title":"From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors","authors":"Longfei Liu, Wen Guo, Shihua Huang, Cheng Li, Xi Shen","doi":"arxiv-2409.07907","DOIUrl":"https://doi.org/arxiv-2409.07907","url":null,"abstract":"Reducing false positives is essential for enhancing object detector\u0000performance, as reflected in the mean Average Precision (mAP) metric. Although\u0000object detectors have achieved notable improvements and high mAP scores on the\u0000COCO dataset, analysis reveals limited progress in addressing false positives\u0000caused by non-target visual clutter-background objects not included in the\u0000annotated categories. This issue is particularly critical in real-world\u0000applications, such as fire and smoke detection, where minimizing false alarms\u0000is crucial. In this study, we introduce COCO-FP, a new evaluation dataset\u0000derived from the ImageNet-1K dataset, designed to address this issue. By\u0000extending the original COCO validation dataset, COCO-FP specifically assesses\u0000object detectors' performance in mitigating background false positives. Our\u0000evaluation of both standard and advanced object detectors shows a significant\u0000number of false positives in both closed-set and open-set scenarios. For\u0000example, the AP50 metric for YOLOv9-E decreases from 72.8 to 65.7 when shifting\u0000from COCO to COCO-FP. The dataset is available at\u0000https://github.com/COCO-FP/COCO-FP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue
Thermography is especially valuable for the military and other users of surveillance cameras. Some recent methods based on Neural Radiance Fields (NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS) prevails due to its rapid training and real-time rendering. In this work, we propose ThermalGaussian, the first thermal 3DGS approach capable of rendering high-quality images in RGB and thermal modalities. We first calibrate the RGB camera and the thermal camera to ensure that both modalities are accurately aligned. Subsequently, we use the registered images to learn the multimodal 3D Gaussians. To prevent the overfitting of any single modality, we introduce several multimodal regularization constraints. We also develop smoothing constraints tailored to the physical characteristics of the thermal modality. Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a hand-hold thermal-infrared camera, facilitating future research on thermal scene reconstruction. We conduct comprehensive experiments to show that ThermalGaussian achieves photorealistic rendering of thermal images and improves the rendering quality of RGB images. With the proposed multimodal regularization constraints, we also reduced the model's storage cost by 90%. The code and dataset will be released.
{"title":"ThermalGaussian: Thermal 3D Gaussian Splatting","authors":"Rongfeng Lu, Hangyu Chen, Zunjie Zhu, Yuhang Qin, Ming Lu, Le Zhang, Chenggang Yan, Anke Xue","doi":"arxiv-2409.07200","DOIUrl":"https://doi.org/arxiv-2409.07200","url":null,"abstract":"Thermography is especially valuable for the military and other users of\u0000surveillance cameras. Some recent methods based on Neural Radiance Fields\u0000(NeRF) are proposed to reconstruct the thermal scenes in 3D from a set of\u0000thermal and RGB images. However, unlike NeRF, 3D Gaussian splatting (3DGS)\u0000prevails due to its rapid training and real-time rendering. In this work, we\u0000propose ThermalGaussian, the first thermal 3DGS approach capable of rendering\u0000high-quality images in RGB and thermal modalities. We first calibrate the RGB\u0000camera and the thermal camera to ensure that both modalities are accurately\u0000aligned. Subsequently, we use the registered images to learn the multimodal 3D\u0000Gaussians. To prevent the overfitting of any single modality, we introduce\u0000several multimodal regularization constraints. We also develop smoothing\u0000constraints tailored to the physical characteristics of the thermal modality.\u0000Besides, we contribute a real-world dataset named RGBT-Scenes, captured by a\u0000hand-hold thermal-infrared camera, facilitating future research on thermal\u0000scene reconstruction. We conduct comprehensive experiments to show that\u0000ThermalGaussian achieves photorealistic rendering of thermal images and\u0000improves the rendering quality of RGB images. With the proposed multimodal\u0000regularization constraints, we also reduced the model's storage cost by 90%.\u0000The code and dataset will be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}