首页 > 最新文献

arXiv - CS - Computer Vision and Pattern Recognition最新文献

英文 中文
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models LLM-wrapper:视觉语言基础模型的黑盒语义感知适配
Pub Date : 2024-09-18 DOI: arxiv-2409.11919
Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord
Vision Language Models (VLMs) have shown impressive performances on numeroustasks but their zero-shot capabilities can be limited compared to dedicated orfine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires`white-box' access to the model's architecture and weights as well as expertiseto design the fine-tuning objectives and optimize the hyper-parameters, whichare specific to each VLM and downstream task. In this work, we proposeLLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner byleveraging large language models (LLMs) so as to reason on their outputs. Wedemonstrate the effectiveness of LLM-wrapper on Referring ExpressionComprehension (REC), a challenging open-vocabulary task that requires spatialand semantic reasoning. Our approach significantly boosts the performance ofoff-the-shelf models, resulting in competitive results when compared withclassic fine-tuning.
视觉语言模型(VLM)在大量任务中表现出令人印象深刻的性能,但与专用模型或微调模型相比,它们的零拍摄能力可能有限。然而,对 VLM 进行微调也有其局限性,因为它需要 "白盒 "访问模型的架构和权重,还需要专家来设计微调目标和优化超参数,这些都是每个 VLM 和下游任务所特有的。在这项工作中,我们提出了LLM-wrapper,这是一种以 "黑箱 "方式调整VLM的新方法,通过利用大型语言模型(LLM)来对其输出进行推理。我们演示了 LLM-wrapper 在参考表达式理解(REC)上的有效性,这是一项具有挑战性的开放词汇任务,需要空间和语义推理。我们的方法大大提高了现成模型的性能,与传统的微调方法相比,结果极具竞争力。
{"title":"LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models","authors":"Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord","doi":"arxiv-2409.11919","DOIUrl":"https://doi.org/arxiv-2409.11919","url":null,"abstract":"Vision Language Models (VLMs) have shown impressive performances on numerous\u0000tasks but their zero-shot capabilities can be limited compared to dedicated or\u0000fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires\u0000`white-box' access to the model's architecture and weights as well as expertise\u0000to design the fine-tuning objectives and optimize the hyper-parameters, which\u0000are specific to each VLM and downstream task. In this work, we propose\u0000LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by\u0000leveraging large language models (LLMs) so as to reason on their outputs. We\u0000demonstrate the effectiveness of LLM-wrapper on Referring Expression\u0000Comprehension (REC), a challenging open-vocabulary task that requires spatial\u0000and semantic reasoning. Our approach significantly boosts the performance of\u0000off-the-shelf models, resulting in competitive results when compared with\u0000classic fine-tuning.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ultrasound Image Enhancement with the Variance of Diffusion Models 利用扩散模型方差增强超声图像
Pub Date : 2024-09-17 DOI: arxiv-2409.11380
Yuxin Zhang, Clément Huneau, Jérôme Idier, Diana Mateus
Ultrasound imaging, despite its widespread use in medicine, often suffersfrom various sources of noise and artifacts that impact the signal-to-noiseratio and overall image quality. Enhancing ultrasound images requires adelicate balance between contrast, resolution, and speckle preservation. Thispaper introduces a novel approach that integrates adaptive beamforming withdenoising diffusion-based variance imaging to address this challenge. Byapplying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing adenoising diffusion model fine-tuned on ultrasound data, our method computesthe variance across multiple diffusion-denoised samples to produce high-qualitydespeckled images. This approach leverages both the inherent multiplicativenoise of ultrasound and the stochastic nature of diffusion models. Experimentalresults on a publicly available dataset demonstrate the effectiveness of ourmethod in achieving superior image reconstructions from single plane-waveacquisitions. The code is available at:https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion.
超声波成像尽管在医学中应用广泛,但经常会受到各种噪声源和伪影的影响,从而影响信噪比和整体图像质量。增强超声图像需要在对比度、分辨率和斑点保留之间取得微妙的平衡。本文介绍了一种将自适应波束成形与基于扩散的方差成像相结合的新方法,以应对这一挑战。我们的方法通过应用基于特征空间的最小方差(EBMV)波束成形技术和根据超声波数据微调的腺扩散模型,计算多个腺扩散样本的方差,从而生成高质量的斑点图像。这种方法充分利用了超声波固有的乘噪声和扩散模型的随机性。在一个公开数据集上的实验结果表明,我们的方法能有效地从单平面波采集中获得优质图像重建。代码见:https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion。
{"title":"Ultrasound Image Enhancement with the Variance of Diffusion Models","authors":"Yuxin Zhang, Clément Huneau, Jérôme Idier, Diana Mateus","doi":"arxiv-2409.11380","DOIUrl":"https://doi.org/arxiv-2409.11380","url":null,"abstract":"Ultrasound imaging, despite its widespread use in medicine, often suffers\u0000from various sources of noise and artifacts that impact the signal-to-noise\u0000ratio and overall image quality. Enhancing ultrasound images requires a\u0000delicate balance between contrast, resolution, and speckle preservation. This\u0000paper introduces a novel approach that integrates adaptive beamforming with\u0000denoising diffusion-based variance imaging to address this challenge. By\u0000applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a\u0000denoising diffusion model fine-tuned on ultrasound data, our method computes\u0000the variance across multiple diffusion-denoised samples to produce high-quality\u0000despeckled images. This approach leverages both the inherent multiplicative\u0000noise of ultrasound and the stochastic nature of diffusion models. Experimental\u0000results on a publicly available dataset demonstrate the effectiveness of our\u0000method in achieving superior image reconstructions from single plane-wave\u0000acquisitions. The code is available at:\u0000https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking SLAck:语义、位置和外观感知开放词汇跟踪
Pub Date : 2024-09-17 DOI: arxiv-2409.11235
Siyuan Li, Lei Ke, Yung-Hsu Yang, Luigi Piccinelli, Mattia Segù, Martin Danelljan, Luc Van Gool
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers tonovel categories not in the training set. Currently, the best-performingmethods are mainly based on pure appearance matching. Due to the complexity ofmotion patterns in the large-vocabulary scenarios and unstable classificationof the novel objects, the motion and semantics cues are either ignored orapplied based on heuristics in the final matching steps by existing methods. Inthis paper, we present a unified framework SLAck that jointly considerssemantics, location, and appearance priors in the early steps of associationand learns how to integrate all valuable information through a lightweightspatial and temporal object graph. Our method eliminates complexpost-processing heuristics for fusing different cues and boosts the associationperformance significantly for large-scale open-vocabulary tracking. Withoutbells and whistles, we outperform previous state-of-the-art methods for novelclasses tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our codeis available athref{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}.
开放词汇多目标跟踪(MOT)旨在将跟踪器泛化到训练集中没有的类别。目前,性能最好的方法主要基于纯外观匹配。由于大词汇量场景中运动模式的复杂性和新物体分类的不稳定性,现有方法在最后的匹配步骤中要么忽略运动和语义线索,要么根据启发式方法应用运动和语义线索。在本文中,我们提出了一个统一的框架 SLAck,该框架在联想的早期步骤中联合考虑了语义、位置和外观先验,并学习如何通过轻量级的空间和时间对象图整合所有有价值的信息。我们的方法消除了融合不同线索的复杂后处理启发式方法,显著提高了大规模开放词汇跟踪的关联性能。在开放词汇 MOT 和 TAO TETA 基准上,我们在新类别跟踪方面的性能超过了以前最先进的方法。我们的代码可在以下网址获取:href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}。
{"title":"SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking","authors":"Siyuan Li, Lei Ke, Yung-Hsu Yang, Luigi Piccinelli, Mattia Segù, Martin Danelljan, Luc Van Gool","doi":"arxiv-2409.11235","DOIUrl":"https://doi.org/arxiv-2409.11235","url":null,"abstract":"Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to\u0000novel categories not in the training set. Currently, the best-performing\u0000methods are mainly based on pure appearance matching. Due to the complexity of\u0000motion patterns in the large-vocabulary scenarios and unstable classification\u0000of the novel objects, the motion and semantics cues are either ignored or\u0000applied based on heuristics in the final matching steps by existing methods. In\u0000this paper, we present a unified framework SLAck that jointly considers\u0000semantics, location, and appearance priors in the early steps of association\u0000and learns how to integrate all valuable information through a lightweight\u0000spatial and temporal object graph. Our method eliminates complex\u0000post-processing heuristics for fusing different cues and boosts the association\u0000performance significantly for large-scale open-vocabulary tracking. Without\u0000bells and whistles, we outperform previous state-of-the-art methods for novel\u0000classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code\u0000is available at\u0000href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking STCMOT:基于无人机的多目标跟踪时空聚合学习
Pub Date : 2024-09-17 DOI: arxiv-2409.11234
Jianbo Ma, Chuanming Tang, Fei Wu, Can Zhao, Jianlin Zhang, Zhiyong Xu
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos isimportant for diverse applications in computer vision. Current MOT trackersrely on accurate object detection results and precise matching of targetreidentification (ReID). These methods focus on optimizing target spatialattributes while overlooking temporal cues in modelling object relationships,especially for challenging tracking conditions such as object deformation andblurring, etc. To address the above-mentioned issues, we propose a novelSpatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), whichutilizes historical embedding features to model the representation of ReID anddetection features in a sequential order. Concretely, a temporal embeddingboosting module is introduced to enhance the discriminability of individualembedding based on adjacent frame cooperation. While the trajectory embeddingis then propagated by a temporal detection refinement module to mine salienttarget locations in the temporal field. Extensive experiments on theVisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a newstate-of-the-art performance in MOTA and IDF1 metrics. The source codes arereleased at https://github.com/ydhcg-BoBo/STCMOT.
无人飞行器(UAV)视频中的多目标跟踪(MOT)对于计算机视觉领域的各种应用都非常重要。当前的多目标跟踪器依赖于精确的目标检测结果和目标识别(ReID)的精确匹配。这些方法侧重于优化目标的空间属性,而忽略了在模拟物体关系时的时间线索,尤其是在物体变形和模糊等具有挑战性的跟踪条件下。为了解决上述问题,我们提出了一种新颖的空间-时间内聚多目标跟踪框架(STCMOT),它利用历史嵌入特征来模拟按顺序表示的 ReID 和检测特征。具体来说,引入了一个时间嵌入增强模块,以增强基于相邻帧合作的单个嵌入的可辨别性。而轨迹嵌入则由时序检测细化模块传播,以挖掘时域中的咸目标位置。在 VisDrone2019 和 UAVDT 数据集上进行的大量实验表明,我们的 STCMOT 在 MOTA 和 IDF1 指标上达到了最先进的性能。源代码发布于 https://github.com/ydhcg-BoBo/STCMOT。
{"title":"STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking","authors":"Jianbo Ma, Chuanming Tang, Fei Wu, Can Zhao, Jianlin Zhang, Zhiyong Xu","doi":"arxiv-2409.11234","DOIUrl":"https://doi.org/arxiv-2409.11234","url":null,"abstract":"Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is\u0000important for diverse applications in computer vision. Current MOT trackers\u0000rely on accurate object detection results and precise matching of target\u0000reidentification (ReID). These methods focus on optimizing target spatial\u0000attributes while overlooking temporal cues in modelling object relationships,\u0000especially for challenging tracking conditions such as object deformation and\u0000blurring, etc. To address the above-mentioned issues, we propose a novel\u0000Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which\u0000utilizes historical embedding features to model the representation of ReID and\u0000detection features in a sequential order. Concretely, a temporal embedding\u0000boosting module is introduced to enhance the discriminability of individual\u0000embedding based on adjacent frame cooperation. While the trajectory embedding\u0000is then propagated by a temporal detection refinement module to mine salient\u0000target locations in the temporal field. Extensive experiments on the\u0000VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new\u0000state-of-the-art performance in MOTA and IDF1 metrics. The source codes are\u0000released at https://github.com/ydhcg-BoBo/STCMOT.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reducing Catastrophic Forgetting in Online Class Incremental Learning Using Self-Distillation 利用自我发散减少在线课堂增量学习中的灾难性遗忘
Pub Date : 2024-09-17 DOI: arxiv-2409.11329
Kotaro Nagata, Hiromu Ono, Kazuhiro Hotta
In continual learning, there is a serious problem of catastrophic forgetting,in which previous knowledge is forgotten when a model learns new tasks. Variousmethods have been proposed to solve this problem. Replay methods which replaydata from previous tasks in later training, have shown good accuracy. However,replay methods have a generalizability problem from a limited memory buffer. Inthis paper, we tried to solve this problem by acquiring transferable knowledgethrough self-distillation using highly generalizable output in shallow layer asa teacher. Furthermore, when we deal with a large number of classes orchallenging data, there is a risk of learning not converging and notexperiencing overfitting. Therefore, we attempted to achieve more efficient andthorough learning by prioritizing the storage of easily misclassified samplesthrough a new method of memory update. We confirmed that our proposed methodoutperformed conventional methods by experiments on CIFAR10, CIFAR100, andMiniimageNet datasets.
在持续学习中,存在一个严重的灾难性遗忘问题,即当模型学习新任务时,先前的知识会被遗忘。为了解决这个问题,人们提出了各种方法。重放方法是在以后的训练中重放以前任务的数据,这种方法显示出良好的准确性。然而,重放方法在有限的内存缓冲区内存在泛化问题。在本文中,我们尝试以浅层中的高泛化输出为教师,通过自我蒸馏来获取可迁移的知识,从而解决这一问题。此外,当我们处理大量类别或具有挑战性的数据时,存在学习不收敛和过度拟合的风险。因此,我们试图通过一种新的内存更新方法,优先存储容易分类错误的样本,从而实现更高效、更彻底的学习。通过在 CIFAR10、CIFAR100 和 MiniimageNet 数据集上的实验,我们证实了我们提出的方法优于传统方法。
{"title":"Reducing Catastrophic Forgetting in Online Class Incremental Learning Using Self-Distillation","authors":"Kotaro Nagata, Hiromu Ono, Kazuhiro Hotta","doi":"arxiv-2409.11329","DOIUrl":"https://doi.org/arxiv-2409.11329","url":null,"abstract":"In continual learning, there is a serious problem of catastrophic forgetting,\u0000in which previous knowledge is forgotten when a model learns new tasks. Various\u0000methods have been proposed to solve this problem. Replay methods which replay\u0000data from previous tasks in later training, have shown good accuracy. However,\u0000replay methods have a generalizability problem from a limited memory buffer. In\u0000this paper, we tried to solve this problem by acquiring transferable knowledge\u0000through self-distillation using highly generalizable output in shallow layer as\u0000a teacher. Furthermore, when we deal with a large number of classes or\u0000challenging data, there is a risk of learning not converging and not\u0000experiencing overfitting. Therefore, we attempted to achieve more efficient and\u0000thorough learning by prioritizing the storage of easily misclassified samples\u0000through a new method of memory update. We confirmed that our proposed method\u0000outperformed conventional methods by experiments on CIFAR10, CIFAR100, and\u0000MiniimageNet datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification Multi-OCT-SelfNet:将自我监督学习与多源数据融合相结合,增强多类视网膜疾病分类能力
Pub Date : 2024-09-17 DOI: arxiv-2409.11375
Fatema-E- Jannat, Sina Gholami, Jennifer I. Lim, Theodore Leng, Minhaj Nur Alam, Hamed Tabkhi
In the medical domain, acquiring large datasets poses significant challengesdue to privacy concerns. Nonetheless, the development of a robust deep-learningmodel for retinal disease diagnosis necessitates a substantial dataset fortraining. The capacity to generalize effectively on smaller datasets remains apersistent challenge. The scarcity of data presents a significant barrier tothe practical implementation of scalable medical AI solutions. To address thisissue, we've combined a wide range of data sources to improve performance andgeneralization to new data by giving it a deeper understanding of the datarepresentation from multi-modal datasets and developed a self-supervisedframework based on large language models (LLMs), SwinV2 to gain a deeperunderstanding of multi-modal dataset representations, enhancing the model'sability to extrapolate to new data for the detection of eye diseases usingoptical coherence tomography (OCT) images. We adopt a two-phase trainingmethodology, self-supervised pre-training, and fine-tuning on a downstreamsupervised classifier. An ablation study conducted across three datasetsemploying various encoder backbones, without data fusion, with low dataavailability setting, and without self-supervised pre-training scenarios,highlights the robustness of our method. Our findings demonstrate consistentperformance across these diverse conditions, showcasing superior generalizationcapabilities compared to the baseline model, ResNet-50.
在医疗领域,由于隐私问题,获取大型数据集是一项重大挑战。然而,为视网膜疾病诊断开发强大的深度学习模型需要大量的数据集进行训练。在较小的数据集上进行有效归纳的能力仍然是一个持续的挑战。数据稀缺是实际实施可扩展医疗人工智能解决方案的重大障碍。为了解决这个问题,我们结合了广泛的数据源,通过让模型更深入地理解多模态数据集的数据表征来提高性能和对新数据的泛化能力,并开发了基于大型语言模型(LLMs)的自监督框架 SwinV2,以深入理解多模态数据集的表征,增强模型对新数据的推断能力,从而利用光学相干断层扫描(OCT)图像检测眼部疾病。我们采用了两阶段训练方法,即自我监督预训练和下游监督分类器微调。我们在三个数据集上进行了消融研究,这三个数据集采用了不同的编码器骨干,包括无数据融合、低数据可用性设置和无自我监督预训练的情况,从而凸显了我们方法的鲁棒性。我们的研究结果表明,与基线模型 ResNet-50 相比,我们的方法在这些不同的条件下都表现出了一致的性能,展示了卓越的泛化能力。
{"title":"Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification","authors":"Fatema-E- Jannat, Sina Gholami, Jennifer I. Lim, Theodore Leng, Minhaj Nur Alam, Hamed Tabkhi","doi":"arxiv-2409.11375","DOIUrl":"https://doi.org/arxiv-2409.11375","url":null,"abstract":"In the medical domain, acquiring large datasets poses significant challenges\u0000due to privacy concerns. Nonetheless, the development of a robust deep-learning\u0000model for retinal disease diagnosis necessitates a substantial dataset for\u0000training. The capacity to generalize effectively on smaller datasets remains a\u0000persistent challenge. The scarcity of data presents a significant barrier to\u0000the practical implementation of scalable medical AI solutions. To address this\u0000issue, we've combined a wide range of data sources to improve performance and\u0000generalization to new data by giving it a deeper understanding of the data\u0000representation from multi-modal datasets and developed a self-supervised\u0000framework based on large language models (LLMs), SwinV2 to gain a deeper\u0000understanding of multi-modal dataset representations, enhancing the model's\u0000ability to extrapolate to new data for the detection of eye diseases using\u0000optical coherence tomography (OCT) images. We adopt a two-phase training\u0000methodology, self-supervised pre-training, and fine-tuning on a downstream\u0000supervised classifier. An ablation study conducted across three datasets\u0000employing various encoder backbones, without data fusion, with low data\u0000availability setting, and without self-supervised pre-training scenarios,\u0000highlights the robustness of our method. Our findings demonstrate consistent\u0000performance across these diverse conditions, showcasing superior generalization\u0000capabilities compared to the baseline model, ResNet-50.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLIP Adaptation by Intra-modal Overlap Reduction 通过减少模内重叠进行 CLIP 适应
Pub Date : 2024-09-17 DOI: arxiv-2409.11338
Alexey Kravets, Vinay Namboodiri
Numerous methods have been proposed to adapt a pre-trained foundational CLIPmodel for few-shot classification. As CLIP is trained on a large corpus, itgeneralises well through adaptation to few-shot classification. In this work,we analyse the intra-modal overlap in image space in terms of embeddingrepresentation. Our analysis shows that, due to contrastive learning,embeddings from CLIP model exhibit high cosine similarity distribution overlapin the image space between paired and unpaired examples affecting theperformance of few-shot training-free classification methods which rely onsimilarity in the image space for their predictions. To tackle intra-modaloverlap we propose to train a lightweight adapter on a generic set of samplesfrom the Google Open Images dataset demonstrating that this improves accuracyfor few-shot training-free classification. We validate our contribution throughextensive empirical analysis and demonstrate that reducing the intra-modaloverlap leads to a) improved performance on a number of standard datasets, b)increased robustness to distribution shift and c) higher feature variancerendering the features more discriminative for downstream tasks.
为了将预先训练好的基础 CLIP 模型适用于少数几次分类,已经提出了许多方法。由于 CLIP 是在大型语料库中训练出来的,因此它能很好地适应少镜头分类。在这项工作中,我们从嵌入表示的角度分析了图像空间中的模内重叠。我们的分析表明,由于对比学习的原因,CLIP 模型的嵌入在配对和非配对实例之间的图像空间中表现出很高的余弦相似性分布重叠,这影响了依赖图像空间相似性进行预测的无训练的少镜头分类方法的性能。为了解决模内重叠问题,我们建议在谷歌开放图片数据集的通用样本集上训练一个轻量级适配器,结果表明这提高了免少量训练分类的准确性。我们通过大量的实证分析验证了我们的贡献,并证明减少模内重叠会带来:a)在一些标准数据集上的性能提高;b)对分布偏移的鲁棒性增强;c)特征变异性提高,使特征对下游任务更具区分性。
{"title":"CLIP Adaptation by Intra-modal Overlap Reduction","authors":"Alexey Kravets, Vinay Namboodiri","doi":"arxiv-2409.11338","DOIUrl":"https://doi.org/arxiv-2409.11338","url":null,"abstract":"Numerous methods have been proposed to adapt a pre-trained foundational CLIP\u0000model for few-shot classification. As CLIP is trained on a large corpus, it\u0000generalises well through adaptation to few-shot classification. In this work,\u0000we analyse the intra-modal overlap in image space in terms of embedding\u0000representation. Our analysis shows that, due to contrastive learning,\u0000embeddings from CLIP model exhibit high cosine similarity distribution overlap\u0000in the image space between paired and unpaired examples affecting the\u0000performance of few-shot training-free classification methods which rely on\u0000similarity in the image space for their predictions. To tackle intra-modal\u0000overlap we propose to train a lightweight adapter on a generic set of samples\u0000from the Google Open Images dataset demonstrating that this improves accuracy\u0000for few-shot training-free classification. We validate our contribution through\u0000extensive empirical analysis and demonstrate that reducing the intra-modal\u0000overlap leads to a) improved performance on a number of standard datasets, b)\u0000increased robustness to distribution shift and c) higher feature variance\u0000rendering the features more discriminative for downstream tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OSV: One Step is Enough for High-Quality Image to Video Generation OSV:一步即可生成高质量图像到视频
Pub Date : 2024-09-17 DOI: arxiv-2409.11367
Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang
Video diffusion models have shown great potential in generating high-qualityvideos, making them an increasingly popular focus. However, their inherentiterative nature leads to substantial computational and time costs. Whileefforts have been made to accelerate video diffusion by reducing inferencesteps (through techniques like consistency distillation) and GAN training(these approaches often fall short in either performance or trainingstability). In this work, we introduce a two-stage training framework thateffectively combines consistency distillation with GAN training to addressthese challenges. Additionally, we propose a novel video discriminator design,which eliminates the need for decoding the video latents and improves the finalperformance. Our model is capable of producing high-quality videos in merelyone-step, with the flexibility to perform multi-step refinement for furtherperformance enhancement. Our quantitative evaluation on the OpenWebVid-1Mbenchmark shows that our model significantly outperforms existing methods.Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance ofthe consistency distillation based method, AnimateLCM (FVD 184.79), andapproaches the 25-step performance of advanced Stable Video Diffusion (FVD156.94).
视频扩散模型在生成高质量视频方面显示出巨大的潜力,因此越来越受到人们的关注。然而,其固有的推理性质导致了大量的计算和时间成本。虽然人们已经努力通过减少推理步骤(通过一致性蒸馏等技术)和 GAN 训练(这些方法通常在性能或训练稳定性方面存在不足)来加速视频扩散。在这项工作中,我们引入了一个两阶段训练框架,有效地将一致性蒸馏和 GAN 训练结合起来,以应对这些挑战。此外,我们还提出了一种新颖的视频判别器设计,无需对视频潜变量进行解码,从而提高了最终性能。我们的模型只需一步就能生成高质量视频,并能灵活地执行多步细化以进一步提高性能。我们在 OpenWebVid-1Mbenchmark 上进行的定量评估表明,我们的模型明显优于现有方法。值得注意的是,我们的 1 步性能(FVD 171.15)超过了基于一致性蒸馏的方法 AnimateLCM 的 8 步性能(FVD 184.79),并接近高级稳定视频扩散的 25 步性能(FVD 156.94)。
{"title":"OSV: One Step is Enough for High-Quality Image to Video Generation","authors":"Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang","doi":"arxiv-2409.11367","DOIUrl":"https://doi.org/arxiv-2409.11367","url":null,"abstract":"Video diffusion models have shown great potential in generating high-quality\u0000videos, making them an increasingly popular focus. However, their inherent\u0000iterative nature leads to substantial computational and time costs. While\u0000efforts have been made to accelerate video diffusion by reducing inference\u0000steps (through techniques like consistency distillation) and GAN training\u0000(these approaches often fall short in either performance or training\u0000stability). In this work, we introduce a two-stage training framework that\u0000effectively combines consistency distillation with GAN training to address\u0000these challenges. Additionally, we propose a novel video discriminator design,\u0000which eliminates the need for decoding the video latents and improves the final\u0000performance. Our model is capable of producing high-quality videos in merely\u0000one-step, with the flexibility to perform multi-step refinement for further\u0000performance enhancement. Our quantitative evaluation on the OpenWebVid-1M\u0000benchmark shows that our model significantly outperforms existing methods.\u0000Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of\u0000the consistency distillation based method, AnimateLCM (FVD 184.79), and\u0000approaches the 25-step performance of advanced Stable Video Diffusion (FVD\u0000156.94).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning NSSR-DIL:利用深度身份学习实现空镜头图像超分辨率
Pub Date : 2024-09-17 DOI: arxiv-2409.12165
Sree Rama Vamsidhar S, Rama Krishna Gorthi
The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methodsemploy Deep Learning (DL) techniques using a large amount of image data. Theprimary limitation to extending the existing SotA ISR works for real-worldinstances is their computational and time complexities. In this paper, contraryto the existing methods, we present a novel and computationally efficient ISRalgorithm that is independent of the image dataset to learn the ISR task. Theproposed algorithm reformulates the ISR task from generating the Super-Resolved(SR) images to computing the inverse of the kernels that span the degradationspace. We introduce Deep Identity Learning, exploiting the identity relationbetween the degradation and inverse degradation models. The proposed approachneither relies on the ISR dataset nor on a single input low-resolution (LR)image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hencewe term our model as Null-Shot Super-Resolution Using Deep Identity Learning(NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources,at least by an order of 10, and demonstrates a competitive performance onbenchmark ISR datasets. Another salient aspect of our proposition is that theNSSR-DIL framework detours retraining the model and remains the same forvarying scale factors like X2, X3, and X4. This makes our highly efficient ISRmodel more suitable for real-world applications.
目前的最新(SotA)图像超分辨率(ISR)方法采用深度学习(DL)技术,使用大量图像数据。将现有的 SotA ISR 作品扩展到现实世界中的主要限制在于其计算和时间复杂性。在本文中,与现有方法相反,我们提出了一种新颖且计算效率高的 ISR 算法,它独立于图像数据集来学习 ISR 任务。所提出的算法将 ISR 任务从生成超级分辨率(SR)图像重新表述为计算跨越降解空间的核的逆。我们引入了深度身份学习(Deep Identity Learning),利用降解模型和逆降解模型之间的身份关系。所提出的方法既不依赖于 ISR 数据集,也不依赖于单一输入的低分辨率(LR)图像(如自监督方法,即 ZSSR)来为 ISR 任务建模。因此,我们将我们的模型称为使用深度身份学习的空镜头超分辨率(NSSR-DIL)。所提出的 NSSR-DIL 模型所需的计算资源更少,至少减少了 10 倍,并且在基准 ISR 数据集上表现出了极具竞争力的性能。我们主张的另一个显著特点是,NSSR-DIL 框架不需要重新训练模型,并且在 X2、X3 和 X4 等规模因子变化时保持不变。这使得我们的高效 ISR 模型更适合实际应用。
{"title":"NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning","authors":"Sree Rama Vamsidhar S, Rama Krishna Gorthi","doi":"arxiv-2409.12165","DOIUrl":"https://doi.org/arxiv-2409.12165","url":null,"abstract":"The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods\u0000employ Deep Learning (DL) techniques using a large amount of image data. The\u0000primary limitation to extending the existing SotA ISR works for real-world\u0000instances is their computational and time complexities. In this paper, contrary\u0000to the existing methods, we present a novel and computationally efficient ISR\u0000algorithm that is independent of the image dataset to learn the ISR task. The\u0000proposed algorithm reformulates the ISR task from generating the Super-Resolved\u0000(SR) images to computing the inverse of the kernels that span the degradation\u0000space. We introduce Deep Identity Learning, exploiting the identity relation\u0000between the degradation and inverse degradation models. The proposed approach\u0000neither relies on the ISR dataset nor on a single input low-resolution (LR)\u0000image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence\u0000we term our model as Null-Shot Super-Resolution Using Deep Identity Learning\u0000(NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources,\u0000at least by an order of 10, and demonstrates a competitive performance on\u0000benchmark ISR datasets. Another salient aspect of our proposition is that the\u0000NSSR-DIL framework detours retraining the model and remains the same for\u0000varying scale factors like X2, X3, and X4. This makes our highly efficient ISR\u0000model more suitable for real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark 遥感中的广义少镜头语义分割:挑战与基准
Pub Date : 2024-09-17 DOI: arxiv-2409.11227
Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya
Learning with limited labelled data is a challenging problem in variousapplications, including remote sensing. Few-shot semantic segmentation is oneapproach that can encourage deep learning models to learn from few labelledexamples for novel classes not seen during the training. The generalizedfew-shot segmentation setting has an additional challenge which encouragesmodels not only to adapt to the novel classes but also to maintain strongperformance on the training base classes. While previous datasets andbenchmarks discussed the few-shot segmentation setting in remote sensing, weare the first to propose a generalized few-shot segmentation benchmark forremote sensing. The generalized setting is more realistic and challenging,which necessitates exploring it within the remote sensing context. We releasethe dataset augmenting OpenEarthMap with additional classes labelled for thegeneralized few-shot evaluation setting. The dataset is released during theOpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVUworkshop in conjunction with CVPR 2024. In this work, we summarize the datasetand challenge details in addition to providing the benchmark results on the twophases of the challenge for the validation and test sets.
在包括遥感在内的各种应用中,利用有限的标记数据进行学习是一个具有挑战性的问题。少数镜头语义分割是一种方法,它可以鼓励深度学习模型从少数标记示例中学习训练期间未见的新类别。广义少镜头分割设置还有一个额外的挑战,即鼓励模型不仅要适应新类别,还要在训练基础类别上保持较强的性能。虽然之前的数据集和基准讨论了遥感中的少数镜头分割设置,但我们是第一个为遥感提出广义少数镜头分割基准的人。广义的设置更加现实和具有挑战性,因此有必要在遥感背景下对其进行探索。我们发布的数据集增强了 OpenEarthMap 的功能,增加了为广义少量照片评估设置而标注的类别。该数据集是在与 CVPR 2024 同时举行的 L3D-IVU 工作坊的 OpenEarthMap 土地覆被测绘通用少量照片挑战赛期间发布的。在这项工作中,我们总结了数据集和挑战赛的细节,并提供了两个阶段的挑战赛验证集和测试集的基准结果。
{"title":"Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark","authors":"Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya","doi":"arxiv-2409.11227","DOIUrl":"https://doi.org/arxiv-2409.11227","url":null,"abstract":"Learning with limited labelled data is a challenging problem in various\u0000applications, including remote sensing. Few-shot semantic segmentation is one\u0000approach that can encourage deep learning models to learn from few labelled\u0000examples for novel classes not seen during the training. The generalized\u0000few-shot segmentation setting has an additional challenge which encourages\u0000models not only to adapt to the novel classes but also to maintain strong\u0000performance on the training base classes. While previous datasets and\u0000benchmarks discussed the few-shot segmentation setting in remote sensing, we\u0000are the first to propose a generalized few-shot segmentation benchmark for\u0000remote sensing. The generalized setting is more realistic and challenging,\u0000which necessitates exploring it within the remote sensing context. We release\u0000the dataset augmenting OpenEarthMap with additional classes labelled for the\u0000generalized few-shot evaluation setting. The dataset is released during the\u0000OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU\u0000workshop in conjunction with CVPR 2024. In this work, we summarize the dataset\u0000and challenge details in addition to providing the benchmark results on the two\u0000phases of the challenge for the validation and test sets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1