Ye Wang, Yaxiong Wang, Guoshuai Zhao, Xueming Qian
Few-shot class-incremental learning (FSCIL) aims to incrementally recognize new classes using a few samples while maintaining the performance on previously learned classes. One of the effective methods to solve this challenge is to construct prototypical evolution classifiers. Despite the advancement achieved by most existing methods, the classifier weights are simply initialized using mean features. Because representations for new classes are weak and biased, we argue such a strategy is suboptimal. In this paper, we tackle this issue from two aspects. Firstly, thanks to the development of foundation models, we employ a foundation model, the CLIP, as the network pedestal to provide a general representation for each class. Secondly, to generate a more reliable and comprehensive instance representation, we propose a Knowledge Adapter (KA) module that summarizes the data-specific knowledge from training data and fuses it into the general representation. Additionally, to tune the knowledge learned from the base classes to the upcoming classes, we propose a mechanism of Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL. Taken together, our proposed method, dubbed as Knowledge Adaptation Network (KANet), achieves competitive performance on a wide range of datasets, including CIFAR100, CUB200, and ImageNet-R.
{"title":"Knowledge Adaptation Network for Few-Shot Class-Incremental Learning","authors":"Ye Wang, Yaxiong Wang, Guoshuai Zhao, Xueming Qian","doi":"arxiv-2409.11770","DOIUrl":"https://doi.org/arxiv-2409.11770","url":null,"abstract":"Few-shot class-incremental learning (FSCIL) aims to incrementally recognize\u0000new classes using a few samples while maintaining the performance on previously\u0000learned classes. One of the effective methods to solve this challenge is to\u0000construct prototypical evolution classifiers. Despite the advancement achieved\u0000by most existing methods, the classifier weights are simply initialized using\u0000mean features. Because representations for new classes are weak and biased, we\u0000argue such a strategy is suboptimal. In this paper, we tackle this issue from\u0000two aspects. Firstly, thanks to the development of foundation models, we employ\u0000a foundation model, the CLIP, as the network pedestal to provide a general\u0000representation for each class. Secondly, to generate a more reliable and\u0000comprehensive instance representation, we propose a Knowledge Adapter (KA)\u0000module that summarizes the data-specific knowledge from training data and fuses\u0000it into the general representation. Additionally, to tune the knowledge learned\u0000from the base classes to the upcoming classes, we propose a mechanism of\u0000Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL.\u0000Taken together, our proposed method, dubbed as Knowledge Adaptation Network\u0000(KANet), achieves competitive performance on a wide range of datasets,\u0000including CIFAR100, CUB200, and ImageNet-R.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ultrasound imaging, despite its widespread use in medicine, often suffers from various sources of noise and artifacts that impact the signal-to-noise ratio and overall image quality. Enhancing ultrasound images requires a delicate balance between contrast, resolution, and speckle preservation. This paper introduces a novel approach that integrates adaptive beamforming with denoising diffusion-based variance imaging to address this challenge. By applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a denoising diffusion model fine-tuned on ultrasound data, our method computes the variance across multiple diffusion-denoised samples to produce high-quality despeckled images. This approach leverages both the inherent multiplicative noise of ultrasound and the stochastic nature of diffusion models. Experimental results on a publicly available dataset demonstrate the effectiveness of our method in achieving superior image reconstructions from single plane-wave acquisitions. The code is available at: https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion.
{"title":"Ultrasound Image Enhancement with the Variance of Diffusion Models","authors":"Yuxin Zhang, Clément Huneau, Jérôme Idier, Diana Mateus","doi":"arxiv-2409.11380","DOIUrl":"https://doi.org/arxiv-2409.11380","url":null,"abstract":"Ultrasound imaging, despite its widespread use in medicine, often suffers\u0000from various sources of noise and artifacts that impact the signal-to-noise\u0000ratio and overall image quality. Enhancing ultrasound images requires a\u0000delicate balance between contrast, resolution, and speckle preservation. This\u0000paper introduces a novel approach that integrates adaptive beamforming with\u0000denoising diffusion-based variance imaging to address this challenge. By\u0000applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a\u0000denoising diffusion model fine-tuned on ultrasound data, our method computes\u0000the variance across multiple diffusion-denoised samples to produce high-quality\u0000despeckled images. This approach leverages both the inherent multiplicative\u0000noise of ultrasound and the stochastic nature of diffusion models. Experimental\u0000results on a publicly available dataset demonstrate the effectiveness of our\u0000method in achieving superior image reconstructions from single plane-wave\u0000acquisitions. The code is available at:\u0000https://github.com/Yuxin-Zhang-Jasmine/IUS2024_Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyuan Li, Lei Ke, Yung-Hsu Yang, Luigi Piccinelli, Mattia Segù, Martin Danelljan, Luc Van Gool
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics, location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}.
开放词汇多目标跟踪(MOT)旨在将跟踪器泛化到训练集中没有的类别。目前,性能最好的方法主要基于纯外观匹配。由于大词汇量场景中运动模式的复杂性和新物体分类的不稳定性,现有方法在最后的匹配步骤中要么忽略运动和语义线索,要么根据启发式方法应用运动和语义线索。在本文中,我们提出了一个统一的框架 SLAck,该框架在联想的早期步骤中联合考虑了语义、位置和外观先验,并学习如何通过轻量级的空间和时间对象图整合所有有价值的信息。我们的方法消除了融合不同线索的复杂后处理启发式方法,显著提高了大规模开放词汇跟踪的关联性能。在开放词汇 MOT 和 TAO TETA 基准上,我们在新类别跟踪方面的性能超过了以前最先进的方法。我们的代码可在以下网址获取:href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}。
{"title":"SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking","authors":"Siyuan Li, Lei Ke, Yung-Hsu Yang, Luigi Piccinelli, Mattia Segù, Martin Danelljan, Luc Van Gool","doi":"arxiv-2409.11235","DOIUrl":"https://doi.org/arxiv-2409.11235","url":null,"abstract":"Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to\u0000novel categories not in the training set. Currently, the best-performing\u0000methods are mainly based on pure appearance matching. Due to the complexity of\u0000motion patterns in the large-vocabulary scenarios and unstable classification\u0000of the novel objects, the motion and semantics cues are either ignored or\u0000applied based on heuristics in the final matching steps by existing methods. In\u0000this paper, we present a unified framework SLAck that jointly considers\u0000semantics, location, and appearance priors in the early steps of association\u0000and learns how to integrate all valuable information through a lightweight\u0000spatial and temporal object graph. Our method eliminates complex\u0000post-processing heuristics for fusing different cues and boosts the association\u0000performance significantly for large-scale open-vocabulary tracking. Without\u0000bells and whistles, we outperform previous state-of-the-art methods for novel\u0000classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code\u0000is available at\u0000href{https://github.com/siyuanliii/SLAck}{github.com/siyuanliii/SLAck}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at https://github.com/ydhcg-BoBo/STCMOT.
{"title":"STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking","authors":"Jianbo Ma, Chuanming Tang, Fei Wu, Can Zhao, Jianlin Zhang, Zhiyong Xu","doi":"arxiv-2409.11234","DOIUrl":"https://doi.org/arxiv-2409.11234","url":null,"abstract":"Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is\u0000important for diverse applications in computer vision. Current MOT trackers\u0000rely on accurate object detection results and precise matching of target\u0000reidentification (ReID). These methods focus on optimizing target spatial\u0000attributes while overlooking temporal cues in modelling object relationships,\u0000especially for challenging tracking conditions such as object deformation and\u0000blurring, etc. To address the above-mentioned issues, we propose a novel\u0000Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which\u0000utilizes historical embedding features to model the representation of ReID and\u0000detection features in a sequential order. Concretely, a temporal embedding\u0000boosting module is introduced to enhance the discriminability of individual\u0000embedding based on adjacent frame cooperation. While the trajectory embedding\u0000is then propagated by a temporal detection refinement module to mine salient\u0000target locations in the temporal field. Extensive experiments on the\u0000VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new\u0000state-of-the-art performance in MOTA and IDF1 metrics. The source codes are\u0000released at https://github.com/ydhcg-BoBo/STCMOT.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In continual learning, there is a serious problem of catastrophic forgetting, in which previous knowledge is forgotten when a model learns new tasks. Various methods have been proposed to solve this problem. Replay methods which replay data from previous tasks in later training, have shown good accuracy. However, replay methods have a generalizability problem from a limited memory buffer. In this paper, we tried to solve this problem by acquiring transferable knowledge through self-distillation using highly generalizable output in shallow layer as a teacher. Furthermore, when we deal with a large number of classes or challenging data, there is a risk of learning not converging and not experiencing overfitting. Therefore, we attempted to achieve more efficient and thorough learning by prioritizing the storage of easily misclassified samples through a new method of memory update. We confirmed that our proposed method outperformed conventional methods by experiments on CIFAR10, CIFAR100, and MiniimageNet datasets.
{"title":"Reducing Catastrophic Forgetting in Online Class Incremental Learning Using Self-Distillation","authors":"Kotaro Nagata, Hiromu Ono, Kazuhiro Hotta","doi":"arxiv-2409.11329","DOIUrl":"https://doi.org/arxiv-2409.11329","url":null,"abstract":"In continual learning, there is a serious problem of catastrophic forgetting,\u0000in which previous knowledge is forgotten when a model learns new tasks. Various\u0000methods have been proposed to solve this problem. Replay methods which replay\u0000data from previous tasks in later training, have shown good accuracy. However,\u0000replay methods have a generalizability problem from a limited memory buffer. In\u0000this paper, we tried to solve this problem by acquiring transferable knowledge\u0000through self-distillation using highly generalizable output in shallow layer as\u0000a teacher. Furthermore, when we deal with a large number of classes or\u0000challenging data, there is a risk of learning not converging and not\u0000experiencing overfitting. Therefore, we attempted to achieve more efficient and\u0000thorough learning by prioritizing the storage of easily misclassified samples\u0000through a new method of memory update. We confirmed that our proposed method\u0000outperformed conventional methods by experiments on CIFAR10, CIFAR100, and\u0000MiniimageNet datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatema-E- Jannat, Sina Gholami, Jennifer I. Lim, Theodore Leng, Minhaj Nur Alam, Hamed Tabkhi
In the medical domain, acquiring large datasets poses significant challenges due to privacy concerns. Nonetheless, the development of a robust deep-learning model for retinal disease diagnosis necessitates a substantial dataset for training. The capacity to generalize effectively on smaller datasets remains a persistent challenge. The scarcity of data presents a significant barrier to the practical implementation of scalable medical AI solutions. To address this issue, we've combined a wide range of data sources to improve performance and generalization to new data by giving it a deeper understanding of the data representation from multi-modal datasets and developed a self-supervised framework based on large language models (LLMs), SwinV2 to gain a deeper understanding of multi-modal dataset representations, enhancing the model's ability to extrapolate to new data for the detection of eye diseases using optical coherence tomography (OCT) images. We adopt a two-phase training methodology, self-supervised pre-training, and fine-tuning on a downstream supervised classifier. An ablation study conducted across three datasets employing various encoder backbones, without data fusion, with low data availability setting, and without self-supervised pre-training scenarios, highlights the robustness of our method. Our findings demonstrate consistent performance across these diverse conditions, showcasing superior generalization capabilities compared to the baseline model, ResNet-50.
{"title":"Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification","authors":"Fatema-E- Jannat, Sina Gholami, Jennifer I. Lim, Theodore Leng, Minhaj Nur Alam, Hamed Tabkhi","doi":"arxiv-2409.11375","DOIUrl":"https://doi.org/arxiv-2409.11375","url":null,"abstract":"In the medical domain, acquiring large datasets poses significant challenges\u0000due to privacy concerns. Nonetheless, the development of a robust deep-learning\u0000model for retinal disease diagnosis necessitates a substantial dataset for\u0000training. The capacity to generalize effectively on smaller datasets remains a\u0000persistent challenge. The scarcity of data presents a significant barrier to\u0000the practical implementation of scalable medical AI solutions. To address this\u0000issue, we've combined a wide range of data sources to improve performance and\u0000generalization to new data by giving it a deeper understanding of the data\u0000representation from multi-modal datasets and developed a self-supervised\u0000framework based on large language models (LLMs), SwinV2 to gain a deeper\u0000understanding of multi-modal dataset representations, enhancing the model's\u0000ability to extrapolate to new data for the detection of eye diseases using\u0000optical coherence tomography (OCT) images. We adopt a two-phase training\u0000methodology, self-supervised pre-training, and fine-tuning on a downstream\u0000supervised classifier. An ablation study conducted across three datasets\u0000employing various encoder backbones, without data fusion, with low data\u0000availability setting, and without self-supervised pre-training scenarios,\u0000highlights the robustness of our method. Our findings demonstrate consistent\u0000performance across these diverse conditions, showcasing superior generalization\u0000capabilities compared to the baseline model, ResNet-50.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.
{"title":"CLIP Adaptation by Intra-modal Overlap Reduction","authors":"Alexey Kravets, Vinay Namboodiri","doi":"arxiv-2409.11338","DOIUrl":"https://doi.org/arxiv-2409.11338","url":null,"abstract":"Numerous methods have been proposed to adapt a pre-trained foundational CLIP\u0000model for few-shot classification. As CLIP is trained on a large corpus, it\u0000generalises well through adaptation to few-shot classification. In this work,\u0000we analyse the intra-modal overlap in image space in terms of embedding\u0000representation. Our analysis shows that, due to contrastive learning,\u0000embeddings from CLIP model exhibit high cosine similarity distribution overlap\u0000in the image space between paired and unpaired examples affecting the\u0000performance of few-shot training-free classification methods which rely on\u0000similarity in the image space for their predictions. To tackle intra-modal\u0000overlap we propose to train a lightweight adapter on a generic set of samples\u0000from the Google Open Images dataset demonstrating that this improves accuracy\u0000for few-shot training-free classification. We validate our contribution through\u0000extensive empirical analysis and demonstrate that reducing the intra-modal\u0000overlap leads to a) improved performance on a number of standard datasets, b)\u0000increased robustness to distribution shift and c) higher feature variance\u0000rendering the features more discriminative for downstream tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
视频扩散模型在生成高质量视频方面显示出巨大的潜力,因此越来越受到人们的关注。然而,其固有的推理性质导致了大量的计算和时间成本。虽然人们已经努力通过减少推理步骤(通过一致性蒸馏等技术)和 GAN 训练(这些方法通常在性能或训练稳定性方面存在不足)来加速视频扩散。在这项工作中,我们引入了一个两阶段训练框架,有效地将一致性蒸馏和 GAN 训练结合起来,以应对这些挑战。此外,我们还提出了一种新颖的视频判别器设计,无需对视频潜变量进行解码,从而提高了最终性能。我们的模型只需一步就能生成高质量视频,并能灵活地执行多步细化以进一步提高性能。我们在 OpenWebVid-1Mbenchmark 上进行的定量评估表明,我们的模型明显优于现有方法。值得注意的是,我们的 1 步性能(FVD 171.15)超过了基于一致性蒸馏的方法 AnimateLCM 的 8 步性能(FVD 184.79),并接近高级稳定视频扩散的 25 步性能(FVD 156.94)。
{"title":"OSV: One Step is Enough for High-Quality Image to Video Generation","authors":"Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang","doi":"arxiv-2409.11367","DOIUrl":"https://doi.org/arxiv-2409.11367","url":null,"abstract":"Video diffusion models have shown great potential in generating high-quality\u0000videos, making them an increasingly popular focus. However, their inherent\u0000iterative nature leads to substantial computational and time costs. While\u0000efforts have been made to accelerate video diffusion by reducing inference\u0000steps (through techniques like consistency distillation) and GAN training\u0000(these approaches often fall short in either performance or training\u0000stability). In this work, we introduce a two-stage training framework that\u0000effectively combines consistency distillation with GAN training to address\u0000these challenges. Additionally, we propose a novel video discriminator design,\u0000which eliminates the need for decoding the video latents and improves the final\u0000performance. Our model is capable of producing high-quality videos in merely\u0000one-step, with the flexibility to perform multi-step refinement for further\u0000performance enhancement. Our quantitative evaluation on the OpenWebVid-1M\u0000benchmark shows that our model significantly outperforms existing methods.\u0000Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of\u0000the consistency distillation based method, AnimateLCM (FVD 184.79), and\u0000approaches the 25-step performance of advanced Stable Video Diffusion (FVD\u0000156.94).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods employ Deep Learning (DL) techniques using a large amount of image data. The primary limitation to extending the existing SotA ISR works for real-world instances is their computational and time complexities. In this paper, contrary to the existing methods, we present a novel and computationally efficient ISR algorithm that is independent of the image dataset to learn the ISR task. The proposed algorithm reformulates the ISR task from generating the Super-Resolved (SR) images to computing the inverse of the kernels that span the degradation space. We introduce Deep Identity Learning, exploiting the identity relation between the degradation and inverse degradation models. The proposed approach neither relies on the ISR dataset nor on a single input low-resolution (LR) image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence we term our model as Null-Shot Super-Resolution Using Deep Identity Learning (NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources, at least by an order of 10, and demonstrates a competitive performance on benchmark ISR datasets. Another salient aspect of our proposition is that the NSSR-DIL framework detours retraining the model and remains the same for varying scale factors like X2, X3, and X4. This makes our highly efficient ISR model more suitable for real-world applications.
{"title":"NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning","authors":"Sree Rama Vamsidhar S, Rama Krishna Gorthi","doi":"arxiv-2409.12165","DOIUrl":"https://doi.org/arxiv-2409.12165","url":null,"abstract":"The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods\u0000employ Deep Learning (DL) techniques using a large amount of image data. The\u0000primary limitation to extending the existing SotA ISR works for real-world\u0000instances is their computational and time complexities. In this paper, contrary\u0000to the existing methods, we present a novel and computationally efficient ISR\u0000algorithm that is independent of the image dataset to learn the ISR task. The\u0000proposed algorithm reformulates the ISR task from generating the Super-Resolved\u0000(SR) images to computing the inverse of the kernels that span the degradation\u0000space. We introduce Deep Identity Learning, exploiting the identity relation\u0000between the degradation and inverse degradation models. The proposed approach\u0000neither relies on the ISR dataset nor on a single input low-resolution (LR)\u0000image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence\u0000we term our model as Null-Shot Super-Resolution Using Deep Identity Learning\u0000(NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources,\u0000at least by an order of 10, and demonstrates a competitive performance on\u0000benchmark ISR datasets. Another salient aspect of our proposition is that the\u0000NSSR-DIL framework detours retraining the model and remains the same for\u0000varying scale factors like X2, X3, and X4. This makes our highly efficient ISR\u0000model more suitable for real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional challenge which encourages models not only to adapt to the novel classes but also to maintain strong performance on the training base classes. While previous datasets and benchmarks discussed the few-shot segmentation setting in remote sensing, we are the first to propose a generalized few-shot segmentation benchmark for remote sensing. The generalized setting is more realistic and challenging, which necessitates exploring it within the remote sensing context. We release the dataset augmenting OpenEarthMap with additional classes labelled for the generalized few-shot evaluation setting. The dataset is released during the OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU workshop in conjunction with CVPR 2024. In this work, we summarize the dataset and challenge details in addition to providing the benchmark results on the two phases of the challenge for the validation and test sets.
{"title":"Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark","authors":"Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya","doi":"arxiv-2409.11227","DOIUrl":"https://doi.org/arxiv-2409.11227","url":null,"abstract":"Learning with limited labelled data is a challenging problem in various\u0000applications, including remote sensing. Few-shot semantic segmentation is one\u0000approach that can encourage deep learning models to learn from few labelled\u0000examples for novel classes not seen during the training. The generalized\u0000few-shot segmentation setting has an additional challenge which encourages\u0000models not only to adapt to the novel classes but also to maintain strong\u0000performance on the training base classes. While previous datasets and\u0000benchmarks discussed the few-shot segmentation setting in remote sensing, we\u0000are the first to propose a generalized few-shot segmentation benchmark for\u0000remote sensing. The generalized setting is more realistic and challenging,\u0000which necessitates exploring it within the remote sensing context. We release\u0000the dataset augmenting OpenEarthMap with additional classes labelled for the\u0000generalized few-shot evaluation setting. The dataset is released during the\u0000OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU\u0000workshop in conjunction with CVPR 2024. In this work, we summarize the dataset\u0000and challenge details in addition to providing the benchmark results on the two\u0000phases of the challenge for the validation and test sets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}