Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
{"title":"Transformer Module Networks for Systematic Generalization in Visual Question Answering.","authors":"Moyuru Yamada, Vanessa D'Amario, Kentaro Takemoto, Xavier Boix, Tomotake Sasaki","doi":"10.1109/TPAMI.2024.3438887","DOIUrl":"10.1109/TPAMI.2024.3438887","url":null,"abstract":"<p><p>Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141899236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How to identify and segment camouflaged objects from the background is challenging. Inspired by the multi-head self-attention in Transformers, we present a simple masked separable attention (MSA) for camouflaged object detection. We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies. Furthermore, we propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results. These structures plus a backbone encoder form a new model, dubbed CamoFormer. Extensive experiments show that CamoFormer achieves new state-of-the-art performance on three widely-used camouflaged object detection benchmarks. To better evaluate the performance of the proposed CamoFormer around the border regions, we propose to use two new metrics, i.e. BR-M and BR-F. There are on average ∼ 5% relative improvements over previous methods in terms of S-measure and weighted F-measure. Our code is available at https://github.com/HVision-NKU/CamoFormer.
{"title":"CamoFormer: Masked Separable Attention for Camouflaged Object Detection.","authors":"Bowen Yin, Xuying Zhang, Deng-Ping Fan, Shaohui Jiao, Ming-Ming Cheng, Luc Van Gool, Qibin Hou","doi":"10.1109/TPAMI.2024.3438565","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3438565","url":null,"abstract":"<p><p>How to identify and segment camouflaged objects from the background is challenging. Inspired by the multi-head self-attention in Transformers, we present a simple masked separable attention (MSA) for camouflaged object detection. We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies. Furthermore, we propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results. These structures plus a backbone encoder form a new model, dubbed CamoFormer. Extensive experiments show that CamoFormer achieves new state-of-the-art performance on three widely-used camouflaged object detection benchmarks. To better evaluate the performance of the proposed CamoFormer around the border regions, we propose to use two new metrics, i.e. BR-M and BR-F. There are on average ∼ 5% relative improvements over previous methods in terms of S-measure and weighted F-measure. Our code is available at https://github.com/HVision-NKU/CamoFormer.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141895073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1109/TPAMI.2024.3438349
Jixiang Deng, Yong Deng, Jian- Bo Yang
In artificial intelligence, it is crucial for pattern recognition systems to process data with uncertain information, necessitating uncertainty reasoning approaches such as evidence theory. As an orderable extension of evidence theory, random permutation set (RPS) theory has received increasing attention. However, RPS theory lacks a suitable generation method for the element order of permutation mass function (PMF) and an efficient determination method for the fusion order of permutation orthogonal sum (POS). To solve these two issues, this paper proposes a reasoning model for RPS theory, called random permutation set reasoning (RPSR). RPSR consists of three techniques, including RPS generation method (RPSGM), RPSR rule of combination, and ordered probability transformation (OPT). Specifically, RPSGM can construct RPS based on Gaussian discriminant model and weight analysis; RPSR rule incorporates POS with reliability vector, which can combine RPS sources with reliability in fusion order; OPT is used to convert RPS into a probability distribution for the final decision. Besides, numerical examples are provided to illustrate the proposed RPSR. Moreover, the proposed RPSR is applied to classification problems. An RPSR-based classification algorithm (RPSRCA) and its hyperparameter tuning method are presented. The results demonstrate the efficiency and stability of RPSRCA compared to existing classifiers.
{"title":"Random Permutation Set Reasoning.","authors":"Jixiang Deng, Yong Deng, Jian- Bo Yang","doi":"10.1109/TPAMI.2024.3438349","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3438349","url":null,"abstract":"<p><p>In artificial intelligence, it is crucial for pattern recognition systems to process data with uncertain information, necessitating uncertainty reasoning approaches such as evidence theory. As an orderable extension of evidence theory, random permutation set (RPS) theory has received increasing attention. However, RPS theory lacks a suitable generation method for the element order of permutation mass function (PMF) and an efficient determination method for the fusion order of permutation orthogonal sum (POS). To solve these two issues, this paper proposes a reasoning model for RPS theory, called random permutation set reasoning (RPSR). RPSR consists of three techniques, including RPS generation method (RPSGM), RPSR rule of combination, and ordered probability transformation (OPT). Specifically, RPSGM can construct RPS based on Gaussian discriminant model and weight analysis; RPSR rule incorporates POS with reliability vector, which can combine RPS sources with reliability in fusion order; OPT is used to convert RPS into a probability distribution for the final decision. Besides, numerical examples are provided to illustrate the proposed RPSR. Moreover, the proposed RPSR is applied to classification problems. An RPSR-based classification algorithm (RPSRCA) and its hyperparameter tuning method are presented. The results demonstrate the efficiency and stability of RPSRCA compared to existing classifiers.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141895075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1109/TPAMI.2024.3438154
Fuxiang Huang, Suqi Song, Lei Zhang
Unsupervised domain adaptation (UDA) intends to transfer knowledge from a labeled source domain to an unlabeled target domain. Many current methods focus on learning feature representations that are both discriminative for classification and invariant across domains by simultaneously optimizing domain alignment and classification tasks. However, these methods often overlook a crucial challenge: the inherent conflict between these two tasks during gradient-based optimization. In this paper, we delve into this issue and introduce two effective solutions known as Gradient Harmonization, including GH and GH++, to mitigate the conflict between domain alignment and classification tasks. GH operates by altering the gradient angle between different tasks from an obtuse angle to an acute angle, thus resolving the conflict and trade-offing the two tasks in a coordinated manner. Yet, this would cause both tasks to deviate from their original optimization directions. We thus further propose an improved version, GH++, which adjusts the gradient angle between tasks from an obtuse angle to a vertical angle. This not only eliminates the conflict but also minimizes deviation from the original gradient directions. Finally, for optimization convenience and efficiency, we evolve the gradient harmonization strategies into a dynamically weighted loss function using an integral operator on the harmonized gradient. Notably, GH/GH++ are orthogonal to UDA and can be seamlessly integrated into most existing UDA models. Theoretical insights and experimental analyses demonstrate that the proposed approaches not only enhance popular UDA baselines but also improve recent state-of-the-art models.
{"title":"Gradient Harmonization in Unsupervised Domain Adaptation.","authors":"Fuxiang Huang, Suqi Song, Lei Zhang","doi":"10.1109/TPAMI.2024.3438154","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3438154","url":null,"abstract":"<p><p>Unsupervised domain adaptation (UDA) intends to transfer knowledge from a labeled source domain to an unlabeled target domain. Many current methods focus on learning feature representations that are both discriminative for classification and invariant across domains by simultaneously optimizing domain alignment and classification tasks. However, these methods often overlook a crucial challenge: the inherent conflict between these two tasks during gradient-based optimization. In this paper, we delve into this issue and introduce two effective solutions known as Gradient Harmonization, including GH and GH++, to mitigate the conflict between domain alignment and classification tasks. GH operates by altering the gradient angle between different tasks from an obtuse angle to an acute angle, thus resolving the conflict and trade-offing the two tasks in a coordinated manner. Yet, this would cause both tasks to deviate from their original optimization directions. We thus further propose an improved version, GH++, which adjusts the gradient angle between tasks from an obtuse angle to a vertical angle. This not only eliminates the conflict but also minimizes deviation from the original gradient directions. Finally, for optimization convenience and efficiency, we evolve the gradient harmonization strategies into a dynamically weighted loss function using an integral operator on the harmonized gradient. Notably, GH/GH++ are orthogonal to UDA and can be seamlessly integrated into most existing UDA models. Theoretical insights and experimental analyses demonstrate that the proposed approaches not only enhance popular UDA baselines but also improve recent state-of-the-art models.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141895074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-02DOI: 10.1109/TPAMI.2024.3436860
Yifeng Bie, Shuai You, Xinrui Li, Xuekui Zhang, Tao Lu
Learning-enabled spectroscopic analysis, promising for automated real-time analysis of chemicals, is facing several challenges. Firstly, a typical machine learning model requires a large number of training samples that physical systems can not provide. Secondly, it requires the testing samples to be in range with the training samples, which often is not the case in the real world. Further, a spectroscopy device is limited by its memory size, computing power, and battery capacity. That requires highly efficient learning models for on-site analysis. In this paper, by analyzing multi-gas mixtures and multi-molecule suspensions, we first show that orders of magnitude reduction of data dimension can be achieved as the number of principal components that need to be retained is the same as the independent constituents in the mixture. From this principle, we designed highly compact models in which the essential principal components can be directly extracted from the interrelations between the individual chemical properties and principal components; and only a few training samples are required. Our model can predict the constituent concentrations that have not been seen in the training dataset and provide estimations of measurement noises. This approach can be extended as an effectively standardized method for principle component extraction.
{"title":"Essential Number of Principal Components and Nearly Training-Free Model for Spectral Analysis.","authors":"Yifeng Bie, Shuai You, Xinrui Li, Xuekui Zhang, Tao Lu","doi":"10.1109/TPAMI.2024.3436860","DOIUrl":"10.1109/TPAMI.2024.3436860","url":null,"abstract":"<p><p>Learning-enabled spectroscopic analysis, promising for automated real-time analysis of chemicals, is facing several challenges. Firstly, a typical machine learning model requires a large number of training samples that physical systems can not provide. Secondly, it requires the testing samples to be in range with the training samples, which often is not the case in the real world. Further, a spectroscopy device is limited by its memory size, computing power, and battery capacity. That requires highly efficient learning models for on-site analysis. In this paper, by analyzing multi-gas mixtures and multi-molecule suspensions, we first show that orders of magnitude reduction of data dimension can be achieved as the number of principal components that need to be retained is the same as the independent constituents in the mixture. From this principle, we designed highly compact models in which the essential principal components can be directly extracted from the interrelations between the individual chemical properties and principal components; and only a few training samples are required. Our model can predict the constituent concentrations that have not been seen in the training dataset and provide estimations of measurement noises. This approach can be extended as an effectively standardized method for principle component extraction.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141879980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-02DOI: 10.1109/TPAMI.2024.3437288
Yangyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli
Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the dataset available to facilitate further exploration in this area.
{"title":"UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models.","authors":"Yangyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli","doi":"10.1109/TPAMI.2024.3437288","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3437288","url":null,"abstract":"<p><p>Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the dataset available to facilitate further exploration in this area.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141879981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1109/TPAMI.2024.3436105
Yutong Xie, Jianpeng Zhang, Yong Xia, Qi Wu
Self-supervised learning (SSL) opens up huge opportunities for medical image analysis that is well known for its lack of annotations. However, aggregating massive (unlabeled) 3D medical images like computerized tomography (CT) remains challenging due to its high imaging cost and privacy restrictions. In our pilot study, we advocated bringing a wealth of 2D images like chest X-rays as compensation for the lack of 3D data, aiming to build a universal medical self-supervised representation learning framework, called UniMiSS. Especially, we designed a pyramid U- like medical Transformer (MiT) as the backbone to make UniMiSS possible to perform SSL with both 2D and 3D images. Consequently, the predecessor UniMiSS has two obvious merits compared to current 3D-specific SSL: (1) more effective - superior to learning strong representations, benefiting from more and diverse data; and (2) more versatile - suitable for various downstream tasks without the restriction on the dimensionality barrier. Unfortunately, UniMiSS did not dig deeply into the intrinsic anatomy correlation between 2D medical images and 3D volumes due to the lack of paired multi-modal/dimension patient data. In this extension paper, we propose the UniMiSS+, in which we introduce the digitally reconstructed radiographs (DRR) technology to simulate X-ray images from a CT volume to access paired CT and X-ray data. Benefiting from the paired group, we introduce an extra pair- wise constraint to boost the cross-modality correlation learning, which also can be adopted as a cross-dimension regularization to further improve the representations. We conduct expensive experiments on multiple 3D/2D medical image analysis tasks, including segmentation and classification. The results show that the proposed UniMiSS+ achieves promising performance on various downstream tasks, not only outperforming the ImageNet pre-training and other advanced SSL counterparts substantially but also improving the predecessor UniMiSS pre-training. Code is available at: https://github.com/YtongXie/UniMiSS-code.
自监督学习(SSL)为医学图像分析带来了巨大的机遇,众所周知,医学图像缺乏注释。然而,由于成像成本高和隐私限制,聚合计算机断层扫描(CT)等海量(无标注)三维医学图像仍具有挑战性。在我们的试点研究中,我们主张利用胸部 X 光片等丰富的二维图像来弥补三维数据的不足,旨在建立一个通用的医学自监督表示学习框架,称为 UniMiSS。特别是,我们设计了一个金字塔 U 型医疗转换器(MiT)作为骨干,使 UniMiSS 可以同时使用二维和三维图像执行 SSL。因此,UniMiSS 的前身与目前的三维专用 SSL 相比有两个明显的优点:(1) 更有效--优于学习强表征,受益于更多、更多样化的数据;(2) 更通用--适用于各种下游任务,不受维度障碍的限制。遗憾的是,由于缺乏配对的多模态/多维度患者数据,UniMiSS 并未深入挖掘二维医学图像与三维体积之间的内在解剖关联性。在这篇扩展论文中,我们提出了 UniMiSS+,其中引入了数字重建射线照片(DRR)技术,从 CT 卷中模拟 X 射线图像,以获取成对的 CT 和 X 射线数据。得益于配对组,我们引入了额外的配对约束来增强跨模态相关性学习,这也可以作为一种跨维度正则化来进一步改进表征。我们在多个三维/二维医学图像分析任务上进行了昂贵的实验,包括分割和分类。结果表明,所提出的 UniMiSS+ 在各种下游任务上都取得了可喜的性能,不仅大大优于 ImageNet 预训练和其他先进的 SSL 对应算法,而且还改进了前身 UniMiSS 预训练。代码见:https://github.com/YtongXie/UniMiSS-code。
{"title":"UniMiSS+: Universal Medical Self-Supervised Learning From Cross-Dimensional Unpaired Data.","authors":"Yutong Xie, Jianpeng Zhang, Yong Xia, Qi Wu","doi":"10.1109/TPAMI.2024.3436105","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3436105","url":null,"abstract":"<p><p>Self-supervised learning (SSL) opens up huge opportunities for medical image analysis that is well known for its lack of annotations. However, aggregating massive (unlabeled) 3D medical images like computerized tomography (CT) remains challenging due to its high imaging cost and privacy restrictions. In our pilot study, we advocated bringing a wealth of 2D images like chest X-rays as compensation for the lack of 3D data, aiming to build a universal medical self-supervised representation learning framework, called UniMiSS. Especially, we designed a pyramid U- like medical Transformer (MiT) as the backbone to make UniMiSS possible to perform SSL with both 2D and 3D images. Consequently, the predecessor UniMiSS has two obvious merits compared to current 3D-specific SSL: (1) more effective - superior to learning strong representations, benefiting from more and diverse data; and (2) more versatile - suitable for various downstream tasks without the restriction on the dimensionality barrier. Unfortunately, UniMiSS did not dig deeply into the intrinsic anatomy correlation between 2D medical images and 3D volumes due to the lack of paired multi-modal/dimension patient data. In this extension paper, we propose the UniMiSS+, in which we introduce the digitally reconstructed radiographs (DRR) technology to simulate X-ray images from a CT volume to access paired CT and X-ray data. Benefiting from the paired group, we introduce an extra pair- wise constraint to boost the cross-modality correlation learning, which also can be adopted as a cross-dimension regularization to further improve the representations. We conduct expensive experiments on multiple 3D/2D medical image analysis tasks, including segmentation and classification. The results show that the proposed UniMiSS+ achieves promising performance on various downstream tasks, not only outperforming the ImageNet pre-training and other advanced SSL counterparts substantially but also improving the predecessor UniMiSS pre-training. Code is available at: https://github.com/YtongXie/UniMiSS-code.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141861980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1109/TPAMI.2024.3435939
Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as "prompt modules" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/ZhangYuanhan-AI/NOAH.
{"title":"Neural Prompt Search.","authors":"Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu","doi":"10.1109/TPAMI.2024.3435939","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3435939","url":null,"abstract":"<p><p>The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as \"prompt modules\" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/ZhangYuanhan-AI/NOAH.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1109/TPAMI.2024.3435448
Mengfei Xia, Yu Zhou, Ran Yi, Yong-Jin Liu, Wenping Wang
Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.
{"title":"A Diffusion Model Translator for Efficient Image-to-Image Translation.","authors":"Mengfei Xia, Yu Zhou, Ran Yi, Yong-Jin Liu, Wenping Wang","doi":"10.1109/TPAMI.2024.3435448","DOIUrl":"10.1109/TPAMI.2024.3435448","url":null,"abstract":"<p><p>Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1109/TPAMI.2024.3435790
Gen Luo, Yiyi Zhou, Minglang Huang, Tianhe Ren, Xiaoshuai Sun, Rongrong Ji
Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL.
{"title":"MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation.","authors":"Gen Luo, Yiyi Zhou, Minglang Huang, Tianhe Ren, Xiaoshuai Sun, Rongrong Ji","doi":"10.1109/TPAMI.2024.3435790","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3435790","url":null,"abstract":"<p><p>Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}