Practical object detection systems are highly desired to be open-ended for learning on frequently evolved datasets. Moreover, learning with little supervision further adds flexibility for real-world applications such as autonomous driving and robotics, where large-scale datasets could be prohibitive or expensive to obtain. However, continual adaption with small training examples often results in catastrophic forgetting and dramatic overfitting. To address such issues, a compositional learning system is proposed to enable effective incremental object detection from nonstationary and few-shot data streams. First of all, a novel bilateral–head framework is proposed to decouple the representation learning of base (pretrained) and novel (few-shot) classes into separate embedding spaces, which takes care of novel concept integration and base knowledge retention simultaneously. Moreover, to enhance learning stability, a robust parameter updating rule, i.e., recall and progress mechanism, is carried out to constrain the optimization trajectory of sequential model adaption. Beyond that, to enforce intertask class discrimination with little memory burden, we present a between-class regularization method that expands the decision space of few-shot classes for constructing unbiased feature representation. Final, we deeply investigate the incomplete annotation issue considering the realistic scenario of incremental few-shot object detection (iFSOD) and propose a semisupervised object labeling mechanism to accurately recover the missing annotations for previously encountered classes, which further enhances the robustness of the target detector to counteract catastrophic forgetting. Extensive experiments conducted on both Pascal visual object classes dataset (VOC) and microsoft common objects in context dataset (MS-COCO) datasets demonstrate the effectiveness of our method.
{"title":"Bilateral-Head Region-Based Convolutional Neural Networks: A Unified Approach for Incremental Few-Shot Object Detection","authors":"Yiting Li;Haiyue Zhu;Sichao Tian;Jun Ma;Cheng Xiang;Prahlad Vadakkepat","doi":"10.1109/TAI.2024.3381919","DOIUrl":"https://doi.org/10.1109/TAI.2024.3381919","url":null,"abstract":"Practical object detection systems are highly desired to be open-ended for learning on frequently evolved datasets. Moreover, learning with little supervision further adds flexibility for real-world applications such as autonomous driving and robotics, where large-scale datasets could be prohibitive or expensive to obtain. However, continual adaption with small training examples often results in catastrophic forgetting and dramatic overfitting. To address such issues, a compositional learning system is proposed to enable effective incremental object detection from nonstationary and few-shot data streams. First of all, a novel bilateral–head framework is proposed to decouple the representation learning of base (pretrained) and novel (few-shot) classes into separate embedding spaces, which takes care of novel concept integration and base knowledge retention simultaneously. Moreover, to enhance learning stability, a robust parameter updating rule, i.e., recall and progress mechanism, is carried out to constrain the optimization trajectory of sequential model adaption. Beyond that, to enforce intertask class discrimination with little memory burden, we present a between-class regularization method that expands the decision space of few-shot classes for constructing unbiased feature representation. Final, we deeply investigate the incomplete annotation issue considering the realistic scenario of incremental few-shot object detection (iFSOD) and propose a semisupervised object labeling mechanism to accurately recover the missing annotations for previously encountered classes, which further enhances the robustness of the target detector to counteract catastrophic forgetting. Extensive experiments conducted on both Pascal visual object classes dataset (VOC) and microsoft common objects in context dataset (MS-COCO) datasets demonstrate the effectiveness of our method.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1109/TAI.2024.3379940
Qian Wang;Fanlin Meng;Toby P. Breckon
Domain adaptation solves image classification problems in the target domain by taking advantage of the labeled source data and unlabeled target data. Usually, the source and target domains share the same set of classes. As a special case, open-set domain adaptation (OSDA) assumes there exist additional classes in the target domain but are not present in the source domain. To solve such a domain adaptation problem, our proposed method learns discriminative common subspaces for the source and target domains using a novel open-set locality preserving projection (OSLPP) algorithm. The source and target domain data are aligned in the learned common spaces classwise. To handle the open-set classification problem, our method progressively selects target samples to be pseudolabeled as known classes, rejects the outliers if they are detected as unknown classes, and leaves the remaining target samples as uncertain. The common subspace learning algorithm OSLPP simultaneously aligns the labeled source data and pseudolabeled target data from known classes and pushes the rejected target data away from the known classes. The common subspace learning and the pseudolabeled sample selection/rejection facilitate each other in an iterative learning framework and achieve state-of-the-art performance on four benchmark datasets Office-31, Office-Home, VisDA17, and Syn2Real-O with the average harmonic mean of open-set recognition accuracy (HOS) of 87.6%, 67.0%, 76.1%, and 65.6%, respectively.
{"title":"Progressively Select and Reject Pseudolabeled Samples for Open-Set Domain Adaptation","authors":"Qian Wang;Fanlin Meng;Toby P. Breckon","doi":"10.1109/TAI.2024.3379940","DOIUrl":"https://doi.org/10.1109/TAI.2024.3379940","url":null,"abstract":"Domain adaptation solves image classification problems in the target domain by taking advantage of the labeled source data and unlabeled target data. Usually, the source and target domains share the same set of classes. As a special case, open-set domain adaptation (OSDA) assumes there exist additional classes in the target domain but are not present in the source domain. To solve such a domain adaptation problem, our proposed method learns discriminative common subspaces for the source and target domains using a novel open-set locality preserving projection (OSLPP) algorithm. The source and target domain data are aligned in the learned common spaces classwise. To handle the open-set classification problem, our method progressively selects target samples to be pseudolabeled as known classes, rejects the outliers if they are detected as unknown classes, and leaves the remaining target samples as uncertain. The common subspace learning algorithm OSLPP simultaneously aligns the labeled source data and pseudolabeled target data from known classes and pushes the rejected target data away from the known classes. The common subspace learning and the pseudolabeled sample selection/rejection facilitate each other in an iterative learning framework and achieve state-of-the-art performance on four benchmark datasets Office-31, Office-Home, VisDA17, and Syn2Real-O with the average harmonic mean of open-set recognition accuracy (HOS) of 87.6%, 67.0%, 76.1%, and 65.6%, respectively.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simulating the method of neurons in the human brain that process signals is crucial for constructing a neural network with biological interpretability. However, existing deep neural networks simplify the function of a single neuron without considering dendritic plasticity. In this article, we present a multidendrite pyramidal neuron model (MDPN) for image classification, which mimics the multilevel dendritic structure of a nerve cell. Unlike the traditional feedforward network model, MDPN discards premature linear summation integration and employs a nonlinear dendritic computation such that improving the neuroplasticity. To model a lightweight and effective classification system, we emphasized the importance of single neuron and redefined the function of each subcomponent. Experimental results verify the effectiveness and robustness of our proposed MDPN in classifying 16 standardized image datasets with different characteristics. Compared to other state-of-the-art and well-known networks, MDPN is superior in terms of classifica-tion accuracy.
{"title":"A Lightweight Multidendritic Pyramidal Neuron Model With Neural Plasticity on Image Recognition","authors":"Yu Zhang;Pengxing Cai;Yanan Sun;Zhiming Zhang;Zhenyu Lei;Shangce Gao","doi":"10.1109/TAI.2024.3379968","DOIUrl":"https://doi.org/10.1109/TAI.2024.3379968","url":null,"abstract":"Simulating the method of neurons in the human brain that process signals is crucial for constructing a neural network with biological interpretability. However, existing deep neural networks simplify the function of a single neuron without considering dendritic plasticity. In this article, we present a multidendrite pyramidal neuron model (MDPN) for image classification, which mimics the multilevel dendritic structure of a nerve cell. Unlike the traditional feedforward network model, MDPN discards premature linear summation integration and employs a nonlinear dendritic computation such that improving the neuroplasticity. To model a lightweight and effective classification system, we emphasized the importance of single neuron and redefined the function of each subcomponent. Experimental results verify the effectiveness and robustness of our proposed MDPN in classifying 16 standardized image datasets with different characteristics. Compared to other state-of-the-art and well-known networks, MDPN is superior in terms of classifica-tion accuracy.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1109/TAI.2024.3379941
Ayush Roy;Sk Mohiuddin;Ram Sarkar
The process of modifying digital images has been made significantly easier by the availability of several image editing software. However, in a variety of contexts, including journalism, judicial processes, and historical documentation, the authenticity of images is of utmost importance. In particular, copy–move forgery is a distinct type of image manipulation, where a portion of an image is copied and pasted into another area of the same image, creating a fictitious or altered version of the original. In this research, we present a lightweight MultiResUnet architecture with the similarity-based positional attention module (SPAM) attention module for copy–move forgery detection (CMFD). By using a similarity measure across the patches of the features, this attention module identifies the patches, where a forged region is present. The lightweight network also aids in resource-efficient training and transforms the model into one that can be used in real time. We have employed four commonly used but extremely difficult CMFD datasets, namely CoMoFoD, COVERAGE, CASIA v2, and MICC-F600, to assess the effectiveness of our model. The proposed model significantly lowers false positives, thereby improving the pixel-level accuracy and dependability of CMFD tools.
{"title":"A Similarity-Based Positional Attention-Aided Deep Learning Model for Copy–Move Forgery Detection","authors":"Ayush Roy;Sk Mohiuddin;Ram Sarkar","doi":"10.1109/TAI.2024.3379941","DOIUrl":"https://doi.org/10.1109/TAI.2024.3379941","url":null,"abstract":"The process of modifying digital images has been made significantly easier by the availability of several image editing software. However, in a variety of contexts, including journalism, judicial processes, and historical documentation, the authenticity of images is of utmost importance. In particular, copy–move forgery is a distinct type of image manipulation, where a portion of an image is copied and pasted into another area of the same image, creating a fictitious or altered version of the original. In this research, we present a lightweight MultiResUnet architecture with the similarity-based positional attention module (SPAM) attention module for copy–move forgery detection (CMFD). By using a similarity measure across the patches of the features, this attention module identifies the patches, where a forged region is present. The lightweight network also aids in resource-efficient training and transforms the model into one that can be used in real time. We have employed four commonly used but extremely difficult CMFD datasets, namely CoMoFoD, COVERAGE, CASIA v2, and MICC-F600, to assess the effectiveness of our model. The proposed model significantly lowers false positives, thereby improving the pixel-level accuracy and dependability of CMFD tools.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1109/TAI.2024.3381102
Jiawei Yang;Sylwan Rahardja;Susanto Rahardja
Outlier ensemble is an important methodology for improving outlier detection, but faces severe challenges in unsupervised settings. Unlike traditional outlier ensembles which revised scores by considering only the values of the scores from multiple detectors, we present a novel regional ensemble (RE). RE combines the scores from multiple objects and multiple detectors and simultaneously takes into consideration both the values and the distribution of these scores. RE specifically enhances the score of a given object by using the scores of neighboring objects of the given object, under the assumption that the scores of the majority of neighboring objects are reliable. RE provides many potential applications, particularly in data mining and machine learning. Compared to existing outlier ensembles with 30 real-world datasets tested, RE attained the best performance with 14 datasets, while the current standard achieves superior performance with only eight datasets. RE can significantly improve the best existing from 0.83 to 0.86 AUC on average.
{"title":"Regional Ensemble for Improving Unsupervised Outlier Detectors","authors":"Jiawei Yang;Sylwan Rahardja;Susanto Rahardja","doi":"10.1109/TAI.2024.3381102","DOIUrl":"https://doi.org/10.1109/TAI.2024.3381102","url":null,"abstract":"Outlier ensemble is an important methodology for improving outlier detection, but faces severe challenges in unsupervised settings. Unlike traditional outlier ensembles which revised scores by considering only the values of the scores from multiple detectors, we present a novel regional ensemble (RE). RE combines the scores from multiple objects and multiple detectors and simultaneously takes into consideration both the values and the distribution of these scores. RE specifically enhances the score of a given object by using the scores of neighboring objects of the given object, under the assumption that the scores of the majority of neighboring objects are reliable. RE provides many potential applications, particularly in data mining and machine learning. Compared to existing outlier ensembles with 30 real-world datasets tested, RE attained the best performance with 14 datasets, while the current standard achieves superior performance with only eight datasets. RE can significantly improve the best existing from 0.83 to 0.86 AUC on average.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1109/TAI.2024.3380590
Emrah Hancer;Bing Xue;Mengjie Zhang
Multilabel learning is an emergent topic that addresses the challenge of associating multiple labels with a single instance simultaneously. Multilabel datasets often exhibit high dimensionality with noisy, irrelevant, and redundant features. In recent years, multilabel feature selection (MLFS) has gained prominence as a crucial and emerging machine learning task due to its ability to handle such data effectively. However, existing approaches for MLFS often prioritize top-ranked features based on intrinsic data criteria, disregarding relationships within the feature subset. Additionally, compared with conventional feature selection, multiobjective evolutionary algorithms (MOEAs) have not been widely explored in the context of MLFS. This study aims to address these gaps by proposing a multimodal multiobjective evolutionary algorithm (MMOEA) called MMDE_SICD which incorporates a preelimination scheme, an improved initialization scheme, an exploration scheme inspired by genetic operations and a statistically inspired crowding distance scheme. The results show that the proposed MMDE_SICD algorithm can outperform a variety of MOEAs and MMOEAs as well as conventional MLFS algorithms. Notably, this study is the first of its kind to consider MLFS as a multimodal multiobjective problem.
{"title":"A Multimodal Multiobjective Evolutionary Algorithm for Filter Feature Selection in Multilabel Classification","authors":"Emrah Hancer;Bing Xue;Mengjie Zhang","doi":"10.1109/TAI.2024.3380590","DOIUrl":"https://doi.org/10.1109/TAI.2024.3380590","url":null,"abstract":"Multilabel learning is an emergent topic that addresses the challenge of associating multiple labels with a single instance simultaneously. Multilabel datasets often exhibit high dimensionality with noisy, irrelevant, and redundant features. In recent years, multilabel feature selection (MLFS) has gained prominence as a crucial and emerging machine learning task due to its ability to handle such data effectively. However, existing approaches for MLFS often prioritize top-ranked features based on intrinsic data criteria, disregarding relationships within the feature subset. Additionally, compared with conventional feature selection, multiobjective evolutionary algorithms (MOEAs) have not been widely explored in the context of MLFS. This study aims to address these gaps by proposing a multimodal multiobjective evolutionary algorithm (MMOEA) called MMDE_SICD which incorporates a preelimination scheme, an improved initialization scheme, an exploration scheme inspired by genetic operations and a statistically inspired crowding distance scheme. The results show that the proposed MMDE_SICD algorithm can outperform a variety of MOEAs and MMOEAs as well as conventional MLFS algorithms. Notably, this study is the first of its kind to consider MLFS as a multimodal multiobjective problem.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1109/TAI.2023.3342563
Francesco Piccialli;Maizar Raissi;Felipe A. C. Viana;Giancarlo Fortino;Huimin Lu;Amir Hussain
The special issue delves into the tantalizing prospects of machine learning for multiscale modeling, a domain where the traditional methodologies often encounter scalability issues. Here, Physics-informed machine learning (PIML) promises to bridge scales from the microscopic to the macroscopic, creating models that are not only scalable but also more accurate and less resource-intensive. Furthermore, the contributors have taken on the challenge of machine learning model interpretability. They have explored how these models can provide insights into physical systems, thus serving a dual purpose of solving complex problems while also contributing to the body of knowledge in physics. The integration of physical laws with machine learning is not just an innovation; it is a renaissance of understanding. The papers in this issue showcase the pioneering works that merge the robustness of physics with the flexibility of machine learning. Here, we provide an overview of the significant contributions made by our authors in advancing the field of PIML.
{"title":"Guest Editorial: Special Issue on Physics-Informed Machine Learning","authors":"Francesco Piccialli;Maizar Raissi;Felipe A. C. Viana;Giancarlo Fortino;Huimin Lu;Amir Hussain","doi":"10.1109/TAI.2023.3342563","DOIUrl":"https://doi.org/10.1109/TAI.2023.3342563","url":null,"abstract":"The special issue delves into the tantalizing prospects of machine learning for multiscale modeling, a domain where the traditional methodologies often encounter scalability issues. Here, Physics-informed machine learning (PIML) promises to bridge scales from the microscopic to the macroscopic, creating models that are not only scalable but also more accurate and less resource-intensive. Furthermore, the contributors have taken on the challenge of machine learning model interpretability. They have explored how these models can provide insights into physical systems, thus serving a dual purpose of solving complex problems while also contributing to the body of knowledge in physics. The integration of physical laws with machine learning is not just an innovation; it is a renaissance of understanding. The papers in this issue showcase the pioneering works that merge the robustness of physics with the flexibility of machine learning. Here, we provide an overview of the significant contributions made by our authors in advancing the field of PIML.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10478751","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140291323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-24DOI: 10.1109/TAI.2024.3404914
Ping Wang;Chengpu Yu;Maolong Lv;Zilong Zhao
The adaptive practical prescribed-time (PPT) neural control is studied for multiinput multioutput (MIMO) nonlinear systems with unknown nonlinear functions and unknown input gain matrices. Unlike existing PPT design schemes based on backstepping, this study proposes a novel PPT control framework using the dynamic surface control (DSC) approach. First, a novel nonlinear filter (NLF) with an adaptive parameter estimator and a piecewise function is constructed to effectively compensate for filter errors and facilitate prescribed-time convergence. Based on this, a unified DSC-based adaptive PPT control algorithm, augmented with a neural networks (NNs) approximator, is developed, where NNs are used to approximate unknown nonlinear system functions. This algorithm not only addresses the inherent computational complexity explosion associated with traditional backstepping methods but also reduces the constraints on filter design parameters compared to the DSC algorithm that relies on linear filters. The simulation showcases the effectiveness and superiority of the devised scheme by employing a two-degree-of-freedom robot manipulator.
{"title":"Adaptive Prescribed-Time Neural Control of Nonlinear Systems via Dynamic Surface Technique","authors":"Ping Wang;Chengpu Yu;Maolong Lv;Zilong Zhao","doi":"10.1109/TAI.2024.3404914","DOIUrl":"https://doi.org/10.1109/TAI.2024.3404914","url":null,"abstract":"The adaptive practical prescribed-time (PPT) neural control is studied for multiinput multioutput (MIMO) nonlinear systems with unknown nonlinear functions and unknown input gain matrices. Unlike existing PPT design schemes based on backstepping, this study proposes a novel PPT control framework using the dynamic surface control (DSC) approach. First, a novel nonlinear filter (NLF) with an adaptive parameter estimator and a piecewise function is constructed to effectively compensate for filter errors and facilitate prescribed-time convergence. Based on this, a unified DSC-based adaptive PPT control algorithm, augmented with a neural networks (NNs) approximator, is developed, where NNs are used to approximate unknown nonlinear system functions. This algorithm not only addresses the inherent computational complexity explosion associated with traditional backstepping methods but also reduces the constraints on filter design parameters compared to the DSC algorithm that relies on linear filters. The simulation showcases the effectiveness and superiority of the devised scheme by employing a two-degree-of-freedom robot manipulator.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-24DOI: 10.1109/TAI.2024.3404910
Shiv Ram Dubey;Satish Kumar Singh
Generative adversarial networks (GANs) have been very successful for synthesizing the images in a given dataset. The artificially generated images by GANs are very realistic. The GANs have shown potential usability in several computer vision applications, including image generation, image-to-image translation, and video synthesis. Conventionally, the generator network is the backbone of GANs, which generates the samples, and the discriminator network is used to facilitate the training of the generator network. The generator and discriminator networks are usually a convolutional neural network (CNN). The convolution-based networks exploit the local relationship in a layer, which requires the deep networks to extract the abstract features. However, recently developed transformer networks are able to exploit the global relationship with tremendous performance improvement for several problems in computer vision. Motivated from the success of transformer networks and GANs, recent works have tried to exploit the transformers in GAN framework for the image/video synthesis. This article presents a comprehensive survey on the developments and advancements in GANs utilizing the transformer networks for computer vision applications. The performance comparison for several applications on benchmark datasets is also performed and analyzed. The conducted survey will be very useful to understand the research trends and gaps related with transformer-based GANs and to develop the advanced GAN architectures by exploiting the global and local relationships for different applications.
生成式对抗网络(GAN)在合成给定数据集中的图像方面非常成功。GAN 人工生成的图像非常逼真。GANs 已在多个计算机视觉应用中显示出潜在的可用性,包括图像生成、图像到图像的翻译和视频合成。传统上,生成器网络是 GAN 的骨干网络,用于生成样本,而判别器网络则用于促进生成器网络的训练。生成器网络和鉴别器网络通常是一个卷积神经网络(CNN)。基于卷积的网络利用层中的局部关系,这就需要深度网络来提取抽象特征。然而,最近开发的变压器网络能够利用全局关系,在计算机视觉的多个问题上取得了巨大的性能提升。受变压器网络和 GAN 的成功启发,最近的研究尝试在 GAN 框架中利用变压器进行图像/视频合成。本文全面介绍了利用变压器网络进行计算机视觉应用的 GAN 的发展和进步。文章还对基准数据集上的多个应用进行了性能比较和分析。所进行的调查将非常有助于了解与基于变压器的 GAN 相关的研究趋势和差距,并通过利用不同应用的全局和局部关系来开发先进的 GAN 架构。
{"title":"Transformer-Based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey","authors":"Shiv Ram Dubey;Satish Kumar Singh","doi":"10.1109/TAI.2024.3404910","DOIUrl":"https://doi.org/10.1109/TAI.2024.3404910","url":null,"abstract":"Generative adversarial networks (GANs) have been very successful for synthesizing the images in a given dataset. The artificially generated images by GANs are very realistic. The GANs have shown potential usability in several computer vision applications, including image generation, image-to-image translation, and video synthesis. Conventionally, the generator network is the backbone of GANs, which generates the samples, and the discriminator network is used to facilitate the training of the generator network. The generator and discriminator networks are usually a convolutional neural network (CNN). The convolution-based networks exploit the local relationship in a layer, which requires the deep networks to extract the abstract features. However, recently developed transformer networks are able to exploit the global relationship with tremendous performance improvement for several problems in computer vision. Motivated from the success of transformer networks and GANs, recent works have tried to exploit the transformers in GAN framework for the image/video synthesis. This article presents a comprehensive survey on the developments and advancements in GANs utilizing the transformer networks for computer vision applications. The performance comparison for several applications on benchmark datasets is also performed and analyzed. The conducted survey will be very useful to understand the research trends and gaps related with transformer-based GANs and to develop the advanced GAN architectures by exploiting the global and local relationships for different applications.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}