Drone (or unmanned aerial vehicle) has been extensively applied in many modern artificial intelligence systems in the past decade. In this work, we propose a novel deep hashing framework that can detect objects from drone-captured pictures extremely fast. Our method can intrinsically and flexibly encode various topological structures from each target object, based on which multiscale objects can be discovered in a view- and altitude-invariant way. Moreover, by leveraging $l_{F}$ and $l_{1}$ norms collaboratively, the calculated hash codes are robust to low-quality drone pictures and possibly contaminated semantic labels. More specifically, for each drone picture, we extract visually/semantically salient object parts inside it. To characterize their topological structure, we construct a graphlet by linking the spatially adjacent object patches into a small graph. Subsequently, a binary matrix factorization (MF) is designed to hierarchically exploit the semantics of these graphlets, wherein three attributes: 1) deep binary hash codes learning; 2) contaminated pictures/labels denoising; and 3) adaptive data graph updating are seamlessly incorporated. Accordingly, a manifold-regularized feature selector is adopted to further obtain more discriminative deep hash codes. Finally, the selected hash codes corresponding to graphlets within each drone photograph are utilized for ranking-based object discovery. Comprehensive experiments on the DAC-SDC, MOHR, and our self-compiled dataset have demonstrated the competitive speed and accuracy of our method.
{"title":"Topology-Preserving Deep Hashing for Ultrafast Drone-Dominated Object Detection.","authors":"Luming Zhang, Guifeng Wang, Zhiming Wang, Ling Shao","doi":"10.1109/TNNLS.2026.3686846","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3686846","url":null,"abstract":"<p><p>Drone (or unmanned aerial vehicle) has been extensively applied in many modern artificial intelligence systems in the past decade. In this work, we propose a novel deep hashing framework that can detect objects from drone-captured pictures extremely fast. Our method can intrinsically and flexibly encode various topological structures from each target object, based on which multiscale objects can be discovered in a view- and altitude-invariant way. Moreover, by leveraging $l_{F}$ and $l_{1}$ norms collaboratively, the calculated hash codes are robust to low-quality drone pictures and possibly contaminated semantic labels. More specifically, for each drone picture, we extract visually/semantically salient object parts inside it. To characterize their topological structure, we construct a graphlet by linking the spatially adjacent object patches into a small graph. Subsequently, a binary matrix factorization (MF) is designed to hierarchically exploit the semantics of these graphlets, wherein three attributes: 1) deep binary hash codes learning; 2) contaminated pictures/labels denoising; and 3) adaptive data graph updating are seamlessly incorporated. Accordingly, a manifold-regularized feature selector is adopted to further obtain more discriminative deep hash codes. Finally, the selected hash codes corresponding to graphlets within each drone photograph are utilized for ranking-based object discovery. Comprehensive experiments on the DAC-SDC, MOHR, and our self-compiled dataset have demonstrated the competitive speed and accuracy of our method.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147837362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-driven diffusion models have achieved remarkable performance in human motion generation. However, these generative works struggle to generate high-quality motion consistent with textual descriptions. The primary reasons are: 1) insufficient fine-grained motion modeling due to the motion representations being difficult to distinguish in latent diffusion; and 2) inconsistencies between motions and textual descriptions due to misalignment in the multimodal space. To overcome these limitations, this work proposes the Motion generation with Frequency and Text State Space models (MoFTSS) including two main modules: frequency state space model (FreqSSM) and text state space model (TextSSM). Specifically, FreqSSM derives fine-grained representations by decomposing sequences into low-frequency and high-frequency components. This allows it to guide the generation of static poses (e.g., sitting, lying) and fine-grained motions (e.g, transitions, stumbling). For consistency between text and motion, TextSSM treats text features as a semantic modulation term within the SSM, enabling dynamic filtering of motion features consistent with textual semantics. Extensive experiments suggest that our MoFTSS achieves superior performance on the text-to-motion generation task. Notably, it attains the lowest FID of 0.181 on the HumanML3D dataset, significantly lower than the 0.421 achieved by MLD.
{"title":"MoFTSS: Motion Generation With Frequency and Text State Space Models.","authors":"Chengjian Li, Xiangbo Shu, Qiongjie Cui, Haifeng Xia, Yazhou Yao, Jinhui Tang","doi":"10.1109/TNNLS.2026.3683909","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3683909","url":null,"abstract":"<p><p>Text-driven diffusion models have achieved remarkable performance in human motion generation. However, these generative works struggle to generate high-quality motion consistent with textual descriptions. The primary reasons are: 1) insufficient fine-grained motion modeling due to the motion representations being difficult to distinguish in latent diffusion; and 2) inconsistencies between motions and textual descriptions due to misalignment in the multimodal space. To overcome these limitations, this work proposes the Motion generation with Frequency and Text State Space models (MoFTSS) including two main modules: frequency state space model (FreqSSM) and text state space model (TextSSM). Specifically, FreqSSM derives fine-grained representations by decomposing sequences into low-frequency and high-frequency components. This allows it to guide the generation of static poses (e.g., sitting, lying) and fine-grained motions (e.g, transitions, stumbling). For consistency between text and motion, TextSSM treats text features as a semantic modulation term within the SSM, enabling dynamic filtering of motion features consistent with textual semantics. Extensive experiments suggest that our MoFTSS achieves superior performance on the text-to-motion generation task. Notably, it attains the lowest FID of 0.181 on the HumanML3D dataset, significantly lower than the 0.421 achieved by MLD.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147814021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-05-01DOI: 10.1109/TNNLS.2026.3684954
Ehsan Adibnia
Designing identical dual-band optical filters remains a complex optimization challenge in photonics and optical communication systems. Conventional methods, which rely on iterative electromagnetic simulations or analytical approximations, often suffer from limited generalizability and high computational costs. In this work, we propose a deep reinforcement learning (RL) framework for the autonomous optimization of identical dual-band fiber Bragg grating (FBG) filters. A policy network based on a three-layer fully connected neural architecture is trained using a proximal policy optimization algorithm to minimize the full width at half maximum (FWHM) of both transmission bands while maintaining spectral symmetry and identical channel characteristics. The deep RL-based design achieves a 43% reduction in FWHM and a 49% reduction in grating length compared to baseline designs, without sacrificing reflectivity or channel uniformity. This study demonstrates the feasibility and effectiveness of deep RL as a powerful optimization tool for complex photonic systems, providing a scalable and data-efficient pathway toward next-generation optical device design.
{"title":"Deep Reinforcement Learning-Based Optimization of Identical-Dual-Band Filters.","authors":"Ehsan Adibnia","doi":"10.1109/TNNLS.2026.3684954","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3684954","url":null,"abstract":"<p><p>Designing identical dual-band optical filters remains a complex optimization challenge in photonics and optical communication systems. Conventional methods, which rely on iterative electromagnetic simulations or analytical approximations, often suffer from limited generalizability and high computational costs. In this work, we propose a deep reinforcement learning (RL) framework for the autonomous optimization of identical dual-band fiber Bragg grating (FBG) filters. A policy network based on a three-layer fully connected neural architecture is trained using a proximal policy optimization algorithm to minimize the full width at half maximum (FWHM) of both transmission bands while maintaining spectral symmetry and identical channel characteristics. The deep RL-based design achieves a 43% reduction in FWHM and a 49% reduction in grating length compared to baseline designs, without sacrificing reflectivity or channel uniformity. This study demonstrates the feasibility and effectiveness of deep RL as a powerful optimization tool for complex photonic systems, providing a scalable and data-efficient pathway toward next-generation optical device design.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147814588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks (DNNs) excel across domains but face challenges in resource-constrained and critical settings due to high computational cost and limited transparency. Early exit DNNs reduce overhead via intermediate predictions; yet, most approaches neglect interpretability, vital for trust in AI systems. This article presents XAI-Exit, an early exit framework that jointly optimizes efficiency and transparency. At its core, ExitDecisionNet (EDN)-a lightweight RNN trained with a curriculum strategy on confidence, interpretability, and stability metrics-dynamically predicts the optimal exit, while a skip mechanism minimizes redundant computation. To ensure transparency, exit attribution maps (EAMs) aggregate feature attributions across exits, revealing the decision trajectory and are complemented by standard XAI methods (integrated gradients (IGs), SmoothGrad, Grad-CAM++, and LRP). Experiments on MobileNetV3, ResNet18, and MSDNet with CIFAR-10, CIFAR-100, and ImageNet show that XAI-Exit improves efficiency without sacrificing accuracy, while uniquely ensuring interpretable exit decisions suitable for real-world deployment.
{"title":"XAI-Exit: Interpretability-Driven Dynamic Early Exits for Efficient and Transparent DNN Inference.","authors":"Haseena Rahmath P, Ajith Abraham, Kuldeep Chaurasia","doi":"10.1109/TNNLS.2026.3685408","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3685408","url":null,"abstract":"<p><p>Deep neural networks (DNNs) excel across domains but face challenges in resource-constrained and critical settings due to high computational cost and limited transparency. Early exit DNNs reduce overhead via intermediate predictions; yet, most approaches neglect interpretability, vital for trust in AI systems. This article presents XAI-Exit, an early exit framework that jointly optimizes efficiency and transparency. At its core, ExitDecisionNet (EDN)-a lightweight RNN trained with a curriculum strategy on confidence, interpretability, and stability metrics-dynamically predicts the optimal exit, while a skip mechanism minimizes redundant computation. To ensure transparency, exit attribution maps (EAMs) aggregate feature attributions across exits, revealing the decision trajectory and are complemented by standard XAI methods (integrated gradients (IGs), SmoothGrad, Grad-CAM++, and LRP). Experiments on MobileNetV3, ResNet18, and MSDNet with CIFAR-10, CIFAR-100, and ImageNet show that XAI-Exit improves efficiency without sacrificing accuracy, while uniquely ensuring interpretable exit decisions suitable for real-world deployment.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147814072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-30DOI: 10.1109/TNNLS.2026.3683544
Chang Nie, Tianchen Deng, Zhe Liu, Hesheng Wang
Denoising is important in many vision, medical, and biological applications, yet real observations are often corrupted by complex nonlinear noise and clean targets are often unavailable. We present MID, a self-supervised iterative denoising framework across data modalities. MID treats an observation as an intermediate state along a controllable corruption process and learns from noisy data only through two networks: a step predictor that estimates the current corruption stage and a residual predictor that estimates the effective residual increment to be removed at that stage. For nonlinear corruption, MID uses a first-order local approximation to enable iterative restoration in a locally linear regime. The same formulation can be instantiated with modality-specific backbones for images, signals, point sets, and sequences. Experiments on diverse tasks in computer vision, biomedicine, and bioinformatics show that MID is robust, broadly applicable, and competitive with recent baselines.
{"title":"MID: A Self-Supervised Multimodal Iterative Denoising Framework.","authors":"Chang Nie, Tianchen Deng, Zhe Liu, Hesheng Wang","doi":"10.1109/TNNLS.2026.3683544","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3683544","url":null,"abstract":"<p><p>Denoising is important in many vision, medical, and biological applications, yet real observations are often corrupted by complex nonlinear noise and clean targets are often unavailable. We present MID, a self-supervised iterative denoising framework across data modalities. MID treats an observation as an intermediate state along a controllable corruption process and learns from noisy data only through two networks: a step predictor that estimates the current corruption stage and a residual predictor that estimates the effective residual increment to be removed at that stage. For nonlinear corruption, MID uses a first-order local approximation to enable iterative restoration in a locally linear regime. The same formulation can be instantiated with modality-specific backbones for images, signals, point sets, and sequences. Experiments on diverse tasks in computer vision, biomedicine, and bioinformatics show that MID is robust, broadly applicable, and competitive with recent baselines.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147813843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-29DOI: 10.1109/TNNLS.2026.3683436
Aijun Yan, Chunpeng Yang
To address the challenges of industrial process modeling caused by multiscale spatiotemporal coupling, a soft sensor method based on the multiscale convolutional stochastic configuration network (MSC-SCN) is proposed. This method introduces a multiscale convolutional strategy into the SCN framework and designs parallel multiscale feature extractors with incremental learning capability under a supervised learning mechanism. Subsequently, cross-scale feature fusion is employed to integrate the multiscale feature maps generated by the previously constructed feature extractors. The output weights of the SCN are then optimized by combining low-rank matrix approximation and regularization methods to improve the efficiency and stability of the inverse of the hidden layer matrix. Experimental comparisons with state-of-the-art methods on three industrial soft sensor tasks demonstrate that the proposed approach yields the best performance and demonstrates high adaptability to multiscale spatiotemporal coupling.
{"title":"Multiscale Convolutional Stochastic Configuration Network Soft Sensor Modeling Method.","authors":"Aijun Yan, Chunpeng Yang","doi":"10.1109/TNNLS.2026.3683436","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3683436","url":null,"abstract":"<p><p>To address the challenges of industrial process modeling caused by multiscale spatiotemporal coupling, a soft sensor method based on the multiscale convolutional stochastic configuration network (MSC-SCN) is proposed. This method introduces a multiscale convolutional strategy into the SCN framework and designs parallel multiscale feature extractors with incremental learning capability under a supervised learning mechanism. Subsequently, cross-scale feature fusion is employed to integrate the multiscale feature maps generated by the previously constructed feature extractors. The output weights of the SCN are then optimized by combining low-rank matrix approximation and regularization methods to improve the efficiency and stability of the inverse of the hidden layer matrix. Experimental comparisons with state-of-the-art methods on three industrial soft sensor tasks demonstrate that the proposed approach yields the best performance and demonstrates high adaptability to multiscale spatiotemporal coupling.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147770066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-29DOI: 10.1109/TNNLS.2026.3685207
Raul Perez-Gonzalo, Andreas Espersen, Soren Forchhammer, Antonio Agudo
Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: 1) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization; 2) a hyperprior-based autoencoder optimized for lossy compression; and 3) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.
{"title":"End-to-End Image Compression With Segmentation Guided Dual Coding for Wind Turbines.","authors":"Raul Perez-Gonzalo, Andreas Espersen, Soren Forchhammer, Antonio Agudo","doi":"10.1109/TNNLS.2026.3685207","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3685207","url":null,"abstract":"<p><p>Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: 1) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization; 2) a hyperprior-based autoencoder optimized for lossy compression; and 3) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147770106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-28DOI: 10.1109/TNNLS.2026.3685832
Yang An, Yaqi Li, Hongwei Wang, Rob Duffield, Steven W Su
This study introduces a novel approach to robot-assisted ankle rehabilitation by proposing a dual-agent multiple model reinforcement learning (DAMMRL) framework, leveraging multiple model adaptive control (MMAC) and co-adaptive control strategies. In robot-assisted rehabilitation, one of the key challenges is modeling human behavior due to the complexity of human cognition and physiological systems. Traditional single-model approaches often fail to capture the dynamics of human-machine interactions. Our research employs a multiple model strategy, using simple submodels to approximate complex human responses during rehabilitation tasks, tailored to varying levels of patient incapacity. The proposed system's versatility is demonstrated in real experiments and simulated environments. Feasibility and potential were evaluated with 13 healthy subjects and nine patients with lower-limb motor disorders, yielding promising results that affirm the anticipated benefits of the approach. This study not only introduces a new paradigm for robot-assisted ankle rehabilitation but also opens the way for future research in adaptive, patient-centered therapeutic interventions.
{"title":"Human-Machine Co-Adaptation for Robot-Assisted Rehabilitation via Dual-Agent Multiple Model Reinforcement Learning (DAMMRL).","authors":"Yang An, Yaqi Li, Hongwei Wang, Rob Duffield, Steven W Su","doi":"10.1109/TNNLS.2026.3685832","DOIUrl":"https://doi.org/10.1109/TNNLS.2026.3685832","url":null,"abstract":"<p><p>This study introduces a novel approach to robot-assisted ankle rehabilitation by proposing a dual-agent multiple model reinforcement learning (DAMMRL) framework, leveraging multiple model adaptive control (MMAC) and co-adaptive control strategies. In robot-assisted rehabilitation, one of the key challenges is modeling human behavior due to the complexity of human cognition and physiological systems. Traditional single-model approaches often fail to capture the dynamics of human-machine interactions. Our research employs a multiple model strategy, using simple submodels to approximate complex human responses during rehabilitation tasks, tailored to varying levels of patient incapacity. The proposed system's versatility is demonstrated in real experiments and simulated environments. Feasibility and potential were evaluated with 13 healthy subjects and nine patients with lower-limb motor disorders, yielding promising results that affirm the anticipated benefits of the approach. This study not only introduces a new paradigm for robot-assisted ankle rehabilitation but also opens the way for future research in adaptive, patient-centered therapeutic interventions.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147770054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-28DOI: 10.1109/TNNLS.2025.3650584
Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King
Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to as vision-language-action (VLA) models-has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.
{"title":"A Survey on Vision-Language-Action Models for Embodied AI.","authors":"Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King","doi":"10.1109/TNNLS.2025.3650584","DOIUrl":"https://doi.org/10.1109/TNNLS.2025.3650584","url":null,"abstract":"<p><p>Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to as vision-language-action (VLA) models-has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.</p>","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"PP ","pages":""},"PeriodicalIF":8.9,"publicationDate":"2026-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147770125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-27DOI: 10.1109/tnnls.2026.3683414
Wenjie Liao, Like Wu, Shihui Xu, Shigeru Fujimura
{"title":"Dual-Path Conditional Diffusion Model With Attribute Consistency for Zero-Shot Fault Diagnosis","authors":"Wenjie Liao, Like Wu, Shihui Xu, Shigeru Fujimura","doi":"10.1109/tnnls.2026.3683414","DOIUrl":"https://doi.org/10.1109/tnnls.2026.3683414","url":null,"abstract":"","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"152 1","pages":""},"PeriodicalIF":10.4,"publicationDate":"2026-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147753197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}