Early age identification and prompt intervention play a crucial role in mitigating the severity of neurodevelopmental disorders in children. Traditional diagnostic approaches can be lengthy, but there is growing research potential in using electroencephalogram (EEG) signals to detect attention deficit hyperactivity disorder (ADHD) and intellectual developmental disorder (IDD). By recording the electrical activity of the brain, EEG has emerged as a promising technique for the early identification of these disorders. This research proposes a novel integrated method for identifying multiple neurodevelopmental disorders from the EEG signals of children. The approach combines successive multivariate variational mode decomposition (SMVMD) for analyzing multicomponent nonstationary signals and a machine learning (ML)-based classifier, addressing the issue of inconsistent numbers of extracted features by introducing an energy-based feature integration approach. By integrating enhanced features from SMVMD with a K-nearest neighbor (KNN) classifier, the unified approach successfully detects two separate neurodevelopmental disorders from normal subjects. The proposed method demonstrates perfect classification scores in detecting IDD under three different scenarios and achieves 99.17% accuracy in classifying ADHD subjects from normal subjects. Evaluation against different ML-based classifiers confirms the effectiveness of the proposed feature extraction algorithm and highlights its superior performance compared to recent methods published on similar datasets.
{"title":"Electroencephalogram-Based Unified Approach for Multiple Neurodevelopmental Disorders Detection in Children Using Successive Multivariate Variational Mode Decomposition","authors":"Ujjawal Chandela;Kazi Newaj Faisal;Rishi Raj Sharma","doi":"10.1109/TCDS.2025.3556888","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3556888","url":null,"abstract":"Early age identification and prompt intervention play a crucial role in mitigating the severity of neurodevelopmental disorders in children. Traditional diagnostic approaches can be lengthy, but there is growing research potential in using electroencephalogram (EEG) signals to detect attention deficit hyperactivity disorder (ADHD) and intellectual developmental disorder (IDD). By recording the electrical activity of the brain, EEG has emerged as a promising technique for the early identification of these disorders. This research proposes a novel integrated method for identifying multiple neurodevelopmental disorders from the EEG signals of children. The approach combines successive multivariate variational mode decomposition (SMVMD) for analyzing multicomponent nonstationary signals and a machine learning (ML)-based classifier, addressing the issue of inconsistent numbers of extracted features by introducing an energy-based feature integration approach. By integrating enhanced features from SMVMD with a K-nearest neighbor (KNN) classifier, the unified approach successfully detects two separate neurodevelopmental disorders from normal subjects. The proposed method demonstrates perfect classification scores in detecting IDD under three different scenarios and achieves 99.17% accuracy in classifying ADHD subjects from normal subjects. Evaluation against different ML-based classifiers confirms the effectiveness of the proposed feature extraction algorithm and highlights its superior performance compared to recent methods published on similar datasets.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1350-1359"},"PeriodicalIF":4.9,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel multimodal dataset for cognitive load assessment in real-time (CLARE). The dataset contains physiological and gaze data from 24 participants with self-reported cognitive load scores as ground-truth labels. The dataset consists of four modalities, namely, electrocardiography (ECG), electrodermal activity (EDA), electroencephalogram (EEG), and gaze tracking. To map diverse levels of mental load on participants during experiments, each participant completed four 9-min sessions on a computer-based operator performance and mental workload task (the MATB-II software) with varying levels of complexity in 1 min segments. During the experiment, participants reported their cognitive load every 10 s. For the dataset, we also provide benchmark binary classification results with machine learning and deep learning models on two different evaluation schemes, namely, 10-fold and leave-one-subject-out (LOSO) cross-validation. Benchmark results show that for 10-fold evaluation, the convolutional neural network (CNN) based deep learning model achieves the best classification performance with ECG, EDA, and gaze. In contrast, for LOSO, the best performance is achieved by the deep learning model with ECG, EDA, and EEG.
{"title":"CLARE: Cognitive Load Assessment in Real-Time With Multimodal Data","authors":"Anubhav Bhatti;Prithila Angkan;Behnam Behinaein;Zunayed Mahmud;Dirk Rodenburg;Heather Braund;P. James Mclellan;Aaron Ruberto;Geoffery Harrison;Daryl Wilson;Adam Szulewski;Dan Howes;Ali Etemad;Paul Hungler","doi":"10.1109/TCDS.2025.3555517","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3555517","url":null,"abstract":"We present a novel multimodal dataset for cognitive load assessment in real-time (CLARE). The dataset contains physiological and gaze data from 24 participants with self-reported cognitive load scores as ground-truth labels. The dataset consists of four modalities, namely, electrocardiography (ECG), electrodermal activity (EDA), electroencephalogram (EEG), and gaze tracking. To map diverse levels of mental load on participants during experiments, each participant completed four 9-min sessions on a computer-based operator performance and mental workload task (the MATB-II software) with varying levels of complexity in 1 min segments. During the experiment, participants reported their cognitive load every 10 s. For the dataset, we also provide benchmark binary classification results with machine learning and deep learning models on two different evaluation schemes, namely, 10-fold and leave-one-subject-out (LOSO) cross-validation. Benchmark results show that for 10-fold evaluation, the convolutional neural network (CNN) based deep learning model achieves the best classification performance with ECG, EDA, and gaze. In contrast, for LOSO, the best performance is achieved by the deep learning model with ECG, EDA, and EEG.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1337-1349"},"PeriodicalIF":4.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In cooperative multiagent reinforcement learning (MARL), previous research has predominantly concentrated on augmenting cooperation through the optimization of global behavioral correlations between agents, with mutual information (MI) typically serving as a crucial metric for correlation quantification. The existing approaches aim to enhance the behavioral correlation among agents to foster better cooperation and goal alignment by leveraging MI. However, it has been demonstrated that the cooperative capabilities among agents cannot be enhanced merely by directly increasing their overall behavioral correlations, particularly in environments with multiple subtasks or scenarios requiring dynamic team structures. To tackle this challenge, a MARL algorithm named group-oriented MI collaboration (GoMIC) is designed, which dynamically partitions agents and employs MI within each partition as an enhanced reward. GoMIC mitigates excessive reliance of individual policies on team-related information and fosters agents to acquire policies across varying team compositions. Experimental evaluations across various tasks in multiagent particle environment (MPE), level-based foraging (LBF), and StarCraft II (SC2) demonstrate the superior performance of GoMIC over some existing approaches, indicating its potential to improve collaboration in multiagent systems.
{"title":"GoMIC: Enhancing Efficient Collaboration in Multiagent Reinforcement Learning Through Group-Specific Mutual Information","authors":"Jichao Wang;Yi Li;Yichun Li;Shuai Mao;Zhaoyang Dong;Yang Tang","doi":"10.1109/TCDS.2025.3574031","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3574031","url":null,"abstract":"In cooperative multiagent reinforcement learning (MARL), previous research has predominantly concentrated on augmenting cooperation through the optimization of global behavioral correlations between agents, with mutual information (MI) typically serving as a crucial metric for correlation quantification. The existing approaches aim to enhance the behavioral correlation among agents to foster better cooperation and goal alignment by leveraging MI. However, it has been demonstrated that the cooperative capabilities among agents cannot be enhanced merely by directly increasing their overall behavioral correlations, particularly in environments with multiple subtasks or scenarios requiring dynamic team structures. To tackle this challenge, a MARL algorithm named group-oriented MI collaboration (GoMIC) is designed, which dynamically partitions agents and employs MI within each partition as an enhanced reward. GoMIC mitigates excessive reliance of individual policies on team-related information and fosters agents to acquire policies across varying team compositions. Experimental evaluations across various tasks in multiagent particle environment (MPE), level-based foraging (LBF), and StarCraft II (SC2) demonstrate the superior performance of GoMIC over some existing approaches, indicating its potential to improve collaboration in multiagent systems.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1536-1547"},"PeriodicalIF":4.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-24DOI: 10.1109/TCDS.2025.3554477
Ammarah Hashmi;Sahibzada Adil Shahzad;Chia Wen Lin;Yu Tsao;Hsin-Min Wang
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that the proposed model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset. We also compare AVTENet against humans in detecting video forgery. The results show that AVTENet significantly outperforms humans.
{"title":"AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection","authors":"Ammarah Hashmi;Sahibzada Adil Shahzad;Chia Wen Lin;Yu Tsao;Hsin-Min Wang","doi":"10.1109/TCDS.2025.3554477","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3554477","url":null,"abstract":"The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that the proposed model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset. We also compare AVTENet against humans in detecting video forgery. The results show that AVTENet significantly outperforms humans.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1360-1376"},"PeriodicalIF":4.9,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-21DOI: 10.1109/TCDS.2025.3571813
Dinh-Cuong Hoang;Phan Xuan Tan;Ta Huu Anh Duong;Tuan-Minh Huynh;Duc-Manh Nguyen;Anh-Nhat Nguyen;Duc-Long Pham;Van-Duc Vu;Thu-Uyen Nguyen;Ngoc-Anh Hoang;Khanh-Toan Phan;Duc-Thanh Tran;Van-Thiep Nguyen;Ngoc-Trung Ho;Cong-Trinh Tran;Van-Hiep Duong
Object pose estimation using learning-based methods often necessitates vast amounts of meticulously labeled training data. The process of capturing real-world object images under diverse conditions and annotating these images with 6 degrees of freedom (6DOF) object poses is both time-consuming and resource-intensive. In this study, we propose an innovative approach to monocular 6-D pose estimation through self-supervised learning, eliminating the need for labor-intensive manual annotations. Our method initiates by training a multitask neural network in a fully supervised manner, leveraging synthetic RGBD data. We leverage semantic segmentation, instance-level depth estimation, and vector-field prediction as auxiliary tasks to enhance the primary task of pose estimation. Subsequently, we harness advancements in multitask learning to further self-supervise the model using unlabeled real-world RGB data. A pivotal element of our self-supervised object pose estimation is a geometry-guided pseudolabel filtering module that relies on estimated depth from instance-level depth estimation. Our extensive experiments conducted on benchmark datasets demonstrate the effectiveness and potential of our approach in achieving accurate monocular 6-D pose estimation. Importantly, our method showcases a promising avenue for overcoming the challenges associated with the labor-intensive annotation process, offering a more efficient and scalable solution for real-world object pose estimation.
{"title":"Self-Supervised Object Pose Estimation With Multitask Learning","authors":"Dinh-Cuong Hoang;Phan Xuan Tan;Ta Huu Anh Duong;Tuan-Minh Huynh;Duc-Manh Nguyen;Anh-Nhat Nguyen;Duc-Long Pham;Van-Duc Vu;Thu-Uyen Nguyen;Ngoc-Anh Hoang;Khanh-Toan Phan;Duc-Thanh Tran;Van-Thiep Nguyen;Ngoc-Trung Ho;Cong-Trinh Tran;Van-Hiep Duong","doi":"10.1109/TCDS.2025.3571813","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3571813","url":null,"abstract":"Object pose estimation using learning-based methods often necessitates vast amounts of meticulously labeled training data. The process of capturing real-world object images under diverse conditions and annotating these images with 6 degrees of freedom (6DOF) object poses is both time-consuming and resource-intensive. In this study, we propose an innovative approach to monocular 6-D pose estimation through self-supervised learning, eliminating the need for labor-intensive manual annotations. Our method initiates by training a multitask neural network in a fully supervised manner, leveraging synthetic RGBD data. We leverage semantic segmentation, instance-level depth estimation, and vector-field prediction as auxiliary tasks to enhance the primary task of pose estimation. Subsequently, we harness advancements in multitask learning to further self-supervise the model using unlabeled real-world RGB data. A pivotal element of our self-supervised object pose estimation is a geometry-guided pseudolabel filtering module that relies on estimated depth from instance-level depth estimation. Our extensive experiments conducted on benchmark datasets demonstrate the effectiveness and potential of our approach in achieving accurate monocular 6-D pose estimation. Importantly, our method showcases a promising avenue for overcoming the challenges associated with the labor-intensive annotation process, offering a more efficient and scalable solution for real-world object pose estimation.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1548-1564"},"PeriodicalIF":4.9,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-18DOI: 10.1109/TCDS.2025.3552124
Minxu Liu;Donghai Guan;Chuhang Zheng;Qi Zhu
Multimodal emotion recognition is gaining significant attention for ability to fuse complementary information from diverse physiological and behavioral signals, which benefits the understanding of emotional disorders. However, challenges arise in multimodal fusion due to uncertainties inherent in different modalities, such as complex signal coupling and modality heterogeneity. Furthermore, the feature distribution drift in intersubject emotion recognition hinders the generalization ability of the method and significantly degrades performance on new individuals. To address the above issues, we propose a cross-subject multimodal emotion robust recognition framework that effectively extracts subject-independent intrinsic emotional identification information from heterogeneous multimodal emotion data. First, we develop a multichannel network with self-attention and cross-attention mechanisms to capture modality-specific and complementary features among different modalities, respectively. Second, we incorporate contrastive loss into the multichannel attention network to enhance feature extraction across different channels, thereby facilitating the disentanglement of emotion-specific information. Moreover, a self-expression learning-based network layer is devised to enhance feature discriminability and subject alignment. It aligns samples in a discriminative space using block diagonal matrices and maps multiple individuals to a shared subspace using a block off-diagonal matrix. Finally, attention is used to merge multichannel features, and multilayer perceptron is employed for classification. Experimental results on multimodal emotion datasets confirm that our proposed approach surpasses the current state-of-the-art in terms of emotion recognition accuracy, with particularly significant gains observed in the challenging cross-subject multimodal recognition scenarios.
{"title":"Multimodal Discriminative Network for Emotion Recognition Across Individuals","authors":"Minxu Liu;Donghai Guan;Chuhang Zheng;Qi Zhu","doi":"10.1109/TCDS.2025.3552124","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3552124","url":null,"abstract":"Multimodal emotion recognition is gaining significant attention for ability to fuse complementary information from diverse physiological and behavioral signals, which benefits the understanding of emotional disorders. However, challenges arise in multimodal fusion due to uncertainties inherent in different modalities, such as complex signal coupling and modality heterogeneity. Furthermore, the feature distribution drift in intersubject emotion recognition hinders the generalization ability of the method and significantly degrades performance on new individuals. To address the above issues, we propose a cross-subject multimodal emotion robust recognition framework that effectively extracts subject-independent intrinsic emotional identification information from heterogeneous multimodal emotion data. First, we develop a multichannel network with self-attention and cross-attention mechanisms to capture modality-specific and complementary features among different modalities, respectively. Second, we incorporate contrastive loss into the multichannel attention network to enhance feature extraction across different channels, thereby facilitating the disentanglement of emotion-specific information. Moreover, a self-expression learning-based network layer is devised to enhance feature discriminability and subject alignment. It aligns samples in a discriminative space using block diagonal matrices and maps multiple individuals to a shared subspace using a block off-diagonal matrix. Finally, attention is used to merge multichannel features, and multilayer perceptron is employed for classification. Experimental results on multimodal emotion datasets confirm that our proposed approach surpasses the current state-of-the-art in terms of emotion recognition accuracy, with particularly significant gains observed in the challenging cross-subject multimodal recognition scenarios.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 5","pages":"1323-1335"},"PeriodicalIF":4.9,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-16DOI: 10.1109/TCDS.2025.3570497
Tianyi Hu;Zhiqiang Pu;Xiaolin Ai;Tenghai Qiu;Yanyan Liang;Jianqiang Yi
This article focuses on cooperative policy learning for physically heterogeneous multiagent system (PHet-MAS), where agents have different observation spaces, action spaces, and local state transitions. Due to the various input–output structures of agents’ policies in PHet-MAS, it is difficult to employ parameter sharing techniques for sample efficiency. Moreover, a totally heterogeneous policy design impedes agents from utilizing the training experience of their companions and increases the risk of environmental nonstationarity. To address the above issues, we propose hybrid heterogeneous actor–critic (HHAC), a method for the policy learning of PHet-MAS. The framework of HHAC consists of a hybrid actor and a hybrid critic, both containing globally shared and locally shared modules. The locally shared modules can be customized according to the actual physical properties of agents, while the globally shared modules can help extract and utilize the common information among agents. In the hybrid critic, a behavioral intention module is designed to alleviate the environmental nonstationary issue caused by evolving heterogeneous policies. Finally, a hybrid network training method is developed to address challenges in sample construction and training stability of hybrid networks. As evidenced by experimental results, HHAC exhibits superior performance enhancements over baseline approaches and can facilitate PHet-MAS in learning sophisticated and instructive policies.
{"title":"Hybrid Actor–Critic for Physically Heterogeneous Multiagent Reinforcement Learning","authors":"Tianyi Hu;Zhiqiang Pu;Xiaolin Ai;Tenghai Qiu;Yanyan Liang;Jianqiang Yi","doi":"10.1109/TCDS.2025.3570497","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3570497","url":null,"abstract":"This article focuses on cooperative policy learning for physically heterogeneous multiagent system (PHet-MAS), where agents have different observation spaces, action spaces, and local state transitions. Due to the various input–output structures of agents’ policies in PHet-MAS, it is difficult to employ parameter sharing techniques for sample efficiency. Moreover, a totally heterogeneous policy design impedes agents from utilizing the training experience of their companions and increases the risk of environmental nonstationarity. To address the above issues, we propose <italic>hybrid heterogeneous actor–critic</i> (HHAC), a method for the policy learning of PHet-MAS. The framework of HHAC consists of a hybrid actor and a hybrid critic, both containing globally shared and locally shared modules. The locally shared modules can be customized according to the actual physical properties of agents, while the globally shared modules can help extract and utilize the common information among agents. In the hybrid critic, a behavioral intention module is designed to alleviate the environmental nonstationary issue caused by evolving heterogeneous policies. Finally, a hybrid network training method is developed to address challenges in sample construction and training stability of hybrid networks. As evidenced by experimental results, HHAC exhibits superior performance enhancements over baseline approaches and can facilitate PHet-MAS in learning sophisticated and instructive policies.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1520-1535"},"PeriodicalIF":4.9,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may still be limited due to the following two issues. First, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Second, they adopt either spiking neural networks (SNN) for energy-efficient recognition with suboptimal results, or artificial neural networks (ANN) for energy-intensive, high-performance recognition. However, few of them consider achieving a balance between these two aspects. In this article, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multimodal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarcity of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27 102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer.
{"title":"SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event Based-Recognition","authors":"Xiao Wang;Yao Rong;Zongzhen Wu;Lin Zhu;Bo Jiang;Jin Tang;Yonghong Tian","doi":"10.1109/TCDS.2025.3568833","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3568833","url":null,"abstract":"Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may still be limited due to the following two issues. First, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Second, they adopt either spiking neural networks (SNN) for energy-efficient recognition with suboptimal results, or artificial neural networks (ANN) for energy-intensive, high-performance recognition. However, few of them consider achieving a balance between these two aspects. In this article, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multimodal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarcity of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27 102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at <uri>https://github.com/Event-AHU/SSTFormer</uri>.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1488-1502"},"PeriodicalIF":4.9,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-12DOI: 10.1109/TCDS.2025.3569352
Alejandro Romero;Gianluca Baldassarre;Richard J. Duro;Vieri Giuliano Santucci
This article addresses the challenge of developing artificial agents capable of autonomously discovering interesting environmental states, setting them as goals, and learning the necessary skills and curricula to achieve these goals—an essential requirement for deploying robotic systems in real-world scenarios. In such environments, robots must adapt to unforeseen situations, learn new skills, and manage unexpected changes autonomously, which is central to open-ended learning (OEL). We present hierarchical goal-discovery robotic architecture for intrinsically-motivated learning (H-GRAIL) an architecture designed to foster autonomous OEL in robotic agents. The novelty of H-GRAIL compared to existing approaches, which often address isolated challenges in OEL, is that it integrates multiple mechanisms that enable robots to autonomously discover new goals, acquire skills, and manage learning processes in dynamic, nonstationary environments. We present tests that demonstrate the advantages of this approach in enabling robots to achieve different goals in nonstationary environments and simultaneously address many of the challenges inherent to OEL.
{"title":"H-GRAIL: A Robotic Motivational Architecture to Tackle Open-Ended Learning Challenges","authors":"Alejandro Romero;Gianluca Baldassarre;Richard J. Duro;Vieri Giuliano Santucci","doi":"10.1109/TCDS.2025.3569352","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3569352","url":null,"abstract":"This article addresses the challenge of developing artificial agents capable of autonomously discovering interesting environmental states, setting them as goals, and learning the necessary skills and curricula to achieve these goals—an essential requirement for deploying robotic systems in real-world scenarios. In such environments, robots must adapt to unforeseen situations, learn new skills, and manage unexpected changes autonomously, which is central to open-ended learning (OEL). We present hierarchical goal-discovery robotic architecture for intrinsically-motivated learning (H-GRAIL) an architecture designed to foster autonomous OEL in robotic agents. The novelty of H-GRAIL compared to existing approaches, which often address isolated challenges in OEL, is that it integrates multiple mechanisms that enable robots to autonomously discover new goals, acquire skills, and manage learning processes in dynamic, nonstationary environments. We present tests that demonstrate the advantages of this approach in enabling robots to achieve different goals in nonstationary environments and simultaneously address many of the challenges inherent to OEL.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1503-1519"},"PeriodicalIF":4.9,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11002551","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multisite resting-state functional magnetic resonance imaging (rs-fMRI) data have been increasingly utilized for diagnostic classification of neuropsychiatric disorders such as schizophrenia (SCZ) and major depressive disorder (MDD). However, the cross-site generalization ability of deep networks is limited due to the significant intersite data heterogeneity caused by different MRI scanners or scanning protocols. To address this issue, we propose a feature and semantic matching multisource domain adaptation method (FSM-MSDA) to learn site-invariant disorder-related feature representations. In FSM-MSDA, we adopt separate feature extractors for multiple source domains, and propose an accurate feature matching module to align the category-level feature distributions across multiple source domains and target domain. In addition, we also propose a semantic feature alignment module to eliminate the distribution discrepancy in high-level semantic features extracted from target samples by different source classifiers. Extensive experiments based on multisite fMRI data of SCZ and MDD show the superiority and robustness of FSM-MSDA compared with state-of-the-art methods. Besides, FSM-MSDA achieves the average accuracy of 80.8% in the classification of SCZ, meeting the clinically diagnostic accuracy threshold of 80%. Shared discriminative brain regions including the middle temporal gyrus and the cerebellum regions are identified in the diagnostic classification of SCZ and MDD.
{"title":"Feature and Semantic Matching Multisource Domain Adaptation for Diagnostic Classification of Neuropsychiatric Disorders","authors":"Minghao Dai;Jianpo Su;Zhipeng Fan;Chenyu Wang;Limin Peng;Dewen Hu;Ling-Li Zeng","doi":"10.1109/TCDS.2025.3567521","DOIUrl":"https://doi.org/10.1109/TCDS.2025.3567521","url":null,"abstract":"Multisite resting-state functional magnetic resonance imaging (rs-fMRI) data have been increasingly utilized for diagnostic classification of neuropsychiatric disorders such as schizophrenia (SCZ) and major depressive disorder (MDD). However, the cross-site generalization ability of deep networks is limited due to the significant intersite data heterogeneity caused by different MRI scanners or scanning protocols. To address this issue, we propose a feature and semantic matching multisource domain adaptation method (FSM-MSDA) to learn site-invariant disorder-related feature representations. In FSM-MSDA, we adopt separate feature extractors for multiple source domains, and propose an accurate feature matching module to align the category-level feature distributions across multiple source domains and target domain. In addition, we also propose a semantic feature alignment module to eliminate the distribution discrepancy in high-level semantic features extracted from target samples by different source classifiers. Extensive experiments based on multisite fMRI data of SCZ and MDD show the superiority and robustness of FSM-MSDA compared with state-of-the-art methods. Besides, FSM-MSDA achieves the average accuracy of 80.8% in the classification of SCZ, meeting the clinically diagnostic accuracy threshold of 80%. Shared discriminative brain regions including the middle temporal gyrus and the cerebellum regions are identified in the diagnostic classification of SCZ and MDD.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"17 6","pages":"1474-1487"},"PeriodicalIF":4.9,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145705906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}