Collective decision-making is vital and widespread in human and artificial societies. Individuals often choose the option by assessing the intrinsic values of options in decision-making through individual learning. But they are also influenced by peer pressure and select the option by conformity-based social learning. A central question is whether the population can settle on the most beneficial option when social learning is involved. Previous studies concerning social learning focused on well-mixed populations where individuals are equally likely to interact with each other. But real social interactions are often more subtle that are modeled by a graph. Therefore, it is challenging to theoretically analyze the effect of social learning on collective decision-making in structured populations. To address this issue, using evolutionary game theory we propose an evolutionary model of binary options jointly integrating individual and social learning in any population structure. We first derive the average fraction of the option with higher merit by means of coalescing random walks and find that the introduction of conformity-based social learning is detrimental to collective performance of decision-making. Interestingly, however, our theoretical analysis reveals that the majority of the population always favors the option with higher merit regardless of the preference of social learning. Importantly, these theoretical predictions are valid for any population structure and they are verified by intensive numerical simulations made in three representative static interaction structures. We further show that they hold in dynamic networks via computer simulations. We also demonstrate the robustness of our findings to different conformity-based social learning procedures.
{"title":"Collective Performance Induced by Social and Individual Learning in Any Population Structure: An Evolutionary Game Approach","authors":"Zhifang Li;Jingwei Zhang;Xiaojie Chen;Attila Szolnoki","doi":"10.1109/TAI.2025.3592636","DOIUrl":"https://doi.org/10.1109/TAI.2025.3592636","url":null,"abstract":"Collective decision-making is vital and widespread in human and artificial societies. Individuals often choose the option by assessing the intrinsic values of options in decision-making through individual learning. But they are also influenced by peer pressure and select the option by conformity-based social learning. A central question is whether the population can settle on the most beneficial option when social learning is involved. Previous studies concerning social learning focused on well-mixed populations where individuals are equally likely to interact with each other. But real social interactions are often more subtle that are modeled by a graph. Therefore, it is challenging to theoretically analyze the effect of social learning on collective decision-making in structured populations. To address this issue, using evolutionary game theory we propose an evolutionary model of binary options jointly integrating individual and social learning in any population structure. We first derive the average fraction of the option with higher merit by means of coalescing random walks and find that the introduction of conformity-based social learning is detrimental to collective performance of decision-making. Interestingly, however, our theoretical analysis reveals that the majority of the population always favors the option with higher merit regardless of the preference of social learning. Importantly, these theoretical predictions are valid for any population structure and they are verified by intensive numerical simulations made in three representative static interaction structures. We further show that they hold in dynamic networks via computer simulations. We also demonstrate the robustness of our findings to different conformity-based social learning procedures.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1143-1157"},"PeriodicalIF":0.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146176027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1109/TAI.2025.3591587
Jingpeng Sun;Chen Chen;Weiping Ding;Xiyuan Hu
Sleep disorders affect a significant portion of the global population and contribute to increased overall mortality. Automatic sleep staging through analyzing physiological signals is pivotal in expanding sleep assessment and diagnostic capabilities. However, due to the complex nonstationary characteristics of physiological signals and the individual differences between subjects, obtaining the effective features of the physiological signals is still challenging. To this end, we propose a novel Transformer-based sleep staging method, SleepLog, to combine local and global information for feature extraction. First, a convolutional neural network (CNN)-based module was used to extract the local information to capture the features of sleep characteristic wave events. Then, we extract the global information that reflects the transformation between different characteristic waves using a self-attention-based patch encoder module. Furthermore, the local and global information was fed to the Transformer encoder module to enable the class (CLS) token of each branch to extract supplementary information from the associated features. Finally, we propose a simple yet effective cross-attention-based feature fusion module, which uses a single class token for each branch as a query to exchange information with other branches. The proposed cross-attention only requires linear time for both computational and memory complexity. To validate the performance of the proposed method, we evaluate SleepLog on a publicly available dataset Sleep-EDF. The experimental results show that the proposed model can maintain superior performance, indicating that it has the potential to develop and apply a home-environment automatic sleep staging system.
{"title":"SleepLog: Local-Global Deep Fusion Learning for Sleep Staging Transformer","authors":"Jingpeng Sun;Chen Chen;Weiping Ding;Xiyuan Hu","doi":"10.1109/TAI.2025.3591587","DOIUrl":"https://doi.org/10.1109/TAI.2025.3591587","url":null,"abstract":"Sleep disorders affect a significant portion of the global population and contribute to increased overall mortality. Automatic sleep staging through analyzing physiological signals is pivotal in expanding sleep assessment and diagnostic capabilities. However, due to the complex nonstationary characteristics of physiological signals and the individual differences between subjects, obtaining the effective features of the physiological signals is still challenging. To this end, we propose a novel Transformer-based sleep staging method, SleepLog, to combine local and global information for feature extraction. First, a convolutional neural network (CNN)-based module was used to extract the local information to capture the features of sleep characteristic wave events. Then, we extract the global information that reflects the transformation between different characteristic waves using a self-attention-based patch encoder module. Furthermore, the local and global information was fed to the Transformer encoder module to enable the class (CLS) token of each branch to extract supplementary information from the associated features. Finally, we propose a simple yet effective cross-attention-based feature fusion module, which uses a single class token for each branch as a query to exchange information with other branches. The proposed cross-attention only requires linear time for both computational and memory complexity. To validate the performance of the proposed method, we evaluate SleepLog on a publicly available dataset Sleep-EDF. The experimental results show that the proposed model can maintain superior performance, indicating that it has the potential to develop and apply a home-environment automatic sleep staging system.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1084-1096"},"PeriodicalIF":0.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146176014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep reinforcement learning (DRL) methods have recently shown promise in path planning tasks. However, when dealing with global planning tasks in mountainous terrain (2.5D) environment, these methods face serious challenges such as poor convergence and generalization. To this end, we propose learn once plan arbitrarily (LOPA), an enhanced DRL method that learns on a single map yet generalizes to topographically similar terrains. Consequently, it enables path planning across multiple mountainous terrain maps while balancing path distance and energy consumption. First, we analyze the reasons for convergence and generalization problems from the perspective of DRL’s observation, revealing that the conventional design causes DRL to be interfered with irrelevant map information. Second, we develop the LOPA, which utilizes a novel dynamic observation mechanism to attain an improved capability in focusing on key information of the observation. Such a mechanism is realized by two steps: 1) a dynamic observation model is built to transform the DRL’s observation into two dynamic views: local and global, significantly guiding the LOPA to focus on the key information of the given maps; and 2) a dual-channel network is constructed to process these two views and integrate them to attain an improved reasoning capability. Meanwhile, through Rademacher Complexity analysis, we provide theoretical justification for LOPA’s improved generalization capability, demonstrating a lower upper bound on the generalization error. The LOPA is validated through multiobjective global path planning experiments conducted on both simulated and real maps. The results suggest that LOPA has improved convergence and generalization performance, as well as great planning efficiency.
{"title":"Learn Once Plan Arbitrarily (LOPA): Dynamic Observation-Based Deep Reinforcement Learning Method for Global Path Planning in Mountainous Terrain Environment","authors":"Shuqiao Huang;Mingxin Hou;Xiaofang Yuan;Xiru Wu;Yaonan Wang;Guoming Huang","doi":"10.1109/TAI.2025.3592648","DOIUrl":"https://doi.org/10.1109/TAI.2025.3592648","url":null,"abstract":"Deep reinforcement learning (DRL) methods have recently shown promise in path planning tasks. However, when dealing with global planning tasks in mountainous terrain (2.5D) environment, these methods face serious challenges such as poor convergence and generalization. To this end, we propose learn once plan arbitrarily (LOPA), an enhanced DRL method that learns on a single map yet generalizes to topographically similar terrains. Consequently, it enables path planning across multiple mountainous terrain maps while balancing path distance and energy consumption. First, we analyze the reasons for convergence and generalization problems from the perspective of DRL’s observation, revealing that the conventional design causes DRL to be interfered with irrelevant map information. Second, we develop the LOPA, which utilizes a novel dynamic observation mechanism to attain an improved capability in focusing on key information of the observation. Such a mechanism is realized by two steps: 1) a dynamic observation model is built to transform the DRL’s observation into two dynamic views: local and global, significantly guiding the LOPA to focus on the key information of the given maps; and 2) a dual-channel network is constructed to process these two views and integrate them to attain an improved reasoning capability. Meanwhile, through Rademacher Complexity analysis, we provide theoretical justification for LOPA’s improved generalization capability, demonstrating a lower upper bound on the generalization error. The LOPA is validated through multiobjective global path planning experiments conducted on both simulated and real maps. The results suggest that LOPA has improved convergence and generalization performance, as well as great planning efficiency.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1168-1184"},"PeriodicalIF":0.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146176003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1109/TAI.2025.3592635
Mayank Kumar Kundalwal;Deepak Mishra
Federated learning (FL) enables collaborative model training across decentralized data sources while preserving privacy. However, FL systems are vulnerable to attacks from malicious clients that can degrade model performance and compromise integrity. In this work, we propose anomaly-resistant robust framework for federated learning (AR2FL), an anomaly-resistant and robust framework that enhances FL aggregation by leveraging mean latent representations of client updates. This data-driven approach enables the server to estimate interclient similarity and dynamically scale clients contributions, reducing the influence of anomalous or adversarial updates. Unlike methods based on fixed distance metrics such as cosine similarity or Euclidean distance, AR2FL captures deeper statistical patterns in the latent space, enabling more accurate and secure model updates. Experiments on several datasets show AR2FL maintains strong accuracy, fast convergence, and high robustness, making it suitable for secure large-scale FL.
{"title":"AR2FL: Anomaly-Resistant Robust Framework for Federated Learning","authors":"Mayank Kumar Kundalwal;Deepak Mishra","doi":"10.1109/TAI.2025.3592635","DOIUrl":"https://doi.org/10.1109/TAI.2025.3592635","url":null,"abstract":"Federated learning (FL) enables collaborative model training across decentralized data sources while preserving privacy. However, FL systems are vulnerable to attacks from malicious clients that can degrade model performance and compromise integrity. In this work, we propose anomaly-resistant robust framework for federated learning (AR2FL), an anomaly-resistant and robust framework that enhances FL aggregation by leveraging mean latent representations of client updates. This data-driven approach enables the server to estimate interclient similarity and dynamically scale clients contributions, reducing the influence of anomalous or adversarial updates. Unlike methods based on fixed distance metrics such as cosine similarity or Euclidean distance, AR2FL captures deeper statistical patterns in the latent space, enabling more accurate and secure model updates. Experiments on several datasets show AR2FL maintains strong accuracy, fast convergence, and high robustness, making it suitable for secure large-scale FL.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1131-1142"},"PeriodicalIF":0.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146176020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep reinforcement learning (DRL) has shown significant success in domains such as computer vision and robot control. However, DRL agents often suffer from low sample efficiency, limiting their practical applicability in industrial settings. Recent advances in model-based DRL, particularly model-based approaches, have sought to address this issue by leveraging imaginary data to improve decision-making and sampling efficiency. Despite their promise, these methods face challenges such as overreliance on early experiences in the replay buffer and under-utilization of imaginary data, which can lead to overfitting and suboptimal policy optimization. To overcome these limitations, we propose a novel reinforcement learning framework, balanced sampling and reusing imaginary data (BSRID), which introduces two key innovations: 1) a BS mechanism that ensures uniform sampling rates to mitigate bias toward early experiences; and 2) a RID strategy that enhances policy optimization by increasing update frequency and maximizing the utility of imaginary data. The experimental results on the Atari 100k benchmark demonstrate that BSRID significantly improves sample efficiency and achieves state-of-the-art (SOTA) performance. This work provides a robust and efficient solution for DRL applications in scenarios requiring high sample efficiency and reliable decision making.
{"title":"Balanced Sampling and Reusing Imaginary Data for World Models in Reinforcement Learning","authors":"Qianyu Wang;Xuekai Wei;Jielu Yan;Leong Hou U;Huayan Pu;Jun Luo;Weijia Jia;Mingliang Zhou","doi":"10.1109/TAI.2025.3592174","DOIUrl":"https://doi.org/10.1109/TAI.2025.3592174","url":null,"abstract":"Deep reinforcement learning (DRL) has shown significant success in domains such as computer vision and robot control. However, DRL agents often suffer from low sample efficiency, limiting their practical applicability in industrial settings. Recent advances in model-based DRL, particularly model-based approaches, have sought to address this issue by leveraging imaginary data to improve decision-making and sampling efficiency. Despite their promise, these methods face challenges such as overreliance on early experiences in the replay buffer and under-utilization of imaginary data, which can lead to overfitting and suboptimal policy optimization. To overcome these limitations, we propose a novel reinforcement learning framework, balanced sampling and reusing imaginary data (BSRID), which introduces two key innovations: 1) a BS mechanism that ensures uniform sampling rates to mitigate bias toward early experiences; and 2) a RID strategy that enhances policy optimization by increasing update frequency and maximizing the utility of imaginary data. The experimental results on the Atari 100k benchmark demonstrate that BSRID significantly improves sample efficiency and achieves state-of-the-art (SOTA) performance. This work provides a robust and efficient solution for DRL applications in scenarios requiring high sample efficiency and reliable decision making.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1118-1130"},"PeriodicalIF":0.0,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-24DOI: 10.1109/TAI.2025.3592157
Christian Nash;Rajesh Nair;Syed Mohsen Naqvi
Attention deficit hyperactivity disorder (ADHD) is commonly found in children, with the prevalence in adults said to be under-reported. In this article, we aim to detect adult ADHD symptoms using two autoencoder architectures. We train and test on the novel multimodal ADHD dataset recorded under the Intelligent Sensing ADHD Trial in collaboration with the Cumbria, Northumberland, Tyne and Wear NHS Foundation Trust, U.K. The autoencoder architectures perform an image reconstruction task to optimize the latent bottleneck feature space to perform downstream classification tasks to detect ADHD subjects or control participants. The RGB video data is specifically exploited to inform the autoencoders about the hyperactivity symptoms. The Audio data is used to further support hyperactivity symptoms while also hoping to gain scope on inattentive symptoms. The self report questionnaire is a subjective measure, where the individual can provide details of ADHD symptoms that they experience. It is a vital data source to include in the proposed work for providing the autoencoders with previously unidentifiable symptoms. An ablation study is undertaken to demonstrate the effectiveness of the individual data modality, attempting to distinguish the associated discriminatory power. Using rigorous validation techniques, we achieve a state-of-the-art classification accuracy, sensitivity, and specificity of 98.9%, 99.2%, and 98.5%, respectively. With ADHD classification being a preliminary subjective decision, the proposed work demonstrates that an objective system can provide robust support to ADHD clinicians in the future.
{"title":"Optimizing ADHD Detection: An Autoencoder Approach for Multimodal Classification","authors":"Christian Nash;Rajesh Nair;Syed Mohsen Naqvi","doi":"10.1109/TAI.2025.3592157","DOIUrl":"https://doi.org/10.1109/TAI.2025.3592157","url":null,"abstract":"Attention deficit hyperactivity disorder (ADHD) is commonly found in children, with the prevalence in adults said to be under-reported. In this article, we aim to detect adult ADHD symptoms using two autoencoder architectures. We train and test on the novel multimodal ADHD dataset recorded under the Intelligent Sensing ADHD Trial in collaboration with the Cumbria, Northumberland, Tyne and Wear NHS Foundation Trust, U.K. The autoencoder architectures perform an image reconstruction task to optimize the latent bottleneck feature space to perform downstream classification tasks to detect ADHD subjects or control participants. The RGB video data is specifically exploited to inform the autoencoders about the hyperactivity symptoms. The Audio data is used to further support hyperactivity symptoms while also hoping to gain scope on inattentive symptoms. The self report questionnaire is a subjective measure, where the individual can provide details of ADHD symptoms that they experience. It is a vital data source to include in the proposed work for providing the autoencoders with previously unidentifiable symptoms. An ablation study is undertaken to demonstrate the effectiveness of the individual data modality, attempting to distinguish the associated discriminatory power. Using rigorous validation techniques, we achieve a state-of-the-art classification accuracy, sensitivity, and specificity of 98.9%, 99.2%, and 98.5%, respectively. With ADHD classification being a preliminary subjective decision, the proposed work demonstrates that an objective system can provide robust support to ADHD clinicians in the future.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1107-1117"},"PeriodicalIF":0.0,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TAI.2025.3591580
Min Liu;Zhao Yao;Mutian Li;Chenqian Zhao;Jiale Xu;Jinhua Yu
Early diagnosis of breast cancer is critical for reducing mortality rates. Dynamic ultrasound videos contain rich tumor-specific features, offering valuable information for clinical diagnosis. In standard clinical practice, sonographers typically first identify keyframes before scanning the surrounding area of it to gather more information. Previous research based on ultrasound videos has been devoted to temporal modeling while neglecting the contribution of keyframes to tumor diagnosis. In this article, we propose a two-stage hybrid network, hybrid keyframe-guided video transformer (HKVT), to model both static keyframe and dynamic video information in breast ultrasound videos. In the first stage, the model uses a multiinstance learning paradigm to construct an efficient video classification model that automatically identifies keyframes using self-attention scores. In the second stage, the embedding tokens of the keyframe are extracted, and a keyframe-guided transformer block is constructed for ultrasound video classification. Specifically, we designed a keyframe-guided temporal attention module and a keyframe-guided spatial coattention module to incorporate static keyframe features alongside dynamic video features. We evaluated the proposed model on an internal dataset of 342 patients and an external test dataset of 119 patients. The HKVT model achieved an area under the curve (AUC) of 0.921 on the internal dataset and 0.901 on the external test dataset, outperforming other state-of-the-art models. Furthermore, our model demonstrated robust performance on 242 multicenter test cases, outperforming other models by at least 2.1% in AUC. These results demonstrate the superiority of our approach for breast ultrasound video classification.
{"title":"A Hybrid Clinical Knowledge-Driven Transformer for Breast Ultrasound Video Classification","authors":"Min Liu;Zhao Yao;Mutian Li;Chenqian Zhao;Jiale Xu;Jinhua Yu","doi":"10.1109/TAI.2025.3591580","DOIUrl":"https://doi.org/10.1109/TAI.2025.3591580","url":null,"abstract":"Early diagnosis of breast cancer is critical for reducing mortality rates. Dynamic ultrasound videos contain rich tumor-specific features, offering valuable information for clinical diagnosis. In standard clinical practice, sonographers typically first identify keyframes before scanning the surrounding area of it to gather more information. Previous research based on ultrasound videos has been devoted to temporal modeling while neglecting the contribution of keyframes to tumor diagnosis. In this article, we propose a two-stage hybrid network, hybrid keyframe-guided video transformer (HKVT), to model both static keyframe and dynamic video information in breast ultrasound videos. In the first stage, the model uses a multiinstance learning paradigm to construct an efficient video classification model that automatically identifies keyframes using self-attention scores. In the second stage, the embedding tokens of the keyframe are extracted, and a keyframe-guided transformer block is constructed for ultrasound video classification. Specifically, we designed a keyframe-guided temporal attention module and a keyframe-guided spatial coattention module to incorporate static keyframe features alongside dynamic video features. We evaluated the proposed model on an internal dataset of 342 patients and an external test dataset of 119 patients. The HKVT model achieved an area under the curve (AUC) of 0.921 on the internal dataset and 0.901 on the external test dataset, outperforming other state-of-the-art models. Furthermore, our model demonstrated robust performance on 242 multicenter test cases, outperforming other models by at least 2.1% in AUC. These results demonstrate the superiority of our approach for breast ultrasound video classification.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1062-1072"},"PeriodicalIF":0.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146176018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TAI.2025.3591588
Shilajit Banerjee;Angshuman Paul
Knowledge distillation (KD) can be used for enhancing the performance of a lightweight student models with the help of knowledge from heavier teacher models. Most KD methods for classification use a one-teacher one-student architecture where only one teacher is responsible for transferring knowledge to a student for all the classes. However, when the number of classes increases, it may become difficult for a single teacher to learn the salient characteristics of all the classes. This may also adversely affect the performance of a student in a KD approach. In this article, we present a novel KD method where an ensemble of lightweight students is trained by a pyramid of teachers. At the top level of the pyramid, we have one teacher who learns all the class labels under consideration. As we go down the pyramid, the number of teachers increases at each level. However, except for the top level, each teacher learns a smaller subset of classes compared with its upper levels. Hence, different teachers learn different perspectives of the classification problem. In addition, as we move down the pyramid, the teachers become more and more specialized. On the contrary, as we move upward, the teachers learn a broader and broader perspective about the classification problem. We design a novel distillation loss to distill the knowledge between the student and the pyramid of teachers. Experimental results on publicly available datasets show the effectiveness of the proposed method. The code can be found at https://github.com/Shilajit77/Pyramid-Distill/tree/main.
{"title":"Knowledge Distillation for an Ensemble of Students From a Pyramid of Teachers With Diverse Perspective","authors":"Shilajit Banerjee;Angshuman Paul","doi":"10.1109/TAI.2025.3591588","DOIUrl":"https://doi.org/10.1109/TAI.2025.3591588","url":null,"abstract":"Knowledge distillation (KD) can be used for enhancing the performance of a lightweight student models with the help of knowledge from heavier teacher models. Most KD methods for classification use a one-teacher one-student architecture where only one teacher is responsible for transferring knowledge to a student for all the classes. However, when the number of classes increases, it may become difficult for a single teacher to learn the salient characteristics of all the classes. This may also adversely affect the performance of a student in a KD approach. In this article, we present a novel KD method where an ensemble of lightweight students is trained by a pyramid of teachers. At the top level of the pyramid, we have one teacher who learns all the class labels under consideration. As we go down the pyramid, the number of teachers increases at each level. However, except for the top level, each teacher learns a smaller subset of classes compared with its upper levels. Hence, different teachers learn different perspectives of the classification problem. In addition, as we move down the pyramid, the teachers become more and more specialized. On the contrary, as we move upward, the teachers learn a broader and broader perspective about the classification problem. We design a novel distillation loss to distill the knowledge between the student and the pyramid of teachers. Experimental results on publicly available datasets show the effectiveness of the proposed method. The code can be found at <uri>https://github.com/Shilajit77/Pyramid-Distill/tree/main</uri>.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1097-1106"},"PeriodicalIF":0.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social media platforms are vital to modern communication, but they also enable the spread of harmful content, such as hate speech and misinformation. Current detection models, while accurate, are often resource-intensive and unsuitable for real-time or resource-constrained environments. Moreover, even models that incorporate multilingual capabilities often fail to generalize effectively across different languages. To address this challenge, we propose CLARITY, a novel lightweight cross-modal transformer architecture designed for efficient and scalable harmful content detection. Unlike traditional models, CLARITY achieves faster processing while maintaining accuracy, making it accessible to a wider range of platforms and devices. CLARITY integrates text, image, and audio modalities to capture complex, multimodal interactions that enhance detection across diverse content types. By employing contrastive learning, CLARITY accurately distinguishes between reclaimed language and genuinely harmful content, significantly reducing false positives and promoting inclusivity, particularly for marginalized communities. Additionally, CLARITY incorporates a domain adaptation module with cross-lingual and multilingual, enabling it to generalize effectively across various platforms and ensuring robust performance even in dynamic online environments. We evaluate CLARITY across multiple benchmark datasets and GPUs, including Kaggle’s Tesla P100, Colab Pro’s NVIDIA T4, and NVIDIA A100. The results demonstrate a significant reduction in inference time, with the A100 achieving an average inference time of 0.85 s per instance—over 30% faster than traditional models—while maintaining competitive accuracy.
{"title":"CLARITY: A Lightweight Multimodal Transformer for Harmful Content Detection","authors":"Gautam Siddharth Kashyap;Niharika Jain;Ebad Shabbir;Harsh Joshi;Usman Naseem;Jiechao Gao","doi":"10.1109/TAI.2025.3591585","DOIUrl":"https://doi.org/10.1109/TAI.2025.3591585","url":null,"abstract":"Social media platforms are vital to modern communication, but they also enable the spread of harmful content, such as hate speech and misinformation. Current detection models, while accurate, are often resource-intensive and unsuitable for real-time or resource-constrained environments. Moreover, even models that incorporate multilingual capabilities often fail to generalize effectively across different languages. To address this challenge, we propose CLARITY, a novel lightweight cross-modal transformer architecture designed for efficient and scalable harmful content detection. Unlike traditional models, CLARITY achieves faster processing while maintaining accuracy, making it accessible to a wider range of platforms and devices. CLARITY integrates text, image, and audio modalities to capture complex, multimodal interactions that enhance detection across diverse content types. By employing contrastive learning, CLARITY accurately distinguishes between reclaimed language and genuinely harmful content, significantly reducing false positives and promoting inclusivity, particularly for marginalized communities. Additionally, CLARITY incorporates a domain adaptation module with cross-lingual and multilingual, enabling it to generalize effectively across various platforms and ensuring robust performance even in dynamic online environments. We evaluate CLARITY across multiple benchmark datasets and GPUs, including Kaggle’s Tesla P100, Colab Pro’s NVIDIA T4, and NVIDIA A100. The results demonstrate a significant reduction in inference time, with the A100 achieving an average inference time of 0.85 s per instance—over 30% faster than traditional models—while maintaining competitive accuracy.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 2","pages":"1073-1083"},"PeriodicalIF":0.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}