This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.
{"title":"Enhancing token boundary detection in disfluent speech","authors":"Manu Srivastava , Marcello Ferro , Vito Pirrelli , Gianpaolo Coro","doi":"10.1016/j.iswa.2025.200614","DOIUrl":"10.1016/j.iswa.2025.200614","url":null,"abstract":"<div><div>This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200614"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-12-18DOI: 10.1016/j.iswa.2025.200621
André Artelt , Stelios G. Vrachimis , Demetrios G. Eliades , Ulrike Kuhl , Barbara Hammer , Marios M. Polycarpou
The increasing penetration of information and communication technologies in the design, monitoring, and control of water systems enables the use of algorithms for detecting and identifying unanticipated events (such as leakages or water contamination) using sensor measurements. However, data-driven methodologies do not always give accurate results and are often not trusted by operators, who may prefer to use their engineering judgment and experience to deal with such events.
In this work, we propose a framework for interpretable event diagnosis — an approach that assists the operators in associating the results of algorithmic event diagnosis methodologies with their own intuition and experience. This is achieved by providing contrasting (i.e., counterfactual) explanations of the results provided by fault diagnosis algorithms; their aim is to improve the understanding of the algorithm’s inner workings by the operators, thus enabling them to take a more informed decision by combining the results with their personal experiences. Specifically, we propose counterfactual event fingerprints, a representation of the difference between the current event diagnosis and the closest alternative explanation, which can be presented in a graphical way. The proposed methodology is applied and evaluated on a realistic use case using the L-Town benchmark.
{"title":"Interpretable event diagnosis in water distribution networks","authors":"André Artelt , Stelios G. Vrachimis , Demetrios G. Eliades , Ulrike Kuhl , Barbara Hammer , Marios M. Polycarpou","doi":"10.1016/j.iswa.2025.200621","DOIUrl":"10.1016/j.iswa.2025.200621","url":null,"abstract":"<div><div>The increasing penetration of information and communication technologies in the design, monitoring, and control of water systems enables the use of algorithms for detecting and identifying unanticipated events (such as leakages or water contamination) using sensor measurements. However, data-driven methodologies do not always give accurate results and are often not trusted by operators, who may prefer to use their engineering judgment and experience to deal with such events.</div><div>In this work, we propose a framework for interpretable event diagnosis — an approach that assists the operators in associating the results of algorithmic event diagnosis methodologies with their own intuition and experience. This is achieved by providing contrasting (i.e., counterfactual) explanations of the results provided by fault diagnosis algorithms; their aim is to improve the understanding of the algorithm’s inner workings by the operators, thus enabling them to take a more informed decision by combining the results with their personal experiences. Specifically, we propose <em>counterfactual event fingerprints</em>, a representation of the difference between the current event diagnosis and the closest alternative explanation, which can be presented in a graphical way. The proposed methodology is applied and evaluated on a realistic use case using the L-Town benchmark.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200621"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145924620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Internet of Things is an enormous network of interrelated devices that makes intelligent interaction and high-level control possible in various environments, such as smart homes, smart cities, and industry, by collecting, processing, and transferring data. The majority of the low-power devices within the network utilize limited sources of energy, such as batteries, and hence energy management is a critical factor in the design and operation of the systems. Current methods, such as reinforcement and evolutionary approaches, have at times been found to provide some enhancements but lacked extensive implementation over broad systems due to computational complexity as well as their inability to adapt to changing environmental settings. The growing number of IoT devices presents challenges in energy management, making it crucial to develop accurate prediction models. This research aims to address this challenge by proposing a novel solution using Long Short-Term Memory (LSTM) networks for energy consumption forecasting. This work suggests an optimal energy usage management model based on Long Short-Term Memory networks. The model collects historical energy usage, activity scheduling, and environmental factors such as temperature and humidity. Following the preprocessing, which includes noise removal and normalisation, it predicts future energy consumption. Scheduling data and the analysis and processing of environmental conditions are done using the short-term memory, while the long-term memory helps the model identify more complex patterns in the energy consumption over time to make more accurate predictions. Based on this prediction, smart policies are made for going to sleep and waking up the devices, so that unnecessary devices are put into sleep mode and only woken up when needed. Adaptive learning algorithms also assist in adjusting to environmental conditions. Results of experiments show that the proposed method can save energy up to 58% and increase device lifetime by 30%, while the prediction of energy consumption has an accuracy of 95%.
{"title":"Optimisation of energy management in IoT devices using LSTM models: Energy consumption prediction with sleep-wake scheduling control","authors":"Nahideh DerakhshanFard, Asra Rajabi Bavil Olyaei, Fahimeh RashidJafari","doi":"10.1016/j.iswa.2025.200624","DOIUrl":"10.1016/j.iswa.2025.200624","url":null,"abstract":"<div><div>The Internet of Things is an enormous network of interrelated devices that makes intelligent interaction and high-level control possible in various environments, such as smart homes, smart cities, and industry, by collecting, processing, and transferring data. The majority of the low-power devices within the network utilize limited sources of energy, such as batteries, and hence energy management is a critical factor in the design and operation of the systems. Current methods, such as reinforcement and evolutionary approaches, have at times been found to provide some enhancements but lacked extensive implementation over broad systems due to computational complexity as well as their inability to adapt to changing environmental settings. The growing number of IoT devices presents challenges in energy management, making it crucial to develop accurate prediction models. This research aims to address this challenge by proposing a novel solution using Long Short-Term Memory (LSTM) networks for energy consumption forecasting. This work suggests an optimal energy usage management model based on Long Short-Term Memory networks. The model collects historical energy usage, activity scheduling, and environmental factors such as temperature and humidity. Following the preprocessing, which includes noise removal and normalisation, it predicts future energy consumption. Scheduling data and the analysis and processing of environmental conditions are done using the short-term memory, while the long-term memory helps the model identify more complex patterns in the energy consumption over time to make more accurate predictions. Based on this prediction, smart policies are made for going to sleep and waking up the devices, so that unnecessary devices are put into sleep mode and only woken up when needed. Adaptive learning algorithms also assist in adjusting to environmental conditions. Results of experiments show that the proposed method can save energy up to 58% and increase device lifetime by 30%, while the prediction of energy consumption has an accuracy of 95%.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200624"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-04DOI: 10.1016/j.iswa.2026.200632
Simone Bianco, Marco Buzzelli, Gianluigi Ciocca, Flavio Piccoli, Raimondo Schettini
Self-supervised learning has recently gained increasing attention in computer vision, enabling the extraction of rich and general-purpose feature representations without requiring large annotated datasets. In this paper we aim to build a unified approach capable of deploying robust and effective analysis systems, replacing the need for multiple task-specific models trained end-to-end. Rather than introducing new architectures or training strategies, our goal is to systematically assess whether a single frozen self-supervised representation can support heterogeneous food-related tasks under realistic operating conditions. To this end, we performed an extensive analysis of DINOv2 features across multiple benchmark datasets and tasks, including food classification, segmentation, aesthetic assessment, and robustness to image distortions. In addition, we explore its capacity for continual learning by applying it to incremental food classification scenarios. Our findings reveal that DINOv2 features excel in many food-related applications. Their shared representations across tasks reduce the need for training separate models, while their strong generalization, high accuracy, and ability to handle complex multi-task scenarios make them a strong candidate for a unified food recognition approach. Specifically, DINOv2 features match or surpass state-of-the-art supervised methods in several food recognition tasks, while offering a simpler and more unified deployment strategy. Furthermore, they outperform end-to-end models in cross-dataset scenarios by up to +19.4% Top-1 accuracy and exhibits strong resilience to common image distortions by up to +48.0% robustness in Top-1 accuracy percentual difference, ensuring reliable performance in real-world applications. On average across all considered tasks, the DINOv2-based unified evaluation outperforms the state of the art by approximately 2.8% and 5.4%, depending on the chosen model size, while using only 6.2% and 23.9% of the total number of model parameters, respectively.
{"title":"A study on the generalization of DINOv2 features for food recognition tasks: A unified evaluation framework","authors":"Simone Bianco, Marco Buzzelli, Gianluigi Ciocca, Flavio Piccoli, Raimondo Schettini","doi":"10.1016/j.iswa.2026.200632","DOIUrl":"10.1016/j.iswa.2026.200632","url":null,"abstract":"<div><div>Self-supervised learning has recently gained increasing attention in computer vision, enabling the extraction of rich and general-purpose feature representations without requiring large annotated datasets. In this paper we aim to build a unified approach capable of deploying robust and effective analysis systems, replacing the need for multiple task-specific models trained end-to-end. Rather than introducing new architectures or training strategies, our goal is to systematically assess whether a single frozen self-supervised representation can support heterogeneous food-related tasks under realistic operating conditions. To this end, we performed an extensive analysis of DINOv2 features across multiple benchmark datasets and tasks, including food classification, segmentation, aesthetic assessment, and robustness to image distortions. In addition, we explore its capacity for continual learning by applying it to incremental food classification scenarios. Our findings reveal that DINOv2 features excel in many food-related applications. Their shared representations across tasks reduce the need for training separate models, while their strong generalization, high accuracy, and ability to handle complex multi-task scenarios make them a strong candidate for a unified food recognition approach. Specifically, DINOv2 features match or surpass state-of-the-art supervised methods in several food recognition tasks, while offering a simpler and more unified deployment strategy. Furthermore, they outperform end-to-end models in cross-dataset scenarios by up to +19.4% Top-1 accuracy and exhibits strong resilience to common image distortions by up to +48.0% robustness in Top-1 accuracy percentual difference, ensuring reliable performance in real-world applications. On average across all considered tasks, the DINOv2-based unified evaluation outperforms the state of the art by approximately 2.8% and 5.4%, depending on the chosen model size, while using only 6.2% and 23.9% of the total number of model parameters, respectively.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200632"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-08DOI: 10.1016/j.iswa.2026.200628
Thanh Tung Luu , Duy An Huynh
Rolling bearing degradation produces vibration signatures that vary across operating conditions, posing challenges for reliable fault diagnosis. This study proposes an adaptive and lightweight diagnostic framework combining a Depthwise Separable Multi-Scale CNN (DSMSCNN) with Convolutional Block Attention Module (CBAM) and Spatial Pyramid Pooling (SPP) to extract fault-frequency invariant features across different mechanical domains. Wavelet-based time–frequency maps are utilized to suppress noise and preserve multi-resolution spectral characteristics. The multi-scale separable convolutions adaptively capture discriminative frequency patterns, while CBAM highlights informative spectral regions and SPP enhances scale robustness without fixed input sizes. Experiments on the CWRU and HUST bearing datasets demonstrate over 99 % accuracy with significantly fewer parameters than conventional CNNs. The results confirm that the proposed DSMSCNN-CBAM-SPP framework effectively captures invariant fault-frequency features, offering a compact and adaptive solution for intelligent bearing fault diagnosis and real-time predictive maintenance in a noisy environment.
{"title":"An efficient lightweight multi-scale CNN framework with CBAM and SPP for bearing fault diagnosis","authors":"Thanh Tung Luu , Duy An Huynh","doi":"10.1016/j.iswa.2026.200628","DOIUrl":"10.1016/j.iswa.2026.200628","url":null,"abstract":"<div><div>Rolling bearing degradation produces vibration signatures that vary across operating conditions, posing challenges for reliable fault diagnosis. This study proposes an adaptive and lightweight diagnostic framework combining a Depthwise Separable Multi-Scale CNN (DSMSCNN) with Convolutional Block Attention Module (CBAM) and Spatial Pyramid Pooling (SPP) to extract fault-frequency invariant features across different mechanical domains. Wavelet-based time–frequency maps are utilized to suppress noise and preserve multi-resolution spectral characteristics. The multi-scale separable convolutions adaptively capture discriminative frequency patterns, while CBAM highlights informative spectral regions and SPP enhances scale robustness without fixed input sizes. Experiments on the CWRU and HUST bearing datasets demonstrate over 99 % accuracy with significantly fewer parameters than conventional CNNs. The results confirm that the proposed DSMSCNN-CBAM-SPP framework effectively captures invariant fault-frequency features, offering a compact and adaptive solution for intelligent bearing fault diagnosis and real-time predictive maintenance in a noisy environment.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200628"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-15DOI: 10.1016/j.iswa.2025.200622
Pei Xue, Yuanchun Ye
We propose a deep reinforcement learning framework for dynamic portfolio optimization that combines a Dirichlet policy with cross-sectional attention mechanisms. The Dirichlet distribution enforces feasibility by construction, accommodates tradability masks, and provides a coherent geometry for exploration. Our architecture integrates per-asset temporal encoders with a global attention layer, allowing the policy to adaptively weight sectoral co-movements, factor spillovers, and other cross-asset dependencies. We evaluate the framework on a comprehensive S&P 500 panel from 2000 to 2025 using purged walk-forward backtesting to prevent look-ahead bias. Empirical results show that attention-enhanced Dirichlet policies deliver higher terminal wealth, Sharpe and Sortino ratios than equal-weight and reinforcement learning baselines, while maintaining realistic turnover and drawdown profiles. Our findings highlight that principled action parameterization and attention-based representation learning materially improve both the stability and interpretability of reinforcement learning methods for portfolio allocation.
{"title":"Attention-enhanced reinforcement learning for dynamic portfolio optimization","authors":"Pei Xue, Yuanchun Ye","doi":"10.1016/j.iswa.2025.200622","DOIUrl":"10.1016/j.iswa.2025.200622","url":null,"abstract":"<div><div>We propose a deep reinforcement learning framework for dynamic portfolio optimization that combines a Dirichlet policy with cross-sectional attention mechanisms. The Dirichlet distribution enforces feasibility by construction, accommodates tradability masks, and provides a coherent geometry for exploration. Our architecture integrates per-asset temporal encoders with a global attention layer, allowing the policy to adaptively weight sectoral co-movements, factor spillovers, and other cross-asset dependencies. We evaluate the framework on a comprehensive S&P 500 panel from 2000 to 2025 using purged walk-forward backtesting to prevent look-ahead bias. Empirical results show that attention-enhanced Dirichlet policies deliver higher terminal wealth, Sharpe and Sortino ratios than equal-weight and reinforcement learning baselines, while maintaining realistic turnover and drawdown profiles. Our findings highlight that principled action parameterization and attention-based representation learning materially improve both the stability and interpretability of reinforcement learning methods for portfolio allocation.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200622"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145976713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI) have revolutionised software engineering (SE), augmenting practitioners across the SE lifecycle. In this paper, we focus on the application of GenAI within data analytics—considered a subdomain of SE—to address the growing need for reliable, user-friendly tools that bridge the gap between human expertise and automated analytical processes. In our work, we transform a conventional API-based analytics platform into a set of tools that can be used by AI agents and formulate a process to facilitate the communication between the data analyst, the agents and the platform. The result is a chat-based interface that allows analysts to query and execute analytical workflows using natural language, thereby reducing cognitive overhead and technical barriers. To validate our approach, we instantiated the proposed framework with open-source models and achieved a mean overall score increase of 7.2 % compared to other baselines. Complementary user-study data demonstrate that the chat-based analytics interface yielded superior task efficiency and higher user preference scores compared to the traditional form-based baseline.
{"title":"Generative AI for autonomous data analytics","authors":"Mattheos Fikardos , Katerina Lepenioti , Alexandros Bousdekis , Dimitris Apostolou , Gregoris Mentzas","doi":"10.1016/j.iswa.2026.200626","DOIUrl":"10.1016/j.iswa.2026.200626","url":null,"abstract":"<div><div>Recent advancements in Large Language Models (LLMs) and Generative Artificial Intelligence (GenAI) have revolutionised software engineering (SE), augmenting practitioners across the SE lifecycle. In this paper, we focus on the application of GenAI within data analytics—considered a subdomain of SE—to address the growing need for reliable, user-friendly tools that bridge the gap between human expertise and automated analytical processes. In our work, we transform a conventional API-based analytics platform into a set of tools that can be used by AI agents and formulate a process to facilitate the communication between the data analyst, the agents and the platform. The result is a chat-based interface that allows analysts to query and execute analytical workflows using natural language, thereby reducing cognitive overhead and technical barriers. To validate our approach, we instantiated the proposed framework with open-source models and achieved a mean overall score increase of 7.2 % compared to other baselines. Complementary user-study data demonstrate that the chat-based analytics interface yielded superior task efficiency and higher user preference scores compared to the traditional form-based baseline.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200626"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145924622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skeleton-based gait recognition has significantly improved due to the advent of graph convolutional networks (GCNs). Nevertheless, the classical ST-GCN has a key drawback: limited receptive fields fail to learn the global correlations of joints, restricting its ability to extract global dependencies effectively. To address this, we present the GSCTN method, a GCN and self-attention contemporary network with temporal convolution. This method combines GCN with a self-attention mechanism using a learnable weighted fusion. By combining local joint details from GCN with the larger context from self-attention, GSCTN creates a strong representation of skeleton movements. Our approach uses decoupled self-attention (DSA) techniques that fragment the tightly coupled (TiC) SA module into two learnable components, unary and pairwise SA, to model joint relationships separately. The unary SA shows an extensive relationship between the single key joint and all additional query joints. The paired SA captures the local gait features from each pair of body joints. We also present a Depthwise Multi-scale Temporal Convolutional Network (DMS-TCN) that smoothly captures the temporal nature of joint movements. DMS-TCN efficiently handles both short-term and long-term motion patterns. To boost the model’s ability to converge spatial and temporal joints dynamically, we applied Global Aware Attention (GAA) to the GSCTN module. We tested our method on the OUMVLP-Pose, CASIA-B, and GREW datasets. The suggested method exhibits remarkable accuracy on widely used CASIA-B datasets, with 97.9% for normal walking, 94.8% for carrying a bag, and 91.91% for clothing conditions. Meanwhile, the OUMVLP-Pose and GREW datasets exhibit a rank-1 accuracy of 93.5% and 75.7%, respectively. Our experimental results demonstrate that the proposed model is a holistic approach for gait recognition by utilizing GCN, DSA, and GAA with DMS-TCN to capture both inter-domain and spatial aspects of human locomotion.
{"title":"A GCN and Graph Self-Attention Contemporary Network with Temporal Depthwise Convolutions for Gait Recognition","authors":"Md. Khaliluzzaman , Kaushik Deb , Pranab Kumar Dhar , Tetsuya Shimamura","doi":"10.1016/j.iswa.2025.200625","DOIUrl":"10.1016/j.iswa.2025.200625","url":null,"abstract":"<div><div>Skeleton-based gait recognition has significantly improved due to the advent of graph convolutional networks (GCNs). Nevertheless, the classical ST-GCN has a key drawback: limited receptive fields fail to learn the global correlations of joints, restricting its ability to extract global dependencies effectively. To address this, we present the GSCTN method, a GCN and self-attention contemporary network with temporal convolution. This method combines GCN with a self-attention mechanism using a learnable weighted fusion. By combining local joint details from GCN with the larger context from self-attention, GSCTN creates a strong representation of skeleton movements. Our approach uses decoupled self-attention (DSA) techniques that fragment the tightly coupled (TiC) SA module into two learnable components, unary and pairwise SA, to model joint relationships separately. The unary SA shows an extensive relationship between the single key joint and all additional query joints. The paired SA captures the local gait features from each pair of body joints. We also present a Depthwise Multi-scale Temporal Convolutional Network (DMS-TCN) that smoothly captures the temporal nature of joint movements. DMS-TCN efficiently handles both short-term and long-term motion patterns. To boost the model’s ability to converge spatial and temporal joints dynamically, we applied Global Aware Attention (GAA) to the GSCTN module. We tested our method on the OUMVLP-Pose, CASIA-B, and GREW datasets. The suggested method exhibits remarkable accuracy on widely used CASIA-B datasets, with 97.9% for normal walking, 94.8% for carrying a bag, and 91.91% for clothing conditions. Meanwhile, the OUMVLP-Pose and GREW datasets exhibit a rank-1 accuracy of 93.5% and 75.7%, respectively. Our experimental results demonstrate that the proposed model is a holistic approach for gait recognition by utilizing GCN, DSA, and GAA with DMS-TCN to capture both inter-domain and spatial aspects of human locomotion.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200625"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145924624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-16DOI: 10.1016/j.iswa.2026.200639
Ngo Van Son , Vo Viet Minh Nhat
Transformer models have achieved significant success in time series forecasting, but relying solely on endogenous information is insufficient to achieve high accuracy. To enhance predictive performance, some systems have used information from multiple sources, providing additional insights to improve forecasting accuracy. Integrating additional information from external sources involves adding factors not present in the primary source, which affects forecasting results. These additional factors, called exogenous variables, can enhance the forecasting ability. We propose ExoFormer, a Transformer-based architecture that incorporates exogenous information for long-term time series forecasting. By leveraging relative cross-attention and a decoder-only design, ExoFormer efficiently models dependencies between endogenous and exogenous variables. Experiments on seven benchmark datasets demonstrate that ExoFormer consistently outperforms state-of-the-art models in both accuracy and computational efficiency.
{"title":"Exoformer: An improved transformer architecture for long-term time series forecasting based on multi-source data","authors":"Ngo Van Son , Vo Viet Minh Nhat","doi":"10.1016/j.iswa.2026.200639","DOIUrl":"10.1016/j.iswa.2026.200639","url":null,"abstract":"<div><div>Transformer models have achieved significant success in time series forecasting, but relying solely on endogenous information is insufficient to achieve high accuracy. To enhance predictive performance, some systems have used information from multiple sources, providing additional insights to improve forecasting accuracy. Integrating additional information from external sources involves adding factors not present in the primary source, which affects forecasting results. These additional factors, called exogenous variables, can enhance the forecasting ability. We propose ExoFormer, a Transformer-based architecture that incorporates exogenous information for long-term time series forecasting. By leveraging relative cross-attention and a decoder-only design, ExoFormer efficiently models dependencies between endogenous and exogenous variables. Experiments on seven benchmark datasets demonstrate that ExoFormer consistently outperforms state-of-the-art models in both accuracy and computational efficiency.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200639"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147394844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We evaluate a personalized, two-stage comparison-based FER framework on two datasets of low-to-mid-intensity, near-neutral expressions. The framework consistently outperforms FaceReader and Py-Feat. On the natural-transition younger-adult dataset (Dataset A, n = 9), mean accuracy is 90.22% ± 3.53%, with within-subject median gains of +16.46 percentage points (pp) over FaceReader (95% CI [+11.33, +33.90], p = 0.00195, r = 1.00) and +8.17 pp over Py-Feat (95% CI [+3.39, +21.58], p = 0.00195, r = 1.00). On the older adults dataset (Dataset B, n = 78), mean accuracy is 75.58% ± 9.04%, exceeding FaceReader by +15.47 pp (95% CI [+13.44, +17.21], p = 2.77 × 10–14, r = 0.980) and Py-Feat by +17.67 pp (95% CI [+15.13, +19.34], p = 3.02 × 10–14, r = 0.985). Component analyses are above chance on both datasets (B-stage medians 92.90% and 99.51%), and polarity-specific asymmetries emerge in the C-stage (A: positive > negative, Δ = +4.23 pp, two-sided p = 0.0273; B: negative > positive, Δ = -7.72 pp, p = 0.00442). On a subset of Dataset A emphasizing subtle transitions, the system maintains [78.61%, 85.38%] accuracy where human annotation accuracy ranges [50.00%, 71.47%]. Grad-CAM highlights eyebrows, forehead, and mouth regions consistent with expressive cues. Collectively, these findings demonstrate statistically significant and practically meaningful advantages for low-to-mid-intensity expression recognition and intensity ranking.
我们在两个低到中等强度、接近中性表达的数据集上评估了一个个性化的、基于两阶段比较的FER框架。该框架始终优于FaceReader和Py-Feat。在自然过渡的年轻人-成年人数据集(数据集A, n = 9)上,平均准确率为90.22%±3.53%,比FaceReader (95% CI [+11.33, +33.90], p = 0.00195, r = 1.00)和Py-Feat (95% CI [+3.39, +21.58], p = 0.00195, r = 1.00)的受试者内中位增益+16.46个百分点(pp)。在老年人数据集(数据集B, n = 78)上,平均准确率为75.58%±9.04%,超过FaceReader +15.47 pp (95% CI [+13.44, +17.21], p = 2.77 × 10-14, r = 0.980)和Py-Feat +17.67 pp (95% CI [+15.13, +19.34], p = 3.02 × 10-14, r = 0.985)。成分分析在两个数据集上都高于偶然(B期中位数为92.90%和99.51%),并且极性特异性不对称出现在c期(A:阳性>;阴性,Δ = +4.23 pp,双面p = 0.0273; B:阴性>;阳性,Δ = -7.72 pp, p = 0.00442)。在强调微妙过渡的Dataset a子集上,系统保持了[78.61%,85.38%]的准确率,而人类标注的准确率范围为[50.00%,71.47%]。Grad-CAM突出眉毛、前额和嘴部与表达线索一致。综上所述,这些发现显示了在低到中强度表达识别和强度排序方面具有统计学意义和实际意义的优势。
{"title":"Personalized two-stage comparison-based framework for low-to-mid-intensity facial expression recognition in real-world scenarios","authors":"Junyao Zhang , Kei Shimonishi , Kazuaki Kondo , Yuichi Nakamura","doi":"10.1016/j.iswa.2026.200627","DOIUrl":"10.1016/j.iswa.2026.200627","url":null,"abstract":"<div><div>We evaluate a personalized, two-stage comparison-based FER framework on two datasets of low-to-mid-intensity, near-neutral expressions. The framework consistently outperforms FaceReader and Py-Feat. On the natural-transition younger-adult dataset (Dataset A, <em>n</em> = 9), mean accuracy is 90.22% ± 3.53%, with within-subject median gains of +16.46 percentage points (pp) over FaceReader (95% CI [+11.33, +33.90], <em>p</em> = 0.00195, <em>r</em> = 1.00) and +8.17 pp over Py-Feat (95% CI [+3.39, +21.58], <em>p</em> = 0.00195, <em>r</em> = 1.00). On the older adults dataset (Dataset B, <em>n</em> = 78), mean accuracy is 75.58% ± 9.04%, exceeding FaceReader by +15.47 pp (95% CI [+13.44, +17.21], <em>p</em> = 2.77 × 10<sup>–14</sup>, <em>r</em> = 0.980) and Py-Feat by +17.67 pp (95% CI [+15.13, +19.34], <em>p</em> = 3.02 × 10<sup>–14</sup>, <em>r</em> = 0.985). Component analyses are above chance on both datasets (B-stage medians 92.90% and 99.51%), and polarity-specific asymmetries emerge in the C-stage (A: positive > negative, Δ = +4.23 pp, two-sided <em>p</em> = 0.0273; B: negative > positive, Δ = -7.72 pp, <em>p</em> = 0.00442). On a subset of Dataset A emphasizing subtle transitions, the system maintains [78.61%, 85.38%] accuracy where human annotation accuracy ranges [50.00%, 71.47%]. Grad-CAM highlights eyebrows, forehead, and mouth regions consistent with expressive cues. Collectively, these findings demonstrate statistically significant and practically meaningful advantages for low-to-mid-intensity expression recognition and intensity ranking.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"29 ","pages":"Article 200627"},"PeriodicalIF":4.3,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}