Pub Date : 2026-01-03DOI: 10.1016/j.patrec.2026.01.003
O. Taylan Turan , Marco Loog , David M.J. Tax
Learning curves show the expected performance with respect to training set size. This is often used to evaluate and compare models, tune hyper-parameters and determine how much data is needed for a specific performance. However, the distributional properties of performance are frequently overlooked on learning curves. Generally, only an average with standard error or standard deviation is used. In this paper, we analyze the distributions of generalization performance on the learning curves. We compile a high-fidelity learning curve database, both with respect to training set size and repetitions of the sampling for a fixed training set size. Our investigation reveals that generalization performance rarely follows a Gaussian distribution for classical classifiers, regardless of dataset balance, loss function, sampling method, or hyper-parameter tuning along learning curves. Furthermore, we show that the choice of statistical summary, mean versus measures like quantiles affect the top model rankings. Our findings highlight the importance of considering different statistical measures and use of non-parametric approaches when evaluating and selecting machine learning models with learning curves.
{"title":"Generalization performance distributions along learning curves","authors":"O. Taylan Turan , Marco Loog , David M.J. Tax","doi":"10.1016/j.patrec.2026.01.003","DOIUrl":"10.1016/j.patrec.2026.01.003","url":null,"abstract":"<div><div>Learning curves show the expected performance with respect to training set size. This is often used to evaluate and compare models, tune hyper-parameters and determine how much data is needed for a specific performance. However, the distributional properties of performance are frequently overlooked on learning curves. Generally, only an average with standard error or standard deviation is used. In this paper, we analyze the distributions of generalization performance on the learning curves. We compile a high-fidelity learning curve database, both with respect to training set size and repetitions of the sampling for a fixed training set size. Our investigation reveals that generalization performance rarely follows a Gaussian distribution for classical classifiers, regardless of dataset balance, loss function, sampling method, or hyper-parameter tuning along learning curves. Furthermore, we show that the choice of statistical summary, mean versus measures like quantiles affect the top model rankings. Our findings highlight the importance of considering different statistical measures and use of non-parametric approaches when evaluating and selecting machine learning models with learning curves.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 29-36"},"PeriodicalIF":3.3,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.patrec.2026.01.002
Huali Yang , Junjie Hu , Tao Huang , Shengze Hu , Wang Gao , Zhuoran Xu , Jing Geng
Accurate recognition of students’ knowledge states is critical for personalized education in the field of intelligent education. Knowledge tracing (KT) has emerged as an important research domain for tracing students’ knowledge states using the analysis of learning trajectory data. However, existing KT methods tend to overlook the hierarchical nature of memory, resulting in incomplete memory transfer. To address this issue, this study proposes a novel hierarchical memory-enhanced knowledge tracing (HMEKT) method that models the hierarchical structure of memory. HMEKT consists of three modules: shallow memory, deep memory, and performance prediction. Specifically, in the shallow memory module, learning and forgetting mechanisms are used to simulate memory growth and decay, capturing the dynamic changes in knowledge states. In the deep memory module, a dynamic memory matrix is used to store the student’s core knowledge system, transferring shallow memory into deep memory through enhancement and reduction gates that control memory transfer. Finally, for predicting student performance, relevant knowledge states are aggregated from the knowledge system matrix for future questions. Experiments on four datasets demonstrate the effectiveness of the model, with a 1.99% AUC gain on Assistment2017 compared to state-of-the-art methods.
{"title":"Hierarchical memory-enhanced networks for student knowledge tracing","authors":"Huali Yang , Junjie Hu , Tao Huang , Shengze Hu , Wang Gao , Zhuoran Xu , Jing Geng","doi":"10.1016/j.patrec.2026.01.002","DOIUrl":"10.1016/j.patrec.2026.01.002","url":null,"abstract":"<div><div>Accurate recognition of students’ knowledge states is critical for personalized education in the field of intelligent education. Knowledge tracing (KT) has emerged as an important research domain for tracing students’ knowledge states using the analysis of learning trajectory data. However, existing KT methods tend to overlook the hierarchical nature of memory, resulting in incomplete memory transfer. To address this issue, this study proposes a novel hierarchical memory-enhanced knowledge tracing (HMEKT) method that models the hierarchical structure of memory. HMEKT consists of three modules: shallow memory, deep memory, and performance prediction. Specifically, in the shallow memory module, learning and forgetting mechanisms are used to simulate memory growth and decay, capturing the dynamic changes in knowledge states. In the deep memory module, a dynamic memory matrix is used to store the student’s core knowledge system, transferring shallow memory into deep memory through enhancement and reduction gates that control memory transfer. Finally, for predicting student performance, relevant knowledge states are aggregated from the knowledge system matrix for future questions. Experiments on four datasets demonstrate the effectiveness of the model, with a 1.99% AUC gain on Assistment2017 compared to state-of-the-art methods.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 37-44"},"PeriodicalIF":3.3,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-27DOI: 10.1016/j.patrec.2025.12.014
Cheng Qian , Jiwu Cao , Ying Mao , Ruotian Zhang , Fei Long , Jun Sang
Text-guided object counting aims to estimate the number of objects described by natural language within complex visual scenes. However, existing approaches often struggle to align textual intent with diverse visual patterns, especially when target objects vary in scale, appearance, or context.
To address these limitations, we propose Frequency-Selective CountNet (FSCNet), a novel framework that integrates spatial and frequency-domain features for precise text-guided counting. FSCNet introduces a Triple-Stream Attention Fusion Module (TSAFM) that combines textual, global, and local visual features. Additionally, an Adaptive Frequency Selector (AFS) dynamically emphasizes frequency components by separately modulating the magnitude and phase spectra, preserving geometric consistency during decoding.
Extensive experiments on the FSC-147 and CARPK datasets demonstrate that FSCNet achieves state-of-the-art performance, outperforming previous best methods by 18.34% in MAE and 27.41% in RMSE on FSC-147 (Avg.) and by 5.17%/7.58% on CARPK.
{"title":"Frequency-selective countnet: Enhancing text-guided object counting with frequency features","authors":"Cheng Qian , Jiwu Cao , Ying Mao , Ruotian Zhang , Fei Long , Jun Sang","doi":"10.1016/j.patrec.2025.12.014","DOIUrl":"10.1016/j.patrec.2025.12.014","url":null,"abstract":"<div><div>Text-guided object counting aims to estimate the number of objects described by natural language within complex visual scenes. However, existing approaches often struggle to align textual intent with diverse visual patterns, especially when target objects vary in scale, appearance, or context.</div><div>To address these limitations, we propose Frequency-Selective CountNet (FSCNet), a novel framework that integrates spatial and frequency-domain features for precise text-guided counting. FSCNet introduces a Triple-Stream Attention Fusion Module (TSAFM) that combines textual, global, and local visual features. Additionally, an Adaptive Frequency Selector (AFS) dynamically emphasizes frequency components by separately modulating the magnitude and phase spectra, preserving geometric consistency during decoding.</div><div>Extensive experiments on the FSC-147 and CARPK datasets demonstrate that FSCNet achieves state-of-the-art performance, outperforming previous best methods by 18.34% in MAE and 27.41% in RMSE on FSC-147 (Avg.) and by 5.17%/7.58% on CARPK.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 15-21"},"PeriodicalIF":3.3,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-26DOI: 10.1016/j.patrec.2025.12.013
Qun Li , Jiru He , Tiancheng Guo , Xinping Gao , Bir Bhanu
Recent advances in Mixture of Experts (MoE) have improved the representational capacity of Vision Transformer (ViT), but most existing methods remain constrained to token-level routing or homogeneous expert scaling, overlooking the diverse representation requirements across different layers and the parameter redundancy within attention modules. To address these problems, we propose PE-ViT, a novel parameter-efficient architecture that integrates the Dimension-adaptive Mixture of Experts (DMoE) and the Selective and Shared Attention (SSA) mechanisms to improve both computational efficiency and model performance. Specifically, DMoE adaptively allocates expert dimensions through layer-wise representation analysis and incorporates shared experts to enhance parameter utilization, while SSA reduces the parameter overhead of attention by dynamically selecting attention heads and sharing query-key projections. Experimental results demonstrate that PE-ViT consistently outperforms existing MoE methods across multiple benchmark datasets.
{"title":"PE-ViT: Parameter-efficient vision transformer with dimension-adaptive experts and economical attention","authors":"Qun Li , Jiru He , Tiancheng Guo , Xinping Gao , Bir Bhanu","doi":"10.1016/j.patrec.2025.12.013","DOIUrl":"10.1016/j.patrec.2025.12.013","url":null,"abstract":"<div><div>Recent advances in Mixture of Experts (MoE) have improved the representational capacity of Vision Transformer (ViT), but most existing methods remain constrained to token-level routing or homogeneous expert scaling, overlooking the diverse representation requirements across different layers and the parameter redundancy within attention modules. To address these problems, we propose PE-ViT, a novel parameter-efficient architecture that integrates the Dimension-adaptive Mixture of Experts (DMoE) and the Selective and Shared Attention (SSA) mechanisms to improve both computational efficiency and model performance. Specifically, DMoE adaptively allocates expert dimensions through layer-wise representation analysis and incorporates shared experts to enhance parameter utilization, while SSA reduces the parameter overhead of attention by dynamically selecting attention heads and sharing query-key projections. Experimental results demonstrate that PE-ViT consistently outperforms existing MoE methods across multiple benchmark datasets.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"200 ","pages":"Pages 135-141"},"PeriodicalIF":3.3,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-25DOI: 10.1016/j.patrec.2025.12.012
Yanru Pan, Benchong Li
The Natarajan dimension is a crucial metric for measuring the capacity of a learning model and analyzing generalization ability of a classifier in multi-class classification tasks. In this paper, we present a tight upper bound of Natarajan dimension for linear multi-class predictors based on class sensitive feature mapping for multi-vector construction, and provide the exact Natarajan dimension when the dimension of feature is 2.
{"title":"Bounds on the Natarajan dimension of a class of linear multi-class predictors","authors":"Yanru Pan, Benchong Li","doi":"10.1016/j.patrec.2025.12.012","DOIUrl":"10.1016/j.patrec.2025.12.012","url":null,"abstract":"<div><div>The Natarajan dimension is a crucial metric for measuring the capacity of a learning model and analyzing generalization ability of a classifier in multi-class classification tasks. In this paper, we present a tight upper bound of Natarajan dimension for linear multi-class predictors based on class sensitive feature mapping for multi-vector construction, and provide the exact Natarajan dimension when the dimension of feature is 2.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"200 ","pages":"Pages 129-134"},"PeriodicalIF":3.3,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-25DOI: 10.1016/j.patrec.2025.12.010
Jingang Wang , Tong Xiao , Hui Du , Cheng Zhang , Peng Liu
Cross-domain detection of AI-generated text is a crucial task for cybersecurity. In practical scenarios, after being trained on one or multiple known text generation sources (source domain), a detection model must be capable of effectively identifying text generated by unknown and unseen sources (target domain). Current approaches suffer from limited cross-domain generalization due to insufficient structural adaptation to domain discrepancies. To address this critical limitation, we propose RiDis,a classification model that synergizes Linguistic Richness and Lexical Pair Dispersion for cross-domain AI-generated text detection. Through comprehensive statistical analysis, we establish Linguistic Richness and Lexical Pair Dispersion as discriminative indicators for distinguishing human-authored and machine-generated texts. Our architecture features two innovative components, a Semantic Coherence Extraction Module employing long-range receptive fields to capture linguistic richness through global semantic trend analysis, and a Contextual Dependency Extraction Module utilizing localized receptive fields to quantify lexical pair dispersion via fine-grained word association patterns. The framework further incorporates domain adaptation learning to enhance cross-domain detection robustness. Extensive evaluations demonstrate that our method achieves superior detection accuracy compared to state-of-the-art baselines across multiple domains, with experimental results showing significant performance improvements on cross-domain test scenarios.
{"title":"Cross-Domain detection of AI-Generated text: Integrating linguistic richness and lexical pair dispersion via deep learning","authors":"Jingang Wang , Tong Xiao , Hui Du , Cheng Zhang , Peng Liu","doi":"10.1016/j.patrec.2025.12.010","DOIUrl":"10.1016/j.patrec.2025.12.010","url":null,"abstract":"<div><div>Cross-domain detection of AI-generated text is a crucial task for cybersecurity. In practical scenarios, after being trained on one or multiple known text generation sources (source domain), a detection model must be capable of effectively identifying text generated by unknown and unseen sources (target domain). Current approaches suffer from limited cross-domain generalization due to insufficient structural adaptation to domain discrepancies. To address this critical limitation, we propose <strong>RiDis</strong>,a classification model that synergizes Linguistic <strong>Ri</strong>chness and Lexical Pair <strong>Dis</strong>persion for cross-domain AI-generated text detection. Through comprehensive statistical analysis, we establish Linguistic Richness and Lexical Pair Dispersion as discriminative indicators for distinguishing human-authored and machine-generated texts. Our architecture features two innovative components, a Semantic Coherence Extraction Module employing long-range receptive fields to capture linguistic richness through global semantic trend analysis, and a Contextual Dependency Extraction Module utilizing localized receptive fields to quantify lexical pair dispersion via fine-grained word association patterns. The framework further incorporates domain adaptation learning to enhance cross-domain detection robustness. Extensive evaluations demonstrate that our method achieves superior detection accuracy compared to state-of-the-art baselines across multiple domains, with experimental results showing significant performance improvements on cross-domain test scenarios.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"200 ","pages":"Pages 123-128"},"PeriodicalIF":3.3,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-25DOI: 10.1016/j.patrec.2025.12.011
Qingshuo Sun , Guorui Sheng , Xiangyi Zhu , Jingru Song , Yongqiang Song , Tao Yao , Haiyang Wang , Lili Wang
Food image recognition based on deep learning plays a crucial role in the field of food computing. However, its high demand for computing resources limits its deployment on end devices and fails to effectively achieve intelligent diet and nutrition management. To address this issue, we aim to balance computational efficiency with recognition accuracy and propose a compact food image recognition model named Lightweight Inter-Group Food Recognition Net (LIFR-Net) that combines Convolutional Neural Network (CNN) and Vision Transformer (ViT). In LIFR-Net, a lightweight ViT module called Lightweight Inter-group Transformer (LIT) is designed, and a lightweight component named Feature Grouping Transformer is constructed, which can efficiently extract local and global features of food images and optimize the number of parameters and computational complexity. In addition, by shuffling and fusing irregularly grouped feature maps, the information exchange among channels is enhanced, and the recognition accuracy of the model is improved. Extensive experiments on three commonly used public food image recognition datasets, namely ETHZ Food–101, Vireo Food–172, and UEC Food–256, show that LIFR-Net achieves recognition accuracies of 90.49%, 91.04%, and 74.23% with lower numbers of parameters and computational amounts.
{"title":"LIFR-Net: A lightweight hybrid neural network with feature grouping for efficient food image recognition","authors":"Qingshuo Sun , Guorui Sheng , Xiangyi Zhu , Jingru Song , Yongqiang Song , Tao Yao , Haiyang Wang , Lili Wang","doi":"10.1016/j.patrec.2025.12.011","DOIUrl":"10.1016/j.patrec.2025.12.011","url":null,"abstract":"<div><div>Food image recognition based on deep learning plays a crucial role in the field of food computing. However, its high demand for computing resources limits its deployment on end devices and fails to effectively achieve intelligent diet and nutrition management. To address this issue, we aim to balance computational efficiency with recognition accuracy and propose a compact food image recognition model named Lightweight Inter-Group Food Recognition Net (LIFR-Net) that combines Convolutional Neural Network (CNN) and Vision Transformer (ViT). In LIFR-Net, a lightweight ViT module called Lightweight Inter-group Transformer (LIT) is designed, and a lightweight component named Feature Grouping Transformer is constructed, which can efficiently extract local and global features of food images and optimize the number of parameters and computational complexity. In addition, by shuffling and fusing irregularly grouped feature maps, the information exchange among channels is enhanced, and the recognition accuracy of the model is improved. Extensive experiments on three commonly used public food image recognition datasets, namely ETHZ Food–101, Vireo Food–172, and UEC Food–256, show that LIFR-Net achieves recognition accuracies of 90.49%, 91.04%, and 74.23% with lower numbers of parameters and computational amounts.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 22-28"},"PeriodicalIF":3.3,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1016/j.patrec.2025.12.008
Edoardo Coppola , Mattia Savardi , Alberto Signoroni
Accurate 3D segmentation of multiple sclerosis lesions is critical for clinical practice, yet existing approaches face key limitations: many models rely on 2D architectures or partial modality combinations, while others struggle to generalise across scanners and protocols. Although large-scale, multi-site training can improve robustness, its data demands are often prohibitive. To address these challenges, we propose a 3D multi-modal network that simultaneously processes T1-weighted, T2-weighted, and FLAIR scans, leveraging full cross-modal interactions and volumetric context to achieve state-of-the-art performance across four diverse public datasets. To tackle data scarcity, we quantify the minimal fine-tuning effort needed to adapt to individual unseen datasets and reformulate the few-shot learning paradigm at an “instance-per-dataset” level (rather than traditional “instance-per-class”), enabling the quantification of the minimal fine-tuning effort to adapt to multiple unseen sources simultaneously. Finally, we introduce Latent Distance Analysis, a novel label-free reliability estimation technique that anticipates potential distribution shifts and supports any form of test-time adaptation, thereby strengthening efficient robustness and physicians’ trust.
{"title":"Towards robust and reliable multi-modal 3D segmentation of multiple sclerosis lesions","authors":"Edoardo Coppola , Mattia Savardi , Alberto Signoroni","doi":"10.1016/j.patrec.2025.12.008","DOIUrl":"10.1016/j.patrec.2025.12.008","url":null,"abstract":"<div><div>Accurate 3D segmentation of multiple sclerosis lesions is critical for clinical practice, yet existing approaches face key limitations: many models rely on 2D architectures or partial modality combinations, while others struggle to generalise across scanners and protocols. Although large-scale, multi-site training can improve robustness, its data demands are often prohibitive. To address these challenges, we propose a 3D multi-modal network that simultaneously processes T1-weighted, T2-weighted, and FLAIR scans, leveraging full cross-modal interactions and volumetric context to achieve state-of-the-art performance across four diverse public datasets. To tackle data scarcity, we quantify the <em>minimal</em> fine-tuning effort needed to adapt to individual unseen datasets and reformulate the few-shot learning paradigm at an “instance-per-dataset” level (rather than traditional “instance-per-class”), enabling the quantification of the <em>minimal</em> fine-tuning effort to adapt to <em>multiple</em> unseen sources simultaneously. Finally, we introduce <em>Latent Distance Analysis</em>, a novel label-free reliability estimation technique that anticipates potential distribution shifts and supports any form of test-time adaptation, thereby strengthening efficient robustness and physicians’ trust.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"200 ","pages":"Pages 115-122"},"PeriodicalIF":3.3,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1016/j.patrec.2025.12.007
Hyunseo Kim, Longbin Jin, Eun Yi Kim
Diagnosing depression is critical due to its profound impact on individuals and associated risks. Although deep learning techniques like convolutional neural networks and transformers have been employed to detect depression, they require large, labeled datasets and substantial computational resources, posing challenges in data-scarce environments. We introduce p-DREAM (Prompt-Driven Reprogramming Exploiting Audio Mapping), a novel and data-efficient model designed to diagnose depression from speech data alone. The p-DREAM combines two main strategies: data augmentation and model reprogramming. First, it utilizes audio-specific data augmentation techniques to generate a richer set of training examples. Next, it employs audio prompts to aid in domain adaptation. These prompts guide a frozen pre-trained transformer, which extracts meaningful features. Finally, these features are fed into a lightweight classifier for prediction. The p-DREAM outperforms traditional fine-tuning and linear probing methods, while requiring only a small number of trainable parameters. Evaluations on three benchmark datasets (DAIC-WoZ, E-DAIC, and AVEC 2014) demonstrate consistent improvements. In particular, p-DREAM achieves a leading macro F1 score of 0.7734 using only acoustic features. We further conducted ablation studies on prompt length, position, and initialization, confirming their importance in effective model adaptation. p-DREAM offers a practical and privacy-conscious approach for speech-based depression assessment in low-resource environments. To promote reproducibility and community adoption, we plan to release our codebase in compliance with the ethical protocols outlined in the AVEC challenges.
{"title":"Audio prompt driven reprogramming for diagnosing major depressive disorder","authors":"Hyunseo Kim, Longbin Jin, Eun Yi Kim","doi":"10.1016/j.patrec.2025.12.007","DOIUrl":"10.1016/j.patrec.2025.12.007","url":null,"abstract":"<div><div>Diagnosing depression is critical due to its profound impact on individuals and associated risks. Although deep learning techniques like convolutional neural networks and transformers have been employed to detect depression, they require large, labeled datasets and substantial computational resources, posing challenges in data-scarce environments. We introduce p-DREAM (Prompt-Driven Reprogramming Exploiting Audio Mapping), a novel and data-efficient model designed to diagnose depression from speech data alone. The p-DREAM combines two main strategies: data augmentation and model reprogramming. First, it utilizes audio-specific data augmentation techniques to generate a richer set of training examples. Next, it employs audio prompts to aid in domain adaptation. These prompts guide a frozen pre-trained transformer, which extracts meaningful features. Finally, these features are fed into a lightweight classifier for prediction. The p-DREAM outperforms traditional fine-tuning and linear probing methods, while requiring only a small number of trainable parameters. Evaluations on three benchmark datasets (DAIC-WoZ, E-DAIC, and AVEC 2014) demonstrate consistent improvements. In particular, p-DREAM achieves a leading macro F1 score of 0.7734 using only acoustic features. We further conducted ablation studies on prompt length, position, and initialization, confirming their importance in effective model adaptation. p-DREAM offers a practical and privacy-conscious approach for speech-based depression assessment in low-resource environments. To promote reproducibility and community adoption, we plan to release our codebase in compliance with the ethical protocols outlined in the AVEC challenges.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 1-8"},"PeriodicalIF":3.3,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1016/j.patrec.2025.12.009
Hangfan Liu , Bo Li , Yiran Li , Manuel Taso , Dylan Tisdall , Yulin Chang , John A Detre , Ze Wang
Arterial spin labeling (ASL) perfusion MRI stands as the sole non-invasive method to quantify regional cerebral blood flow (CBF), a crucial physiological parameter. However, ASL MRI typically suffers from a relatively low signal-to-noise ratio. In this study, we introduce a novel ASL denoising approach termed Multi-coil Unified Sparsity regularization using Inter-slice Correlation (MUSIC). While MRI, including ASL data, is routinely captured using multi-channel coils, existing denoising techniques are tailored for coil-combined data, overlooking inherent multi-channel correlations. MUSIC capitalizes on the fact that multi-channel images are primarily distinguished by coil sensitivity weighting and random noise, resulting in an intrinsic low-rank structure within the stacked multi-channel data matrix. This low rankness can be further enhanced by grouping highly correlated slices. Our approach involves adapting regularization to each slice individually, forming potentially low-rank matrices by stacking vectorized slices selected from different channels based on their Euclidean distance from the current slice under processing. Matrix rank is then approximated using the logarithm-determinant of the covariance matrix. Importantly, MUSIC operates directly on complex data, eliminating the need for separating magnitude and phase or dividing real and imaginary data, thereby minimizing information loss. The degree of low-rank regularization is controlled by the estimated noise level, achieving a balance between noise reduction and texture preservation. Experimental validation on real-world imaging data demonstrates the efficacy of MUSIC in significantly enhancing ASL perfusion quality. By effectively suppressing noise while retaining essential textural information, MUSIC holds promise for improving the utility and accuracy of ASL perfusion MRI, thus advancing neuroimaging research and clinical diagnoses.
{"title":"MUSIC: Multi-coil unified sparsity regularization using inter-slice correlation for arterial spin labeling MRI denoising","authors":"Hangfan Liu , Bo Li , Yiran Li , Manuel Taso , Dylan Tisdall , Yulin Chang , John A Detre , Ze Wang","doi":"10.1016/j.patrec.2025.12.009","DOIUrl":"10.1016/j.patrec.2025.12.009","url":null,"abstract":"<div><div>Arterial spin labeling (ASL) perfusion MRI stands as the sole non-invasive method to quantify regional cerebral blood flow (CBF), a crucial physiological parameter. However, ASL MRI typically suffers from a relatively low signal-to-noise ratio. In this study, we introduce a novel ASL denoising approach termed Multi-coil Unified Sparsity regularization using Inter-slice Correlation (MUSIC). While MRI, including ASL data, is routinely captured using multi-channel coils, existing denoising techniques are tailored for coil-combined data, overlooking inherent multi-channel correlations. MUSIC capitalizes on the fact that multi-channel images are primarily distinguished by coil sensitivity weighting and random noise, resulting in an intrinsic low-rank structure within the stacked multi-channel data matrix. This low rankness can be further enhanced by grouping highly correlated slices. Our approach involves adapting regularization to each slice individually, forming potentially low-rank matrices by stacking vectorized slices selected from different channels based on their Euclidean distance from the current slice under processing. Matrix rank is then approximated using the logarithm-determinant of the covariance matrix. Importantly, MUSIC operates directly on complex data, eliminating the need for separating magnitude and phase or dividing real and imaginary data, thereby minimizing information loss. The degree of low-rank regularization is controlled by the estimated noise level, achieving a balance between noise reduction and texture preservation. Experimental validation on real-world imaging data demonstrates the efficacy of MUSIC in significantly enhancing ASL perfusion quality. By effectively suppressing noise while retaining essential textural information, MUSIC holds promise for improving the utility and accuracy of ASL perfusion MRI, thus advancing neuroimaging research and clinical diagnoses.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"200 ","pages":"Pages 142-148"},"PeriodicalIF":3.3,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145938874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}