Pub Date : 2025-12-15DOI: 10.1016/j.mlwa.2025.100821
Paolo Giudici, Vasily Kolesnikov
We contribute to the field of AI governance with the development of a unified compliance metric that integrates three key dimensions of SAFE Artificial Intelligence: Security, Accuracy, and Explainability. While these aspects are typically assessed in isolation, the proposed approach integrates them into a single and interpretable metric, grounded in a consistent mathematical structure. To develop an integrated framework, the outputs of machine learning models are evaluated under three risk dimensions, that correspond to different input data perturbations: data removal (for accuracy); data poisoning (for security); and feature removal (for explainability). The experimentation of the methodology on both real and simulated datasets shows that the integrated metric improves compliance monitoring and enables a consistent evaluation of AI risks.
{"title":"SAFE AI metrics: An integrated approach","authors":"Paolo Giudici, Vasily Kolesnikov","doi":"10.1016/j.mlwa.2025.100821","DOIUrl":"10.1016/j.mlwa.2025.100821","url":null,"abstract":"<div><div>We contribute to the field of AI governance with the development of a unified compliance metric that integrates three key dimensions of SAFE Artificial Intelligence: Security, Accuracy, and Explainability. While these aspects are typically assessed in isolation, the proposed approach integrates them into a single and interpretable metric, grounded in a consistent mathematical structure. To develop an integrated framework, the outputs of machine learning models are evaluated under three risk dimensions, that correspond to different input data perturbations: data removal (for accuracy); data poisoning (for security); and feature removal (for explainability). The experimentation of the methodology on both real and simulated datasets shows that the integrated metric improves compliance monitoring and enables a consistent evaluation of AI risks.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100821"},"PeriodicalIF":4.9,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.mlwa.2025.100812
Mohammadhossein Homaei , Mehran Tarif , Pablo García Rodríguez , Mar Ávila , Andrés Caro
Machine learning (ML) models are often used to predict demand in digital twins (DTs) of water distribution systems (WDS). However, most models do not provide uncertainty estimation, and this makes risk evaluation limited. In this work, we introduce the first systematic framework for hierarchical uncertainty transfer in regional water networks, because until now no method existed for DT of regional water systems. We propose Adaptive Multi-Village Conformal Prediction (AMV-CP), a method that keeps theoretical guarantees and also allows transfer of uncertainty information between villages that are similar in structure but different in operation. The main ideas are: (i) village-adaptive conformity scores that capture local patterns, (ii) a meta-learning algorithm that reduces calibration cost by 88.6%, and (iii) regime-aware calibration that keeps 94.2% coverage when seasons change. We use eight years of data from six villages with 6174 users in one regional network. The results show a theoretical basis for cross-village transfer and 95.1% empirical coverage (target was 95%), with real-time speed of 120 predictions per second. Early multi-step tests also show 93.7% coverage for 24-hour horizons, with controlled trade-offs. This framework is the first systematic method for controlled uncertainty transfer in infrastructure DTs, with theoretical guarantees under -mixing and practical deployment. Our multi-village tests demonstrate the value of meta-learning for uncertainty estimation and make a base method that can be used in other hierarchical infrastructure systems. The system is validated in a Mediterranean rural network, but generalization to other climates, urban settings, and cascading systems needs further empirical study.
{"title":"Adaptive multi-domain uncertainty quantification for digital twin water forecasting","authors":"Mohammadhossein Homaei , Mehran Tarif , Pablo García Rodríguez , Mar Ávila , Andrés Caro","doi":"10.1016/j.mlwa.2025.100812","DOIUrl":"10.1016/j.mlwa.2025.100812","url":null,"abstract":"<div><div>Machine learning (ML) models are often used to predict demand in digital twins (DTs) of water distribution systems (WDS). However, most models do not provide uncertainty estimation, and this makes risk evaluation limited. In this work, we introduce the first systematic framework for hierarchical uncertainty transfer in regional water networks, because until now no method existed for DT of regional water systems. We propose Adaptive Multi-Village Conformal Prediction (AMV-CP), a method that keeps theoretical guarantees and also allows transfer of uncertainty information between villages that are similar in structure but different in operation. The main ideas are: (i) village-adaptive conformity scores that capture local patterns, (ii) a meta-learning algorithm that reduces calibration cost by 88.6%, and (iii) regime-aware calibration that keeps 94.2% coverage when seasons change. We use eight years of data from six villages with 6174 users in one regional network. The results show a theoretical basis for cross-village transfer and 95.1% empirical coverage (target was 95%), with real-time speed of 120 predictions per second. Early multi-step tests also show 93.7% coverage for 24-hour horizons, with controlled trade-offs. This framework is the first systematic method for controlled uncertainty transfer in infrastructure DTs, with theoretical guarantees under <span><math><mi>ϕ</mi></math></span>-mixing and practical deployment. Our multi-village tests demonstrate the value of meta-learning for uncertainty estimation and make a base method that can be used in other hierarchical infrastructure systems. The system is validated in a Mediterranean rural network, but generalization to other climates, urban settings, and cascading systems needs further empirical study.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100812"},"PeriodicalIF":4.9,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.mlwa.2025.100817
Ryan SangBaek Kim
<div><div>Modern NLP systems excel when labels are abundant but struggle with <em>high-inference</em> constructs that are costly to annotate and risky to synthesize without constraints. We present the <em>Defensive Motivational Node</em> framework, henceforth DefMoN (formerly DMN), which <em>operationalizes</em> Vaillant’s hierarchy of ego defenses and Plutchik’s psychoevolutionary emotions into a controllable generative process for text. We release <strong>DMN-Syn v1.0</strong> — a quadri-lingual (EN/KO/FR/KA) corpus of 300 theory-constrained utterances — together with a complete, versioned research compendium (data, code, seeds, QC manifests, and evaluation scripts) archived on Zenodo (Kim, 2025). The full package is permanently available at <span><span>https://doi.org/10.5281/zenodo.17101927</span><svg><path></path></svg></span>.</div><div>On the modeling side, we treat defense recognition as 10-way sentence classification and fine-tune a multilingual Transformer (XLM-R) <em>only</em> on DMN-Syn v1. In-domain performance is high (EN macro-<span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>97</mn></mrow></math></span>, MCC <span><math><mrow><mo>=</mo><mn>0</mn><mo>.</mo><mn>96</mn></mrow></math></span>; KO macro-<span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>96</mn></mrow></math></span>), and zero-shot transfer is strong (EN<span><math><mo>→</mo></math></span>KO macro-<span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>81</mn></mrow></math></span>). When evaluated on a small, anonymized real-world benchmark, the model reaches Macro <span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>62</mn></mrow></math></span> with <em>zero</em> real training data, then rises to 0.76 with only <span><math><mrow><mi>k</mi><mo>=</mo><mn>64</mn></mrow></math></span> supervised examples per class. Human annotators on that same benchmark agree with each other at <span><math><mrow><mi>κ</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>68</mn></mrow></math></span>, <span><math><mrow><mi>α</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>66</mn></mrow></math></span>. This shows that DefMoN is not a turnkey classifier, but a <em>theory-grounded primer</em> that enables data-efficient adaptation toward human-level ambiguity <em>theory-grounded primer</em> that enables data-efficient alignment to <em>schema-coded human benchmark</em> without large-scale annotation. We additionally quantify <em>reliability</em>—reporting ECE/MCE and coverage–performance curves for selective prediction—and show robustness under group-aware splits (template/scenario disjoint) and cue ablations, establishing structural coherence in the absence of large-scale human trials.</div><div>Beyond raw scores, we foreground <em>auditability</em>. Each instance in <strong>DMN-Syn v1</st
现代NLP系统在标签丰富时表现出色,但在注释成本高且不受约束的合成风险大的高推理结构中挣扎。我们提出了防御性动机节点框架,即DefMoN(以前的DMN),它将Vaillant的自我防御层次和Plutchik的心理进化情感运作为一个可控的文本生成过程。我们发布了DMN-Syn v1.0——一个四语(EN/KO/FR/KA)语料库,包含300个理论约束的话语——以及一个完整的、版本化的研究纲要(数据、代码、种子、QC清单和评估脚本),存档在Zenodo (Kim, 2025)上。完整的软件包可以在https://doi.org/10.5281/zenodo.17101927.On建模端永久获得,我们将防御识别视为10路句子分类,并仅在DMN-Syn v1上微调多语言转换器(XLM-R)。域内性能高(EN macro-F1=0.97, MCC =0.96; KO macro-F1=0.96),零射转移强(EN→KO macro-F1=0.81)。当在一个小的、匿名的真实世界基准上进行评估时,模型在零真实训练数据的情况下达到Macro F1=0.62,然后在每个类只有k=64个监督样本的情况下上升到0.76。在相同的基准上,人类注释者在κ=0.68, α=0.66时彼此一致。这表明DefMoN不是一个交钥匙分类器,而是一个基于理论的引物,它可以实现对人类级别歧义的数据高效适应;基于理论的引物可以实现对模式编码的人类基准的数据高效校准,而无需大规模注释。我们还量化了可靠性报告ECE/MCE和覆盖率-性能曲线,以进行选择性预测,并显示了在群体意识分裂(模板/场景脱节)和线索消融下的稳健性,在没有大规模人体试验的情况下建立了结构一致性。除了原始分数,我们还强调可审计性。DMN-Syn v1中的每个实例都具有固定的种子,分组分裂和护栏,以防止标签泄漏和构造漂移;发布验证器、清单和代码以进行字节精确复制。结果支持理论约束合成作为昂贵的专家标记和无约束的LLM生成之间的实用中间路径,特别是对于低资源和跨语言设置。通过使用心理学理论作为明确的生成约束,而不是事后解释,DefMoN将合成数据工作重新定义为机器学习结构的操作化。该框架(i)标准化护栏,以最大限度地减少偏差放大和漂移,(ii)提供小但理论密集的语料库,训练具有不确定性意识的可靠分类器,以及(iii)提供可审计的工件(种子、清单、验证器),使其能够重现并扩展到新的防御、语言和对话级别设置。术语和品牌。在之前的版本和存储库中,我们使用首字母缩略词DMN来表示防御性动机节点。为了避免与神经科学的“默认模式网络”混淆,本文采用首字母缩略词DefMoN来表示整个框架。保留了遗留数据集和存储库名称(例如,DMN-Syn v1),以保持先前工作的连续性和可重复性。因此,在整篇论文中,我们使用:DefMoN作为整体框架和方法,DMN- syn v1用于发布的数据集及其工件,“DMN节点”用于指该数据集中的单个(防御、情感、场景)元组。这种分裂是有意的。
{"title":"DefMoN: A reproducible framework for theory-grounded synthetic data generation in affective AI","authors":"Ryan SangBaek Kim","doi":"10.1016/j.mlwa.2025.100817","DOIUrl":"10.1016/j.mlwa.2025.100817","url":null,"abstract":"<div><div>Modern NLP systems excel when labels are abundant but struggle with <em>high-inference</em> constructs that are costly to annotate and risky to synthesize without constraints. We present the <em>Defensive Motivational Node</em> framework, henceforth DefMoN (formerly DMN), which <em>operationalizes</em> Vaillant’s hierarchy of ego defenses and Plutchik’s psychoevolutionary emotions into a controllable generative process for text. We release <strong>DMN-Syn v1.0</strong> — a quadri-lingual (EN/KO/FR/KA) corpus of 300 theory-constrained utterances — together with a complete, versioned research compendium (data, code, seeds, QC manifests, and evaluation scripts) archived on Zenodo (Kim, 2025). The full package is permanently available at <span><span>https://doi.org/10.5281/zenodo.17101927</span><svg><path></path></svg></span>.</div><div>On the modeling side, we treat defense recognition as 10-way sentence classification and fine-tune a multilingual Transformer (XLM-R) <em>only</em> on DMN-Syn v1. In-domain performance is high (EN macro-<span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>97</mn></mrow></math></span>, MCC <span><math><mrow><mo>=</mo><mn>0</mn><mo>.</mo><mn>96</mn></mrow></math></span>; KO macro-<span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>96</mn></mrow></math></span>), and zero-shot transfer is strong (EN<span><math><mo>→</mo></math></span>KO macro-<span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>81</mn></mrow></math></span>). When evaluated on a small, anonymized real-world benchmark, the model reaches Macro <span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>62</mn></mrow></math></span> with <em>zero</em> real training data, then rises to 0.76 with only <span><math><mrow><mi>k</mi><mo>=</mo><mn>64</mn></mrow></math></span> supervised examples per class. Human annotators on that same benchmark agree with each other at <span><math><mrow><mi>κ</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>68</mn></mrow></math></span>, <span><math><mrow><mi>α</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>66</mn></mrow></math></span>. This shows that DefMoN is not a turnkey classifier, but a <em>theory-grounded primer</em> that enables data-efficient adaptation toward human-level ambiguity <em>theory-grounded primer</em> that enables data-efficient alignment to <em>schema-coded human benchmark</em> without large-scale annotation. We additionally quantify <em>reliability</em>—reporting ECE/MCE and coverage–performance curves for selective prediction—and show robustness under group-aware splits (template/scenario disjoint) and cue ablations, establishing structural coherence in the absence of large-scale human trials.</div><div>Beyond raw scores, we foreground <em>auditability</em>. Each instance in <strong>DMN-Syn v1</st","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100817"},"PeriodicalIF":4.9,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.mlwa.2025.100820
Mohammad Meysami , Ali Lotfi , Sehar Saleem
Deep learning models often continue to generalize well even when they have far more parameters than available training examples. This observation naturally leads to two questions: why does training remain stable, and why do the resulting predictors generalize at all? To address these questions, we return to the classical Extreme Value Theorem and interpret modern training as optimization over compact sets in parameter space or function space. Our main results show that continuity together with coercive or Lipschitz based regularization gives existence of minimizers and uniform control of the excess risk, by bounding rare high loss events. We apply this framework to weight decay, gradient penalties, and spectral normalization, and we introduce simple diagnostics that monitor compactness in parameter space, representation space, and function space. Experiments on synthetic examples, standard image data sets (MNIST, CIFAR ten, Tiny ImageNet), and the UCI Adult tabular task are consistent with the theory: mild regularization leads to smoother optimization, reduced variation across random seeds, and better robustness and calibration while preserving accuracy. Taken together, these results highlight compactness as a practical geometric guideline for training stable and reliable deep networks.
{"title":"Deep learning and the geometry of compactness in stability and generalization","authors":"Mohammad Meysami , Ali Lotfi , Sehar Saleem","doi":"10.1016/j.mlwa.2025.100820","DOIUrl":"10.1016/j.mlwa.2025.100820","url":null,"abstract":"<div><div>Deep learning models often continue to generalize well even when they have far more parameters than available training examples. This observation naturally leads to two questions: why does training remain stable, and why do the resulting predictors generalize at all? To address these questions, we return to the classical Extreme Value Theorem and interpret modern training as optimization over compact sets in parameter space or function space. Our main results show that continuity together with coercive or Lipschitz based regularization gives existence of minimizers and uniform control of the excess risk, by bounding rare high loss events. We apply this framework to weight decay, gradient penalties, and spectral normalization, and we introduce simple diagnostics that monitor compactness in parameter space, representation space, and function space. Experiments on synthetic examples, standard image data sets (MNIST, CIFAR ten, Tiny ImageNet), and the UCI Adult tabular task are consistent with the theory: mild regularization leads to smoother optimization, reduced variation across random seeds, and better robustness and calibration while preserving accuracy. Taken together, these results highlight compactness as a practical geometric guideline for training stable and reliable deep networks.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100820"},"PeriodicalIF":4.9,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1016/j.mlwa.2025.100807
Muhammad Anus Khan , Bilal Hassan Khan , Shafiq ur Rehman Khan , Ali Raza , Asif Raza , Shehzad Ashraf Chaudhry
The rapid decline of honey bee populations presents an urgent ecological and agricultural concern, demanding innovative and scalable monitoring solutions. This study proposes a deep learning-based system for non-invasive classification of honey bee buzzing sounds to distinguish bee activity from complex environmental noise—a fundamental challenge for real-world acoustic monitoring. Traditional machine learning models using features like Mel Frequency Cepstral Coefficients (MFCCs) and spectral statistics performed well on curated datasets but failed under natural conditions due to overlapping acoustic signatures and inconsistent recordings.
To address this gap, we built a diverse dataset combining public bee audio with recordings from the Honeybee Research Center at the National Agricultural Research Centre (NARC), Pakistan, capturing various devices and natural environments. Audio signals were converted into mel spectrograms and chromograms, enabling pattern learning via pre-trained convolutional neural networks. Among tested architectures—EfficientNetB0, ResNet50, and MobileNetV2—MobileNetV2 achieved the highest generalization, with 95.29% accuracy on spectrograms and over 90% on chromograms under an 80% confidence threshold.
Data augmentation improved robustness to noise, while transfer learning enhanced adaptability. This work forms part of a broader project to develop a mobile application for real-time hive health monitoring in natural environments, where distinguishing bee buzzing from other sounds is the crucial first step. Beyond binary classification, the proposed approach offers potential for detecting hive health issues through acoustic patterns, supporting early interventions and contributing to global bee conservation efforts.
{"title":"Spectrogram-Based Deep Learning Models for Acoustic Identification of Honey Bees in Complex Environmental Noises","authors":"Muhammad Anus Khan , Bilal Hassan Khan , Shafiq ur Rehman Khan , Ali Raza , Asif Raza , Shehzad Ashraf Chaudhry","doi":"10.1016/j.mlwa.2025.100807","DOIUrl":"10.1016/j.mlwa.2025.100807","url":null,"abstract":"<div><div>The rapid decline of honey bee populations presents an urgent ecological and agricultural concern, demanding innovative and scalable monitoring solutions. This study proposes a deep learning-based system for non-invasive classification of honey bee buzzing sounds to distinguish bee activity from complex environmental noise—a fundamental challenge for real-world acoustic monitoring. Traditional machine learning models using features like Mel Frequency Cepstral Coefficients (MFCCs) and spectral statistics performed well on curated datasets but failed under natural conditions due to overlapping acoustic signatures and inconsistent recordings.</div><div>To address this gap, we built a diverse dataset combining public bee audio with recordings from the Honeybee Research Center at the National Agricultural Research Centre (NARC), Pakistan, capturing various devices and natural environments. Audio signals were converted into mel spectrograms and chromograms, enabling pattern learning via pre-trained convolutional neural networks. Among tested architectures—EfficientNetB0, ResNet50, and MobileNetV2—MobileNetV2 achieved the highest generalization, with 95.29% accuracy on spectrograms and over 90% on chromograms under an 80% confidence threshold.</div><div>Data augmentation improved robustness to noise, while transfer learning enhanced adaptability. This work forms part of a broader project to develop a mobile application for real-time hive health monitoring in natural environments, where distinguishing bee buzzing from other sounds is the crucial first step. Beyond binary classification, the proposed approach offers potential for detecting hive health issues through acoustic patterns, supporting early interventions and contributing to global bee conservation efforts.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100807"},"PeriodicalIF":4.9,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1016/j.mlwa.2025.100818
Sara Fanati Rashidi , Maryam Olfati , Seyedali Mirjalili , Crina Grosan , Jan Platoš , Vaclav Snášel
This study integrates Data Envelopment Analysis (DEA) with Machine Learning (ML) to address key limitations of traditional DEA in identifying reference sets for inefficient Decision-Making Units (DMUs). In DEA, inefficient units are evaluated against benchmark units; however, some benchmarks may be inappropriate or even outliers, which can distort the efficiency frontier. Moreover, when a new DMU is added, the entire model must be recalculated, resulting in high computational costs for large datasets. To overcome these issues, we propose a hybrid approach that combines Fuzzy C-Means (FCM) and Possibilistic Fuzzy C-Means (PFCM) clustering. By leveraging Euclidean distance and membership degrees, the method identifies closer and more relevant reference units, while a sensitivity threshold is introduced to control the number of benchmarks according to practical requirements. The effectiveness of the proposed method is validated on two datasets: a banking dataset and a banknote authentication dataset with 1,372 samples. Results show that the reference sets derived from this ML-based framework achieve 71.6%–98.3% agreement with DEA, while overcoming two major drawbacks: (1) sensitivity to dataset size and (2) inclusion of inappropriate reference units. Furthermore, statistical analyses, including confidence intervals and McNemar’s test, confirm the robustness and practical significance of the findings.
{"title":"A hybrid DEA–fuzzy clustering approach for accurate reference set identification","authors":"Sara Fanati Rashidi , Maryam Olfati , Seyedali Mirjalili , Crina Grosan , Jan Platoš , Vaclav Snášel","doi":"10.1016/j.mlwa.2025.100818","DOIUrl":"10.1016/j.mlwa.2025.100818","url":null,"abstract":"<div><div>This study integrates Data Envelopment Analysis (DEA) with Machine Learning (ML) to address key limitations of traditional DEA in identifying reference sets for inefficient Decision-Making Units (DMUs). In DEA, inefficient units are evaluated against benchmark units; however, some benchmarks may be inappropriate or even outliers, which can distort the efficiency frontier. Moreover, when a new DMU is added, the entire model must be recalculated, resulting in high computational costs for large datasets. To overcome these issues, we propose a hybrid approach that combines Fuzzy C-Means (FCM) and Possibilistic Fuzzy C-Means (PFCM) clustering. By leveraging Euclidean distance and membership degrees, the method identifies closer and more relevant reference units, while a sensitivity threshold is introduced to control the number of benchmarks according to practical requirements. The effectiveness of the proposed method is validated on two datasets: a banking dataset and a banknote authentication dataset with 1,372 samples. Results show that the reference sets derived from this ML-based framework achieve 71.6%–98.3% agreement with DEA, while overcoming two major drawbacks: (1) sensitivity to dataset size and (2) inclusion of inappropriate reference units. Furthermore, statistical analyses, including confidence intervals and McNemar’s test, confirm the robustness and practical significance of the findings.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100818"},"PeriodicalIF":4.9,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1016/j.mlwa.2025.100814
Ethar Alzaid , George Wright , Mark Eastwood , Piotr Keller , Fayyaz Minhas
Survival prediction from medical data is often constrained by scarce labels, limiting the effectiveness of fully supervised models. In addition, most existing approaches produce deterministic risk scores without conveying reliability, which hinders interpretability and clinical trustworthiness. To address these challenges, we introduce T-SURE, a transductive survival ranking and risk-stratification framework that learns jointly from labeled and unlabeled patients to reduce dependence on large annotated cohorts. It also estimates a rejection score that identifies high-uncertainty cases, enabling selective abstention when confidence is low. T-SURE generates a single risk score that enables (1) patient ranking based on survival risk, (2) automatic assignment to risk groups, and (3) optional rejection of uncertain predictions. We extensively evaluated the model on pan-cancer datasets from The Cancer Genome Atlas (TCGA), using gene expression profiles, whole slide images, pathology reports, and clinical information. The model outperformed existing approaches in both ranking and risk stratification, especially in the limited labeled data regimen. It also showed consistent improvements in performance as uncertain samples were rejected, while maintaining statistically significant stratification across datasets. T-SURE integrates as a reliable component within computational pathology pipelines by guiding risk-specific therapeutic and monitoring decisions and flagging ambiguous or rare cases via a high rejection score for further investigation. To support reproducibility, the full implementation of T-SURE is publicly available at: (Anonymized).
{"title":"Automatic discovery of robust risk groups from limited survival data across biomedical modalities","authors":"Ethar Alzaid , George Wright , Mark Eastwood , Piotr Keller , Fayyaz Minhas","doi":"10.1016/j.mlwa.2025.100814","DOIUrl":"10.1016/j.mlwa.2025.100814","url":null,"abstract":"<div><div>Survival prediction from medical data is often constrained by scarce labels, limiting the effectiveness of fully supervised models. In addition, most existing approaches produce deterministic risk scores without conveying reliability, which hinders interpretability and clinical trustworthiness. To address these challenges, we introduce T-SURE, a transductive survival ranking and risk-stratification framework that learns jointly from labeled and unlabeled patients to reduce dependence on large annotated cohorts. It also estimates a rejection score that identifies high-uncertainty cases, enabling selective abstention when confidence is low. T-SURE generates a single risk score that enables (1) patient ranking based on survival risk, (2) automatic assignment to risk groups, and (3) optional rejection of uncertain predictions. We extensively evaluated the model on pan-cancer datasets from The Cancer Genome Atlas (TCGA), using gene expression profiles, whole slide images, pathology reports, and clinical information. The model outperformed existing approaches in both ranking and risk stratification, especially in the limited labeled data regimen. It also showed consistent improvements in performance as uncertain samples were rejected, while maintaining statistically significant stratification across datasets. T-SURE integrates as a reliable component within computational pathology pipelines by guiding risk-specific therapeutic and monitoring decisions and flagging ambiguous or rare cases via a high rejection score for further investigation. To support reproducibility, the full implementation of T-SURE is publicly available at: (Anonymized).</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100814"},"PeriodicalIF":4.9,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-06DOI: 10.1016/j.mlwa.2025.100816
Elizaveta Popova , Marc Jacobs , Martin Hofmann-Apitius , Negin Sadat Babaiha
Background
The COVID-19 pandemic has intensified concerns about its long-term neurological impact, with evidence linking SARS-CoV-2 infection to neurodegenerative diseases (NDDs) such as Alzheimer’s (AD) and Parkinson’s (PD). Patients with these conditions face higher risk of severe COVID-19 outcomes and may experience accelerated cognitive or motor decline following infection. Proposed mechanisms—including neuroinflammation, blood–brain barrier (BBB) disruption, and abnormal protein aggregation—closely mirror core features of neurodegenerative pathology. However, current knowledge remains fragmented across text, figures, and pathway diagrams, limiting integration into computational models that could reveal systemic patterns.
Results
To address this gap, we applied GPT-4 Omni (GPT-4o), a multimodal large language model (LLM), to extract mechanistic insights from biomedical figures. Over 10,000 images were retrieved through targeted searches on COVID-19 and neurodegeneration; after automated and manual filtering, a curated subset was analyzed. GPT-4o extracted biological relationships as semantic triples, grouped into six mechanistic categories—including microglial activation and barrier disruption—using ontology-guided similarity and assembled into a Neo4j knowledge graph (KG). Accuracy was evaluated against a gold-standard dataset of expert-annotated images using Biomedical Bidirectional Encoder Representations from Transformers (BioBERT)–based semantic matching. This evaluation enabled prompt tuning, threshold optimization, and hyperparameter assessment. Results show that GPT-4o successfully recovers both established and novel mechanisms, yielding interpretable outputs that illuminate complex biological links between SARS-CoV-2 and neurodegeneration.
Conclusions
This study demonstrates the potential of multimodal LLMs to mine biomedical visual data at scale. By complementing text mining and integrating figure-derived knowledge, our framework advances understanding of COVID-19–related neurodegeneration and supports future translational research.
{"title":"Leveraging multimodal large language models to extract mechanistic insights from biomedical visuals: A case study on COVID-19 and neurodegenerative diseases","authors":"Elizaveta Popova , Marc Jacobs , Martin Hofmann-Apitius , Negin Sadat Babaiha","doi":"10.1016/j.mlwa.2025.100816","DOIUrl":"10.1016/j.mlwa.2025.100816","url":null,"abstract":"<div><h3>Background</h3><div>The COVID-19 pandemic has intensified concerns about its long-term neurological impact, with evidence linking SARS-CoV-2 infection to neurodegenerative diseases (NDDs) such as Alzheimer’s (AD) and Parkinson’s (PD). Patients with these conditions face higher risk of severe COVID-19 outcomes and may experience accelerated cognitive or motor decline following infection. Proposed mechanisms—including neuroinflammation, blood–brain barrier (BBB) disruption, and abnormal protein aggregation—closely mirror core features of neurodegenerative pathology. However, current knowledge remains fragmented across text, figures, and pathway diagrams, limiting integration into computational models that could reveal systemic patterns.</div></div><div><h3>Results</h3><div>To address this gap, we applied GPT-4 Omni (GPT-4o), a multimodal large language model (LLM), to extract mechanistic insights from biomedical figures. Over 10,000 images were retrieved through targeted searches on COVID-19 and neurodegeneration; after automated and manual filtering, a curated subset was analyzed. GPT-4o extracted biological relationships as semantic triples, grouped into six mechanistic categories—including microglial activation and barrier disruption—using ontology-guided similarity and assembled into a Neo4j knowledge graph (KG). Accuracy was evaluated against a gold-standard dataset of expert-annotated images using Biomedical Bidirectional Encoder Representations from Transformers (BioBERT)–based semantic matching. This evaluation enabled prompt tuning, threshold optimization, and hyperparameter assessment. Results show that GPT-4o successfully recovers both established and novel mechanisms, yielding interpretable outputs that illuminate complex biological links between SARS-CoV-2 and neurodegeneration.</div></div><div><h3>Conclusions</h3><div>This study demonstrates the potential of multimodal LLMs to mine biomedical visual data at scale. By complementing text mining and integrating figure-derived knowledge, our framework advances understanding of COVID-19–related neurodegeneration and supports future translational research.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100816"},"PeriodicalIF":4.9,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1016/j.mlwa.2025.100813
Chourik Fousseni , Martin Otis , Khaled Ziane
Accurately estimating the remaining driving time (RDT) of an electric vehicle (EV) battery is essential for optimizing energy management and enhancing user experience. However, traditional estimation methods do not adequately account for the influence of temperature, driving characteristics and vehicle driving time, leading to less accurate predictions and suboptimal range management. To address these limitations, this study presents a method for estimating the remaining charge retention time by integrating temperature and driving characteristics, which refines predictions and improves model reliability. Furthermore, data from the National Big Data Alliance for New Energy Vehicles (NDANEV) were employed to develop a predictive model based on machine learning (ML) models. The different ML models compared in this study are Linear Regression, LSTM, RF, Prophet, LightGBM, and XGBoost. The model performance was evaluated using mean absolute error (MAE), root mean square error (RMSE), the coefficient of determination () and the prediction runtime to assess the prediction accuracy. The results show that the values for Prophet, Random Forest, LSTM, XGBoost, and LightGBM are 0.91, 0.94, 0.95, 0.94, and 0.94 respectively. This suggests that XGBoost outperforms the other models, providing the most accurate estimate of the remaining driving time. In addition, the result confirms that considering driving characteristics and ambient temperature improves the reliability and robustness of estimations. These advancements contribute to more efficient energy management and optimized charging strategies.
{"title":"Estimation of the remaining charge retention time of an electric vehicle battery","authors":"Chourik Fousseni , Martin Otis , Khaled Ziane","doi":"10.1016/j.mlwa.2025.100813","DOIUrl":"10.1016/j.mlwa.2025.100813","url":null,"abstract":"<div><div>Accurately estimating the remaining driving time (RDT) of an electric vehicle (EV) battery is essential for optimizing energy management and enhancing user experience. However, traditional estimation methods do not adequately account for the influence of temperature, driving characteristics and vehicle driving time, leading to less accurate predictions and suboptimal range management. To address these limitations, this study presents a method for estimating the remaining charge retention time by integrating temperature and driving characteristics, which refines predictions and improves model reliability. Furthermore, data from the National Big Data Alliance for New Energy Vehicles (NDANEV) were employed to develop a predictive model based on machine learning (ML) models. The different ML models compared in this study are Linear Regression, LSTM, RF, Prophet, LightGBM, and XGBoost. The model performance was evaluated using mean absolute error (MAE), root mean square error (RMSE), the coefficient of determination (<span><math><msup><mrow><mi>R</mi></mrow><mn>2</mn></msup></math></span>) and the prediction runtime to assess the prediction accuracy. The results show that the <span><math><msup><mrow><mi>R</mi></mrow><mn>2</mn></msup></math></span> values for Prophet, Random Forest, LSTM, XGBoost, and LightGBM are 0.91, 0.94, 0.95, 0.94, and 0.94 respectively. This suggests that XGBoost outperforms the other models, providing the most accurate estimate of the remaining driving time. In addition, the result confirms that considering driving characteristics and ambient temperature improves the reliability and robustness of estimations. These advancements contribute to more efficient energy management and optimized charging strategies.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100813"},"PeriodicalIF":4.9,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145748777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.mlwa.2025.100808
Leonardo Viegas Filipe , João Canelas , Mário Vieira , Francisco Correia da Fonseca , André Cid , Joana Castro , Inês Machado
Reliable estimates of dolphin abundance are essential for conservation and impact assessment, yet manual analysis of aerial surveys is time-consuming and difficult to scale. This paper presents an end-to-end pipeline for automatic dolphin counting from unmanned aerial vehicle (UAV) video that combines modern object detection and multi-object tracking. We construct a large detection dataset of 64,705 images with 225,305 dolphin bounding boxes and a tracking dataset of 54,274 frames with 207,850 boxes and 603 unique tracks, derived from UAV line-transect surveys. Using these data, we train a YOLO11-based detector that achieves a precision of approximately 0.93 across a range of sea states. For tracking, we adopt BoT-SORT and tune its parameters with a genetic algorithm using a multi-metric objective, reducing ID fragmentation by about 29% relative to default settings. Recent YOLO-based cetacean detectors trained on UAV imagery of beluga whales report precision/recall around 0.92/0.92 for adults and 0.94/0.89 for calves, but rely on DeepSORT tracking whose MOTA remains below 0.5 and must be boosted to roughly 0.7 with post-hoc trajectory post-processing. In this context, our pipeline offers competitive detection performance, substantially larger and fully documented detection and tracking benchmarks, and GA-optimized tracking without manual post-processing. Applied to dolphin group counting, the full pipeline attains a mean absolute error of 1.24 on a held-out validation set, demonstrating that UAV-based automated counting can support robust, scalable monitoring of coastal dolphin populations.
{"title":"Advancing marine mammal monitoring: Large-scale UAV delphinidae datasets and robust motion tracking for group size estimation","authors":"Leonardo Viegas Filipe , João Canelas , Mário Vieira , Francisco Correia da Fonseca , André Cid , Joana Castro , Inês Machado","doi":"10.1016/j.mlwa.2025.100808","DOIUrl":"10.1016/j.mlwa.2025.100808","url":null,"abstract":"<div><div>Reliable estimates of dolphin abundance are essential for conservation and impact assessment, yet manual analysis of aerial surveys is time-consuming and difficult to scale. This paper presents an end-to-end pipeline for automatic dolphin counting from unmanned aerial vehicle (UAV) video that combines modern object detection and multi-object tracking. We construct a large detection dataset of 64,705 images with 225,305 dolphin bounding boxes and a tracking dataset of 54,274 frames with 207,850 boxes and 603 unique tracks, derived from UAV line-transect surveys. Using these data, we train a YOLO11-based detector that achieves a precision of approximately 0.93 across a range of sea states. For tracking, we adopt BoT-SORT and tune its parameters with a genetic algorithm using a multi-metric objective, reducing ID fragmentation by about 29% relative to default settings. Recent YOLO-based cetacean detectors trained on UAV imagery of beluga whales report precision/recall around 0.92/0.92 for adults and 0.94/0.89 for calves, but rely on DeepSORT tracking whose MOTA remains below 0.5 and must be boosted to roughly 0.7 with post-hoc trajectory post-processing. In this context, our pipeline offers competitive detection performance, substantially larger and fully documented detection and tracking benchmarks, and GA-optimized tracking without manual post-processing. Applied to dolphin group counting, the full pipeline attains a mean absolute error of 1.24 on a held-out validation set, demonstrating that UAV-based automated counting can support robust, scalable monitoring of coastal dolphin populations.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"23 ","pages":"Article 100808"},"PeriodicalIF":4.9,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}