Pub Date : 2026-01-10DOI: 10.1186/s13040-025-00512-2
Debora Garza-Hernandez, Emmanuel Martinez-Ledesma, Victor Trevino
{"title":"Genotype subtyping approach to identify unnoticed variants in diseases from GWAS data.","authors":"Debora Garza-Hernandez, Emmanuel Martinez-Ledesma, Victor Trevino","doi":"10.1186/s13040-025-00512-2","DOIUrl":"10.1186/s13040-025-00512-2","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":" ","pages":"8"},"PeriodicalIF":6.1,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875041/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1186/s13040-025-00516-y
Geletaw Sahle Tegenaw, Hailin Song, Tomas Ward
{"title":"Generalization or mirage? Data leakage and reported performance in neonatal EEG seizure detection models: a systematic review.","authors":"Geletaw Sahle Tegenaw, Hailin Song, Tomas Ward","doi":"10.1186/s13040-025-00516-y","DOIUrl":"https://doi.org/10.1186/s13040-025-00516-y","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":" ","pages":""},"PeriodicalIF":6.1,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145935756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1186/s13040-025-00506-0
Isibor Kennedy Ihianle, Wathsala Samarasekara, Keeley Brookes, Pedro Machado
{"title":"Cross-cohort genetic risk prediction for Alzheimer's disease: a transfer learning approach using GWAS and deep learning models.","authors":"Isibor Kennedy Ihianle, Wathsala Samarasekara, Keeley Brookes, Pedro Machado","doi":"10.1186/s13040-025-00506-0","DOIUrl":"10.1186/s13040-025-00506-0","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":" ","pages":"89"},"PeriodicalIF":6.1,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752400/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145811982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1186/s13040-025-00498-x
Yujie Huo, Weng Howe Chan, Ahmad Najmi Bin Amerhaider Nuar, Hongyu Gao
Early detection of depression is vital for public health, yet current multimodal methods often struggle with challenges such as modality incompleteness, semantic inconsistency, and emotional temporal fluctuation. To address these issues, this paper proposes SBT-Net, a novel Semantic-Bias-Trend guided framework for robust depression detection from audio and text data. The model integrates three innovative modules: a semantically guided cross-modal gating (SGCMG) mechanism that dynamically filters effective modality features based on global semantic cues, a bias-guided tensor product attention (BG-TPA) mechanism that enhances fine-grained fusion and alignment between modalities, and an emotion trend modeling (ETM) module that captures the temporal evolution of depressive emotional states.We evaluate SBT-Net using two widely adopted benchmark datasets: DAIC-WOZ, which contains 189 interview sessions, and EATD-Corpus, comprising 162 conversational samples. Experimental results show that SBT-Net achieves excellent performance in multiple indicators, including 93.0% accuracy, 0.93 F1 score, and 0.92 recall, all of which surpass the competitive baselines.Ablation studies further validate the individual and synergistic contributions of each proposed module.These findings highlight the potential of integrating semantic guidance, bias-aware fusion, and emotional trend modeling to advance multimodal depression detection solutions. The source code can be found at https://github.com/ghy-yhg/SBT-Net .
{"title":"SBT-Net: a tri-cue guided multimodal fusion framework for depression recognition.","authors":"Yujie Huo, Weng Howe Chan, Ahmad Najmi Bin Amerhaider Nuar, Hongyu Gao","doi":"10.1186/s13040-025-00498-x","DOIUrl":"10.1186/s13040-025-00498-x","url":null,"abstract":"<p><p>Early detection of depression is vital for public health, yet current multimodal methods often struggle with challenges such as modality incompleteness, semantic inconsistency, and emotional temporal fluctuation. To address these issues, this paper proposes SBT-Net, a novel Semantic-Bias-Trend guided framework for robust depression detection from audio and text data. The model integrates three innovative modules: a semantically guided cross-modal gating (SGCMG) mechanism that dynamically filters effective modality features based on global semantic cues, a bias-guided tensor product attention (BG-TPA) mechanism that enhances fine-grained fusion and alignment between modalities, and an emotion trend modeling (ETM) module that captures the temporal evolution of depressive emotional states.We evaluate SBT-Net using two widely adopted benchmark datasets: DAIC-WOZ, which contains 189 interview sessions, and EATD-Corpus, comprising 162 conversational samples. Experimental results show that SBT-Net achieves excellent performance in multiple indicators, including 93.0% accuracy, 0.93 F1 score, and 0.92 recall, all of which surpass the competitive baselines.Ablation studies further validate the individual and synergistic contributions of each proposed module.These findings highlight the potential of integrating semantic guidance, bias-aware fusion, and emotional trend modeling to advance multimodal depression detection solutions. The source code can be found at https://github.com/ghy-yhg/SBT-Net .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"86"},"PeriodicalIF":6.1,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723886/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145811969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1186/s13040-025-00500-6
Roberta Coletti, J Orestes Cerdeira, Marcos Raydan, Marta B Lopes
Background: High-dimensional omics data often contain more variables than observations, which can lead to overfitting and negatively impact the results of classical data analysis methods. To address the issue, supervised variable selection methods are often used, incorporating penalty terms into the model. While effective for selecting task-specific variables, this approach may not preserve the overall dataset structure for multiple downstream analyses. This study aims to evaluate unsupervised variable selection approaches and introduce a novel tool that improves data interpretability while maintaining biological information.
Results: We assessed multiple unsupervised variable selection techniques to identify a representative subset of the original dataset. Based on this evaluation, we developed TRIM-IT, a computational tool that integrates unsupervised variable selection, clustering, survival analysis, and differential gene expression analysis. TRIM-IT was applied to glioblastoma transcriptomics data, uncovering three distinct patient clusters. These clusters correlated with tumor histology, exhibited significantly different survival outcomes, and revealed molecular profiles that suggest potential biomarker candidates.
Conclusion: TRIM-IT provides a novel approach for analyzing high-dimensional omics data while preserving key biological insights. Its ability to identify meaningful patient subgroups and molecular signatures highlights its applicability across various biomedical research contexts. The tool is implemented in R and the code is publicly available for reproduction and adaptation to other studies.
{"title":"An unsupervised tool for biomarker discovery and cancer subtyping applied to glioblastoma.","authors":"Roberta Coletti, J Orestes Cerdeira, Marcos Raydan, Marta B Lopes","doi":"10.1186/s13040-025-00500-6","DOIUrl":"10.1186/s13040-025-00500-6","url":null,"abstract":"<p><strong>Background: </strong>High-dimensional omics data often contain more variables than observations, which can lead to overfitting and negatively impact the results of classical data analysis methods. To address the issue, supervised variable selection methods are often used, incorporating penalty terms into the model. While effective for selecting task-specific variables, this approach may not preserve the overall dataset structure for multiple downstream analyses. This study aims to evaluate unsupervised variable selection approaches and introduce a novel tool that improves data interpretability while maintaining biological information.</p><p><strong>Results: </strong>We assessed multiple unsupervised variable selection techniques to identify a representative subset of the original dataset. Based on this evaluation, we developed TRIM-IT, a computational tool that integrates unsupervised variable selection, clustering, survival analysis, and differential gene expression analysis. TRIM-IT was applied to glioblastoma transcriptomics data, uncovering three distinct patient clusters. These clusters correlated with tumor histology, exhibited significantly different survival outcomes, and revealed molecular profiles that suggest potential biomarker candidates.</p><p><strong>Conclusion: </strong>TRIM-IT provides a novel approach for analyzing high-dimensional omics data while preserving key biological insights. Its ability to identify meaningful patient subgroups and molecular signatures highlights its applicability across various biomedical research contexts. The tool is implemented in R and the code is publicly available for reproduction and adaptation to other studies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"85"},"PeriodicalIF":6.1,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12720448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145811993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Emergency Department (ED) revisits represent a critical issue in emergency medicine. Identifying high-risk revisit cases (revisits with intensive care unit admissions, cardiac arrest, or requiring emergency surgery) is particularly important. While prior studies have explored machine learning models for ED revisit prediction, few deep learning approaches exist, and dynamic features remain underutilized.
Methods: We used data from National Taiwan University Hospital (NTUH), incorporating both static (e.g., age, sex, triage) and dynamic (vital signs) features. A preprocessing strategy was developed to handle temporal irregularities. We proposed a hybrid deep learning model combining Temporal Convolutional Network (TCN) and feature tokenizer (FT)-Transformer to integrate static and short-term dynamic information.
Results: We evaluated our model on NTUH 2016-2019 data, achieving the area under the receiver operating characteristic curve (AUROC) of 0.8453 and the area under precision recall curve (AUPRC) of 0.0935 for high-risk revisits (base rate = 0.01), and AUROC of 0.7250 and AUPRC of 0.2005 for general revisits (base rate = 0.042). The model maintained robust performance when validated on 2020-2022 data. Compared to the static-only logistic regression baseline, our model improved AUPRC from 0.0288 to 0.0935 and precision from 0.0281 to 0.0428.
Conclusion: Our model significantly outperformed the static-only baseline. It demonstrates the effectiveness of multimodal clinical data fusion in improving ED revisit prediction and supporting clinical decision-making.
{"title":"Deep learning to predict emergency department revisit using static and dynamic features (Deep Revisit): development and validation study.","authors":"Su-Yin Hsu, Jhe-Yi Jhu, Jun-Wan Gao, Chien-Hua Huang, Chu-Lin Tsai, Li-Chen Fu","doi":"10.1186/s13040-025-00509-x","DOIUrl":"10.1186/s13040-025-00509-x","url":null,"abstract":"<p><strong>Background: </strong>Emergency Department (ED) revisits represent a critical issue in emergency medicine. Identifying high-risk revisit cases (revisits with intensive care unit admissions, cardiac arrest, or requiring emergency surgery) is particularly important. While prior studies have explored machine learning models for ED revisit prediction, few deep learning approaches exist, and dynamic features remain underutilized.</p><p><strong>Methods: </strong>We used data from National Taiwan University Hospital (NTUH), incorporating both static (e.g., age, sex, triage) and dynamic (vital signs) features. A preprocessing strategy was developed to handle temporal irregularities. We proposed a hybrid deep learning model combining Temporal Convolutional Network (TCN) and feature tokenizer (FT)-Transformer to integrate static and short-term dynamic information.</p><p><strong>Results: </strong>We evaluated our model on NTUH 2016-2019 data, achieving the area under the receiver operating characteristic curve (AUROC) of 0.8453 and the area under precision recall curve (AUPRC) of 0.0935 for high-risk revisits (base rate = 0.01), and AUROC of 0.7250 and AUPRC of 0.2005 for general revisits (base rate = 0.042). The model maintained robust performance when validated on 2020-2022 data. Compared to the static-only logistic regression baseline, our model improved AUPRC from 0.0288 to 0.0935 and precision from 0.0281 to 0.0428.</p><p><strong>Conclusion: </strong>Our model significantly outperformed the static-only baseline. It demonstrates the effectiveness of multimodal clinical data fusion in improving ED revisit prediction and supporting clinical decision-making.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":" ","pages":"88"},"PeriodicalIF":6.1,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12750547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145800645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1186/s13040-025-00510-4
Platon Lukyanenko, Joshua Mayourian, Mingxuan Liu, John K Triedman, Sunil J Ghelani, William G La Cava
Background: Artificial intelligence applied to electrocardiography (AI-ECG) has recently shown potential for mortality prediction, but heterogeneous approaches and private datasets have limited generalizable insights into AI methodologies fit for this purpose. To address this, we systematically evaluated model design choices across three large medical center cohorts: Beth Isreal Deaconess (MIMIC-IV: n = 795,546 ECGs, United States), Telehealth Network of Minas Gerais (Code-15; n = 345,779, Brazil), and Boston Children's Hospital (BCH; n = 255,379, United States).
Results: We comprehensively evaluates models to predict all-cause mortality, comparing horizon-based classification and deep survival methods various neural architectures including convolutional neural networks and transformers. We also benchmarked against demographic-only and gradient boosting baselines. Top models yielded good performance (median concordance, Code-15: 0.83; MIMIC-IV: 0.78; BCH: 0.81). Incorporating age and sex improved performance across all datasets. Classifier-Cox models exhibited site-dependent sensitivity to horizon choice (median Pearson's R, Code-15: 0.35; MIMIC-IV: -0.71; BCH: 0.37). External validation reduced concordance, and in some cases, demographic-only models outperformed externally trained AI-ECG models on Code-15. However, models trained on multi-site data outperformed site-specific models by margins ranging from 5% to 22%.
Conclusions: These findings highlight several key factors for robust AI-ECG deployment. Deep survival methods consistently provided advantages over horizon-based classifiers, while inclusion of demographic covariates such as age and sex improved predictive performance across sites. The sensitivity of classifier-based models to horizon selection underscores the need for site-specific calibration. The multi-site experiment reveals that cross-cohort training, even between adult and pediatric cohorts, can substantially improve performance on those cohorts compared to cohort-specific training. Together, these results emphasize the importance of model type, demographic features, and training data diversity in developing AI-ECG models that can be reliably applied across populations.
背景:人工智能应用于心电图(AI- ecg)最近显示出预测死亡率的潜力,但异构方法和私有数据集限制了适合这一目的的人工智能方法的可推广见解。为了解决这个问题,我们系统地评估了三个大型医疗中心队列的模型设计选择:Beth Isreal Deaconess (MIMIC-IV: n = 795,546个心电图,美国)、Minas Gerais远程医疗网络(Code-15; n = 345,779,巴西)和波士顿儿童医院(BCH; n = 255,379,美国)。结果:我们综合评价了预测全因死亡率的模型,比较了基于水平的分类和深度生存方法以及卷积神经网络和变压器等各种神经结构。我们还以人口统计学和梯度提升基线为基准。顶级模型产生了良好的性能(中位数一致性,Code-15: 0.83; MIMIC-IV: 0.78; BCH: 0.81)。结合年龄和性别提高了所有数据集的性能。分类器- cox模型对水平选择表现出位点依赖的敏感性(中位数Pearson’s R, Code-15: 0.35; MIMIC-IV: -0.71; BCH: 0.37)。外部验证降低了一致性,在某些情况下,仅人口统计学模型在Code-15上的表现优于外部训练的AI-ECG模型。然而,在多站点数据上训练的模型比特定站点模型的性能高出5%到22%。结论:这些发现强调了AI-ECG稳健部署的几个关键因素。深度生存方法始终比基于水平的分类器提供优势,而包含人口统计协变量(如年龄和性别)则提高了跨站点的预测性能。基于分类器的模型对地平线选择的敏感性强调了特定地点校准的必要性。多站点实验表明,与特定队列训练相比,跨队列训练,即使是成人和儿科队列之间的训练,也能显著提高这些队列的表现。总之,这些结果强调了模型类型、人口统计学特征和训练数据多样性在开发可可靠地应用于人群的AI-ECG模型中的重要性。
{"title":"Deep survival analysis from adult and pediatric electrocardiograms: a multi-center benchmark study.","authors":"Platon Lukyanenko, Joshua Mayourian, Mingxuan Liu, John K Triedman, Sunil J Ghelani, William G La Cava","doi":"10.1186/s13040-025-00510-4","DOIUrl":"10.1186/s13040-025-00510-4","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence applied to electrocardiography (AI-ECG) has recently shown potential for mortality prediction, but heterogeneous approaches and private datasets have limited generalizable insights into AI methodologies fit for this purpose. To address this, we systematically evaluated model design choices across three large medical center cohorts: Beth Isreal Deaconess (MIMIC-IV: n = 795,546 ECGs, United States), Telehealth Network of Minas Gerais (Code-15; n = 345,779, Brazil), and Boston Children's Hospital (BCH; n = 255,379, United States).</p><p><strong>Results: </strong>We comprehensively evaluates models to predict all-cause mortality, comparing horizon-based classification and deep survival methods various neural architectures including convolutional neural networks and transformers. We also benchmarked against demographic-only and gradient boosting baselines. Top models yielded good performance (median concordance, Code-15: 0.83; MIMIC-IV: 0.78; BCH: 0.81). Incorporating age and sex improved performance across all datasets. Classifier-Cox models exhibited site-dependent sensitivity to horizon choice (median Pearson's R, Code-15: 0.35; MIMIC-IV: -0.71; BCH: 0.37). External validation reduced concordance, and in some cases, demographic-only models outperformed externally trained AI-ECG models on Code-15. However, models trained on multi-site data outperformed site-specific models by margins ranging from 5% to 22%.</p><p><strong>Conclusions: </strong>These findings highlight several key factors for robust AI-ECG deployment. Deep survival methods consistently provided advantages over horizon-based classifiers, while inclusion of demographic covariates such as age and sex improved predictive performance across sites. The sensitivity of classifier-based models to horizon selection underscores the need for site-specific calibration. The multi-site experiment reveals that cross-cohort training, even between adult and pediatric cohorts, can substantially improve performance on those cohorts compared to cohort-specific training. Together, these results emphasize the importance of model type, demographic features, and training data diversity in developing AI-ECG models that can be reliably applied across populations.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":" ","pages":"6"},"PeriodicalIF":6.1,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12821881/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}