首页 > 最新文献

Biodata Mining最新文献

英文 中文
A graph-theoretic framework for quantitative analysis of angiogenic networks. 血管生成网络定量分析的图论框架。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-10-02 DOI: 10.1186/s13040-025-00478-1
Goodluck Okoro, Pawel Wityk, Michael B Nelappana, Karl A Jackiewicz, Veronica Z Kucharczyk, Annie Tigranyan, Catherine C Applegate, Iwona T Dobrucki, Lawrence W Dobrucki

The endothelial tube formation assay is an established in vitro model for evaluating angiogenesis. Although widely used, quantification of angiogenic behavior in such assays remains semi-empirical and often lacks spatial, topological, and structural context. Here, we present a graph-theoretic framework to quantify network morphology, temporal dynamics, and spatial heterogeneity in tube formation assays. We simulated two distinct angiogenic network morphologies using human umbilical vein endothelial cells (HUVECs) seeded at two densities and imaged at 2, 4, and 18 h post-seeding. Skeletonized images were converted to mathematical graphs from which 11 graph-based metrics were extracted. This framework captured both morphological differences and temporal progression. Sparse networks exhibited significantly higher average node degree (p = 0.00079), clustering coefficient (p = 0.00109), and tortuosity (p = 0.0171), whereas dense networks showed greater node and edges counts (p = 0.00109). Over time, networks evolved from fragmented forms at 2 h to integrated structures at 18 h, as reflected by increased largest component size (p = 0.00216), connectivity index (p = 0.00216), and efficiency (p = 0.0152). ROC AUC analysis revealed that metrics such as average degree (AUC = 0.98) and clustering coefficient (AUC = 0.96) effectively distinguished between sparse and dense morphologies, while component-based metrics perfectly separated 2- and 18-hour networks (AUC = 1.00). Radial zone analysis revealed that vascular distribution becomes more compartmentalized over time, with increasing standard deviation and coefficient of variation. This approach provides a sensitive and scalable method for quantifying angiogenic dynamics, offering insight into both therapeutic efficacy and disease-related vascular remodeling.

内皮管形成试验是一种用于评估血管生成的体外模型。虽然被广泛使用,但在这种检测中,血管生成行为的量化仍然是半经验的,往往缺乏空间、拓扑和结构背景。在这里,我们提出了一个图论框架来量化网络形态、时间动态和管道形成分析的空间异质性。我们模拟了两种不同的血管生成网络形态,使用人脐静脉内皮细胞(HUVECs)以两种密度播种,并在播种后2、4和18小时成像。将骨架化图像转换为数学图,从中提取11个基于图的度量。这个框架捕捉到了形态差异和时间进展。稀疏网络具有更高的平均节点度(p = 0.00079)、聚类系数(p = 0.00109)和扭曲度(p = 0.0171),而密集网络具有更高的节点和边数(p = 0.00109)。随着时间的推移,网络从2小时的碎片形态演变为18小时的集成结构,这反映在最大组件尺寸(p = 0.00216)、连通性指数(p = 0.00216)和效率(p = 0.0152)的增加上。ROC AUC分析显示,平均程度(AUC = 0.98)和聚类系数(AUC = 0.96)等指标可以有效区分稀疏和密集的形态,而基于组件的指标可以完美区分2小时和18小时的网络(AUC = 1.00)。径向区分析表明,随着时间的推移,随着标准差和变异系数的增加,血管分布变得更加分区化。该方法为定量血管生成动力学提供了一种敏感且可扩展的方法,为治疗效果和疾病相关血管重构提供了见解。
{"title":"A graph-theoretic framework for quantitative analysis of angiogenic networks.","authors":"Goodluck Okoro, Pawel Wityk, Michael B Nelappana, Karl A Jackiewicz, Veronica Z Kucharczyk, Annie Tigranyan, Catherine C Applegate, Iwona T Dobrucki, Lawrence W Dobrucki","doi":"10.1186/s13040-025-00478-1","DOIUrl":"10.1186/s13040-025-00478-1","url":null,"abstract":"<p><p>The endothelial tube formation assay is an established in vitro model for evaluating angiogenesis. Although widely used, quantification of angiogenic behavior in such assays remains semi-empirical and often lacks spatial, topological, and structural context. Here, we present a graph-theoretic framework to quantify network morphology, temporal dynamics, and spatial heterogeneity in tube formation assays. We simulated two distinct angiogenic network morphologies using human umbilical vein endothelial cells (HUVECs) seeded at two densities and imaged at 2, 4, and 18 h post-seeding. Skeletonized images were converted to mathematical graphs from which 11 graph-based metrics were extracted. This framework captured both morphological differences and temporal progression. Sparse networks exhibited significantly higher average node degree (p = 0.00079), clustering coefficient (p = 0.00109), and tortuosity (p = 0.0171), whereas dense networks showed greater node and edges counts (p = 0.00109). Over time, networks evolved from fragmented forms at 2 h to integrated structures at 18 h, as reflected by increased largest component size (p = 0.00216), connectivity index (p = 0.00216), and efficiency (p = 0.0152). ROC AUC analysis revealed that metrics such as average degree (AUC = 0.98) and clustering coefficient (AUC = 0.96) effectively distinguished between sparse and dense morphologies, while component-based metrics perfectly separated 2- and 18-hour networks (AUC = 1.00). Radial zone analysis revealed that vascular distribution becomes more compartmentalized over time, with increasing standard deviation and coefficient of variation. This approach provides a sensitive and scalable method for quantifying angiogenic dynamics, offering insight into both therapeutic efficacy and disease-related vascular remodeling.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"69"},"PeriodicalIF":6.1,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12492523/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-regional radiomics: a novel framework for relationship-based feature extraction with validation in Parkinson's disease motor subtyping. 跨区域放射组学:一种基于关系的特征提取的新框架,并在帕金森病运动亚型中得到验证。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-29 DOI: 10.1186/s13040-025-00483-4
Mahboube Sadat Hosseini, Seyyed Mahmoud Reza Aghamiri, Mehdi Panahi

Traditional radiomics approaches focus on single-region feature extraction, limiting their ability to capture complex inter-regional relationships crucial for understanding pathophysiological mechanisms in complex diseases. This study introduces a novel cross-regional radiomics framework that systematically extracts relationship-based features between anatomically and functionally connected brain regions. We analyzed T1-weighted magnetic resonance imaging (MRI) data from 140 early-stage Parkinson's disease patients (70 tremor-dominant, 70 postural instability gait difficulty) from the Parkinson's Progression Markers Initiative (PPMI) database across multiple imaging centers. Eight bilateral motor circuit regions (putamen, caudate nucleus, globus pallidus, substantia nigra) were segmented using standardized atlases. Two feature sets were developed: 48 traditional single-region of interest (ROI) features and 60 novel motor-circuit features capturing cross-regional ratios, asymmetry indices, volumetric relationships, and shape distributions. Six feature engineering scenarios were evaluated using center-based 5-fold cross-validation with six machine learning classifiers to ensure robust generalization across different imaging centers. Motor-circuit features demonstrated superior performance compared to single-ROI features across enhanced preprocessing scenarios. Peak performance was achieved with area under the curve (AUC) of 0.821 ± 0.117 versus 0.650 ± 0.220 for single-ROI features (p = 0.0012, Cohen's d = 0.665). Cross-regional ratios, particularly putamen-substantia nigra relationships, dominated the most discriminative features. Motor-circuit features showed superior generalization across multi-center data and better clinical utility through decision curve analysis and calibration curves. The proposed cross-regional radiomics framework significantly outperforms traditional single-region approaches for Parkinson's disease motor subtype classification. This methodology provides a foundation for advancing radiomics applications in complex diseases where inter-regional connectivity patterns are fundamental to pathophysiology.

传统的放射组学方法侧重于单区域特征提取,限制了它们捕捉复杂区域间关系的能力,这对理解复杂疾病的病理生理机制至关重要。本研究引入了一种新的跨区域放射组学框架,系统地提取解剖和功能连接的大脑区域之间基于关系的特征。我们分析了140名早期帕金森病患者(70名震颤为主,70名姿势不稳定步态困难)的t1加权磁共振成像(MRI)数据,这些数据来自多个成像中心的帕金森进展标志物倡议(PPMI)数据库。采用标准化地图集对双侧8个运动回路区域(壳核、尾状核、苍白球、黑质)进行分割。开发了两个特征集:48个传统的单区域感兴趣(ROI)特征和60个新的电机电路特征,这些特征捕获了跨区域比率、不对称指数、体积关系和形状分布。使用基于中心的五重交叉验证和六个机器学习分类器对六个特征工程场景进行评估,以确保跨不同成像中心的鲁棒泛化。在增强的预处理场景中,与单roi特征相比,电机电路特征表现出优越的性能。曲线下面积(AUC)为0.821±0.117,而单roi特征的AUC为0.650±0.220 (p = 0.0012, Cohen’s d = 0.665)。跨区域比率,特别是壳核-黑质关系,是最具区别性的特征。通过决策曲线分析和校准曲线,电机电路特征在多中心数据中具有较好的通用性,具有较好的临床应用价值。提出的跨区域放射组学框架明显优于传统的帕金森病运动亚型分类的单区域方法。该方法为推进放射组学在复杂疾病中的应用提供了基础,其中区域间连接模式是病理生理学的基础。
{"title":"Cross-regional radiomics: a novel framework for relationship-based feature extraction with validation in Parkinson's disease motor subtyping.","authors":"Mahboube Sadat Hosseini, Seyyed Mahmoud Reza Aghamiri, Mehdi Panahi","doi":"10.1186/s13040-025-00483-4","DOIUrl":"10.1186/s13040-025-00483-4","url":null,"abstract":"<p><p>Traditional radiomics approaches focus on single-region feature extraction, limiting their ability to capture complex inter-regional relationships crucial for understanding pathophysiological mechanisms in complex diseases. This study introduces a novel cross-regional radiomics framework that systematically extracts relationship-based features between anatomically and functionally connected brain regions. We analyzed T1-weighted magnetic resonance imaging (MRI) data from 140 early-stage Parkinson's disease patients (70 tremor-dominant, 70 postural instability gait difficulty) from the Parkinson's Progression Markers Initiative (PPMI) database across multiple imaging centers. Eight bilateral motor circuit regions (putamen, caudate nucleus, globus pallidus, substantia nigra) were segmented using standardized atlases. Two feature sets were developed: 48 traditional single-region of interest (ROI) features and 60 novel motor-circuit features capturing cross-regional ratios, asymmetry indices, volumetric relationships, and shape distributions. Six feature engineering scenarios were evaluated using center-based 5-fold cross-validation with six machine learning classifiers to ensure robust generalization across different imaging centers. Motor-circuit features demonstrated superior performance compared to single-ROI features across enhanced preprocessing scenarios. Peak performance was achieved with area under the curve (AUC) of 0.821 ± 0.117 versus 0.650 ± 0.220 for single-ROI features (p = 0.0012, Cohen's d = 0.665). Cross-regional ratios, particularly putamen-substantia nigra relationships, dominated the most discriminative features. Motor-circuit features showed superior generalization across multi-center data and better clinical utility through decision curve analysis and calibration curves. The proposed cross-regional radiomics framework significantly outperforms traditional single-region approaches for Parkinson's disease motor subtype classification. This methodology provides a foundation for advancing radiomics applications in complex diseases where inter-regional connectivity patterns are fundamental to pathophysiology.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"67"},"PeriodicalIF":6.1,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proteome mining of Yersinia Enterocolitica for drug targets and computational inhibitor identification with ADMET, anti-inflammation potential and formulation characteristics. 小肠结肠炎耶尔森菌的蛋白质组挖掘药物靶点和ADMET计算抑制剂鉴定,抗炎潜力和配方特征。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-29 DOI: 10.1186/s13040-025-00482-5
Zarrin Basharat, Youssef Saeed Alghamdi, Mutaib M Mashraqi, Hanan A Ogaly, Fatimah A M Al-Zahrani, Calvin R Wei, Ibrar Ahmed, Seil Kim

Yersinia enterocolitica infection can manifest as self-limiting gastroenteritis and may lead to more severe conditions, such as mesenteric lymphadenitis, reactive arthritis, or rare systemic infections. Fluoroquinolones and third-generation cephalosporins are the most effective treatment options but tetracyclines and co-trimoxazole effectiveness may vary based on resistance patterns. To explore new therapeutic options in case of antibiotic resistance, we initially mined drug targets from the Yersinia enterocolitica proteome using a subtractive proteomics approach. Subsequently, we repurposed FDA approved & Traditional Chinese Medicinal (TCM) compounds against its cell wall synthesis mechanism by targeting DD-transpeptidase. DrugRep screening prioritized FDA-approved hits (Digitoxin, Irinotecan, Acetyldigitoxin; ≤ -9.4 kcal/mol) and TCM hits (Vaccarin, Narirutin, Hinokiflavone; ≤ -9.5 kcal/mol). Machine learning-based validation identified Hinokiflavone and Acetyldigitoxin as most potent binders. Molecular dynamics simulations (100 ns) revealed RMSD values < 1 nm for all complexes, indicating stable binding. ADMET profiling predicted all compounds as non-allergenic and TCM compounds having poor absorption. SBE-β-cyclodextrin coupling with FormulationAI showed improved compound solubility and oral bioavailability. InflamNat predicted strong anti-inflammatory potential for Hinokiflavone, highlighting its dual role in antibacterial and host-directed immunomodulatory activity. These computational insights mark an initial step in drug discovery, prompting comprehensive testing of prioritized compounds against Yersinia enterocolitica.

小肠结肠炎耶尔森菌感染可表现为自限性胃肠炎,并可能导致更严重的情况,如肠系膜淋巴结炎、反应性关节炎或罕见的全身性感染。氟喹诺酮类药物和第三代头孢菌素是最有效的治疗选择,但四环素和复方新诺明的有效性可能因耐药模式而异。为了在抗生素耐药性的情况下探索新的治疗选择,我们最初使用减法蛋白质组学方法从小肠结肠炎耶尔森菌蛋白质组中挖掘药物靶点。随后,我们通过靶向dd -转肽酶,重新利用FDA批准的中药制剂对抗其细胞壁合成机制。药物组筛选优先考虑fda批准的药物(洋地黄素、伊立替康、乙酰洋地黄素,≤-9.4 kcal/mol)和中药药物(万花莲、Narirutin、Hinokiflavone,≤-9.5 kcal/mol)。基于机器学习的验证发现,扁桃黄酮和乙酰洋地黄毒素是最有效的结合剂。分子动力学模拟(100 ns)显示RMSD值
{"title":"Proteome mining of Yersinia Enterocolitica for drug targets and computational inhibitor identification with ADMET, anti-inflammation potential and formulation characteristics.","authors":"Zarrin Basharat, Youssef Saeed Alghamdi, Mutaib M Mashraqi, Hanan A Ogaly, Fatimah A M Al-Zahrani, Calvin R Wei, Ibrar Ahmed, Seil Kim","doi":"10.1186/s13040-025-00482-5","DOIUrl":"10.1186/s13040-025-00482-5","url":null,"abstract":"<p><p>Yersinia enterocolitica infection can manifest as self-limiting gastroenteritis and may lead to more severe conditions, such as mesenteric lymphadenitis, reactive arthritis, or rare systemic infections. Fluoroquinolones and third-generation cephalosporins are the most effective treatment options but tetracyclines and co-trimoxazole effectiveness may vary based on resistance patterns. To explore new therapeutic options in case of antibiotic resistance, we initially mined drug targets from the Yersinia enterocolitica proteome using a subtractive proteomics approach. Subsequently, we repurposed FDA approved & Traditional Chinese Medicinal (TCM) compounds against its cell wall synthesis mechanism by targeting DD-transpeptidase. DrugRep screening prioritized FDA-approved hits (Digitoxin, Irinotecan, Acetyldigitoxin; ≤ -9.4 kcal/mol) and TCM hits (Vaccarin, Narirutin, Hinokiflavone; ≤ -9.5 kcal/mol). Machine learning-based validation identified Hinokiflavone and Acetyldigitoxin as most potent binders. Molecular dynamics simulations (100 ns) revealed RMSD values < 1 nm for all complexes, indicating stable binding. ADMET profiling predicted all compounds as non-allergenic and TCM compounds having poor absorption. SBE-β-cyclodextrin coupling with FormulationAI showed improved compound solubility and oral bioavailability. InflamNat predicted strong anti-inflammatory potential for Hinokiflavone, highlighting its dual role in antibacterial and host-directed immunomodulatory activity. These computational insights mark an initial step in drug discovery, prompting comprehensive testing of prioritized compounds against Yersinia enterocolitica.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"68"},"PeriodicalIF":6.1,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decoding ancestry-specific genetic risk: interpretable deep feature selection reveals prostate cancer SNP disparities in diverse populations. 解码祖先特异性遗传风险:可解释的深度特征选择揭示了前列腺癌SNP在不同人群中的差异。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-29 DOI: 10.1186/s13040-025-00470-9
Zhong Chen, Zichen Lao, You Lu, Wensheng Zhang, Andrea Edwards, Kun Zhang

Background: The clinical potential of single nucleotide polymorphisms (SNPs) in prostate cancer (PCa) diagnosis has been extensively explored using conventional statistical and machine learning approaches. However, the predictive power and interpretability of these methods remain inadequate for clinical translation, primarily due to limited generalization across high-dimensional SNP datasets. This study addresses the contested diagnostic utility of SNPs by integrating interpretable feature selection with deep learning to enhance both classification performance and biological relevance.

Methods: We propose an interpretable deep feature selection framework designed to enhance both the classification performance and biological relevance of SNP markers in distinguishing between benign and malignant prostate cancer samples. This study specifically investigates the debated diagnostic value of SNPs in PCa classification by integrating feature selection with deep learning to uncover actionable insights. Specifically, our framework comprises four key components: (1) Heuristic feature reduction, which eliminates irrelevant SNPs during gradient computation for training deep neural networks (DNNs); (2) Iterative SNP subset optimization, aiming at maximizing classification AUC during model training; (3) Gradient variance minimization, mitigating instability caused by limited sample sizes; and (4) Nonlinear interaction modeling, which extracts high-level SNP interactions through hierarchical representations.

Results: Evaluated on the PLCO, BPC3, and MEC-AA datasets, our method achieved mean AUC scores of 0.747, 0.751, and 0.559, respectively, demonstrating statistically significant improvements (p < 0.05, a paired t-test) over existing approaches. Notably, the lower AUC for MEC-AA may reflect inherent population-specific complexities, as this dataset focuses on African American men, a group historically underrepresented in genomic studies. For interpretability, our framework identified 345, 373, and 437 consensus SNP markers across the PLCO, BPC3, and MEC-AA cohorts, respectively. Key SNPs were further validated against prior research on PCa racial disparities: rs10086908 and rs2273669 (PLCO); rs12284087, rs902774, rs9364554, and rs7611694 (BPC3); and rs3123078 and rs1447295 (MEC-AA) exhibited strong concordance with established loci linked to ethnic-specific risk profiles. For instance, rs1447295 on chromosome 8q24, recurrently associated with African ancestry, underscores the method's ability to recover population-relevant variants.

Conclusion: By synergizing interpretable feature selection with deep learning, this work advances the translation of SNP-based biomarkers into clinically actionable tools while clarifying their contested diagnostic role in PCa.

背景:单核苷酸多态性(snp)在前列腺癌(PCa)诊断中的临床潜力已经通过传统的统计学和机器学习方法进行了广泛的探索。然而,这些方法的预测能力和可解释性仍然不足以用于临床翻译,主要是由于高维SNP数据集的泛化有限。本研究通过将可解释的特征选择与深度学习相结合来提高分类性能和生物学相关性,解决了snp有争议的诊断效用。方法:我们提出了一个可解释的深度特征选择框架,旨在提高SNP标记在区分良性和恶性前列腺癌样本中的分类性能和生物学相关性。本研究通过将特征选择与深度学习相结合来揭示可操作的见解,专门研究了snp在PCa分类中的诊断价值。具体来说,我们的框架包括四个关键部分:(1)启发式特征约简,它在训练深度神经网络(dnn)的梯度计算过程中消除不相关的snp;(2)迭代SNP子集优化,以模型训练时的分类AUC最大化为目标;(3)梯度方差最小化,减轻样本量有限造成的不稳定性;(4)非线性相互作用建模,通过分层表示提取高水平SNP相互作用。结果:在PLCO, BPC3和MEC-AA数据集上进行评估,我们的方法分别获得了0.747,0.751和0.559的平均AUC分数,显示出统计学上显著的改进(p结论:通过将可解释的特征选择与深度学习相结合,这项工作将基于snp的生物标志物转化为临床可操作的工具,同时澄清了它们在PCa中的争议性诊断作用。
{"title":"Decoding ancestry-specific genetic risk: interpretable deep feature selection reveals prostate cancer SNP disparities in diverse populations.","authors":"Zhong Chen, Zichen Lao, You Lu, Wensheng Zhang, Andrea Edwards, Kun Zhang","doi":"10.1186/s13040-025-00470-9","DOIUrl":"10.1186/s13040-025-00470-9","url":null,"abstract":"<p><strong>Background: </strong>The clinical potential of single nucleotide polymorphisms (SNPs) in prostate cancer (PCa) diagnosis has been extensively explored using conventional statistical and machine learning approaches. However, the predictive power and interpretability of these methods remain inadequate for clinical translation, primarily due to limited generalization across high-dimensional SNP datasets. This study addresses the contested diagnostic utility of SNPs by integrating interpretable feature selection with deep learning to enhance both classification performance and biological relevance.</p><p><strong>Methods: </strong>We propose an interpretable deep feature selection framework designed to enhance both the classification performance and biological relevance of SNP markers in distinguishing between benign and malignant prostate cancer samples. This study specifically investigates the debated diagnostic value of SNPs in PCa classification by integrating feature selection with deep learning to uncover actionable insights. Specifically, our framework comprises four key components: (1) Heuristic feature reduction, which eliminates irrelevant SNPs during gradient computation for training deep neural networks (DNNs); (2) Iterative SNP subset optimization, aiming at maximizing classification AUC during model training; (3) Gradient variance minimization, mitigating instability caused by limited sample sizes; and (4) Nonlinear interaction modeling, which extracts high-level SNP interactions through hierarchical representations.</p><p><strong>Results: </strong>Evaluated on the PLCO, BPC3, and MEC-AA datasets, our method achieved mean AUC scores of 0.747, 0.751, and 0.559, respectively, demonstrating statistically significant improvements (p < 0.05, a paired t-test) over existing approaches. Notably, the lower AUC for MEC-AA may reflect inherent population-specific complexities, as this dataset focuses on African American men, a group historically underrepresented in genomic studies. For interpretability, our framework identified 345, 373, and 437 consensus SNP markers across the PLCO, BPC3, and MEC-AA cohorts, respectively. Key SNPs were further validated against prior research on PCa racial disparities: rs10086908 and rs2273669 (PLCO); rs12284087, rs902774, rs9364554, and rs7611694 (BPC3); and rs3123078 and rs1447295 (MEC-AA) exhibited strong concordance with established loci linked to ethnic-specific risk profiles. For instance, rs1447295 on chromosome 8q24, recurrently associated with African ancestry, underscores the method's ability to recover population-relevant variants.</p><p><strong>Conclusion: </strong>By synergizing interpretable feature selection with deep learning, this work advances the translation of SNP-based biomarkers into clinically actionable tools while clarifying their contested diagnostic role in PCa.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"66"},"PeriodicalIF":6.1,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481780/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoRFs_TransFuse: a MoRFs predictor based on multimodal feature fusion and the lightweight Transformer network. MoRFs_TransFuse:一个基于多模态特征融合和轻量级Transformer网络的morf预测器。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-29 DOI: 10.1186/s13040-025-00481-6
Lele Zhang, Hao He, Xuesen Shi

Molecular recognition features (MoRFs) can facilitate specific protein-protein interactions by undergoing disorder-to-order transitions when binding to their protein partners. Thus, it is essential to accurately predict MoRFs. In this paper, we propose an innovative MoRFs prediction method, named MoRFs_TransFuse, based on multimodal feature fusion and a lightweight Transformer network. To construct high-quality biological features, MoRFs_TransFuse innovatively integrates physicochemical properties, evolutionary features, and pre-trained model embeddings, while retaining optimal feature combinations through multi-window extraction and Random Forest secondary screening. In terms of architecture, MoRFs_TransFuse overcomes the limitations of modeling long-range dependencies by using a self-attention mechanism to accurately capture long-range residue associations in protein sequences. Comparative experiments on benchmark datasets show that MoRFs_TransFuse significantly outperforms existing single component and combined component predictors. Additionally, the lightweight design greatly improves computational efficiency while ensuring prediction accuracy.

分子识别特征(morf)可以通过与它们的蛋白质伴侣结合时经历无序到有序的转变来促进特定的蛋白质相互作用。因此,准确预测morf是至关重要的。在本文中,我们提出了一种基于多模态特征融合和轻量级Transformer网络的创新性morf预测方法——MoRFs_TransFuse。为了构建高质量的生物特征,MoRFs_TransFuse创新地整合了物理化学特性、进化特征和预训练模型嵌入,同时通过多窗口提取和随机森林二次筛选保留最佳特征组合。在架构方面,MoRFs_TransFuse通过使用自注意机制来准确捕获蛋白质序列中的远程残基关联,克服了远程依赖关系建模的局限性。在基准数据集上的对比实验表明,MoRFs_TransFuse显著优于现有的单成分和组合成分预测器。此外,轻量化设计在保证预测精度的同时大大提高了计算效率。
{"title":"MoRFs_TransFuse: a MoRFs predictor based on multimodal feature fusion and the lightweight Transformer network.","authors":"Lele Zhang, Hao He, Xuesen Shi","doi":"10.1186/s13040-025-00481-6","DOIUrl":"10.1186/s13040-025-00481-6","url":null,"abstract":"<p><p>Molecular recognition features (MoRFs) can facilitate specific protein-protein interactions by undergoing disorder-to-order transitions when binding to their protein partners. Thus, it is essential to accurately predict MoRFs. In this paper, we propose an innovative MoRFs prediction method, named MoRFs_TransFuse, based on multimodal feature fusion and a lightweight Transformer network. To construct high-quality biological features, MoRFs_TransFuse innovatively integrates physicochemical properties, evolutionary features, and pre-trained model embeddings, while retaining optimal feature combinations through multi-window extraction and Random Forest secondary screening. In terms of architecture, MoRFs_TransFuse overcomes the limitations of modeling long-range dependencies by using a self-attention mechanism to accurately capture long-range residue associations in protein sequences. Comparative experiments on benchmark datasets show that MoRFs_TransFuse significantly outperforms existing single component and combined component predictors. Additionally, the lightweight design greatly improves computational efficiency while ensuring prediction accuracy.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"65"},"PeriodicalIF":6.1,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482271/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal phenotyping and prognostic stratification of patients with sepsis through longitudinal clustering. 通过纵向聚类分析脓毒症患者的时间表型和预后分层。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-26 DOI: 10.1186/s13040-025-00480-7
Patrizia Ribino, Maria Mannone, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini

Sepsis is a critical medical condition characterized by a highly variable and rapidly evolving clinical course, often necessitating early intervention and tailored treatment plans to improve patient outcomes. Due to its complexity and heterogeneity, understanding the progression of sepsis across different patient populations remains a significant challenge. In this study, we exploit a sophisticated analytical framework based on k-means multivariate longitudinal clustering to capture the diverse trajectories of sepsis. We do so by analyzing multiple clinical parameters tracked over time, providing a nuanced view of disease progression. By incorporating Dynamic Time Warping (DTW) as the distance metric, the proposed method effectively accounts for temporal misalignments and variability in the rate of disease progression, an essential capability given the unpredictable and heterogeneous nature of sepsis. This integration enhances the model's ability to detect distinct temporal patterns and phenotypic subgroups that may remain undetected using conventional analytical approaches. By leveraging sepsis-related electronic health records (EHRs), which provide rich time-series data on laboratory results along with patient demographics and underlying health conditions, the proposed method reveals distinct sepsis phenotypes that reflect variations in disease progression. We perform several experiments varying the number of clusters and clinical variable combinations, evaluating the clustering performances using Silhouette score, Caliski-Harabasz Index, and Davies-Bouldin Index, as reference quality metrics. Our results confirm the prognostic role of the Thrombin-Antigen complex and the Prothrombin Time-International Normalized Ratio for septic patients. Furthermore, to evaluate the relevance of subjects' stratification, the Adjusted Rand Index metric is used to quantify the survival prediction capability of our longitudinal clustering method, considering the 28-day death feature as the target variable. The same metric demonstrates that our proposal outperforms other longitudinal clustering algorithms available in the literature.

脓毒症是一种严重的疾病,其特点是具有高度可变和快速发展的临床过程,通常需要早期干预和量身定制的治疗计划来改善患者的预后。由于其复杂性和异质性,了解脓毒症在不同患者群体中的进展仍然是一个重大挑战。在这项研究中,我们利用基于k-means多元纵向聚类的复杂分析框架来捕捉脓毒症的不同轨迹。我们通过分析长期跟踪的多个临床参数来做到这一点,提供了疾病进展的细致入微的观点。通过将动态时间扭曲(DTW)作为距离度量,所提出的方法有效地解释了疾病进展率的时间偏差和可变性,这是考虑到败血症不可预测和异质性的基本能力。这种整合增强了模型检测不同时间模式和表型亚组的能力,而这些可能是传统分析方法无法检测到的。通过利用与败血症相关的电子健康记录(EHRs),提供丰富的实验室结果时间序列数据以及患者人口统计和潜在的健康状况,所提出的方法揭示了反映疾病进展变化的不同败血症表型。我们进行了几个实验,改变了聚类的数量和临床变量的组合,使用Silhouette评分、Caliski-Harabasz指数和Davies-Bouldin指数作为参考质量指标来评估聚类的性能。我们的研究结果证实了凝血酶-抗原复合物和凝血酶原时间-国际标准化比率对脓毒症患者的预后作用。此外,为了评估受试者分层的相关性,考虑28天死亡特征作为目标变量,使用调整后的Rand指数度量来量化纵向聚类方法的生存预测能力。同样的度量表明,我们的建议优于文献中可用的其他纵向聚类算法。
{"title":"Temporal phenotyping and prognostic stratification of patients with sepsis through longitudinal clustering.","authors":"Patrizia Ribino, Maria Mannone, Claudia Di Napoli, Giovanni Paragliola, Davide Chicco, Francesca Gasparini","doi":"10.1186/s13040-025-00480-7","DOIUrl":"10.1186/s13040-025-00480-7","url":null,"abstract":"<p><p>Sepsis is a critical medical condition characterized by a highly variable and rapidly evolving clinical course, often necessitating early intervention and tailored treatment plans to improve patient outcomes. Due to its complexity and heterogeneity, understanding the progression of sepsis across different patient populations remains a significant challenge. In this study, we exploit a sophisticated analytical framework based on k-means multivariate longitudinal clustering to capture the diverse trajectories of sepsis. We do so by analyzing multiple clinical parameters tracked over time, providing a nuanced view of disease progression. By incorporating Dynamic Time Warping (DTW) as the distance metric, the proposed method effectively accounts for temporal misalignments and variability in the rate of disease progression, an essential capability given the unpredictable and heterogeneous nature of sepsis. This integration enhances the model's ability to detect distinct temporal patterns and phenotypic subgroups that may remain undetected using conventional analytical approaches. By leveraging sepsis-related electronic health records (EHRs), which provide rich time-series data on laboratory results along with patient demographics and underlying health conditions, the proposed method reveals distinct sepsis phenotypes that reflect variations in disease progression. We perform several experiments varying the number of clusters and clinical variable combinations, evaluating the clustering performances using Silhouette score, Caliski-Harabasz Index, and Davies-Bouldin Index, as reference quality metrics. Our results confirm the prognostic role of the Thrombin-Antigen complex and the Prothrombin Time-International Normalized Ratio for septic patients. Furthermore, to evaluate the relevance of subjects' stratification, the Adjusted Rand Index metric is used to quantify the survival prediction capability of our longitudinal clustering method, considering the 28-day death feature as the target variable. The same metric demonstrates that our proposal outperforms other longitudinal clustering algorithms available in the literature.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"64"},"PeriodicalIF":6.1,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12465323/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145179832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction and validation of a machine learning-based model predicting early readmission in patients with decompensated cirrhosis: a prospective two-center cohort study. 基于机器学习的预测失代偿肝硬化患者早期再入院模型的构建和验证:一项前瞻性双中心队列研究。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-24 DOI: 10.1186/s13040-025-00479-0
Fang Yang, Jia Li, Ziyi Yang, Liping Wu, Han Wang, Chao Sun

Background: Early 30-day readmission remains a significant burden on the socioeconomic and healthcare system in the context of decompensated cirrhosis. Early recognition and accurate identification are crucial. However, current evidence is elusive and traditional scores concerning liver disease severity are lacking specificity and sensitivity. We sought to construct and validate an explainable machine learning (ML)-based prediction model, and evaluate its prognostic implementation in patients readmitted due to acute episodes. The prediction model for discovery and validation was based on a two-center prospective investigation. Our discovery sample, comprising 636 patients with cirrhosis, was divided into a training set and a test set, with an additional cohort of 150 patients serving as an external validation. Eleven ML methods were performed to establish an indicative model based on a variety of easily accessible and obtainable variables from the electronic health record. The area under the ROC curve (AUC), alongside several evaluation parameters, was used for comparison regarding predictive performance. Considering feature importance and final model explanation, we adopted the SHapley Additive exPlanation method for ranking. Furthermore, prognostic implementation was verified by subgrouping according to the final model and clinical outcomes during follow-up.

Results: Among all 11 ML algorithms, the random forest (RF) algorithm represented the best discriminatory capability. Processing feature reduction generated a final 7-feature RF model with explainability based on the importance ranking. Our constructed model was of moderately accurate prediction pertaining to internal and external validations, with respective AUCs of 0.853 and 0.838, which was further transformed into an online tool to facilitate daily practice. Patients positively adjudged by the prediction model had aggravated underlying disease severity and poor psychophysiologic reservation.

Conclusions: The final explainable ML model was capable of predicting early readmission and was closely connected with adverse outcomes in individual patients experiencing decompensated cirrhosis. Notably, it allayed the "black-box" concerns inherent to ML techniques with an indirect interpretation.

背景:在失代偿肝硬化的背景下,早期30天再入院仍然是社会经济和医疗保健系统的一个重大负担。早期识别和准确识别是至关重要的。然而,目前的证据是难以捉摸的,传统的肝病严重程度评分缺乏特异性和敏感性。我们试图构建并验证一个可解释的基于机器学习(ML)的预测模型,并评估其在急性发作再入院患者中的预后实施情况。发现和验证的预测模型是基于双中心前瞻性调查。我们的发现样本包括636名肝硬化患者,分为训练集和测试集,另外还有150名患者作为外部验证。采用11种ML方法,根据电子健康记录中各种易于获取和获取的变量建立指示性模型。ROC曲线下面积(AUC)与几个评估参数一起用于比较预测性能。考虑到特征的重要性和最终的模型解释,我们采用SHapley加性解释法进行排序。此外,根据最终模型和随访期间的临床结果进行亚分组,验证预后的实现情况。结果:在11种ML算法中,随机森林(RF)算法具有最好的区分能力。处理特征约简生成了最终的7个特征的RF模型,该模型基于重要性排序具有可解释性。我们构建的模型在内部验证和外部验证中具有中等精度的预测,auc分别为0.853和0.838,进一步转化为在线工具,方便日常实践。预测模型阳性的患者基础疾病严重程度加重,心理生理保留差。结论:最终可解释的ML模型能够预测早期再入院,并与失代偿性肝硬化个体患者的不良结局密切相关。值得注意的是,它通过间接解释减轻了ML技术固有的“黑箱”问题。
{"title":"Construction and validation of a machine learning-based model predicting early readmission in patients with decompensated cirrhosis: a prospective two-center cohort study.","authors":"Fang Yang, Jia Li, Ziyi Yang, Liping Wu, Han Wang, Chao Sun","doi":"10.1186/s13040-025-00479-0","DOIUrl":"10.1186/s13040-025-00479-0","url":null,"abstract":"<p><strong>Background: </strong>Early 30-day readmission remains a significant burden on the socioeconomic and healthcare system in the context of decompensated cirrhosis. Early recognition and accurate identification are crucial. However, current evidence is elusive and traditional scores concerning liver disease severity are lacking specificity and sensitivity. We sought to construct and validate an explainable machine learning (ML)-based prediction model, and evaluate its prognostic implementation in patients readmitted due to acute episodes. The prediction model for discovery and validation was based on a two-center prospective investigation. Our discovery sample, comprising 636 patients with cirrhosis, was divided into a training set and a test set, with an additional cohort of 150 patients serving as an external validation. Eleven ML methods were performed to establish an indicative model based on a variety of easily accessible and obtainable variables from the electronic health record. The area under the ROC curve (AUC), alongside several evaluation parameters, was used for comparison regarding predictive performance. Considering feature importance and final model explanation, we adopted the SHapley Additive exPlanation method for ranking. Furthermore, prognostic implementation was verified by subgrouping according to the final model and clinical outcomes during follow-up.</p><p><strong>Results: </strong>Among all 11 ML algorithms, the random forest (RF) algorithm represented the best discriminatory capability. Processing feature reduction generated a final 7-feature RF model with explainability based on the importance ranking. Our constructed model was of moderately accurate prediction pertaining to internal and external validations, with respective AUCs of 0.853 and 0.838, which was further transformed into an online tool to facilitate daily practice. Patients positively adjudged by the prediction model had aggravated underlying disease severity and poor psychophysiologic reservation.</p><p><strong>Conclusions: </strong>The final explainable ML model was capable of predicting early readmission and was closely connected with adverse outcomes in individual patients experiencing decompensated cirrhosis. Notably, it allayed the \"black-box\" concerns inherent to ML techniques with an indirect interpretation.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"63"},"PeriodicalIF":6.1,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462353/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145139194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating causal effects of HDL-C on cognitive function through cross-sectional and Mendelian randomization analyses: concentration-response patterns and clues for Alzheimer's disease prevention. 通过横断面和孟德尔随机化分析调查HDL-C对认知功能的因果影响:浓度-反应模式和阿尔茨海默病预防的线索
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-22 DOI: 10.1186/s13040-025-00484-3
Longmin Fan, Haitao Jiang, Zheyu Zhang

Background: Disrupted cholesterol homeostasis may accelerate cognitive aging. This study investigated the relationship between serum HDL-C levels and cognitive function, utilizing cross-sectional data and Mendelian randomization (MR) analysis.

Methods: A cross-sectional study was conducted using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014, including 19,931 participants. Among them, 2,777 individuals aged 60 years and older with complete HDL-C levels and cognitive function data were included. Cognitive function was assessed using tests such as the Consortium to Establish a Registry for Alzheimer's Disease Immediate and Delayed Recall, the Animal Fluency Test, and the Digit Symbol Substitution Test. Additionally, MR analysis was employed to assess the causal relationship between genetically predicted HDL-C and dementia.

Results: Gender-stratified analyses revealed sex-specific patterns in the relationship between HDL-C and cognitive function. In fully adjusted linear models, men showed consistently positive associations across all cognitive domains, including delayed recall (β = 0.10, 95% CI 0.04-0.17, p < 0.001), immediate recall (β = 0.06, 95% CI 0.00-0.12, p = 0.047), verbal fluency (β = 0.20, 95% CI 0.14-0.26, p < 0.001), processing speed (β = 0.09, 95% CI 0.05-0.14, p < 0.001), and overall composite score (β = 0.45, 95% CI 0.29-0.62, p < 0.001). In women, these associations were attenuated or non-significant for immediate recall, delayed recall, and composite cognition, suggesting non-linearity. Further concentration-response analyses revealed a linear positive association in men and an inverted U-shaped relationship in women. MR analyses indicated a protective association between genetically predicted HDL-C and Alzheimer's disease risk (OR = 0.51, 95% CI 0.29-0.89, p = 0.019). However, sensitivity analyses revealed attenuation after MR-PRESSO outlier correction (β=-0.013, p = 0.756), and inconsistent estimates across methods, with significant heterogeneity (Q-test p < 0.001) and evidence of pleiotropy. In multivariable analysis, adjusting for LDL-C and TG, IVW (β = 0.290, p = 0.048) and Lasso regression (β = 0.752, p = 0.008) indicated weak positive correlations. However, MR-Egger (β = 0.752, p = 0.008) revealed potential pleiotropic interference (intercept p = 0.050).

Conclusions: Our findings suggest that maintaining optimal serum HDL-C levels may help preserve cognitive function in older adults. Notably, sex-specific associations were observed, warranting further investigation into underlying mechanisms.

背景:胆固醇稳态的破坏可能加速认知老化。本研究利用横断面数据和孟德尔随机化(MR)分析调查了血清HDL-C水平与认知功能之间的关系。方法:采用2011-2014年国家健康与营养检查调查(NHANES)的数据进行横断面研究,共纳入19931名参与者。其中包括2777名年龄在60岁及以上的人,他们有完整的HDL-C水平和认知功能数据。认知功能通过诸如阿尔茨海默病即时和延迟回忆注册协会、动物流畅性测试和数字符号替代测试等测试进行评估。此外,磁共振分析被用来评估基因预测HDL-C和痴呆之间的因果关系。结果:性别分层分析揭示了HDL-C与认知功能之间关系的性别特异性模式。在完全调整的线性模型中,男性在所有认知领域都表现出一致的正相关,包括延迟回忆(β = 0.10, 95% CI 0.04-0.17, p)。结论:我们的研究结果表明,维持最佳的血清HDL-C水平可能有助于保持老年人的认知功能。值得注意的是,观察到性别特异性关联,需要进一步研究潜在机制。
{"title":"Investigating causal effects of HDL-C on cognitive function through cross-sectional and Mendelian randomization analyses: concentration-response patterns and clues for Alzheimer's disease prevention.","authors":"Longmin Fan, Haitao Jiang, Zheyu Zhang","doi":"10.1186/s13040-025-00484-3","DOIUrl":"10.1186/s13040-025-00484-3","url":null,"abstract":"<p><strong>Background: </strong>Disrupted cholesterol homeostasis may accelerate cognitive aging. This study investigated the relationship between serum HDL-C levels and cognitive function, utilizing cross-sectional data and Mendelian randomization (MR) analysis.</p><p><strong>Methods: </strong>A cross-sectional study was conducted using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014, including 19,931 participants. Among them, 2,777 individuals aged 60 years and older with complete HDL-C levels and cognitive function data were included. Cognitive function was assessed using tests such as the Consortium to Establish a Registry for Alzheimer's Disease Immediate and Delayed Recall, the Animal Fluency Test, and the Digit Symbol Substitution Test. Additionally, MR analysis was employed to assess the causal relationship between genetically predicted HDL-C and dementia.</p><p><strong>Results: </strong>Gender-stratified analyses revealed sex-specific patterns in the relationship between HDL-C and cognitive function. In fully adjusted linear models, men showed consistently positive associations across all cognitive domains, including delayed recall (β = 0.10, 95% CI 0.04-0.17, p < 0.001), immediate recall (β = 0.06, 95% CI 0.00-0.12, p = 0.047), verbal fluency (β = 0.20, 95% CI 0.14-0.26, p < 0.001), processing speed (β = 0.09, 95% CI 0.05-0.14, p < 0.001), and overall composite score (β = 0.45, 95% CI 0.29-0.62, p < 0.001). In women, these associations were attenuated or non-significant for immediate recall, delayed recall, and composite cognition, suggesting non-linearity. Further concentration-response analyses revealed a linear positive association in men and an inverted U-shaped relationship in women. MR analyses indicated a protective association between genetically predicted HDL-C and Alzheimer's disease risk (OR = 0.51, 95% CI 0.29-0.89, p = 0.019). However, sensitivity analyses revealed attenuation after MR-PRESSO outlier correction (β=-0.013, p = 0.756), and inconsistent estimates across methods, with significant heterogeneity (Q-test p < 0.001) and evidence of pleiotropy. In multivariable analysis, adjusting for LDL-C and TG, IVW (β = 0.290, p = 0.048) and Lasso regression (β = 0.752, p = 0.008) indicated weak positive correlations. However, MR-Egger (β = 0.752, p = 0.008) revealed potential pleiotropic interference (intercept p = 0.050).</p><p><strong>Conclusions: </strong>Our findings suggest that maintaining optimal serum HDL-C levels may help preserve cognitive function in older adults. Notably, sex-specific associations were observed, warranting further investigation into underlying mechanisms.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"62"},"PeriodicalIF":6.1,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12455801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145126261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of severity related mutation hotspots in SARS-CoV-2 using a density-based clustering approach. 基于密度的聚类方法识别SARS-CoV-2严重相关突变热点
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-01 DOI: 10.1186/s13040-025-00476-3
Sohyun Youn, Dabin Jeong, Hwijun Kwon, Eonyong Han, Sun Kim, Inuk Jung

Background: The immune response to SARS-CoV-2 varies greatly among individuals yielding highly varying severity levels among the patients. While there are various methods to spot severity associated biomarkers in COVID-19 patients, we investigated highly mutated regions, or mutation hotspots, within the SARS-CoV-2 genome that correlate with patient severity levels. SARS-CoV-2 mutation hotspots were searched in the GISAID database using a density based clustering algorithm, Mutclust, that searches for loci with high mutation density and diversity.

Results: Using Mutclust, 477 mutation hotspots were searched in the SARS-CoV-2 genome, of which 28 showed significant association with severity levels in a multi-omics COVID-19 cohort comprised of 387 infected patients. The patients were further stratified into moderate and severe patient groups based on the 28 severity related mutation hotspots that showed distinctive cytokine and gene expression levels in both cytokine profile and single-cell RNA-seq samples. The effect of the SARS-CoV-2 mutation hotspots on human genes was further investigated by network propagation analysis, where two mutation hotspots specific to the severe group showed association with NK cell activity. One of them showed to decrease the affinity between the viral epitope of the hotspot region and its binding HLA when compared to the non-mutated epitope.

Conclusion: Genes related to the immunological function of NK cells, especially the NK cell receptor and co-activating receptor genes, were significantly dysregulated in the severe patient group in both cytokine and single-cell levels. Collectively, mutation hotspots associated with severity and their related NK cell associated gene expression regulation were identified.

背景:个体对SARS-CoV-2的免疫反应差异很大,患者之间的严重程度差异很大。虽然有各种方法可以在COVID-19患者中发现与严重程度相关的生物标志物,但我们研究了SARS-CoV-2基因组中与患者严重程度相关的高度突变区域或突变热点。使用基于密度的聚类算法Mutclust在GISAID数据库中搜索SARS-CoV-2突变热点,该算法搜索具有高突变密度和多样性的位点。结果:利用Mutclust在387例感染患者的多组学COVID-19队列中搜索到SARS-CoV-2基因组中477个突变热点,其中28个突变热点与严重程度显著相关。根据28个与严重程度相关的突变热点,将患者进一步分为中度和重度患者组,这些突变热点在细胞因子谱和单细胞RNA-seq样本中均显示出不同的细胞因子和基因表达水平。通过网络传播分析进一步研究SARS-CoV-2突变热点对人类基因的影响,重度组特异性的两个突变热点与NK细胞活性相关。其中一种与未突变的抗原表位相比,热点区病毒表位与其结合HLA的亲和力降低。结论:重症患者组NK细胞免疫功能相关基因,尤其是NK细胞受体和共激活受体基因在细胞因子和单细胞水平上均出现了显著的失调。总的来说,确定了与严重程度相关的突变热点及其相关的NK细胞相关基因表达调控。
{"title":"Identification of severity related mutation hotspots in SARS-CoV-2 using a density-based clustering approach.","authors":"Sohyun Youn, Dabin Jeong, Hwijun Kwon, Eonyong Han, Sun Kim, Inuk Jung","doi":"10.1186/s13040-025-00476-3","DOIUrl":"10.1186/s13040-025-00476-3","url":null,"abstract":"<p><strong>Background: </strong>The immune response to SARS-CoV-2 varies greatly among individuals yielding highly varying severity levels among the patients. While there are various methods to spot severity associated biomarkers in COVID-19 patients, we investigated highly mutated regions, or mutation hotspots, within the SARS-CoV-2 genome that correlate with patient severity levels. SARS-CoV-2 mutation hotspots were searched in the GISAID database using a density based clustering algorithm, Mutclust, that searches for loci with high mutation density and diversity.</p><p><strong>Results: </strong>Using Mutclust, 477 mutation hotspots were searched in the SARS-CoV-2 genome, of which 28 showed significant association with severity levels in a multi-omics COVID-19 cohort comprised of 387 infected patients. The patients were further stratified into moderate and severe patient groups based on the 28 severity related mutation hotspots that showed distinctive cytokine and gene expression levels in both cytokine profile and single-cell RNA-seq samples. The effect of the SARS-CoV-2 mutation hotspots on human genes was further investigated by network propagation analysis, where two mutation hotspots specific to the severe group showed association with NK cell activity. One of them showed to decrease the affinity between the viral epitope of the hotspot region and its binding HLA when compared to the non-mutated epitope.</p><p><strong>Conclusion: </strong>Genes related to the immunological function of NK cells, especially the NK cell receptor and co-activating receptor genes, were significantly dysregulated in the severe patient group in both cytokine and single-cell levels. Collectively, mutation hotspots associated with severity and their related NK cell associated gene expression regulation were identified.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"61"},"PeriodicalIF":6.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12400602/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144975608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving classification on imbalanced genomic data via KDE-based synthetic sampling. 基于kde的合成采样改进不平衡基因组数据分类。
IF 6.1 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-08-29 DOI: 10.1186/s13040-025-00474-5
Edoardo Taccaliti, Jesus S Aguilar-Ruiz

Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.

类不平衡在生物医学机器学习中构成了严峻的挑战,特别是在基因组学中,数据集的特征是极高的维度和非常有限的样本量。在这种情况下,标准分类器倾向于支持大多数类别,导致有偏见的预测-这在临床诊断中是一个特别有问题的问题,因为罕见的情况不容忽视。在这项研究中,我们引入了一种基于核密度估计(KDE)的过采样方法,通过生成合成的少数类样本来重新平衡不平衡的基因组数据集。与SMOTE等传统方法不同,KDE估计少数类的全局概率分布并相应地重新采样,从而避免了局部插值陷阱。我们使用三种分类器-Naïve贝叶斯、决策树和随机森林在15个真实世界的基因组数据集上评估了我们的方法,并将其与SMOTE和基线训练进行比较。实验结果表明,KDE过采样持续提高了分类性能,特别是在抗不平衡指标(如IMCP曲线的AUC)方面。值得注意的是,KDE在极大地简化采样过程的同时,在基于树的模型中取得了优异的结果。这种方法为平衡基因组数据集提供了一种统计基础和有效的解决方案,具有提高高风险医疗决策公平性和准确性的强大潜力。
{"title":"Improving classification on imbalanced genomic data via KDE-based synthetic sampling.","authors":"Edoardo Taccaliti, Jesus S Aguilar-Ruiz","doi":"10.1186/s13040-025-00474-5","DOIUrl":"10.1186/s13040-025-00474-5","url":null,"abstract":"<p><p>Class imbalance poses a serious challenge in biomedical machine learning, particularly in genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading to biased predictions - an especially problematic issue in clinical diagnostics where rare conditions must not be overlooked. In this study, we introduce a Kernel Density Estimation (KDE)-based oversampling approach to rebalance imbalanced genomic datasets by generating synthetic minority class samples. Unlike conventional methods such as SMOTE, KDE estimates the global probability distribution of the minority class and resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real-world genomic datasets using three classifiers -Naïve Bayes, Decision Trees, and Random Forests- and compare it to SMOTE and baseline training. Experimental results demonstrate that KDE oversampling consistently improves classification performance, especially in metrics robust to imbalance, such as AUC of the IMCP curve. Notably, KDE achieves superior results in tree-based models while dramatically simplifying the sampling process. This approach offers a statistically grounded and effective solution for balancing genomic datasets, with strong potential for improving fairness and accuracy in high-stakes medical decision-making.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"60"},"PeriodicalIF":6.1,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395650/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144975628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1