首页 > 最新文献

Computational and structural biotechnology journal最新文献

英文 中文
Explainable machine learning for prognostic modeling of waitlist mortality in cirrhotic liver transplantation. 可解释的机器学习用于肝硬化肝移植等待名单死亡率的预后建模。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-26 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.057
Abdelghani Halimi, Nesma Houmani, Sonia Garcia-Salicetti, Ilias Kounis, Audrey Coilly, Eric Vibert

Accurate mortality prediction in liver transplant (LT) candidates is essential for improving organ allocation and prioritization. Models like Model for End-Stage Liver Disease (MELD) are widely used, but may overlook complex nonlinear interactions between risk factors. Machine learning (ML) offers improved predictive accuracy but often at the expense of interpretability. In this study, we conduct a comprehensive comparison of three MELD-based scores against advanced ML models, including LDA, TabNet, RF and LightGBM to predict 3-, 6-, and 12-month waitlist mortality, using retrospective data from the UNOS/OPTN registry. SHapley Additive exPlanations (SHAP) were exploited to provide deeper insights into the best model's decision-making process, offering both global and local explanations while pinpointing key risk factors. LightGBM emerged as the best-performing model achieving AUROC of 0.921, 0.892, and 0.872 for 3-, 6-, and 12-month mortality predictions, respectively. Moreover, our proposed Ensemble Learning Transplant Mortality (ELTM) score, derived from LightGBM, not only enhanced overall risk assessment but also improved equity and patient prioritization. The explanation component highlighted key predictors beyond traditional MELD components, such as patient's functional state, age at registration, degree of ascites, and bilirubin changes over time. By introducing an explainable ML framework for prognostic modeling, this study provides a transparent data-driven approach that could enhance the efficiency and fairness of organ allocation, potentially saving lives by prioritizing patients more accurately.

准确预测肝移植(LT)候选人的死亡率对于改善器官分配和优先排序至关重要。终末期肝病模型(MELD)等模型被广泛使用,但可能忽略了危险因素之间复杂的非线性相互作用。机器学习(ML)提供了更高的预测准确性,但往往以牺牲可解释性为代价。在这项研究中,我们利用UNOS/OPTN登记处的回顾性数据,将三种基于meld的评分与先进的ML模型(包括LDA、TabNet、RF和LightGBM)进行全面比较,以预测3个月、6个月和12个月的等待名单死亡率。SHapley加性解释(SHAP)被用来为最佳模型的决策过程提供更深入的见解,在精确定位关键风险因素的同时,提供全局和局部解释。LightGBM是表现最好的模型,3个月、6个月和12个月的死亡率预测AUROC分别为0.921、0.892和0.872。此外,我们提出的来自LightGBM的集成学习移植死亡率(ELTM)评分不仅增强了总体风险评估,而且改善了公平性和患者优先级。解释部分强调了传统MELD组件之外的关键预测因素,如患者的功能状态、登记时的年龄、腹水程度和胆红素随时间的变化。通过引入可解释的机器学习框架进行预后建模,本研究提供了一种透明的数据驱动方法,可以提高器官分配的效率和公平性,通过更准确地优先考虑患者,有可能挽救生命。
{"title":"Explainable machine learning for prognostic modeling of waitlist mortality in cirrhotic liver transplantation.","authors":"Abdelghani Halimi, Nesma Houmani, Sonia Garcia-Salicetti, Ilias Kounis, Audrey Coilly, Eric Vibert","doi":"10.1016/j.csbj.2025.11.057","DOIUrl":"10.1016/j.csbj.2025.11.057","url":null,"abstract":"<p><p>Accurate mortality prediction in liver transplant (LT) candidates is essential for improving organ allocation and prioritization. Models like Model for End-Stage Liver Disease (MELD) are widely used, but may overlook complex nonlinear interactions between risk factors. Machine learning (ML) offers improved predictive accuracy but often at the expense of interpretability. In this study, we conduct a comprehensive comparison of three MELD-based scores against advanced ML models, including LDA, TabNet, RF and LightGBM to predict 3-, 6-, and 12-month waitlist mortality, using retrospective data from the UNOS/OPTN registry. SHapley Additive exPlanations (SHAP) were exploited to provide deeper insights into the best model's decision-making process, offering both global and local explanations while pinpointing key risk factors. LightGBM emerged as the best-performing model achieving AUROC of 0.921, 0.892, and 0.872 for 3-, 6-, and 12-month mortality predictions, respectively. Moreover, our proposed Ensemble Learning Transplant Mortality (ELTM) score, derived from LightGBM, not only enhanced overall risk assessment but also improved equity and patient prioritization. The explanation component highlighted key predictors beyond traditional MELD components, such as patient's functional state, age at registration, degree of ascites, and bilirubin changes over time. By introducing an explainable ML framework for prognostic modeling, this study provides a transparent data-driven approach that could enhance the efficiency and fairness of organ allocation, potentially saving lives by prioritizing patients more accurately.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5590-5603"},"PeriodicalIF":4.1,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12743420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145849023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In silico analysis of the functional implications of drug resistance associated mutations in Mycobacterium tuberculosis. 结核分枝杆菌耐药相关突变的功能意义的计算机分析。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-24 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.054
Ankita Pal, Sapna Pal, Shweta Mahapatra, Apratim Pandey, Debasisa Mohanty

The tuberculosis mutation catalogue published by World Health Organization (WHO) lists a large number of mutations based on the statistical significance of their association with resistance or susceptibility to various drugs. However, the mechanism by which they confer resistance to drugs is often not understood. To address these gaps, we combined known resistance associated mutations from the WHO catalogue and newly discovered mutations by explainable artificial intelligence (XAI). In order to decipher the mechanistic basis of drug resistance, we examined where these mutations occur in three dimensional (3D) structures of key drug targets, measured their proximity to drug binding sites and compared their abundance in drug resistant as well as drug susceptible Mycobacterium tuberculosis (M.tb) strains. In parallel, we analyzed the functions of 112 newly identified drug resistance associated genes and compared them to known resistance genes, finding that most novel genes fall into different functional categories, though six share families with known resistance genes. We mapped coding mutations in all 112 novel genes to their functional domains, predicted 3D structures using Alphafold3 and evaluated their effects on protein stability. Notably, our study highlights that mutations in ribosomal proteins (RpsN1, RpsN2) and the transporter PstB may introduce new resistance mechanisms, such as altered drug interactions or increased drug efflux. Whereas, analysis of non coding mutations revealed that most are located at transcription factor binding sites, potentially affecting gene regulation. The current analysis provides valuable insights for the design of experiments to decode mechanistic basis of drug resistance tuberculosis.

世界卫生组织(WHO)公布的结核病突变目录根据其与各种药物耐药或易感性相关的统计显著性列出了大量突变。然而,它们赋予耐药性的机制往往不为人所知。为了弥补这些空白,我们将世卫组织目录中已知的耐药性相关突变与可解释人工智能(XAI)新发现的突变结合起来。为了破译耐药的机制基础,我们研究了这些突变在关键药物靶点的三维(3D)结构中发生的位置,测量了它们与药物结合位点的接近程度,并比较了它们在耐药和药物敏感结核分枝杆菌(M.tb)菌株中的丰度。同时,我们分析了112个新发现的耐药相关基因的功能,并将它们与已知的耐药基因进行了比较,发现大多数新基因属于不同的功能类别,尽管有6个基因与已知的耐药基因共享家族。我们将所有112个新基因的编码突变映射到它们的功能域,使用Alphafold3预测3D结构,并评估它们对蛋白质稳定性的影响。值得注意的是,我们的研究强调了核糖体蛋白(RpsN1, RpsN2)和转运体PstB的突变可能引入新的耐药机制,如药物相互作用改变或药物外排增加。然而,对非编码突变的分析显示,大多数突变位于转录因子结合位点,可能影响基因调控。目前的分析为实验设计提供了有价值的见解,以解码耐药结核病的机制基础。
{"title":"<i>In silico</i> analysis of the functional implications of drug resistance associated mutations in <i>Mycobacterium tuberculosis</i>.","authors":"Ankita Pal, Sapna Pal, Shweta Mahapatra, Apratim Pandey, Debasisa Mohanty","doi":"10.1016/j.csbj.2025.11.054","DOIUrl":"10.1016/j.csbj.2025.11.054","url":null,"abstract":"<p><p>The tuberculosis mutation catalogue published by World Health Organization (WHO) lists a large number of mutations based on the statistical significance of their association with resistance or susceptibility to various drugs. However, the mechanism by which they confer resistance to drugs is often not understood. To address these gaps, we combined known resistance associated mutations from the WHO catalogue and newly discovered mutations by explainable artificial intelligence (XAI). In order to decipher the mechanistic basis of drug resistance, we examined where these mutations occur in three dimensional (3D) structures of key drug targets, measured their proximity to drug binding sites and compared their abundance in drug resistant as well as drug susceptible <i>Mycobacterium tuberculosis</i> (<i>M.tb</i>) strains. In parallel, we analyzed the functions of 112 newly identified drug resistance associated genes and compared them to known resistance genes, finding that most novel genes fall into different functional categories, though six share families with known resistance genes. We mapped coding mutations in all 112 novel genes to their functional domains, predicted 3D structures using Alphafold3 and evaluated their effects on protein stability. Notably, our study highlights that mutations in ribosomal proteins (RpsN1, RpsN2) and the transporter PstB may introduce new resistance mechanisms, such as altered drug interactions or increased drug efflux. Whereas, analysis of non coding mutations revealed that most are located at transcription factor binding sites, potentially affecting gene regulation. The current analysis provides valuable insights for the design of experiments to decode mechanistic basis of drug resistance tuberculosis.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5425-5440"},"PeriodicalIF":4.1,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12703862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145767345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PathoEye: A deep learning framework for the whole-slide image analysis of skin tissue. 病理学眼:用于皮肤组织全幻灯片图像分析的深度学习框架。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-23 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.052
Yusen Lin, Feiyan Lin, Yongjun Zhang, Jiayu Wen, Guomin Li, Xinquan Zeng, Hang Sun, Hang Jiang, Jingxia Lin, Teng Yan, Ruzheng Xue, Hao Sun, Bin Yang, Jiajian Zhou

Objective: To provide an interpretable computational framework for examining whole-slide images (WSI) in skin biopsies, PathoEye focuses on the dermis-epidermis junctional (DEJ) areas, also known as the basement membrane zone (BMZ), to enrich the pathological features of various skin conditions.

Method: We presented PathoEye for WSI analysis in dermatology, which integrates epidermis-guided sampling, deep learning and radiomics. It enables the semantic segmentation of the BMZ automatically and extracts distinct features associated with various skin conditions.

Results: PathoEye outperforms the existing methods in multi-class classification tasks involving various skin conditions by leveraging the BMZ-centric segmentation approach. It enables the investigation of histopathological aberrations in aged skin compared with young skin. Additionally, it highlighted the texture changes in the BMZ of young skin compared with aged skin. Further experimental analyses revealed that senescence cells were enriched in the BMZ, and the turnover of basement membrane (BM) components, including COL17A1, COL4A2, and ITGA6, was increased in aged skin.

Conclusion: PathoEye is a WSI analysis tool that focuses on the features of the BMZ related to various skin conditions. The BMZ-centric patch sampling method improves the performance of the classification model for skin diseases.

目的:为了提供一个可解释的计算框架来检查皮肤活检中的全切片图像(WSI), PathoEye专注于真皮-表皮交界处(DEJ)区域,也称为基底膜区(BMZ),以丰富各种皮肤状况的病理特征。方法:结合表皮引导采样、深度学习和放射组学,我们提出了用于皮肤病学WSI分析的病理学眼。它可以自动对BMZ进行语义分割,并提取与各种皮肤状况相关的独特特征。结果:patheye通过利用以bmz为中心的分割方法,在涉及各种皮肤状况的多类别分类任务中优于现有方法。它使研究组织病理畸变老化皮肤与年轻皮肤的比较。此外,它突出了年轻皮肤与老年皮肤相比BMZ的纹理变化。进一步的实验分析表明,衰老细胞在BMZ中富集,基底膜(BM)成分(包括COL17A1、COL4A2和ITGA6)的周转量在衰老皮肤中增加。结论:PathoEye是一种WSI分析工具,关注与各种皮肤状况相关的BMZ特征。以bmz为中心的斑块采样方法提高了皮肤病分类模型的性能。
{"title":"PathoEye: A deep learning framework for the whole-slide image analysis of skin tissue.","authors":"Yusen Lin, Feiyan Lin, Yongjun Zhang, Jiayu Wen, Guomin Li, Xinquan Zeng, Hang Sun, Hang Jiang, Jingxia Lin, Teng Yan, Ruzheng Xue, Hao Sun, Bin Yang, Jiajian Zhou","doi":"10.1016/j.csbj.2025.11.052","DOIUrl":"10.1016/j.csbj.2025.11.052","url":null,"abstract":"<p><strong>Objective: </strong>To provide an interpretable computational framework for examining whole-slide images (WSI) in skin biopsies, PathoEye focuses on the dermis-epidermis junctional (DEJ) areas, also known as the basement membrane zone (BMZ), to enrich the pathological features of various skin conditions.</p><p><strong>Method: </strong>We presented PathoEye for WSI analysis in dermatology, which integrates epidermis-guided sampling, deep learning and radiomics. It enables the semantic segmentation of the BMZ automatically and extracts distinct features associated with various skin conditions.</p><p><strong>Results: </strong>PathoEye outperforms the existing methods in multi-class classification tasks involving various skin conditions by leveraging the BMZ-centric segmentation approach. It enables the investigation of histopathological aberrations in aged skin compared with young skin. Additionally, it highlighted the texture changes in the BMZ of young skin compared with aged skin. Further experimental analyses revealed that senescence cells were enriched in the BMZ, and the turnover of basement membrane (BM) components, including COL17A1, COL4A2, and ITGA6, was increased in aged skin.</p><p><strong>Conclusion: </strong>PathoEye is a WSI analysis tool that focuses on the features of the BMZ related to various skin conditions. The BMZ-centric patch sampling method improves the performance of the classification model for skin diseases.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5391-5400"},"PeriodicalIF":4.1,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699262/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145755523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In silico and bioassay-guided identification of potential anti-SARS-CoV-2 tentative candidate compounds from Andrographis paniculata extract. 计算机和生物测定法鉴定穿心莲提取物中抗sars - cov -2的初步候选化合物。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-23 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.050
Jeerakit Kerdsiri, Kowit Hengphasatporn, Tasana Pitaksuteepong, Nitra Nuengchamnong, Aphinya Suroengrit, Phumbodin Chupinidsakulwong, Yasuteru Shigeta, Parvapan Bhattarakosol, Siwaporn Boonyasuppayakorn, Neti Waranuch

COVID-19, caused by the SARS-CoV-2 virus. This infectious disease significantly targets the upper and lower respiratory tracts of humans and animals. The mechanisms of SARS-CoV-2 infection are crucially mediated by specific viral proteins, such as the spike protein, as well as various enzymes, including proteases and transferases, and host proteins, including the ACE2 receptor. These proteins facilitate viral attachment and viral replication. This study aimed to identify the potential candidate compounds from Andrographis paniculata due to its anti-SARS-CoV-2 property, examined using both in vitro and in silico methodologies. The most effective fractions of the extract provided 15 candidate compounds that were identified based on their binding affinity to papain-like protease (PLpro), 3-chymotrypsin-like protease (3CLpro), RNA-dependent RNA polymerase (RdRp), and methyltransferase (MTase). Notably, these compounds showed no satisfactory coupling with the spike protein and the ACE2 receptor. This work revealed a potential mechanism of action that focuses on viral replication rather than initial attachment. Nine candidate compounds of flavonoids (CP14, CP15, CP16, CP26, CP30, CP31, CP32, CP33, and CP39), together with 2-tert-butyl-6-[(3-tert-butyl-2-hydroxy-5-methylphenyl)methyl]-4-methylphenol (CP13), 6-butyryl-5,7-dihydroxy-8-isopentenyl-4-propylcoumarin (CP37), 7-demethyltangeretin (CP38), afromosin (CP43), asperglaucide (CP45), and galanolactone (CP53) have strong binding affinity with the viral proteins, including PLpro, 3CLpro, RdRp, and MTase. The LB-PaCS-MD/FMO framework revealed detailed ligand binding pathways and induced-fit adaptation of CP14 and CP45 within the flexible 3CLpro pocket, providing quantum-level insight into their stabilization mechanisms. This suggests significant potential to disrupt viral replication, and this finding will be the guidance for anti-SARS-CoV-2 products in the future.

COVID-19,由SARS-CoV-2病毒引起。这种传染病主要针对人类和动物的上呼吸道和下呼吸道。SARS-CoV-2感染的机制关键是由特定的病毒蛋白介导的,如刺突蛋白,以及各种酶,包括蛋白酶和转移酶,以及宿主蛋白,包括ACE2受体。这些蛋白质促进病毒附着和病毒复制。本研究旨在鉴定穿心莲中具有抗sars - cov -2特性的潜在候选化合物,并使用体外和计算机方法进行检测。提取物中最有效的部分提供了15种候选化合物,根据它们与木瓜蛋白酶样蛋白酶(PLpro)、3-凝乳蛋白酶样蛋白酶(3CLpro)、RNA依赖性RNA聚合酶(RdRp)和甲基转移酶(MTase)的结合亲和力进行了鉴定。值得注意的是,这些化合物与刺突蛋白和ACE2受体没有令人满意的偶联。这项工作揭示了一种潜在的作用机制,其重点是病毒复制而不是初始附着。9种黄酮类化合物(CP14、CP15、CP16、CP26、CP30、CP31、CP32、CP33和CP39)与2-叔丁基-6-[(3-叔丁基-2-羟基-5-甲基苯基)甲基]-4-甲基苯酚(CP13)、6-丁基-5、7-二羟基-8-异戊烯基-4-丙基香豆素(CP37)、7-去甲基ltangeretin (CP38)、afromosin (CP43)、asperglaucide (CP45)和galanolactone (CP53)与病毒蛋白PLpro、3CLpro、RdRp和MTase具有较强的结合亲和力。LB-PaCS-MD/FMO框架揭示了CP14和CP45在柔性3CLpro口袋中的详细配体结合途径和诱导适应,为其稳定机制提供了量子水平的见解。这表明破坏病毒复制的巨大潜力,这一发现将成为未来抗sars - cov -2产品的指导。
{"title":"<i>In silico</i> and bioassay-guided identification of potential anti-SARS-CoV-2 tentative candidate compounds from <i>Andrographis paniculata</i> extract.","authors":"Jeerakit Kerdsiri, Kowit Hengphasatporn, Tasana Pitaksuteepong, Nitra Nuengchamnong, Aphinya Suroengrit, Phumbodin Chupinidsakulwong, Yasuteru Shigeta, Parvapan Bhattarakosol, Siwaporn Boonyasuppayakorn, Neti Waranuch","doi":"10.1016/j.csbj.2025.11.050","DOIUrl":"10.1016/j.csbj.2025.11.050","url":null,"abstract":"<p><p>COVID-19, caused by the SARS-CoV-2 virus. This infectious disease significantly targets the upper and lower respiratory tracts of humans and animals. The mechanisms of SARS-CoV-2 infection are crucially mediated by specific viral proteins, such as the spike protein, as well as various enzymes, including proteases and transferases, and host proteins, including the ACE2 receptor. These proteins facilitate viral attachment and viral replication. This study aimed to identify the potential candidate compounds from <i>Andrographis paniculata</i> due to its anti-SARS-CoV-2 property, examined using both <i>in vitro</i> and <i>in silico</i> methodologies. The most effective fractions of the extract provided 15 candidate compounds that were identified based on their binding affinity to papain-like protease (PL<sup>pro</sup>), 3-chymotrypsin-like protease (3CL<sup>pro</sup>), RNA-dependent RNA polymerase (RdRp), and methyltransferase (MTase). Notably, these compounds showed no satisfactory coupling with the spike protein and the ACE2 receptor. This work revealed a potential mechanism of action that focuses on viral replication rather than initial attachment. Nine candidate compounds of flavonoids (CP14, CP15, CP16, CP26, CP30, CP31, CP32, CP33, and CP39), together with 2-tert-butyl-6-[(3-tert-butyl-2-hydroxy-5-methylphenyl)methyl]-4-methylphenol (CP13), 6-butyryl-5,7-dihydroxy-8-isopentenyl-4-propylcoumarin (CP37), 7-demethyltangeretin (CP38), afromosin (CP43), asperglaucide (CP45), and galanolactone (CP53) have strong binding affinity with the viral proteins, including PL<sup>pro</sup>, 3CL<sup>pro</sup>, RdRp, and MTase. The LB-PaCS-MD/FMO framework revealed detailed ligand binding pathways and induced-fit adaptation of CP14 and CP45 within the flexible 3CL<sup>pro</sup> pocket, providing quantum-level insight into their stabilization mechanisms. This suggests significant potential to disrupt viral replication, and this finding will be the guidance for anti-SARS-CoV-2 products in the future.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5479-5492"},"PeriodicalIF":4.1,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12720350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145818357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating DNA words and their distributions across the tree of life. 研究DNA单词及其在生命之树上的分布。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-22 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.040
Charalampos Koilakos, Kimonas Provatas, Michail Patsakis, Aris Karatzikos, Alexandros Tzanakakis, Ilias Georgakopoulos-Soares

The frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer frequency spectra across more than 225,000 genome assemblies spanning all three domains of life and viruses. We demonstrate that Zipf's law consistently underperforms in modeling k-mer distributions. In contrast, we propose the truncated power law and Zipf-Mandelbrot distributions, which provide substantially improved fits across taxonomic groups. We show that genome size and GC content influence model performance, with larger and GC-content imbalanced genomes yielding better fits. Additionally, we perform an extensive analysis on vocabulary expansion and exhaustion across the same organisms using Heaps' law. We apply our modeling framework to evaluate simulated genomes generated by k-let preserving shuffling and deep generative language models. Our results reveal substantial differences between organismal genomes and their synthetic or shuffled counterparts, offering a novel approach to benchmark the biological plausibility of artificial genomes. Collectively, this work establishes new standards for modeling genomic k-mer distributions and provides insights relevant to synthetic biology, and evolutionary sequence analysis.

DNA k-mers的频率分布是由基本的生物过程形成的,并为基因组结构和进化提供了一个窗口。受自然语言类比的启发,先前的研究试图利用齐夫定律(Zipf’s law)来模拟基因组k-mer的使用。齐夫定律是一种最初为人类语言中的单词制定的秩-频率定律。然而,这一定律在多大程度上准确地捕捉到了k-mers在不同物种中的分布仍不清楚。在这里,我们系统地分析了跨越生命和病毒所有三个领域的超过225,000个基因组组装的k-mer频谱。我们证明Zipf定律在k-mer分布建模中一直表现不佳。相比之下,我们提出了截断幂律和Zipf-Mandelbrot分布,它们在分类群之间提供了显着改进的拟合。我们发现基因组大小和GC含量影响模型的性能,更大和GC含量不平衡的基因组产生更好的拟合。此外,我们使用希普斯定律对同一生物的词汇扩展和耗尽进行了广泛的分析。我们应用我们的建模框架来评估由k-let保持洗牌和深度生成语言模型生成的模拟基因组。我们的研究结果揭示了有机基因组与其合成或洗牌的对应物之间的实质性差异,提供了一种新的方法来基准人工基因组的生物学合理性。总的来说,这项工作建立了基因组k-mer分布建模的新标准,并提供了与合成生物学和进化序列分析相关的见解。
{"title":"Investigating DNA words and their distributions across the tree of life.","authors":"Charalampos Koilakos, Kimonas Provatas, Michail Patsakis, Aris Karatzikos, Alexandros Tzanakakis, Ilias Georgakopoulos-Soares","doi":"10.1016/j.csbj.2025.11.040","DOIUrl":"10.1016/j.csbj.2025.11.040","url":null,"abstract":"<p><p>The frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer frequency spectra across more than 225,000 genome assemblies spanning all three domains of life and viruses. We demonstrate that Zipf's law consistently underperforms in modeling k-mer distributions. In contrast, we propose the truncated power law and Zipf-Mandelbrot distributions, which provide substantially improved fits across taxonomic groups. We show that genome size and GC content influence model performance, with larger and GC-content imbalanced genomes yielding better fits. Additionally, we perform an extensive analysis on vocabulary expansion and exhaustion across the same organisms using Heaps' law. We apply our modeling framework to evaluate simulated genomes generated by k-let preserving shuffling and deep generative language models. Our results reveal substantial differences between organismal genomes and their synthetic or shuffled counterparts, offering a novel approach to benchmark the biological plausibility of artificial genomes. Collectively, this work establishes new standards for modeling genomic k-mer distributions and provides insights relevant to synthetic biology, and evolutionary sequence analysis.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5335-5347"},"PeriodicalIF":4.1,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12686733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145721433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CirRFKB: A knowledgebase of circadian-related risk factors for cancer pathogenesis and personalized medicine. CirRFKB:一个与昼夜节律相关的癌症发病危险因素和个体化治疗的知识库。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-22 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.051
Jiao Wang, Hui Zong, Yingbo Zhang, Xingyun Liu, Ke Shen, Xiaoyu Li, Rongrong Wu, Min Jiang, Daniel Rivero Cebrián, Juan Ramón Rabuñal Dopico, Bairong Shen

Circadian rhythms regulate numerous physiological and biochemical processes in humans, and their disruption is linked to elevated cancer risk and progression. Although substantial research has elucidated interactions between circadian mechanisms and cancer pathways, these findings remain fragmented and poorly integrated, impeding a holistic understanding. To address this gap, we developed the Circadian-Related Risk Factor Knowledgebase for Cancer (CirRFKB), a manually curated repository documenting validated associations between the circadian clock and cancer. CirRFKB curates data from 471 articles, encompassing 46 cancer types and 4052 records, categorizing risk factors into 1449 single factors and 340 combinations. Single factors were categorized into 681 genetic factors, 106 environmental factors, 244 physiological factors, and 418 behavioral factors. These factors were further classified as 254 protective factors, 323 risk factors, 291 no-influencing factors, and 921 unclear factors. The user-friendly interface enables researchers to explore, visualize, and retrieve data through comprehensive browsing and query tools. CirRFKB provides a foundational resource that structures circadian-cancer interactions, offering systematic evidence to advance clinical applications in deep phenotyping for precision oncology and the optimization of chronotherapy. CirRFKB is publicly accessible at: http://bioinf.org.cn:9876/.

昼夜节律调节着人类的许多生理和生化过程,它们的破坏与癌症风险和进展的增加有关。尽管大量的研究已经阐明了昼夜节律机制和癌症途径之间的相互作用,但这些发现仍然是零散的,缺乏整合,阻碍了整体的理解。为了解决这一差距,我们开发了昼夜节律相关的癌症风险因素知识库(CirRFKB),这是一个手动管理的存储库,记录了生物钟与癌症之间的有效关联。CirRFKB整理了471篇文章的数据,包括46种癌症类型和4052条记录,将风险因素分为1449个单一因素和340个组合。单因素分为遗传因素681个,环境因素106个,生理因素244个,行为因素418个。其中保护性因素254个,危险因素323个,无影响因素291个,不明确因素921个。用户友好的界面使研究人员能够通过全面的浏览和查询工具来探索,可视化和检索数据。CirRFKB提供了构建昼夜节律与癌症相互作用的基础资源,为推进精准肿瘤学深度表型的临床应用和优化时间疗法提供了系统证据。CirRFKB是公开访问:http://bioinf.org.cn:9876/。
{"title":"CirRFKB: A knowledgebase of circadian-related risk factors for cancer pathogenesis and personalized medicine.","authors":"Jiao Wang, Hui Zong, Yingbo Zhang, Xingyun Liu, Ke Shen, Xiaoyu Li, Rongrong Wu, Min Jiang, Daniel Rivero Cebrián, Juan Ramón Rabuñal Dopico, Bairong Shen","doi":"10.1016/j.csbj.2025.11.051","DOIUrl":"10.1016/j.csbj.2025.11.051","url":null,"abstract":"<p><p>Circadian rhythms regulate numerous physiological and biochemical processes in humans, and their disruption is linked to elevated cancer risk and progression. Although substantial research has elucidated interactions between circadian mechanisms and cancer pathways, these findings remain fragmented and poorly integrated, impeding a holistic understanding. To address this gap, we developed the Circadian-Related Risk Factor Knowledgebase for Cancer (CirRFKB), a manually curated repository documenting validated associations between the circadian clock and cancer. CirRFKB curates data from 471 articles, encompassing 46 cancer types and 4052 records, categorizing risk factors into 1449 single factors and 340 combinations. Single factors were categorized into 681 genetic factors, 106 environmental factors, 244 physiological factors, and 418 behavioral factors. These factors were further classified as 254 protective factors, 323 risk factors, 291 no-influencing factors, and 921 unclear factors. The user-friendly interface enables researchers to explore, visualize, and retrieve data through comprehensive browsing and query tools. CirRFKB provides a foundational resource that structures circadian-cancer interactions, offering systematic evidence to advance clinical applications in deep phenotyping for precision oncology and the optimization of chronotherapy. CirRFKB is publicly accessible at: http://bioinf.org.cn:9876/.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5326-5334"},"PeriodicalIF":4.1,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12686628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145721480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A formal approach to the hierarchical structures of microbial communities with negative interactions. 具有负相互作用的微生物群落的等级结构的正式方法。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-21 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.036
Beatrice Ruth, Bashar Ibrahim, Peter Dittrich

Microbial communities typically consist of numerous species that coexist through intricate mutual dependencies. Understanding the structure of these communities and the interactions among their species is essential for explaining their functions and predicting their behavior. In this study, we follow the idea that a community organizes itself into a hierarchy of potentially persistent sub-communities. Previously, this hierarchy was described using Chemical Organization Theory (COT). However, that approach did not account for negative interactions. Here, we enhance the theory by incorporating negative interactions through an inhibitory resource called a toxin. For simplicity, we assume that a taxon sensitive to a toxin cannot coexist with a taxon that produces that toxin. Our results demonstrate that introducing a toxin reduces the number of organizations, with the extent of this reduction depending on various modeling parameters. Further, we show that the usage of essential resources leads to a computationally NP-hard transformation problem into direct taxa interactions. Additionally, we demonstrate that the number of measurements required to infer all persistent subspaces increases. We determine which groups of species are mutually excluded due to toxin interactions. Besides toxic interactions, it is also possible to infer cross-feeding aspects of the microbial community, for which a potential algorithm is outlined and illustrated by an example.

微生物群落通常由许多物种组成,它们通过复杂的相互依赖关系共存。了解这些群落的结构及其物种之间的相互作用对于解释它们的功能和预测它们的行为至关重要。在这项研究中,我们遵循这样的想法:一个社区将自己组织成一个由潜在的持久子社区组成的层次结构。以前,这种层次结构是用化学组织理论(COT)来描述的。然而,这种方法并没有解释负面的相互作用。在这里,我们通过一种被称为毒素的抑制资源来整合负面相互作用,从而增强了这一理论。为简单起见,我们假设对某种毒素敏感的分类群不能与产生这种毒素的分类群共存。我们的结果表明,引入毒素减少了组织的数量,这种减少的程度取决于各种建模参数。此外,我们表明,基本资源的使用导致计算NP-hard转换问题到直接的分类群相互作用。此外,我们证明了推断所有持久子空间所需的测量数量增加了。我们确定哪组物种由于毒素相互作用而相互排斥。除了毒性相互作用外,还可以推断微生物群落的交叉摄食方面,为此概述了一种潜在的算法并通过实例说明。
{"title":"A formal approach to the hierarchical structures of microbial communities with negative interactions.","authors":"Beatrice Ruth, Bashar Ibrahim, Peter Dittrich","doi":"10.1016/j.csbj.2025.11.036","DOIUrl":"10.1016/j.csbj.2025.11.036","url":null,"abstract":"<p><p>Microbial communities typically consist of numerous species that coexist through intricate mutual dependencies. Understanding the structure of these communities and the interactions among their species is essential for explaining their functions and predicting their behavior. In this study, we follow the idea that a community organizes itself into a hierarchy of potentially persistent sub-communities. Previously, this hierarchy was described using Chemical Organization Theory (COT). However, that approach did not account for negative interactions. Here, we enhance the theory by incorporating negative interactions through an inhibitory resource called a toxin. For simplicity, we assume that a taxon sensitive to a toxin cannot coexist with a taxon that produces that toxin. Our results demonstrate that introducing a toxin reduces the number of organizations, with the extent of this reduction depending on various modeling parameters. Further, we show that the usage of essential resources leads to a computationally NP-hard transformation problem into direct taxa interactions. Additionally, we demonstrate that the number of measurements required to infer all persistent subspaces increases. We determine which groups of species are mutually excluded due to toxin interactions. Besides toxic interactions, it is also possible to infer cross-feeding aspects of the microbial community, for which a potential algorithm is outlined and illustrated by an example.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5561-5574"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12731272/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145833148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges in predicting protein-protein interactions of understudied viruses: Arenavirus-human interactions. 预测未充分研究的病毒的蛋白质-蛋白质相互作用的挑战:沙粒病毒-人类相互作用。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-21 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.037
Harshita Sahni, Sarah Michelle Crotzer, Juston Moore, Steven S Branda, Trilce Estrada, S Gnanakaran

Understanding protein-protein interactions (PPIs) between viruses and host organisms is crucial for uncovering infection mechanisms and identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied viruses presents a significant challenge. In this work, we use arenavirus-human PPIs to illustrate the difficulties associated with model generalization, which are compounded by a lack of both positive and negative data. We employ a Transfer Learning approach to investigate arenavirus-human PPIs by utilizing models trained on better-studied virus-human and human-human PPIs. Additionally, we curate and assess four types of negative sampling datasets to evaluate their impact on model performance. Despite the overall high accuracies (93-99 %) and AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due to data leakage, data bias, and overfitting, especially concerning under-represented viral proteins. We reveal these gaps and assess the impact of data imbalance using standard k-fold cross-validation and Independent Blind Testing with a Balanced Dataset, resulting in a drop in accuracy below 50 %. We propose a viral protein-specific evaluation framework that categorizes viral proteins into majority and minority classes based on their representation in the dataset, enabling comparison of model performance across these groups using balanced accuracies. This framework offers a more robust evaluation of model generalizability, addressing biases inherent in standard evaluation techniques and paving the way for more reliable PPI prediction models for understudied viruses.

了解病毒与宿主生物之间的蛋白-蛋白相互作用(PPIs)对于揭示感染机制和确定潜在的治疗靶点至关重要。在研究不足的病毒中推广PPI预测模型的能力提出了一个重大挑战。在这项工作中,我们使用沙粒病毒-人类ppi来说明与模型泛化相关的困难,这些困难由于缺乏正面和负面数据而变得更加复杂。我们采用迁移学习方法来研究沙粒病毒-人类PPIs,利用在研究得更好的病毒-人类和人-人类PPIs上训练的模型。此外,我们整理和评估四种类型的负抽样数据集,以评估它们对模型性能的影响。尽管总体上的高准确度(93-99 %)和AUPRC得分(0.8-0.9)看起来很有希望,但进一步的分析表明,由于数据泄漏、数据偏差和过拟合,特别是在涉及代表性不足的病毒蛋白时,这些性能指标可能会产生误导。我们揭示了这些差距,并使用标准的k-fold交叉验证和使用平衡数据集的独立盲测来评估数据不平衡的影响,导致准确性下降到50% %以下。我们提出了一个病毒蛋白特异性评估框架,该框架根据病毒蛋白在数据集中的表现将病毒蛋白分为多数和少数类,从而能够使用平衡的准确性比较这些组之间的模型性能。该框架提供了对模型通用性的更稳健的评估,解决了标准评估技术中固有的偏差,并为研究不足的病毒建立更可靠的PPI预测模型铺平了道路。
{"title":"Challenges in predicting protein-protein interactions of understudied viruses: Arenavirus-human interactions.","authors":"Harshita Sahni, Sarah Michelle Crotzer, Juston Moore, Steven S Branda, Trilce Estrada, S Gnanakaran","doi":"10.1016/j.csbj.2025.11.037","DOIUrl":"10.1016/j.csbj.2025.11.037","url":null,"abstract":"<p><p>Understanding protein-protein interactions (PPIs) between viruses and host organisms is crucial for uncovering infection mechanisms and identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied viruses presents a significant challenge. In this work, we use arenavirus-human PPIs to illustrate the difficulties associated with model generalization, which are compounded by a lack of both positive and negative data. We employ a Transfer Learning approach to investigate arenavirus-human PPIs by utilizing models trained on better-studied virus-human and human-human PPIs. Additionally, we curate and assess four types of negative sampling datasets to evaluate their impact on model performance. Despite the overall high accuracies (93-99 %) and AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due to data leakage, data bias, and overfitting, especially concerning under-represented viral proteins. We reveal these gaps and assess the impact of data imbalance using standard k-fold cross-validation and Independent Blind Testing with a Balanced Dataset, resulting in a drop in accuracy below 50 %. We propose a viral protein-specific evaluation framework that categorizes viral proteins into majority and minority classes based on their representation in the dataset, enabling comparison of model performance across these groups using balanced accuracies. This framework offers a more robust evaluation of model generalizability, addressing biases inherent in standard evaluation techniques and paving the way for more reliable PPI prediction models for understudied viruses.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5401-5412"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12703866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145767348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-supervised domain adaptation of protein language model based solely on positive enzyme-reaction pairs. 仅基于正酶反应对的蛋白质语言模型的自监督域自适应。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-21 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.045
Tomoya Okuno, Naoaki Ono, Md Altaf-Ul-Amin, Shigehiko Kanaya

There is growing interest in developing predictive models of enzyme catalytic properties that leverage activity data spanning diverse enzyme families. A fundamental challenge lies in the inherent biases of public biochemical databases. These databases predominantly catalog valid enzyme activities, rarely include negative instances, and report quantitative catalytic parameters for only a relatively small subset of enzymes. Such limitations pose a major obstacle to supervised learning of enzyme catalytic properties. One existing approach for model training involves generating synthetic negative enzyme-activity pairs by recombining existing enzymes and their activity information, particularly substrates or chemical reactions, that were not originally associated within datasets. However, it remains unclear whether the generated negative examples are truly inactive or merely unobserved active instances. To build a model that captures functional properties across diverse enzyme families while avoiding reliance on negative examples, this paper introduces a self-supervised domain adaptation methodology for pre-trained protein language models, solely based on positive enzyme-reaction pairs. The enzyme representations obtained from the adapted protein language model achieved superior or at least competitive performance compared to those from an existing method that relies on synthetic negatives, in both the turnover number prediction task for natural reactions of wild-type enzymes and the activity prediction task for family-wide enzyme-substrate specificity screening datasets. Overall, our approach represents a methodological advancement that eliminates the need for synthetic negatives and provides a scalable framework for leveraging the growing enzyme activity data in biochemical databases.

人们对开发酶催化特性的预测模型越来越感兴趣,该模型利用了跨越不同酶家族的活性数据。一个根本性的挑战在于公共生化数据库的固有偏见。这些数据库主要目录有效的酶活性,很少包括负面的实例,并报告定量催化参数的酶的一个相对较小的子集。这些限制对酶催化性质的监督学习构成了主要障碍。一种现有的模型训练方法是通过重组现有的酶及其活性信息,特别是底物或化学反应,生成合成负酶活性对,这些酶活性对最初在数据集中没有关联。然而,目前尚不清楚产生的负面例子是否真的不活跃或仅仅是未观察到的活跃实例。为了建立一个能够捕获不同酶家族功能特性的模型,同时避免依赖于负例,本文为预训练的蛋白质语言模型引入了一种自监督结构域自适应方法,该方法仅基于正酶反应对。与依赖合成阴性的现有方法相比,从适应性蛋白质语言模型获得的酶表示在野生型酶的自然反应的营业额预测任务和全家族酶-底物特异性筛选数据集的活性预测任务中取得了优越或至少有竞争力的表现。总的来说,我们的方法代表了一种方法上的进步,它消除了对合成阴性的需要,并为利用生化数据库中不断增长的酶活性数据提供了一个可扩展的框架。
{"title":"Self-supervised domain adaptation of protein language model based solely on positive enzyme-reaction pairs.","authors":"Tomoya Okuno, Naoaki Ono, Md Altaf-Ul-Amin, Shigehiko Kanaya","doi":"10.1016/j.csbj.2025.11.045","DOIUrl":"10.1016/j.csbj.2025.11.045","url":null,"abstract":"<p><p>There is growing interest in developing predictive models of enzyme catalytic properties that leverage activity data spanning diverse enzyme families. A fundamental challenge lies in the inherent biases of public biochemical databases. These databases predominantly catalog valid enzyme activities, rarely include negative instances, and report quantitative catalytic parameters for only a relatively small subset of enzymes. Such limitations pose a major obstacle to supervised learning of enzyme catalytic properties. One existing approach for model training involves generating synthetic negative enzyme-activity pairs by recombining existing enzymes and their activity information, particularly substrates or chemical reactions, that were not originally associated within datasets. However, it remains unclear whether the generated negative examples are truly inactive or merely unobserved active instances. To build a model that captures functional properties across diverse enzyme families while avoiding reliance on negative examples, this paper introduces a self-supervised domain adaptation methodology for pre-trained protein language models, solely based on positive enzyme-reaction pairs. The enzyme representations obtained from the adapted protein language model achieved superior or at least competitive performance compared to those from an existing method that relies on synthetic negatives, in both the turnover number prediction task for natural reactions of wild-type enzymes and the activity prediction task for family-wide enzyme-substrate specificity screening datasets. Overall, our approach represents a methodological advancement that eliminates the need for synthetic negatives and provides a scalable framework for leveraging the growing enzyme activity data in biochemical databases.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5441-5449"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712682/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145803408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PanARGMiner (Pan-Genomic Antimicrobial Resistance Gene Miner): An advanced feature selection framework for extracting key resistance genes from pan-genomic datasets. PanARGMiner (Pan-Genomic Antimicrobial Resistance Gene Miner):一个先进的特征选择框架,用于从泛基因组数据集中提取关键耐药基因。
IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2025-11-21 eCollection Date: 2025-01-01 DOI: 10.1016/j.csbj.2025.11.046
Yu-Cheng Chen, Ming-Ren Yang, Yu-Wei Wu

Identifying antimicrobial resistance (AMR)-related biomarkers from large-scale genomic datasets is often akin to finding a needle in a haystack. With pan-genomic data containing more than 100,000 gene sequences, isolating features that truly drive resistance remains a major challenge in computational biology. Here we present PanARGMiner, a machine learning-based feature selection framework designed to robustly extract highly relevant and informative biomarkers from high-dimensional biological data. PanARGMiner uses an ensemble-based feature selection strategy to select highly informative and compact feature subsets. It then utilizes repeated iterations to ensure the stability and reliability of the proposed framework, enabling PanARGMiner to generate significantly reduced features with comparable prediction performance compared to those obtained with other feature selection algorithms. Applying PanARGMiner to bacterial pan-genomic antimicrobial resistance datasets successfully extracted as few as one to ten candidate AMR biomarkers from datasets with more than 100,000 genes for five common pathogens. Although many of the extracted candidate AMR biomarkers are well-known resistance genes, proteins not known to be associated with AMR mechanisms, including functionally uncharacterized hypothetical proteins, were also extracted. This indicates the potential of PanARGMiner in revealing both established and novel mechanisms of antibiotic resistance, thus providing actionable insights for biomarker discovery, functional genomics, and precision medicine based on complex data. Its ability to uncover both known and uncharacterized resistance-related features offers new opportunities for research and clinical applications in combating AMR.

从大规模基因组数据集中识别与抗菌素耐药性(AMR)相关的生物标志物往往类似于大海捞针。由于泛基因组数据包含超过100,000个基因序列,分离真正驱动耐药性的特征仍然是计算生物学的主要挑战。在这里,我们提出了PanARGMiner,一个基于机器学习的特征选择框架,旨在从高维生物数据中稳健地提取高度相关和信息丰富的生物标志物。PanARGMiner使用基于集成的特征选择策略来选择高信息量和紧凑的特征子集。然后,它利用重复迭代来确保所提出框架的稳定性和可靠性,使PanARGMiner能够生成与其他特征选择算法相比具有相当预测性能的显著减少的特征。将PanARGMiner应用于细菌泛基因组抗微生物药物耐药性数据集,成功地从5种常见病原体超过10万个基因的数据集中提取出1 - 10个候选AMR生物标志物。虽然提取的许多候选AMR生物标志物是众所周知的耐药基因,但也提取了与AMR机制相关的未知蛋白质,包括功能未表征的假设蛋白质。这表明PanARGMiner在揭示已建立的和新的抗生素耐药机制方面具有潜力,从而为基于复杂数据的生物标志物发现、功能基因组学和精准医学提供可操作的见解。它能够发现已知和未表征的耐药相关特征,为抗抗生素耐药性的研究和临床应用提供了新的机会。
{"title":"PanARGMiner (Pan-Genomic Antimicrobial Resistance Gene Miner): An advanced feature selection framework for extracting key resistance genes from pan-genomic datasets.","authors":"Yu-Cheng Chen, Ming-Ren Yang, Yu-Wei Wu","doi":"10.1016/j.csbj.2025.11.046","DOIUrl":"10.1016/j.csbj.2025.11.046","url":null,"abstract":"<p><p>Identifying antimicrobial resistance (AMR)-related biomarkers from large-scale genomic datasets is often akin to finding a needle in a haystack. With pan-genomic data containing more than 100,000 gene sequences, isolating features that truly drive resistance remains a major challenge in computational biology. Here we present PanARGMiner, a machine learning-based feature selection framework designed to robustly extract highly relevant and informative biomarkers from high-dimensional biological data. PanARGMiner uses an ensemble-based feature selection strategy to select highly informative and compact feature subsets. It then utilizes repeated iterations to ensure the stability and reliability of the proposed framework, enabling PanARGMiner to generate significantly reduced features with comparable prediction performance compared to those obtained with other feature selection algorithms. Applying PanARGMiner to bacterial pan-genomic antimicrobial resistance datasets successfully extracted as few as one to ten candidate AMR biomarkers from datasets with more than 100,000 genes for five common pathogens. Although many of the extracted candidate AMR biomarkers are well-known resistance genes, proteins not known to be associated with AMR mechanisms, including functionally uncharacterized hypothetical proteins, were also extracted. This indicates the potential of PanARGMiner in revealing both established and novel mechanisms of antibiotic resistance, thus providing actionable insights for biomarker discovery, functional genomics, and precision medicine based on complex data. Its ability to uncover both known and uncharacterized resistance-related features offers new opportunities for research and clinical applications in combating AMR.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5363-5374"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699266/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145755479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational and structural biotechnology journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1