Accurate mortality prediction in liver transplant (LT) candidates is essential for improving organ allocation and prioritization. Models like Model for End-Stage Liver Disease (MELD) are widely used, but may overlook complex nonlinear interactions between risk factors. Machine learning (ML) offers improved predictive accuracy but often at the expense of interpretability. In this study, we conduct a comprehensive comparison of three MELD-based scores against advanced ML models, including LDA, TabNet, RF and LightGBM to predict 3-, 6-, and 12-month waitlist mortality, using retrospective data from the UNOS/OPTN registry. SHapley Additive exPlanations (SHAP) were exploited to provide deeper insights into the best model's decision-making process, offering both global and local explanations while pinpointing key risk factors. LightGBM emerged as the best-performing model achieving AUROC of 0.921, 0.892, and 0.872 for 3-, 6-, and 12-month mortality predictions, respectively. Moreover, our proposed Ensemble Learning Transplant Mortality (ELTM) score, derived from LightGBM, not only enhanced overall risk assessment but also improved equity and patient prioritization. The explanation component highlighted key predictors beyond traditional MELD components, such as patient's functional state, age at registration, degree of ascites, and bilirubin changes over time. By introducing an explainable ML framework for prognostic modeling, this study provides a transparent data-driven approach that could enhance the efficiency and fairness of organ allocation, potentially saving lives by prioritizing patients more accurately.
{"title":"Explainable machine learning for prognostic modeling of waitlist mortality in cirrhotic liver transplantation.","authors":"Abdelghani Halimi, Nesma Houmani, Sonia Garcia-Salicetti, Ilias Kounis, Audrey Coilly, Eric Vibert","doi":"10.1016/j.csbj.2025.11.057","DOIUrl":"10.1016/j.csbj.2025.11.057","url":null,"abstract":"<p><p>Accurate mortality prediction in liver transplant (LT) candidates is essential for improving organ allocation and prioritization. Models like Model for End-Stage Liver Disease (MELD) are widely used, but may overlook complex nonlinear interactions between risk factors. Machine learning (ML) offers improved predictive accuracy but often at the expense of interpretability. In this study, we conduct a comprehensive comparison of three MELD-based scores against advanced ML models, including LDA, TabNet, RF and LightGBM to predict 3-, 6-, and 12-month waitlist mortality, using retrospective data from the UNOS/OPTN registry. SHapley Additive exPlanations (SHAP) were exploited to provide deeper insights into the best model's decision-making process, offering both global and local explanations while pinpointing key risk factors. LightGBM emerged as the best-performing model achieving AUROC of 0.921, 0.892, and 0.872 for 3-, 6-, and 12-month mortality predictions, respectively. Moreover, our proposed Ensemble Learning Transplant Mortality (ELTM) score, derived from LightGBM, not only enhanced overall risk assessment but also improved equity and patient prioritization. The explanation component highlighted key predictors beyond traditional MELD components, such as patient's functional state, age at registration, degree of ascites, and bilirubin changes over time. By introducing an explainable ML framework for prognostic modeling, this study provides a transparent data-driven approach that could enhance the efficiency and fairness of organ allocation, potentially saving lives by prioritizing patients more accurately.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5590-5603"},"PeriodicalIF":4.1,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12743420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145849023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The tuberculosis mutation catalogue published by World Health Organization (WHO) lists a large number of mutations based on the statistical significance of their association with resistance or susceptibility to various drugs. However, the mechanism by which they confer resistance to drugs is often not understood. To address these gaps, we combined known resistance associated mutations from the WHO catalogue and newly discovered mutations by explainable artificial intelligence (XAI). In order to decipher the mechanistic basis of drug resistance, we examined where these mutations occur in three dimensional (3D) structures of key drug targets, measured their proximity to drug binding sites and compared their abundance in drug resistant as well as drug susceptible Mycobacterium tuberculosis (M.tb) strains. In parallel, we analyzed the functions of 112 newly identified drug resistance associated genes and compared them to known resistance genes, finding that most novel genes fall into different functional categories, though six share families with known resistance genes. We mapped coding mutations in all 112 novel genes to their functional domains, predicted 3D structures using Alphafold3 and evaluated their effects on protein stability. Notably, our study highlights that mutations in ribosomal proteins (RpsN1, RpsN2) and the transporter PstB may introduce new resistance mechanisms, such as altered drug interactions or increased drug efflux. Whereas, analysis of non coding mutations revealed that most are located at transcription factor binding sites, potentially affecting gene regulation. The current analysis provides valuable insights for the design of experiments to decode mechanistic basis of drug resistance tuberculosis.
{"title":"<i>In silico</i> analysis of the functional implications of drug resistance associated mutations in <i>Mycobacterium tuberculosis</i>.","authors":"Ankita Pal, Sapna Pal, Shweta Mahapatra, Apratim Pandey, Debasisa Mohanty","doi":"10.1016/j.csbj.2025.11.054","DOIUrl":"10.1016/j.csbj.2025.11.054","url":null,"abstract":"<p><p>The tuberculosis mutation catalogue published by World Health Organization (WHO) lists a large number of mutations based on the statistical significance of their association with resistance or susceptibility to various drugs. However, the mechanism by which they confer resistance to drugs is often not understood. To address these gaps, we combined known resistance associated mutations from the WHO catalogue and newly discovered mutations by explainable artificial intelligence (XAI). In order to decipher the mechanistic basis of drug resistance, we examined where these mutations occur in three dimensional (3D) structures of key drug targets, measured their proximity to drug binding sites and compared their abundance in drug resistant as well as drug susceptible <i>Mycobacterium tuberculosis</i> (<i>M.tb</i>) strains. In parallel, we analyzed the functions of 112 newly identified drug resistance associated genes and compared them to known resistance genes, finding that most novel genes fall into different functional categories, though six share families with known resistance genes. We mapped coding mutations in all 112 novel genes to their functional domains, predicted 3D structures using Alphafold3 and evaluated their effects on protein stability. Notably, our study highlights that mutations in ribosomal proteins (RpsN1, RpsN2) and the transporter PstB may introduce new resistance mechanisms, such as altered drug interactions or increased drug efflux. Whereas, analysis of non coding mutations revealed that most are located at transcription factor binding sites, potentially affecting gene regulation. The current analysis provides valuable insights for the design of experiments to decode mechanistic basis of drug resistance tuberculosis.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5425-5440"},"PeriodicalIF":4.1,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12703862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145767345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-23eCollection Date: 2025-01-01DOI: 10.1016/j.csbj.2025.11.052
Yusen Lin, Feiyan Lin, Yongjun Zhang, Jiayu Wen, Guomin Li, Xinquan Zeng, Hang Sun, Hang Jiang, Jingxia Lin, Teng Yan, Ruzheng Xue, Hao Sun, Bin Yang, Jiajian Zhou
Objective: To provide an interpretable computational framework for examining whole-slide images (WSI) in skin biopsies, PathoEye focuses on the dermis-epidermis junctional (DEJ) areas, also known as the basement membrane zone (BMZ), to enrich the pathological features of various skin conditions.
Method: We presented PathoEye for WSI analysis in dermatology, which integrates epidermis-guided sampling, deep learning and radiomics. It enables the semantic segmentation of the BMZ automatically and extracts distinct features associated with various skin conditions.
Results: PathoEye outperforms the existing methods in multi-class classification tasks involving various skin conditions by leveraging the BMZ-centric segmentation approach. It enables the investigation of histopathological aberrations in aged skin compared with young skin. Additionally, it highlighted the texture changes in the BMZ of young skin compared with aged skin. Further experimental analyses revealed that senescence cells were enriched in the BMZ, and the turnover of basement membrane (BM) components, including COL17A1, COL4A2, and ITGA6, was increased in aged skin.
Conclusion: PathoEye is a WSI analysis tool that focuses on the features of the BMZ related to various skin conditions. The BMZ-centric patch sampling method improves the performance of the classification model for skin diseases.
{"title":"PathoEye: A deep learning framework for the whole-slide image analysis of skin tissue.","authors":"Yusen Lin, Feiyan Lin, Yongjun Zhang, Jiayu Wen, Guomin Li, Xinquan Zeng, Hang Sun, Hang Jiang, Jingxia Lin, Teng Yan, Ruzheng Xue, Hao Sun, Bin Yang, Jiajian Zhou","doi":"10.1016/j.csbj.2025.11.052","DOIUrl":"10.1016/j.csbj.2025.11.052","url":null,"abstract":"<p><strong>Objective: </strong>To provide an interpretable computational framework for examining whole-slide images (WSI) in skin biopsies, PathoEye focuses on the dermis-epidermis junctional (DEJ) areas, also known as the basement membrane zone (BMZ), to enrich the pathological features of various skin conditions.</p><p><strong>Method: </strong>We presented PathoEye for WSI analysis in dermatology, which integrates epidermis-guided sampling, deep learning and radiomics. It enables the semantic segmentation of the BMZ automatically and extracts distinct features associated with various skin conditions.</p><p><strong>Results: </strong>PathoEye outperforms the existing methods in multi-class classification tasks involving various skin conditions by leveraging the BMZ-centric segmentation approach. It enables the investigation of histopathological aberrations in aged skin compared with young skin. Additionally, it highlighted the texture changes in the BMZ of young skin compared with aged skin. Further experimental analyses revealed that senescence cells were enriched in the BMZ, and the turnover of basement membrane (BM) components, including COL17A1, COL4A2, and ITGA6, was increased in aged skin.</p><p><strong>Conclusion: </strong>PathoEye is a WSI analysis tool that focuses on the features of the BMZ related to various skin conditions. The BMZ-centric patch sampling method improves the performance of the classification model for skin diseases.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5391-5400"},"PeriodicalIF":4.1,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699262/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145755523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
COVID-19, caused by the SARS-CoV-2 virus. This infectious disease significantly targets the upper and lower respiratory tracts of humans and animals. The mechanisms of SARS-CoV-2 infection are crucially mediated by specific viral proteins, such as the spike protein, as well as various enzymes, including proteases and transferases, and host proteins, including the ACE2 receptor. These proteins facilitate viral attachment and viral replication. This study aimed to identify the potential candidate compounds from Andrographis paniculata due to its anti-SARS-CoV-2 property, examined using both in vitro and in silico methodologies. The most effective fractions of the extract provided 15 candidate compounds that were identified based on their binding affinity to papain-like protease (PLpro), 3-chymotrypsin-like protease (3CLpro), RNA-dependent RNA polymerase (RdRp), and methyltransferase (MTase). Notably, these compounds showed no satisfactory coupling with the spike protein and the ACE2 receptor. This work revealed a potential mechanism of action that focuses on viral replication rather than initial attachment. Nine candidate compounds of flavonoids (CP14, CP15, CP16, CP26, CP30, CP31, CP32, CP33, and CP39), together with 2-tert-butyl-6-[(3-tert-butyl-2-hydroxy-5-methylphenyl)methyl]-4-methylphenol (CP13), 6-butyryl-5,7-dihydroxy-8-isopentenyl-4-propylcoumarin (CP37), 7-demethyltangeretin (CP38), afromosin (CP43), asperglaucide (CP45), and galanolactone (CP53) have strong binding affinity with the viral proteins, including PLpro, 3CLpro, RdRp, and MTase. The LB-PaCS-MD/FMO framework revealed detailed ligand binding pathways and induced-fit adaptation of CP14 and CP45 within the flexible 3CLpro pocket, providing quantum-level insight into their stabilization mechanisms. This suggests significant potential to disrupt viral replication, and this finding will be the guidance for anti-SARS-CoV-2 products in the future.
{"title":"<i>In silico</i> and bioassay-guided identification of potential anti-SARS-CoV-2 tentative candidate compounds from <i>Andrographis paniculata</i> extract.","authors":"Jeerakit Kerdsiri, Kowit Hengphasatporn, Tasana Pitaksuteepong, Nitra Nuengchamnong, Aphinya Suroengrit, Phumbodin Chupinidsakulwong, Yasuteru Shigeta, Parvapan Bhattarakosol, Siwaporn Boonyasuppayakorn, Neti Waranuch","doi":"10.1016/j.csbj.2025.11.050","DOIUrl":"10.1016/j.csbj.2025.11.050","url":null,"abstract":"<p><p>COVID-19, caused by the SARS-CoV-2 virus. This infectious disease significantly targets the upper and lower respiratory tracts of humans and animals. The mechanisms of SARS-CoV-2 infection are crucially mediated by specific viral proteins, such as the spike protein, as well as various enzymes, including proteases and transferases, and host proteins, including the ACE2 receptor. These proteins facilitate viral attachment and viral replication. This study aimed to identify the potential candidate compounds from <i>Andrographis paniculata</i> due to its anti-SARS-CoV-2 property, examined using both <i>in vitro</i> and <i>in silico</i> methodologies. The most effective fractions of the extract provided 15 candidate compounds that were identified based on their binding affinity to papain-like protease (PL<sup>pro</sup>), 3-chymotrypsin-like protease (3CL<sup>pro</sup>), RNA-dependent RNA polymerase (RdRp), and methyltransferase (MTase). Notably, these compounds showed no satisfactory coupling with the spike protein and the ACE2 receptor. This work revealed a potential mechanism of action that focuses on viral replication rather than initial attachment. Nine candidate compounds of flavonoids (CP14, CP15, CP16, CP26, CP30, CP31, CP32, CP33, and CP39), together with 2-tert-butyl-6-[(3-tert-butyl-2-hydroxy-5-methylphenyl)methyl]-4-methylphenol (CP13), 6-butyryl-5,7-dihydroxy-8-isopentenyl-4-propylcoumarin (CP37), 7-demethyltangeretin (CP38), afromosin (CP43), asperglaucide (CP45), and galanolactone (CP53) have strong binding affinity with the viral proteins, including PL<sup>pro</sup>, 3CL<sup>pro</sup>, RdRp, and MTase. The LB-PaCS-MD/FMO framework revealed detailed ligand binding pathways and induced-fit adaptation of CP14 and CP45 within the flexible 3CL<sup>pro</sup> pocket, providing quantum-level insight into their stabilization mechanisms. This suggests significant potential to disrupt viral replication, and this finding will be the guidance for anti-SARS-CoV-2 products in the future.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5479-5492"},"PeriodicalIF":4.1,"publicationDate":"2025-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12720350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145818357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer frequency spectra across more than 225,000 genome assemblies spanning all three domains of life and viruses. We demonstrate that Zipf's law consistently underperforms in modeling k-mer distributions. In contrast, we propose the truncated power law and Zipf-Mandelbrot distributions, which provide substantially improved fits across taxonomic groups. We show that genome size and GC content influence model performance, with larger and GC-content imbalanced genomes yielding better fits. Additionally, we perform an extensive analysis on vocabulary expansion and exhaustion across the same organisms using Heaps' law. We apply our modeling framework to evaluate simulated genomes generated by k-let preserving shuffling and deep generative language models. Our results reveal substantial differences between organismal genomes and their synthetic or shuffled counterparts, offering a novel approach to benchmark the biological plausibility of artificial genomes. Collectively, this work establishes new standards for modeling genomic k-mer distributions and provides insights relevant to synthetic biology, and evolutionary sequence analysis.
DNA k-mers的频率分布是由基本的生物过程形成的,并为基因组结构和进化提供了一个窗口。受自然语言类比的启发,先前的研究试图利用齐夫定律(Zipf’s law)来模拟基因组k-mer的使用。齐夫定律是一种最初为人类语言中的单词制定的秩-频率定律。然而,这一定律在多大程度上准确地捕捉到了k-mers在不同物种中的分布仍不清楚。在这里,我们系统地分析了跨越生命和病毒所有三个领域的超过225,000个基因组组装的k-mer频谱。我们证明Zipf定律在k-mer分布建模中一直表现不佳。相比之下,我们提出了截断幂律和Zipf-Mandelbrot分布,它们在分类群之间提供了显着改进的拟合。我们发现基因组大小和GC含量影响模型的性能,更大和GC含量不平衡的基因组产生更好的拟合。此外,我们使用希普斯定律对同一生物的词汇扩展和耗尽进行了广泛的分析。我们应用我们的建模框架来评估由k-let保持洗牌和深度生成语言模型生成的模拟基因组。我们的研究结果揭示了有机基因组与其合成或洗牌的对应物之间的实质性差异,提供了一种新的方法来基准人工基因组的生物学合理性。总的来说,这项工作建立了基因组k-mer分布建模的新标准,并提供了与合成生物学和进化序列分析相关的见解。
{"title":"Investigating DNA words and their distributions across the tree of life.","authors":"Charalampos Koilakos, Kimonas Provatas, Michail Patsakis, Aris Karatzikos, Alexandros Tzanakakis, Ilias Georgakopoulos-Soares","doi":"10.1016/j.csbj.2025.11.040","DOIUrl":"10.1016/j.csbj.2025.11.040","url":null,"abstract":"<p><p>The frequency distributions of DNA k-mers are shaped by fundamental biological processes and offer a window into genome structure and evolution. Inspired by analogies to natural language, prior studies have attempted to model genomic k-mer usage using Zipf's law, a rank-frequency law originally formulated for words in human language. However, the extent to which this law accurately captures the distribution of k-mers across diverse species remains unclear. Here, we systematically analyze k-mer frequency spectra across more than 225,000 genome assemblies spanning all three domains of life and viruses. We demonstrate that Zipf's law consistently underperforms in modeling k-mer distributions. In contrast, we propose the truncated power law and Zipf-Mandelbrot distributions, which provide substantially improved fits across taxonomic groups. We show that genome size and GC content influence model performance, with larger and GC-content imbalanced genomes yielding better fits. Additionally, we perform an extensive analysis on vocabulary expansion and exhaustion across the same organisms using Heaps' law. We apply our modeling framework to evaluate simulated genomes generated by k-let preserving shuffling and deep generative language models. Our results reveal substantial differences between organismal genomes and their synthetic or shuffled counterparts, offering a novel approach to benchmark the biological plausibility of artificial genomes. Collectively, this work establishes new standards for modeling genomic k-mer distributions and provides insights relevant to synthetic biology, and evolutionary sequence analysis.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5335-5347"},"PeriodicalIF":4.1,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12686733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145721433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22eCollection Date: 2025-01-01DOI: 10.1016/j.csbj.2025.11.051
Jiao Wang, Hui Zong, Yingbo Zhang, Xingyun Liu, Ke Shen, Xiaoyu Li, Rongrong Wu, Min Jiang, Daniel Rivero Cebrián, Juan Ramón Rabuñal Dopico, Bairong Shen
Circadian rhythms regulate numerous physiological and biochemical processes in humans, and their disruption is linked to elevated cancer risk and progression. Although substantial research has elucidated interactions between circadian mechanisms and cancer pathways, these findings remain fragmented and poorly integrated, impeding a holistic understanding. To address this gap, we developed the Circadian-Related Risk Factor Knowledgebase for Cancer (CirRFKB), a manually curated repository documenting validated associations between the circadian clock and cancer. CirRFKB curates data from 471 articles, encompassing 46 cancer types and 4052 records, categorizing risk factors into 1449 single factors and 340 combinations. Single factors were categorized into 681 genetic factors, 106 environmental factors, 244 physiological factors, and 418 behavioral factors. These factors were further classified as 254 protective factors, 323 risk factors, 291 no-influencing factors, and 921 unclear factors. The user-friendly interface enables researchers to explore, visualize, and retrieve data through comprehensive browsing and query tools. CirRFKB provides a foundational resource that structures circadian-cancer interactions, offering systematic evidence to advance clinical applications in deep phenotyping for precision oncology and the optimization of chronotherapy. CirRFKB is publicly accessible at: http://bioinf.org.cn:9876/.
{"title":"CirRFKB: A knowledgebase of circadian-related risk factors for cancer pathogenesis and personalized medicine.","authors":"Jiao Wang, Hui Zong, Yingbo Zhang, Xingyun Liu, Ke Shen, Xiaoyu Li, Rongrong Wu, Min Jiang, Daniel Rivero Cebrián, Juan Ramón Rabuñal Dopico, Bairong Shen","doi":"10.1016/j.csbj.2025.11.051","DOIUrl":"10.1016/j.csbj.2025.11.051","url":null,"abstract":"<p><p>Circadian rhythms regulate numerous physiological and biochemical processes in humans, and their disruption is linked to elevated cancer risk and progression. Although substantial research has elucidated interactions between circadian mechanisms and cancer pathways, these findings remain fragmented and poorly integrated, impeding a holistic understanding. To address this gap, we developed the Circadian-Related Risk Factor Knowledgebase for Cancer (CirRFKB), a manually curated repository documenting validated associations between the circadian clock and cancer. CirRFKB curates data from 471 articles, encompassing 46 cancer types and 4052 records, categorizing risk factors into 1449 single factors and 340 combinations. Single factors were categorized into 681 genetic factors, 106 environmental factors, 244 physiological factors, and 418 behavioral factors. These factors were further classified as 254 protective factors, 323 risk factors, 291 no-influencing factors, and 921 unclear factors. The user-friendly interface enables researchers to explore, visualize, and retrieve data through comprehensive browsing and query tools. CirRFKB provides a foundational resource that structures circadian-cancer interactions, offering systematic evidence to advance clinical applications in deep phenotyping for precision oncology and the optimization of chronotherapy. CirRFKB is publicly accessible at: http://bioinf.org.cn:9876/.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5326-5334"},"PeriodicalIF":4.1,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12686628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145721480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.1016/j.csbj.2025.11.036
Beatrice Ruth, Bashar Ibrahim, Peter Dittrich
Microbial communities typically consist of numerous species that coexist through intricate mutual dependencies. Understanding the structure of these communities and the interactions among their species is essential for explaining their functions and predicting their behavior. In this study, we follow the idea that a community organizes itself into a hierarchy of potentially persistent sub-communities. Previously, this hierarchy was described using Chemical Organization Theory (COT). However, that approach did not account for negative interactions. Here, we enhance the theory by incorporating negative interactions through an inhibitory resource called a toxin. For simplicity, we assume that a taxon sensitive to a toxin cannot coexist with a taxon that produces that toxin. Our results demonstrate that introducing a toxin reduces the number of organizations, with the extent of this reduction depending on various modeling parameters. Further, we show that the usage of essential resources leads to a computationally NP-hard transformation problem into direct taxa interactions. Additionally, we demonstrate that the number of measurements required to infer all persistent subspaces increases. We determine which groups of species are mutually excluded due to toxin interactions. Besides toxic interactions, it is also possible to infer cross-feeding aspects of the microbial community, for which a potential algorithm is outlined and illustrated by an example.
{"title":"A formal approach to the hierarchical structures of microbial communities with negative interactions.","authors":"Beatrice Ruth, Bashar Ibrahim, Peter Dittrich","doi":"10.1016/j.csbj.2025.11.036","DOIUrl":"10.1016/j.csbj.2025.11.036","url":null,"abstract":"<p><p>Microbial communities typically consist of numerous species that coexist through intricate mutual dependencies. Understanding the structure of these communities and the interactions among their species is essential for explaining their functions and predicting their behavior. In this study, we follow the idea that a community organizes itself into a hierarchy of potentially persistent sub-communities. Previously, this hierarchy was described using Chemical Organization Theory (COT). However, that approach did not account for negative interactions. Here, we enhance the theory by incorporating negative interactions through an inhibitory resource called a toxin. For simplicity, we assume that a taxon sensitive to a toxin cannot coexist with a taxon that produces that toxin. Our results demonstrate that introducing a toxin reduces the number of organizations, with the extent of this reduction depending on various modeling parameters. Further, we show that the usage of essential resources leads to a computationally NP-hard transformation problem into direct taxa interactions. Additionally, we demonstrate that the number of measurements required to infer all persistent subspaces increases. We determine which groups of species are mutually excluded due to toxin interactions. Besides toxic interactions, it is also possible to infer cross-feeding aspects of the microbial community, for which a potential algorithm is outlined and illustrated by an example.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5561-5574"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12731272/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145833148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.1016/j.csbj.2025.11.037
Harshita Sahni, Sarah Michelle Crotzer, Juston Moore, Steven S Branda, Trilce Estrada, S Gnanakaran
Understanding protein-protein interactions (PPIs) between viruses and host organisms is crucial for uncovering infection mechanisms and identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied viruses presents a significant challenge. In this work, we use arenavirus-human PPIs to illustrate the difficulties associated with model generalization, which are compounded by a lack of both positive and negative data. We employ a Transfer Learning approach to investigate arenavirus-human PPIs by utilizing models trained on better-studied virus-human and human-human PPIs. Additionally, we curate and assess four types of negative sampling datasets to evaluate their impact on model performance. Despite the overall high accuracies (93-99 %) and AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due to data leakage, data bias, and overfitting, especially concerning under-represented viral proteins. We reveal these gaps and assess the impact of data imbalance using standard k-fold cross-validation and Independent Blind Testing with a Balanced Dataset, resulting in a drop in accuracy below 50 %. We propose a viral protein-specific evaluation framework that categorizes viral proteins into majority and minority classes based on their representation in the dataset, enabling comparison of model performance across these groups using balanced accuracies. This framework offers a more robust evaluation of model generalizability, addressing biases inherent in standard evaluation techniques and paving the way for more reliable PPI prediction models for understudied viruses.
{"title":"Challenges in predicting protein-protein interactions of understudied viruses: Arenavirus-human interactions.","authors":"Harshita Sahni, Sarah Michelle Crotzer, Juston Moore, Steven S Branda, Trilce Estrada, S Gnanakaran","doi":"10.1016/j.csbj.2025.11.037","DOIUrl":"10.1016/j.csbj.2025.11.037","url":null,"abstract":"<p><p>Understanding protein-protein interactions (PPIs) between viruses and host organisms is crucial for uncovering infection mechanisms and identifying potential therapeutic targets. The ability to generalize PPI predictive models across understudied viruses presents a significant challenge. In this work, we use arenavirus-human PPIs to illustrate the difficulties associated with model generalization, which are compounded by a lack of both positive and negative data. We employ a Transfer Learning approach to investigate arenavirus-human PPIs by utilizing models trained on better-studied virus-human and human-human PPIs. Additionally, we curate and assess four types of negative sampling datasets to evaluate their impact on model performance. Despite the overall high accuracies (93-99 %) and AUPRC scores (0.8-0.9) appearing promising, further analysis indicates that these performance metrics can be misleading due to data leakage, data bias, and overfitting, especially concerning under-represented viral proteins. We reveal these gaps and assess the impact of data imbalance using standard k-fold cross-validation and Independent Blind Testing with a Balanced Dataset, resulting in a drop in accuracy below 50 %. We propose a viral protein-specific evaluation framework that categorizes viral proteins into majority and minority classes based on their representation in the dataset, enabling comparison of model performance across these groups using balanced accuracies. This framework offers a more robust evaluation of model generalizability, addressing biases inherent in standard evaluation techniques and paving the way for more reliable PPI prediction models for understudied viruses.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5401-5412"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12703866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145767348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is growing interest in developing predictive models of enzyme catalytic properties that leverage activity data spanning diverse enzyme families. A fundamental challenge lies in the inherent biases of public biochemical databases. These databases predominantly catalog valid enzyme activities, rarely include negative instances, and report quantitative catalytic parameters for only a relatively small subset of enzymes. Such limitations pose a major obstacle to supervised learning of enzyme catalytic properties. One existing approach for model training involves generating synthetic negative enzyme-activity pairs by recombining existing enzymes and their activity information, particularly substrates or chemical reactions, that were not originally associated within datasets. However, it remains unclear whether the generated negative examples are truly inactive or merely unobserved active instances. To build a model that captures functional properties across diverse enzyme families while avoiding reliance on negative examples, this paper introduces a self-supervised domain adaptation methodology for pre-trained protein language models, solely based on positive enzyme-reaction pairs. The enzyme representations obtained from the adapted protein language model achieved superior or at least competitive performance compared to those from an existing method that relies on synthetic negatives, in both the turnover number prediction task for natural reactions of wild-type enzymes and the activity prediction task for family-wide enzyme-substrate specificity screening datasets. Overall, our approach represents a methodological advancement that eliminates the need for synthetic negatives and provides a scalable framework for leveraging the growing enzyme activity data in biochemical databases.
{"title":"Self-supervised domain adaptation of protein language model based solely on positive enzyme-reaction pairs.","authors":"Tomoya Okuno, Naoaki Ono, Md Altaf-Ul-Amin, Shigehiko Kanaya","doi":"10.1016/j.csbj.2025.11.045","DOIUrl":"10.1016/j.csbj.2025.11.045","url":null,"abstract":"<p><p>There is growing interest in developing predictive models of enzyme catalytic properties that leverage activity data spanning diverse enzyme families. A fundamental challenge lies in the inherent biases of public biochemical databases. These databases predominantly catalog valid enzyme activities, rarely include negative instances, and report quantitative catalytic parameters for only a relatively small subset of enzymes. Such limitations pose a major obstacle to supervised learning of enzyme catalytic properties. One existing approach for model training involves generating synthetic negative enzyme-activity pairs by recombining existing enzymes and their activity information, particularly substrates or chemical reactions, that were not originally associated within datasets. However, it remains unclear whether the generated negative examples are truly inactive or merely unobserved active instances. To build a model that captures functional properties across diverse enzyme families while avoiding reliance on negative examples, this paper introduces a self-supervised domain adaptation methodology for pre-trained protein language models, solely based on positive enzyme-reaction pairs. The enzyme representations obtained from the adapted protein language model achieved superior or at least competitive performance compared to those from an existing method that relies on synthetic negatives, in both the turnover number prediction task for natural reactions of wild-type enzymes and the activity prediction task for family-wide enzyme-substrate specificity screening datasets. Overall, our approach represents a methodological advancement that eliminates the need for synthetic negatives and provides a scalable framework for leveraging the growing enzyme activity data in biochemical databases.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5441-5449"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712682/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145803408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21eCollection Date: 2025-01-01DOI: 10.1016/j.csbj.2025.11.046
Yu-Cheng Chen, Ming-Ren Yang, Yu-Wei Wu
Identifying antimicrobial resistance (AMR)-related biomarkers from large-scale genomic datasets is often akin to finding a needle in a haystack. With pan-genomic data containing more than 100,000 gene sequences, isolating features that truly drive resistance remains a major challenge in computational biology. Here we present PanARGMiner, a machine learning-based feature selection framework designed to robustly extract highly relevant and informative biomarkers from high-dimensional biological data. PanARGMiner uses an ensemble-based feature selection strategy to select highly informative and compact feature subsets. It then utilizes repeated iterations to ensure the stability and reliability of the proposed framework, enabling PanARGMiner to generate significantly reduced features with comparable prediction performance compared to those obtained with other feature selection algorithms. Applying PanARGMiner to bacterial pan-genomic antimicrobial resistance datasets successfully extracted as few as one to ten candidate AMR biomarkers from datasets with more than 100,000 genes for five common pathogens. Although many of the extracted candidate AMR biomarkers are well-known resistance genes, proteins not known to be associated with AMR mechanisms, including functionally uncharacterized hypothetical proteins, were also extracted. This indicates the potential of PanARGMiner in revealing both established and novel mechanisms of antibiotic resistance, thus providing actionable insights for biomarker discovery, functional genomics, and precision medicine based on complex data. Its ability to uncover both known and uncharacterized resistance-related features offers new opportunities for research and clinical applications in combating AMR.
{"title":"PanARGMiner (Pan-Genomic Antimicrobial Resistance Gene Miner): An advanced feature selection framework for extracting key resistance genes from pan-genomic datasets.","authors":"Yu-Cheng Chen, Ming-Ren Yang, Yu-Wei Wu","doi":"10.1016/j.csbj.2025.11.046","DOIUrl":"10.1016/j.csbj.2025.11.046","url":null,"abstract":"<p><p>Identifying antimicrobial resistance (AMR)-related biomarkers from large-scale genomic datasets is often akin to finding a needle in a haystack. With pan-genomic data containing more than 100,000 gene sequences, isolating features that truly drive resistance remains a major challenge in computational biology. Here we present PanARGMiner, a machine learning-based feature selection framework designed to robustly extract highly relevant and informative biomarkers from high-dimensional biological data. PanARGMiner uses an ensemble-based feature selection strategy to select highly informative and compact feature subsets. It then utilizes repeated iterations to ensure the stability and reliability of the proposed framework, enabling PanARGMiner to generate significantly reduced features with comparable prediction performance compared to those obtained with other feature selection algorithms. Applying PanARGMiner to bacterial pan-genomic antimicrobial resistance datasets successfully extracted as few as one to ten candidate AMR biomarkers from datasets with more than 100,000 genes for five common pathogens. Although many of the extracted candidate AMR biomarkers are well-known resistance genes, proteins not known to be associated with AMR mechanisms, including functionally uncharacterized hypothetical proteins, were also extracted. This indicates the potential of PanARGMiner in revealing both established and novel mechanisms of antibiotic resistance, thus providing actionable insights for biomarker discovery, functional genomics, and precision medicine based on complex data. Its ability to uncover both known and uncharacterized resistance-related features offers new opportunities for research and clinical applications in combating AMR.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"5363-5374"},"PeriodicalIF":4.1,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699266/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145755479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}