Pub Date : 2025-09-08eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1628800
Emma L Flynn, Riya Shah, Ian Dunn, Rishal Aggarwal, David Ryan Koes
Structure-based drug design (SBDD) is enhanced by machine learning (ML) to improve both virtual screening and de novo design. Despite advances in ML tools for both strategies, screening remains bounded by time and computational cost, while generative models frequently produce invalid and synthetically inaccessible molecules. Screening time can be improved with pharmacophore search, which quickly identifies ligands in a database that match a pharmacophore query. In this work, we introduce PharmacoForge, a diffusion model for generating 3D pharmacophores conditioned on a protein pocket. Generated pharmacophore queries identify ligands that are guaranteed to be valid, commercially available molecules. We evaluate PharmacoForge against automated pharmacophore generation methods using the LIT-PCBA benchmark and ligand generative models through a docking-based evaluation framework. We further assess pharmacophore quality through a retrospective screening of the DUD-E dataset. PharmacoForge surpasses other pharmacophore generation methods in the LIT-PCBA benchmark, and resulting ligands from pharmacophore queries performed similarly to de novo generated ligands when docking to DUD-E targets and had lower strain energies compared to de novo generated ligands.
{"title":"PharmacoForge: pharmacophore generation with diffusion models.","authors":"Emma L Flynn, Riya Shah, Ian Dunn, Rishal Aggarwal, David Ryan Koes","doi":"10.3389/fbinf.2025.1628800","DOIUrl":"10.3389/fbinf.2025.1628800","url":null,"abstract":"<p><p>Structure-based drug design (SBDD) is enhanced by machine learning (ML) to improve both virtual screening and <i>de novo</i> design. Despite advances in ML tools for both strategies, screening remains bounded by time and computational cost, while generative models frequently produce invalid and synthetically inaccessible molecules. Screening time can be improved with pharmacophore search, which quickly identifies ligands in a database that match a pharmacophore query. In this work, we introduce PharmacoForge, a diffusion model for generating 3D pharmacophores conditioned on a protein pocket. Generated pharmacophore queries identify ligands that are guaranteed to be valid, commercially available molecules. We evaluate PharmacoForge against automated pharmacophore generation methods using the LIT-PCBA benchmark and ligand generative models through a docking-based evaluation framework. We further assess pharmacophore quality through a retrospective screening of the DUD-E dataset. PharmacoForge surpasses other pharmacophore generation methods in the LIT-PCBA benchmark, and resulting ligands from pharmacophore queries performed similarly to <i>de novo</i> generated ligands when docking to DUD-E targets and had lower strain energies compared to <i>de novo</i> generated ligands.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1628800"},"PeriodicalIF":3.9,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12451294/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145132816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-05eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1619790
Antoine A Ruzette, Nina Kozlova, Kayla A Cruz, Taru Muranen, Simon F Nørrelykke
Aggressive cancers, such as pancreatic ductal adenocarcinoma (PDAC), are often characterized by a complex and desmoplastic tumor microenvironment, a stroma rich supportive connective tissue composed primarily of extracellular matrix (ECM) and non-cancerous cells. Desmoplasia, a dense deposition of stroma, is a major reason for therapy resistance, acting both as a physical barrier that interferes with drug penetration and as a supportive niche that protects cancer cells through diverse mechanisms. Precise understanding of spatial cell interactions in stroma-rich tumors is essential for optimizing therapeutic responses. It enables detailed mapping of stromal-tumor interfaces, comprehensive cell phenotyping, and insights into changes in tissue architecture, improving assessment of drug responses. Recent advances in multiplexed immunofluorescence imaging have enabled the acquisition of large batches of whole-slide tumor images, but scalable and reproducible methods to analyze the spatial distribution of cell states relative to stromal regions remain limited. To address this gap, we developed an open-source computational pipeline that integrates QuPath, StarDist, and custom Python scripts to quantify biomarker expression at a single- and sub-cellular resolution across entire tumor sections. Our workflow includes: (i) automated nuclei segmentation using StarDist, (ii) machine learning-based cell classification using multiplexed marker expression, (iii) modeling of stromal regions based on fibronectin staining, (iv) sensitivity analyses on classification thresholds to ensure robustness across heterogeneous datasets, and (v) distance-based quantification of the proximity of each cell to the stromal border. To improve consistency across slides with variable staining intensities, we introduce a statistical strategy that translates classification thresholds by propagating a chosen reference percentile across the distribution of marker-related cell measurement in each image. We apply this approach to quantify spatial patterns of distribution of the phosphorylated form of the N-Myc downregulated gene 1 (NDRG1), a novel DNA repair protein that conveys signals from the ECM to the nucleus to maintain replication fork homeostasis, and a known cell proliferation marker Ki67 in fibronectin-defined stromal regions in PDAC xenografts. The pipeline is applicable for the analysis of markers of interest in stroma-rich tissues and is publicly available.
{"title":"An image analysis pipeline to quantify the spatial distribution of cell markers in stroma-rich tumors.","authors":"Antoine A Ruzette, Nina Kozlova, Kayla A Cruz, Taru Muranen, Simon F Nørrelykke","doi":"10.3389/fbinf.2025.1619790","DOIUrl":"10.3389/fbinf.2025.1619790","url":null,"abstract":"<p><p>Aggressive cancers, such as pancreatic ductal adenocarcinoma (PDAC), are often characterized by a complex and desmoplastic tumor microenvironment, a stroma rich supportive connective tissue composed primarily of extracellular matrix (ECM) and non-cancerous cells. Desmoplasia, a dense deposition of stroma, is a major reason for therapy resistance, acting both as a physical barrier that interferes with drug penetration and as a supportive niche that protects cancer cells through diverse mechanisms. Precise understanding of spatial cell interactions in stroma-rich tumors is essential for optimizing therapeutic responses. It enables detailed mapping of stromal-tumor interfaces, comprehensive cell phenotyping, and insights into changes in tissue architecture, improving assessment of drug responses. Recent advances in multiplexed immunofluorescence imaging have enabled the acquisition of large batches of whole-slide tumor images, but scalable and reproducible methods to analyze the spatial distribution of cell states relative to stromal regions remain limited. To address this gap, we developed an open-source computational pipeline that integrates QuPath, StarDist, and custom Python scripts to quantify biomarker expression at a single- and sub-cellular resolution across entire tumor sections. Our workflow includes: (i) automated nuclei segmentation using StarDist, (ii) machine learning-based cell classification using multiplexed marker expression, (iii) modeling of stromal regions based on fibronectin staining, (iv) sensitivity analyses on classification thresholds to ensure robustness across heterogeneous datasets, and (v) distance-based quantification of the proximity of each cell to the stromal border. To improve consistency across slides with variable staining intensities, we introduce a statistical strategy that translates classification thresholds by propagating a chosen reference percentile across the distribution of marker-related cell measurement in each image. We apply this approach to quantify spatial patterns of distribution of the phosphorylated form of the N-Myc downregulated gene 1 (NDRG1), a novel DNA repair protein that conveys signals from the ECM to the nucleus to maintain replication fork homeostasis, and a known cell proliferation marker Ki67 in fibronectin-defined stromal regions in PDAC xenografts. The pipeline is applicable for the analysis of markers of interest in stroma-rich tissues and is publicly available.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1619790"},"PeriodicalIF":3.9,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12446346/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-05eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1641491
Roman Perik-Zavodskii, Olga Perik-Zavodskaia, Marina Volynets, Saleh Alrhmoun, Sergey Sennikov
Introduction: Single-cell multi-omics has transformed T-cell biology by enabling the simultaneous analysis of T-cell receptor (TCR) sequences, transcriptomes, and surface proteins at the resolution of individual cells. These capabilities are critical for identifying antigen-specific T-cells and accelerating the development of TCR-based immunotherapies.
Methods: Here, we introduce TCRscape, an open-source Python 3 tool designed for high-resolution T-cell receptor clonotype discovery and quantification, optimized for BD Rhapsody™ single-cell multi-omics data.
Results: TCRscape integrates full-length TCR sequence data with gene expression profiles and surface protein expression to enable multimodal clustering of αβ and γδ T-cell populations. It also outputs Seurat-compatible matrices, facilitating downstream visualization and analysis in standard single-cell analysis environments.
Discussion: By bridging clonotype detection with immune cell transcriptome, proteome, and antigen specificity profiling, TCRscape supports rapid identification of dominant T-cell clones and their functional phenotypes, offering a powerful resource for immune monitoring and TCR-engineered therapeutic development. TCRscape can be found at https://github.com/Perik-Zavodskii/TCRscape/.
{"title":"TCRscape: a single-cell multi-omic TCR profiling toolkit.","authors":"Roman Perik-Zavodskii, Olga Perik-Zavodskaia, Marina Volynets, Saleh Alrhmoun, Sergey Sennikov","doi":"10.3389/fbinf.2025.1641491","DOIUrl":"10.3389/fbinf.2025.1641491","url":null,"abstract":"<p><strong>Introduction: </strong>Single-cell multi-omics has transformed T-cell biology by enabling the simultaneous analysis of T-cell receptor (TCR) sequences, transcriptomes, and surface proteins at the resolution of individual cells. These capabilities are critical for identifying antigen-specific T-cells and accelerating the development of TCR-based immunotherapies.</p><p><strong>Methods: </strong>Here, we introduce TCRscape, an open-source Python 3 tool designed for high-resolution T-cell receptor clonotype discovery and quantification, optimized for BD Rhapsody™ single-cell multi-omics data.</p><p><strong>Results: </strong>TCRscape integrates full-length TCR sequence data with gene expression profiles and surface protein expression to enable multimodal clustering of αβ and γδ T-cell populations. It also outputs Seurat-compatible matrices, facilitating downstream visualization and analysis in standard single-cell analysis environments.</p><p><strong>Discussion: </strong>By bridging clonotype detection with immune cell transcriptome, proteome, and antigen specificity profiling, TCRscape supports rapid identification of dominant T-cell clones and their functional phenotypes, offering a powerful resource for immune monitoring and TCR-engineered therapeutic development. TCRscape can be found at https://github.com/Perik-Zavodskii/TCRscape/.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1641491"},"PeriodicalIF":3.9,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12446293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1576317
Grigorios Koulouras, Yingrong Xu
Proteolytic digestion is an essential process in mass spectrometry-based proteomics for converting proteins into peptides, hence crucial for protein identification and quantification. In a typical proteomics experiment, digestion reagents are selected without prior evaluation of their optimality for detecting proteins or peptides of interest, partly due to the lack of comprehensive and user-friendly predictive tools. In this work, we introduce Protein Cleaver, a web-based application that systematically assesses regions of proteins that are likely or unlikely to be identified, along with extensive sequence and structure annotation and visualization features. We showcase practical examples of Protein Cleaver's usability in drug discovery and highlight proteins that are typically difficult to detect using the most common proteolytic enzymes. We evaluate trypsin and chymotrypsin for identifying G-protein-coupled receptors and discover that chymotrypsin produces significantly more identifiable peptides than trypsin. We perform a bulk digestion analysis and assess 36 proteolytic enzymes for their ability to detect most of cysteine-containing peptides in the human proteome. We anticipate Protein Cleaver to be a valuable auxiliary tool for proteomics scientists.
{"title":"Protein cleaver: an interactive web interface for <i>in silico</i> prediction and systematic annotation of protein digestion-derived peptides.","authors":"Grigorios Koulouras, Yingrong Xu","doi":"10.3389/fbinf.2025.1576317","DOIUrl":"10.3389/fbinf.2025.1576317","url":null,"abstract":"<p><p>Proteolytic digestion is an essential process in mass spectrometry-based proteomics for converting proteins into peptides, hence crucial for protein identification and quantification. In a typical proteomics experiment, digestion reagents are selected without prior evaluation of their optimality for detecting proteins or peptides of interest, partly due to the lack of comprehensive and user-friendly predictive tools. In this work, we introduce Protein Cleaver, a web-based application that systematically assesses regions of proteins that are likely or unlikely to be identified, along with extensive sequence and structure annotation and visualization features. We showcase practical examples of Protein Cleaver's usability in drug discovery and highlight proteins that are typically difficult to detect using the most common proteolytic enzymes. We evaluate trypsin and chymotrypsin for identifying G-protein-coupled receptors and discover that chymotrypsin produces significantly more identifiable peptides than trypsin. We perform a bulk digestion analysis and assess 36 proteolytic enzymes for their ability to detect most of cysteine-containing peptides in the human proteome. We anticipate Protein Cleaver to be a valuable auxiliary tool for proteomics scientists.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1576317"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12445168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1528515
Tim Breitenbach, Thomas Dandekar
How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.
{"title":"Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling.","authors":"Tim Breitenbach, Thomas Dandekar","doi":"10.3389/fbinf.2025.1528515","DOIUrl":"10.3389/fbinf.2025.1528515","url":null,"abstract":"<p><p>How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1528515"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12444090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1577324
Anas Al-Okaily, Abdelghani Tbakhi
Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.
{"title":"A novel linear indexing method for strings under all internal nodes in a suffix tree.","authors":"Anas Al-Okaily, Abdelghani Tbakhi","doi":"10.3389/fbinf.2025.1577324","DOIUrl":"10.3389/fbinf.2025.1577324","url":null,"abstract":"<p><p>Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1577324"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12443692/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1685992
Derek L Thompson, Hsiang-Yun Wu, Christopher W Bartlett, William C Ray
{"title":"Editorial: Networks and graphs in biological data: current methods, opportunities and challenges.","authors":"Derek L Thompson, Hsiang-Yun Wu, Christopher W Bartlett, William C Ray","doi":"10.3389/fbinf.2025.1685992","DOIUrl":"10.3389/fbinf.2025.1685992","url":null,"abstract":"","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1685992"},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12437696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Germline mutation profiling of breast cancer patients using a non-BRCA sequencing panel.","authors":"Sonar Soni Panigoro, Rafika Indah Paramita, Fadilah Fadilah, Septelia Inawati Wanandi, Aisyah Fitriannisa Prawiningrum, Linda Erlina, Wahyu Dian Utari, Ajeng Megawati Fajrin","doi":"10.3389/fbinf.2025.1620025","DOIUrl":"10.3389/fbinf.2025.1620025","url":null,"abstract":"","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1620025"},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12436446/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1630078
Rafael Pereira Lemos, Diego Mariano, Sabrina De Azevedo Silveira, Raquel C de Melo-Minardi
Protein interatomic contacts, defined by spatial proximity and physicochemical complementarity at atomic resolution, are fundamental to characterizing molecular interactions and bonding. Methods for calculating contacts are generally categorized as cutoff-dependent, which rely on Euclidean distances, or cutoff-independent, which utilize Delaunay and Voronoi tessellations. While cutoff-dependent methods are recognized for their simplicity, completeness, and reliability, traditional implementations remain computationally expensive, posing significant scalability challenges in the current Big Data era of bioinformatics. Here, we introduce COC DA (COntact search pruning by C Distance Analysis), a Python-based command-line tool for improving search pruning in large-scale interatomic protein contact analysis using alpha-carbon (C ) distance matrices. COC DA detects intra- and inter-chain contacts, and classifies them into seven different types: hydrogen and disulfide bonds; hydrophobic effects; attractive, repulsive, and salt-bridge interactions; and aromatic stackings. To evaluate our tool, we compared it with three traditional approaches in the literature: all-against-all atom distance calculation ("brute-force"), static C distance cutoff (SC), and Biopython's NeighborSearch class (NS). COC DA demonstrated superior performance compared to the other methods, achieving on average 6x faster computation times than advanced data structures like k-d trees from NS, in addition to being simpler to implement and fully customizable. The presented tool facilitates exploratory and large-scale analyses of interatomic contacts in proteins in a simple and efficient manner, also enabling the integration of results with other tools and pipelines. The COC DA tool is freely available at https://github.com/LBS-UFMG/COCaDA.
蛋白质原子间接触是由空间接近性和原子分辨率上的物理化学互补性定义的,是表征分子相互作用和键合的基础。计算接触的方法通常被分类为依赖于欧几里得距离的截止点,或利用Delaunay和Voronoi细分的截止点无关。虽然截止依赖方法因其简单、完整和可靠而得到认可,但传统的实现方法在计算上仍然昂贵,在当前生物信息学的大数据时代提出了重大的可扩展性挑战。在这里,我们介绍了COC α DA (COntact search pruning by C α Distance Analysis),这是一个基于python的命令行工具,用于改进使用α -碳(C α)距离矩阵进行大规模原子间蛋白质接触分析的搜索修剪。COC α DA检测链内和链间的接触,并将其分为7种不同的类型:氢键和二硫键;疏水效果;吸引、排斥和盐桥相互作用;还有芳香的堆叠。为了评估我们的工具,我们将其与文献中的三种传统方法进行了比较:全反全原子距离计算(“蛮力”)、静态C α距离切断(SC)和Biopython的NeighborSearch类(NS)。与其他方法相比,COC α DA表现出了优越的性能,实现的计算时间平均比来自NS的k-d树等高级数据结构快6倍,并且更容易实现和完全可定制。该工具以一种简单有效的方式促进了对蛋白质中原子间接触的探索性和大规模分析,也使结果能够与其他工具和管道集成。COC α DA工具可在https://github.com/LBS-UFMG/COCaDA免费获得。
{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\">COC <ns0:math><ns0:mrow><ns0:mi>α</ns0:mi></ns0:mrow> </ns0:math> DA - a fast and scalable algorithm for interatomic contact detection in proteins using C <ns0:math><ns0:mrow><ns0:mi>α</ns0:mi></ns0:mrow> </ns0:math> distance matrices.","authors":"Rafael Pereira Lemos, Diego Mariano, Sabrina De Azevedo Silveira, Raquel C de Melo-Minardi","doi":"10.3389/fbinf.2025.1630078","DOIUrl":"10.3389/fbinf.2025.1630078","url":null,"abstract":"<p><p>Protein interatomic contacts, defined by spatial proximity and physicochemical complementarity at atomic resolution, are fundamental to characterizing molecular interactions and bonding. Methods for calculating contacts are generally categorized as cutoff-dependent, which rely on Euclidean distances, or cutoff-independent, which utilize Delaunay and Voronoi tessellations. While cutoff-dependent methods are recognized for their simplicity, completeness, and reliability, traditional implementations remain computationally expensive, posing significant scalability challenges in the current Big Data era of bioinformatics. Here, we introduce COC <math><mrow><mi>α</mi></mrow> </math> DA (COntact search pruning by C <math><mrow><mi>α</mi></mrow> </math> Distance Analysis), a Python-based command-line tool for improving search pruning in large-scale interatomic protein contact analysis using alpha-carbon (C <math><mrow><mi>α</mi></mrow> </math> ) distance matrices. COC <math><mrow><mi>α</mi></mrow> </math> DA detects intra- and inter-chain contacts, and classifies them into seven different types: hydrogen and disulfide bonds; hydrophobic effects; attractive, repulsive, and salt-bridge interactions; and aromatic stackings. To evaluate our tool, we compared it with three traditional approaches in the literature: all-against-all atom distance calculation (\"brute-force\"), static C <math><mrow><mi>α</mi></mrow> </math> distance cutoff (SC), and Biopython's NeighborSearch class (NS). COC <math><mrow><mi>α</mi></mrow> </math> DA demonstrated superior performance compared to the other methods, achieving on average 6x faster computation times than advanced data structures like <i>k</i>-d trees from NS, in addition to being simpler to implement and fully customizable. The presented tool facilitates exploratory and large-scale analyses of interatomic contacts in proteins in a simple and efficient manner, also enabling the integration of results with other tools and pipelines. The COC <math><mrow><mi>α</mi></mrow> </math> DA tool is freely available at https://github.com/LBS-UFMG/COCaDA.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1630078"},"PeriodicalIF":3.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12433948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145076621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29eCollection Date: 2025-01-01DOI: 10.3389/fbinf.2025.1610015
Patricia Agudelo-Romero, Talya Conradie, Jose Antonio Caparros-Martin, David Jimmy Martino, Anthony Kicic, Stephen Michael Stick, Christopher Hakkaart, Abhinav Sharma
The increasing adoption of high-throughput "omics" technologies has heightened the demand for standardized, scalable, and reproducible bioinformatics workflows. Nextflow and nf-core provide a robust framework for researchers, particularly early- and mid-career researchers (EMCRs), to navigate complex data analysis. At The Kids Research Institute Australia, we implemented a structured approach to bioinformatics capacity building using these tools. This perspective presents nine practical rules derived from lessons learnt, which facilitated the successful adoption of Nextflow and nf-core, addressing implementation challenges, knowledge gaps, resource allocation, and community support. Our experience serves as a guide for institutions aiming to establish sustainable bioinformatics capabilities and empower EMCRs.
{"title":"Advancing bioinformatics capacity through Nextflow and nf-core: lessons from an early-to mid-career researchers-focused program at The Kids Research Institute Australia.","authors":"Patricia Agudelo-Romero, Talya Conradie, Jose Antonio Caparros-Martin, David Jimmy Martino, Anthony Kicic, Stephen Michael Stick, Christopher Hakkaart, Abhinav Sharma","doi":"10.3389/fbinf.2025.1610015","DOIUrl":"10.3389/fbinf.2025.1610015","url":null,"abstract":"<p><p>The increasing adoption of high-throughput \"omics\" technologies has heightened the demand for standardized, scalable, and reproducible bioinformatics workflows. Nextflow and nf-core provide a robust framework for researchers, particularly early- and mid-career researchers (EMCRs), to navigate complex data analysis. At The Kids Research Institute Australia, we implemented a structured approach to bioinformatics capacity building using these tools. This perspective presents nine practical rules derived from lessons learnt, which facilitated the successful adoption of Nextflow and nf-core, addressing implementation challenges, knowledge gaps, resource allocation, and community support. Our experience serves as a guide for institutions aiming to establish sustainable bioinformatics capabilities and empower EMCRs.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1610015"},"PeriodicalIF":3.9,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}