首页 > 最新文献

Frontiers in bioinformatics最新文献

英文 中文
PharmacoForge: pharmacophore generation with diffusion models. PharmacoForge:药效团生成与扩散模型。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-08 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1628800
Emma L Flynn, Riya Shah, Ian Dunn, Rishal Aggarwal, David Ryan Koes

Structure-based drug design (SBDD) is enhanced by machine learning (ML) to improve both virtual screening and de novo design. Despite advances in ML tools for both strategies, screening remains bounded by time and computational cost, while generative models frequently produce invalid and synthetically inaccessible molecules. Screening time can be improved with pharmacophore search, which quickly identifies ligands in a database that match a pharmacophore query. In this work, we introduce PharmacoForge, a diffusion model for generating 3D pharmacophores conditioned on a protein pocket. Generated pharmacophore queries identify ligands that are guaranteed to be valid, commercially available molecules. We evaluate PharmacoForge against automated pharmacophore generation methods using the LIT-PCBA benchmark and ligand generative models through a docking-based evaluation framework. We further assess pharmacophore quality through a retrospective screening of the DUD-E dataset. PharmacoForge surpasses other pharmacophore generation methods in the LIT-PCBA benchmark, and resulting ligands from pharmacophore queries performed similarly to de novo generated ligands when docking to DUD-E targets and had lower strain energies compared to de novo generated ligands.

基于结构的药物设计(SBDD)通过机器学习(ML)增强,以改进虚拟筛选和从头设计。尽管这两种策略的ML工具都取得了进步,但筛选仍然受到时间和计算成本的限制,而生成模型经常产生无效和合成不可接近的分子。通过药效团搜索,可以快速识别数据库中与药效团查询匹配的配体,从而提高筛选时间。在这项工作中,我们介绍了PharmacoForge,这是一种用于生成基于蛋白质口袋的3D药效团的扩散模型。生成的药效团查询识别保证有效的配体,商业上可用的分子。我们通过基于对接的评估框架,使用LIT-PCBA基准和配体生成模型,对PharmacoForge与自动药效团生成方法进行了评估。我们通过对ddu - e数据集的回顾性筛选进一步评估药效团质量。在lite - pcba基准测试中,PharmacoForge超越了其他药效团生成方法,从药效团查询得到的配体在对接到ddu - e靶标时的表现与从头生成的配体相似,并且与从头生成的配体相比具有更低的应变能。
{"title":"PharmacoForge: pharmacophore generation with diffusion models.","authors":"Emma L Flynn, Riya Shah, Ian Dunn, Rishal Aggarwal, David Ryan Koes","doi":"10.3389/fbinf.2025.1628800","DOIUrl":"10.3389/fbinf.2025.1628800","url":null,"abstract":"<p><p>Structure-based drug design (SBDD) is enhanced by machine learning (ML) to improve both virtual screening and <i>de novo</i> design. Despite advances in ML tools for both strategies, screening remains bounded by time and computational cost, while generative models frequently produce invalid and synthetically inaccessible molecules. Screening time can be improved with pharmacophore search, which quickly identifies ligands in a database that match a pharmacophore query. In this work, we introduce PharmacoForge, a diffusion model for generating 3D pharmacophores conditioned on a protein pocket. Generated pharmacophore queries identify ligands that are guaranteed to be valid, commercially available molecules. We evaluate PharmacoForge against automated pharmacophore generation methods using the LIT-PCBA benchmark and ligand generative models through a docking-based evaluation framework. We further assess pharmacophore quality through a retrospective screening of the DUD-E dataset. PharmacoForge surpasses other pharmacophore generation methods in the LIT-PCBA benchmark, and resulting ligands from pharmacophore queries performed similarly to <i>de novo</i> generated ligands when docking to DUD-E targets and had lower strain energies compared to <i>de novo</i> generated ligands.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1628800"},"PeriodicalIF":3.9,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12451294/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145132816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An image analysis pipeline to quantify the spatial distribution of cell markers in stroma-rich tumors. 一种图像分析管道,用于量化富基质肿瘤中细胞标记物的空间分布。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-05 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1619790
Antoine A Ruzette, Nina Kozlova, Kayla A Cruz, Taru Muranen, Simon F Nørrelykke

Aggressive cancers, such as pancreatic ductal adenocarcinoma (PDAC), are often characterized by a complex and desmoplastic tumor microenvironment, a stroma rich supportive connective tissue composed primarily of extracellular matrix (ECM) and non-cancerous cells. Desmoplasia, a dense deposition of stroma, is a major reason for therapy resistance, acting both as a physical barrier that interferes with drug penetration and as a supportive niche that protects cancer cells through diverse mechanisms. Precise understanding of spatial cell interactions in stroma-rich tumors is essential for optimizing therapeutic responses. It enables detailed mapping of stromal-tumor interfaces, comprehensive cell phenotyping, and insights into changes in tissue architecture, improving assessment of drug responses. Recent advances in multiplexed immunofluorescence imaging have enabled the acquisition of large batches of whole-slide tumor images, but scalable and reproducible methods to analyze the spatial distribution of cell states relative to stromal regions remain limited. To address this gap, we developed an open-source computational pipeline that integrates QuPath, StarDist, and custom Python scripts to quantify biomarker expression at a single- and sub-cellular resolution across entire tumor sections. Our workflow includes: (i) automated nuclei segmentation using StarDist, (ii) machine learning-based cell classification using multiplexed marker expression, (iii) modeling of stromal regions based on fibronectin staining, (iv) sensitivity analyses on classification thresholds to ensure robustness across heterogeneous datasets, and (v) distance-based quantification of the proximity of each cell to the stromal border. To improve consistency across slides with variable staining intensities, we introduce a statistical strategy that translates classification thresholds by propagating a chosen reference percentile across the distribution of marker-related cell measurement in each image. We apply this approach to quantify spatial patterns of distribution of the phosphorylated form of the N-Myc downregulated gene 1 (NDRG1), a novel DNA repair protein that conveys signals from the ECM to the nucleus to maintain replication fork homeostasis, and a known cell proliferation marker Ki67 in fibronectin-defined stromal regions in PDAC xenografts. The pipeline is applicable for the analysis of markers of interest in stroma-rich tissues and is publicly available.

侵袭性癌症,如胰腺导管腺癌(PDAC),通常以复杂的肿瘤微环境、富含基质的支持性结缔组织(主要由细胞外基质(ECM)和非癌细胞组成)为特征。结缔组织增生是一种致密的间质沉积,是治疗耐药的主要原因,它既作为干扰药物渗透的物理屏障,又作为通过多种机制保护癌细胞的支持生态位。精确理解富基质肿瘤中空间细胞相互作用对于优化治疗反应至关重要。它可以详细绘制基质肿瘤界面,全面的细胞表型,洞察组织结构的变化,改进药物反应的评估。最近在多路免疫荧光成像方面的进展使得能够获得大批量的全片肿瘤图像,但是分析细胞状态相对于基质区域的空间分布的可扩展和可重复的方法仍然有限。为了解决这一问题,我们开发了一个开源计算管道,集成了QuPath、StarDist和自定义Python脚本,以整个肿瘤切片的单细胞和亚细胞分辨率量化生物标志物的表达。我们的工作流程包括:(i)使用StarDist自动分割细胞核,(ii)使用多路标记表达的基于机器学习的细胞分类,(iii)基于纤维连接蛋白染色的基质区域建模,(iv)分类阈值的敏感性分析,以确保跨异构数据集的鲁棒性,以及(v)基于距离的每个细胞接近基质边界的量化。为了提高不同染色强度的载玻片的一致性,我们引入了一种统计策略,通过在每个图像中与标记相关的细胞测量分布中传播选择的参考百分位数来翻译分类阈值。我们应用这种方法量化了N-Myc下调基因1 (NDRG1)的磷酸化形式的空间分布模式,NDRG1是一种新的DNA修复蛋白,它将信号从ECM传递到细胞核以维持复制叉的稳态,并且在PDAC异种移植物中纤维连接蛋白定义的基质区域中已知的细胞增殖标记Ki67。该管道适用于富基质组织中感兴趣的标记物的分析,并且是公开可用的。
{"title":"An image analysis pipeline to quantify the spatial distribution of cell markers in stroma-rich tumors.","authors":"Antoine A Ruzette, Nina Kozlova, Kayla A Cruz, Taru Muranen, Simon F Nørrelykke","doi":"10.3389/fbinf.2025.1619790","DOIUrl":"10.3389/fbinf.2025.1619790","url":null,"abstract":"<p><p>Aggressive cancers, such as pancreatic ductal adenocarcinoma (PDAC), are often characterized by a complex and desmoplastic tumor microenvironment, a stroma rich supportive connective tissue composed primarily of extracellular matrix (ECM) and non-cancerous cells. Desmoplasia, a dense deposition of stroma, is a major reason for therapy resistance, acting both as a physical barrier that interferes with drug penetration and as a supportive niche that protects cancer cells through diverse mechanisms. Precise understanding of spatial cell interactions in stroma-rich tumors is essential for optimizing therapeutic responses. It enables detailed mapping of stromal-tumor interfaces, comprehensive cell phenotyping, and insights into changes in tissue architecture, improving assessment of drug responses. Recent advances in multiplexed immunofluorescence imaging have enabled the acquisition of large batches of whole-slide tumor images, but scalable and reproducible methods to analyze the spatial distribution of cell states relative to stromal regions remain limited. To address this gap, we developed an open-source computational pipeline that integrates QuPath, StarDist, and custom Python scripts to quantify biomarker expression at a single- and sub-cellular resolution across entire tumor sections. Our workflow includes: (i) automated nuclei segmentation using StarDist, (ii) machine learning-based cell classification using multiplexed marker expression, (iii) modeling of stromal regions based on fibronectin staining, (iv) sensitivity analyses on classification thresholds to ensure robustness across heterogeneous datasets, and (v) distance-based quantification of the proximity of each cell to the stromal border. To improve consistency across slides with variable staining intensities, we introduce a statistical strategy that translates classification thresholds by propagating a chosen reference percentile across the distribution of marker-related cell measurement in each image. We apply this approach to quantify spatial patterns of distribution of the phosphorylated form of the N-Myc downregulated gene 1 (NDRG1), a novel DNA repair protein that conveys signals from the ECM to the nucleus to maintain replication fork homeostasis, and a known cell proliferation marker Ki67 in fibronectin-defined stromal regions in PDAC xenografts. The pipeline is applicable for the analysis of markers of interest in stroma-rich tissues and is publicly available.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1619790"},"PeriodicalIF":3.9,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12446346/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TCRscape: a single-cell multi-omic TCR profiling toolkit. TCRscape:单细胞多组TCR分析工具包。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-05 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1641491
Roman Perik-Zavodskii, Olga Perik-Zavodskaia, Marina Volynets, Saleh Alrhmoun, Sergey Sennikov

Introduction: Single-cell multi-omics has transformed T-cell biology by enabling the simultaneous analysis of T-cell receptor (TCR) sequences, transcriptomes, and surface proteins at the resolution of individual cells. These capabilities are critical for identifying antigen-specific T-cells and accelerating the development of TCR-based immunotherapies.

Methods: Here, we introduce TCRscape, an open-source Python 3 tool designed for high-resolution T-cell receptor clonotype discovery and quantification, optimized for BD Rhapsody™ single-cell multi-omics data.

Results: TCRscape integrates full-length TCR sequence data with gene expression profiles and surface protein expression to enable multimodal clustering of αβ and γδ T-cell populations. It also outputs Seurat-compatible matrices, facilitating downstream visualization and analysis in standard single-cell analysis environments.

Discussion: By bridging clonotype detection with immune cell transcriptome, proteome, and antigen specificity profiling, TCRscape supports rapid identification of dominant T-cell clones and their functional phenotypes, offering a powerful resource for immune monitoring and TCR-engineered therapeutic development. TCRscape can be found at https://github.com/Perik-Zavodskii/TCRscape/.

单细胞多组学通过在单个细胞的分辨率上同时分析t细胞受体(TCR)序列、转录组和表面蛋白,改变了t细胞生物学。这些能力对于识别抗原特异性t细胞和加速基于tcr的免疫疗法的发展至关重要。方法:本文介绍了TCRscape,这是一个开源的Python 3工具,用于高分辨率t细胞受体克隆型发现和定量,并针对BD Rhapsody™单细胞多组学数据进行了优化。结果:TCRscape整合了全长TCR序列数据、基因表达谱和表面蛋白表达,实现了αβ和γδ t细胞群体的多模态聚类。它还输出与seurat兼容的矩阵,便于在标准单细胞分析环境中进行下游可视化和分析。讨论:通过将克隆型检测与免疫细胞转录组、蛋白质组和抗原特异性分析连接起来,TCRscape支持快速鉴定优势t细胞克隆及其功能表型,为免疫监测和tcr工程治疗开发提供了强大的资源。TCRscape可以在https://github.com/Perik-Zavodskii/TCRscape/上找到。
{"title":"TCRscape: a single-cell multi-omic TCR profiling toolkit.","authors":"Roman Perik-Zavodskii, Olga Perik-Zavodskaia, Marina Volynets, Saleh Alrhmoun, Sergey Sennikov","doi":"10.3389/fbinf.2025.1641491","DOIUrl":"10.3389/fbinf.2025.1641491","url":null,"abstract":"<p><strong>Introduction: </strong>Single-cell multi-omics has transformed T-cell biology by enabling the simultaneous analysis of T-cell receptor (TCR) sequences, transcriptomes, and surface proteins at the resolution of individual cells. These capabilities are critical for identifying antigen-specific T-cells and accelerating the development of TCR-based immunotherapies.</p><p><strong>Methods: </strong>Here, we introduce TCRscape, an open-source Python 3 tool designed for high-resolution T-cell receptor clonotype discovery and quantification, optimized for BD Rhapsody™ single-cell multi-omics data.</p><p><strong>Results: </strong>TCRscape integrates full-length TCR sequence data with gene expression profiles and surface protein expression to enable multimodal clustering of αβ and γδ T-cell populations. It also outputs Seurat-compatible matrices, facilitating downstream visualization and analysis in standard single-cell analysis environments.</p><p><strong>Discussion: </strong>By bridging clonotype detection with immune cell transcriptome, proteome, and antigen specificity profiling, TCRscape supports rapid identification of dominant T-cell clones and their functional phenotypes, offering a powerful resource for immune monitoring and TCR-engineered therapeutic development. TCRscape can be found at https://github.com/Perik-Zavodskii/TCRscape/.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1641491"},"PeriodicalIF":3.9,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12446293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protein cleaver: an interactive web interface for in silico prediction and systematic annotation of protein digestion-derived peptides. 蛋白质切割器:一个交互式网络界面,用于蛋白质消化衍生肽的计算机预测和系统注释。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-04 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1576317
Grigorios Koulouras, Yingrong Xu

Proteolytic digestion is an essential process in mass spectrometry-based proteomics for converting proteins into peptides, hence crucial for protein identification and quantification. In a typical proteomics experiment, digestion reagents are selected without prior evaluation of their optimality for detecting proteins or peptides of interest, partly due to the lack of comprehensive and user-friendly predictive tools. In this work, we introduce Protein Cleaver, a web-based application that systematically assesses regions of proteins that are likely or unlikely to be identified, along with extensive sequence and structure annotation and visualization features. We showcase practical examples of Protein Cleaver's usability in drug discovery and highlight proteins that are typically difficult to detect using the most common proteolytic enzymes. We evaluate trypsin and chymotrypsin for identifying G-protein-coupled receptors and discover that chymotrypsin produces significantly more identifiable peptides than trypsin. We perform a bulk digestion analysis and assess 36 proteolytic enzymes for their ability to detect most of cysteine-containing peptides in the human proteome. We anticipate Protein Cleaver to be a valuable auxiliary tool for proteomics scientists.

蛋白质水解消化是基于质谱的蛋白质组学中将蛋白质转化为多肽的重要过程,因此对蛋白质鉴定和定量至关重要。在典型的蛋白质组学实验中,消化试剂的选择没有事先评估其检测感兴趣的蛋白质或肽的最佳性,部分原因是缺乏全面和用户友好的预测工具。在这项工作中,我们介绍了Protein Cleaver,这是一个基于web的应用程序,可以系统地评估可能或不可能被识别的蛋白质区域,以及广泛的序列和结构注释和可视化功能。我们展示了Protein Cleaver在药物发现中的可用性的实际例子,并强调了使用最常见的蛋白水解酶通常难以检测到的蛋白质。我们评估了胰蛋白酶和凝乳胰蛋白酶在识别g蛋白偶联受体方面的作用,发现凝乳胰蛋白酶比胰蛋白酶产生更多可识别的肽。我们进行了大量消化分析,并评估了36种蛋白水解酶检测人类蛋白质组中大多数含半胱氨酸肽的能力。我们期待Protein Cleaver成为蛋白质组学科学家的一个有价值的辅助工具。
{"title":"Protein cleaver: an interactive web interface for <i>in silico</i> prediction and systematic annotation of protein digestion-derived peptides.","authors":"Grigorios Koulouras, Yingrong Xu","doi":"10.3389/fbinf.2025.1576317","DOIUrl":"10.3389/fbinf.2025.1576317","url":null,"abstract":"<p><p>Proteolytic digestion is an essential process in mass spectrometry-based proteomics for converting proteins into peptides, hence crucial for protein identification and quantification. In a typical proteomics experiment, digestion reagents are selected without prior evaluation of their optimality for detecting proteins or peptides of interest, partly due to the lack of comprehensive and user-friendly predictive tools. In this work, we introduce Protein Cleaver, a web-based application that systematically assesses regions of proteins that are likely or unlikely to be identified, along with extensive sequence and structure annotation and visualization features. We showcase practical examples of Protein Cleaver's usability in drug discovery and highlight proteins that are typically difficult to detect using the most common proteolytic enzymes. We evaluate trypsin and chymotrypsin for identifying G-protein-coupled receptors and discover that chymotrypsin produces significantly more identifiable peptides than trypsin. We perform a bulk digestion analysis and assess 36 proteolytic enzymes for their ability to detect most of cysteine-containing peptides in the human proteome. We anticipate Protein Cleaver to be a valuable auxiliary tool for proteomics scientists.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1576317"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12445168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling. 自适应采样方法有助于确定可靠的数据集大小,用于循证建模。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-04 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1528515
Tim Breitenbach, Thomas Dandekar

How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.

我们如何确保我们的模型有足够的数据,使得预测在未知数据上仍然可靠,并且当使用相同大小的不同样本时,从拟合模型得出的结论不会有显着变化?我们通过一种系统的方法来回答这些和相关的问题,该方法检查数据大小和相应的准确性增益。假设样本数据是从没有数据漂移的数据池中提取的,那么大数定律可以确保模型收敛到其基本真值精度。我们的方法提供了一种启发式的方法来研究关于数据样本大小的收敛速度。这种关系是使用抽样方法估计的,这在不同的运行中引入了收敛速度结果的变化。为了稳定结果,使结论不依赖于运行,并提取有关收敛速度的可用数据中编码的最可靠的信息,所提出的方法自动确定足够的重复次数,以减少采样偏差低于预定义的阈值,从而确保有关所需数据量的结论的可靠性。
{"title":"Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling.","authors":"Tim Breitenbach, Thomas Dandekar","doi":"10.3389/fbinf.2025.1528515","DOIUrl":"10.3389/fbinf.2025.1528515","url":null,"abstract":"<p><p>How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1528515"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12444090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel linear indexing method for strings under all internal nodes in a suffix tree. 一种新颖的字符串在后缀树所有内部节点下的线性索引方法。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-04 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1577324
Anas Al-Okaily, Abdelghani Tbakhi

Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.

后缀树是弦学中的基本数据结构,在各个领域都有广泛的应用。在这项工作中,我们提出了两种线性时间算法,用于在后缀树的每个内部节点下索引字符串,同时保留跟踪不同内部节点之间的相似性和冗余的能力。这是通过派生自后缀树的新颖树结构以及新的索引概念实现的。所得到的索引在DNA序列分析和近似模式匹配等多个领域提供了实用的解决方案。
{"title":"A novel linear indexing method for strings under all internal nodes in a suffix tree.","authors":"Anas Al-Okaily, Abdelghani Tbakhi","doi":"10.3389/fbinf.2025.1577324","DOIUrl":"10.3389/fbinf.2025.1577324","url":null,"abstract":"<p><p>Suffix trees are fundamental data structures in stringology and have wide applications across various domains. In this work, we propose two linear-time algorithms for indexing strings under each internal node in a suffix tree while preserving the ability to track similarities and redundancies across different internal nodes. This is achieved through a novel tree structure derived from the suffix tree, along with new indexing concepts. The resulting indexes offer practical solutions in several areas, including DNA sequence analysis and approximate pattern matching.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1577324"},"PeriodicalIF":3.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12443692/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: Networks and graphs in biological data: current methods, opportunities and challenges. 编辑:生物数据中的网络和图形:当前的方法、机遇和挑战。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-02 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1685992
Derek L Thompson, Hsiang-Yun Wu, Christopher W Bartlett, William C Ray
{"title":"Editorial: Networks and graphs in biological data: current methods, opportunities and challenges.","authors":"Derek L Thompson, Hsiang-Yun Wu, Christopher W Bartlett, William C Ray","doi":"10.3389/fbinf.2025.1685992","DOIUrl":"10.3389/fbinf.2025.1685992","url":null,"abstract":"","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1685992"},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12437696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Germline mutation profiling of breast cancer patients using a non-BRCA sequencing panel. 使用非brca测序面板的乳腺癌患者种系突变谱分析。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-02 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1620025
Sonar Soni Panigoro, Rafika Indah Paramita, Fadilah Fadilah, Septelia Inawati Wanandi, Aisyah Fitriannisa Prawiningrum, Linda Erlina, Wahyu Dian Utari, Ajeng Megawati Fajrin
{"title":"Germline mutation profiling of breast cancer patients using a non-BRCA sequencing panel.","authors":"Sonar Soni Panigoro, Rafika Indah Paramita, Fadilah Fadilah, Septelia Inawati Wanandi, Aisyah Fitriannisa Prawiningrum, Linda Erlina, Wahyu Dian Utari, Ajeng Megawati Fajrin","doi":"10.3389/fbinf.2025.1620025","DOIUrl":"10.3389/fbinf.2025.1620025","url":null,"abstract":"","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1620025"},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12436446/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COC α DA - a fast and scalable algorithm for interatomic contact detection in proteins using C α distance matrices. COC α DA -一种基于C α距离矩阵的快速可扩展的蛋白质原子间接触检测算法。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-09-01 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1630078
Rafael Pereira Lemos, Diego Mariano, Sabrina De Azevedo Silveira, Raquel C de Melo-Minardi

Protein interatomic contacts, defined by spatial proximity and physicochemical complementarity at atomic resolution, are fundamental to characterizing molecular interactions and bonding. Methods for calculating contacts are generally categorized as cutoff-dependent, which rely on Euclidean distances, or cutoff-independent, which utilize Delaunay and Voronoi tessellations. While cutoff-dependent methods are recognized for their simplicity, completeness, and reliability, traditional implementations remain computationally expensive, posing significant scalability challenges in the current Big Data era of bioinformatics. Here, we introduce COC α DA (COntact search pruning by C α Distance Analysis), a Python-based command-line tool for improving search pruning in large-scale interatomic protein contact analysis using alpha-carbon (C α ) distance matrices. COC α DA detects intra- and inter-chain contacts, and classifies them into seven different types: hydrogen and disulfide bonds; hydrophobic effects; attractive, repulsive, and salt-bridge interactions; and aromatic stackings. To evaluate our tool, we compared it with three traditional approaches in the literature: all-against-all atom distance calculation ("brute-force"), static C α distance cutoff (SC), and Biopython's NeighborSearch class (NS). COC α DA demonstrated superior performance compared to the other methods, achieving on average 6x faster computation times than advanced data structures like k-d trees from NS, in addition to being simpler to implement and fully customizable. The presented tool facilitates exploratory and large-scale analyses of interatomic contacts in proteins in a simple and efficient manner, also enabling the integration of results with other tools and pipelines. The COC α DA tool is freely available at https://github.com/LBS-UFMG/COCaDA.

蛋白质原子间接触是由空间接近性和原子分辨率上的物理化学互补性定义的,是表征分子相互作用和键合的基础。计算接触的方法通常被分类为依赖于欧几里得距离的截止点,或利用Delaunay和Voronoi细分的截止点无关。虽然截止依赖方法因其简单、完整和可靠而得到认可,但传统的实现方法在计算上仍然昂贵,在当前生物信息学的大数据时代提出了重大的可扩展性挑战。在这里,我们介绍了COC α DA (COntact search pruning by C α Distance Analysis),这是一个基于python的命令行工具,用于改进使用α -碳(C α)距离矩阵进行大规模原子间蛋白质接触分析的搜索修剪。COC α DA检测链内和链间的接触,并将其分为7种不同的类型:氢键和二硫键;疏水效果;吸引、排斥和盐桥相互作用;还有芳香的堆叠。为了评估我们的工具,我们将其与文献中的三种传统方法进行了比较:全反全原子距离计算(“蛮力”)、静态C α距离切断(SC)和Biopython的NeighborSearch类(NS)。与其他方法相比,COC α DA表现出了优越的性能,实现的计算时间平均比来自NS的k-d树等高级数据结构快6倍,并且更容易实现和完全可定制。该工具以一种简单有效的方式促进了对蛋白质中原子间接触的探索性和大规模分析,也使结果能够与其他工具和管道集成。COC α DA工具可在https://github.com/LBS-UFMG/COCaDA免费获得。
{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\">COC <ns0:math><ns0:mrow><ns0:mi>α</ns0:mi></ns0:mrow> </ns0:math> DA - a fast and scalable algorithm for interatomic contact detection in proteins using C <ns0:math><ns0:mrow><ns0:mi>α</ns0:mi></ns0:mrow> </ns0:math> distance matrices.","authors":"Rafael Pereira Lemos, Diego Mariano, Sabrina De Azevedo Silveira, Raquel C de Melo-Minardi","doi":"10.3389/fbinf.2025.1630078","DOIUrl":"10.3389/fbinf.2025.1630078","url":null,"abstract":"<p><p>Protein interatomic contacts, defined by spatial proximity and physicochemical complementarity at atomic resolution, are fundamental to characterizing molecular interactions and bonding. Methods for calculating contacts are generally categorized as cutoff-dependent, which rely on Euclidean distances, or cutoff-independent, which utilize Delaunay and Voronoi tessellations. While cutoff-dependent methods are recognized for their simplicity, completeness, and reliability, traditional implementations remain computationally expensive, posing significant scalability challenges in the current Big Data era of bioinformatics. Here, we introduce COC <math><mrow><mi>α</mi></mrow> </math> DA (COntact search pruning by C <math><mrow><mi>α</mi></mrow> </math> Distance Analysis), a Python-based command-line tool for improving search pruning in large-scale interatomic protein contact analysis using alpha-carbon (C <math><mrow><mi>α</mi></mrow> </math> ) distance matrices. COC <math><mrow><mi>α</mi></mrow> </math> DA detects intra- and inter-chain contacts, and classifies them into seven different types: hydrogen and disulfide bonds; hydrophobic effects; attractive, repulsive, and salt-bridge interactions; and aromatic stackings. To evaluate our tool, we compared it with three traditional approaches in the literature: all-against-all atom distance calculation (\"brute-force\"), static C <math><mrow><mi>α</mi></mrow> </math> distance cutoff (SC), and Biopython's NeighborSearch class (NS). COC <math><mrow><mi>α</mi></mrow> </math> DA demonstrated superior performance compared to the other methods, achieving on average 6x faster computation times than advanced data structures like <i>k</i>-d trees from NS, in addition to being simpler to implement and fully customizable. The presented tool facilitates exploratory and large-scale analyses of interatomic contacts in proteins in a simple and efficient manner, also enabling the integration of results with other tools and pipelines. The COC <math><mrow><mi>α</mi></mrow> </math> DA tool is freely available at https://github.com/LBS-UFMG/COCaDA.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1630078"},"PeriodicalIF":3.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12433948/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145076621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing bioinformatics capacity through Nextflow and nf-core: lessons from an early-to mid-career researchers-focused program at The Kids Research Institute Australia. 通过Nextflow和nf-core提升生物信息学能力:来自澳大利亚儿童研究所早期到中期职业研究人员的经验教训。
IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2025-08-29 eCollection Date: 2025-01-01 DOI: 10.3389/fbinf.2025.1610015
Patricia Agudelo-Romero, Talya Conradie, Jose Antonio Caparros-Martin, David Jimmy Martino, Anthony Kicic, Stephen Michael Stick, Christopher Hakkaart, Abhinav Sharma

The increasing adoption of high-throughput "omics" technologies has heightened the demand for standardized, scalable, and reproducible bioinformatics workflows. Nextflow and nf-core provide a robust framework for researchers, particularly early- and mid-career researchers (EMCRs), to navigate complex data analysis. At The Kids Research Institute Australia, we implemented a structured approach to bioinformatics capacity building using these tools. This perspective presents nine practical rules derived from lessons learnt, which facilitated the successful adoption of Nextflow and nf-core, addressing implementation challenges, knowledge gaps, resource allocation, and community support. Our experience serves as a guide for institutions aiming to establish sustainable bioinformatics capabilities and empower EMCRs.

越来越多地采用高通量“组学”技术,提高了对标准化、可扩展和可重复的生物信息学工作流程的需求。Nextflow和nf-core为研究人员,特别是职业生涯早期和中期的研究人员(emcr)提供了一个强大的框架,以导航复杂的数据分析。在澳大利亚儿童研究所,我们使用这些工具实施了一种结构化的方法来进行生物信息学能力建设。这一观点提出了从经验教训中得出的9条实用规则,这些规则促进了Nextflow和nf-core的成功采用,解决了实施挑战、知识差距、资源分配和社区支持。我们的经验可以作为旨在建立可持续生物信息学能力和授权emcr的机构的指南。
{"title":"Advancing bioinformatics capacity through Nextflow and nf-core: lessons from an early-to mid-career researchers-focused program at The Kids Research Institute Australia.","authors":"Patricia Agudelo-Romero, Talya Conradie, Jose Antonio Caparros-Martin, David Jimmy Martino, Anthony Kicic, Stephen Michael Stick, Christopher Hakkaart, Abhinav Sharma","doi":"10.3389/fbinf.2025.1610015","DOIUrl":"10.3389/fbinf.2025.1610015","url":null,"abstract":"<p><p>The increasing adoption of high-throughput \"omics\" technologies has heightened the demand for standardized, scalable, and reproducible bioinformatics workflows. Nextflow and nf-core provide a robust framework for researchers, particularly early- and mid-career researchers (EMCRs), to navigate complex data analysis. At The Kids Research Institute Australia, we implemented a structured approach to bioinformatics capacity building using these tools. This perspective presents nine practical rules derived from lessons learnt, which facilitated the successful adoption of Nextflow and nf-core, addressing implementation challenges, knowledge gaps, resource allocation, and community support. Our experience serves as a guide for institutions aiming to establish sustainable bioinformatics capabilities and empower EMCRs.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1610015"},"PeriodicalIF":3.9,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Frontiers in bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1