首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
ProTrack3D: a comprehensive tool for segmentation and tracking of proteins with split and fusion. Protrack3d:一个全面的工具,分割和跟踪与分裂和融合的蛋白质。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-28 DOI: 10.1186/s12859-025-06307-w
Ramu Gautam, Yang Jiao, Yasong Pang, Mo Weng, Mei Yang
{"title":"ProTrack3D: a comprehensive tool for segmentation and tracking of proteins with split and fusion.","authors":"Ramu Gautam, Yang Jiao, Yasong Pang, Mo Weng, Mei Yang","doi":"10.1186/s12859-025-06307-w","DOIUrl":"10.1186/s12859-025-06307-w","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"4"},"PeriodicalIF":3.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145629068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impact of U2-type introns on splice site prediction in A. thaliana species using deep learning. 基于深度学习的拟南芥u - 2内含子对剪接位点预测的影响
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-28 DOI: 10.1186/s12859-025-06315-w
Espoir Kabanga, Seonil Jee, Soeun Yun, Stephen Depuydt, Arnout Van Messem, Wesley De Neve

Background: Splice site prediction in plant genomes poses substantial challenges that can be addressed using deep learning models. U2-type introns are especially useful for such studies given their ubiquity in plant genomes and the availability of rich datasets. We formulated two hypotheses: one proposing that short introns may enhance prediction effectiveness due to reduced spatial complexity, and another suggesting that sequences with multiple introns provide a richer context for splicing events.

Results: Our findings demonstrate that (1) models trained on datasets containing shorter introns achieve improved effectiveness for acceptor splice sites, but not for donor splice sites, indicating a more nuanced relationship between intron length and splice site prediction than initially hypothesized, and (2) models trained on datasets with multiple introns per sequence show higher effectiveness compared to those trained on datasets with a single intron per sequence. Notably, among the 402 bp sequences analyzed, 72% contained single introns while 28% contained multiple introns for donor sites (36,399 versus 13,987 sequences), with similar proportions observed for acceptor sites (37,236 versus 14,112 sequences). These computational insights align with biological observations, particularly regarding the conserved spatial relationship between branch points and acceptor splice sites, as well as the synergistic effects of multiple introns on splicing efficiency.

Conclusions: The obtained results contribute to a deeper understanding of how intronic features influence splice site prediction and suggest that future prediction models should consider factors such as intron length, multiplicity, and the spatial arrangement of splice-related signals.

背景:植物基因组剪接位点预测面临着巨大的挑战,可以使用深度学习模型来解决。鉴于其在植物基因组中的普遍存在和丰富数据集的可用性,u2型内含子对此类研究特别有用。我们提出了两个假设:一个假设认为短内含子可以提高预测的有效性,因为降低了空间复杂性;另一个假设认为含有多个内含子的序列为剪接事件提供了更丰富的背景。结果:我们的研究结果表明:(1)在包含较短内含子的数据集上训练的模型对受体剪接位点的有效性有所提高,但对供体剪接位点的有效性却没有提高,这表明内含子长度与剪接位点预测之间的关系比最初假设的更为微妙;(2)在每个序列有多个内含子的数据集上训练的模型比在每个序列只有一个内含子的数据集上训练的模型显示出更高的有效性。值得注意的是,在分析的402 bp序列中,供体位点72%含有单个内含子,28%含有多个内含子(36,399对13,987个序列),受体位点的比例相似(37,236对14,112个序列)。这些计算见解与生物学观察相一致,特别是关于分支点和受体剪接位点之间的保守空间关系,以及多个内含子对剪接效率的协同效应。结论:获得的结果有助于更深入地了解内含子特征如何影响剪接位点预测,并建议未来的预测模型应考虑内含子长度、多样性和剪接相关信号的空间排列等因素。
{"title":"Impact of U2-type introns on splice site prediction in A. thaliana species using deep learning.","authors":"Espoir Kabanga, Seonil Jee, Soeun Yun, Stephen Depuydt, Arnout Van Messem, Wesley De Neve","doi":"10.1186/s12859-025-06315-w","DOIUrl":"10.1186/s12859-025-06315-w","url":null,"abstract":"<p><strong>Background: </strong>Splice site prediction in plant genomes poses substantial challenges that can be addressed using deep learning models. U2-type introns are especially useful for such studies given their ubiquity in plant genomes and the availability of rich datasets. We formulated two hypotheses: one proposing that short introns may enhance prediction effectiveness due to reduced spatial complexity, and another suggesting that sequences with multiple introns provide a richer context for splicing events.</p><p><strong>Results: </strong>Our findings demonstrate that (1) models trained on datasets containing shorter introns achieve improved effectiveness for acceptor splice sites, but not for donor splice sites, indicating a more nuanced relationship between intron length and splice site prediction than initially hypothesized, and (2) models trained on datasets with multiple introns per sequence show higher effectiveness compared to those trained on datasets with a single intron per sequence. Notably, among the 402 bp sequences analyzed, 72% contained single introns while 28% contained multiple introns for donor sites (36,399 versus 13,987 sequences), with similar proportions observed for acceptor sites (37,236 versus 14,112 sequences). These computational insights align with biological observations, particularly regarding the conserved spatial relationship between branch points and acceptor splice sites, as well as the synergistic effects of multiple introns on splicing efficiency.</p><p><strong>Conclusions: </strong>The obtained results contribute to a deeper understanding of how intronic features influence splice site prediction and suggest that future prediction models should consider factors such as intron length, multiplicity, and the spatial arrangement of splice-related signals.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"288"},"PeriodicalIF":3.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TaxaPLN: a taxonomy-aware augmentation strategy for microbiome-trait classification including metadata. TaxaPLN:包括元数据在内的微生物组特征分类的分类感知增强策略。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-28 DOI: 10.1186/s12859-025-06312-z
Alexandre Chaussard, Anna Bonnet, Sylvain Le Corff, Harry Sokol

Background: The gut microbiome plays a crucial role in human health, making it a cornerstone of modern biomedical research. To study its structure and dynamics, machine learning models are increasingly used to identify key microbial patterns associated with disease and environmental factors, but their performance is often limited by the intrinsic complexity of microbiome data and the small size of available cohorts. In this context, data augmentation has emerged as a promising strategy to overcome these challenges by generating artificial microbiome profiles.

Results: We introduce TaxaPLN, a data augmentation method based on PLN-Tree generative models, which leverages the taxonomy and a data-driven sampler to generate realistic synthetic microbiome compositions. Additionally, we propose a conditional extension based on feature-wise linear modulation, enabling covariate-aware generation. Experiments on diverse curated microbiome datasets show that TaxaPLN preserves ecological properties and generally improves or maintains predictive performances, outperforming state-of-the-art baselines on most tasks. Furthermore, the conditional variant of TaxaPLN establishes a new benchmark for metadata-aware microbiome augmentation.

Conclusion: TaxaPLN provides a model-based framework for augmenting microbiome datasets while preserving their ecological and clinical relevance. By integrating taxonomic structure and host metadata, it enhances predictive modeling across diverse real-world settings. To facilitate reproducible and scalable microbiome analysis using our method, TaxaPLN is released as an open-source Python package available on PyPI (plntree), with MIT-licensed source code hosted at https://github.com/AlexandreChaussard/PLNTree-package .

背景:肠道微生物群在人类健康中起着至关重要的作用,是现代生物医学研究的基石。为了研究其结构和动态,机器学习模型越来越多地用于识别与疾病和环境因素相关的关键微生物模式,但它们的性能往往受到微生物组数据固有复杂性和可用队列规模小的限制。在这种情况下,数据增强已经成为一种有希望的策略,通过生成人工微生物组概况来克服这些挑战。结果:我们引入了一种基于PLN-Tree生成模型的数据增强方法TaxaPLN,该方法利用分类学和数据驱动的采样器来生成真实的合成微生物组组成。此外,我们提出了一种基于特征线性调制的条件扩展,使协变量感知生成成为可能。在不同的微生物组数据集上进行的实验表明,TaxaPLN保留了生态特性,总体上提高或保持了预测性能,在大多数任务上优于最先进的基线。此外,TaxaPLN的条件变体为元数据感知微生物组扩增建立了新的基准。结论:TaxaPLN为增加微生物组数据集提供了一个基于模型的框架,同时保留了它们的生态和临床相关性。通过集成分类结构和主机元数据,它增强了跨不同现实世界设置的预测建模。为了便于使用我们的方法进行可重复和可扩展的微生物组分析,TaxaPLN作为一个开源Python包发布在PyPI (plntree)上,其源代码托管在https://github.com/AlexandreChaussard/PLNTree-package上。
{"title":"TaxaPLN: a taxonomy-aware augmentation strategy for microbiome-trait classification including metadata.","authors":"Alexandre Chaussard, Anna Bonnet, Sylvain Le Corff, Harry Sokol","doi":"10.1186/s12859-025-06312-z","DOIUrl":"10.1186/s12859-025-06312-z","url":null,"abstract":"<p><strong>Background: </strong>The gut microbiome plays a crucial role in human health, making it a cornerstone of modern biomedical research. To study its structure and dynamics, machine learning models are increasingly used to identify key microbial patterns associated with disease and environmental factors, but their performance is often limited by the intrinsic complexity of microbiome data and the small size of available cohorts. In this context, data augmentation has emerged as a promising strategy to overcome these challenges by generating artificial microbiome profiles.</p><p><strong>Results: </strong>We introduce TaxaPLN, a data augmentation method based on PLN-Tree generative models, which leverages the taxonomy and a data-driven sampler to generate realistic synthetic microbiome compositions. Additionally, we propose a conditional extension based on feature-wise linear modulation, enabling covariate-aware generation. Experiments on diverse curated microbiome datasets show that TaxaPLN preserves ecological properties and generally improves or maintains predictive performances, outperforming state-of-the-art baselines on most tasks. Furthermore, the conditional variant of TaxaPLN establishes a new benchmark for metadata-aware microbiome augmentation.</p><p><strong>Conclusion: </strong>TaxaPLN provides a model-based framework for augmenting microbiome datasets while preserving their ecological and clinical relevance. By integrating taxonomic structure and host metadata, it enhances predictive modeling across diverse real-world settings. To facilitate reproducible and scalable microbiome analysis using our method, TaxaPLN is released as an open-source Python package available on PyPI (plntree), with MIT-licensed source code hosted at https://github.com/AlexandreChaussard/PLNTree-package .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"1"},"PeriodicalIF":3.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12763835/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contrastive learning-based multi-mechanism disentangled assessment for drug-drug interaction. 基于对比学习的药物-药物相互作用多机制解耦评价。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-27 DOI: 10.1186/s12859-025-06304-z
Jinxiong Zhang, Yunjv Zeng, Chunyan Tang, Cheng Zhong, Hao Wen, Yang Liu

Background: Polypharmacy's ability to circumvent acquired resistance to single drug makes it a critical strategy for treating complex diseases. However, it inevitably carries risks of drug-drug interactions (DDIs) that may alter pharmacological activities and potentially lead to severe adverse events or mortality. Computational assessment of drug combination has emerged as an effective approach to support clinical decision-making. Current risk identification methods focus on mining historical interaction patterns to uncover underlying mechanisms, yet face challenges from data sparsity. While data augmentation strategy can mitigate such problem, conventional approaches often introduce noise that obscures core pharmacological mechanisms, undermining safety evaluation.

Results: This study proposes a Multi-Mechanism Disentangled Drug-drug Interaction assessment framework integrated contrastive learning, MMDDI, which includes two key components: (1) biologically-informed multi-view generation that creates high-quality augmented views, effectively addressing semantic distortion during data augmentation; (2) Mechanism-aware disentanglement that incorporates mutual information constraints to isolate interaction mechanisms from coupling of multi-modal and heterogeneous data, eliminating quantification bias. Contrastive learning integrates labeled and unlabeled data to enhance robustness against sparse observations.

Conclusions: Comprehensive evaluations demonstrate that MMDDI with hit@4 of 0.86 outperforms the compared baselines, with ablation studies validating the critical contributions of multi-view contrastive and mechanism disentanglement. MMDDI continues to demonstrate excellent performance in cold-start scenarios, achieving accuracy of 0.94 and recall of 0.95. Clinically, MMDDI enables interpretable causal analysis of drug interaction pathways through its mechanism-aware representations, providing operability for optimizing therapeutic regimens.

背景:多药治疗能够规避对单一药物的获得性耐药,是治疗复杂疾病的重要策略。然而,它不可避免地带有药物-药物相互作用(ddi)的风险,这可能会改变药理活性,并可能导致严重的不良事件或死亡。药物组合的计算评估已成为支持临床决策的有效方法。当前的风险识别方法侧重于挖掘历史交互模式以揭示潜在机制,但面临数据稀疏性的挑战。虽然数据增强策略可以缓解这一问题,但传统方法通常会引入噪音,使核心药理机制变得模糊,从而破坏安全性评估。结果:本研究提出了一种集成对比学习的多机制解纠缠药物-药物相互作用评估框架(MMDDI),该框架包括两个关键组成部分:(1)基于生物信息的多视图生成,生成高质量的增强视图,有效解决数据增强过程中的语义失真问题;(2)机制感知解纠缠,结合相互信息约束,从多模态和异构数据耦合中分离交互机制,消除量化偏差。对比学习集成了标记和未标记的数据,以增强对稀疏观测的鲁棒性。结论:综合评价表明,hit@4为0.86的MMDDI优于比较基线,消融研究验证了多视图对比和机制解开的关键贡献。MMDDI在冷启动场景中继续表现出色,实现了0.94的准确率和0.95的召回率。在临床上,MMDDI通过其机制感知表征实现了药物相互作用途径的可解释因果分析,为优化治疗方案提供了可操作性。
{"title":"Contrastive learning-based multi-mechanism disentangled assessment for drug-drug interaction.","authors":"Jinxiong Zhang, Yunjv Zeng, Chunyan Tang, Cheng Zhong, Hao Wen, Yang Liu","doi":"10.1186/s12859-025-06304-z","DOIUrl":"10.1186/s12859-025-06304-z","url":null,"abstract":"<p><strong>Background: </strong>Polypharmacy's ability to circumvent acquired resistance to single drug makes it a critical strategy for treating complex diseases. However, it inevitably carries risks of drug-drug interactions (DDIs) that may alter pharmacological activities and potentially lead to severe adverse events or mortality. Computational assessment of drug combination has emerged as an effective approach to support clinical decision-making. Current risk identification methods focus on mining historical interaction patterns to uncover underlying mechanisms, yet face challenges from data sparsity. While data augmentation strategy can mitigate such problem, conventional approaches often introduce noise that obscures core pharmacological mechanisms, undermining safety evaluation.</p><p><strong>Results: </strong>This study proposes a Multi-Mechanism Disentangled Drug-drug Interaction assessment framework integrated contrastive learning, MMDDI, which includes two key components: (1) biologically-informed multi-view generation that creates high-quality augmented views, effectively addressing semantic distortion during data augmentation; (2) Mechanism-aware disentanglement that incorporates mutual information constraints to isolate interaction mechanisms from coupling of multi-modal and heterogeneous data, eliminating quantification bias. Contrastive learning integrates labeled and unlabeled data to enhance robustness against sparse observations.</p><p><strong>Conclusions: </strong>Comprehensive evaluations demonstrate that MMDDI with hit@4 of 0.86 outperforms the compared baselines, with ablation studies validating the critical contributions of multi-view contrastive and mechanism disentanglement. MMDDI continues to demonstrate excellent performance in cold-start scenarios, achieving accuracy of 0.94 and recall of 0.95. Clinically, MMDDI enables interpretable causal analysis of drug interaction pathways through its mechanism-aware representations, providing operability for optimizing therapeutic regimens.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"286"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12659097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CountASAP: a lightweight, easy to use python package for processing ASAPseq data. CountASAP:一个轻量级的,易于使用的python包,用于处理ASAPseq数据。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-27 DOI: 10.1186/s12859-025-06311-0
Christopher T Boughter, Budha Chatterjee, Yuko Ohta, Katrina Gorga, Carly Blair, Elizabeth M Hill, Zachary Fasana, Adedola O Adebamowo, Farah Ammar, Ivan Kosik, Vel Murugan, Wilbur H Chen, Nevil J Singh, Martin Meier-Schellersheim

Background: Declining sequencing costs coupled with the increasing availability of easy-to-use kits for the isolation of DNA and RNA transcripts from single cells have driven a rapid proliferation of studies centered around genomic and transcriptomic data. Simultaneously, a wealth of new techniques have been developed that utilize single cell technologies to interrogate a broad range of cell-biological processes. One recently developed technique, transposase-accessible chromatin with sequencing (ATAC) with select antigen profiling by sequencing (ASAPseq), provides a combination of chromatin accessibility assessments with measurements of cell-surface marker expression levels. While software exists for the characterization of these datasets, there currently exists no tool explicitly designed to reformat ASAP surface marker FASTQ data into a count matrix which can then be used for these downstream analyses.

Results: To address this lack of a dedicated tool for ASAPseq data processing, we created CountASAP, an easy-to-use Python package purposefully designed to transform FASTQ files from ASAP experiments into count matrices compatible with commonly-used downstream bioinformatic analysis packages. CountASAP takes advantage of the independence of the relevant data structures to perform fully parallelized matches of each sequenced read to user-supplied input ASAP oligos and unique cell-identifier sequences. We directly compare the performance and user-friendliness of CountASAP to existing tools using similarly-structured data from a more common sequencing experiment: cellular indexing of transcriptomes and epitopes by sequencing (CITEseq). Further benchmarking against existing tools helps to identify proper defaults for CountASAP and assess the agreement of outputs from all tested software. A final test using a novel ASAPseq dataset provides evidence that CountASAP can generate biologically meaningful results that correlate well with paired chromatin accessibility data.

Conclusions: CountASAP shows good agreement with existing, well-tested data processing tools in the analysis of similarly-structured benchmarking data. CountASAP runs efficiently on a standard laptop, has user-friendly documentation, a one-step installation, and represents the first and only tool designed specifically for the processing of ASAPseq data.

背景:测序成本的下降,加上从单细胞中分离DNA和RNA转录物的易于使用的试剂盒的增加,推动了以基因组和转录组学数据为中心的研究的快速扩散。同时,大量的新技术已经被开发出来,利用单细胞技术来询问广泛的细胞生物学过程。最近开发的一项技术,转座酶可及染色质测序(ATAC)与选择抗原测序分析(ASAPseq),提供了染色质可及性评估与细胞表面标记物表达水平测量的结合。虽然现有的软件可以表征这些数据集,但目前还没有明确设计的工具可以将ASAP表面标记FASTQ数据重新格式化为计数矩阵,然后用于这些下游分析。结果:为了解决ASAPseq数据处理缺乏专用工具的问题,我们创建了CountASAP,这是一个易于使用的Python包,旨在将ASAP实验中的FASTQ文件转换为与常用下游生物信息学分析包兼容的计数矩阵。CountASAP利用相关数据结构的独立性,将每个序列读取与用户提供的输入ASAP寡序列和唯一细胞标识符序列进行完全并行匹配。我们直接将CountASAP的性能和用户友好性与现有工具进行比较,使用来自更常见的测序实验的类似结构数据:通过测序对转录组和表位进行细胞索引(CITEseq)。针对现有工具的进一步基准测试有助于确定CountASAP的适当默认值,并评估来自所有测试软件的输出的一致性。最后一项使用新型ASAPseq数据集的测试证明,CountASAP可以产生具有生物学意义的结果,这些结果与成对的染色质可及性数据很好地相关。结论:CountASAP在分析结构相似的基准数据时,与现有的、经过良好测试的数据处理工具表现出良好的一致性。CountASAP在标准笔记本电脑上高效运行,具有用户友好的文档,一步安装,是第一个也是唯一一个专门为处理ASAPseq数据而设计的工具。
{"title":"CountASAP: a lightweight, easy to use python package for processing ASAPseq data.","authors":"Christopher T Boughter, Budha Chatterjee, Yuko Ohta, Katrina Gorga, Carly Blair, Elizabeth M Hill, Zachary Fasana, Adedola O Adebamowo, Farah Ammar, Ivan Kosik, Vel Murugan, Wilbur H Chen, Nevil J Singh, Martin Meier-Schellersheim","doi":"10.1186/s12859-025-06311-0","DOIUrl":"10.1186/s12859-025-06311-0","url":null,"abstract":"<p><strong>Background: </strong>Declining sequencing costs coupled with the increasing availability of easy-to-use kits for the isolation of DNA and RNA transcripts from single cells have driven a rapid proliferation of studies centered around genomic and transcriptomic data. Simultaneously, a wealth of new techniques have been developed that utilize single cell technologies to interrogate a broad range of cell-biological processes. One recently developed technique, transposase-accessible chromatin with sequencing (ATAC) with select antigen profiling by sequencing (ASAPseq), provides a combination of chromatin accessibility assessments with measurements of cell-surface marker expression levels. While software exists for the characterization of these datasets, there currently exists no tool explicitly designed to reformat ASAP surface marker FASTQ data into a count matrix which can then be used for these downstream analyses.</p><p><strong>Results: </strong>To address this lack of a dedicated tool for ASAPseq data processing, we created CountASAP, an easy-to-use Python package purposefully designed to transform FASTQ files from ASAP experiments into count matrices compatible with commonly-used downstream bioinformatic analysis packages. CountASAP takes advantage of the independence of the relevant data structures to perform fully parallelized matches of each sequenced read to user-supplied input ASAP oligos and unique cell-identifier sequences. We directly compare the performance and user-friendliness of CountASAP to existing tools using similarly-structured data from a more common sequencing experiment: cellular indexing of transcriptomes and epitopes by sequencing (CITEseq). Further benchmarking against existing tools helps to identify proper defaults for CountASAP and assess the agreement of outputs from all tested software. A final test using a novel ASAPseq dataset provides evidence that CountASAP can generate biologically meaningful results that correlate well with paired chromatin accessibility data.</p><p><strong>Conclusions: </strong>CountASAP shows good agreement with existing, well-tested data processing tools in the analysis of similarly-structured benchmarking data. CountASAP runs efficiently on a standard laptop, has user-friendly documentation, a one-step installation, and represents the first and only tool designed specifically for the processing of ASAPseq data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"307"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751127/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOV&RSim: computational modelling of cancer-specific variants and sequencing reads characteristics for realistic tumoral sample simulation. MOV&RSim:癌症特异性变异的计算建模和真实肿瘤样本模拟的测序读取特征。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-27 DOI: 10.1186/s12859-025-06292-0
Francesca Longhin, Giacomo Baruzzo, Enidia Hazizaj, Diego Boscarino, Dino Paladin, Barbara Di Camillo
{"title":"MOV&RSim: computational modelling of cancer-specific variants and sequencing reads characteristics for realistic tumoral sample simulation.","authors":"Francesca Longhin, Giacomo Baruzzo, Enidia Hazizaj, Diego Boscarino, Dino Paladin, Barbara Di Camillo","doi":"10.1186/s12859-025-06292-0","DOIUrl":"10.1186/s12859-025-06292-0","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"287"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12659228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting motif finding: do bi-objective metaheuristics surpass single-objective metaheuristics? 重新审视母题发现:双目标元启发式超越单目标元启发式吗?
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-27 DOI: 10.1186/s12859-025-06327-6
Muhammad Ali Nayeem, Shehab Sarar Ahmed, Suliman Aladhadh, M Sohel Rahman

Background: The discovery of DNA motifs is essential for studying gene expression and function in many biological systems. Most existing algorithms for motif detection rely on a single optimization criterion or objective function. This study formulates motif finding as a bi-objective optimization problem and investigates whether multi-objective metaheuristics offer potential advantages over single-objective approaches.

Results: We developed four variants of the Non-dominated Sorting Genetic Algorithm II (NSGA-II) incorporating simple, problem-specific genetic operators. Experiments on six benchmark datasets from three organisms demonstrate that our bi-objective approach significantly outperforms the state-of-the-art Artificial Bee Colony (ABC) metaheuristic. Remarkably, NSGA-II-PMC achieved superior performance over ABC using 6 times fewer fitness evaluations, highlighting its computational efficiency. The synergistic combination of problem-specific operators proved essential, with individual operators showing limited effectiveness compared to their joint application.

Conclusions: Our findings question the common belief that single-objective metaheuristics are better suited for combinatorial problems like motif finding. The bi-objective formulation helps maintain diversity and avoid premature convergence, even with partially correlated objectives, resulting in better solutions than those obtained through dedicated single-objective optimization. Simple, interpretable problem-specific adaptations can yield substantial performance gains over sophisticated alternatives. These results suggest that bi-objective approaches may provide more robust and computationally efficient solutions for DNA motif discovery, opening new research directions in bioinformatics.

背景:DNA基序的发现对于研究许多生物系统中的基因表达和功能至关重要。大多数现有的基序检测算法依赖于单一的优化准则或目标函数。本研究将母题寻找作为一个双目标优化问题,并探讨多目标元启发式方法是否比单目标方法具有潜在优势。结果:我们开发了四种非支配排序遗传算法II (NSGA-II)的变体,其中包含简单的、针对特定问题的遗传算子。在来自三种生物的六个基准数据集上的实验表明,我们的双目标方法显著优于最先进的人工蜂群(ABC)元启发式方法。值得注意的是,NSGA-II-PMC使用比ABC少6倍的适应度评估获得了更好的性能,突出了其计算效率。事实证明,针对特定问题的作业者的协同组合至关重要,与联合应用相比,单个作业者的效果有限。结论:我们的研究结果质疑了单目标元启发式更适合于组合问题(如motif finding)的普遍看法。双目标公式有助于保持多样性并避免过早收敛,即使目标部分相关,也比通过专门的单目标优化获得的解更好。与复杂的替代方案相比,简单的、可解释的、特定于问题的调整可以产生显著的性能提升。这些结果表明,双目标方法可能为DNA基序发现提供更强大和计算效率更高的解决方案,为生物信息学开辟了新的研究方向。
{"title":"Revisiting motif finding: do bi-objective metaheuristics surpass single-objective metaheuristics?","authors":"Muhammad Ali Nayeem, Shehab Sarar Ahmed, Suliman Aladhadh, M Sohel Rahman","doi":"10.1186/s12859-025-06327-6","DOIUrl":"10.1186/s12859-025-06327-6","url":null,"abstract":"<p><strong>Background: </strong>The discovery of DNA motifs is essential for studying gene expression and function in many biological systems. Most existing algorithms for motif detection rely on a single optimization criterion or objective function. This study formulates motif finding as a bi-objective optimization problem and investigates whether multi-objective metaheuristics offer potential advantages over single-objective approaches.</p><p><strong>Results: </strong>We developed four variants of the Non-dominated Sorting Genetic Algorithm II (NSGA-II) incorporating simple, problem-specific genetic operators. Experiments on six benchmark datasets from three organisms demonstrate that our bi-objective approach significantly outperforms the state-of-the-art Artificial Bee Colony (ABC) metaheuristic. Remarkably, NSGA-II-PMC achieved superior performance over ABC using 6 times fewer fitness evaluations, highlighting its computational efficiency. The synergistic combination of problem-specific operators proved essential, with individual operators showing limited effectiveness compared to their joint application.</p><p><strong>Conclusions: </strong>Our findings question the common belief that single-objective metaheuristics are better suited for combinatorial problems like motif finding. The bi-objective formulation helps maintain diversity and avoid premature convergence, even with partially correlated objectives, resulting in better solutions than those obtained through dedicated single-objective optimization. Simple, interpretable problem-specific adaptations can yield substantial performance gains over sophisticated alternatives. These results suggest that bi-objective approaches may provide more robust and computationally efficient solutions for DNA motif discovery, opening new research directions in bioinformatics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"291"},"PeriodicalIF":3.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12713252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145629038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GrafGen: distance-based inference of population ancestry for Helicobacter pylori genomes. GrafGen:幽门螺杆菌基因组群体祖先的距离推断。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-26 DOI: 10.1186/s12859-025-06294-y
William Wheeler, Difei Wang, Isaac Zhao, Filipa F Vale, Yumi Jin, Charles S Rabkin

Background: Helicobacter pylori is a highly diverse gastric bacterium whose genomic variation both reflects human migration and complicates genome-wide association studies (GWAS). Its 1.67 Mb genome contains ~ 143,000 biallelic SNPs with minor allele frequency > 1%, making population stratification a major confounder. Existing model- and distance-based methods for bacterial ancestry classification often yield inconsistent results depending on dataset composition. A robust and generalizable framework is needed to improve downstream analyses.

Results: We developed GrafGen, an open-source R package adapted from the human ancestry tool GrafPop, for the classification of H. pylori and prophage populations. Using reference data from the H. pylori Genome Project (1,011 genomes from 50 countries), GrafGen identified nine distinct bacterial populations and four prophage groups by genetic distance clustering. Validation with 255 GenBank sequences showed consistent mapping to GrafGen-defined populations. Classifications based on subsets of 14,300 and 1,430 SNPs achieved > 97% and > 90% concordance, respectively, with those using the full 143,000 SNPs, demonstrating robustness to down-sampling. The package integrates visualization tools for geometric interpretation of ancestry structure and is distributed via Bioconductor (v1.4.0, nine-population reference) and GitHub (v2.0_beta, general framework for haploid species and prophages).

Conclusions: GrafGen provides a reliable approach for classifying H. pylori ancestry and correcting for bacterial population stratification in GWAS. By enabling more accurate inference of genotype-phenotype associations, the method enhances studies of bacterial genetics and host-pathogen interactions. The underlying algorithm is extensible to other haploid organisms with adequate reference data, broadening its relevance beyond H. pylori.

背景:幽门螺杆菌是一种高度多样化的胃细菌,其基因组变异既反映了人类迁移,也使全基因组关联研究(GWAS)复杂化。其1.67 Mb的基因组包含约143,000个双等位snp,次要等位基因频率约为1%,使种群分层成为主要混杂因素。现有的基于模型和距离的细菌血统分类方法往往根据数据集的组成产生不一致的结果。需要一个健壮且可推广的框架来改进下游分析。结果:我们开发了一个基于人类祖先工具GrafPop的开源R软件包GrafGen,用于幽门螺杆菌和噬菌体群体的分类。利用幽门螺杆菌基因组计划(来自50个国家的1011个基因组)的参考数据,GrafGen通过遗传距离聚类鉴定出9个不同的细菌种群和4个前噬菌体群。255个GenBank序列的验证显示与grafgen定义的群体一致。基于14300个和1430个SNPs子集的分类分别与使用全部143000个SNPs的分类达到了> 97%和> 90%的一致性,证明了对下采样的稳健性。该软件包集成了用于祖先结构几何解释的可视化工具,并通过Bioconductor (v1.4.0, 9个种群参考)和GitHub (v2.0_beta,单倍体物种和前噬菌体的通用框架)分发。结论:GrafGen提供了一种可靠的方法来分类幽门螺杆菌的祖先和校正细菌群体分层。通过更准确地推断基因型-表型关联,该方法增强了细菌遗传学和宿主-病原体相互作用的研究。基础算法可扩展到其他单倍体生物有足够的参考数据,扩大其相关性超出幽门螺杆菌。
{"title":"GrafGen: distance-based inference of population ancestry for Helicobacter pylori genomes.","authors":"William Wheeler, Difei Wang, Isaac Zhao, Filipa F Vale, Yumi Jin, Charles S Rabkin","doi":"10.1186/s12859-025-06294-y","DOIUrl":"10.1186/s12859-025-06294-y","url":null,"abstract":"<p><strong>Background: </strong>Helicobacter pylori is a highly diverse gastric bacterium whose genomic variation both reflects human migration and complicates genome-wide association studies (GWAS). Its 1.67 Mb genome contains ~ 143,000 biallelic SNPs with minor allele frequency > 1%, making population stratification a major confounder. Existing model- and distance-based methods for bacterial ancestry classification often yield inconsistent results depending on dataset composition. A robust and generalizable framework is needed to improve downstream analyses.</p><p><strong>Results: </strong>We developed GrafGen, an open-source R package adapted from the human ancestry tool GrafPop, for the classification of H. pylori and prophage populations. Using reference data from the H. pylori Genome Project (1,011 genomes from 50 countries), GrafGen identified nine distinct bacterial populations and four prophage groups by genetic distance clustering. Validation with 255 GenBank sequences showed consistent mapping to GrafGen-defined populations. Classifications based on subsets of 14,300 and 1,430 SNPs achieved > 97% and > 90% concordance, respectively, with those using the full 143,000 SNPs, demonstrating robustness to down-sampling. The package integrates visualization tools for geometric interpretation of ancestry structure and is distributed via Bioconductor (v1.4.0, nine-population reference) and GitHub (v2.0_beta, general framework for haploid species and prophages).</p><p><strong>Conclusions: </strong>GrafGen provides a reliable approach for classifying H. pylori ancestry and correcting for bacterial population stratification in GWAS. By enabling more accurate inference of genotype-phenotype associations, the method enhances studies of bacterial genetics and host-pathogen interactions. The underlying algorithm is extensible to other haploid organisms with adequate reference data, broadening its relevance beyond H. pylori.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"308"},"PeriodicalIF":3.3,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751694/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protein language models uncover carbohydrate-active enzyme function in metagenomics. 蛋白质语言模型揭示了宏基因组中碳水化合物活性酶的功能。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-26 DOI: 10.1186/s12859-025-06286-y
Kumar Thurimella, Ahmed M T Mohamed, Chenhao Li, Tommi Vatanen, Daniel B Graham, Róisín M Owens, Sabina Leanti La Rosa, Damian R Plichta, Sergio Bacallado, Ramnik J Xavier

Background: The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.

Results: CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn's disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn's disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.

Conclusions: CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.

背景:从宏基因组数据中对未表征的微生物酶的功能注释仍然是一个重大挑战,限制了我们对微生物代谢动力学的理解。传统的注释方法往往依赖于序列同源性,这可能无法识别远端同源物或具有结构而非序列保守的酶。为了解决这个问题,我们开发了CAZyLingua,这是第一个使用蛋白质语言模型(pLMs)对碳水化合物活性酶(CAZyme)家族和亚家族进行准确分类的注释工具。结果:CAZyLingua表现出高性能,保持了与最先进的基于隐马尔可夫模型的方法相当的精度和召回率,同时优于纯粹基于序列的方法。当应用于母亲/婴儿对的元基因组基因目录时,CAZyLingua确定了超过27,000种其他工具遗漏的推定cazyme,包括与婴儿微生物组发育有关的水平转移酶。在克罗恩病和igg4相关疾病患者的数据集中,CAZyLinuga发现了疾病相关的CAZymes,突出了igg4相关疾病中碳水化合物酯酶(CEs)的扩增。预测在克罗恩病中过量的CE17酶在功能上得到了验证,证实了其对乙酰化甘露寡糖的催化活性。结论:CAZyLingua是一个强大的工具,它有效地增强了现有的CAZymes功能注释管道。通过利用pLMs捕获的深层上下文信息,我们的方法可以发现新的CAZyme多样性,揭示与健康和疾病相关的酶功能,有助于进一步了解与宿主健康和营养相关的生物过程。
{"title":"Protein language models uncover carbohydrate-active enzyme function in metagenomics.","authors":"Kumar Thurimella, Ahmed M T Mohamed, Chenhao Li, Tommi Vatanen, Daniel B Graham, Róisín M Owens, Sabina Leanti La Rosa, Damian R Plichta, Sergio Bacallado, Ramnik J Xavier","doi":"10.1186/s12859-025-06286-y","DOIUrl":"10.1186/s12859-025-06286-y","url":null,"abstract":"<p><strong>Background: </strong>The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.</p><p><strong>Results: </strong>CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn's disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn's disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.</p><p><strong>Conclusions: </strong>CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"285"},"PeriodicalIF":3.3,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12659350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145628910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AlphaFold Database Structure Extractor: a web server and API to download AlphaFold structures using common protein accessions. AlphaFold数据库结构提取器:一个web服务器和API,用于下载使用常见蛋白质接入的AlphaFold结构。
IF 3.3 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-11-25 DOI: 10.1186/s12859-025-06303-0
Niharika Saraf, Vishvesh Karthik, Gaurav Sharma

Background: The AlphaFoldDB Structure Extractor ( https://project.iith.ac.in/sharmaglab/alphafoldextractor/ ) is an open-access web server and API toolkit designed to facilitate the bulk download of predicted protein structures from the AlphaFold Database using well-known accession formats. Addressing the current limitations in extracting structures beyond a restricted list of model organisms and a threshold number, this tool accepts diverse sequence and structure input identifiers, such as NCBI Taxonomy ID, RefSeq accessions, locus tags (old and new), and UniProt or AlphaFold accessions for structure retrieval.

Results: Users can download structure files in PDB, mmCIF, bCIF, or/and PAE JSON formats using any of the above-mentioned input accessions as input. The tool also generates an accompanying ID mapping file to trace input identifiers back to standard accession numbers and reports unmapped IDs separately. Users can also perform just the ID mapping in case they do not require the structure coordinate files. An API methodology is also provided for programmatic access, enabling integration into bioinformatics pipelines. We have tested the tool using several randomly selected accessions (individual inputs and up to 5000 input accessions) of each type from NCBI RefSeq and Taxonomy Databases, UniProt Database and AlphaFold Database.

Conclusions: Overall, AlphaFoldDB Structure Extractor streamlines the structure procurement process from AlphaFold database, empowering researchers in structural and functional genomics with minimal computational expertise.

背景:AlphaFoldDB结构提取器(https://project.iith.ac.in/sharmaglab/alphafoldextractor/)是一个开放访问的web服务器和API工具包,旨在促进使用知名的加入格式从AlphaFold数据库中大量下载预测的蛋白质结构。该工具解决了目前在提取模型生物的限制列表和阈值数之外的结构方面的局限性,它接受不同的序列和结构输入标识符,例如NCBI Taxonomy ID, RefSeq接入,位点标签(旧的和新的),以及用于结构检索的UniProt或AlphaFold接入。结果:用户可以使用上述任何一种输入访问作为输入,下载PDB、mmCIF、bCIF或/和PAE JSON格式的结构文件。该工具还生成附带的ID映射文件,将输入标识符跟踪到标准加入号,并单独报告未映射的ID。在不需要结构坐标文件的情况下,用户也可以只执行ID映射。API方法学也提供了程序化访问,使集成到生物信息学管道。我们使用NCBI RefSeq和Taxonomy数据库、UniProt数据库和AlphaFold数据库中随机选择的几种类型的输入(单个输入和多达5000个输入)对该工具进行了测试。结论:总的来说,AlphaFoldDB结构提取器简化了从AlphaFold数据库中获取结构的过程,使结构和功能基因组学的研究人员能够以最少的计算专业知识进行研究。
{"title":"AlphaFold Database Structure Extractor: a web server and API to download AlphaFold structures using common protein accessions.","authors":"Niharika Saraf, Vishvesh Karthik, Gaurav Sharma","doi":"10.1186/s12859-025-06303-0","DOIUrl":"10.1186/s12859-025-06303-0","url":null,"abstract":"<p><strong>Background: </strong>The AlphaFoldDB Structure Extractor ( https://project.iith.ac.in/sharmaglab/alphafoldextractor/ ) is an open-access web server and API toolkit designed to facilitate the bulk download of predicted protein structures from the AlphaFold Database using well-known accession formats. Addressing the current limitations in extracting structures beyond a restricted list of model organisms and a threshold number, this tool accepts diverse sequence and structure input identifiers, such as NCBI Taxonomy ID, RefSeq accessions, locus tags (old and new), and UniProt or AlphaFold accessions for structure retrieval.</p><p><strong>Results: </strong>Users can download structure files in PDB, mmCIF, bCIF, or/and PAE JSON formats using any of the above-mentioned input accessions as input. The tool also generates an accompanying ID mapping file to trace input identifiers back to standard accession numbers and reports unmapped IDs separately. Users can also perform just the ID mapping in case they do not require the structure coordinate files. An API methodology is also provided for programmatic access, enabling integration into bioinformatics pipelines. We have tested the tool using several randomly selected accessions (individual inputs and up to 5000 input accessions) of each type from NCBI RefSeq and Taxonomy Databases, UniProt Database and AlphaFold Database.</p><p><strong>Conclusions: </strong>Overall, AlphaFoldDB Structure Extractor streamlines the structure procurement process from AlphaFold database, empowering researchers in structural and functional genomics with minimal computational expertise.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":"305"},"PeriodicalIF":3.3,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145602065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1