首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Solu: a cloud platform for real-time genomic pathogen surveillance. Solu:实时基因组病原体监测云平台。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-13 DOI: 10.1186/s12859-024-06005-z
Timo Saratto, Kerkko Visuri, Jonatan Lehtinen, Irene Ortega-Sanz, Jacob L Steenwyk, Samuel Sihvonen

Background: Genomic surveillance is extensively used for tracking public health outbreaks and healthcare-associated pathogens. Despite advancements in bioinformatics pipelines, there are still significant challenges in terms of infrastructure, expertise, and security when it comes to continuous surveillance. The existing pipelines often require the user to set up and manage their own infrastructure and are not designed for continuous surveillance that demands integration of new and regularly generated sequencing data with previous analyses. Additionally, academic projects often do not meet the privacy requirements of healthcare providers.

Results: We present Solu, a cloud-based platform that integrates genomic data into a real-time, privacy-focused surveillance system.

Evaluation: Solu's accuracy for taxonomy assignment, antimicrobial resistance genes, and phylogenetics was comparable to established pathogen surveillance pipelines. In some cases, Solu identified antimicrobial resistance genes that were previously undetected. Together, these findings demonstrate the efficacy of our platform.

Conclusions: By enabling reliable, user-friendly, and privacy-focused genomic surveillance, Solu has the potential to bridge the gap between cutting-edge research and practical, widespread application in healthcare settings. The platform is available for free academic use at https://platform.solugenomics.com .

背景:基因组监测被广泛用于跟踪公共卫生暴发和卫生保健相关病原体。尽管生物信息学管道取得了进步,但在基础设施、专业知识和安全方面,当涉及到持续监测时,仍然存在重大挑战。现有的管道通常需要用户建立和管理自己的基础设施,并且不适合持续监测,这需要将新的和定期生成的测序数据与以前的分析相结合。此外,学术项目往往不符合医疗保健提供者的隐私要求。结果:我们提出了Solu,一个基于云的平台,将基因组数据集成到一个实时的、以隐私为中心的监控系统中。评价:Solu在分类分配、抗菌素耐药基因和系统发育方面的准确性与已建立的病原体监测管道相当。在某些情况下,Solu发现了以前未发现的抗微生物药物耐药性基因。总之,这些发现证明了我们平台的有效性。结论:通过实现可靠、用户友好和注重隐私的基因组监测,Solu有可能弥合前沿研究与医疗保健环境中实际、广泛应用之间的差距。该平台可在https://platform.solugenomics.com上免费用于学术用途。
{"title":"Solu: a cloud platform for real-time genomic pathogen surveillance.","authors":"Timo Saratto, Kerkko Visuri, Jonatan Lehtinen, Irene Ortega-Sanz, Jacob L Steenwyk, Samuel Sihvonen","doi":"10.1186/s12859-024-06005-z","DOIUrl":"10.1186/s12859-024-06005-z","url":null,"abstract":"<p><strong>Background: </strong>Genomic surveillance is extensively used for tracking public health outbreaks and healthcare-associated pathogens. Despite advancements in bioinformatics pipelines, there are still significant challenges in terms of infrastructure, expertise, and security when it comes to continuous surveillance. The existing pipelines often require the user to set up and manage their own infrastructure and are not designed for continuous surveillance that demands integration of new and regularly generated sequencing data with previous analyses. Additionally, academic projects often do not meet the privacy requirements of healthcare providers.</p><p><strong>Results: </strong>We present Solu, a cloud-based platform that integrates genomic data into a real-time, privacy-focused surveillance system.</p><p><strong>Evaluation: </strong>Solu's accuracy for taxonomy assignment, antimicrobial resistance genes, and phylogenetics was comparable to established pathogen surveillance pipelines. In some cases, Solu identified antimicrobial resistance genes that were previously undetected. Together, these findings demonstrate the efficacy of our platform.</p><p><strong>Conclusions: </strong>By enabling reliable, user-friendly, and privacy-focused genomic surveillance, Solu has the potential to bridge the gap between cutting-edge research and practical, widespread application in healthcare settings. The platform is available for free academic use at https://platform.solugenomics.com .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"12"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11731562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks. MDFGNN-SMMA:基于多源数据融合和图神经网络的潜在小分子- mirna关联预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-13 DOI: 10.1186/s12859-025-06040-4
Jianwei Li, Xukun Zhang, Bing Li, Ziyu Li, Zhenzhen Chen

Background: MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.

Results: In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.

Conclusions: The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.

背景:MicroRNAs (miRNAs)在复杂人类疾病的发生和发展中起着关键作用,并已被确定为小分子(SM)药物的靶点。然而,用于鉴定SM-miRNA关联的传统实验技术昂贵且耗时的特点突出了在该领域高效计算方法的必要性。结果:在这项研究中,我们提出了一种深度学习方法,称为多源数据融合和小分子- mirna关联图神经网络(MDFGNN-SMMA)来预测潜在的SM-miRNA关联。首先,MDFGNN-SMMA提取原子对指纹和分子访问系统指纹的特征,得到小分子指纹的融合特征向量;利用K-mer特征生成mirna的初始特征向量。其次,计算余弦相似度,分别构建SMs和mirna的邻接矩阵;然后,将这些特征向量和邻接矩阵输入到GAT和GraphSAGE模型中,利用GAT和GraphSAGE模型生成SMs和mirna的最终特征向量。最后,将平均的最终特征向量用作多层感知器的输入,以预测SMs和mirna之间的关联。结论:MDFGNN-SMMA的性能通过10倍交叉验证进行评估,在AUC和AUPR方面都优于四种最先进的模型。独立测试集的实验结果证实了模型的泛化能力。此外,通过三个案例研究证实了MDFGNN-SMMA的疗效。结果显示,与顺铂、5-氟尿嘧啶和阿霉素相关的前50个预测mirna中,分别有42个、36个和36个mirna得到了现有文献和rnai数据库的证实。
{"title":"MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks.","authors":"Jianwei Li, Xukun Zhang, Bing Li, Ziyu Li, Zhenzhen Chen","doi":"10.1186/s12859-025-06040-4","DOIUrl":"10.1186/s12859-025-06040-4","url":null,"abstract":"<p><strong>Background: </strong>MicroRNAs (miRNAs) are pivotal in the initiation and progression of complex human diseases and have been identified as targets for small molecule (SM) drugs. However, the expensive and time-intensive characteristics of conventional experimental techniques for identifying SM-miRNA associations highlight the necessity for efficient computational methodologies in this field.</p><p><strong>Results: </strong>In this study, we proposed a deep learning method called Multi-source Data Fusion and Graph Neural Networks for Small Molecule-MiRNA Association (MDFGNN-SMMA) to predict potential SM-miRNA associations. Firstly, MDFGNN-SMMA extracted features of Atom Pairs fingerprints and Molecular ACCess System fingerprints to derive fusion feature vectors for small molecules (SMs). The K-mer features were employed to generate the initial feature vectors for miRNAs. Secondly, cosine similarity measures were computed to construct the adjacency matrices for SMs and miRNAs, respectively. Thirdly, these feature vectors and adjacency matrices were input into a model comprising GAT and GraphSAGE, which were utilized to generate the final feature vectors for SMs and miRNAs. Finally, the averaged final feature vectors were utilized as input for a multilayer perceptron to predict the associations between SMs and miRNAs.</p><p><strong>Conclusions: </strong>The performance of MDFGNN-SMMA was assessed using 10-fold cross-validation, demonstrating superior compared to the four state-of-the-art models in terms of both AUC and AUPR. Moreover, the experimental results of an independent test set confirmed the model's generalization capability. Additionally, the efficacy of MDFGNN-SMMA was substantiated through three case studies. The findings indicated that among the top 50 predicted miRNAs associated with Cisplatin, 5-Fluorouracil, and Doxorubicin, 42, 36, and 36 miRNAs, respectively, were corroborated by existing literature and the RNAInter database.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"13"},"PeriodicalIF":2.9,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11730471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Not seeing the trees for the forest. The impact of neighbours on graph-based configurations in histopathology. 只见树木不见森林。组织病理学中邻域对基于图的构型的影响。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-11 DOI: 10.1186/s12859-024-06007-x
Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal

Background: Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.

Results: In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.

Conclusions: Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.

背景:深度学习(DL)为癌症诊断设定了新的标准,显著提高了来自活检组织样本的全切片图像(wsi)自动分类的准确性。为了使深度学习模型能够处理这些大型图像,通常将wsi划分为数千个较小的块,每个块包含10-50个单元格。多实例学习(Multiple Instance Learning, MIL)是一种常用的方法,其中wsi被视为包含许多块(实例)的包,并且在训练期间只提供包级标签。模型从这些广泛的标签中学习,以提取更详细的实例级洞察。然而,活检切片通常表现出高度的表型内和表型间异质性,这对分类提出了重大挑战。为了解决这个问题,已经提出了许多基于图的方法,其中每个WSI都表示为一个图,其中瓦片作为节点和由特定空间关系定义的边。结果:在本研究中,我们研究了不同的图配置,不同的连通性和邻域结构,如何影响MIL模型的性能。我们开发了一种新的管道,K-MIL,来评估上下文信息对细胞分类性能的影响。通过将相邻的块合并到分析中,我们检查了上下文信息是提高还是削弱了网络识别模式和特征的能力,这些模式和特征对准确分类至关重要。我们的实验在两个数据集上进行:结肠癌和UCSB数据集。结论:我们的研究结果表明,虽然纳入更多的空间上下文信息通常会提高袋子和瓷砖层面的模型精度,但瓷砖层面的提高不是线性的。在某些情况下,增加空间背景会导致错误分类,这表明更多的背景并不总是有益的。这一发现强调了在数字病理分类任务中纳入空间上下文信息时需要仔细考虑。
{"title":"Not seeing the trees for the forest. The impact of neighbours on graph-based configurations in histopathology.","authors":"Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal","doi":"10.1186/s12859-024-06007-x","DOIUrl":"10.1186/s12859-024-06007-x","url":null,"abstract":"<p><strong>Background: </strong>Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.</p><p><strong>Results: </strong>In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.</p><p><strong>Conclusions: </strong>Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"9"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UniAMP: enhancing AMP prediction using deep neural networks with inferred information of peptides. UniAMP:利用深度神经网络与肽推断信息增强AMP预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-11 DOI: 10.1186/s12859-025-06033-3
Zixin Chen, Chengming Ji, Wenwen Xu, Jianfeng Gao, Ji Huang, Huanliang Xu, Guoliang Qian, Junxian Huang

Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.

由于全球范围内抗生素在医学和农业中的滥用日益严重,抗菌肽(AMPs)已被广泛认为是对抗微生物耐药性的一种有前途的解决方案。在这项研究中,我们提出了UniAMP,一个发现amp的系统预测框架。我们观察到,在现有的各种研究中,从肽信息(如序列、组成和结构)构建的特征向量可以被深度学习模型推断的信息增强甚至取代。具体来说,我们使用由两个深度学习模型UniRep和ProtT5推断的2924个值的特征向量来证明这些推断的肽信息足以完成任务,并利用我们提出的由完全连接层和变压器编码器组成的深度神经网络模型来预测肽的抗菌活性。评估结果表明,与现有研究相比,我们提出的模型在平衡基准数据集和不平衡测试数据集上都具有优越的性能。随后,我们分析了肽序列之间的关系,人工提取特征,并通过深度学习模型自动推断信息,结果表明,推断信息对于预测amp的任务更加全面和无冗余。此外,该方法缓解了实证数据稀缺的影响,在未来的研究和应用中具有很大的潜力。
{"title":"UniAMP: enhancing AMP prediction using deep neural networks with inferred information of peptides.","authors":"Zixin Chen, Chengming Ji, Wenwen Xu, Jianfeng Gao, Ji Huang, Huanliang Xu, Guoliang Qian, Junxian Huang","doi":"10.1186/s12859-025-06033-3","DOIUrl":"10.1186/s12859-025-06033-3","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"10"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11725221/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142969469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data. BiomiX是一个用户友好的生物信息学工具,用于民主化分析和整合多组学数据。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-10 DOI: 10.1186/s12859-024-06022-y
Cristian Iperi, Álvaro Fernández-Ochoa, Guillermo Barturen, Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Marta Alarcón-Riquelme, Divi Cornec, Anne Bordron, Christophe Jamin

Background: Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.

Results: BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.

Conclusions: The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the "Findable, Accessible, Interoperable, and Reusable" data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.

背景:解释生物系统的变化需要解释大量的多组学数据。虽然存在用于单组学分析的用户友好工具,但整合多组学仍然需要生物信息学专业知识,限制了更广泛的科学界的可及性。结果:BiomiX解决了高通量组学数据分析的瓶颈,能够对来自两个队列的多组学数据进行高效和集成的分析。BiomiX整合了多种组学数据,使用DESeq2/Limma包进行转录组学,并量化代谢组学峰值差异,通过带有错误发现率校正的Wilcoxon测试进行评估。对于液相色谱-质谱非靶向代谢组学的代谢组注释,还使用CEU质量介质数据库中的质量电荷比和TidyMass包中的碎片谱来支持。甲基组学分析使用ChAMP R包进行。最后,多组学因子分析(MOFA)集成识别组学数据中共享的变异源。BiomiX还生成统计数据、报告数据,并将enrichment r和GSEA集成在一起,用于生物过程探索和基于用户自定义基因面板的亚群分析,从而增强病情亚型。BiomiX对MOFA模型进行微调,以优化因子数量的选择,区分队列,并提供解释歧视性MOFA因素的工具。这种解释依赖于Pubmed的创新书目研究,它提供了与判别因子贡献者最相关的文章。此外,鉴别MOFA因素与临床数据的相关性,并探讨了最重要的贡献途径,所有这些都是为了指导用户进行因素解释。结论:在一个独立的工具中分析单组学和多组学的集成,以及MOFA的实现及其通过文献的可解释性,代表了多组学领域的重大进展,符合“可查找、可访问、可互操作和可重用”的数据原则。BiomiX提供广泛的参数和交互式数据可视化,允许根据用户需求进行个性化分析。这个基于r的,用户友好的工具与多种操作系统兼容,旨在使非生物信息学专家也可以进行多组学分析。
{"title":"BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data.","authors":"Cristian Iperi, Álvaro Fernández-Ochoa, Guillermo Barturen, Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Marta Alarcón-Riquelme, Divi Cornec, Anne Bordron, Christophe Jamin","doi":"10.1186/s12859-024-06022-y","DOIUrl":"10.1186/s12859-024-06022-y","url":null,"abstract":"<p><strong>Background: </strong>Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.</p><p><strong>Results: </strong>BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.</p><p><strong>Conclusions: </strong>The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the \"Findable, Accessible, Interoperable, and Reusable\" data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"8"},"PeriodicalIF":2.9,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11721463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid natural language processing tool for semantic annotation of medical texts in Spanish. 用于西班牙医学文本语义注释的混合自然语言处理工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-08 DOI: 10.1186/s12859-024-05949-6
Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión

Background: Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.

Results: In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).

Conclusions: The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.

背景:自然语言处理(NLP)能够提取嵌入在非结构化文本中的信息,例如临床病例报告和试验资格标准。通过识别相关的医学概念,NLP有助于生成结构化和可操作的数据,支持队列识别和临床记录分析等复杂任务。为了完成这些任务,我们为西班牙语文本引入了一个基于深度学习和基于词典的命名实体识别(NER)工具。它执行医学NER和规范化,药物信息提取和检测时间实体,否定和推测,以及时间或经验属性(年龄,禁忌症,否定,推测,假设,未来,家庭成员,患者和其他)。我们使用专门的词典和规则来构建该工具,这些规则改编自NegEx和HeidelTime。使用这些资源,我们注释了1200个文本的语料库,注释者之间的一致性很高(实体的平均F1 = 0.841%±0.045,属性的平均F1 = 0.881%±0.032)。我们使用这个语料库来训练基于transformer的模型(基于roberta的模型,mBERT和mDeBERTa)。我们将它们与基于字典的系统集成在一个混合工具中,并通过拥抱脸中心分发模型。对于内部验证,我们使用了一个固定测试集并进行了错误分析。为了进行外部验证,8名医疗专业人员通过修改200个未在开发中使用的新文本的注释来评估该系统。结果:在内部验证中,模型的F1值达到0.915。在100个临床试验的外部验证中,该工具的平均F1评分为0.858(±0.032);在100例匿名临床病例中,平均F1得分为0.910(±0.019)。结论:该工具可在https://claramed.csic.es/medspaner上获得。我们还发布了代码(https://github.com/lcampillos/medspaner)和带注释的语料库来训练模型。
{"title":"Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.","authors":"Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión","doi":"10.1186/s12859-024-05949-6","DOIUrl":"10.1186/s12859-024-05949-6","url":null,"abstract":"<p><strong>Background: </strong>Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.</p><p><strong>Results: </strong>In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).</p><p><strong>Conclusions: </strong>The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"7"},"PeriodicalIF":2.9,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CDPMF-DDA: contrastive deep probabilistic matrix factorization for drug-disease association prediction. CDPMF-DDA:药物-疾病关联预测的对比深度概率矩阵分解。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06032-w
Xianfang Tang, Yawen Hou, Yajie Meng, Zhaojing Wang, Changcheng Lu, Juan Lv, Xinrong Hu, Junlin Xu, Jialiang Yang

The process of new drug development is complex, whereas drug-disease association (DDA) prediction aims to identify new therapeutic uses for existing medications. However, existing graph contrastive learning approaches typically rely on single-view contrastive learning, which struggle to fully capture drug-disease relationships. Subsequently, we introduce a novel multi-view contrastive learning framework, named CDPMF-DDA, which enhances the model's ability to capture drug-disease associations by incorporating diverse information representations from different views. First, we decompose the original drug-disease association matrix into drug and disease feature matrices, which are then used to reconstruct the drug-disease association network, as well as the drug-drug and disease-disease similarity networks. This process effectively reduces noise in the data, establishing a reliable foundation for the networks produced. Next, we generate multiple contrastive views from both the original and generated networks. These views effectively capture hidden feature associations, significantly enhancing the model's ability to represent complex relationships. Extensive cross-validation experiments on three standard datasets show that CDPMF-DDA achieves an average AUC of 0.9475 and an AUPR of 0.5009, outperforming existing models. Additionally, case studies on Alzheimer's disease and epilepsy further validate the model's effectiveness, demonstrating its high accuracy and robustness in drug-disease association prediction. Based on a multi-view contrastive learning framework, CDPMF-DDA is capable of integrating multi-source information and effectively capturing complex drug-disease associations, making it a powerful tool for drug repositioning and the discovery of new therapeutic strategies.

新药开发的过程是复杂的,而药物-疾病关联(DDA)预测旨在确定现有药物的新治疗用途。然而,现有的图对比学习方法通常依赖于单视图对比学习,难以完全捕获药物-疾病关系。随后,我们引入了一种名为CDPMF-DDA的新型多视图对比学习框架,该框架通过整合来自不同视图的不同信息表示来增强模型捕获药物-疾病关联的能力。首先,我们将原始的药物-疾病关联矩阵分解为药物和疾病特征矩阵,然后利用这些特征矩阵重构药物-疾病关联网络,以及药物-药物和疾病-疾病相似网络。这一过程有效地降低了数据中的噪声,为生成的网络奠定了可靠的基础。接下来,我们从原始和生成的网络中生成多个对比视图。这些视图有效地捕获隐藏的特征关联,显著地增强了模型表示复杂关系的能力。在三个标准数据集上进行的大量交叉验证实验表明,CDPMF-DDA的平均AUC为0.9475,AUPR为0.5009,优于现有模型。此外,对阿尔茨海默病和癫痫的案例研究进一步验证了该模型的有效性,证明了其在药物-疾病关联预测中的准确性和鲁棒性。基于多视角对比学习框架,CDPMF-DDA能够整合多源信息并有效捕获复杂的药物-疾病关联,使其成为药物重新定位和发现新治疗策略的有力工具。
{"title":"CDPMF-DDA: contrastive deep probabilistic matrix factorization for drug-disease association prediction.","authors":"Xianfang Tang, Yawen Hou, Yajie Meng, Zhaojing Wang, Changcheng Lu, Juan Lv, Xinrong Hu, Junlin Xu, Jialiang Yang","doi":"10.1186/s12859-024-06032-w","DOIUrl":"https://doi.org/10.1186/s12859-024-06032-w","url":null,"abstract":"<p><p>The process of new drug development is complex, whereas drug-disease association (DDA) prediction aims to identify new therapeutic uses for existing medications. However, existing graph contrastive learning approaches typically rely on single-view contrastive learning, which struggle to fully capture drug-disease relationships. Subsequently, we introduce a novel multi-view contrastive learning framework, named CDPMF-DDA, which enhances the model's ability to capture drug-disease associations by incorporating diverse information representations from different views. First, we decompose the original drug-disease association matrix into drug and disease feature matrices, which are then used to reconstruct the drug-disease association network, as well as the drug-drug and disease-disease similarity networks. This process effectively reduces noise in the data, establishing a reliable foundation for the networks produced. Next, we generate multiple contrastive views from both the original and generated networks. These views effectively capture hidden feature associations, significantly enhancing the model's ability to represent complex relationships. Extensive cross-validation experiments on three standard datasets show that CDPMF-DDA achieves an average AUC of 0.9475 and an AUPR of 0.5009, outperforming existing models. Additionally, case studies on Alzheimer's disease and epilepsy further validate the model's effectiveness, demonstrating its high accuracy and robustness in drug-disease association prediction. Based on a multi-view contrastive learning framework, CDPMF-DDA is capable of integrating multi-source information and effectively capturing complex drug-disease associations, making it a powerful tool for drug repositioning and the discovery of new therapeutic strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"5"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142942980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Causal models and prediction in cell line perturbation experiments. 细胞系扰动实验中的因果模型和预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06027-7
James P Long, Yumeng Yang, Shohei Shimizu, Thong Pham, Kim-Anh Do

In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations in silico. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al. in Cell Syst 12:128-140, 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal interpretation for the Cellbox model. We compare LR and CSR/Cellbox in simulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and CSR/Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.

在细胞系扰动实验中,一组细胞被外部因子和反应(如蛋白质表达)所扰动。由于成本限制,所有可能的扰动中只有一小部分可以在体外进行测试。这导致了计算模型的发展,可以预测细胞对硅扰动的反应。这些模型的一个核心挑战是预测新的、以前未经测试的、未在训练数据中使用的扰动的影响。在这里,我们提出因果结构方程来模拟扰动如何影响细胞。从这个模型中,我们得到了两个预测响应的估计量:线性回归(LR)估计量和因果结构学习估计量,我们称之为因果结构回归(CSR)。CSR估计器比LR需要更多的假设,但可以预测未在训练数据中应用的药物的效果。接下来,我们介绍了Cellbox,这是最近提出的一种基于常微分方程(ode)的模型,该模型在黑色素瘤细胞系扰动数据集上获得了最佳预测性能(Yuan等人在cell Syst 12:128-140, 2021)。我们得出的分析结果显示CSR和Cellbox之间的密切联系,为Cellbox模型提供了新的因果解释。我们在模拟中比较了LR和CSR/Cellbox,突出了两种方法的优缺点。最后,我们比较了LR和CSR/Cellbox在基准黑色素瘤数据集上的性能。我们发现LR模型的性能与Cellbox相当或略好。
{"title":"Causal models and prediction in cell line perturbation experiments.","authors":"James P Long, Yumeng Yang, Shohei Shimizu, Thong Pham, Kim-Anh Do","doi":"10.1186/s12859-024-06027-7","DOIUrl":"https://doi.org/10.1186/s12859-024-06027-7","url":null,"abstract":"<p><p>In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations in silico. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al. in Cell Syst 12:128-140, 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal interpretation for the Cellbox model. We compare LR and CSR/Cellbox in simulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and CSR/Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"4"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707890/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142944048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A metric and its derived protein network for evaluation of ortholog database inconsistency. 一种评价同源数据库不一致性的度量及其衍生蛋白网络。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06023-x
Weijie Yang, Jingsi Ji, Gang Fang

Background: Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.

Results: We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.

Conclusions: We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.

背景:在基因组研究的各个领域中,同源预测是必不可少的,但随着同源数据库的不断增加,同源预测面临着越来越多的不一致性。计算一致同源物的通用策略引入了额外的随意性,强调需要检查这种不一致的原因并识别易受预测错误影响的蛋白质。结果:我们引入了信号贾卡德指数(SJI),这是一种基于无监督基因组上下文聚类的新度量,旨在评估蛋白质相似性。利用SJI,我们构建了一个蛋白质网络,并揭示了网络中的外周蛋白质是导致同源预测不一致的主要因素。此外,我们表明蛋白质在网络中的中心性程度可以作为其在共识集中的可靠性的强预测因子。结论:我们提出了一个客观的,无监督的基于sji的网络,包括所有蛋白质,其中其拓扑特征阐明了同源预测的不一致性。度中心性(DC)在不依赖任意参数的情况下有效地识别容易出错的正交分配。值得注意的是,DC是稳定的,不受物种选择的影响,并且非常适合于同源基准测试。这种方法超越了通用阈值的限制,为探索蛋白质进化和功能关系提供了一个强大的定量框架。
{"title":"A metric and its derived protein network for evaluation of ortholog database inconsistency.","authors":"Weijie Yang, Jingsi Ji, Gang Fang","doi":"10.1186/s12859-024-06023-x","DOIUrl":"https://doi.org/10.1186/s12859-024-06023-x","url":null,"abstract":"<p><strong>Background: </strong>Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.</p><p><strong>Results: </strong>We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.</p><p><strong>Conclusions: </strong>We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"6"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142944047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes. DFAST_QC:原核生物基因组质量评估和分类鉴定工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06030-y
Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa

Background: Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.

Results: We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.

Conclusions: DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .

背景:基因组数据库中准确的分类对可靠的生物学研究和有效的数据共享至关重要。基因组注释中的错误标记或不准确可能导致不正确的科学结论,并阻碍研究结果的可重复性。尽管基因组分析技术取得了进步,但在确保精确可靠的分类分配方面仍然存在挑战。现有的基因组验证工具通常涉及大量的计算资源或冗长的处理时间,这限制了它们在大型项目中的可访问性和可扩展性。我们需要更高效、用户友好的解决方案,能够处理不同的数据集,并以最小的计算需求提供准确的结果。这项工作旨在通过引入一种新的工具来解决这些挑战,该工具可以提高分类准确性,提供用户友好的界面,并支持大规模分析。结果:我们介绍了一种新的原核生物基因组质量控制和分类工具DFAST_QC,它可以作为命令行工具和web服务。DFAST_QC通过结合使用MASH的基因组距离计算和使用Skani的ANI计算,可以快速识别基于NCBI和GTDB分类的物种。我们评估了DFAST_QC在物种识别方面的表现,发现它与现有的分类学标准高度一致,成功地识别了不同数据集的物种。在一些案例中,DFAST_QC发现了公共数据库中潜在的物种名称错误标记,并突出了当前分类中的差异,证明了其发现错误和提高分类准确性的能力。此外,该工具的高效设计使其能够在本地机器上以最小的计算需求顺利运行,使其成为大规模基因组项目的实用选择。结论:DFAST_QC是一种可靠、高效的准确分类鉴定和基因组质量控制工具,适用于大规模基因组研究。它与有限资源环境的兼容性,结合其用户友好的设计,确保无缝集成到现有的工作流程。DFAST_QC在公共数据库中完善物种分配的能力突出了其作为维护和提高基因组研究中分类数据准确性的补充工具的价值。web版本可在https://dfast.ddbj.nig.ac.jp/dqc/submit/上获得,本地使用的源代码可在https://github.com/nigyta/dfast_qc上找到。
{"title":"DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.","authors":"Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa","doi":"10.1186/s12859-024-06030-y","DOIUrl":"https://doi.org/10.1186/s12859-024-06030-y","url":null,"abstract":"<p><strong>Background: </strong>Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.</p><p><strong>Results: </strong>We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.</p><p><strong>Conclusions: </strong>DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"3"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1