首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Not seeing the trees for the forest. The impact of neighbours on graph-based configurations in histopathology. 只见树木不见森林。组织病理学中邻域对基于图的构型的影响。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-11 DOI: 10.1186/s12859-024-06007-x
Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal

Background: Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.

Results: In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.

Conclusions: Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.

背景:深度学习(DL)为癌症诊断设定了新的标准,显著提高了来自活检组织样本的全切片图像(wsi)自动分类的准确性。为了使深度学习模型能够处理这些大型图像,通常将wsi划分为数千个较小的块,每个块包含10-50个单元格。多实例学习(Multiple Instance Learning, MIL)是一种常用的方法,其中wsi被视为包含许多块(实例)的包,并且在训练期间只提供包级标签。模型从这些广泛的标签中学习,以提取更详细的实例级洞察。然而,活检切片通常表现出高度的表型内和表型间异质性,这对分类提出了重大挑战。为了解决这个问题,已经提出了许多基于图的方法,其中每个WSI都表示为一个图,其中瓦片作为节点和由特定空间关系定义的边。结果:在本研究中,我们研究了不同的图配置,不同的连通性和邻域结构,如何影响MIL模型的性能。我们开发了一种新的管道,K-MIL,来评估上下文信息对细胞分类性能的影响。通过将相邻的块合并到分析中,我们检查了上下文信息是提高还是削弱了网络识别模式和特征的能力,这些模式和特征对准确分类至关重要。我们的实验在两个数据集上进行:结肠癌和UCSB数据集。结论:我们的研究结果表明,虽然纳入更多的空间上下文信息通常会提高袋子和瓷砖层面的模型精度,但瓷砖层面的提高不是线性的。在某些情况下,增加空间背景会导致错误分类,这表明更多的背景并不总是有益的。这一发现强调了在数字病理分类任务中纳入空间上下文信息时需要仔细考虑。
{"title":"Not seeing the trees for the forest. The impact of neighbours on graph-based configurations in histopathology.","authors":"Olga Fourkioti, Matt De Vries, Reed Naidoo, Chris Bakal","doi":"10.1186/s12859-024-06007-x","DOIUrl":"10.1186/s12859-024-06007-x","url":null,"abstract":"<p><strong>Background: </strong>Deep learning (DL) has set new standards in cancer diagnosis, significantly enhancing the accuracy of automated classification of whole slide images (WSIs) derived from biopsied tissue samples. To enable DL models to process these large images, WSIs are typically divided into thousands of smaller tiles, each containing 10-50 cells. Multiple Instance Learning (MIL) is a commonly used approach, where WSIs are treated as bags comprising numerous tiles (instances) and only bag-level labels are provided during training. The model learns from these broad labels to extract more detailed, instance-level insights. However, biopsied sections often exhibit high intra- and inter-phenotypic heterogeneity, presenting a significant challenge for classification. To address this, many graph-based methods have been proposed, where each WSI is represented as a graph with tiles as nodes and edges defined by specific spatial relationships.</p><p><strong>Results: </strong>In this study, we investigate how different graph configurations, varying in connectivity and neighborhood structure, affect the performance of MIL models. We developed a novel pipeline, K-MIL, to evaluate the impact of contextual information on cell classification performance. By incorporating neighboring tiles into the analysis, we examined whether contextual information improves or impairs the network's ability to identify patterns and features critical for accurate classification. Our experiments were conducted on two datasets: COLON cancer and UCSB datasets.</p><p><strong>Conclusions: </strong>Our results indicate that while incorporating more spatial context information generally improves model accuracy at both the bag and tile levels, the improvement at the tile level is not linear. In some instances, increasing spatial context leads to misclassification, suggesting that more context is not always beneficial. This finding highlights the need for careful consideration when incorporating spatial context information in digital pathology classification tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"9"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11724494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UniAMP: enhancing AMP prediction using deep neural networks with inferred information of peptides. UniAMP:利用深度神经网络与肽推断信息增强AMP预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-11 DOI: 10.1186/s12859-025-06033-3
Zixin Chen, Chengming Ji, Wenwen Xu, Jianfeng Gao, Ji Huang, Huanliang Xu, Guoliang Qian, Junxian Huang

Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.

由于全球范围内抗生素在医学和农业中的滥用日益严重,抗菌肽(AMPs)已被广泛认为是对抗微生物耐药性的一种有前途的解决方案。在这项研究中,我们提出了UniAMP,一个发现amp的系统预测框架。我们观察到,在现有的各种研究中,从肽信息(如序列、组成和结构)构建的特征向量可以被深度学习模型推断的信息增强甚至取代。具体来说,我们使用由两个深度学习模型UniRep和ProtT5推断的2924个值的特征向量来证明这些推断的肽信息足以完成任务,并利用我们提出的由完全连接层和变压器编码器组成的深度神经网络模型来预测肽的抗菌活性。评估结果表明,与现有研究相比,我们提出的模型在平衡基准数据集和不平衡测试数据集上都具有优越的性能。随后,我们分析了肽序列之间的关系,人工提取特征,并通过深度学习模型自动推断信息,结果表明,推断信息对于预测amp的任务更加全面和无冗余。此外,该方法缓解了实证数据稀缺的影响,在未来的研究和应用中具有很大的潜力。
{"title":"UniAMP: enhancing AMP prediction using deep neural networks with inferred information of peptides.","authors":"Zixin Chen, Chengming Ji, Wenwen Xu, Jianfeng Gao, Ji Huang, Huanliang Xu, Guoliang Qian, Junxian Huang","doi":"10.1186/s12859-025-06033-3","DOIUrl":"10.1186/s12859-025-06033-3","url":null,"abstract":"<p><p>Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"10"},"PeriodicalIF":2.9,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11725221/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142969469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data. BiomiX是一个用户友好的生物信息学工具,用于民主化分析和整合多组学数据。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-10 DOI: 10.1186/s12859-024-06022-y
Cristian Iperi, Álvaro Fernández-Ochoa, Guillermo Barturen, Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Marta Alarcón-Riquelme, Divi Cornec, Anne Bordron, Christophe Jamin

Background: Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.

Results: BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.

Conclusions: The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the "Findable, Accessible, Interoperable, and Reusable" data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.

背景:解释生物系统的变化需要解释大量的多组学数据。虽然存在用于单组学分析的用户友好工具,但整合多组学仍然需要生物信息学专业知识,限制了更广泛的科学界的可及性。结果:BiomiX解决了高通量组学数据分析的瓶颈,能够对来自两个队列的多组学数据进行高效和集成的分析。BiomiX整合了多种组学数据,使用DESeq2/Limma包进行转录组学,并量化代谢组学峰值差异,通过带有错误发现率校正的Wilcoxon测试进行评估。对于液相色谱-质谱非靶向代谢组学的代谢组注释,还使用CEU质量介质数据库中的质量电荷比和TidyMass包中的碎片谱来支持。甲基组学分析使用ChAMP R包进行。最后,多组学因子分析(MOFA)集成识别组学数据中共享的变异源。BiomiX还生成统计数据、报告数据,并将enrichment r和GSEA集成在一起,用于生物过程探索和基于用户自定义基因面板的亚群分析,从而增强病情亚型。BiomiX对MOFA模型进行微调,以优化因子数量的选择,区分队列,并提供解释歧视性MOFA因素的工具。这种解释依赖于Pubmed的创新书目研究,它提供了与判别因子贡献者最相关的文章。此外,鉴别MOFA因素与临床数据的相关性,并探讨了最重要的贡献途径,所有这些都是为了指导用户进行因素解释。结论:在一个独立的工具中分析单组学和多组学的集成,以及MOFA的实现及其通过文献的可解释性,代表了多组学领域的重大进展,符合“可查找、可访问、可互操作和可重用”的数据原则。BiomiX提供广泛的参数和交互式数据可视化,允许根据用户需求进行个性化分析。这个基于r的,用户友好的工具与多种操作系统兼容,旨在使非生物信息学专家也可以进行多组学分析。
{"title":"BiomiX, a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data.","authors":"Cristian Iperi, Álvaro Fernández-Ochoa, Guillermo Barturen, Jacques-Olivier Pers, Nathan Foulquier, Eleonore Bettacchioli, Marta Alarcón-Riquelme, Divi Cornec, Anne Bordron, Christophe Jamin","doi":"10.1186/s12859-024-06022-y","DOIUrl":"10.1186/s12859-024-06022-y","url":null,"abstract":"<p><strong>Background: </strong>Interpreting biological system changes requires interpreting vast amounts of multi-omics data. While user-friendly tools exist for single-omics analysis, integrating multiple omics still requires bioinformatics expertise, limiting accessibility for the broader scientific community.</p><p><strong>Results: </strong>BiomiX tackles the bottleneck in high-throughput omics data analysis, enabling efficient and integrated analysis of multiomics data obtained from two cohorts. BiomiX incorporates diverse omics data, using DESeq2/Limma packages for transcriptomics, and quantifying metabolomics peak differences, evaluated via the Wilcoxon test with the False Discovery Rate correction. The metabolomics annotation for Liquid Chromatography-Mass Spectrometry untargeted metabolomics is additionally supported using the mass-to-charge ratio in the CEU Mass Mediator database and fragmentation spectra in the TidyMass package. Methylomics analysis is performed using the ChAMP R package. Finally, Multi-Omics Factor Analysis (MOFA) integration identifies shared sources of variation across omics data. BiomiX also generates statistics, report figures and integrates EnrichR and GSEA for biological process exploration and subgroup analysis based on user-defined gene panels enhancing condition subtyping. BiomiX fine-tunes MOFA models, to optimize factors number selection, distinguishing between cohorts and providing tools to interpret discriminative MOFA factors. The interpretation relies on innovative bibliography research on Pubmed, which provides the articles most related to the discriminant factor contributors. Furthermore, discriminant MOFA factors are correlated with clinical data, and the top contributing pathways are explored, all with the aim of guiding the user in factor interpretation.</p><p><strong>Conclusions: </strong>The analysis of single-omics and multi-omics integration in a standalone tool, along with MOFA implementation and its interpretability via literature, represents significant progress in the multi-omics field in line with the \"Findable, Accessible, Interoperable, and Reusable\" data principles. BiomiX offers a wide range of parameters and interactive data visualization, allowing for personalized analysis tailored to user needs. This R-based, user-friendly tool is compatible with multiple operating systems and aims to make multi-omics analysis accessible to non-experts in bioinformatics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"8"},"PeriodicalIF":2.9,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11721463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142963687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid natural language processing tool for semantic annotation of medical texts in Spanish. 用于西班牙医学文本语义注释的混合自然语言处理工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-08 DOI: 10.1186/s12859-024-05949-6
Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión

Background: Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.

Results: In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).

Conclusions: The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.

背景:自然语言处理(NLP)能够提取嵌入在非结构化文本中的信息,例如临床病例报告和试验资格标准。通过识别相关的医学概念,NLP有助于生成结构化和可操作的数据,支持队列识别和临床记录分析等复杂任务。为了完成这些任务,我们为西班牙语文本引入了一个基于深度学习和基于词典的命名实体识别(NER)工具。它执行医学NER和规范化,药物信息提取和检测时间实体,否定和推测,以及时间或经验属性(年龄,禁忌症,否定,推测,假设,未来,家庭成员,患者和其他)。我们使用专门的词典和规则来构建该工具,这些规则改编自NegEx和HeidelTime。使用这些资源,我们注释了1200个文本的语料库,注释者之间的一致性很高(实体的平均F1 = 0.841%±0.045,属性的平均F1 = 0.881%±0.032)。我们使用这个语料库来训练基于transformer的模型(基于roberta的模型,mBERT和mDeBERTa)。我们将它们与基于字典的系统集成在一个混合工具中,并通过拥抱脸中心分发模型。对于内部验证,我们使用了一个固定测试集并进行了错误分析。为了进行外部验证,8名医疗专业人员通过修改200个未在开发中使用的新文本的注释来评估该系统。结果:在内部验证中,模型的F1值达到0.915。在100个临床试验的外部验证中,该工具的平均F1评分为0.858(±0.032);在100例匿名临床病例中,平均F1得分为0.910(±0.019)。结论:该工具可在https://claramed.csic.es/medspaner上获得。我们还发布了代码(https://github.com/lcampillos/medspaner)和带注释的语料库来训练模型。
{"title":"Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.","authors":"Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión","doi":"10.1186/s12859-024-05949-6","DOIUrl":"10.1186/s12859-024-05949-6","url":null,"abstract":"<p><strong>Background: </strong>Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.</p><p><strong>Results: </strong>In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).</p><p><strong>Conclusions: </strong>The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"7"},"PeriodicalIF":2.9,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CDPMF-DDA: contrastive deep probabilistic matrix factorization for drug-disease association prediction. CDPMF-DDA:药物-疾病关联预测的对比深度概率矩阵分解。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06032-w
Xianfang Tang, Yawen Hou, Yajie Meng, Zhaojing Wang, Changcheng Lu, Juan Lv, Xinrong Hu, Junlin Xu, Jialiang Yang

The process of new drug development is complex, whereas drug-disease association (DDA) prediction aims to identify new therapeutic uses for existing medications. However, existing graph contrastive learning approaches typically rely on single-view contrastive learning, which struggle to fully capture drug-disease relationships. Subsequently, we introduce a novel multi-view contrastive learning framework, named CDPMF-DDA, which enhances the model's ability to capture drug-disease associations by incorporating diverse information representations from different views. First, we decompose the original drug-disease association matrix into drug and disease feature matrices, which are then used to reconstruct the drug-disease association network, as well as the drug-drug and disease-disease similarity networks. This process effectively reduces noise in the data, establishing a reliable foundation for the networks produced. Next, we generate multiple contrastive views from both the original and generated networks. These views effectively capture hidden feature associations, significantly enhancing the model's ability to represent complex relationships. Extensive cross-validation experiments on three standard datasets show that CDPMF-DDA achieves an average AUC of 0.9475 and an AUPR of 0.5009, outperforming existing models. Additionally, case studies on Alzheimer's disease and epilepsy further validate the model's effectiveness, demonstrating its high accuracy and robustness in drug-disease association prediction. Based on a multi-view contrastive learning framework, CDPMF-DDA is capable of integrating multi-source information and effectively capturing complex drug-disease associations, making it a powerful tool for drug repositioning and the discovery of new therapeutic strategies.

新药开发的过程是复杂的,而药物-疾病关联(DDA)预测旨在确定现有药物的新治疗用途。然而,现有的图对比学习方法通常依赖于单视图对比学习,难以完全捕获药物-疾病关系。随后,我们引入了一种名为CDPMF-DDA的新型多视图对比学习框架,该框架通过整合来自不同视图的不同信息表示来增强模型捕获药物-疾病关联的能力。首先,我们将原始的药物-疾病关联矩阵分解为药物和疾病特征矩阵,然后利用这些特征矩阵重构药物-疾病关联网络,以及药物-药物和疾病-疾病相似网络。这一过程有效地降低了数据中的噪声,为生成的网络奠定了可靠的基础。接下来,我们从原始和生成的网络中生成多个对比视图。这些视图有效地捕获隐藏的特征关联,显著地增强了模型表示复杂关系的能力。在三个标准数据集上进行的大量交叉验证实验表明,CDPMF-DDA的平均AUC为0.9475,AUPR为0.5009,优于现有模型。此外,对阿尔茨海默病和癫痫的案例研究进一步验证了该模型的有效性,证明了其在药物-疾病关联预测中的准确性和鲁棒性。基于多视角对比学习框架,CDPMF-DDA能够整合多源信息并有效捕获复杂的药物-疾病关联,使其成为药物重新定位和发现新治疗策略的有力工具。
{"title":"CDPMF-DDA: contrastive deep probabilistic matrix factorization for drug-disease association prediction.","authors":"Xianfang Tang, Yawen Hou, Yajie Meng, Zhaojing Wang, Changcheng Lu, Juan Lv, Xinrong Hu, Junlin Xu, Jialiang Yang","doi":"10.1186/s12859-024-06032-w","DOIUrl":"https://doi.org/10.1186/s12859-024-06032-w","url":null,"abstract":"<p><p>The process of new drug development is complex, whereas drug-disease association (DDA) prediction aims to identify new therapeutic uses for existing medications. However, existing graph contrastive learning approaches typically rely on single-view contrastive learning, which struggle to fully capture drug-disease relationships. Subsequently, we introduce a novel multi-view contrastive learning framework, named CDPMF-DDA, which enhances the model's ability to capture drug-disease associations by incorporating diverse information representations from different views. First, we decompose the original drug-disease association matrix into drug and disease feature matrices, which are then used to reconstruct the drug-disease association network, as well as the drug-drug and disease-disease similarity networks. This process effectively reduces noise in the data, establishing a reliable foundation for the networks produced. Next, we generate multiple contrastive views from both the original and generated networks. These views effectively capture hidden feature associations, significantly enhancing the model's ability to represent complex relationships. Extensive cross-validation experiments on three standard datasets show that CDPMF-DDA achieves an average AUC of 0.9475 and an AUPR of 0.5009, outperforming existing models. Additionally, case studies on Alzheimer's disease and epilepsy further validate the model's effectiveness, demonstrating its high accuracy and robustness in drug-disease association prediction. Based on a multi-view contrastive learning framework, CDPMF-DDA is capable of integrating multi-source information and effectively capturing complex drug-disease associations, making it a powerful tool for drug repositioning and the discovery of new therapeutic strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"5"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11708303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142942980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Causal models and prediction in cell line perturbation experiments. 细胞系扰动实验中的因果模型和预测。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06027-7
James P Long, Yumeng Yang, Shohei Shimizu, Thong Pham, Kim-Anh Do

In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations in silico. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al. in Cell Syst 12:128-140, 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal interpretation for the Cellbox model. We compare LR and CSR/Cellbox in simulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and CSR/Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.

在细胞系扰动实验中,一组细胞被外部因子和反应(如蛋白质表达)所扰动。由于成本限制,所有可能的扰动中只有一小部分可以在体外进行测试。这导致了计算模型的发展,可以预测细胞对硅扰动的反应。这些模型的一个核心挑战是预测新的、以前未经测试的、未在训练数据中使用的扰动的影响。在这里,我们提出因果结构方程来模拟扰动如何影响细胞。从这个模型中,我们得到了两个预测响应的估计量:线性回归(LR)估计量和因果结构学习估计量,我们称之为因果结构回归(CSR)。CSR估计器比LR需要更多的假设,但可以预测未在训练数据中应用的药物的效果。接下来,我们介绍了Cellbox,这是最近提出的一种基于常微分方程(ode)的模型,该模型在黑色素瘤细胞系扰动数据集上获得了最佳预测性能(Yuan等人在cell Syst 12:128-140, 2021)。我们得出的分析结果显示CSR和Cellbox之间的密切联系,为Cellbox模型提供了新的因果解释。我们在模拟中比较了LR和CSR/Cellbox,突出了两种方法的优缺点。最后,我们比较了LR和CSR/Cellbox在基准黑色素瘤数据集上的性能。我们发现LR模型的性能与Cellbox相当或略好。
{"title":"Causal models and prediction in cell line perturbation experiments.","authors":"James P Long, Yumeng Yang, Shohei Shimizu, Thong Pham, Kim-Anh Do","doi":"10.1186/s12859-024-06027-7","DOIUrl":"https://doi.org/10.1186/s12859-024-06027-7","url":null,"abstract":"<p><p>In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations in silico. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al. in Cell Syst 12:128-140, 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal interpretation for the Cellbox model. We compare LR and CSR/Cellbox in simulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and CSR/Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"4"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707890/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142944048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A metric and its derived protein network for evaluation of ortholog database inconsistency. 一种评价同源数据库不一致性的度量及其衍生蛋白网络。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06023-x
Weijie Yang, Jingsi Ji, Gang Fang

Background: Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.

Results: We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.

Conclusions: We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.

背景:在基因组研究的各个领域中,同源预测是必不可少的,但随着同源数据库的不断增加,同源预测面临着越来越多的不一致性。计算一致同源物的通用策略引入了额外的随意性,强调需要检查这种不一致的原因并识别易受预测错误影响的蛋白质。结果:我们引入了信号贾卡德指数(SJI),这是一种基于无监督基因组上下文聚类的新度量,旨在评估蛋白质相似性。利用SJI,我们构建了一个蛋白质网络,并揭示了网络中的外周蛋白质是导致同源预测不一致的主要因素。此外,我们表明蛋白质在网络中的中心性程度可以作为其在共识集中的可靠性的强预测因子。结论:我们提出了一个客观的,无监督的基于sji的网络,包括所有蛋白质,其中其拓扑特征阐明了同源预测的不一致性。度中心性(DC)在不依赖任意参数的情况下有效地识别容易出错的正交分配。值得注意的是,DC是稳定的,不受物种选择的影响,并且非常适合于同源基准测试。这种方法超越了通用阈值的限制,为探索蛋白质进化和功能关系提供了一个强大的定量框架。
{"title":"A metric and its derived protein network for evaluation of ortholog database inconsistency.","authors":"Weijie Yang, Jingsi Ji, Gang Fang","doi":"10.1186/s12859-024-06023-x","DOIUrl":"https://doi.org/10.1186/s12859-024-06023-x","url":null,"abstract":"<p><strong>Background: </strong>Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors.</p><p><strong>Results: </strong>We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets.</p><p><strong>Conclusions: </strong>We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"6"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707888/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142944047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes. DFAST_QC:原核生物基因组质量评估和分类鉴定工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-07 DOI: 10.1186/s12859-024-06030-y
Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa

Background: Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.

Results: We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.

Conclusions: DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .

背景:基因组数据库中准确的分类对可靠的生物学研究和有效的数据共享至关重要。基因组注释中的错误标记或不准确可能导致不正确的科学结论,并阻碍研究结果的可重复性。尽管基因组分析技术取得了进步,但在确保精确可靠的分类分配方面仍然存在挑战。现有的基因组验证工具通常涉及大量的计算资源或冗长的处理时间,这限制了它们在大型项目中的可访问性和可扩展性。我们需要更高效、用户友好的解决方案,能够处理不同的数据集,并以最小的计算需求提供准确的结果。这项工作旨在通过引入一种新的工具来解决这些挑战,该工具可以提高分类准确性,提供用户友好的界面,并支持大规模分析。结果:我们介绍了一种新的原核生物基因组质量控制和分类工具DFAST_QC,它可以作为命令行工具和web服务。DFAST_QC通过结合使用MASH的基因组距离计算和使用Skani的ANI计算,可以快速识别基于NCBI和GTDB分类的物种。我们评估了DFAST_QC在物种识别方面的表现,发现它与现有的分类学标准高度一致,成功地识别了不同数据集的物种。在一些案例中,DFAST_QC发现了公共数据库中潜在的物种名称错误标记,并突出了当前分类中的差异,证明了其发现错误和提高分类准确性的能力。此外,该工具的高效设计使其能够在本地机器上以最小的计算需求顺利运行,使其成为大规模基因组项目的实用选择。结论:DFAST_QC是一种可靠、高效的准确分类鉴定和基因组质量控制工具,适用于大规模基因组研究。它与有限资源环境的兼容性,结合其用户友好的设计,确保无缝集成到现有的工作流程。DFAST_QC在公共数据库中完善物种分配的能力突出了其作为维护和提高基因组研究中分类数据准确性的补充工具的价值。web版本可在https://dfast.ddbj.nig.ac.jp/dqc/submit/上获得,本地使用的源代码可在https://github.com/nigyta/dfast_qc上找到。
{"title":"DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes.","authors":"Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa","doi":"10.1186/s12859-024-06030-y","DOIUrl":"https://doi.org/10.1186/s12859-024-06030-y","url":null,"abstract":"<p><strong>Background: </strong>Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses.</p><p><strong>Results: </strong>We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects.</p><p><strong>Conclusions: </strong>DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"3"},"PeriodicalIF":2.9,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SequenceCraft: machine learning-based resource for exploratory analysis of RNA-cleaving deoxyribozymes. SequenceCraft:基于机器学习的rna切割脱氧核酶探索性分析资源。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-06 DOI: 10.1186/s12859-024-06019-7
M Eremeyeva, Y Din, N Shirokii, N Serov

Background: Deoxyribozymes or DNAzymes represent artificial short DNA sequences bearing many catalytic properties. In particular, DNAzymes able to cleave RNA sequences have a huge potential in gene therapy and sequence-specific analytic detection of disease markers. This activity is provided by catalytic cores able to perform site-specific hydrolysis of the phosphodiester bond of an RNA substrate. However, the vast majority of existing DNAzyme catalytic cores have low efficacy in in vivo experiments, whereas SELEX based on in vitro screening offers long and expensive selection cycle with the average success rate of ~ 30%, moreover not allowing the direct selection of chemically modified DNAzymes, which were previously shown to demonstrate higher activity in vivo. Therefore, there is a huge need in in silico approach for exploratory analysis of RNA-cleaving DNAzyme cores to drastically ease the discovery of novel catalytic cores with superior activities.

Results: In this work, we develop a machine learning based open-source platform SequenceCraft allowing experimental scientists to perform DNAzyme exploratory analysis via quantitative observed rate constant (kobs) estimation as well as statistical and clustering data analysis. This became possible with the development of a unique curated database of > 350 RNA-cleaving catalytic cores, property-based sequence representations allowing to work with both conventional and chemically modified nucleotides, and optimized kobs predicting algorithm achieving Q2 > 0.9 on experimental data published to date.

Conclusions: This work represents a significant advancement in DNAzyme research, providing a tool for more efficient discovery of RNA-cleaving DNAzymes. The SequenceCraft platform offers an in silico alternative to traditional experimental approaches, potentially accelerating the development of DNAzymes.

背景:脱氧核酶或DNAzymes是具有许多催化性质的人工短DNA序列。特别是,能够切割RNA序列的DNAzymes在基因治疗和疾病标记物序列特异性分析检测方面具有巨大的潜力。这种活性是由能够对RNA底物的磷酸二酯键进行特定位点水解的催化核提供的。然而,现有的绝大多数DNAzyme催化核心在体内实验中的效率较低,而基于体外筛选的SELEX选择周期长且昂贵,平均成功率约为30%,而且不能直接选择化学修饰的DNAzymes,而这些DNAzymes在体内表现出更高的活性。因此,迫切需要采用计算机方法对rna切割DNAzyme核心进行探索性分析,从而大大简化具有优越活性的新型催化核心的发现。在这项工作中,我们开发了一个基于机器学习的开源平台SequenceCraft,允许实验科学家通过定量观察速率常数(kobs)估计以及统计和聚类数据分析来执行DNAzyme探索性分析。随着bbbb350 rna切割催化核心的独特策划数据库的开发,基于属性的序列表示允许使用传统和化学修饰的核苷酸,以及优化的kobs预测算法在迄今公布的实验数据上达到Q2 > 0.9,这成为可能。结论:这项工作代表了DNAzyme研究的重大进展,为更有效地发现rna切割DNAzyme提供了工具。SequenceCraft平台为传统的实验方法提供了一种硅替代方案,有可能加速DNAzymes的开发。
{"title":"SequenceCraft: machine learning-based resource for exploratory analysis of RNA-cleaving deoxyribozymes.","authors":"M Eremeyeva, Y Din, N Shirokii, N Serov","doi":"10.1186/s12859-024-06019-7","DOIUrl":"https://doi.org/10.1186/s12859-024-06019-7","url":null,"abstract":"<p><strong>Background: </strong>Deoxyribozymes or DNAzymes represent artificial short DNA sequences bearing many catalytic properties. In particular, DNAzymes able to cleave RNA sequences have a huge potential in gene therapy and sequence-specific analytic detection of disease markers. This activity is provided by catalytic cores able to perform site-specific hydrolysis of the phosphodiester bond of an RNA substrate. However, the vast majority of existing DNAzyme catalytic cores have low efficacy in in vivo experiments, whereas SELEX based on in vitro screening offers long and expensive selection cycle with the average success rate of ~ 30%, moreover not allowing the direct selection of chemically modified DNAzymes, which were previously shown to demonstrate higher activity in vivo. Therefore, there is a huge need in in silico approach for exploratory analysis of RNA-cleaving DNAzyme cores to drastically ease the discovery of novel catalytic cores with superior activities.</p><p><strong>Results: </strong>In this work, we develop a machine learning based open-source platform SequenceCraft allowing experimental scientists to perform DNAzyme exploratory analysis via quantitative observed rate constant (k<sub>obs</sub>) estimation as well as statistical and clustering data analysis. This became possible with the development of a unique curated database of > 350 RNA-cleaving catalytic cores, property-based sequence representations allowing to work with both conventional and chemically modified nucleotides, and optimized k<sub>obs</sub> predicting algorithm achieving Q<sup>2</sup> > 0.9 on experimental data published to date.</p><p><strong>Conclusions: </strong>This work represents a significant advancement in DNAzyme research, providing a tool for more efficient discovery of RNA-cleaving DNAzymes. The SequenceCraft platform offers an in silico alternative to traditional experimental approaches, potentially accelerating the development of DNAzymes.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"2"},"PeriodicalIF":2.9,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706003/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142943795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GLiDe: a web-based genome-scale CRISPRi sgRNA design tool for prokaryotes. GLiDe:基于网络的原核生物基因组级CRISPRi sgRNA设计工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-03 DOI: 10.1186/s12859-024-06012-0
Tongjun Xiang, Huibao Feng, Xin-Hui Xing, Chong Zhang

Background: CRISPRi screening has become a powerful approach for functional genomic research. However, the off-target effects resulting from the mismatch tolerance between sgRNAs and their intended targets is a primary concern in CRISPRi applications.

Results: We introduce Guide Library Designer (GLiDe), a web-based tool specifically created for the genome-scale design of sgRNA libraries tailored for CRISPRi screening in prokaryotic organisms. GLiDe incorporates a robust quality control framework, rooted in prior experimental knowledge, ensuring the accurate identification of off-target hits. It boasts an extensive built-in database, encompassing 1,397 common prokaryotic species as a comprehensive design resource. It also provides the capability to design sgRNAs for newly discovered organisms by accepting uploaded design resource. We further demonstrated that GLiDe exhibits enhanced precision in identifying off-target binding sites for the CRISPRi system.

Conclusions: We present a web server that allows the construction of genome-scale CRISPRi sgRNA libraries for prokaryotes. It mitigates off-target effects through a robust quality control framework, leveraging prior experimental knowledge within an end-to-end, user-friendly pipeline.

背景:CRISPRi筛选已成为功能基因组研究的有力手段。然而,由sgrna与其预期靶标之间的错配耐受性引起的脱靶效应是CRISPRi应用中的主要问题。结果:我们介绍了Guide Library Designer (GLiDe),这是一个基于网络的工具,专门为原核生物中CRISPRi筛选定制的sgRNA文库的基因组规模设计而创建。GLiDe结合了一个强大的质量控制框架,植根于先前的实验知识,确保准确识别脱靶命中。它拥有一个广泛的内置数据库,包括1397种常见的原核生物作为一个全面的设计资源。它还提供了通过接受上传的设计资源为新发现的生物体设计sgrna的能力。我们进一步证明,GLiDe在识别CRISPRi系统的脱靶结合位点方面具有更高的精度。结论:我们提出了一个web服务器,允许构建基因组规模的原核生物CRISPRi sgRNA文库。它通过强大的质量控制框架,在端到端用户友好的管道中利用先前的实验知识,减轻了脱靶效应。
{"title":"GLiDe: a web-based genome-scale CRISPRi sgRNA design tool for prokaryotes.","authors":"Tongjun Xiang, Huibao Feng, Xin-Hui Xing, Chong Zhang","doi":"10.1186/s12859-024-06012-0","DOIUrl":"10.1186/s12859-024-06012-0","url":null,"abstract":"<p><strong>Background: </strong>CRISPRi screening has become a powerful approach for functional genomic research. However, the off-target effects resulting from the mismatch tolerance between sgRNAs and their intended targets is a primary concern in CRISPRi applications.</p><p><strong>Results: </strong>We introduce Guide Library Designer (GLiDe), a web-based tool specifically created for the genome-scale design of sgRNA libraries tailored for CRISPRi screening in prokaryotic organisms. GLiDe incorporates a robust quality control framework, rooted in prior experimental knowledge, ensuring the accurate identification of off-target hits. It boasts an extensive built-in database, encompassing 1,397 common prokaryotic species as a comprehensive design resource. It also provides the capability to design sgRNAs for newly discovered organisms by accepting uploaded design resource. We further demonstrated that GLiDe exhibits enhanced precision in identifying off-target binding sites for the CRISPRi system.</p><p><strong>Conclusions: </strong>We present a web server that allows the construction of genome-scale CRISPRi sgRNA libraries for prokaryotes. It mitigates off-target effects through a robust quality control framework, leveraging prior experimental knowledge within an end-to-end, user-friendly pipeline.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"1"},"PeriodicalIF":2.9,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11699761/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1