首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Hamiltonian diversity: effectively measuring molecular diversity by shortest Hamiltonian circuits 哈密顿多样性:通过最短哈密顿电路有效测量分子多样性。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-07 DOI: 10.1186/s13321-024-00883-4
Xiuyuan Hu, Guoqing Liu, Quanming Yao, Yang Zhao, Hao Zhang

In recent years, significant advancements have been made in molecular generation algorithms aimed at facilitating drug development, and molecular diversity holds paramount importance within the realm of molecular generation. Nonetheless, the effective quantification of molecular diversity remains an elusive challenge, as extant metrics exemplified by Richness and Internal Diversity fall short in concurrently encapsulating the two main aspects of such diversity: quantity and dissimilarity. To address this quandary, we propose Hamiltonian diversity, a novel molecular diversity metric predicated upon the shortest Hamiltonian circuit. This metric embodies both aspects of molecular diversity in principle, and we implement its calculation with high efficiency and accuracy. Furthermore, through empirical experiments we demonstrate the high consistency of Hamiltonian diversity with real-world chemical diversity, and substantiate its effects in promoting diversity of molecular generation algorithms. Our implementation of Hamiltonian diversity in Python is available at: https://github.com/HXYfighter/HamDiv.

Scientific contribution

We propose a more rational molecular diversity metric for the community of cheminformatics and drug development. This metric can be applied to evaluation of existing molecular generation methods and enhancing drug design algorithms.

近年来,旨在促进药物开发的分子生成算法取得了重大进展,而分子多样性在分子生成领域具有极其重要的意义。然而,分子多样性的有效量化仍然是一个难以捉摸的挑战,因为以丰富度(Richness)和内部多样性(Internal Diversity)为代表的现有指标无法同时囊括这种多样性的两个主要方面:数量和不相似性。为了解决这一难题,我们提出了汉密尔顿多样性,这是一种基于最短汉密尔顿电路的新型分子多样性指标。这一指标从原理上体现了分子多样性的两个方面,我们以高效率和高精度实现了它的计算。此外,通过经验实验,我们证明了汉密尔顿多样性与现实世界化学多样性的高度一致性,并证实了它在促进分子生成算法多样性方面的效果。我们在 Python 中实现的汉密尔顿多样性可在以下网址获取:https://github.com/HXYfighter/HamDiv .科学贡献我们为化学信息学和药物开发界提出了一种更合理的分子多样性指标。该指标可用于评估现有的分子生成方法和改进药物设计算法。
{"title":"Hamiltonian diversity: effectively measuring molecular diversity by shortest Hamiltonian circuits","authors":"Xiuyuan Hu,&nbsp;Guoqing Liu,&nbsp;Quanming Yao,&nbsp;Yang Zhao,&nbsp;Hao Zhang","doi":"10.1186/s13321-024-00883-4","DOIUrl":"10.1186/s13321-024-00883-4","url":null,"abstract":"<div><p>In recent years, significant advancements have been made in molecular generation algorithms aimed at facilitating drug development, and molecular diversity holds paramount importance within the realm of molecular generation. Nonetheless, the effective quantification of molecular diversity remains an elusive challenge, as extant metrics exemplified by Richness and Internal Diversity fall short in concurrently encapsulating the two main aspects of such diversity: quantity and dissimilarity. To address this quandary, we propose Hamiltonian diversity, a novel molecular diversity metric predicated upon the shortest Hamiltonian circuit. This metric embodies both aspects of molecular diversity in principle, and we implement its calculation with high efficiency and accuracy. Furthermore, through empirical experiments we demonstrate the high consistency of Hamiltonian diversity with real-world chemical diversity, and substantiate its effects in promoting diversity of molecular generation algorithms. Our implementation of Hamiltonian diversity in Python is available at: https://github.com/HXYfighter/HamDiv.</p><p><b>Scientific contribution</b></p><p>We propose a more rational molecular diversity metric for the community of cheminformatics and drug development. This metric can be applied to evaluation of existing molecular generation methods and enhancing drug design algorithms.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11308660/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141900545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancements in biotransformation pathway prediction: enhancements, datasets, and novel functionalities in enviPath 生物转化途径预测的进展:enviPath 的改进、数据集和新功能。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-06 DOI: 10.1186/s13321-024-00881-6
Jasmin Hafner, Tim Lorsbach, Sebastian Schmidt, Liam Brydon, Katharina Dost, Kunyang Zhang, Kathrin Fenner, Jörg Wicker

enviPath is a widely used database and prediction system for microbial biotransformation pathways of primarily xenobiotic compounds. Data and prediction system are freely available both via a web interface and a public REST API. Since its initial release in 2016, we extended the data available in enviPath and improved the performance of the prediction system and usability of the overall system. We now provide three diverse data sets, covering microbial biotransformation in different environments and under different experimental conditions. This also enabled developing a pathway prediction model that is applicable to a more diverse set of chemicals. In the prediction engine, we implemented a new evaluation tailored towards pathway prediction, which returns a more honest and holistic view on the performance. We also implemented a novel applicability domain algorithm, which allows the user to estimate how well the model will perform on their data. Finally, we improved the implementation to speed up the overall system and provide new functionality via a plugin system.

enviPath 是一个广泛使用的数据库和预测系统,主要用于预测异生物化合物的微生物生物转化途径。数据和预测系统可通过网络界面和公共 REST API 免费获取。自2016年首次发布以来,我们扩展了enviPath中的可用数据,并提高了预测系统的性能和整个系统的可用性。现在,我们提供了三个不同的数据集,涵盖了不同环境和不同实验条件下的微生物生物转化。这也使得我们能够开发出适用于更多化学物质的途径预测模型。在预测引擎中,我们针对通路预测实施了一种新的评估方法,它能更真实、更全面地反映预测结果。我们还采用了一种新颖的适用性域算法,使用户能够估计模型在其数据上的表现。最后,我们改进了实现方式,以加快整个系统的运行速度,并通过插件系统提供新的功能。科学贡献:主要科学贡献是开发了适用于多种化学品的路径预测模型、用于整体性能评估的专门评价方法以及用于用户特定性能估算的新型适用域算法。两个新数据集的引入以及欧共体类链接的创建,使 enviPath 成为微生物生物转化研究领域的独特资源。
{"title":"Advancements in biotransformation pathway prediction: enhancements, datasets, and novel functionalities in enviPath","authors":"Jasmin Hafner,&nbsp;Tim Lorsbach,&nbsp;Sebastian Schmidt,&nbsp;Liam Brydon,&nbsp;Katharina Dost,&nbsp;Kunyang Zhang,&nbsp;Kathrin Fenner,&nbsp;Jörg Wicker","doi":"10.1186/s13321-024-00881-6","DOIUrl":"10.1186/s13321-024-00881-6","url":null,"abstract":"<p>enviPath is a widely used database and prediction system for microbial biotransformation pathways of primarily xenobiotic compounds. Data and prediction system are freely available both via a web interface and a public REST API. Since its initial release in 2016, we extended the data available in enviPath and improved the performance of the prediction system and usability of the overall system. We now provide three diverse data sets, covering microbial biotransformation in different environments and under different experimental conditions. This also enabled developing a pathway prediction model that is applicable to a more diverse set of chemicals. In the prediction engine, we implemented a new evaluation tailored towards pathway prediction, which returns a more honest and holistic view on the performance. We also implemented a novel applicability domain algorithm, which allows the user to estimate how well the model will perform on their data. Finally, we improved the implementation to speed up the overall system and provide new functionality via a plugin system.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141896391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel multitask learning algorithm for tasks with distinct chemical space: zebrafish toxicity prediction as an example 针对不同化学空间任务的新型多任务学习算法:以斑马鱼毒性预测为例
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-02 DOI: 10.1186/s13321-024-00891-4
Run-Hsin Lin, Pinpin Lin, Chia-Chi Wang, Chun-Wei Tung

Data scarcity is one of the most critical issues impeding the development of prediction models for chemical effects. Multitask learning algorithms leveraging knowledge from relevant tasks showed potential for dealing with tasks with limited data. However, current multitask methods mainly focus on learning from datasets whose task labels are available for most of the training samples. Since datasets were generated for different purposes with distinct chemical spaces, the conventional multitask learning methods may not be suitable. This study presents a novel multitask learning method MTForestNet that can deal with data scarcity problems and learn from tasks with distinct chemical space. The MTForestNet consists of nodes of random forest classifiers organized in the form of a progressive network, where each node represents a random forest model learned from a specific task. To demonstrate the effectiveness of the MTForestNet, 48 zebrafish toxicity datasets were collected and utilized as an example. Among them, two tasks are very different from other tasks with only 1.3% common chemicals shared with other tasks. In an independent test, MTForestNet with a high area under the receiver operating characteristic curve (AUC) value of 0.911 provided superior performance over compared single-task and multitask methods. The overall toxicity derived from the developed models of zebrafish toxicity is well correlated with the experimentally determined overall toxicity. In addition, the outputs from the developed models of zebrafish toxicity can be utilized as features to boost the prediction of developmental toxicity. The developed models are effective for predicting zebrafish toxicity and the proposed MTForestNet is expected to be useful for tasks with distinct chemical space that can be applied in other tasks.

Scieific contribution

A novel multitask learning algorithm MTForestNet was proposed to address the challenges of developing models using datasets with distinct chemical space that is a common issue of cheminformatics tasks. As an example, zebrafish toxicity prediction models were developed using the proposed MTForestNet which provide superior performance over conventional single-task and multitask learning methods. In addition, the developed zebrafish toxicity prediction models can reduce animal testing.

数据匮乏是阻碍化学效应预测模型开发的最关键问题之一。利用相关任务知识的多任务学习算法显示出了处理数据有限任务的潜力。然而,目前的多任务方法主要侧重于从任务标签可用于大部分训练样本的数据集中学习。由于数据集是为不同目的生成的,具有不同的化学空间,因此传统的多任务学习方法可能并不适合。本研究提出了一种新颖的多任务学习方法 MTForestNet,它可以处理数据稀缺问题,并从具有不同化学空间的任务中学习。MTForestNet 由以渐进网络形式组织的随机森林分类器节点组成,每个节点代表一个从特定任务中学习到的随机森林模型。为了证明 MTForestNet 的有效性,我们收集了 48 个斑马鱼毒性数据集作为示例。其中,有两个任务与其他任务有很大不同,只有 1.3% 的化学物质与其他任务共享。在一项独立测试中,MTForestNet 的接收器工作特征曲线下面积(AUC)值高达 0.911,其性能优于单任务和多任务方法。从开发的斑马鱼毒性模型中得出的总体毒性与实验测定的总体毒性有很好的相关性。此外,所开发的斑马鱼毒性模型的输出结果可作为特征用于提高发育毒性的预测。所开发的模型可有效预测斑马鱼的毒性,预计所提出的 MTForestNet 可用于具有独特化学空间的任务,并可应用于其他任务。科学贡献 我们提出了一种新颖的多任务学习算法 MTForestNet,以解决使用具有独特化学空间的数据集开发模型所面临的挑战,这是化学信息学任务中的一个常见问题。以斑马鱼毒性预测模型为例,使用所提出的 MTForestNet 开发的模型比传统的单任务和多任务学习方法性能更优。此外,所开发的斑马鱼毒性预测模型还能减少动物试验。
{"title":"A novel multitask learning algorithm for tasks with distinct chemical space: zebrafish toxicity prediction as an example","authors":"Run-Hsin Lin,&nbsp;Pinpin Lin,&nbsp;Chia-Chi Wang,&nbsp;Chun-Wei Tung","doi":"10.1186/s13321-024-00891-4","DOIUrl":"10.1186/s13321-024-00891-4","url":null,"abstract":"<div><p>Data scarcity is one of the most critical issues impeding the development of prediction models for chemical effects. Multitask learning algorithms leveraging knowledge from relevant tasks showed potential for dealing with tasks with limited data. However, current multitask methods mainly focus on learning from datasets whose task labels are available for most of the training samples. Since datasets were generated for different purposes with distinct chemical spaces, the conventional multitask learning methods may not be suitable. This study presents a novel multitask learning method MTForestNet that can deal with data scarcity problems and learn from tasks with distinct chemical space. The MTForestNet consists of nodes of random forest classifiers organized in the form of a progressive network, where each node represents a random forest model learned from a specific task. To demonstrate the effectiveness of the MTForestNet, 48 zebrafish toxicity datasets were collected and utilized as an example. Among them, two tasks are very different from other tasks with only 1.3% common chemicals shared with other tasks. In an independent test, MTForestNet with a high area under the receiver operating characteristic curve (AUC) value of 0.911 provided superior performance over compared single-task and multitask methods. The overall toxicity derived from the developed models of zebrafish toxicity is well correlated with the experimentally determined overall toxicity. In addition, the outputs from the developed models of zebrafish toxicity can be utilized as features to boost the prediction of developmental toxicity. The developed models are effective for predicting zebrafish toxicity and the proposed MTForestNet is expected to be useful for tasks with distinct chemical space that can be applied in other tasks.</p><p><b>Scieific contribution</b></p><p>A novel multitask learning algorithm MTForestNet was proposed to address the challenges of developing models using datasets with distinct chemical space that is a common issue of cheminformatics tasks. As an example, zebrafish toxicity prediction models were developed using the proposed MTForestNet which provide superior performance over conventional single-task and multitask learning methods. In addition, the developed zebrafish toxicity prediction models can reduce animal testing.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00891-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141877674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications PETA:评估蛋白质转移学习与子词标记化对下游应用的影响
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-02 DOI: 10.1186/s13321-024-00884-3
Yang Tan, Mingchen Li, Ziyi Zhou, Pan Tan, Huiqun Yu, Guisheng Fan, Liang Hong

Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining.

蛋白质语言模型(PLM)在蛋白质表征学习中发挥着主导作用。现有的大多数蛋白质语言模型将蛋白质视为由 20 个天然氨基酸组成的序列。这种表示方法的问题在于,它只是简单地将蛋白质序列划分为单个氨基酸的序列,而忽略了某些残基经常一起出现的事实。因此,将氨基酸视为孤立的标记是不恰当的。相反,PLM 应将经常出现的氨基酸组合识别为单个标记。在本研究中,我们使用字节对编码算法和 unigram 来构建用于蛋白质序列标记化的高级残基词汇表,结果表明,与使用简单词汇表训练的 PLM 相比,使用这些高级词汇表预先训练的 PLM 在下游任务中表现出更优越的性能。此外,我们还介绍了 PETA,这是一种用于系统评估 PLM 的综合基准。我们发现,由 50 个和 200 个元素组成的词汇表可实现最佳性能。我们的代码、模型权重和数据集可在 https://github.com/ginnm/ProteinPretraining 上获取。本研究利用字节对编码算法和 unigram 引入了先进的蛋白质序列标记化分析。通过将频繁出现的氨基酸组合识别为单个标记,我们提出的方法提高了 PLM 在下游任务中的性能。此外,我们还提出了用于系统评估 PLM 的新综合基准 PETA,证明 50 个和 200 个元素的词表可提供最佳性能。
{"title":"PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications","authors":"Yang Tan,&nbsp;Mingchen Li,&nbsp;Ziyi Zhou,&nbsp;Pan Tan,&nbsp;Huiqun Yu,&nbsp;Guisheng Fan,&nbsp;Liang Hong","doi":"10.1186/s13321-024-00884-3","DOIUrl":"10.1186/s13321-024-00884-3","url":null,"abstract":"<p>Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining. </p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00884-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141877673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implementation of a soft grading system for chemistry in a Moodle plugin: reaction handling 在 Moodle 插件中实施化学软评分系统:反应处理。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-08-01 DOI: 10.1186/s13321-024-00889-y
Louis Plyer, Gilles Marcou, Céline Perves, Fanny Bonachera, Alexander Varnek

Here, we present a new method for evaluating questions on chemical reactions in the context of remote education. This method can be used when binary grading is not sufficient as some tolerance may be acceptable. In order to determine a grade, the developed workflow uses the pairwise similarity assessment of two considered reactions, each encoded by a single molecular graph with the help of the Condensed Graph of Reaction (CGR) approach. This workflow is part of the ChemMoodle project and is implemented as a Moodle Plugin. It uses the Chemdoodle engine for reaction drawing and visualization and communicates with a REST server calculating the similarity score using ISIDA fragment descriptors. The plugin is open-source, accessible in GitHub (https://github.com/Laboratoire-de-Chemoinformatique/moodle-qtype_reacsimilarity) and on the Moodle plugin store (https://moodle.org/plugins/qtype_reacsimilarity?lang=en). Both similarity measures and fragmentation can be configured.

Scientific contribution

This work introduces an open-source method for evaluating chemical reaction questions within Moodle using the CGR approach. Our contribution provides a nuanced grading mechanism that accommodates acceptable tolerances in reaction assessments, enhancing the accuracy and flexibility of the grading process.

在此,我们介绍一种在远程教育背景下评价化学反应问题的新方法。这种方法可用于二元评分不够充分的情况,因为可以接受一定的宽容度。为了确定一个等级,开发的工作流程使用了两个被考虑的反应的成对相似性评估,每个反应都由一个分子图进行编码,并借助反应的凝缩图(CGR)方法。该工作流程是 ChemMoodle 项目的一部分,以 Moodle 插件的形式实现。它使用 Chemdoodle 引擎进行反应绘图和可视化,并与使用 ISIDA 片段描述符计算相似性得分的 REST 服务器进行通信。该插件是开源的,可在 GitHub ( https://github.com/Laboratoire-de-Chemoinformatique/moodle-qtype_reacsimilarity ) 和 Moodle 插件商店 ( https://moodle.org/plugins/qtype_reacsimilarity?lang=en ) 上访问。这项工作介绍了一种开源方法,用于在 Moodle 中使用 CGR 方法评估化学反应问题。我们的贡献是提供了一种细致入微的评分机制,它能在反应评估中考虑到可接受的容差,从而提高评分过程的准确性和灵活性。
{"title":"Implementation of a soft grading system for chemistry in a Moodle plugin: reaction handling","authors":"Louis Plyer,&nbsp;Gilles Marcou,&nbsp;Céline Perves,&nbsp;Fanny Bonachera,&nbsp;Alexander Varnek","doi":"10.1186/s13321-024-00889-y","DOIUrl":"10.1186/s13321-024-00889-y","url":null,"abstract":"<div><p>Here, we present a new method for evaluating questions on chemical reactions in the context of remote education. This method can be used when binary grading is not sufficient as some tolerance may be acceptable. In order to determine a grade, the developed workflow uses the pairwise similarity assessment of two considered reactions, each encoded by a single molecular graph with the help of the Condensed Graph of Reaction (CGR) approach. This workflow is part of the ChemMoodle project and is implemented as a Moodle Plugin. It uses the Chemdoodle engine for reaction drawing and visualization and communicates with a REST server calculating the similarity score using ISIDA fragment descriptors. The plugin is open-source, accessible in GitHub (https://github.com/Laboratoire-de-Chemoinformatique/moodle-qtype_reacsimilarity) and on the Moodle plugin store (https://moodle.org/plugins/qtype_reacsimilarity?lang=en). Both similarity measures and fragmentation can be configured.</p><p><b>Scientific contribution</b></p><p> This work introduces an open-source method for evaluating chemical reaction questions within Moodle using the CGR approach. Our contribution provides a nuanced grading mechanism that accommodates acceptable tolerances in reaction assessments, enhancing the accuracy and flexibility of the grading process.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11295431/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transfer learning across different chemical domains: virtual screening of organic materials with deep learning models pretrained on small molecule and chemical reaction data 跨不同化学领域的迁移学习:利用小分子和化学反应数据预训练的深度学习模型对有机材料进行虚拟筛选。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-30 DOI: 10.1186/s13321-024-00886-1
Chengwei Zhang, Yushuang Zhai, Ziyang Gong, Hongliang Duan, Yuan-Bin She, Yun-Fang Yang, An Su

Machine learning is becoming a preferred method for the virtual screening of organic materials due to its cost-effectiveness over traditional computationally demanding techniques. However, the scarcity of labeled data for organic materials poses a significant challenge for training advanced machine learning models. This study showcases the potential of utilizing databases of drug-like small molecules and chemical reactions to pretrain the BERT model, enhancing its performance in the virtual screening of organic materials. By fine-tuning the BERT models with data from five virtual screening tasks, the version pretrained with the USPTO–SMILES dataset achieved R2 scores exceeding 0.94 for three tasks and over 0.81 for two others. This performance surpasses that of models pretrained on the small molecule or organic materials databases and outperforms three traditional machine learning models trained directly on virtual screening data. The success of the USPTO–SMILES pretrained BERT model can be attributed to the diverse array of organic building blocks in the USPTO database, offering a broader exploration of the chemical space. The study further suggests that accessing a reaction database with a wider range of reactions than the USPTO could further enhance model performance. Overall, this research validates the feasibility of applying transfer learning across different chemical domains for the efficient virtual screening of organic materials.

Scientific contribution

This study verifies the feasibility of applying transfer learning to large language models in different chemical fields to help organic materials perform virtual screening. Through the comparison of transfer learning from different chemical fields to a variety of organic material molecules, the high precision virtual screening of organic materials is realized.

与计算要求高的传统技术相比,机器学习具有成本效益,正成为有机材料虚拟筛选的首选方法。然而,有机材料标注数据的稀缺性给训练高级机器学习模型带来了巨大挑战。本研究展示了利用类药物小分子和化学反应数据库预训练 BERT 模型的潜力,从而提高其在有机材料虚拟筛选中的性能。通过使用五个虚拟筛选任务的数据对 BERT 模型进行微调,使用 USPTO-SMILES 数据集进行预训练的版本在三个任务中的 R2 分数超过了 0.94,在另外两个任务中超过了 0.81。这一成绩超过了在小分子或有机材料数据库上预先训练的模型,也超过了直接在虚拟筛选数据上训练的三个传统机器学习模型。USPTO-SMILES 预训练 BERT 模型的成功可归功于 USPTO 数据库中多种多样的有机构建模块,为探索化学空间提供了更广阔的空间。研究进一步表明,访问比美国专利商标局拥有更广泛反应的反应数据库,可以进一步提高模型性能。总之,这项研究验证了将迁移学习应用于不同化学领域以高效虚拟筛选有机材料的可行性。通过将不同化学领域的迁移学习与多种有机材料分子进行比较,实现了有机材料的高精度虚拟筛选。
{"title":"Transfer learning across different chemical domains: virtual screening of organic materials with deep learning models pretrained on small molecule and chemical reaction data","authors":"Chengwei Zhang,&nbsp;Yushuang Zhai,&nbsp;Ziyang Gong,&nbsp;Hongliang Duan,&nbsp;Yuan-Bin She,&nbsp;Yun-Fang Yang,&nbsp;An Su","doi":"10.1186/s13321-024-00886-1","DOIUrl":"10.1186/s13321-024-00886-1","url":null,"abstract":"<div><p>Machine learning is becoming a preferred method for the virtual screening of organic materials due to its cost-effectiveness over traditional computationally demanding techniques. However, the scarcity of labeled data for organic materials poses a significant challenge for training advanced machine learning models. This study showcases the potential of utilizing databases of drug-like small molecules and chemical reactions to pretrain the BERT model, enhancing its performance in the virtual screening of organic materials. By fine-tuning the BERT models with data from five virtual screening tasks, the version pretrained with the USPTO–SMILES dataset achieved R<sup>2</sup> scores exceeding 0.94 for three tasks and over 0.81 for two others. This performance surpasses that of models pretrained on the small molecule or organic materials databases and outperforms three traditional machine learning models trained directly on virtual screening data. The success of the USPTO–SMILES pretrained BERT model can be attributed to the diverse array of organic building blocks in the USPTO database, offering a broader exploration of the chemical space. The study further suggests that accessing a reaction database with a wider range of reactions than the USPTO could further enhance model performance. Overall, this research validates the feasibility of applying transfer learning across different chemical domains for the efficient virtual screening of organic materials.</p><p><b>Scientific contribution</b></p><p>This study verifies the feasibility of applying transfer learning to large language models in different chemical fields to help organic materials perform virtual screening. Through the comparison of transfer learning from different chemical fields to a variety of organic material molecules, the high precision virtual screening of organic materials is realized.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11290278/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141854476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hilbert-curve assisted structure embedding method 希尔伯特曲线辅助结构嵌入法
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-29 DOI: 10.1186/s13321-024-00850-z
Gergely Zahoránszky-Kőhalmi, Kanny K. Wan, Alexander G. Godfrey

Motivation

Chemical space embedding methods are widely utilized in various research settings for dimensional reduction, clustering and effective visualization. The maps generated by the embedding process can provide valuable insight to medicinal chemists in terms of the relationships between structural, physicochemical and biological properties of compounds. However, these maps are known to be difficult to interpret, and the ‘‘landscape’’ on the map is prone to ‘‘rearrangement’’ when embedding different sets of compounds.

Results

In this study we present the Hilbert-Curve Assisted Space Embedding (HCASE) method which was designed to create maps by organizing structures according to a logic familiar to medicinal chemists. First, a chemical space is created with the help of a set of ‘‘reference scaffolds’’. These scaffolds are sorted according to the medicinal chemistry inspired Scaffold-Key algorithm found in prior art. Next, the ordered scaffolds are mapped to a line which is folded into a higher dimensional (here: 2D) space. The intricately folded line is referred to as a pseudo-Hilbert-Curve. The embedding of a compound happens by locating its most similar reference scaffold in the pseudo-Hilbert-Curve and assuming the respective position. Through a series of experiments, we demonstrate the properties of the maps generated by the HCASE method. Subjects of embeddings were compounds of the DrugBank and CANVASS libraries, and the chemical spaces were defined by scaffolds extracted from the ChEMBL database.

Scientific contribution

The novelty of HCASE method lies in generating robust and intuitive chemical space embeddings that are reflective of a medicinal chemist’s reasoning, and the precedential use of space filling (Hilbert) curve in the process.

Availability

https://github.com/ncats/hcase

Graphical Abstract

化学空间嵌入方法在各种研究环境中被广泛用于降维、聚类和有效可视化。嵌入过程生成的图谱可以为药物化学家提供化合物结构、物理化学和生物特性之间关系的宝贵见解。然而,众所周知,这些图谱难以解释,而且在嵌入不同化合物集时,图谱上的 "景观 "容易发生 "重新排列"。在本研究中,我们介绍了希尔伯特曲线辅助空间嵌入(HCASE)方法,该方法旨在根据药物化学家熟悉的逻辑组织结构,从而创建地图。首先,借助一组 "参考支架 "创建化学空间。这些支架根据现有技术中受药物化学启发的 "支架-键 "算法进行排序。接下来,有序的支架被映射到一条折叠到更高维度(此处为二维)空间的线上。错综复杂的折叠线被称为伪希尔伯特曲线。化合物的嵌入是通过在伪希尔伯特曲线中找到其最相似的参考支架并假设相应的位置来实现的。通过一系列实验,我们证明了 HCASE 方法生成的图谱的特性。嵌入的对象是 DrugBank 和 CANVASS 库中的化合物,化学空间由 ChEMBL 数据库中提取的支架定义。HCASE 方法的新颖之处在于能生成反映药物化学家推理的稳健而直观的化学空间嵌入图,并在此过程中优先使用了空间填充(希尔伯特)曲线。https://github.com/ncats/hcase。
{"title":"Hilbert-curve assisted structure embedding method","authors":"Gergely Zahoránszky-Kőhalmi,&nbsp;Kanny K. Wan,&nbsp;Alexander G. Godfrey","doi":"10.1186/s13321-024-00850-z","DOIUrl":"10.1186/s13321-024-00850-z","url":null,"abstract":"<div><h3>Motivation</h3><p>Chemical space embedding methods are widely utilized in various research settings for dimensional reduction, clustering and effective visualization. The maps generated by the embedding process can provide valuable insight to medicinal chemists in terms of the relationships between structural, physicochemical and biological properties of compounds. However, these maps are known to be difficult to interpret, and the ‘‘landscape’’ on the map is prone to ‘‘rearrangement’’ when embedding different sets of compounds.</p><h3>Results</h3><p>In this study we present the Hilbert-Curve Assisted Space Embedding (HCASE) method which was designed to create maps by organizing structures according to a logic familiar to medicinal chemists. First, a chemical space is created with the help of a set of ‘‘reference scaffolds’’. These scaffolds are sorted according to the medicinal chemistry inspired Scaffold-Key algorithm found in prior art. Next, the ordered scaffolds are mapped to a line which is folded into a higher dimensional (here: 2D) space. The intricately folded line is referred to as a pseudo-Hilbert-Curve. The embedding of a compound happens by locating its most similar reference scaffold in the pseudo-Hilbert-Curve and assuming the respective position. Through a series of experiments, we demonstrate the properties of the maps generated by the HCASE method. Subjects of embeddings were compounds of the DrugBank and CANVASS libraries, and the chemical spaces were defined by scaffolds extracted from the ChEMBL database.</p><h3>Scientific contribution</h3><p>The novelty of HCASE method lies in generating robust and intuitive chemical space embeddings that are reflective of a medicinal chemist’s reasoning, and the precedential use of space filling (Hilbert) curve in the process.</p><h3>Availability</h3><p>https://github.com/ncats/hcase</p><h3>Graphical Abstract</h3><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00850-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141791021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reproducible MS/MS library cleaning pipeline in matchms matchms 中可重复的 MS/MS 文库清洗管道
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-29 DOI: 10.1186/s13321-024-00878-1
Niek F. de Jonge, Helge Hecht, Michael Strobel, Mingxun Wang, Justin J. J. van der Hooft, Florian Huber

Mass spectral libraries have proven to be essential for mass spectrum annotation, both for library matching and training new machine learning algorithms. A key step in training machine learning models is the availability of high-quality training data. Public libraries of mass spectrometry data that are open to user submission often suffer from limited metadata curation and harmonization. The resulting variability in data quality makes training of machine learning models challenging. Here we present a library cleaning pipeline designed for cleaning tandem mass spectrometry library data. The pipeline is designed with ease of use, flexibility, and reproducibility as leading principles.

Scientific contribution

This pipeline will result in cleaner public mass spectral libraries that will improve library searching and the quality of machine-learning training datasets in mass spectrometry. This pipeline builds on previous work by adding new functionality for curating and correcting annotated libraries, by validating structure annotations. Due to the high quality of our software, the reproducibility, and improved logging, we think our new pipeline has the potential to become the standard in the field for cleaning tandem mass spectrometry libraries.

Graphical Abstract

事实证明,质谱库对于质谱注释至关重要,既可用于质谱库匹配,也可用于训练新的机器学习算法。训练机器学习模型的一个关键步骤是提供高质量的训练数据。开放供用户提交的质谱数据公共库往往在元数据整理和协调方面存在局限性。由此造成的数据质量差异使机器学习模型的训练面临挑战。在此,我们介绍一种专为清理串联质谱库数据而设计的库清理管道。该管道的设计以易用性、灵活性和可重复性为主要原则。科学贡献 该管道将产生更清洁的公共质谱库,从而改进质谱库搜索和机器学习训练数据集的质量。该管道以先前的工作为基础,通过验证结构注释,为整理和校正注释库增加了新的功能。由于我们的软件质量高、可重现性强、日志记录也得到了改进,我们认为我们的新管道有可能成为该领域清理串联质谱库的标准。
{"title":"Reproducible MS/MS library cleaning pipeline in matchms","authors":"Niek F. de Jonge,&nbsp;Helge Hecht,&nbsp;Michael Strobel,&nbsp;Mingxun Wang,&nbsp;Justin J. J. van der Hooft,&nbsp;Florian Huber","doi":"10.1186/s13321-024-00878-1","DOIUrl":"10.1186/s13321-024-00878-1","url":null,"abstract":"<div><p>Mass spectral libraries have proven to be essential for mass spectrum annotation, both for library matching and training new machine learning algorithms. A key step in training machine learning models is the availability of high-quality training data. Public libraries of mass spectrometry data that are open to user submission often suffer from limited metadata curation and harmonization. The resulting variability in data quality makes training of machine learning models challenging. Here we present a library cleaning pipeline designed for cleaning tandem mass spectrometry library data. The pipeline is designed with ease of use, flexibility, and reproducibility as leading principles.</p><p><b>Scientific contribution</b></p><p>This pipeline will result in cleaner public mass spectral libraries that will improve library searching and the quality of machine-learning training datasets in mass spectrometry. This pipeline builds on previous work by adding new functionality for curating and correcting annotated libraries, by validating structure annotations. Due to the high quality of our software, the reproducibility, and improved logging, we think our new pipeline has the potential to become the standard in the field for cleaning tandem mass spectrometry libraries.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00878-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141790934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A computational workflow for analysis of missense mutations in precision oncology 精准肿瘤学中分析错义突变的计算工作流程
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-29 DOI: 10.1186/s13321-024-00876-3
Rayyan Tariq Khan, Petra Pokorna, Jan Stourac, Simeon Borko, Ihor Arefiev, Joan Planas-Iglesias, Adam Dobias, Gaspar Pinto, Veronika Szotkowska, Jaroslav Sterba, Ondrej Slaby, Jiri Damborsky, Stanislav Mazurenko, David Bednar

Every year, more than 19 million cancer cases are diagnosed, and this number continues to increase annually. Since standard treatment options have varying success rates for different types of cancer, understanding the biology of an individual's tumour becomes crucial, especially for cases that are difficult to treat. Personalised high-throughput profiling, using next-generation sequencing, allows for a comprehensive examination of biopsy specimens. Furthermore, the widespread use of this technology has generated a wealth of information on cancer-specific gene alterations. However, there exists a significant gap between identified alterations and their proven impact on protein function. Here, we present a bioinformatics pipeline that enables fast analysis of a missense mutation’s effect on stability and function in known oncogenic proteins. This pipeline is coupled with a predictor that summarises the outputs of different tools used throughout the pipeline, providing a single probability score, achieving a balanced accuracy above 86%. The pipeline incorporates a virtual screening method to suggest potential FDA/EMA-approved drugs to be considered for treatment. We showcase three case studies to demonstrate the timely utility of this pipeline. To facilitate access and analysis of cancer-related mutations, we have packaged the pipeline as a web server, which is freely available at https://loschmidt.chemi.muni.cz/predictonco/.

Scientific contribution

This work presents a novel bioinformatics pipeline that integrates multiple computational tools to predict the effects of missense mutations on proteins of oncological interest. The pipeline uniquely combines fast protein modelling, stability prediction, and evolutionary analysis with virtual drug screening, while offering actionable insights for precision oncology. This comprehensive approach surpasses existing tools by automating the interpretation of mutations and suggesting potential treatments, thereby striving to bridge the gap between sequencing data and clinical application.

每年确诊的癌症病例超过 1900 万例,而且这一数字还在逐年增加。由于标准治疗方案对不同类型癌症的成功率各不相同,因此了解个体肿瘤的生物学特性变得至关重要,尤其是对于难以治疗的病例。利用新一代测序技术进行个性化高通量分析,可以对活检标本进行全面检查。此外,这项技术的广泛应用还产生了大量有关癌症特异性基因改变的信息。然而,在已确定的基因改变及其对蛋白质功能的已证实影响之间存在着巨大的差距。在这里,我们介绍一种生物信息学管道,它能快速分析错义突变对已知致癌蛋白质稳定性和功能的影响。该流水线与一个预测器相结合,该预测器汇总了整个流水线中使用的不同工具的输出结果,提供了一个单一的概率分数,实现了 86% 以上的均衡准确率。该管道结合了一种虚拟筛选方法,可为治疗提供FDA/EMA批准的潜在药物建议。我们展示了三个案例研究,以证明该管道的及时实用性。为方便访问和分析癌症相关突变,我们将该管道打包成一个网络服务器,可在 https://loschmidt.chemi.muni.cz/predictonco/ 免费获取。科学贡献 本研究提出了一种新型生物信息学管道,它整合了多种计算工具,可预测错义突变对肿瘤相关蛋白质的影响。该管道将快速蛋白质建模、稳定性预测和进化分析与虚拟药物筛选独特地结合在一起,为精准肿瘤学提供了可操作的见解。这种全面的方法超越了现有的工具,可自动解读突变并提出潜在的治疗建议,从而努力缩小测序数据与临床应用之间的差距。
{"title":"A computational workflow for analysis of missense mutations in precision oncology","authors":"Rayyan Tariq Khan,&nbsp;Petra Pokorna,&nbsp;Jan Stourac,&nbsp;Simeon Borko,&nbsp;Ihor Arefiev,&nbsp;Joan Planas-Iglesias,&nbsp;Adam Dobias,&nbsp;Gaspar Pinto,&nbsp;Veronika Szotkowska,&nbsp;Jaroslav Sterba,&nbsp;Ondrej Slaby,&nbsp;Jiri Damborsky,&nbsp;Stanislav Mazurenko,&nbsp;David Bednar","doi":"10.1186/s13321-024-00876-3","DOIUrl":"10.1186/s13321-024-00876-3","url":null,"abstract":"<div><p>Every year, more than 19 million cancer cases are diagnosed, and this number continues to increase annually. Since standard treatment options have varying success rates for different types of cancer, understanding the biology of an individual's tumour becomes crucial, especially for cases that are difficult to treat. Personalised high-throughput profiling, using next-generation sequencing, allows for a comprehensive examination of biopsy specimens. Furthermore, the widespread use of this technology has generated a wealth of information on cancer-specific gene alterations. However, there exists a significant gap between identified alterations and their proven impact on protein function. Here, we present a bioinformatics pipeline that enables fast analysis of a missense mutation’s effect on stability and function in known oncogenic proteins. This pipeline is coupled with a predictor that summarises the outputs of different tools used throughout the pipeline, providing a single probability score, achieving a balanced accuracy above 86%. The pipeline incorporates a virtual screening method to suggest potential FDA/EMA-approved drugs to be considered for treatment. We showcase three case studies to demonstrate the timely utility of this pipeline. To facilitate access and analysis of cancer-related mutations, we have packaged the pipeline as a web server, which is freely available at https://loschmidt.chemi.muni.cz/predictonco/.</p><p><b>Scientific contribution</b></p><p>This work presents a novel bioinformatics pipeline that integrates multiple computational tools to predict the effects of missense mutations on proteins of oncological interest. The pipeline uniquely combines fast protein modelling, stability prediction, and evolutionary analysis with virtual drug screening, while offering actionable insights for precision oncology. This comprehensive approach surpasses existing tools by automating the interpretation of mutations and suggesting potential treatments, thereby striving to bridge the gap between sequencing data and clinical application.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00876-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141790935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CACTI: an in silico chemical analysis tool through the integration of chemogenomic data and clustering analysis CACTI:通过整合化学基因组数据和聚类分析的硅学化学分析工具
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-24 DOI: 10.1186/s13321-024-00885-2
Karla P. Godinez-Macias, Elizabeth A. Winzeler

It is well-accepted that knowledge of a small molecule’s target can accelerate optimization. Although chemogenomic databases are helpful resources for predicting or finding compound interaction partners, they tend to be limited and poorly annotated. Furthermore, unlike genes, compound identifiers are often not standardized, and many synonyms may exist, especially in the biological literature, making batch analysis of compounds difficult. Here, we constructed an open-source annotation and target hypothesis prediction tool that explores some of the largest chemical and biological databases, mining these for both common name, synonyms, and structurally similar molecules. We used this Chemical Analysis and Clustering for Target Identification (CACTI) tool to analyze the Pathogen Box collection, an open-source set of 400 drug-like compounds active against a variety of microbial pathogens. Our analysis resulted in 4,315 new synonyms, 35,963 pieces of new information and target prediction hints for 58 members.

Scientific contributions

With the employment of this tool, a comprehensive report with known evidence, close analogs and drug-target prediction can be obtained for large-scale chemical libraries that will facilitate their evaluation and future target validation and optimization efforts.

人们普遍认为,了解小分子的靶点可以加速优化。虽然化学基因组数据库是预测或寻找化合物相互作用伙伴的有用资源,但它们往往数量有限且注释不全。此外,与基因不同,化合物标识符通常没有标准化,而且可能存在许多同义词,尤其是在生物文献中,这使得化合物的批量分析变得困难。在这里,我们构建了一个开源注释和靶标假设预测工具,该工具可以探索一些最大的化学和生物数据库,挖掘其中的通用名称、同义词和结构相似的分子。我们使用这个化学分析和目标识别聚类(CACTI)工具分析了病原体盒(Pathogen Box)集合,这是一个包含 400 种对各种微生物病原体有活性的类药物的开源集合。我们的分析为 58 个成员提供了 4315 个新同义词、35963 条新信息和目标预测提示。科学贡献 利用这一工具,可以为大规模化学文库获得一份包含已知证据、近似类似物和药物靶点预测的综合报告,这将有助于它们的评估以及未来的靶点验证和优化工作。
{"title":"CACTI: an in silico chemical analysis tool through the integration of chemogenomic data and clustering analysis","authors":"Karla P. Godinez-Macias,&nbsp;Elizabeth A. Winzeler","doi":"10.1186/s13321-024-00885-2","DOIUrl":"10.1186/s13321-024-00885-2","url":null,"abstract":"<div><p>It is well-accepted that knowledge of a small molecule’s target can accelerate optimization. Although chemogenomic databases are helpful resources for predicting or finding compound interaction partners, they tend to be limited and poorly annotated. Furthermore, unlike genes, compound identifiers are often not standardized, and many synonyms may exist, especially in the biological literature, making batch analysis of compounds difficult. Here, we constructed an open-source annotation and target hypothesis prediction tool that explores some of the largest chemical and biological databases, mining these for both common name, synonyms, and structurally similar molecules. We used this Chemical Analysis and Clustering for Target Identification (CACTI) tool to analyze the Pathogen Box collection, an open-source set of 400 drug-like compounds active against a variety of microbial pathogens. Our analysis resulted in 4,315 new synonyms, 35,963 pieces of new information and target prediction hints for 58 members.</p><p><b>Scientific contributions</b></p><p>With the employment of this tool, a comprehensive report with known evidence, close analogs and drug-target prediction can be obtained for large-scale chemical libraries that will facilitate their evaluation and future target validation and optimization efforts.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00885-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141755348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1