首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry AutoTemplate:为有机化学中的机器学习应用增强化学反应数据集
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-27 DOI: 10.1186/s13321-024-00869-2
Lung-Yi Chen, Yi-Pei Li

This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis.

本文介绍了一种创新的数据预处理协议--AutoTemplate,以满足有机化学机器学习应用领域对高质量化学反应数据集的关键需求。人工智能的最新进展扩大了机器学习在化学领域的应用,尤其是在产率预测、逆合成和反应条件预测方面。然而,这些模型的有效性取决于化学反应数据集的完整性,而这些数据集往往存在不一致性,如缺少反应物、原子映射不正确以及反应完全错误等。AutoTemplate 采用两阶段方法来完善这些数据集。第一阶段包括提取有意义的反应转换规则,并使用简化的 SMARTS 表示法制定通用反应模板。这种简化扩大了模板在各种化学反应中的适用性。第二阶段是模板指导下的反应整理,系统地应用这些模板来验证和修正反应数据。这一过程可有效修正缺失的反应物信息、纠正原子映射错误并消除错误的数据项。AutoTemplate 的一个突出特点是能够同时识别和纠正错误的化学反应。它的运行前提是数据集中的大多数反应都是准确的,并将这些反应作为模板来指导对错误条目的修正。该协议在一系列化学反应中证明了其有效性,显著提高了数据集的质量。这一进步为开发可靠的化学机器学习模型奠定了更坚实的基础,从而提高了正向和反向合成预测的准确性。AutoTemplate 标志着化学反应数据集预处理的重大进步,弥补了重要的差距,促进了有机合成中更精确、更高效的机器学习应用。所提出的化学反应数据自动预处理工具旨在识别化学数据库中的错误。具体来说,如果错误涉及原子映射或反应物类型缺失,则可使用反应模板进行系统性修正,最终提升数据库的整体质量。
{"title":"AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry","authors":"Lung-Yi Chen,&nbsp;Yi-Pei Li","doi":"10.1186/s13321-024-00869-2","DOIUrl":"10.1186/s13321-024-00869-2","url":null,"abstract":"<p>This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00869-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141462625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physicochemical modelling of the retention mechanism of temperature-responsive polymeric columns for HPLC through machine learning algorithms 通过机器学习算法建立用于高效液相色谱的温度响应型聚合物色谱柱保留机理的物理化学模型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-21 DOI: 10.1186/s13321-024-00873-6
Elena Bandini, Rodrigo Castellano Ontiveros, Ardiana Kajtazi, Hamed Eghbali, Frédéric Lynen

Temperature-responsive liquid chromatography (TRLC) offers a promising alternative to reversed-phase liquid chromatography (RPLC) for environmentally friendly analytical techniques by utilizing pure water as a mobile phase, eliminating the need for harmful organic solvents. TRLC columns, packed with temperature-responsive polymers coupled to silica particles, exhibit a unique retention mechanism influenced by temperature-induced polymer hydration. An investigation of the physicochemical parameters driving separation at high and low temperatures is crucial for better column manufacturing and selectivity control. Assessment of predictability using a dataset of 139 molecules analyzed at different temperatures elucidated the molecular descriptors (MDs) relevant to retention mechanisms. Linear regression, support vector regression (SVR), and tree-based ensemble models were evaluated, with no standout performer. The precision, accuracy, and robustness of models were validated through metrics, such as r and mean absolute error (MAE), and statistical analysis. At (45,^{circ }hbox {C}), logP predominantly influenced retention, akin to reversed-phase columns, while at (5^{circ }hbox {C}), complex interactions with lipophilic and negative MDs, along with specific functional groups, dictated retention. These findings provide deeper insights into TRLC mechanisms, facilitating method development and maximizing column potential.

温度响应液相色谱法(TRLC)利用纯水作为流动相,无需使用有害的有机溶剂,是反相液相色谱法(RPLC)的理想替代品,可用于环保型分析技术。TRLC 色谱柱由温度响应聚合物和二氧化硅颗粒组成,受温度引起的聚合物水合作用影响,表现出独特的保留机制。研究驱动高温和低温分离的物理化学参数对于更好地制造色谱柱和控制选择性至关重要。利用在不同温度下分析的 139 种分子的数据集对可预测性进行评估,阐明了与保留机制相关的分子描述符 (MD)。对线性回归、支持向量回归(SVR)和基于树的集合模型进行了评估,没有发现突出的表现。通过r和平均绝对误差(MAE)等指标以及统计分析,对模型的精确度、准确性和稳健性进行了验证。在 $$45,^{circ }hbox {C}$ 时,logP 主要影响保留,类似于反相色谱柱,而在 $$5^{circ }hbox {C}$ 时,与亲脂性和负 MD 以及特定官能团的复杂相互作用决定了保留。这些发现深入揭示了 TRLC 的机理,有助于方法开发和最大限度地发挥色谱柱的潜力。
{"title":"Physicochemical modelling of the retention mechanism of temperature-responsive polymeric columns for HPLC through machine learning algorithms","authors":"Elena Bandini,&nbsp;Rodrigo Castellano Ontiveros,&nbsp;Ardiana Kajtazi,&nbsp;Hamed Eghbali,&nbsp;Frédéric Lynen","doi":"10.1186/s13321-024-00873-6","DOIUrl":"10.1186/s13321-024-00873-6","url":null,"abstract":"<div><p>Temperature-responsive liquid chromatography (TRLC) offers a promising alternative to reversed-phase liquid chromatography (RPLC) for environmentally friendly analytical techniques by utilizing pure water as a mobile phase, eliminating the need for harmful organic solvents. TRLC columns, packed with temperature-responsive polymers coupled to silica particles, exhibit a unique retention mechanism influenced by temperature-induced polymer hydration. An investigation of the physicochemical parameters driving separation at high and low temperatures is crucial for better column manufacturing and selectivity control. Assessment of predictability using a dataset of 139 molecules analyzed at different temperatures elucidated the molecular descriptors (MDs) relevant to retention mechanisms. Linear regression, support vector regression (SVR), and tree-based ensemble models were evaluated, with no standout performer. The precision, accuracy, and robustness of models were validated through metrics, such as <i>r</i> and mean absolute error (MAE), and statistical analysis. At <span>(45,^{circ }hbox {C})</span>, logP predominantly influenced retention, akin to reversed-phase columns, while at <span>(5^{circ }hbox {C})</span>, complex interactions with lipophilic and negative MDs, along with specific functional groups, dictated retention. These findings provide deeper insights into TRLC mechanisms, facilitating method development and maximizing column potential.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00873-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Llamol: a dynamic multi-conditional generative transformer for de novo molecular design Llamol:用于从头开始分子设计的动态多条件生成转换器。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-21 DOI: 10.1186/s13321-024-00863-8
Niklas Dobberstein, Astrid Maass, Jan Hamaekers

Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present Llamol, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce Stochastic Context Learning (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making Llamol a potent tool for de novo molecule design, easily expandable with new properties.

生成模型已在自然语言处理(NLP)领域展现出巨大前景,并已应用于分子设计,如通用预训练变换器(GPT)模型。为了开发这样一种用于探索有机化学空间以寻找潜在电活性化合物的工具,我们提出了 Llamol,这是一种基于 Llama 2 架构的单一新型生成式变换器模型,它是在来自不同公共资源的 1250 万个有机化合物超集上训练而成的。鉴于数据可能不完整,为了最大限度地提高使用灵活性和鲁棒性,我们引入了随机上下文学习(SCL)作为新的训练程序。我们证明,由此产生的模型能够很好地处理单条件和多条件有机分子生成,最多可有四个条件,但也可能有更多条件。该模型以 SMILES 符号生成有效的分子结构,同时根据要求灵活地将三个数字和/或一个标记序列纳入生成过程。在所有测试场景中,生成的化合物都非常令人满意。详细而言,我们展示了该模型利用标记序列进行调节的能力,无论是单独使用还是与数字特性结合使用,都使 Llamol 成为一种有效的全新分子设计工具,可轻松扩展新特性。科学贡献:我们在 Llama 2 架构的基础上开发了一种新颖的生成式转换器模型 Llamol,该模型在 12.5 M 有机化合物的不同集合上进行了训练。该模型引入了随机上下文学习(SCL)作为一种新的训练程序,可以灵活、稳健地生成有效的有机分子,这些分子可以多种条件以多种方式组合,从而使其成为全新分子设计的有力工具。
{"title":"Llamol: a dynamic multi-conditional generative transformer for de novo molecular design","authors":"Niklas Dobberstein,&nbsp;Astrid Maass,&nbsp;Jan Hamaekers","doi":"10.1186/s13321-024-00863-8","DOIUrl":"10.1186/s13321-024-00863-8","url":null,"abstract":"<p>Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present <i>Llamol</i>, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce <i>Stochastic Context Learning</i> (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making <i>Llamol</i> a potent tool for de novo molecule design, easily expandable with new properties.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00863-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence 基于 BERT 的预训练模型,用于从 SMILES 序列中提取分子结构信息
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-19 DOI: 10.1186/s13321-024-00848-7
Xiaofan Zheng, Yoichi Tomiura

Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction

在各种分子特性及其组合中,通过理论或实验获得所需的分子特性是一个成本高昂的过程。利用机器学习分析分子结构特征并预测分子性质,是加速预测分子性质的一种潜在高效替代方法。在本研究中,我们从机器学习的角度通过分子结构分析分子特性。我们将 SMILES 序列作为人工神经网络的输入,用于提取分子结构特征和预测分子性质。SMILES 序列由代表分子结构的符号组成。针对 SMILES 序列不同于实际分子结构数据的问题,我们提出了一种基于 BERT 模型的 SMILES 序列预训练模型,该模型被广泛应用于自然语言处理领域,从而使模型学会提取 SMILES 序列中包含的分子结构信息。在实验中,我们首先用 100,000 个 SMILES 序列对所提出的模型进行预训练,然后使用预训练模型预测 22 个数据集的分子特性和分子的气味特征(98 种气味描述符)。实验结果表明,我们提出的预训练模型有效地提高了分子性质预测的性能。2-编码器预训练是针对 SMILES 中的符号与上下文环境的依赖性低于自然语言句子中的符号与上下文环境的依赖性,以及一个化合物对应多个 SMILES 序列的特点而提出的。与擅长自然语言的 BERT 相比,使用 2-encoder 预训练的模型在分子特性预测任务中表现出更高的鲁棒性。
{"title":"A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence","authors":"Xiaofan Zheng,&nbsp;Yoichi Tomiura","doi":"10.1186/s13321-024-00848-7","DOIUrl":"10.1186/s13321-024-00848-7","url":null,"abstract":"<p>Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00848-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141425535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stereochemically-aware bioactivity descriptors for uncharacterized chemical compounds 针对未定性化合物的立体化学感知生物活性描述符。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-18 DOI: 10.1186/s13321-024-00867-4
Arnau Comajuncosa-Creus, Aksel Lenes, Miguel Sánchez-Palomino, Dylan Dalton, Patrick Aloy

Stereochemistry plays a fundamental role in pharmacology. Here, we systematically investigate the relationship between stereoisomerism and bioactivity on over 1 M compounds, finding that a very significant fraction (~ 40%) of spatial isomer pairs show, to some extent, distinct bioactivities. We then use the 3D representation of these molecules to train a collection of deep neural networks (Signaturizers3D) to generate bioactivity descriptors associated to small molecules, that capture their effects at increasing levels of biological complexity (i.e. from protein targets to clinical outcomes). Further, we assess the ability of the descriptors to distinguish between stereoisomers and to recapitulate their different target binding profiles. Overall, we show how these new stereochemically-aware descriptors provide an even more faithful description of complex small molecule bioactivity properties, capturing key differences in the activity of stereoisomers.

Scientific contribution

We systematically assess the relationship between stereoisomerism and bioactivity on a large scale, focusing on compound-target binding events, and use our findings to train novel deep learning models to generate stereochemically-aware bioactivity signatures for any compound of interest.

立体化学在药理学中起着基础性作用。在这里,我们系统地研究了超过 100 万个化合物的立体异构体与生物活性之间的关系,发现相当大一部分(约 40%)的空间异构体对在一定程度上显示出不同的生物活性。然后,我们利用这些分子的三维表征来训练一系列深度神经网络(Signaturizers3D),以生成与小分子相关的生物活性描述符,从而捕捉其在生物复杂性(即从蛋白质靶点到临床结果)不断提高的水平上的效应。此外,我们还评估了描述符区分立体异构体和再现其不同靶标结合特征的能力。总之,我们展示了这些新的立体化学感知描述符如何更忠实地描述复杂的小分子生物活性特性,捕捉立体异构体活性的关键差异。科学贡献我们系统地大规模评估了立体异构体与生物活性之间的关系,重点关注化合物与靶标的结合事件,并利用我们的研究成果训练新型深度学习模型,为任何感兴趣的化合物生成立体化学感知生物活性特征。
{"title":"Stereochemically-aware bioactivity descriptors for uncharacterized chemical compounds","authors":"Arnau Comajuncosa-Creus,&nbsp;Aksel Lenes,&nbsp;Miguel Sánchez-Palomino,&nbsp;Dylan Dalton,&nbsp;Patrick Aloy","doi":"10.1186/s13321-024-00867-4","DOIUrl":"10.1186/s13321-024-00867-4","url":null,"abstract":"<div><p>Stereochemistry plays a fundamental role in pharmacology. Here, we systematically investigate the relationship between stereoisomerism and bioactivity on over 1 M compounds, finding that a very significant fraction (~ 40%) of spatial isomer pairs show, to some extent, distinct bioactivities. We then use the 3D representation of these molecules to train a collection of deep neural networks (<i>Signaturizers3D</i>) to generate bioactivity descriptors associated to small molecules, that capture their effects at increasing levels of biological complexity (i.e. from protein targets to clinical outcomes). Further, we assess the ability of the descriptors to distinguish between stereoisomers and to recapitulate their different target binding profiles. Overall, we show how these new stereochemically-aware descriptors provide an even more faithful description of complex small molecule bioactivity properties, capturing key differences in the activity of stereoisomers.</p><p><b>Scientific contribution</b></p><p>We systematically assess the relationship between stereoisomerism and bioactivity on a large scale, focusing on compound-target binding events, and use our findings to train novel deep learning models to generate stereochemically-aware bioactivity signatures for any compound of interest.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00867-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141417136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PubChem synonym filtering process using crowdsourcing 使用众包技术的 PubChem 同义词过滤过程。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-16 DOI: 10.1186/s13321-024-00868-3
Sunghwan Kim, Bo Yu, Qingliang Li, Evan E. Bolton

PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem’s crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem’s filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.

PubChem ( https://pubchem.ncbi.nlm.nih.gov ) 是一个公共化学信息资源,包含 1 亿多种独特的化学结构。PubChem 和其他化学数据库中最常见的任务之一是通过名称(通常也称为 "化学同义词")搜索化学物质。PubChem 通过查找由 PubChem 的个人保存者提供的化学同义词-结构关联来完成这项任务。此外,这些同义词还可用于多种用途,包括在化学品和 PubMed 文章之间建立链接(使用医学主题词表 (MeSH) 术语)。然而,这些保存者提供的名称-结构关联在保存者内部和保存者之间存在很大差异,因此很难明确地将化学名称映射到特定的化学结构。本文介绍了 PubChem 基于众包的同义词过滤策略,该策略可以解决同义词-结构关联以及化学物质-MeSH 关联中存管者之间和存管者内部的差异。PubChem 的同义词过滤流程是在对四种众包投票策略进行分析的基础上开发的,这四种策略的不同之处在于所采用的一致性阈值(60% 与 70%),以及在储户间众包投票之前如何解决储户内差异(每个储户单票与多票)。考虑到化学结构及其主要成分的同位素组成、立体化学和连接性的不同,在六个化学等效水平上确定了投票的一致性。虽然所有四种策略都显示出了相似的结果,但策略 I(每个保存人投一票,一致性阈值为 60%)导致分配给单个化学结构的同义词最多,以及在六个化学等效上下文中消除的同义词-结构关联最多。根据这项研究的结果,在 PubChem 的过滤过程中实施了策略 I,以清除同义词-结构关联以及化学-MeSH 关联。这种基于一致性的过滤程序旨在寻找名称-结构关联的共识,但无法证明其正确性。因此,它可能无法识别正确的名称-结构关联(或不正确的名称-结构关联),例如,当一个同义词仅由一个保存者提供或许多贡献者都不正确时。不过,这一过滤过程是 PubChem 等大型化学数据库中名称-结构关联质量控制的重要起点。
{"title":"PubChem synonym filtering process using crowdsourcing","authors":"Sunghwan Kim,&nbsp;Bo Yu,&nbsp;Qingliang Li,&nbsp;Evan E. Bolton","doi":"10.1186/s13321-024-00868-3","DOIUrl":"10.1186/s13321-024-00868-3","url":null,"abstract":"<div><p>PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem’s crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem’s filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00868-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141330048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: QuanDB: a quantum chemical property database towards enhancing 3D molecular representation learning 更正:QuanDB:加强三维分子表征学习的量子化学特性数据库
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-11 DOI: 10.1186/s13321-024-00864-7
Zhijiang Yang, Tengxin Huang, Li Pan, Jingjing Wang, Liangliang Wang, Junjie Ding, Junhua Xiao
{"title":"Correction: QuanDB: a quantum chemical property database towards enhancing 3D molecular representation learning","authors":"Zhijiang Yang,&nbsp;Tengxin Huang,&nbsp;Li Pan,&nbsp;Jingjing Wang,&nbsp;Liangliang Wang,&nbsp;Junjie Ding,&nbsp;Junhua Xiao","doi":"10.1186/s13321-024-00864-7","DOIUrl":"10.1186/s13321-024-00864-7","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00864-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141304470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model 基于简化同质图卷积网络和预训练语言模型的端到端化合物-蛋白质相互作用预测方法。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-07 DOI: 10.1186/s13321-024-00862-9
Yufang Zhang, Jiayi Li, Shenggeng Lin, Jianwei Zhao, Yi Xiong, Dong-Qing Wei

Identification of interactions between chemical compounds and proteins is crucial for various applications, including drug discovery, target identification, network pharmacology, and elucidation of protein functions. Deep neural network-based approaches are becoming increasingly popular in efficiently identifying compound-protein interactions with high-throughput capabilities, narrowing down the scope of candidates for traditional labor-intensive, time-consuming and expensive experimental techniques. In this study, we proposed an end-to-end approach termed SPVec-SGCN-CPI, which utilized simplified graph convolutional network (SGCN) model with low-dimensional and continuous features generated from our previously developed model SPVec and graph topology information to predict compound-protein interactions. The SGCN technique, dividing the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregates K-order neighbor information while avoiding neighbor explosion and expediting training. The performance of the SPVec-SGCN-CPI method was assessed across three datasets and compared against four machine learning- and deep learning-based methods, as well as six state-of-the-art methods. Experimental results revealed that SPVec-SGCN-CPI outperformed all these competing methods, particularly excelling in unbalanced data scenarios. By propagating node features and topological information to the feature space, SPVec-SGCN-CPI effectively incorporates interactions between compounds and proteins, enabling the fusion of heterogeneity. Furthermore, our method scored all unlabeled data in ChEMBL, confirming the top five ranked compound-protein interactions through molecular docking and existing evidence. These findings suggest that our model can reliably uncover compound-protein interactions within unlabeled compound-protein pairs, carrying substantial implications for drug re-profiling and discovery. In summary, SPVec-SGCN demonstrates its efficacy in accurately predicting compound-protein interactions, showcasing potential to enhance target identification and streamline drug discovery processes.

Scientific contributions

The methodology presented in this work not only enables the comparatively accurate prediction of compound-protein interactions but also, for the first time, take sample imbalance which is very common in real world and computation efficiency into consideration simultaneously, accelerating the target identification and drug discovery process.

鉴定化合物与蛋白质之间的相互作用对于药物发现、靶点鉴定、网络药理学和阐明蛋白质功能等各种应用至关重要。基于深度神经网络的方法正变得越来越流行,这种方法具有高通量能力,能有效识别化合物与蛋白质之间的相互作用,缩小了传统劳动密集型、耗时且昂贵的实验技术的候选范围。在本研究中,我们提出了一种名为 SPVec-SGCN-CPI 的端到端方法,该方法利用简化图卷积网络(SGCN)模型,结合我们之前开发的模型 SPVec 和图拓扑信息生成的低维连续特征来预测化合物-蛋白质相互作用。SGCN 技术将局部邻域聚合和非线性分层传播步骤分开,有效地聚合了 K 阶邻域信息,同时避免了邻域爆炸,加快了训练速度。在三个数据集上评估了 SPVec-SGCN-CPI 方法的性能,并与四种基于机器学习和深度学习的方法以及六种最先进的方法进行了比较。实验结果表明,SPVec-SGCN-CPI 的性能优于所有这些竞争方法,尤其是在不平衡数据场景中表现突出。通过将节点特征和拓扑信息传播到特征空间,SPVec-SGCN-CPI 有效地结合了化合物和蛋白质之间的相互作用,实现了异质性融合。此外,我们的方法还对 ChEMBL 中所有未标记的数据进行了评分,通过分子对接和现有证据确认了排名前五的化合物-蛋白质相互作用。这些发现表明,我们的模型可以可靠地发现未标记化合物-蛋白质对中的化合物-蛋白质相互作用,对药物再筛选和发现具有重大意义。总之,SPVec-SGCN 证明了其在准确预测化合物-蛋白质相互作用方面的功效,展示了其在增强目标识别和简化药物发现过程方面的潜力。
{"title":"An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model","authors":"Yufang Zhang,&nbsp;Jiayi Li,&nbsp;Shenggeng Lin,&nbsp;Jianwei Zhao,&nbsp;Yi Xiong,&nbsp;Dong-Qing Wei","doi":"10.1186/s13321-024-00862-9","DOIUrl":"10.1186/s13321-024-00862-9","url":null,"abstract":"<div><p>Identification of interactions between chemical compounds and proteins is crucial for various applications, including drug discovery, target identification, network pharmacology, and elucidation of protein functions. Deep neural network-based approaches are becoming increasingly popular in efficiently identifying compound-protein interactions with high-throughput capabilities, narrowing down the scope of candidates for traditional labor-intensive, time-consuming and expensive experimental techniques. In this study, we proposed an end-to-end approach termed SPVec-SGCN-CPI, which utilized simplified graph convolutional network (SGCN) model with low-dimensional and continuous features generated from our previously developed model SPVec and graph topology information to predict compound-protein interactions. The SGCN technique, dividing the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregates K-order neighbor information while avoiding neighbor explosion and expediting training. The performance of the SPVec-SGCN-CPI method was assessed across three datasets and compared against four machine learning- and deep learning-based methods, as well as six state-of-the-art methods. Experimental results revealed that SPVec-SGCN-CPI outperformed all these competing methods, particularly excelling in unbalanced data scenarios. By propagating node features and topological information to the feature space, SPVec-SGCN-CPI effectively incorporates interactions between compounds and proteins, enabling the fusion of heterogeneity. Furthermore, our method scored all unlabeled data in ChEMBL, confirming the top five ranked compound-protein interactions through molecular docking and existing evidence. These findings suggest that our model can reliably uncover compound-protein interactions within unlabeled compound-protein pairs, carrying substantial implications for drug re-profiling and discovery. In summary, SPVec-SGCN demonstrates its efficacy in accurately predicting compound-protein interactions, showcasing potential to enhance target identification and streamline drug discovery processes.</p><p><b>Scientific contributions</b></p><p>The methodology presented in this work not only enables the comparatively accurate prediction of compound-protein interactions but also, for the first time, take sample imbalance which is very common in real world and computation efficiency into consideration simultaneously, accelerating the target identification and drug discovery process.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00862-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141287531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction PUResNetV2.0:利用稀疏表示改进配体结合位点预测的深度学习模型
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-07 DOI: 10.1186/s13321-024-00865-6
Kandel Jeevan, Shrestha Palistha, Hilal Tayara, Kil T. Chong

Accurate ligand binding site prediction (LBSP) within proteins is essential for drug discovery. We developed ProteinUNetResNetV2.0 (PUResNetV2.0), leveraging sparse representation of protein structures to improve LBSP accuracy. Our training dataset included protein complexes from 4729 protein families. Evaluations on benchmark datasets showed that PUResNetV2.0 achieved an 85.4% Distance Center Atom (DCA) success rate and a 74.7% F1 Score on the Holo801 dataset, outperforming existing methods. However, its performance in specific cases, such as RNA, DNA, peptide-like ligand, and ion binding site prediction, was limited due to constraints in our training data. Our findings underscore the potential of sparse representation in LBSP, especially for oligomeric structures, suggesting PUResNetV2.0 as a promising tool for computational drug discovery.

准确预测蛋白质中的配体结合位点(LBSP)对药物发现至关重要。我们开发了 ProteinUNetResNetV2.0(PUResNetV2.0),利用蛋白质结构的稀疏表示来提高配体结合位点预测的准确性。我们的训练数据集包括来自 4729 个蛋白质家族的蛋白质复合物。在基准数据集上进行的评估表明,PUResNetV2.0 在 Holo801 数据集上取得了 85.4% 的距离中心原子(DCA)成功率和 74.7% 的 F1 分数,优于现有方法。然而,由于训练数据的限制,它在特定情况下的表现有限,如 RNA、DNA、类肽配体和离子结合位点预测。我们的发现强调了稀疏表示在 LBSP 中的潜力,尤其是在寡聚结构方面,这表明 PUResNetV2.0 是一种很有前途的计算药物发现工具。
{"title":"PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction","authors":"Kandel Jeevan,&nbsp;Shrestha Palistha,&nbsp;Hilal Tayara,&nbsp;Kil T. Chong","doi":"10.1186/s13321-024-00865-6","DOIUrl":"10.1186/s13321-024-00865-6","url":null,"abstract":"<div><p>Accurate ligand binding site prediction (LBSP) within proteins is essential for drug discovery. We developed ProteinUNetResNetV2.0 (PUResNetV2.0), leveraging sparse representation of protein structures to improve LBSP accuracy. Our training dataset included protein complexes from 4729 protein families. Evaluations on benchmark datasets showed that PUResNetV2.0 achieved an 85.4% Distance Center Atom (DCA) success rate and a 74.7% F1 Score on the Holo801 dataset, outperforming existing methods. However, its performance in specific cases, such as RNA, DNA, peptide-like ligand, and ion binding site prediction, was limited due to constraints in our training data. Our findings underscore the potential of sparse representation in LBSP, especially for oligomeric structures, suggesting PUResNetV2.0 as a promising tool for computational drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00865-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141286747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protein target similarity is positive predictor of in vitro antipathogenic activity: a drug repurposing strategy for Plasmodium falciparum 蛋白质靶点相似性是体外抗致病活性的积极预测因素:恶性疟原虫药物再利用战略
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-05-30 DOI: 10.1186/s13321-024-00856-7
Reagan M. Mogire, Silviane A. Miruka, Dennis W. Juma, Case W. McNamara, Ben Andagalu, Jeremy N. Burrows, Elodie Chenu, James Duffy, Bernhards R. Ogutu, Hoseah M. Akala

Drug discovery is an intricate and costly process. Repurposing existing drugs and active compounds offers a viable pathway to develop new therapies for various diseases. By leveraging publicly available biomedical information, it is possible to predict compounds’ activity and identify their potential targets across diverse organisms. In this study, we aimed to assess the antiplasmodial activity of compounds from the Repurposing, Focused Rescue, and Accelerated Medchem (ReFRAME) library using in vitro and bioinformatics approaches. We assessed the in vitro antiplasmodial activity of the compounds using blood-stage and liver-stage drug susceptibility assays. We used protein sequences of known targets of the ReFRAME compounds with high antiplasmodial activity (EC50 < 10 uM) to conduct a protein-pairwise search to identify similar Plasmodium falciparum 3D7 proteins (from PlasmoDB) using NCBI protein BLAST. We further assessed the association between the compounds' in vitro antiplasmodial activity and level of similarity between their known and predicted P. falciparum target proteins using simple linear regression analyses. BLAST analyses revealed 735 P. falciparum proteins that were similar to the 226 known protein targets associated with the ReFRAME compounds. Antiplasmodial activity of the compounds was positively associated with the degree of similarity between the compounds’ known targets and predicted P. falciparum protein targets (percentage identity, E value, and bit score), the number of the predicted P. falciparum targets, and their respective mutagenesis index and fitness scores (R2 between 0.066 and 0.92, P < 0.05). Compounds predicted to target essential P. falciparum proteins or those with a druggability index of 1 showed the highest antiplasmodial activity.

药物研发是一个复杂而昂贵的过程。现有药物和活性化合物的再利用为开发治疗各种疾病的新疗法提供了一条可行的途径。通过利用公开的生物医学信息,可以预测化合物的活性并确定其在不同生物体中的潜在靶点。在本研究中,我们旨在利用体外和生物信息学方法,评估 "再利用、重点抢救和加速医药化学(ReFRAME)"文库中化合物的抗疟活性。我们利用血液阶段和肝脏阶段药敏试验评估了化合物的体外抗疟活性。我们利用具有高抗疟活性(EC50 < 10 uM)的 ReFRAME 化合物已知靶标的蛋白质序列,使用 NCBI 蛋白质 BLAST 进行蛋白质配对搜索,以确定类似的恶性疟原虫 3D7 蛋白质(来自 PlasmoDB)。我们使用简单的线性回归分析进一步评估了化合物的体外抗疟活性与其已知和预测的恶性疟原虫靶蛋白之间相似程度的关联。BLAST 分析显示有 735 个恶性疟原虫蛋白与 ReFRAME 化合物的 226 个已知靶蛋白相似。化合物的抗疟活性与化合物的已知靶标和预测的恶性疟原虫蛋白靶标之间的相似程度(同一性百分比、E 值和比特分数)、预测的恶性疟原虫靶标数量以及各自的诱变指数和适应性分数呈正相关(R2 在 0.066 和 0.92 之间,P < 0.05)。预测靶向恶性疟原虫基本蛋白的化合物或可药性指数为 1 的化合物显示出最高的抗疟活性。这是首次证明化合物体外抗病原活性与不同物种靶点相似性之间相关性的研究。我们的研究结果表明,通过预测化合物的活性及其在不同生物体内的潜在靶点,利用蛋白质-靶点相似性可能会加快许多疾病的药物再利用过程。
{"title":"Protein target similarity is positive predictor of in vitro antipathogenic activity: a drug repurposing strategy for Plasmodium falciparum","authors":"Reagan M. Mogire,&nbsp;Silviane A. Miruka,&nbsp;Dennis W. Juma,&nbsp;Case W. McNamara,&nbsp;Ben Andagalu,&nbsp;Jeremy N. Burrows,&nbsp;Elodie Chenu,&nbsp;James Duffy,&nbsp;Bernhards R. Ogutu,&nbsp;Hoseah M. Akala","doi":"10.1186/s13321-024-00856-7","DOIUrl":"10.1186/s13321-024-00856-7","url":null,"abstract":"<div><p>Drug discovery is an intricate and costly process. Repurposing existing drugs and active compounds offers a viable pathway to develop new therapies for various diseases. By leveraging publicly available biomedical information, it is possible to predict compounds’ activity and identify their potential targets across diverse organisms. In this study, we aimed to assess the antiplasmodial activity of compounds from the Repurposing, Focused Rescue, and Accelerated Medchem (ReFRAME) library using in vitro and bioinformatics approaches. We assessed the in vitro antiplasmodial activity of the compounds using blood-stage and liver-stage drug susceptibility assays. We used protein sequences of known targets of the ReFRAME compounds with high antiplasmodial activity (EC<sub>50</sub> &lt; 10 uM) to conduct a protein-pairwise search to identify similar <i>Plasmodium falciparum</i> 3D7 proteins (from PlasmoDB) using NCBI protein BLAST. We further assessed the association between the compounds' in vitro antiplasmodial activity and level of similarity between their known and predicted <i>P. falciparum</i> target proteins using simple linear regression analyses. BLAST analyses revealed 735 <i>P. falciparum</i> proteins that were similar to the 226 known protein targets associated with the ReFRAME compounds. Antiplasmodial activity of the compounds was positively associated with the degree of similarity between the compounds’ known targets and predicted <i>P. falciparum</i> protein targets (percentage identity, E value, and bit score), the number of the predicted <i>P. falciparum</i> targets, and their respective mutagenesis index and fitness scores (R<sup>2</sup> between 0.066 and 0.92, <i>P</i> &lt; 0.05). Compounds predicted to target essential <i>P. falciparum</i> proteins or those with a druggability index of 1 showed the highest antiplasmodial activity.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00856-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141236004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1