Increasing antimicrobial resistance (AMR) represents a global healthcare threat. To decrease the spread of AMR and associated mortality, methods for rapid selection of optimal antibiotic treatment are urgently needed. Machine learning (ML) models based on genomic data to predict resistant phenotypes can serve as a fast screening tool prior to phenotypic testing. Nonetheless, many existing ML methods lack interpretability. Therefore, we present a methodology for visualization of sequence space and AMR prediction based on the non-linear dimensionality reduction method - generative topographic mapping (GTM). This approach, applied to AMR data of >5000 S. aureus isolates retrieved from the PATRIC database, yielded GTM models with reasonable accuracy for all drugs (balanced accuracy values ≥0.75). The Generative Topographic Maps (GTMs) represent data in the form of illustrative maps of the genomic space and allow for antibiotic-wise comparison of resistant phenotypes. The maps were also found to be useful for the analysis of genetic determinants responsible for drug resistance. Overall, the GTM-based methodology is a useful tool for both the illustrative exploration of the genomic sequence space and AMR prediction.
抗菌素耐药性(AMR)的不断增加对全球医疗保健构成了威胁。为了减少 AMR 的传播和相关死亡率,迫切需要快速选择最佳抗生素治疗方法。基于基因组数据预测耐药性表型的机器学习(ML)模型可作为表型测试前的快速筛选工具。然而,许多现有的 ML 方法缺乏可解释性。因此,我们提出了一种基于非线性降维方法--生成地形图(GTM)的序列空间可视化和 AMR 预测方法。这种方法适用于从 PATRIC 数据库中检索到的超过 5000 个金黄色葡萄球菌分离物的 AMR 数据,对所有药物都产生了具有合理准确度的 GTM 模型(平衡准确度值≥0.75)。生成地形图(GTM)以基因组空间示意图的形式表示数据,可对抗生素耐药表型进行比较。研究还发现,生成地形图有助于分析导致耐药性的基因决定因素。总之,基于 GTM 的方法对于基因组序列空间的说明性探索和 AMR 预测都是一种有用的工具。
{"title":"Predicting S. aureus antimicrobial resistance with interpretable genomic space maps.","authors":"Karina Pikalyova, Alexey Orlov, Dragos Horvath, Gilles Marcou, Alexandre Varnek","doi":"10.1002/minf.202300263","DOIUrl":"10.1002/minf.202300263","url":null,"abstract":"<p><p>Increasing antimicrobial resistance (AMR) represents a global healthcare threat. To decrease the spread of AMR and associated mortality, methods for rapid selection of optimal antibiotic treatment are urgently needed. Machine learning (ML) models based on genomic data to predict resistant phenotypes can serve as a fast screening tool prior to phenotypic testing. Nonetheless, many existing ML methods lack interpretability. Therefore, we present a methodology for visualization of sequence space and AMR prediction based on the non-linear dimensionality reduction method - generative topographic mapping (GTM). This approach, applied to AMR data of >5000 S. aureus isolates retrieved from the PATRIC database, yielded GTM models with reasonable accuracy for all drugs (balanced accuracy values ≥0.75). The Generative Topographic Maps (GTMs) represent data in the form of illustrative maps of the genomic space and allow for antibiotic-wise comparison of resistant phenotypes. The maps were also found to be useful for the analysis of genetic determinants responsible for drug resistance. Overall, the GTM-based methodology is a useful tool for both the illustrative exploration of the genomic sequence space and AMR prediction.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139932061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mirjana Antonijevic, Jana Sopkova‐de Oliveira Santos, Patrick Dallemagne, Christophe Rochais
The important role that the neurotrophin tyrosine kinase receptor ‐ TrkB has in the pathogenesis of several neurodegenerative conditions such are Alzheimer's disease, Parkinson's disease, Huntington's disease, has been well described. This shouldn't be a surprise, since in the physiological conditions, once activated by brain‐derived neurotrophic factor (BDNF) and neurotrophin‐4/5 (NT‐4/5), the TrkB receptor promotes neuronal survival, differentiation and synaptic function. Considering that the natural ligands for TrkB receptor are large proteins, it is a challenge to discover small molecule capable to mimic their effects.Even though, the surface of receptor that is interacting with BDNF or NT‐4/5 is known, there was always a question which pocket and interaction is responsible for activation of it. In order to answer this challenging question, we have used molecular dynamic (MD) simulations and Pocketron algorithm which enabled us to detect, for the first time, a pocket network existing in the interacting domain (d5) of the receptor; to describe them and to see how they are communicating with each other. This new discovery gave us potential new areas on receptor that can be targeted and used for structure‐based drug design approach in the development of the new ligands.
{"title":"Discovery of a pocket network on the domain 5 of the TrkB receptor – A potential new target in the quest for the new ligands","authors":"Mirjana Antonijevic, Jana Sopkova‐de Oliveira Santos, Patrick Dallemagne, Christophe Rochais","doi":"10.1002/minf.202400043","DOIUrl":"https://doi.org/10.1002/minf.202400043","url":null,"abstract":"The important role that the neurotrophin tyrosine kinase receptor ‐ TrkB has in the pathogenesis of several neurodegenerative conditions such are Alzheimer's disease, Parkinson's disease, Huntington's disease, has been well described. This shouldn't be a surprise, since in the physiological conditions, once activated by brain‐derived neurotrophic factor (BDNF) and neurotrophin‐4/5 (NT‐4/5), the TrkB receptor promotes neuronal survival, differentiation and synaptic function. Considering that the natural ligands for TrkB receptor are large proteins, it is a challenge to discover small molecule capable to mimic their effects.Even though, the surface of receptor that is interacting with BDNF or NT‐4/5 is known, there was always a question which pocket and interaction is responsible for activation of it. In order to answer this challenging question, we have used molecular dynamic (MD) simulations and Pocketron algorithm which enabled us to detect, for the first time, a pocket network existing in the interacting domain (d5) of the receptor; to describe them and to see how they are communicating with each other. This new discovery gave us potential new areas on receptor that can be targeted and used for structure‐based drug design approach in the development of the new ligands.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peptides are potentially useful modalities of drugs; however, cell membrane permeability is an obstacle in peptide drug discovery. The identification of bioactive peptides for a therapeutic target is also challenging because of the huge amino acid sequence patterns of peptides. In this study, we propose a novel computational method, PEptide generation system using Neural network Trained on Amino acid sequence data and Gaussian process-based optimizatiON (PENTAGON), to automatically generate new peptides with desired bioactivity and cell membrane permeability. In the algorithm, we mapped peptide amino acid sequences onto the latent space constructed using a variational autoencoder and searched for peptides with desired bioactivity and cell membrane permeability using Bayesian optimization. We used our proposed method to generate peptides with cell membrane permeability and bioactivity for each of the nine therapeutic targets, such as the estrogen receptor (ER). Our proposed method outperformed a previously developed peptide generator in terms of similarity to known active peptide sequences and the length of generated peptide sequences.
{"title":"Automatic generation of functional peptides with desired bioactivity and membrane permeability using Bayesian optimization.","authors":"Itsuki Fukunaga, Yuki Matsukiyo, Kazuma Kaitoh, Yoshihiro Yamanishi","doi":"10.1002/minf.202300148","DOIUrl":"10.1002/minf.202300148","url":null,"abstract":"<p><p>Peptides are potentially useful modalities of drugs; however, cell membrane permeability is an obstacle in peptide drug discovery. The identification of bioactive peptides for a therapeutic target is also challenging because of the huge amino acid sequence patterns of peptides. In this study, we propose a novel computational method, PEptide generation system using Neural network Trained on Amino acid sequence data and Gaussian process-based optimizatiON (PENTAGON), to automatically generate new peptides with desired bioactivity and cell membrane permeability. In the algorithm, we mapped peptide amino acid sequences onto the latent space constructed using a variational autoencoder and searched for peptides with desired bioactivity and cell membrane permeability using Bayesian optimization. We used our proposed method to generate peptides with cell membrane permeability and bioactivity for each of the nine therapeutic targets, such as the estrogen receptor (ER). Our proposed method outperformed a previously developed peptide generator in terms of similarity to known active peptide sequences and the length of generated peptide sequences.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139106312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01Epub Date: 2024-02-06DOI: 10.1002/minf.202300183
Gian Marco Ghiandoni, Stuart R Flanagan, Michael J Bodkin, Maria Giulia Nizi, Albert Galera-Prat, Annalaura Brai, Beining Chen, James E A Wallace, Dimitar Hristozov, James Webster, Giuseppe Manfroni, Lari Lehtiö, Oriana Tabarrini, Valerie J Gillet
De novo design has been a hotly pursued topic for many years. Most recent developments have involved the use of deep learning methods for generative molecular design. Despite increasing levels of algorithmic sophistication, the design of molecules that are synthetically accessible remains a major challenge. Reaction-based de novo design takes a conceptually simpler approach and aims to address synthesisability directly by mimicking synthetic chemistry and driving structural transformations by known reactions that are applied in a stepwise manner. However, the use of a small number of hand-coded transformations restricts the chemical space that can be accessed and there are few examples in the literature where molecules and their synthetic routes have been designed and executed successfully. Here we describe the application of reaction-based de novo design to the design of synthetically accessible and biologically active compounds as proof-of-concept of our reaction vector-based software. Reaction vectors are derived automatically from known reactions and allow access to a wide region of synthetically accessible chemical space. The design was aimed at producing molecules that are active against PARP1 and which have improved brain penetration properties compared to existing PARP1 inhibitors. We synthesised a selection of the designed molecules according to the provided synthetic routes and tested them experimentally. The results demonstrate that reaction vectors can be applied to the design of novel molecules of biological relevance that are also synthetically accessible.
{"title":"Synthetically accessible de novo design using reaction vectors: Application to PARP1 inhibitors.","authors":"Gian Marco Ghiandoni, Stuart R Flanagan, Michael J Bodkin, Maria Giulia Nizi, Albert Galera-Prat, Annalaura Brai, Beining Chen, James E A Wallace, Dimitar Hristozov, James Webster, Giuseppe Manfroni, Lari Lehtiö, Oriana Tabarrini, Valerie J Gillet","doi":"10.1002/minf.202300183","DOIUrl":"10.1002/minf.202300183","url":null,"abstract":"<p><p>De novo design has been a hotly pursued topic for many years. Most recent developments have involved the use of deep learning methods for generative molecular design. Despite increasing levels of algorithmic sophistication, the design of molecules that are synthetically accessible remains a major challenge. Reaction-based de novo design takes a conceptually simpler approach and aims to address synthesisability directly by mimicking synthetic chemistry and driving structural transformations by known reactions that are applied in a stepwise manner. However, the use of a small number of hand-coded transformations restricts the chemical space that can be accessed and there are few examples in the literature where molecules and their synthetic routes have been designed and executed successfully. Here we describe the application of reaction-based de novo design to the design of synthetically accessible and biologically active compounds as proof-of-concept of our reaction vector-based software. Reaction vectors are derived automatically from known reactions and allow access to a wide region of synthetically accessible chemical space. The design was aimed at producing molecules that are active against PARP1 and which have improved brain penetration properties compared to existing PARP1 inhibitors. We synthesised a selection of the designed molecules according to the provided synthetic routes and tested them experimentally. The results demonstrate that reaction vectors can be applied to the design of novel molecules of biological relevance that are also synthetically accessible.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139521506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01Epub Date: 2024-02-15DOI: 10.1002/minf.202300292
Milad Rayka, Morteza Mirzaei, Ali Mohammad Latifi
When designing a machine learning-based scoring function, we access a limited number of protein-ligand complexes with experimentally determined binding affinity values, representing only a fraction of all possible protein-ligand complexes. Consequently, it is crucial to report a measure of confidence and quantify the uncertainty in the model's predictions during test time. Here, we adopt the conformal prediction technique to evaluate the confidence of a prediction for each member of the core set of the CASF 2016 benchmark. The conformal prediction technique requires a diverse ensemble of predictors for uncertainty estimation. To this end, we introduce ENS-Score as an ensemble predictor, which includes 30 models with different protein-ligand representation approaches and achieves Pearson's correlation of 0.842 on the core set of the CASF 2016 benchmark. Also, we comprehensively investigate the residual error of each data point to assess the normality behavior of the distribution of the residual errors and their correlation to the structural features of the ligands, such as hydrophobic interactions and halogen bonding. In the end, we provide a local host web application to facilitate the usage of ENS-Score. All codes to repeat results are provided at https://github.com/miladrayka/ENS_Score.
{"title":"An ensemble-based approach to estimate confidence of predicted protein-ligand binding affinity values.","authors":"Milad Rayka, Morteza Mirzaei, Ali Mohammad Latifi","doi":"10.1002/minf.202300292","DOIUrl":"10.1002/minf.202300292","url":null,"abstract":"<p><p>When designing a machine learning-based scoring function, we access a limited number of protein-ligand complexes with experimentally determined binding affinity values, representing only a fraction of all possible protein-ligand complexes. Consequently, it is crucial to report a measure of confidence and quantify the uncertainty in the model's predictions during test time. Here, we adopt the conformal prediction technique to evaluate the confidence of a prediction for each member of the core set of the CASF 2016 benchmark. The conformal prediction technique requires a diverse ensemble of predictors for uncertainty estimation. To this end, we introduce ENS-Score as an ensemble predictor, which includes 30 models with different protein-ligand representation approaches and achieves Pearson's correlation of 0.842 on the core set of the CASF 2016 benchmark. Also, we comprehensively investigate the residual error of each data point to assess the normality behavior of the distribution of the residual errors and their correlation to the structural features of the ligands, such as hydrophobic interactions and halogen bonding. In the end, we provide a local host web application to facilitate the usage of ENS-Score. All codes to repeat results are provided at https://github.com/miladrayka/ENS_Score.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139735655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01Epub Date: 2024-02-19DOI: 10.1002/minf.202300210
Souvik Pore, Arkaprava Banerjee, Kunal Roy
The application of various in-silico-based approaches for the prediction of various properties of materials has been an effective alternative to experimental methods. Recently, the concepts of Quantitative structure-property relationship (QSPR) and read-across (RA) methods were merged to develop a new emerging chemoinformatic tool: read-across structure-property relationship (RASPR). The RASPR method can be applicable to both large and small datasets as it uses various similarity and error-based measures. It has also been observed that RASPR models tend to have an increased external predictivity compared to the corresponding QSPR models. In this study, we have modeled the power conversion efficiency (PCE) of organic dyes used in dye-sensitized solar cells (DSSCs) by using the quantitative RASPR (q-RASPR) method. We have used relatively larger classes of organic dyes-Phenothiazines (n=207), Porphyrins (n=281), and Triphenylamines (n=229) for the modelling purpose. We have divided each of the datasets into training and test sets in 3 different combinations, and with the training sets we have developed three different QSPR models with structural and physicochemical descriptors and validated them with the corresponding test sets. These corresponding modeled descriptors were used to calculate the RASPR descriptors using a Java-based tool RASAR Descriptor Calculator v2.0 (https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home), and then data fusion was performed by pooling the previously selected structural and physicochemical descriptors with the calculated RASPR descriptors. Further feature selection algorithm was employed to develop the final RASPR PLS models. Here, we also developed different machine learning (ML) models with the descriptors selected in the QSPR PLS and RASPR PLS models, and it was found that models with RASPR descriptors superseded in external predictivity the models with only structural and physicochemical descriptors: RMSEP reduced for phenothiazines from 1.16-1.25 to 1.07-1.18, for porphyrins from 1.60-1.79 to 1.45-1.53, for triphenylamines from 1.27-1.54 to 1.20-1.47.
{"title":"Application of machine learning-based read-across structure-property relationship (RASPR) as a new tool for predictive modelling: Prediction of power conversion efficiency (PCE) for selected classes of organic dyes in dye-sensitized solar cells (DSSCs).","authors":"Souvik Pore, Arkaprava Banerjee, Kunal Roy","doi":"10.1002/minf.202300210","DOIUrl":"10.1002/minf.202300210","url":null,"abstract":"<p><p>The application of various in-silico-based approaches for the prediction of various properties of materials has been an effective alternative to experimental methods. Recently, the concepts of Quantitative structure-property relationship (QSPR) and read-across (RA) methods were merged to develop a new emerging chemoinformatic tool: read-across structure-property relationship (RASPR). The RASPR method can be applicable to both large and small datasets as it uses various similarity and error-based measures. It has also been observed that RASPR models tend to have an increased external predictivity compared to the corresponding QSPR models. In this study, we have modeled the power conversion efficiency (PCE) of organic dyes used in dye-sensitized solar cells (DSSCs) by using the quantitative RASPR (q-RASPR) method. We have used relatively larger classes of organic dyes-Phenothiazines (n=207), Porphyrins (n=281), and Triphenylamines (n=229) for the modelling purpose. We have divided each of the datasets into training and test sets in 3 different combinations, and with the training sets we have developed three different QSPR models with structural and physicochemical descriptors and validated them with the corresponding test sets. These corresponding modeled descriptors were used to calculate the RASPR descriptors using a Java-based tool RASAR Descriptor Calculator v2.0 (https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home), and then data fusion was performed by pooling the previously selected structural and physicochemical descriptors with the calculated RASPR descriptors. Further feature selection algorithm was employed to develop the final RASPR PLS models. Here, we also developed different machine learning (ML) models with the descriptors selected in the QSPR PLS and RASPR PLS models, and it was found that models with RASPR descriptors superseded in external predictivity the models with only structural and physicochemical descriptors: RMSEP reduced for phenothiazines from 1.16-1.25 to 1.07-1.18, for porphyrins from 1.60-1.79 to 1.45-1.53, for triphenylamines from 1.27-1.54 to 1.20-1.47.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139906082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-23DOI: 10.1002/minf.202300249
Asu Busra Temizer, Gökçe Uludoğan, Rıza Özçelik, Taha Koulani, Elif Ozkirimli, Kutlu O Ulgen, Nilgun Karali, Arzucan Özgür
Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.
{"title":"Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties.","authors":"Asu Busra Temizer, Gökçe Uludoğan, Rıza Özçelik, Taha Koulani, Elif Ozkirimli, Kutlu O Ulgen, Nilgun Karali, Arzucan Özgür","doi":"10.1002/minf.202300249","DOIUrl":"10.1002/minf.202300249","url":null,"abstract":"<p><p>Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139403684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fragment-based drug design (FBDD) has emerged as a captivating subject in the realm of computer-aided drug design, enabling the generation of novel molecules through the rearrangement of ring systems within known compounds. The construction of focused fragment library plays a pivotal role in FBDD, necessitating the compilation of all potential bioactive ring systems capable of interacting with a specific target. In our study, we propose a workflow for the development of a focused fragment library and combinatorial compound library. The fragment library comprises seed fragments and collected fragments. The extraction of seed fragments is guided by receptor information, serving as a prerequisite for establishing a focused libraries. Conversely, collected fragments are obtained using the feature graph method, which offers a simplified representation of fragments and strikes a balance between diversity and similarity when categorizing different fragments. The utilization of feature graph facilitates the rational partitioning of chemical space at fragment level, enabling the exploration of desired chemical space and enhancing the efficiency of screening compound library. Analysis demonstrates that our workflow enables the enumeration of a greater number of entirely new potential compounds, thereby aiding in the rational design of drugs.
{"title":"In silico construction of a focused fragment library facilitating exploration of chemical space.","authors":"Weijie Han, Xiaohe Xu, Qing Fan, Yingchao Yan, YanMin Zhang, Yadong Chen, Haichun Liu","doi":"10.1002/minf.202300256","DOIUrl":"10.1002/minf.202300256","url":null,"abstract":"<p><p>Fragment-based drug design (FBDD) has emerged as a captivating subject in the realm of computer-aided drug design, enabling the generation of novel molecules through the rearrangement of ring systems within known compounds. The construction of focused fragment library plays a pivotal role in FBDD, necessitating the compilation of all potential bioactive ring systems capable of interacting with a specific target. In our study, we propose a workflow for the development of a focused fragment library and combinatorial compound library. The fragment library comprises seed fragments and collected fragments. The extraction of seed fragments is guided by receptor information, serving as a prerequisite for establishing a focused libraries. Conversely, collected fragments are obtained using the feature graph method, which offers a simplified representation of fragments and strikes a balance between diversity and similarity when categorizing different fragments. The utilization of feature graph facilitates the rational partitioning of chemical space at fragment level, enabling the exploration of desired chemical space and enhancing the efficiency of screening compound library. Analysis demonstrates that our workflow enables the enumeration of a greater number of entirely new potential compounds, thereby aiding in the rational design of drugs.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139403685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}