Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties.

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL Molecular Informatics Pub Date : 2024-03-01 Epub Date: 2024-01-23 DOI:10.1002/minf.202300249

Asu Busra Temizer, Gökçe Uludoğan, Rıza Özçelik, Taha Koulani, Elif Ozkirimli, Kutlu O Ulgen, Nilgun Karali, Arzucan Özgür

{"title":"Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties.","authors":"Asu Busra Temizer, Gökçe Uludoğan, Rıza Özçelik, Taha Koulani, Elif Ozkirimli, Kutlu O Ulgen, Nilgun Karali, Arzucan Özgür","doi":"10.1002/minf.202300249","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202300249"},"PeriodicalIF":2.8000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.202300249","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/23 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索数据驱动的化学 SMILES 标记化方法，以确定关键的蛋白质配体结合分子。

机器学习模型在计算药物发现中得到了大量成功应用。由于分子序列容易获得、简单且信息量大，因此大量此类模型将分子表示为序列。基于序列的模型通常将分子序列分割成称为化学词的片段（类似于人类语言中组成句子的单词），然后应用先进的自然语言处理技术来完成新药设计、性质预测和结合亲和力预测等任务。然而，这些构件（化学词语）的化学特征和意义仍未得到探索。为了填补这一空白，我们采用了数据驱动的 SMILES 标记化技术，如字节对编码、WordPiece 和 Unigram，来识别化学词并比较由此产生的词汇表。为了理解这些词的化学意义，我们建立了一个语言启发管道，将蛋白质靶标的高亲和性配体视为文档，并根据 tf-idf 加权选择构成这些配体的关键化学词。在多个蛋白质配体亲和性数据集上的实验表明，尽管不同的子词标记化算法生成的词汇表在字数、长度和有效性上存在差异，但识别出的关键化学词却表现出了相似性。此外，我们还对一些目标物进行了案例研究，以分析关键化学词对结合的影响。我们发现，这些关键化学词对蛋白质靶标具有特异性，并与已知的药理和功能基团相对应。我们的方法阐明了机器学习模型识别出的单词的化学特性，可用于药物发现研究，以确定重要的化学分子。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.30

自引率

2.80%

发文量

审稿时长

3 months

期刊介绍： Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.