Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Journal of Cheminformatics Pub Date : 2024-05-22 DOI:10.1186/s13321-024-00852-x

Hengwei Chen, Jürgen Bajorath

{"title":"Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model","authors":"Hengwei Chen, Jürgen Bajorath","doi":"10.1186/s13321-024-00852-x","DOIUrl":null,"url":null,"abstract":"<p>Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00852-x","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00852-x","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用多模态生化语言模型，从目标蛋白质序列中生成具有所需效力的化合物

从自然语言处理中改造而来的深度学习模型，通过对连续分子数据表示的机器翻译，为活性化合物的预测提供了新的机遇。例如，化学语言模型通常用于化合物字符串转换。此外，鉴于语言模型在翻译不同类型文本表征方面的主要通用性，可以探索非主流设计任务。在这项工作中，我们研究了从目标序列嵌入中生成具有所需效力的活性化合物的设计，这是一项颇具挑战性的预测任务。因此，我们设计了一个双组件条件语言模型，用于从多模态数据中学习。它包括一个用于生成目标序列嵌入的蛋白质语言模型组件和一个用于预测具有所需效力的新活性化合物的条件转换器。为此，对指定的 "生化 "语言模型进行了训练，以学习蛋白质序列和化合物效力值嵌入到相应化合物的组合映射，对模型推导过程中未遇到的单个活性类别进行微调，并在结构与训练集不同的化合物测试集上进行评估。生化语言模型正确再现了所有活性类别中具有不同效力的已知化合物，为该方法提供了概念验证。此外，与无条件模型相比，有条件模型始终能再现更多的已知化合物和更强效的化合物，揭示了药效条件的重大影响。生化语言模型还生成了结构多样的候选化合物，与微调化合物和测试化合物都有所不同。总之，基于效力值条件目标序列嵌入的生成式化合物设计取得了可喜的成果，使该方法在进一步探索和实际应用中具有吸引力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.