Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-08-01 DOI:10.1186/s12859-024-05861-z

Nicholas Aksamit, Alain Tchagang, Yifeng Li, Beatrice Ombuki-Berman

{"title":"Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery.","authors":"Nicholas Aksamit, Alain Tchagang, Yifeng Li, Beatrice Ombuki-Berman","doi":"10.1186/s12859-024-05861-z","DOIUrl":null,"url":null,"abstract":"Background: Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized.Results: This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs.Conclusion: The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11295479/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05861-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized.

Results: This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs.

Conclusion: The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于药物发现中 ADMET 预测的混合片段-SMILES 标记化。

背景：药物发现和开发是一个成本极高、耗时极长的过程，需要找出能与生物标志物靶点相互作用的新分子，以阻断相关疾病的发病途径。除了与靶点结合外，候选药物还需要满足影响吸收、分布、代谢、排泄和毒性（ADMET）的多种特性。人工智能方法为改进药物发现和开发过程的每一步提供了机会，其中我们面临的第一个问题是如何对分子进行信息表征，从而优化硅内解决方案：本研究介绍了一种新颖的 SMILES-片段混合标记化方法，结合两种预训练策略，利用基于 Transformer 的模型。我们研究了混合标记化在提高 ADMET 预测任务性能方面的功效。我们的方法利用了 MTL-BERT（一种仅用于编码器的 Transformer 模型，可实现最先进的 ADMET 预测），并在片段库截断范围内将标准 SMILES 标记化与我们的混合方法进行了对比：研究结果表明，虽然过多的片段会影响性能，但使用混合标记法和高频片段可以提高结果，超过基本的 SMILES 标记法。这一进步强调了在用于 ADMET 特性预测的 Transformer 模型训练中整合片段和字符级分子特征的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.