{"title":"基于对比学习的预训练蛋白质语言模型的蛋白质-小分子结合位点预测。","authors":"Jue Wang, Yufan Liu, Boxue Tian","doi":"10.1186/s13321-024-00920-2","DOIUrl":null,"url":null,"abstract":"<p>Predicting protein-small molecule binding sites, the initial step in structure-guided drug design, remains challenging for proteins lacking experimentally derived ligand-bound structures. Here, we propose CLAPE-SMB, which integrates a pre-trained protein language model with contrastive learning to provide high accuracy predictions of small molecule binding sites that can accommodate proteins without a published crystal structure. We trained and tested CLAPE-SMB on the SJC dataset, a non-redundant dataset based on sc-PDB, JOINED, and COACH420, and achieved an MCC of 0.529. We also compiled the UniProtSMB dataset, which merges sites from similar proteins based on raw data from UniProtKB database, and achieved an MCC of 0.699 on the test set. In addition, CLAPE-SMB achieved an MCC of 0.815 on our intrinsically disordered protein (IDP) dataset that contains 336 non-redundant sequences. Case studies of DAPK1, RebH, and Nep1 support the potential of this binding site prediction tool to aid in drug design. The code and datasets are freely available at https://github.com/JueWangTHU/CLAPE-SMB.</p><p>CLAPE-SMB combines a pre-trained protein language model with contrastive learning to accurately predict protein-small molecule binding sites, especially for proteins without experimental structures, such as IDPs. Trained across various datasets, this model shows strong adaptability, making it a valuable tool for advancing drug design and understanding protein-small molecule interactions.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00920-2","citationCount":"0","resultStr":"{\"title\":\"Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning\",\"authors\":\"Jue Wang, Yufan Liu, Boxue Tian\",\"doi\":\"10.1186/s13321-024-00920-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Predicting protein-small molecule binding sites, the initial step in structure-guided drug design, remains challenging for proteins lacking experimentally derived ligand-bound structures. Here, we propose CLAPE-SMB, which integrates a pre-trained protein language model with contrastive learning to provide high accuracy predictions of small molecule binding sites that can accommodate proteins without a published crystal structure. We trained and tested CLAPE-SMB on the SJC dataset, a non-redundant dataset based on sc-PDB, JOINED, and COACH420, and achieved an MCC of 0.529. We also compiled the UniProtSMB dataset, which merges sites from similar proteins based on raw data from UniProtKB database, and achieved an MCC of 0.699 on the test set. In addition, CLAPE-SMB achieved an MCC of 0.815 on our intrinsically disordered protein (IDP) dataset that contains 336 non-redundant sequences. Case studies of DAPK1, RebH, and Nep1 support the potential of this binding site prediction tool to aid in drug design. The code and datasets are freely available at https://github.com/JueWangTHU/CLAPE-SMB.</p><p>CLAPE-SMB combines a pre-trained protein language model with contrastive learning to accurately predict protein-small molecule binding sites, especially for proteins without experimental structures, such as IDPs. Trained across various datasets, this model shows strong adaptability, making it a valuable tool for advancing drug design and understanding protein-small molecule interactions.</p>\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":7.1000,\"publicationDate\":\"2024-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00920-2\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://link.springer.com/article/10.1186/s13321-024-00920-2\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00920-2","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning
Predicting protein-small molecule binding sites, the initial step in structure-guided drug design, remains challenging for proteins lacking experimentally derived ligand-bound structures. Here, we propose CLAPE-SMB, which integrates a pre-trained protein language model with contrastive learning to provide high accuracy predictions of small molecule binding sites that can accommodate proteins without a published crystal structure. We trained and tested CLAPE-SMB on the SJC dataset, a non-redundant dataset based on sc-PDB, JOINED, and COACH420, and achieved an MCC of 0.529. We also compiled the UniProtSMB dataset, which merges sites from similar proteins based on raw data from UniProtKB database, and achieved an MCC of 0.699 on the test set. In addition, CLAPE-SMB achieved an MCC of 0.815 on our intrinsically disordered protein (IDP) dataset that contains 336 non-redundant sequences. Case studies of DAPK1, RebH, and Nep1 support the potential of this binding site prediction tool to aid in drug design. The code and datasets are freely available at https://github.com/JueWangTHU/CLAPE-SMB.
CLAPE-SMB combines a pre-trained protein language model with contrastive learning to accurately predict protein-small molecule binding sites, especially for proteins without experimental structures, such as IDPs. Trained across various datasets, this model shows strong adaptability, making it a valuable tool for advancing drug design and understanding protein-small molecule interactions.
期刊介绍:
Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling.
Coverage includes, but is not limited to:
chemical information systems, software and databases, and molecular modelling,
chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases,
computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.