{"title":"BERT and LLMs-Based avGFP Brightness Prediction and Mutation Design","authors":"X. Guo, W. Che","doi":"arxiv-2407.20534","DOIUrl":null,"url":null,"abstract":"This study aims to utilize Transformer models and large language models (such\nas GPT and Claude) to predict the brightness of Aequorea victoria green\nfluorescent protein (avGFP) and design mutants with higher brightness.\nConsidering the time and cost associated with traditional experimental\nscreening methods, this study employs machine learning techniques to enhance\nresearch efficiency. We first read and preprocess a proprietary dataset\ncontaining approximately 140,000 protein sequences, including about 30,000\navGFP sequences. Subsequently, we constructed and trained a Transformer-based\nprediction model to screen and design new avGFP mutants that are expected to\nexhibit higher brightness. Our methodology consists of two primary stages: first, the construction of a\nscoring model using BERT, and second, the screening and generation of mutants\nusing mutation site statistics and large language models. Through the analysis\nof predictive results, we designed and screened 10 new high-brightness avGFP\nsequences. This study not only demonstrates the potential of deep learning in\nprotein design but also provides new perspectives and methodologies for future\nresearch by integrating prior knowledge from large language models.","PeriodicalId":501219,"journal":{"name":"arXiv - QuanBio - Other Quantitative Biology","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Other Quantitative Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.20534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This study aims to utilize Transformer models and large language models (such
as GPT and Claude) to predict the brightness of Aequorea victoria green
fluorescent protein (avGFP) and design mutants with higher brightness.
Considering the time and cost associated with traditional experimental
screening methods, this study employs machine learning techniques to enhance
research efficiency. We first read and preprocess a proprietary dataset
containing approximately 140,000 protein sequences, including about 30,000
avGFP sequences. Subsequently, we constructed and trained a Transformer-based
prediction model to screen and design new avGFP mutants that are expected to
exhibit higher brightness. Our methodology consists of two primary stages: first, the construction of a
scoring model using BERT, and second, the screening and generation of mutants
using mutation site statistics and large language models. Through the analysis
of predictive results, we designed and screened 10 new high-brightness avGFP
sequences. This study not only demonstrates the potential of deep learning in
protein design but also provides new perspectives and methodologies for future
research by integrating prior knowledge from large language models.