Simple and effective embedding model for single-cell biology built from ChatGPT

IF 26.8 1区 医学 Q1 ENGINEERING, BIOMEDICAL Nature Biomedical Engineering Pub Date : 2024-12-06 DOI:10.1038/s41551-024-01284-6
Yiqun Chen, James Zou
{"title":"Simple and effective embedding model for single-cell biology built from ChatGPT","authors":"Yiqun Chen, James Zou","doi":"10.1038/s41551-024-01284-6","DOIUrl":null,"url":null,"abstract":"<p>Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene’s expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models—particularly, tasks of gene-property and cell-type classifications—our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.</p>","PeriodicalId":19063,"journal":{"name":"Nature Biomedical Engineering","volume":"4 1","pages":""},"PeriodicalIF":26.8000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1038/s41551-024-01284-6","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene’s expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models—particularly, tasks of gene-property and cell-type classifications—our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于ChatGPT构建的单细胞生物学简单有效的嵌入模型
大规模的基因表达数据被用来预训练隐式学习基因和细胞功能的模型。然而,这样的模型需要大量的数据管理和培训。在这里,我们探索一个更简单的替代方案:利用基于文献的ChatGPT基因嵌入。我们使用GPT-3.5从单个基因的文本描述中生成基因嵌入,然后通过对每个基因表达水平加权的基因嵌入进行平均来生成单细胞嵌入。我们还通过仅使用按表达水平排序的基因名称为每个细胞创建了一个句子嵌入。在许多用于评估预训练的单细胞嵌入模型的下游任务中,特别是基因特性和细胞类型分类的任务,我们的模型,我们将其命名为GenePT,与从数百万细胞的基因表达谱中预训练的模型相比,取得了相当或更好的性能。GenePT表明,文献的大语言模型嵌入为编码单细胞生物学知识提供了一种简单有效的途径。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Nature Biomedical Engineering
Nature Biomedical Engineering Medicine-Medicine (miscellaneous)
CiteScore
45.30
自引率
1.10%
发文量
138
期刊介绍: Nature Biomedical Engineering is an online-only monthly journal that was launched in January 2017. It aims to publish original research, reviews, and commentary focusing on applied biomedicine and health technology. The journal targets a diverse audience, including life scientists who are involved in developing experimental or computational systems and methods to enhance our understanding of human physiology. It also covers biomedical researchers and engineers who are engaged in designing or optimizing therapies, assays, devices, or procedures for diagnosing or treating diseases. Additionally, clinicians, who make use of research outputs to evaluate patient health or administer therapy in various clinical settings and healthcare contexts, are also part of the target audience.
期刊最新文献
Radioprotection of healthy tissue via nanoparticle-delivered mRNA encoding for a damage-suppressor protein found in tardigrades Enhancing phage therapy by coating single bacteriophage-infected bacteria with polymer to preserve phage vitality Molecular probes for in vivo optical imaging of immune cells Characterization of tumour heterogeneity through segmentation-free representation learning on multiplexed imaging data Charting targeted courses for vaccination
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1