Harnessing the Foundation Model for Exploration of Single-cell Expression Atlases in Plants.

Guangshuo Cao, Haoyu Chao, Wenqi Zheng, Yangming Lan, Kaiyan Lu, Yueyi Wang, Ming Chen, He Zhang, Dijun Chen
{"title":"Harnessing the Foundation Model for Exploration of Single-cell Expression Atlases in Plants.","authors":"Guangshuo Cao, Haoyu Chao, Wenqi Zheng, Yangming Lan, Kaiyan Lu, Yueyi Wang, Ming Chen, He Zhang, Dijun Chen","doi":"10.1093/gpbjnl/qzaf024","DOIUrl":null,"url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into plant cellular diversity by enabling high-resolution analyses of gene expression at the single-cell level. However, the complexity of scRNA-seq data, including challenges in batch integration, cell type annotation, and gene regulatory network (GRN) inference, demands advanced computational approaches. To address these challenges, we developed scPlantLLM, a Transformer model trained on millions of plant single-cell data points. Using a sequential pretraining strategy incorporating masked language modeling and cell type annotation tasks, scPlantLLM generates robust and interpretable single-cell data embeddings. When applied to Arabidopsis thaliana datasets, scPlantLLM excels in clustering, cell type annotation, and batch integration, achieving an accuracy of up to 0.91 in zero-shot learning scenarios. Furthermore, the model demonstrates an ability to identify biologically meaningful GRNs and subtle cellular subtypes, showcasing its potential to advance plant biology research. Compared to traditional methods, scPlantLLM outperforms in key metrics such as adjusted rand index (ARI), normalized mutual information (NMI) and silhouette score (SIL), highlighting its superior clustering accuracy and biological relevance. scPlantLLM represents a foundational model for exploring plant single-cell expression atlases, offering unprecedented capabilities to resolve cellular heterogeneity and regulatory dynamics across diverse plant systems. The code used in this study is available at https://github.com/compbioNJU/scPlantLLM.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/gpbjnl/qzaf024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into plant cellular diversity by enabling high-resolution analyses of gene expression at the single-cell level. However, the complexity of scRNA-seq data, including challenges in batch integration, cell type annotation, and gene regulatory network (GRN) inference, demands advanced computational approaches. To address these challenges, we developed scPlantLLM, a Transformer model trained on millions of plant single-cell data points. Using a sequential pretraining strategy incorporating masked language modeling and cell type annotation tasks, scPlantLLM generates robust and interpretable single-cell data embeddings. When applied to Arabidopsis thaliana datasets, scPlantLLM excels in clustering, cell type annotation, and batch integration, achieving an accuracy of up to 0.91 in zero-shot learning scenarios. Furthermore, the model demonstrates an ability to identify biologically meaningful GRNs and subtle cellular subtypes, showcasing its potential to advance plant biology research. Compared to traditional methods, scPlantLLM outperforms in key metrics such as adjusted rand index (ARI), normalized mutual information (NMI) and silhouette score (SIL), highlighting its superior clustering accuracy and biological relevance. scPlantLLM represents a foundational model for exploring plant single-cell expression atlases, offering unprecedented capabilities to resolve cellular heterogeneity and regulatory dynamics across diverse plant systems. The code used in this study is available at https://github.com/compbioNJU/scPlantLLM.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
单细胞 RNA 测序(scRNA-seq)可在单细胞水平上对基因表达进行高分辨率分析,从而为植物细胞多样性提供前所未有的见解。然而,scRNA-seq 数据的复杂性,包括批量整合、细胞类型注释和基因调控网络(GRN)推断方面的挑战,需要先进的计算方法。为了应对这些挑战,我们开发了 scPlantLLM,这是一种在数百万植物单细胞数据点上训练的 Transformer 模型。scPlantLLM 采用顺序预训练策略,结合屏蔽语言建模和细胞类型注释任务,生成了稳健且可解释的单细胞数据嵌入。当应用于拟南芥数据集时,scPlantLLM 在聚类、细胞类型注释和批量整合方面表现出色,在零点学习情况下准确率高达 0.91。此外,该模型还展示了识别具有生物学意义的 GRN 和微妙的细胞亚型的能力,展示了其推进植物生物学研究的潜力。与传统方法相比,scPlantLLM 在调整后兰德指数(ARI)、归一化互信息(NMI)和剪影得分(SIL)等关键指标上表现优异,突出了其卓越的聚类准确性和生物学相关性。scPlantLLM 代表了探索植物单细胞表达图谱的基础模型,为解析不同植物系统的细胞异质性和调控动态提供了前所未有的能力。本研究使用的代码见 https://github.com/compbioNJU/scPlantLLM。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HemAtlas: A Multi-omics Hematopoiesis Database. Harnessing the Foundation Model for Exploration of Single-cell Expression Atlases in Plants. LCORL and STC2 Variants Increase Body Size and Growth Rate in Cattle and Other Animals. The High Expression of PD-1 Defines A Subpopulation of Tfh Cells Responding to COVID-19 Vaccine in Humans. Haplotype-based Pangenomics: A Blueprint for Climate Adaptation in Plants.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1