GO2Sum:根据 GO 术语生成人类可读的蛋白质功能摘要。

IF 3.5 2区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY NPJ Systems Biology and Applications Pub Date : 2024-03-15 DOI:10.1038/s41540-024-00358-0
Swagarika Jaharlal Giri, Nabil Ibtehaz, Daisuke Kihara
{"title":"GO2Sum:根据 GO 术语生成人类可读的蛋白质功能摘要。","authors":"Swagarika Jaharlal Giri, Nabil Ibtehaz, Daisuke Kihara","doi":"10.1038/s41540-024-00358-0","DOIUrl":null,"url":null,"abstract":"<p><p>Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.</p>","PeriodicalId":19345,"journal":{"name":"NPJ Systems Biology and Applications","volume":null,"pages":null},"PeriodicalIF":3.5000,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10943200/pdf/","citationCount":"0","resultStr":"{\"title\":\"GO2Sum: generating human-readable functional summary of proteins from GO terms.\",\"authors\":\"Swagarika Jaharlal Giri, Nabil Ibtehaz, Daisuke Kihara\",\"doi\":\"10.1038/s41540-024-00358-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.</p>\",\"PeriodicalId\":19345,\"journal\":{\"name\":\"NPJ Systems Biology and Applications\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-03-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10943200/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NPJ Systems Biology and Applications\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1038/s41540-024-00358-0\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Systems Biology and Applications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1038/s41540-024-00358-0","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

了解蛋白质的生物学功能对现代生物学至关重要。为了表示蛋白质的功能,基因本体(Gene Ontology,GO)这一受控词汇经常被使用,因为它易于计算机程序处理,避免了开放式文本解释。特别是,目前大多数蛋白质功能预测方法都依赖于 GO 术语。然而,描述蛋白质功能的大量 GO 术语在解释时会给生物学家带来挑战。为了解决这个问题,我们开发了 GO2Sum(基因本体术语总结器),这是一个将一组 GO 术语作为输入,并使用 T5 大语言模型生成人类可读总结的模型。GO2Sum 是通过微调 T5 的 GO 术语分配和 UniProt 条目的自由文本功能描述而开发的,使其能够通过连接 GO 术语描述来重新创建功能描述。我们的研究结果表明,在为 UniProt 条目生成功能、亚基结构和途径段落方面,GO2Sum 明显优于在整个网络语料库中训练的原始 T5 模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GO2Sum: generating human-readable functional summary of proteins from GO terms.

Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
NPJ Systems Biology and Applications
NPJ Systems Biology and Applications Mathematics-Applied Mathematics
CiteScore
5.80
自引率
0.00%
发文量
46
审稿时长
8 weeks
期刊介绍: npj Systems Biology and Applications is an online Open Access journal dedicated to publishing the premier research that takes a systems-oriented approach. The journal aims to provide a forum for the presentation of articles that help define this nascent field, as well as those that apply the advances to wider fields. We encourage studies that integrate, or aid the integration of, data, analyses and insight from molecules to organisms and broader systems. Important areas of interest include not only fundamental biological systems and drug discovery, but also applications to health, medical practice and implementation, big data, biotechnology, food science, human behaviour, broader biological systems and industrial applications of systems biology. We encourage all approaches, including network biology, application of control theory to biological systems, computational modelling and analysis, comprehensive and/or high-content measurements, theoretical, analytical and computational studies of system-level properties of biological systems and computational/software/data platforms enabling such studies.
期刊最新文献
Understanding flux switching in metabolic networks through an analysis of synthetic lethals Optimal performance objectives in the highly conserved bone morphogenetic protein signaling pathway Tipping-point transition from transient to persistent inflammation in pancreatic islets EpiScan: accurate high-throughput mapping of antibody-specific epitopes using sequence information Codon usage and expression-based features significantly improve prediction of CRISPR efficiency.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1