A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun
{"title":"A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language","authors":"Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun","doi":"arxiv-2407.15888","DOIUrl":null,"url":null,"abstract":"Predicting gene function from its DNA sequence is a fundamental challenge in\nbiology. Many deep learning models have been proposed to embed DNA sequences\nand predict their enzymatic function, leveraging information in public\ndatabases linking DNA sequences to an enzymatic function label. However, much\nof the scientific community's knowledge of biological function is not\nrepresented in these categorical labels, and is instead captured in\nunstructured text descriptions of mechanisms, reactions, and enzyme behavior.\nThese descriptions are often captured alongside DNA sequences in biological\ndatabases, albeit in an unstructured manner. Deep learning of models predicting\nenzymatic function are likely to benefit from incorporating this multi-modal\ndata encoding scientific knowledge of biological function. There is, however,\nno dataset designed for machine learning algorithms to leverage this\nmulti-modal information. Here we propose a novel dataset and benchmark suite\nthat enables the exploration and development of large multi-modal neural\nnetwork models on gene DNA sequences and natural language descriptions of gene\nfunction. We present baseline performance on benchmarks for both unsupervised\nand supervised tasks that demonstrate the difficulty of this modeling\nobjective, while demonstrating the potential benefit of incorporating\nmulti-modal data types in function prediction compared to DNA sequences alone.\nOur dataset is at: https://hoarfrost-lab.github.io/BioTalk/.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Predicting gene function from its DNA sequence is a fundamental challenge in biology. Many deep learning models have been proposed to embed DNA sequences and predict their enzymatic function, leveraging information in public databases linking DNA sequences to an enzymatic function label. However, much of the scientific community's knowledge of biological function is not represented in these categorical labels, and is instead captured in unstructured text descriptions of mechanisms, reactions, and enzyme behavior. These descriptions are often captured alongside DNA sequences in biological databases, albeit in an unstructured manner. Deep learning of models predicting enzymatic function are likely to benefit from incorporating this multi-modal data encoding scientific knowledge of biological function. There is, however, no dataset designed for machine learning algorithms to leverage this multi-modal information. Here we propose a novel dataset and benchmark suite that enables the exploration and development of large multi-modal neural network models on gene DNA sequences and natural language descriptions of gene function. We present baseline performance on benchmarks for both unsupervised and supervised tasks that demonstrate the difficulty of this modeling objective, while demonstrating the potential benefit of incorporating multi-modal data types in function prediction compared to DNA sequences alone. Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
结合 DNA 序列和自然语言的酶功能多模式预测基准数据集
从 DNA 序列预测基因功能是生物学的一项基本挑战。许多深度学习模型被提出来嵌入 DNA 序列并预测其酶功能,利用公共数据库中的信息将 DNA 序列与酶功能标签联系起来。然而,科学界关于生物功能的大部分知识并没有体现在这些分类标签中,而是体现在关于机制、反应和酶行为的非结构化文本描述中。这些描述通常与生物数据库中的 DNA 序列一起被捕获,尽管是以非结构化的方式捕获的。预测酶功能的深度学习模型很可能受益于这些编码生物功能科学知识的多模式数据。然而,目前还没有专为机器学习算法设计的数据集来利用这些多模态信息。在这里,我们提出了一个新颖的数据集和基准套装,它可以在基因 DNA 序列和基因功能自然语言描述上探索和开发大型多模态神经网络模型。我们展示了无监督和有监督任务的基准性能,证明了这一建模目标的难度,同时也证明了与单独的 DNA 序列相比,在功能预测中纳入多模态数据类型的潜在好处。我们的数据集在:https://hoarfrost-lab.github.io/BioTalk/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1