利用基因组序列和染色质结构研究非编码变异效应的多模态学习。

IF 4.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Bioinformatics Pub Date : 2023-09-02 DOI:10.1093/bioinformatics/btad541

Wuwei Tan, Yang Shen

{"title":"利用基因组序列和染色质结构研究非编码变异效应的多模态学习。","authors":"Wuwei Tan, Yang Shen","doi":"10.1093/bioinformatics/btad541","DOIUrl":null,"url":null,"abstract":"Motivation: A growing amount of noncoding genetic variants, including single-nucleotide polymorphisms, are found to be associated with complex human traits and diseases. Their mechanistic interpretation is relatively limited and can use the help from computational prediction of their effects on epigenetic profiles. However, current models often focus on local, 1D genome sequence determinants and disregard global, 3D chromatin structure that critically affects epigenetic events.Results: We find that noncoding variants of unexpected high similarity in epigenetic profiles, with regards to their relatively low similarity in local sequences, can be largely attributed to their proximity in chromatin structure. Accordingly, we have developed a multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects. Specifically, we have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models. Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs. They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised \"zero-shot\" learning or supervised \"few-shot\" learning.Availability and implementation: Codes and data can be accessed at https://github.com/Shen-Lab/ncVarPred-1D3D and https://zenodo.org/record/7975777.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10502240/pdf/","citationCount":"0","resultStr":"{\"title\":\"Multimodal learning of noncoding variant effects using genome sequence and chromatin structure.\",\"authors\":\"Wuwei Tan, Yang Shen\",\"doi\":\"10.1093/bioinformatics/btad541\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: A growing amount of noncoding genetic variants, including single-nucleotide polymorphisms, are found to be associated with complex human traits and diseases. Their mechanistic interpretation is relatively limited and can use the help from computational prediction of their effects on epigenetic profiles. However, current models often focus on local, 1D genome sequence determinants and disregard global, 3D chromatin structure that critically affects epigenetic events.Results: We find that noncoding variants of unexpected high similarity in epigenetic profiles, with regards to their relatively low similarity in local sequences, can be largely attributed to their proximity in chromatin structure. Accordingly, we have developed a multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects. Specifically, we have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models. Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs. They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised \\\"zero-shot\\\" learning or supervised \\\"few-shot\\\" learning.Availability and implementation: Codes and data can be accessed at https://github.com/Shen-Lab/ncVarPred-1D3D and https://zenodo.org/record/7975777.\",\"PeriodicalId\":8903,\"journal\":{\"name\":\"Bioinformatics\",\"volume\":\"39 9\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2023-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10502240/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btad541\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btad541","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

动机:越来越多的非编码基因变异，包括单核苷酸多态性，被发现与复杂的人类特征和疾病有关。它们的机制解释是相对有限的，可以利用计算预测它们对表观遗传谱的影响。然而，目前的模型往往侧重于局部的1D基因组序列决定因素，而忽略了对表观遗传事件有重要影响的全局的3D染色质结构。结果:我们发现，在表观遗传谱中具有意想不到的高相似性的非编码变异，在局部序列中具有相对较低的相似性，这在很大程度上可归因于它们在染色质结构上的接近性。因此，我们开发了一种多模态深度学习方案，该方案结合了1D基因组序列和3D染色质结构数据，用于预测非编码变异效应。具体来说，我们利用最新的DNA语言模型，将卷积和循环神经网络集成到序列嵌入中，将图神经网络集成到结构嵌入中，尽管两种类型的数据之间存在分辨率差距。数值结果表明，我们的模型在预测表观遗传谱方面优于竞争的纯序列模型，并且它们使用远程相互作用来补充纯序列模型在提取调控基序方面的作用。无论是在无监督的“零次”学习还是在有监督的“少次”学习中，它们都被证明是基因表达和致病性中非编码变异效应的极好预测因子。可用性和实施:代码和数据可在https://github.com/Shen-Lab/ncVarPred-1D3D和https://zenodo.org/record/7975777上访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multimodal learning of noncoding variant effects using genome sequence and chromatin structure.

Motivation: A growing amount of noncoding genetic variants, including single-nucleotide polymorphisms, are found to be associated with complex human traits and diseases. Their mechanistic interpretation is relatively limited and can use the help from computational prediction of their effects on epigenetic profiles. However, current models often focus on local, 1D genome sequence determinants and disregard global, 3D chromatin structure that critically affects epigenetic events.

Results: We find that noncoding variants of unexpected high similarity in epigenetic profiles, with regards to their relatively low similarity in local sequences, can be largely attributed to their proximity in chromatin structure. Accordingly, we have developed a multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects. Specifically, we have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models. Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs. They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised "zero-shot" learning or supervised "few-shot" learning.

Availability and implementation: Codes and data can be accessed at https://github.com/Shen-Lab/ncVarPred-1D3D and https://zenodo.org/record/7975777.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics 生物-生化研究方法

CiteScore

11.20

自引率

5.20%

发文量

753

审稿时长

2.1 months

期刊介绍： The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.