Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang
{"title":"GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning","authors":"Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang","doi":"arxiv-2407.16940","DOIUrl":null,"url":null,"abstract":"Genetic variants (GVs) are defined as differences in the DNA sequences among\nindividuals and play a crucial role in diagnosing and treating genetic\ndiseases. The rapid decrease in next generation sequencing cost has led to an\nexponential increase in patient-level GV data. This growth poses a challenge\nfor clinicians who must efficiently prioritize patient-specific GVs and\nintegrate them with existing genomic databases to inform patient management. To\naddressing the interpretation of GVs, genomic foundation models (GFMs) have\nemerged. However, these models lack standardized performance assessments,\nleading to considerable variability in model evaluations. This poses the\nquestion: How effectively do deep learning methods classify unknown GVs and\nalign them with clinically-verified GVs? We argue that representation learning,\nwhich transforms raw data into meaningful feature spaces, is an effective\napproach for addressing both indexing and classification challenges. We\nintroduce a large-scale Genetic Variant dataset, named GV-Rep, featuring\nvariable-length contexts and detailed annotations, designed for deep learning\nmodels to learn GV representations across various traits, diseases, tissue\ntypes, and experimental contexts. Our contributions are three-fold: (i)\nConstruction of a comprehensive dataset with 7 million records, each labeled\nwith characteristics of the corresponding variants, alongside additional data\nfrom 17,548 gene knockout tests across 1,107 cell types, 1,808 variant\ncombinations, and 156 unique clinically verified GVs from real-world patients.\n(ii) Analysis of the structure and properties of the dataset. (iii)\nExperimentation of the dataset with pre-trained GFMs. The results show a\nsignificant gap between GFMs current capabilities and accurate GV\nrepresentation. We hope this dataset will help advance genomic deep learning to\nbridge this gap.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"78 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.16940","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Genetic variants (GVs) are defined as differences in the DNA sequences among
individuals and play a crucial role in diagnosing and treating genetic
diseases. The rapid decrease in next generation sequencing cost has led to an
exponential increase in patient-level GV data. This growth poses a challenge
for clinicians who must efficiently prioritize patient-specific GVs and
integrate them with existing genomic databases to inform patient management. To
addressing the interpretation of GVs, genomic foundation models (GFMs) have
emerged. However, these models lack standardized performance assessments,
leading to considerable variability in model evaluations. This poses the
question: How effectively do deep learning methods classify unknown GVs and
align them with clinically-verified GVs? We argue that representation learning,
which transforms raw data into meaningful feature spaces, is an effective
approach for addressing both indexing and classification challenges. We
introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring
variable-length contexts and detailed annotations, designed for deep learning
models to learn GV representations across various traits, diseases, tissue
types, and experimental contexts. Our contributions are three-fold: (i)
Construction of a comprehensive dataset with 7 million records, each labeled
with characteristics of the corresponding variants, alongside additional data
from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant
combinations, and 156 unique clinically verified GVs from real-world patients.
(ii) Analysis of the structure and properties of the dataset. (iii)
Experimentation of the dataset with pre-trained GFMs. The results show a
significant gap between GFMs current capabilities and accurate GV
representation. We hope this dataset will help advance genomic deep learning to
bridge this gap.