Low-dimensional genotype embeddings for predictive models

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2022-08-07 DOI:10.1145/3535508.3545507

Syed Fahad Sultan, Xingzhi Guo, S. Skiena

引用次数: 0

Abstract

We develop methods for constructing low-dimensional vector representations (embeddings) of large-scale genotyping data, capable of reducing genotypes of hundreds of thousands of SNPs to 100-dimensional embeddings that retain substantial predictive power for inferring medical phenotypes. We demonstrate that embedding-based models yield an average F-score of 0.605 on a test of ten phenoypes (including BMI prediction, genetic relatedness, and depression) versus 0.339 for baseline models. Genotype embeddings also hold promise for creating sharing data while preserving subject anonymity: we show that they retain substantial predictive power even after anonymization by adding Gaussian noise to each dimension.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

预测模型的低维基因型嵌入

我们开发了构建大规模基因分型数据的低维载体表示(嵌入)的方法，能够将数十万个snp的基因型减少到100维嵌入，这些嵌入保留了推断医学表型的实质性预测能力。我们证明，基于嵌入的模型在10种表型(包括BMI预测、遗传相关性和抑郁)的测试中产生的平均f分为0.605，而基线模型的平均f分为0.339。基因型嵌入也有望在保持受试者匿名性的同时创建共享数据:我们表明，即使在匿名化之后，通过向每个维度添加高斯噪声，它们仍保留了大量的预测能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

自引率

0.00%

发文量

期刊最新文献

Examining post-pandemic behaviors influencing human mobility trends Geographic ensembles of observations using randomised ensembles of autoregression chains: ensemble methods for spatio-temporal time series forecasting of influenza-like illness Trajectory-based and sound-based medical data clustering Session details: Graphs & networks TopographyNET