机器学习促进了跨多种环境的基因表达水平的Imputation

2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics Pub Date : 2021-01-09 DOI:10.1145/3448340.3448342

Ziang Xu, H. Qi

{"title":"机器学习促进了跨多种环境的基因表达水平的Imputation","authors":"Ziang Xu, H. Qi","doi":"10.1145/3448340.3448342","DOIUrl":null,"url":null,"abstract":"Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.","PeriodicalId":365447,"journal":{"name":"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning Facilitates Imputation of Gene Expression Levels across Multiple Environments\",\"authors\":\"Ziang Xu, H. Qi\",\"doi\":\"10.1145/3448340.3448342\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.\",\"PeriodicalId\":365447,\"journal\":{\"name\":\"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3448340.3448342\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3448340.3448342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基因表达水平反映了活细胞中活跃的生物过程。在多种环境中，基因表达水平的定量研究具有重要意义。然而，由于技术原因，在mRNA-seq、qPCR或微阵列中，由于序列多样性或技术原因，可能无法正确测量某些环境/菌株的物种表达水平。因此，如果我们能从现有的数据中推断出缺失的表达水平，这将是非常有益的，这个填补缺失值的过程被称为imputation。Imputation是机器学习中一个非常活跃的领域，许多科技公司使用Imputation来推断客户对产品/电影等的偏好。在这里，我们应用了多种最先进的计算方法，并比较了它们在预测多种环境下基因表达水平方面的表现。利用酿酒酵母在13个环境中的多环境表达数据集，我们从数据集中随机去除5%、20%、50%和75%的表达水平，并应用各种imputation方法预测缺失值，并使用均方根误差对模型性能进行比较。我们发现，在5种方法中，SVD的效果最好，其次是5近邻KNN和2近邻KNN。相比之下，单变量均值和单变量中位数效果更差，表现相似。尽管后两种单变量方法在实践中非常常用，但我们的结果强调了使用机器学习方法进行imputation的好处，可以更好地预测不同环境下的表达水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Machine Learning Facilitates Imputation of Gene Expression Levels across Multiple Environments

Gene expression level reflects the active biological processes in a live cell. It is of great importance to quantify gene expression levels across multiple environments. However, for technical reasons, the expression level in some environments/strains of species may not be measured correctly because of sequence diversity or technical reasons in mRNA-seq, qPCR, or microarray. Therefore, it would be highly beneficial if we could infer the missing expression level from existing data, and this process of filling in such missing values is called imputation. Imputation is a very active field in machine learning, and many tech companies use imputation to infer customer preferences for products/movies, etc. Here we apply multiple state-of-the-art imputation methods and compare their performance in predicting gene expression levels across multiple environments. Using a multi-environment expression dataset of Saccharomyces cerevisiae across 13 environments, we randomly removed 5%, 20%, 50%, and 75% of the expression level from the dataset and applied various imputation methods to predict the missing values and use root mean squared error for comparison of model performances. We found that SVD works the best among the five methods, followed by KNN with five nearest neighbors and KNN with two nearest neighbors. In contrast, univariate mean and univariate median works the worse and perform similarly. Although the latter two univariate methods were very commonly used in practice, our result highlights the benefit of using machine learning methods for imputation for better predictions of expression levels across environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 11th International Conference on Bioscience, Biochemistry and Bioinformatics

自引率

0.00%

发文量