大规模遗传数据的统计学习:如何使用1000基因组计划数据进行基因表达数据的全基因组关联研究

Pub Date : 2023-07-01 DOI:10.1007/s12561-023-09375-9

Anton Sugolov, Eric Emmenegger, Andrew D. Paterson, Lei Sun

{"title":"大规模遗传数据的统计学习:如何使用1000基因组计划数据进行基因表达数据的全基因组关联研究","authors":"Anton Sugolov, Eric Emmenegger, Andrew D. Paterson, Lei Sun","doi":"10.1007/s12561-023-09375-9","DOIUrl":null,"url":null,"abstract":"Abstract Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain $$\\sim$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>∼</mml:mo> </mml:math> 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data\",\"authors\":\"Anton Sugolov, Eric Emmenegger, Andrew D. Paterson, Lei Sun\",\"doi\":\"10.1007/s12561-023-09375-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain $$\\\\sim$$ <mml:math xmlns:mml=\\\"http://www.w3.org/1998/Math/MathML\\\"> <mml:mo>∼</mml:mo> </mml:math> 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.\",\"PeriodicalId\":0,\"journal\":{\"name\":\"\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s12561-023-09375-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12561-023-09375-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过对当代大规模数据集的应用来教授统计学对于吸引学生进入该领域至关重要。为此，我们为没有统计遗传学知识但具有数据科学基础知识的高中生或初中生开发了一个为期一周的实践研讨会，以进行他们自己的全基因组关联研究(GWAS)。GWAS使用公开可用的人类遗传学数据，对开源基因表达数据进行分析。在详细的指导手册的帮助下，学生们能够在几天内从真正的科学研究中获得$$\sim$$ ～ 140万个p值。这种早期的动机使学生们致力于学习支持他们结果的理论，包括回归、数据可视化、结果解释和大规模的多重假设检验。为了通过强调与这类数据分析的个人联系来进一步提高他们的学习动机，学生们被鼓励就GWAS如何为他们的朋友或家人中存在的疾病的遗传基础提供见解进行简短的介绍。所附的开源分步指导手册包括对所使用的数据集、所需软件和研讨会结果的描述。此外，研讨会中使用的脚本存档在Github和Zenodo上，以进一步增强可重复的研究和培训。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

Abstract Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain $$\sim$$ ∼ 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助