Abdul Muntakim Rafi, Daria Nogina, Dmitry Penzar, Dohoon Lee, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Dohyeon Kim, Yeojin Shin, Il-Youp Kwak, Georgy Meshcheryakov, Andrey Lando, Arsenii Zinkevich, Byeong-Chan Kim, Juhyun Lee, Taein Kang, Eeshit Dhaval Vaishnav, Payman Yadollahpour, Sun Kim, Jake Albrecht, Aviv Regev, Wuming Gong, Ivan V Kulakovskiy, Pablo Meyer, Carl de Boer
{"title":"评估和优化基于序列的基因调控深度学习模型。","authors":"Abdul Muntakim Rafi, Daria Nogina, Dmitry Penzar, Dohoon Lee, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Dohyeon Kim, Yeojin Shin, Il-Youp Kwak, Georgy Meshcheryakov, Andrey Lando, Arsenii Zinkevich, Byeong-Chan Kim, Juhyun Lee, Taein Kang, Eeshit Dhaval Vaishnav, Payman Yadollahpour, Sun Kim, Jake Albrecht, Aviv Regev, Wuming Gong, Ivan V Kulakovskiy, Pablo Meyer, Carl de Boer","doi":"10.1101/2023.04.26.538471","DOIUrl":null,"url":null,"abstract":"<p><p>Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the <i>Prix Fixe</i> framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on <i>Drosophila</i> and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.</p>","PeriodicalId":48010,"journal":{"name":"Elementary School Journal","volume":"60 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10888977/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation and optimization of sequence-based gene regulatory deep learning models.\",\"authors\":\"Abdul Muntakim Rafi, Daria Nogina, Dmitry Penzar, Dohoon Lee, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Dohyeon Kim, Yeojin Shin, Il-Youp Kwak, Georgy Meshcheryakov, Andrey Lando, Arsenii Zinkevich, Byeong-Chan Kim, Juhyun Lee, Taein Kang, Eeshit Dhaval Vaishnav, Payman Yadollahpour, Sun Kim, Jake Albrecht, Aviv Regev, Wuming Gong, Ivan V Kulakovskiy, Pablo Meyer, Carl de Boer\",\"doi\":\"10.1101/2023.04.26.538471\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the <i>Prix Fixe</i> framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on <i>Drosophila</i> and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.</p>\",\"PeriodicalId\":48010,\"journal\":{\"name\":\"Elementary School Journal\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10888977/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Elementary School Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.04.26.538471\",\"RegionNum\":4,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Elementary School Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.04.26.538471","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
摘要
神经网络已成为预测基因组功能区的强大工具,最近在破译基因调控逻辑方面取得的成功就是明证。然而,目前还缺乏对模型架构和训练策略如何影响基因组学模型性能的系统评估。为了弥补这一不足,我们举办了一次 DREAM 挑战赛,让参赛者在数百万个随机启动子 DNA 序列和相应表达水平的数据集上训练模型,这些数据集是在酵母中通过实验确定的,目的是最好地捕捉调控 DNA 与基因表达之间的关系。为了对模型进行稳健的评估,我们设计了一套涵盖各种序列类型的综合基准。虽然某些基准在表现最好的模型中产生了相似的结果,但其他基准却有很大不同。所有表现最出色的模型都使用了神经网络,但在针对基因组学序列数据定制的架构和新颖的训练策略方面存在差异。为了剖析架构和训练选择对性能的影响,我们开发了 Prix Fixe 框架,将任何给定模型划分为逻辑上等价的构建模块。我们测试了前三个模型的所有可能组合,并观察到每个模型的性能都有所提高。DREAM Challenge 模型不仅在我们的综合酵母数据集上取得了最先进的结果,而且在果蝇和人类基因组数据集上也不断超越现有基准。总之,我们证明了高质量的黄金标准基因组学数据集能推动模型开发取得重大进展。
Evaluation and optimization of sequence-based gene regulatory deep learning models.
Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.
期刊介绍:
The Elementary School Journal has served researchers, teacher educators, and practitioners in the elementary and middle school education for over one hundred years. ESJ publishes peer-reviewed articles dealing with both education theory and research and their implications for teaching practice. In addition, ESJ presents articles that relate the latest research in child development, cognitive psychology, and sociology to school learning and teaching. ESJ prefers to publish original studies that contain data about school and classroom processes in elementary or middle schools while occasionally publishing integrative research reviews and in-depth conceptual analyses of schooling.