Breaking down data silos across companies to train genome-wide predictions: A feasibility study in wheat

IF 10.5 1区 生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY Plant Biotechnology Journal Pub Date : 2025-04-20 DOI:10.1111/pbi.70095
Moritz Lell, Abhishek Gogna, Vincent Kloesgen, Ulrike Avenhaus, Jost Dörnte, Wera Maria Eckhoff, Tobias Eschholz, Mario Gils, Martin Kirchhoff, Michael Koch, Sonja Kollers, Nina Pfeiffer, Matthias Rapp, Valentin Wimmer, Markus Wolf, Jochen Reif, Yusheng Zhao
{"title":"Breaking down data silos across companies to train genome-wide predictions: A feasibility study in wheat","authors":"Moritz Lell,&nbsp;Abhishek Gogna,&nbsp;Vincent Kloesgen,&nbsp;Ulrike Avenhaus,&nbsp;Jost Dörnte,&nbsp;Wera Maria Eckhoff,&nbsp;Tobias Eschholz,&nbsp;Mario Gils,&nbsp;Martin Kirchhoff,&nbsp;Michael Koch,&nbsp;Sonja Kollers,&nbsp;Nina Pfeiffer,&nbsp;Matthias Rapp,&nbsp;Valentin Wimmer,&nbsp;Markus Wolf,&nbsp;Jochen Reif,&nbsp;Yusheng Zhao","doi":"10.1111/pbi.70095","DOIUrl":null,"url":null,"abstract":"<p>Big data, combined with artificial intelligence (AI) techniques, holds the potential to significantly enhance the accuracy of genome-wide predictions. Motivated by the success reported for wheat hybrids, we extended the scope to inbred lines by integrating phenotypic and genotypic data from four commercial wheat breeding programs. Acting as an academic data trustee, we merged these data with historical experimental series from previous public–private partnerships. The integrated data spanned 12 years, 168 environments, and provided a genomic prediction training set of up to ~9500 genotypes for grain yield, plant height and heading date. Despite the heterogeneous phenotypic and genotypic data, we were able to obtain high-quality data by implementing rigorous data curation, including SNP imputation. We utilized the data to compare genomic best linear unbiased predictions with convolutional neural network-based genomic prediction. Our analysis revealed that we could flexibly combine experimental series for genomic prediction, with prediction ability steadily improving as the training set sizes increased, peaking at around 4000 genotypes. As training set sizes were further increased, the gains in prediction ability decreased, approaching a plateau well below the theoretical limit defined by the square root of the heritability. Potential avenues, such as designed training sets or novel non-linear prediction approaches, could overcome this plateau and help to more fully exploit the high-value big data generated by breaking down data silos across companies.</p>","PeriodicalId":221,"journal":{"name":"Plant Biotechnology Journal","volume":"23 7","pages":"2704-2719"},"PeriodicalIF":10.5000,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/pbi.70095","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Biotechnology Journal","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/pbi.70095","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Big data, combined with artificial intelligence (AI) techniques, holds the potential to significantly enhance the accuracy of genome-wide predictions. Motivated by the success reported for wheat hybrids, we extended the scope to inbred lines by integrating phenotypic and genotypic data from four commercial wheat breeding programs. Acting as an academic data trustee, we merged these data with historical experimental series from previous public–private partnerships. The integrated data spanned 12 years, 168 environments, and provided a genomic prediction training set of up to ~9500 genotypes for grain yield, plant height and heading date. Despite the heterogeneous phenotypic and genotypic data, we were able to obtain high-quality data by implementing rigorous data curation, including SNP imputation. We utilized the data to compare genomic best linear unbiased predictions with convolutional neural network-based genomic prediction. Our analysis revealed that we could flexibly combine experimental series for genomic prediction, with prediction ability steadily improving as the training set sizes increased, peaking at around 4000 genotypes. As training set sizes were further increased, the gains in prediction ability decreased, approaching a plateau well below the theoretical limit defined by the square root of the heritability. Potential avenues, such as designed training sets or novel non-linear prediction approaches, could overcome this plateau and help to more fully exploit the high-value big data generated by breaking down data silos across companies.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
打破公司间的数据孤岛,训练全基因组预测:小麦可行性研究
大数据与人工智能(AI)技术相结合,具有显著提高全基因组预测准确性的潜力。受小麦杂交成功报道的激励,我们通过整合来自四个商业小麦育种项目的表型和基因型数据,将范围扩大到自交系。作为学术数据受托人,我们将这些数据与以前公私合作伙伴关系的历史实验系列合并。综合数据跨越12年、168个环境,提供了约9500个籽粒产量、株高和抽穗日期基因型的基因组预测训练集。尽管表型和基因型数据存在异质性,但通过实施严格的数据管理,包括SNP插入,我们能够获得高质量的数据。我们利用这些数据来比较基因组最佳线性无偏预测与基于卷积神经网络的基因组预测。我们的分析表明,我们可以灵活地结合实验序列进行基因组预测,随着训练集规模的增加,预测能力稳步提高,在4000个基因型左右达到峰值。随着训练集规模的进一步增加,预测能力的增益下降,接近一个平台,远低于遗传力平方根定义的理论极限。潜在的途径,如设计训练集或新颖的非线性预测方法,可以克服这一瓶颈,并帮助更充分地利用通过打破公司之间的数据孤岛产生的高价值大数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Plant Biotechnology Journal
Plant Biotechnology Journal 生物-生物工程与应用微生物
CiteScore
20.50
自引率
2.90%
发文量
201
审稿时长
1 months
期刊介绍: Plant Biotechnology Journal aspires to publish original research and insightful reviews of high impact, authored by prominent researchers in applied plant science. The journal places a special emphasis on molecular plant sciences and their practical applications through plant biotechnology. Our goal is to establish a platform for showcasing significant advances in the field, encompassing curiosity-driven studies with potential applications, strategic research in plant biotechnology, scientific analysis of crucial issues for the beneficial utilization of plant sciences, and assessments of the performance of plant biotechnology products in practical applications.
期刊最新文献
Plastid Engineering for Photosynthesis‐Driven Synthesis of Hyaluronic Acid in Tobacco Overexpression of Tonoplast Transporter FvMATE51 Simultaneously Increases Fruit Size and Sugar Accumulation in Strawberry Expression of a Bacterial Trehalose 6‐Phosphate Synthase Gene otsA in Camelina sativa Seeds Promotes the Channelling of Carbon Towards Oil Accumulation TaPHL7 Transcription Factor Regulates Utilisation of Nitrogen and Phosphorus in Wheat. OsFeSOD3 Functions as an Enzymatic Component of the PEP Complex, Bifunctionally Regulating Chloroplastic ROS Metabolism and Chloroplast Biogenesis in Rice
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1