Combining phenotypic and genomic data to improve prediction of binary traits

IF 1.2 4区 数学 Q2 STATISTICS & PROBABILITY Journal of Applied Statistics Pub Date : 2023-05-16 DOI:10.1080/02664763.2023.2208773
D. Jarquin, A. Roy, B. Clarke, S. Ghosal
{"title":"Combining phenotypic and genomic data to improve prediction of binary traits","authors":"D. Jarquin, A. Roy, B. Clarke, S. Ghosal","doi":"10.1080/02664763.2023.2208773","DOIUrl":null,"url":null,"abstract":"Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here ‘main traits’) of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or ‘phenotypes’) that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"51 1","pages":"0"},"PeriodicalIF":1.2000,"publicationDate":"2023-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/02664763.2023.2208773","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here ‘main traits’) of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or ‘phenotypes’) that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
结合表型和基因组数据提高二元性状的预测
植物育种家希望培育出优于现有基因型的品种。这些品种的一些特征(这里的“主要性状”)是分类的,难以直接测量。准确预测新发育基因型的主要性状具有重要意义。除了标记数据外,育种计划通常还包含易于测量的次要性状(或“表型”)信息。我们的目标是通过使用变量选择技术将两种数据类型结合起来,提高对具有可解释关系的主要性状的预测。然而,基因组特征可能压倒次要特征,因此标准技术可能无法选择任何表型变量。我们开发了一种新的统计技术,以确保二级性状和基因型变量的适当表示,以实现最佳预测。当两种数据类型(标记和二级性状)可用时,我们通过两个步骤实现了对二元性状的改进预测,这两个步骤旨在确保在考虑基因型的额外影响之前,表型的显着内在影响被纳入关系中。首先,对标记上的次要性状进行稀疏回归,并用其残差代替次要性状,得到经基因型变量调整后的表型变量效应。然后,我们利用标记和残差开发了一个稀疏逻辑分类器,以便首先选择调整后的表型,以避免因其数量优势而被基因型变量淹没。该分类器使用前向选择,并辅以惩罚项,可以通过一种称为一遍方法的技术有效地计算。在模拟数据和真实数据上与其他分类器进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Applied Statistics
Journal of Applied Statistics 数学-统计学与概率论
CiteScore
3.40
自引率
0.00%
发文量
126
审稿时长
6 months
期刊介绍: Journal of Applied Statistics provides a forum for communication between both applied statisticians and users of applied statistical techniques across a wide range of disciplines. These areas include business, computing, economics, ecology, education, management, medicine, operational research and sociology, but papers from other areas are also considered. The editorial policy is to publish rigorous but clear and accessible papers on applied techniques. Purely theoretical papers are avoided but those on theoretical developments which clearly demonstrate significant applied potential are welcomed. Each paper is submitted to at least two independent referees.
期刊最新文献
Framework for constructing an optimal weighted score based on agreement On function-on-function linear quantile regression Quantile regression based method for characterizing risk-specific behavioral patterns in relation to longitudinal left-censored biomarker data collected from heterogeneous populations A two-sample nonparametric test for one-sided location-scale alternative Modeling the time to dropout under phase-wise variable stress fixed cohort setup
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1