XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.

Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović
{"title":"XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.","authors":"Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović","doi":"10.1016/j.fsigen.2024.103183","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103183"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic science international. Genetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.fsigen.2024.103183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
XGBoost是一种可靠的机器学习工具,用于使用常染色体STR谱预测祖先-方法证明。
本研究的目的是检验基于短串联重复序列(STR)谱的祖先隶属关系预测模型的有效性。来自Promega网站的29个遗传标记的频率用于四个不同的人群(非洲裔美国人、亚洲人、高加索人、西班牙裔美国人),生成了36万个档案(每个群体9万个档案),这些档案后来被用于训练和测试一系列机器学习算法,目的是建立最优的模型,以准确预测祖先。选择的模型(决策树,支持向量机,XGBoost等)在Python中部署,并比较它们的性能。XGBoost模型优于其他模型,在所有四个类别中显示出显著的预测能力,准确率为94.24 %,在涉及亚洲,非洲裔美国人和高加索人子样本的区分任务中准确率为99.06 %,在区分非裔美国人,亚洲人和高加索人和西班牙人的混合组时准确率为98.57 %。评估训练集大小的影响显示,当每个类别有90,000个配置文件时,模型准确率达到94 %的峰值,但当每个类别的配置文件数量减少到500个时,模型准确率下降到83 %,特别是在区分高加索人和西班牙裔亚组时影响精度。该研究进一步调查了标记物数量对模型准确性的影响,发现使用21种标记物(通常在商业扩增试剂盒中可用)对非洲裔美国人、亚洲人和高加索人的准确率为96.3% %,对所有四种人群的准确率为88.28 %。这些发现强调了基于str的模型在法医分析中的潜力,并暗示了机器学习在遗传祖先测定中的更广泛适用性,这对提高法医调查的准确性和可靠性具有重要意义,特别是在祖先背景可能是关键信息的异质环境中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The IPEFA model: An initiative for online training and education as applied by the International Society for Forensic Genetics. Expression of Concern "Population data of 17 Y-STR loci in Nanyang Han population from Henan Province, Central China" [Forensic Sci. Int. Gene. 13 (2014) 145-146]. Expression of Concern "Population genetics of 17 Y-STR loci in a large Chinese Han population from Zhejiang Province, Eastern China" [Forensic Sci. Int. Genet. 5 (2011) e11-e13]. Expression of Concern: "Genetic population data of Yfiler Plus kit from 1434 unrelated Hans in Henan Province (Central China)" [Forensic Sci. Int. Genet. 22 (2016) e25-e27]. Expression of Concern: "Genetic profile of 17 Y chromosome STRs in the Guizhou Han population of southwestern China" [Forensic Sci. Int. Genet. 25 (2016) e6-e7].
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1