XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.

Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović
{"title":"XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.","authors":"Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović","doi":"10.1016/j.fsigen.2024.103183","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103183"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic science international. Genetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.fsigen.2024.103183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The IPEFA model: An initiative for online training and education as applied by the International Society for Forensic Genetics. Expression of Concern "Population data of 17 Y-STR loci in Nanyang Han population from Henan Province, Central China" [Forensic Sci. Int. Gene. 13 (2014) 145-146]. Expression of Concern "Population genetics of 17 Y-STR loci in a large Chinese Han population from Zhejiang Province, Eastern China" [Forensic Sci. Int. Genet. 5 (2011) e11-e13]. Expression of Concern: "Genetic population data of Yfiler Plus kit from 1434 unrelated Hans in Henan Province (Central China)" [Forensic Sci. Int. Genet. 22 (2016) e25-e27]. Expression of Concern: "Genetic profile of 17 Y chromosome STRs in the Guizhou Han population of southwestern China" [Forensic Sci. Int. Genet. 25 (2016) e6-e7].
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1