XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method

IF 3.2 2区医学 Q2 GENETICS & HEREDITY Forensic Science International-Genetics Pub Date : 2024-11-29 DOI:10.1016/j.fsigen.2024.103183

Dejan Šorgić , Aleksandra Stefanović , Dušan Keckarević , Mladen Popović

{"title":"XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method","authors":"Dejan Šorgić , Aleksandra Stefanović , Dušan Keckarević , Mladen Popović","doi":"10.1016/j.fsigen.2024.103183","DOIUrl":null,"url":null,"abstract":"<div><div>The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.</div></div>","PeriodicalId":50435,"journal":{"name":"Forensic Science International-Genetics","volume":"76 ","pages":"Article 103183"},"PeriodicalIF":3.2000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Genetics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1872497324001790","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

XGBoost是一种可靠的机器学习工具，用于使用常染色体STR谱预测祖先-方法证明。

本研究的目的是检验基于短串联重复序列（STR）谱的祖先隶属关系预测模型的有效性。来自Promega网站的29个遗传标记的频率用于四个不同的人群（非洲裔美国人、亚洲人、高加索人、西班牙裔美国人），生成了36万个档案（每个群体9万个档案），这些档案后来被用于训练和测试一系列机器学习算法，目的是建立最优的模型，以准确预测祖先。选择的模型（决策树，支持向量机，XGBoost等）在Python中部署，并比较它们的性能。XGBoost模型优于其他模型，在所有四个类别中显示出显著的预测能力，准确率为94.24 %，在涉及亚洲，非洲裔美国人和高加索人子样本的区分任务中准确率为99.06 %，在区分非裔美国人，亚洲人和高加索人和西班牙人的混合组时准确率为98.57 %。评估训练集大小的影响显示，当每个类别有90,000个配置文件时，模型准确率达到94 %的峰值，但当每个类别的配置文件数量减少到500个时，模型准确率下降到83 %，特别是在区分高加索人和西班牙裔亚组时影响精度。该研究进一步调查了标记物数量对模型准确性的影响，发现使用21种标记物（通常在商业扩增试剂盒中可用）对非洲裔美国人、亚洲人和高加索人的准确率为96.3% %，对所有四种人群的准确率为88.28 %。这些发现强调了基于str的模型在法医分析中的潜力，并暗示了机器学习在遗传祖先测定中的更广泛适用性，这对提高法医调查的准确性和可靠性具有重要意义，特别是在祖先背景可能是关键信息的异质环境中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Forensic Science International-Genetics 生物-医学：法

CiteScore

7.50

自引率

32.30%

发文量

132

审稿时长

11.3 weeks

期刊介绍： Forensic Science International: Genetics is the premier journal in the field of Forensic Genetics. This branch of Forensic Science can be defined as the application of genetics to human and non-human material (in the sense of a science with the purpose of studying inherited characteristics for the analysis of inter- and intra-specific variations in populations) for the resolution of legal conflicts. The scope of the journal includes: Forensic applications of human polymorphism. Testing of paternity and other family relationships, immigration cases, typing of biological stains and tissues from criminal casework, identification of human remains by DNA testing methodologies. Description of human polymorphisms of forensic interest, with special interest in DNA polymorphisms. Autosomal DNA polymorphisms, mini- and microsatellites (or short tandem repeats, STRs), single nucleotide polymorphisms (SNPs), X and Y chromosome polymorphisms, mtDNA polymorphisms, and any other type of DNA variation with potential forensic applications. Non-human DNA polymorphisms for crime scene investigation. Population genetics of human polymorphisms of forensic interest. Population data, especially from DNA polymorphisms of interest for the solution of forensic problems. DNA typing methodologies and strategies. Biostatistical methods in forensic genetics. Evaluation of DNA evidence in forensic problems (such as paternity or immigration cases, criminal casework, identification), classical and new statistical approaches. Standards in forensic genetics. Recommendations of regulatory bodies concerning methods, markers, interpretation or strategies or proposals for procedural or technical standards. Quality control. Quality control and quality assurance strategies, proficiency testing for DNA typing methodologies. Criminal DNA databases. Technical, legal and statistical issues. General ethical and legal issues related to forensic genetics.