Pub Date : 2024-12-13DOI: 10.1016/j.fsigen.2024.103211
Amber C W Vandepoele, Natalie Novotna, Dan Myers, Michael A Marciano
Short tandem repeat analysis is a robust and reliable DNA analysis technique that aids in source identification of a biological sample. However, the interpretation, particularly when DNA mixtures are present at low levels, can be complicated by the presence of PCR artifacts most commonly referred to as stutter. The presence of stutter products can increase the difficulty of interpretation in DNA mixtures as well as low-level DNA samples down to a single cell. Stutter product formation is stochastic in nature and although methods exist that can estimate the magnitude of stutter product formation, it still is not well understood. With the increased sensitivity of forensic DNA analyses, it has become possible to obtain interpretable DNA profiles from as low as 6.6 pg of DNA, or a single human diploid cell. However, this presents an interpretational challenge because the stutter in these low-level DNA samples might stray from the expected patterns observed in high-level DNA samples. Therefore, this project focuses on characterizing stutter in single cell samples to help generate a deeper understanding of stutter and provide a guide for detecting and evaluating stutter in low-level samples. Stutter analysis was performed using data generated from 180 single cells isolated with the DEPArrayTM NxT, amplified using the PowerPlex Fusion 6 C amplification kit at 29 or 30 cycles. Stutter was successfully characterized in single cells and stutter percentages were highly elevated compared to high-level samples where the variance increased as the number of cells being analyzed decreased leading to potential high stutter at low DNA levels. Using empirical and simulated (resampled) data, this study also reinforces historically relevant patterns in stutter product formation and demonstrates the relative differences in stutter in n-1, n-2 and n + 1 stutter product formation in simple, complex and compound repeats.
{"title":"Characterizing stutter in single cells and the impact on multi-cell analysis.","authors":"Amber C W Vandepoele, Natalie Novotna, Dan Myers, Michael A Marciano","doi":"10.1016/j.fsigen.2024.103211","DOIUrl":"https://doi.org/10.1016/j.fsigen.2024.103211","url":null,"abstract":"<p><p>Short tandem repeat analysis is a robust and reliable DNA analysis technique that aids in source identification of a biological sample. However, the interpretation, particularly when DNA mixtures are present at low levels, can be complicated by the presence of PCR artifacts most commonly referred to as stutter. The presence of stutter products can increase the difficulty of interpretation in DNA mixtures as well as low-level DNA samples down to a single cell. Stutter product formation is stochastic in nature and although methods exist that can estimate the magnitude of stutter product formation, it still is not well understood. With the increased sensitivity of forensic DNA analyses, it has become possible to obtain interpretable DNA profiles from as low as 6.6 pg of DNA, or a single human diploid cell. However, this presents an interpretational challenge because the stutter in these low-level DNA samples might stray from the expected patterns observed in high-level DNA samples. Therefore, this project focuses on characterizing stutter in single cell samples to help generate a deeper understanding of stutter and provide a guide for detecting and evaluating stutter in low-level samples. Stutter analysis was performed using data generated from 180 single cells isolated with the DEPArrayTM NxT, amplified using the PowerPlex Fusion 6 C amplification kit at 29 or 30 cycles. Stutter was successfully characterized in single cells and stutter percentages were highly elevated compared to high-level samples where the variance increased as the number of cells being analyzed decreased leading to potential high stutter at low DNA levels. Using empirical and simulated (resampled) data, this study also reinforces historically relevant patterns in stutter product formation and demonstrates the relative differences in stutter in n-1, n-2 and n + 1 stutter product formation in simple, complex and compound repeats.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103211"},"PeriodicalIF":0.0,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142857352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-10DOI: 10.1016/j.fsigen.2024.103208
Séverine Nozownik, Tacha Hicks, Patrick Basset, Vincent Castella
In most National DNA databases (NDNADB), only single source DNA profiles, and sometimes two-person DNA mixtures, can be searched provided a minimum number of loci (or alleles) is available. DNA profiles that do not meet these criteria (about 14 % of the traces analyzed in Western Switzerland) can be compared locally with candidates upon request from police services, used for one-off search, or remain unused. With the advent of probabilistic genotyping (PG), such complex DNA profiles can be compared to those stored in NDNADB based on likelihood ratios (LRs). In this pilot study, traces of known contributors and casework DNA profiles were used to evaluate the performance of the DBLR™ "Search database" tool in conjunction with the Swiss NDNADB. First, 40 DNA mixtures (2-5 contributors) from 15 volunteers were prepared in the wet laboratory. They were deconvoluted with STRmix™ and compared to a database containing the DNA profiles of these 15 volunteers, along with 174,493 person DNA profiles from the Swiss NDNADB (ground-truth experiments). Using LR thresholds of 103 and 106, sensitivity and specificity were respectively 90.0 %/57.1 % and 99.9 %/100.0 %. For the lower LR threshold, this resulted in 52 adventitious associations out of more than 24 million pairwise comparisons. Second, 160 DNA mixture profiles from casework (2-4 contributors) that had previously been locally compared were searched with DBLR™ using the same conditions as for phase 1. With the 103 LR threshold, 380 associations were retrieved: 194 of these corresponded to expected associations, as they were previously made through the local comparisons with known persons, and 186 were new. With the 106 LR threshold, 199 associations were recovered of which 180 were expected and 19 new. This demonstrates that even with complex DNA profiles (up to 4 contributors) all expected associations were retrieved with a limited number of candidates per trace. Database searches of complex DNA mixtures allow for the generation of leads early in an investigation for DNA profiles that might otherwise remain underutilized. Next steps for the possible integration of DBLR™ or similar software within an operational context will require discussions on legal, financial, and technical aspects among stakeholders.
在大多数国家 DNA 数据库(NDNADB)中,只有单个来源的 DNA 图谱,有时是两人的 DNA 混合物,才可以进行搜索,条件是必须有最低数量的位点(或等位基因)。不符合这些标准的DNA图谱(在瑞士西部分析的痕迹中约占14%)可根据警方要求在当地与候选图谱进行比较,或用于一次性搜索,或保持闲置。随着概率基因分型技术(PG)的出现,这种复杂的 DNA 图谱可以根据似然比(LR)与 NDNADB 中存储的 DNA 图谱进行比较。在这项试验性研究中,使用了已知贡献者的痕迹和个案DNA图谱来评估DBLR™"搜索数据库 "工具与瑞士NDNADB的结合性能。首先,在湿实验室中制备了来自 15 名志愿者的 40 份 DNA 混合物(2-5 个贡献者)。使用 STRmix™ 对这些混合物进行去卷积,并与包含这 15 名志愿者 DNA 图谱的数据库以及瑞士 NDNADB 中 174,493 人的 DNA 图谱(地面实况实验)进行比较。使用 103 和 106 的 LR 阈值,灵敏度和特异性分别为 90.0 %/57.1 % 和 99.9 %/100.0%。对于较低的 LR 阈值,在 2,400 多万次配对比较中发现了 52 个偶然关联。其次,在与第一阶段相同的条件下,使用 DBLR™ 对以前进行过局部比对的 160 份来自个案工作(2-4 个贡献者)的 DNA 混合图谱进行了搜索。在 103 LR 阈值下,共检索到 380 条关联:其中 194 条符合预期关联,因为这些关联是之前通过与已知人员进行局部比对得出的,186 条是新关联。在 106 LR 临界值下,检索到 199 个关联,其中 180 个是预期关联,19 个是新关联。这表明,即使是复杂的 DNA 图谱(多达 4 个贡献者),在每个痕量的候选者数量有限的情况下,也能检索到所有预期关联。对复杂的 DNA 混合物进行数据库搜索,可在调查初期为 DNA 图谱提供线索,否则这些线索可能仍未得到充分利用。要将 DBLR™ 或类似软件整合到业务中,还需要利益相关者就法律、财务和技术方面的问题进行讨论。
{"title":"Searching national DNA databases with complex DNA profiles: An empirical study using probabilistic genotyping.","authors":"Séverine Nozownik, Tacha Hicks, Patrick Basset, Vincent Castella","doi":"10.1016/j.fsigen.2024.103208","DOIUrl":"https://doi.org/10.1016/j.fsigen.2024.103208","url":null,"abstract":"<p><p>In most National DNA databases (NDNADB), only single source DNA profiles, and sometimes two-person DNA mixtures, can be searched provided a minimum number of loci (or alleles) is available. DNA profiles that do not meet these criteria (about 14 % of the traces analyzed in Western Switzerland) can be compared locally with candidates upon request from police services, used for one-off search, or remain unused. With the advent of probabilistic genotyping (PG), such complex DNA profiles can be compared to those stored in NDNADB based on likelihood ratios (LRs). In this pilot study, traces of known contributors and casework DNA profiles were used to evaluate the performance of the DBLR™ \"Search database\" tool in conjunction with the Swiss NDNADB. First, 40 DNA mixtures (2-5 contributors) from 15 volunteers were prepared in the wet laboratory. They were deconvoluted with STRmix™ and compared to a database containing the DNA profiles of these 15 volunteers, along with 174,493 person DNA profiles from the Swiss NDNADB (ground-truth experiments). Using LR thresholds of 10<sup>3</sup> and 10<sup>6</sup>, sensitivity and specificity were respectively 90.0 %/57.1 % and 99.9 %/100.0 %. For the lower LR threshold, this resulted in 52 adventitious associations out of more than 24 million pairwise comparisons. Second, 160 DNA mixture profiles from casework (2-4 contributors) that had previously been locally compared were searched with DBLR™ using the same conditions as for phase 1. With the 10<sup>3</sup> LR threshold, 380 associations were retrieved: 194 of these corresponded to expected associations, as they were previously made through the local comparisons with known persons, and 186 were new. With the 10<sup>6</sup> LR threshold, 199 associations were recovered of which 180 were expected and 19 new. This demonstrates that even with complex DNA profiles (up to 4 contributors) all expected associations were retrieved with a limited number of candidates per trace. Database searches of complex DNA mixtures allow for the generation of leads early in an investigation for DNA profiles that might otherwise remain underutilized. Next steps for the possible integration of DBLR™ or similar software within an operational context will require discussions on legal, financial, and technical aspects among stakeholders.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103208"},"PeriodicalIF":0.0,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142824841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1016/j.fsigen.2024.103206
Peter Resutik, Joëlle Schneider, Simon Aeschbacher, Magnus Dehli Vigeland, Mario Gysi, Corinne Moser, Chiara Barbieri, Paul Widmer, Mathias Currat, Adelgunde Kratzer, Michael Krützen, Cordula Haas, Natasha Arora
Since leaving Africa, human populations have gone through a series of range expansions. While the genomic signatures of these expansions are well detectable on a continental scale, the genomic consequences of small-scale expansions over shorter time spans are more challenging to disentangle. The medieval migration of the Walser people from their homeland in ssouthern Switzerland (Upper Valais) into other regions of the Alps is a good example of such a comparatively recent geographic and demographic expansion in humans. While several studies from the 1980s, based on allozyme markers, assessed levels of isolation and inbreeding in individual Walser communities, they mostly did so by focusing on a single community at a time. Here, we provide a comprehensive overview of genetic diversity and differentiation based on samples from multiple Walser, Walser-homeland, and non-Walser Alpine communities, along with an idealized (simulated) Swiss reference population (Ref-Pop). To explore genetic signals of the Walser migration in the genomes of their descendants, we use a set of forensic autosomal STRs as well as uniparental markers. Estimates of pairwise FST based on autosomal STRs reveal that the Walser-homeland and Walser communities show low to moderate genetic differentiation from the non-Walser Alpine communities and the idealized Ref-Pop. The geographically more remote and likely more isolated Walser-homeland community of Lötschental and the Walser communities of Vals and Gressoney appear genetically more strongly differentiated than other communities. Analyses of mitochondrial DNA revealed the presence of haplogroup W6 among the Walser communities, a haplogroup that is otherwise rare in central Europe. Our study contributes to the understanding of genetic diversity in the Walser-homeland and Walser people, but also highlights the need for a more comprehensive study of the population genetic structure and evolutionary history of European Alpine populations using genome-wide data.
{"title":"Uncovering genetic signatures of the Walser migration in the Alps: Patterns of diversity and differentiation.","authors":"Peter Resutik, Joëlle Schneider, Simon Aeschbacher, Magnus Dehli Vigeland, Mario Gysi, Corinne Moser, Chiara Barbieri, Paul Widmer, Mathias Currat, Adelgunde Kratzer, Michael Krützen, Cordula Haas, Natasha Arora","doi":"10.1016/j.fsigen.2024.103206","DOIUrl":"https://doi.org/10.1016/j.fsigen.2024.103206","url":null,"abstract":"<p><p>Since leaving Africa, human populations have gone through a series of range expansions. While the genomic signatures of these expansions are well detectable on a continental scale, the genomic consequences of small-scale expansions over shorter time spans are more challenging to disentangle. The medieval migration of the Walser people from their homeland in ssouthern Switzerland (Upper Valais) into other regions of the Alps is a good example of such a comparatively recent geographic and demographic expansion in humans. While several studies from the 1980s, based on allozyme markers, assessed levels of isolation and inbreeding in individual Walser communities, they mostly did so by focusing on a single community at a time. Here, we provide a comprehensive overview of genetic diversity and differentiation based on samples from multiple Walser, Walser-homeland, and non-Walser Alpine communities, along with an idealized (simulated) Swiss reference population (Ref-Pop). To explore genetic signals of the Walser migration in the genomes of their descendants, we use a set of forensic autosomal STRs as well as uniparental markers. Estimates of pairwise F<sub>ST</sub> based on autosomal STRs reveal that the Walser-homeland and Walser communities show low to moderate genetic differentiation from the non-Walser Alpine communities and the idealized Ref-Pop. The geographically more remote and likely more isolated Walser-homeland community of Lötschental and the Walser communities of Vals and Gressoney appear genetically more strongly differentiated than other communities. Analyses of mitochondrial DNA revealed the presence of haplogroup W6 among the Walser communities, a haplogroup that is otherwise rare in central Europe. Our study contributes to the understanding of genetic diversity in the Walser-homeland and Walser people, but also highlights the need for a more comprehensive study of the population genetic structure and evolutionary history of European Alpine populations using genome-wide data.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103206"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142824844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-07DOI: 10.1016/j.fsigen.2024.103207
Alexandre Gouy, Martin Zieger
Population data in forensic genetics must be checked for a variety of statistical parameters before it can be employed for casework. Several tools exist to perform such tasks; however, it can become challenging to obtain the right results due to the number of software to use and the broad range of input formats. Furthermore, a substantial amount of experience is required to use some of these programs. To overcome these difficulties, we have developed STRAF (STR Analysis for Forensics), a convenient online tool to analyse STR data in forensic genetics. Since its first release in 2017, it has been used in many studies to report allele frequencies, forensic and population genetics parameters, and to explore genetic datasets interactively through a user-friendly interface. Herewith, we introduce the latest version of the STRAF software and the improvements we have implemented over the last years. STRAF 2 includes several new features, such as new statistical methods (multidimensional scaling, comparison to a reference population, haplotype diversities and frequencies) and file conversion utilities. Performance and user experience have also been improved and documentation has been extended. This new version is freely available as an R package (https://github.com/agouy/straf) and a web application (https://straf.fr).
{"title":"STRAF 2: New features and improvements of the STR population data analysis software.","authors":"Alexandre Gouy, Martin Zieger","doi":"10.1016/j.fsigen.2024.103207","DOIUrl":"https://doi.org/10.1016/j.fsigen.2024.103207","url":null,"abstract":"<p><p>Population data in forensic genetics must be checked for a variety of statistical parameters before it can be employed for casework. Several tools exist to perform such tasks; however, it can become challenging to obtain the right results due to the number of software to use and the broad range of input formats. Furthermore, a substantial amount of experience is required to use some of these programs. To overcome these difficulties, we have developed STRAF (STR Analysis for Forensics), a convenient online tool to analyse STR data in forensic genetics. Since its first release in 2017, it has been used in many studies to report allele frequencies, forensic and population genetics parameters, and to explore genetic datasets interactively through a user-friendly interface. Herewith, we introduce the latest version of the STRAF software and the improvements we have implemented over the last years. STRAF 2 includes several new features, such as new statistical methods (multidimensional scaling, comparison to a reference population, haplotype diversities and frequencies) and file conversion utilities. Performance and user experience have also been improved and documentation has been extended. This new version is freely available as an R package (https://github.com/agouy/straf) and a web application (https://straf.fr).</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103207"},"PeriodicalIF":0.0,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-02DOI: 10.1016/j.fsigen.2024.103205
Duncan Taylor, Amy Cahill, Roland A H van Oorschot, Luke Volgin, Mariya Goray
A major factor that influences DNA transfer is the propensity of individuals to 'shed' DNA, commonly referred to as their 'shedder status'. In this work we provide a novel method to analyse and interrogate DNA transfer data from a largely uncontrolled study that tracks the movements and actions of a group of individuals over the course of an hour. By setting up a model that provides a simplistic description of the world, parameters within the model that represent properties of interest can be iteratively refined until the model can sufficiently describe a set of final DNA observations. Because the model describing reality can be constructed and parametrised in any desired configuration, aspects that may be difficult to traditionally test together can be investigated. To that end, we use a 60-min timeline of activity between four individuals and use DNA profiling results from objects taken at the conclusion of the hour to investigate factors that may affect shedder status. We simultaneously consider factors of: the amount of DNA transferred per contact, the rate of self-DNA regeneration, the capacity of hands to hold DNA, and the rate of non-self-DNA removal, all of which may ultimately contribute to someone's shedder status.
{"title":"Using an interaction timeline to investigate factors related to shedder status.","authors":"Duncan Taylor, Amy Cahill, Roland A H van Oorschot, Luke Volgin, Mariya Goray","doi":"10.1016/j.fsigen.2024.103205","DOIUrl":"https://doi.org/10.1016/j.fsigen.2024.103205","url":null,"abstract":"<p><p>A major factor that influences DNA transfer is the propensity of individuals to 'shed' DNA, commonly referred to as their 'shedder status'. In this work we provide a novel method to analyse and interrogate DNA transfer data from a largely uncontrolled study that tracks the movements and actions of a group of individuals over the course of an hour. By setting up a model that provides a simplistic description of the world, parameters within the model that represent properties of interest can be iteratively refined until the model can sufficiently describe a set of final DNA observations. Because the model describing reality can be constructed and parametrised in any desired configuration, aspects that may be difficult to traditionally test together can be investigated. To that end, we use a 60-min timeline of activity between four individuals and use DNA profiling results from objects taken at the conclusion of the hour to investigate factors that may affect shedder status. We simultaneously consider factors of: the amount of DNA transferred per contact, the rate of self-DNA regeneration, the capacity of hands to hold DNA, and the rate of non-self-DNA removal, all of which may ultimately contribute to someone's shedder status.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103205"},"PeriodicalIF":0.0,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142793069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29DOI: 10.1016/j.fsigen.2024.103183
Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović
The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.
{"title":"XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.","authors":"Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović","doi":"10.1016/j.fsigen.2024.103183","DOIUrl":"https://doi.org/10.1016/j.fsigen.2024.103183","url":null,"abstract":"<p><p>The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103183"},"PeriodicalIF":0.0,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142786984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}