Short Tandem Repeat Analysis as a Novel Method for Biogeographic Ancestry Prediction

Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt
{"title":"Short Tandem Repeat Analysis as a Novel Method for Biogeographic Ancestry Prediction","authors":"Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt","doi":"10.1109/sieds55548.2022.9799365","DOIUrl":null,"url":null,"abstract":"Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\\mathrm{N}=1747$ subjects across $\\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\\sim45\\%$ to $> 60\\%$ while 10-fold training accuracy varied from 60% to $\\sim80\\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\\sim35\\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.","PeriodicalId":286724,"journal":{"name":"2022 Systems and Information Engineering Design Symposium (SIEDS)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sieds55548.2022.9799365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\mathrm{N}=1747$ subjects across $\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\sim45\%$ to $> 60\%$ while 10-fold training accuracy varied from 60% to $\sim80\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\sim35\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
短串联重复序列分析作为生物地理祖先预测的新方法
评估DNA以确定个体的生物地理血统仍然是世界各地法医实验室的主要任务。由于与全面基因组数据采集和处理相关的昂贵性质,许多法医实验室缺乏进行包括分析祖先信息的单核苷酸多态性(aiSNP)在内的全面基因检测的能力,因此,需要更具成本效益的信息来源。在本研究中,我们评估了机器学习(ML)方法在短串联重复序列(STRs)分析中的使用,短序列DNA的非编码重复序列,以确定生物地理祖先。STRs是重复序列,其长度单位为1至25个核苷酸,存在于整个基因组的不同位置。由于STRs的高变异性,STRs被广泛用于创建不同个体的独特遗传图谱。我们使用美国国防部(DoD)提供的匿名STR数据,分析了随机森林分类模型中选定位点的性能,这些数据来自$\ mathm {N}=1747$受试者,来自$\ mathm {K}=5$大洲,以便预测每个个体在给定其基因组的情况下的大陆起源。受试者的监督分类测试准确率从$\sim45\%$到$> 60\%$不等,而10倍训练准确率从60%到$\sim80\%$不等。据报道,无监督聚类测试的准确率为$\sim35\%$。研究结果表明,利用STR数据作为大陆祖先预测的一种新方法具有很大的可能性,并且随着研究的深入,可以达到较高的精度。最后,我们对未来的参数优化策略进行了评论,以最大化STR分析的效用,这可能有利于较小的实验室,并加快法医专业人员和执法官员的生物地理血统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The Linville Creek Bridge: A Case Study of Design Thinking in Structural Engineering Convergence Across Behavioral and Self-report Measures Evaluating Individuals' Trust in an Autonomous Golf Cart Investigating the Illicit Trade of Cultural Property with an Automated Data Pipeline Architecture Investigating Disinformation Through the Lens of Mass Media: A System Design Dynamic Coal Production Line: Plant Design and Analysis Tool
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1