Development of a two-layer machine learning model for the forensic application of legal and illegal poppy classification based on sequence data

IF 3.2 2区医学 Q2 GENETICS & HEREDITY Forensic Science International-Genetics Pub Date : 2024-05-22 DOI:10.1016/j.fsigen.2024.103061

Hyung-Eun An , Min-Ho Mun , Adeel Malik , Chang-Bae Kim

{"title":"Development of a two-layer machine learning model for the forensic application of legal and illegal poppy classification based on sequence data","authors":"Hyung-Eun An , Min-Ho Mun , Adeel Malik , Chang-Bae Kim","doi":"10.1016/j.fsigen.2024.103061","DOIUrl":null,"url":null,"abstract":"<div>Poppies are beneficial plants with a variety of applications, including medicinal, edible, ornamental, and industrial purposes. Some Papaver species are forensically significant plants because they contain opium, a narcotic substance. Internationally trafficked species of illegal poppies are being identified by DNA barcoding employing multiple markers in response to their forensic value. However, effective markers for precise species identification of legal and illegal poppies are still under discussion, with research on illegal poppies focusing on Papaver somniferum L., and species identification studies of Papaver bracteatum and Papaver setigerum DC. still lacking. As a result, in order to evaluate the performance of genetic markers and classify their DNA sequences in the genus Papaver, this study developed the first machine learning-based two-layer model, in which the first layer classifies legal and illegal poppies from the given sequence and the second layer identifies species of illegal poppies using their sequences. We constructed the dataset and investigated biological features from four markers, internal transcribed spacer 1 (ITS1), internal transcribed spacer 2 (ITS2), transfer RNA Leucine (trnL), transfer RNA Leucine - transfer RNA Phenylalanine intergenic spacer (trnL–trnF intergenic spacer) and their combination, using four machine learning algorithms, K-nearest neighbor (KNN), Naïve Bayes (NB), extreme gradient boost (XGBoost) and Random Forest (RF). According to our findings, for Layer 1 to classify legal and illegal poppies, KNN-based models using combined ITS region achieved the greatest performance of accuracy 0.846 and 0.889 using training and test sets, respectively. Additionally, for Layer 2 to identify illegal poppy species, KNN-based models using combined ITS region achieved the best performance of 0.833 and 1.000 for using training and test sets, respectively. To validate the model, the combined ITS region, which includes ITS 1 and 2 sequences, from blind poppy samples were used as a case study, with the Layer 1 correctly classifying legal and illegal poppies with over 0.830 accuracy. Layer 2 correctly identified P. setigerum DC., however, only one of the three P. somniferum L. species was accurately identified. Nevertheless, our research shows that machine learning can be used to classify and identify legal and illegal poppy species using DNA barcodes which can then be used as an efficient and effective forensic tool for improved law enforcement and a safer society.</div>","PeriodicalId":50435,"journal":{"name":"Forensic Science International-Genetics","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1872497324000553/pdfft?md5=b07331f0615a4fbf6f5400482d9c7b44&pid=1-s2.0-S1872497324000553-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Genetics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1872497324000553","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Poppies are beneficial plants with a variety of applications, including medicinal, edible, ornamental, and industrial purposes. Some Papaver species are forensically significant plants because they contain opium, a narcotic substance. Internationally trafficked species of illegal poppies are being identified by DNA barcoding employing multiple markers in response to their forensic value. However, effective markers for precise species identification of legal and illegal poppies are still under discussion, with research on illegal poppies focusing on Papaver somniferum L., and species identification studies of Papaver bracteatum and Papaver setigerum DC. still lacking. As a result, in order to evaluate the performance of genetic markers and classify their DNA sequences in the genus Papaver, this study developed the first machine learning-based two-layer model, in which the first layer classifies legal and illegal poppies from the given sequence and the second layer identifies species of illegal poppies using their sequences. We constructed the dataset and investigated biological features from four markers, internal transcribed spacer 1 (ITS1), internal transcribed spacer 2 (ITS2), transfer RNA Leucine (trnL), transfer RNA Leucine - transfer RNA Phenylalanine intergenic spacer (trnL–trnF intergenic spacer) and their combination, using four machine learning algorithms, K-nearest neighbor (KNN), Naïve Bayes (NB), extreme gradient boost (XGBoost) and Random Forest (RF). According to our findings, for Layer 1 to classify legal and illegal poppies, KNN-based models using combined ITS region achieved the greatest performance of accuracy 0.846 and 0.889 using training and test sets, respectively. Additionally, for Layer 2 to identify illegal poppy species, KNN-based models using combined ITS region achieved the best performance of 0.833 and 1.000 for using training and test sets, respectively. To validate the model, the combined ITS region, which includes ITS 1 and 2 sequences, from blind poppy samples were used as a case study, with the Layer 1 correctly classifying legal and illegal poppies with over 0.830 accuracy. Layer 2 correctly identified P. setigerum DC., however, only one of the three P. somniferum L. species was accurately identified. Nevertheless, our research shows that machine learning can be used to classify and identify legal and illegal poppy species using DNA barcodes which can then be used as an efficient and effective forensic tool for improved law enforcement and a safer society.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

开发基于序列数据的合法和非法罂粟分类法医应用双层机器学习模型

罂粟是一种有益植物，具有多种用途，包括药用、食用、观赏和工业用途。某些罂粟品种因含有麻醉物质鸦片而具有重要的法医学意义。鉴于非法罂粟的法医价值，国际贩运的非法罂粟物种正在通过采用多种标记的 DNA 条形码进行鉴定。然而，对合法罂粟和非法罂粟进行精确物种鉴定的有效标记仍在讨论之中，非法罂粟的研究主要集中在 Papaver somniferum L.，而对 Papaver bracteatum 和 Papaver setigerum DC.的物种鉴定研究仍然缺乏。因此，为了评估遗传标记的性能并对其在罂粟属中的 DNA 序列进行分类，本研究首次开发了基于机器学习的双层模型，其中第一层根据给定序列对合法和非法罂粟进行分类，第二层利用其序列识别非法罂粟的物种。我们构建了数据集，并研究了内部转录间隔序列 1（ITS1）、内部转录间隔序列 2（ITS2）、转运核糖核酸亮氨酸（trnL）这四个标记的生物学特征、转运 RNA 亮氨酸-转运 RNA 苯丙氨酸基因间距（trnL-trnF 基因间距）以及它们的组合，使用四种机器学习算法：K-近邻（KNN）、奈夫贝叶斯（NB）、极梯度提升（XGBoost）和随机森林（RF）。根据我们的研究结果，在第 1 层对合法和非法罂粟进行分类时，使用综合 ITS 区域的 KNN 模型在训练集和测试集上分别取得了 0.846 和 0.889 的最高准确率。此外，对于识别非法罂粟物种的第 2 层，基于 KNN 的模型使用组合 ITS 区域，在使用训练集和测试集时分别取得了 0.833 和 1.000 的最佳性能。为了验证该模型，我们使用了来自罂粟盲样的组合 ITS 区域（包括 ITS 1 和 2 序列）作为案例研究，第 1 层以超过 0.830 的准确率正确地对合法和非法罂粟进行了分类。第 2 层正确识别了 P. setigerum DC.，但在三个 P. somniferum L.物种中只有一个被准确识别。尽管如此，我们的研究表明，机器学习可用于利用 DNA 条形码对合法和非法罂粟品种进行分类和识别，然后将其作为一种高效和有效的法医工具，用于改进执法工作和提高社会安全。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Forensic Science International-Genetics 生物-医学：法

CiteScore

7.50

自引率

32.30%

发文量

132

审稿时长

11.3 weeks

期刊介绍： Forensic Science International: Genetics is the premier journal in the field of Forensic Genetics. This branch of Forensic Science can be defined as the application of genetics to human and non-human material (in the sense of a science with the purpose of studying inherited characteristics for the analysis of inter- and intra-specific variations in populations) for the resolution of legal conflicts. The scope of the journal includes: Forensic applications of human polymorphism. Testing of paternity and other family relationships, immigration cases, typing of biological stains and tissues from criminal casework, identification of human remains by DNA testing methodologies. Description of human polymorphisms of forensic interest, with special interest in DNA polymorphisms. Autosomal DNA polymorphisms, mini- and microsatellites (or short tandem repeats, STRs), single nucleotide polymorphisms (SNPs), X and Y chromosome polymorphisms, mtDNA polymorphisms, and any other type of DNA variation with potential forensic applications. Non-human DNA polymorphisms for crime scene investigation. Population genetics of human polymorphisms of forensic interest. Population data, especially from DNA polymorphisms of interest for the solution of forensic problems. DNA typing methodologies and strategies. Biostatistical methods in forensic genetics. Evaluation of DNA evidence in forensic problems (such as paternity or immigration cases, criminal casework, identification), classical and new statistical approaches. Standards in forensic genetics. Recommendations of regulatory bodies concerning methods, markers, interpretation or strategies or proposals for procedural or technical standards. Quality control. Quality control and quality assurance strategies, proficiency testing for DNA typing methodologies. Criminal DNA databases. Technical, legal and statistical issues. General ethical and legal issues related to forensic genetics.