Ya Li , Xiaozhao Liu , Maomin Chen , Shaohua Yi , Ximiao He , Chao Xiao , Daixin Huang
{"title":"基于DNA甲基化的精液年龄估计:全基因组标记鉴定和模型开发。","authors":"Ya Li , Xiaozhao Liu , Maomin Chen , Shaohua Yi , Ximiao He , Chao Xiao , Daixin Huang","doi":"10.1016/j.fsigen.2024.103215","DOIUrl":null,"url":null,"abstract":"<div><div>DNA methylation at age-related CpG (AR-CpG) sites holds significant promise for forensic age estimation. However, somatic models perform poorly in semen due to unique methylation dynamics during spermatogenesis, and current studies are constrained by the limited coverage of methylation microarrays. This study aimed to identify novel semen-specific AR-CpG sites using double-enzyme reduced representation bisulfite sequencing (dRRBS) and validate these markers, alongside previously reported sites and neighboring CpGs, using bisulfite amplicon sequencing (BSAS) to develop robust age estimation models. A methylome-wide association study was conducted on semen samples from 21 healthy Chinese men across three age groups, generating over 4 million CpG sites per sample at ≥ 5 × depth. Analysis of 721,840 shared CpG sites revealed that more than 95 % were not covered by conventional methylation microarrays. Differential methylation and correlation analyses identified 139 AR-CpG sites. A two-stage validation process using multiplex PCR-based BSAS was performed. In the first stage, 47 top dRRBS-identified AR-CpG sites, 26 literature-reported sites, and 242 neighboring CpGs were assessed in 129 semen samples (22–64 years), validating 31 dRRBS, 26 literature-reported, and 152 neighboring CpGs as age-related. The second stage examined 154 CpG sites in 247 samples (22–67 years), confirming 71 AR-CpG sites with |rho| > 0.50. Among these, chr2:129071885 (cg19998819) emerged as the strongest age-associated marker (rho = 0.81). Using the second BSAS dataset, age estimation models were developed with multiple linear regression and random forest (RF) algorithms within a repeated nested cross-validation (CV) framework (10-fold outer CV with 10-fold inner CV, repeated 10 times). The RF models demonstrated superior accuracy across feature subsets of 5–25 CpGs. The optimized 9-CpG RF model achieved an average root mean square error of 4.73 years (4.62–4.96, SD=0.10) and an average mean absolute error of 3.30 years (3.23–3.43, SD=0.06). This study demonstrates the utility of dRRBS for large-scale AR-CpG discovery and provides a robust age estimation model and a comprehensive reference database of semen-specific AR-CpG sites for forensic applications.</div></div>","PeriodicalId":50435,"journal":{"name":"Forensic Science International-Genetics","volume":"76 ","pages":"Article 103215"},"PeriodicalIF":3.2000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DNA methylation-based age estimation from semen: Genome-wide marker identification and model development\",\"authors\":\"Ya Li , Xiaozhao Liu , Maomin Chen , Shaohua Yi , Ximiao He , Chao Xiao , Daixin Huang\",\"doi\":\"10.1016/j.fsigen.2024.103215\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>DNA methylation at age-related CpG (AR-CpG) sites holds significant promise for forensic age estimation. However, somatic models perform poorly in semen due to unique methylation dynamics during spermatogenesis, and current studies are constrained by the limited coverage of methylation microarrays. This study aimed to identify novel semen-specific AR-CpG sites using double-enzyme reduced representation bisulfite sequencing (dRRBS) and validate these markers, alongside previously reported sites and neighboring CpGs, using bisulfite amplicon sequencing (BSAS) to develop robust age estimation models. A methylome-wide association study was conducted on semen samples from 21 healthy Chinese men across three age groups, generating over 4 million CpG sites per sample at ≥ 5 × depth. Analysis of 721,840 shared CpG sites revealed that more than 95 % were not covered by conventional methylation microarrays. Differential methylation and correlation analyses identified 139 AR-CpG sites. A two-stage validation process using multiplex PCR-based BSAS was performed. In the first stage, 47 top dRRBS-identified AR-CpG sites, 26 literature-reported sites, and 242 neighboring CpGs were assessed in 129 semen samples (22–64 years), validating 31 dRRBS, 26 literature-reported, and 152 neighboring CpGs as age-related. The second stage examined 154 CpG sites in 247 samples (22–67 years), confirming 71 AR-CpG sites with |rho| > 0.50. Among these, chr2:129071885 (cg19998819) emerged as the strongest age-associated marker (rho = 0.81). Using the second BSAS dataset, age estimation models were developed with multiple linear regression and random forest (RF) algorithms within a repeated nested cross-validation (CV) framework (10-fold outer CV with 10-fold inner CV, repeated 10 times). The RF models demonstrated superior accuracy across feature subsets of 5–25 CpGs. The optimized 9-CpG RF model achieved an average root mean square error of 4.73 years (4.62–4.96, SD=0.10) and an average mean absolute error of 3.30 years (3.23–3.43, SD=0.06). This study demonstrates the utility of dRRBS for large-scale AR-CpG discovery and provides a robust age estimation model and a comprehensive reference database of semen-specific AR-CpG sites for forensic applications.</div></div>\",\"PeriodicalId\":50435,\"journal\":{\"name\":\"Forensic Science International-Genetics\",\"volume\":\"76 \",\"pages\":\"Article 103215\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-12-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Forensic Science International-Genetics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1872497324002114\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Genetics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1872497324002114","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
DNA methylation-based age estimation from semen: Genome-wide marker identification and model development
DNA methylation at age-related CpG (AR-CpG) sites holds significant promise for forensic age estimation. However, somatic models perform poorly in semen due to unique methylation dynamics during spermatogenesis, and current studies are constrained by the limited coverage of methylation microarrays. This study aimed to identify novel semen-specific AR-CpG sites using double-enzyme reduced representation bisulfite sequencing (dRRBS) and validate these markers, alongside previously reported sites and neighboring CpGs, using bisulfite amplicon sequencing (BSAS) to develop robust age estimation models. A methylome-wide association study was conducted on semen samples from 21 healthy Chinese men across three age groups, generating over 4 million CpG sites per sample at ≥ 5 × depth. Analysis of 721,840 shared CpG sites revealed that more than 95 % were not covered by conventional methylation microarrays. Differential methylation and correlation analyses identified 139 AR-CpG sites. A two-stage validation process using multiplex PCR-based BSAS was performed. In the first stage, 47 top dRRBS-identified AR-CpG sites, 26 literature-reported sites, and 242 neighboring CpGs were assessed in 129 semen samples (22–64 years), validating 31 dRRBS, 26 literature-reported, and 152 neighboring CpGs as age-related. The second stage examined 154 CpG sites in 247 samples (22–67 years), confirming 71 AR-CpG sites with |rho| > 0.50. Among these, chr2:129071885 (cg19998819) emerged as the strongest age-associated marker (rho = 0.81). Using the second BSAS dataset, age estimation models were developed with multiple linear regression and random forest (RF) algorithms within a repeated nested cross-validation (CV) framework (10-fold outer CV with 10-fold inner CV, repeated 10 times). The RF models demonstrated superior accuracy across feature subsets of 5–25 CpGs. The optimized 9-CpG RF model achieved an average root mean square error of 4.73 years (4.62–4.96, SD=0.10) and an average mean absolute error of 3.30 years (3.23–3.43, SD=0.06). This study demonstrates the utility of dRRBS for large-scale AR-CpG discovery and provides a robust age estimation model and a comprehensive reference database of semen-specific AR-CpG sites for forensic applications.
期刊介绍:
Forensic Science International: Genetics is the premier journal in the field of Forensic Genetics. This branch of Forensic Science can be defined as the application of genetics to human and non-human material (in the sense of a science with the purpose of studying inherited characteristics for the analysis of inter- and intra-specific variations in populations) for the resolution of legal conflicts.
The scope of the journal includes:
Forensic applications of human polymorphism.
Testing of paternity and other family relationships, immigration cases, typing of biological stains and tissues from criminal casework, identification of human remains by DNA testing methodologies.
Description of human polymorphisms of forensic interest, with special interest in DNA polymorphisms.
Autosomal DNA polymorphisms, mini- and microsatellites (or short tandem repeats, STRs), single nucleotide polymorphisms (SNPs), X and Y chromosome polymorphisms, mtDNA polymorphisms, and any other type of DNA variation with potential forensic applications.
Non-human DNA polymorphisms for crime scene investigation.
Population genetics of human polymorphisms of forensic interest.
Population data, especially from DNA polymorphisms of interest for the solution of forensic problems.
DNA typing methodologies and strategies.
Biostatistical methods in forensic genetics.
Evaluation of DNA evidence in forensic problems (such as paternity or immigration cases, criminal casework, identification), classical and new statistical approaches.
Standards in forensic genetics.
Recommendations of regulatory bodies concerning methods, markers, interpretation or strategies or proposals for procedural or technical standards.
Quality control.
Quality control and quality assurance strategies, proficiency testing for DNA typing methodologies.
Criminal DNA databases.
Technical, legal and statistical issues.
General ethical and legal issues related to forensic genetics.