利用卷积神经网络优化序列数据分析，预测CNV诱饵位置。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-12-24 DOI:10.1186/s12859-024-06006-y

Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár

{"title":"利用卷积神经网络优化序列数据分析，预测CNV诱饵位置。","authors":"Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár","doi":"10.1186/s12859-024-06006-y","DOIUrl":null,"url":null,"abstract":"Background: Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization.Results: We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions.Results: This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"389"},"PeriodicalIF":3.3000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669243/pdf/","citationCount":"0","resultStr":"{\"title\":\"Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions.\",\"authors\":\"Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár\",\"doi\":\"10.1186/s12859-024-06006-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization.Results: We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions.Results: This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":\"25 1\",\"pages\":\"389\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669243/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-024-06006-y\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-06006-y","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

背景：从目标捕获下一代测序（NGS）数据中准确预测拷贝数变化（CNVs）依赖于有效的读覆盖谱归一化。由于GC偏差等潜在的系统性偏差，标准化过程尤其具有挑战性，这些偏差会显著影响CNV检测的灵敏度和特异性。在许多情况下，试剂盒清单只提供了目标区域的基因组坐标，并且寡核苷酸捕获诱饵的确切设计是不可用的。尽管目标区域与诱饵设计有很大的重叠，但缺乏足够的信息使得覆盖数据的归一化不太准确。在这项研究中，我们提出了一种利用1D卷积神经网络（CNN）模型来预测复杂全外显子组测序（WES）试剂盒中捕获诱饵位置的新方法。通过准确地识别诱饵坐标的确切位置，我们的模型可以精确地规范化目标区域的GC偏差，从而实现更好的CNV数据规范化。结果：我们评估了最优超参数、模型架构和复杂性，以预测寡核苷酸捕获诱饵的可能位置。我们的分析表明，CNN模型在诱饵预测方面优于Dense NN。批归一化是CNN模型稳定训练最重要的参数。我们的研究结果表明，数据的空间性对预测性能起着重要作用。我们已经表明，组合输入数据，包括实验覆盖，目标信息和序列数据，对诱饵预测至关重要。此外，与目标信息的比较表明，CNN模型在预测与真实诱饵位置高度重叠（>90%）的诱饵位置方面表现更好。结果：本研究强调了利用基于cnn的方法优化覆盖率数据分析和改善拷贝数数据规范化的潜力。随后基于这些预测坐标的CNV检测有助于更准确地测量覆盖曲线和更好地归一化GC偏差。因此，该方法可以减少系统偏差，提高基因组研究中CNV检测的敏感性和特异性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions.

Background: Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization.

Results: We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions.

Results: This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.