{"title":"High-Risk Sequence Prediction Model in DNA Storage: The LQSF Method.","authors":"Yitong Ma, Shuai Chen, Xu Qi, Zuhong Lu, Kun Bi","doi":"10.1109/TNB.2024.3424576","DOIUrl":null,"url":null,"abstract":"<p><p>Traditional DNA storage technologies rely on passive filtering methods for error correction during synthesis and sequencing, which result in redundancy and inadequate error correction. Addressing this, the Low Quality Sequence Filter (LQSF) was introduced, an innovative method employing deep learning models to predict high-risk sequences. The LQSF approach leverages a classification model trained on error-prone sequences, enabling efficient pre-sequencing filtration of low-quality sequences and reducing time and resources in subsequent stages. Analysis has demonstrated a clear distinction between high and low-quality sequences, confirming the efficacy of the LQSF method. Extensive training and testing were conducted across various neural networks and test sets. The results showed all models achieving an AUC value above 0.91 on ROC curves and over 0.95 on PR curves across different datasets. Notably, models such as Alexnet, VGG16, and VGG19 achieved a perfect AUC of 1.0 on the Original dataset, highlighting their precision in classification. Further validation using Illumina sequencing data substantiated a strong correlation between model scores and sequence error-proneness, emphasizing the model's applicability. The LQSF method marks a significant advancement in DNA storage technology, introducing active sequence filtering at the encoding stage. This pioneering approach holds substantial promise for future DNA storage research and applications.</p>","PeriodicalId":13264,"journal":{"name":"IEEE Transactions on NanoBioscience","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on NanoBioscience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1109/TNB.2024.3424576","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Traditional DNA storage technologies rely on passive filtering methods for error correction during synthesis and sequencing, which result in redundancy and inadequate error correction. Addressing this, the Low Quality Sequence Filter (LQSF) was introduced, an innovative method employing deep learning models to predict high-risk sequences. The LQSF approach leverages a classification model trained on error-prone sequences, enabling efficient pre-sequencing filtration of low-quality sequences and reducing time and resources in subsequent stages. Analysis has demonstrated a clear distinction between high and low-quality sequences, confirming the efficacy of the LQSF method. Extensive training and testing were conducted across various neural networks and test sets. The results showed all models achieving an AUC value above 0.91 on ROC curves and over 0.95 on PR curves across different datasets. Notably, models such as Alexnet, VGG16, and VGG19 achieved a perfect AUC of 1.0 on the Original dataset, highlighting their precision in classification. Further validation using Illumina sequencing data substantiated a strong correlation between model scores and sequence error-proneness, emphasizing the model's applicability. The LQSF method marks a significant advancement in DNA storage technology, introducing active sequence filtering at the encoding stage. This pioneering approach holds substantial promise for future DNA storage research and applications.
期刊介绍:
The IEEE Transactions on NanoBioscience reports on original, innovative and interdisciplinary work on all aspects of molecular systems, cellular systems, and tissues (including molecular electronics). Topics covered in the journal focus on a broad spectrum of aspects, both on foundations and on applications. Specifically, methods and techniques, experimental aspects, design and implementation, instrumentation and laboratory equipment, clinical aspects, hardware and software data acquisition and analysis and computer based modelling are covered (based on traditional or high performance computing - parallel computers or computer networks).