A mapping-free natural language processing-based technique for sequence search in nanopore long-reads.

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2024-11-13 DOI:10.1186/s12859-024-05980-7

Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska

{"title":"A mapping-free natural language processing-based technique for sequence search in nanopore long-reads.","authors":"Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska","doi":"10.1186/s12859-024-05980-7","DOIUrl":null,"url":null,"abstract":"Background: In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes natural language processing techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach.Results: The training dataset consisted of RNA sequencing data from 6 samples. Multiple natural language processing models were examined, differing in the type of dictionary components (word length, step, context) as well as the encoding length and number of sequences required for algorithm training. The best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV was 99.25%, compared to minimap2's performance in a cross-validation scenario. The next stage focused on exploring the dictionary components and attempting to optimize it, employing statistical techniques as well as those relying on the explainability of the decisions made. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one.Conclusions: We conclude that for long Oxford nanopore reads, a natural language processing-based approach can reliably replace classical mapping when there is a need for fast, reliable and energy and computationally efficient targeted mapping of a pre-defined subset of transcripts. The developed model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05980-7","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes natural language processing techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach.

Results: The training dataset consisted of RNA sequencing data from 6 samples. Multiple natural language processing models were examined, differing in the type of dictionary components (word length, step, context) as well as the encoding length and number of sequences required for algorithm training. The best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV was 99.25%, compared to minimap2's performance in a cross-validation scenario. The next stage focused on exploring the dictionary components and attempting to optimize it, employing statistical techniques as well as those relying on the explainability of the decisions made. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one.

Conclusions: We conclude that for long Oxford nanopore reads, a natural language processing-based approach can reliably replace classical mapping when there is a need for fast, reliable and energy and computationally efficient targeted mapping of a pre-defined subset of transcripts. The developed model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于无映射自然语言处理技术的纳米孔长读序列搜索。

背景：在核电站或民用辐射事故等不可预见的情况下，需要有效且计算成本低廉的方法来确定所选基因面板的表达水平，以便对数千名供体的剂量进行粗略估算。新一代原位映射器速度快、能耗低，可在单个纳米孔输出水平上工作，因此需求量很大。我们的目标是创建一种利用自然语言处理技术的序列识别工具，并确保与传统方法相比具有较高的阴性预测值（NPV）：训练数据集由来自 6 个样本的 RNA 测序数据组成。对多种自然语言处理模型进行了研究，这些模型的字典成分类型（词长、步长、上下文）以及算法训练所需的编码长度和序列数量各不相同。最佳配置是对整个序列进行分析，并使用 3 个碱基对的字长，每边相邻一个字。对于所考虑的 FDXR 基因，与 minimap2 在交叉验证情况下的性能相比，所达到的平均平衡准确率（BACC）为 98.29%，NPV 为 99.25%。下一阶段的重点是探索字典组件并尝试对其进行优化，其中采用了统计技术以及依赖于决策可解释性的技术。将字典从 1024 个减少到 145 个后，BACC 变为 96.49%，NPV 变为 98.15%。获得的模型在外部独立基因组测序数据集上进行了验证，完整字典的 NPV 为 99.64%，缩小字典的 NPV 为 95.87%。对于完整字典，鲑鱼估算的读数与经典方法平均相差 3.48%，而对于缩减字典，则相差 5.82%：我们得出结论：对于牛津纳米孔长读数，当需要对预先定义的转录本子集进行快速、可靠、节能和计算效率高的定向图谱绘制时，基于自然语言处理的方法可以可靠地取代经典图谱绘制。开发的模型可以很容易地进行再训练，以识别选定的转录本和/或与各种长读数测序技术配合使用。我们的研究结果清楚地证明了将经典文本处理技术应用于核苷酸序列的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.