SAKit: An all-in-one analysis pipeline for identifying novel proteins resulting from variant events at both large and small scales.

IF 0.9 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Journal of Bioinformatics and Computational Biology Pub Date : 2024-10-01 DOI:10.1142/S0219720024500227

Yan Li, Boran Wang, Zengding Wu, Shiliang Ji, Shi Xu, Caiyi Fei

{"title":"SAKit: An all-in-one analysis pipeline for identifying novel proteins resulting from variant events at both large and small scales.","authors":"Yan Li, Boran Wang, Zengding Wu, Shiliang Ji, Shi Xu, Caiyi Fei","doi":"10.1142/S0219720024500227","DOIUrl":null,"url":null,"abstract":"Background: Genetic mutations that cause the inactivation or aberrant activation of essential proteins may trigger alterations or even dysfunctions in cellular signaling pathways, culminating in the development of precancerous lesions and cancer. Mutations and such dysfunctions can result in the generation of \"novel proteins\" that are not part of the conventional human proteome. Identification of these proteins carries a profound potential for unraveling promising drug targets and designing innovative therapeutic models. Despite the emergence of diverse tools for detecting DNA or RNA variants, facilitated by the widespread adoption of nucleotide sequencing technology, these methods primarily target point mutations and exhibit suboptimal performance in detecting large-scale and combinatorial mutations. Additionally, the outcomes of these tools are confined to the genome and transcriptome levels, and do not provide the corresponding protein information resulting from genetic alterations. Results: We present the development of Sequencing Analysis Kit (SAKit), a bioinformatics pipeline for hybrid sequencing analysis integrating long-read and short-read RNA sequencing data. Long reads are utilized for detecting large-scale variations such as gene fusions, exon skipping, intron retention, and aberrant expression in non-coding regions, owing to their excellent coverage capabilities. Short reads serve to validate these findings at breakpoints and splice junctions. Conversely, short reads are employed for identifying small-scale variations, including single nucleotide variants, deletions, and insertions, due to their superior sequencing depth, with long reads providing additional validation. SAKit is designed to perform analyses using inter-species configuration files comprising genome references and annotation data, making it applicable to both human and mouse studies. Furthermore, SAKit implements a hierarchical filtering approach to eliminate low-confidence variants and employs open reading frame (ORF) analysis to translate identified variants into protein sequences. Conclusion: SAKit is a robust and versatile bioinformatics tool designed for the comprehensive identification of both large-scale and small-scale variants from RNA-seq data, facilitating the discovery of novel proteins. This pipeline integrates analysis of long-read and short-read sequencing data, offering a powerful solution for researchers in genomics and transcriptomics. SAKit is freely accessible and open-source, available through GitHub (https://github.com/therarna/SAKit) and as a Docker image https://hub.docker.com/repository/docker/therarna). Implemented primarily within a Snakemake framework using Python, SAKit ensures reproducibility, scalability, and ease of use for the scientific community.","PeriodicalId":48910,"journal":{"name":"Journal of Bioinformatics and Computational Biology","volume":"22 5","pages":"2450022"},"PeriodicalIF":0.9000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Bioinformatics and Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1142/S0219720024500227","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Genetic mutations that cause the inactivation or aberrant activation of essential proteins may trigger alterations or even dysfunctions in cellular signaling pathways, culminating in the development of precancerous lesions and cancer. Mutations and such dysfunctions can result in the generation of "novel proteins" that are not part of the conventional human proteome. Identification of these proteins carries a profound potential for unraveling promising drug targets and designing innovative therapeutic models. Despite the emergence of diverse tools for detecting DNA or RNA variants, facilitated by the widespread adoption of nucleotide sequencing technology, these methods primarily target point mutations and exhibit suboptimal performance in detecting large-scale and combinatorial mutations. Additionally, the outcomes of these tools are confined to the genome and transcriptome levels, and do not provide the corresponding protein information resulting from genetic alterations. Results: We present the development of Sequencing Analysis Kit (SAKit), a bioinformatics pipeline for hybrid sequencing analysis integrating long-read and short-read RNA sequencing data. Long reads are utilized for detecting large-scale variations such as gene fusions, exon skipping, intron retention, and aberrant expression in non-coding regions, owing to their excellent coverage capabilities. Short reads serve to validate these findings at breakpoints and splice junctions. Conversely, short reads are employed for identifying small-scale variations, including single nucleotide variants, deletions, and insertions, due to their superior sequencing depth, with long reads providing additional validation. SAKit is designed to perform analyses using inter-species configuration files comprising genome references and annotation data, making it applicable to both human and mouse studies. Furthermore, SAKit implements a hierarchical filtering approach to eliminate low-confidence variants and employs open reading frame (ORF) analysis to translate identified variants into protein sequences. Conclusion: SAKit is a robust and versatile bioinformatics tool designed for the comprehensive identification of both large-scale and small-scale variants from RNA-seq data, facilitating the discovery of novel proteins. This pipeline integrates analysis of long-read and short-read sequencing data, offering a powerful solution for researchers in genomics and transcriptomics. SAKit is freely accessible and open-source, available through GitHub (https://github.com/therarna/SAKit) and as a Docker image https://hub.docker.com/repository/docker/therarna). Implemented primarily within a Snakemake framework using Python, SAKit ensures reproducibility, scalability, and ease of use for the scientific community.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SAKit：集所有功能于一身的分析管道，用于识别大尺度和小尺度变异事件产生的新型蛋白质。

背景：基因突变导致必需蛋白失活或异常激活，可能引发细胞信号通路的改变甚至功能障碍，最终导致癌前病变和癌症的发生。突变和这种功能障碍会导致产生不属于传统人类蛋白质组的 "新型蛋白质"。对这些蛋白质进行鉴定，对于揭示有前景的药物靶点和设计创新的治疗模型具有深远的潜力。尽管随着核苷酸测序技术的广泛应用，出现了多种检测 DNA 或 RNA 变异的工具，但这些方法主要针对点突变，在检测大规模和组合突变方面表现不佳。此外，这些工具的结果仅限于基因组和转录组水平，不能提供基因改变产生的相应蛋白质信息。结果：我们开发了测序分析工具包（SAKit），这是一种用于混合测序分析的生物信息学管道，整合了长读程和短读程 RNA 测序数据。长读数因其出色的覆盖能力，可用于检测基因融合、外显子跳转、内含子保留和非编码区异常表达等大规模变异。短读数可在断点和剪接接头处验证这些发现。相反，短读数因其超强的测序深度，可用于鉴定小规模变异，包括单核苷酸变异、缺失和插入，长读数可提供额外的验证。SAKit 可使用由基因组参考文献和注释数据组成的种间配置文件进行分析，因此适用于人类和小鼠研究。此外，SAKit 还采用了分层过滤方法来剔除低置信度变异，并利用开放阅读框（ORF）分析将识别出的变异转化为蛋白质序列。结论SAKit 是一款功能强大、用途广泛的生物信息学工具，设计用于从 RNA-seq 数据中全面鉴定大规模和小规模变异，从而促进新型蛋白质的发现。该管道整合了长读程和短读程测序数据的分析，为基因组学和转录组学研究人员提供了强大的解决方案。SAKit 可免费访问并开源，可通过 GitHub (https://github.com/therarna/SAKit) 和 Docker 镜像 https://hub.docker.com/repository/docker/therarna) 获得。SAKit 主要在 Snakemake 框架内使用 Python 实现，确保了科学界的可重复性、可扩展性和易用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Bioinformatics and Computational Biology MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

2.10

自引率

0.00%

发文量

期刊介绍： The Journal of Bioinformatics and Computational Biology aims to publish high quality, original research articles, expository tutorial papers and review papers as well as short, critical comments on technical issues associated with the analysis of cellular information. The research papers will be technical presentations of new assertions, discoveries and tools, intended for a narrower specialist community. The tutorials, reviews and critical commentary will be targeted at a broader readership of biologists who are interested in using computers but are not knowledgeable about scientific computing, and equally, computer scientists who have an interest in biology but are not familiar with current thrusts nor the language of biology. Such carefully chosen tutorials and articles should greatly accelerate the rate of entry of these new creative scientists into the field.