Yang Zhao, Shuo Wang, Jinze Huang, Bo Meng, Dong An, Xiang Fang, Yaoguang Wei, Xinhua Dai
{"title":"A transformer-based semi-autoregressive framework for high-speed and accurate de novo peptide sequencing.","authors":"Yang Zhao, Shuo Wang, Jinze Huang, Bo Meng, Dong An, Xiang Fang, Yaoguang Wei, Xinhua Dai","doi":"10.1038/s42003-025-07584-0","DOIUrl":null,"url":null,"abstract":"<p><p>De novo peptide sequencing directly identifies peptides from mass spectrometry data, playing a critical role in discovering novel proteins and analyzing complex biological samples without reliance on existing databases. To address challenges in both speed and accuracy, a transformer-based model, TSARseqNovo, incorporates two key innovations: a Semi-Autoregressive decoder for parallel prediction of multiple amino acids and a Masking Refinement decoder for refining low-confidence predictions. These features significantly enhance sequencing efficiency and accuracy. Evaluations on the Nine-Species, Aggregated, and Glycoproteomic datasets, demonstrate that TSARseqNovo outperforms state-of-the-art models, including CasaNovo, NovoB, InstaNovo + , and π-HelixNovo. Specifically, TSARseqNovo achieves up to a 2-fold speed increase over CasaNovo and π-HelixNovo, and approximately 10-fold over NovoB and InstaNovo + , while also showing substantial improvements in peptide prediction precision, especially for long peptides. These advancements position TSARseqNovo as a powerful tool for accelerating high-throughput proteomics research and addressing increasingly complex biological questions.</p>","PeriodicalId":10552,"journal":{"name":"Communications Biology","volume":"8 1","pages":"234"},"PeriodicalIF":5.2000,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11825679/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1038/s42003-025-07584-0","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
De novo peptide sequencing directly identifies peptides from mass spectrometry data, playing a critical role in discovering novel proteins and analyzing complex biological samples without reliance on existing databases. To address challenges in both speed and accuracy, a transformer-based model, TSARseqNovo, incorporates two key innovations: a Semi-Autoregressive decoder for parallel prediction of multiple amino acids and a Masking Refinement decoder for refining low-confidence predictions. These features significantly enhance sequencing efficiency and accuracy. Evaluations on the Nine-Species, Aggregated, and Glycoproteomic datasets, demonstrate that TSARseqNovo outperforms state-of-the-art models, including CasaNovo, NovoB, InstaNovo + , and π-HelixNovo. Specifically, TSARseqNovo achieves up to a 2-fold speed increase over CasaNovo and π-HelixNovo, and approximately 10-fold over NovoB and InstaNovo + , while also showing substantial improvements in peptide prediction precision, especially for long peptides. These advancements position TSARseqNovo as a powerful tool for accelerating high-throughput proteomics research and addressing increasingly complex biological questions.
期刊介绍:
Communications Biology is an open access journal from Nature Research publishing high-quality research, reviews and commentary in all areas of the biological sciences. Research papers published by the journal represent significant advances bringing new biological insight to a specialized area of research.