{"title":"PS4:下一代蛋白质单序列二级结构预测数据集。","authors":"Omar Peracha","doi":"10.2144/btn-2023-0024","DOIUrl":null,"url":null,"abstract":"<p><p>Protein secondary structure prediction is a subproblem of protein folding. A light-weight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could provide useful input for tertiary structure prediction, alleviating the reliance on multiple sequence alignments typically seen in today's best-performing models. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck. We present PS4, a dataset of 18,731 nonredundant protein chains and their respective secondary structure labels. Each chain is identified, and the dataset is nonredundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set and obtains state-of-the-art accuracy on the CB513 test set in zero shots.</p>","PeriodicalId":8945,"journal":{"name":"BioTechniques","volume":" ","pages":"63-70"},"PeriodicalIF":2.2000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PS4: a next-generation dataset for protein single-sequence secondary structure prediction.\",\"authors\":\"Omar Peracha\",\"doi\":\"10.2144/btn-2023-0024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Protein secondary structure prediction is a subproblem of protein folding. A light-weight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could provide useful input for tertiary structure prediction, alleviating the reliance on multiple sequence alignments typically seen in today's best-performing models. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck. We present PS4, a dataset of 18,731 nonredundant protein chains and their respective secondary structure labels. Each chain is identified, and the dataset is nonredundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set and obtains state-of-the-art accuracy on the CB513 test set in zero shots.</p>\",\"PeriodicalId\":8945,\"journal\":{\"name\":\"BioTechniques\",\"volume\":\" \",\"pages\":\"63-70\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BioTechniques\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.2144/btn-2023-0024\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/11/24 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioTechniques","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.2144/btn-2023-0024","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/24 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
PS4: a next-generation dataset for protein single-sequence secondary structure prediction.
Protein secondary structure prediction is a subproblem of protein folding. A light-weight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could provide useful input for tertiary structure prediction, alleviating the reliance on multiple sequence alignments typically seen in today's best-performing models. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck. We present PS4, a dataset of 18,731 nonredundant protein chains and their respective secondary structure labels. Each chain is identified, and the dataset is nonredundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set and obtains state-of-the-art accuracy on the CB513 test set in zero shots.
期刊介绍:
BioTechniques is a peer-reviewed, open-access journal dedicated to publishing original laboratory methods, related technical and software tools, and methods-oriented review articles that are of broad interest to professional life scientists, as well as to scientists from other disciplines (e.g., chemistry, physics, computer science, plant and agricultural science and climate science) interested in life science applications for their technologies.
Since 1983, BioTechniques has been a leading peer-reviewed journal for methods-related research. The journal considers:
Reports describing innovative new methods, platforms and software, substantive modifications to existing methods, or innovative applications of existing methods, techniques & tools to new models or scientific questions
Descriptions of technical tools that facilitate the design or performance of experiments or data analysis, such as software and simple laboratory devices
Surveys of technical approaches related to broad fields of research
Reviews discussing advancements in techniques and methods related to broad fields of research
Letters to the Editor and Expert Opinions highlighting interesting observations or cautionary tales concerning experimental design, methodology or analysis.