Hitesh Kore, Satomi Okano, Keshava K Datta, Jackson Thorp, Parthiban Periasamy, Mayur Divate, Upekha Liyanage, Gunter Hartel, Shivashankar H Nagaraj, Harsha Gowda
{"title":"Identification of Small Open Reading Frame Encoded Proteins from the Human Genome.","authors":"Hitesh Kore, Satomi Okano, Keshava K Datta, Jackson Thorp, Parthiban Periasamy, Mayur Divate, Upekha Liyanage, Gunter Hartel, Shivashankar H Nagaraj, Harsha Gowda","doi":"10.1093/gpbjnl/qzaf004","DOIUrl":null,"url":null,"abstract":"<p><p>One of the main goals of human genome project was to identify all the protein-coding genes. There are ∼ 20,500 protein-coding genes annotated in human reference databases. However, in the last few years, proteogenomics studies have predicted thousands of novel protein-coding regions including low molecular weight proteins encoded by small open reading frames (ORFs) in untranslated regions of messenger RNAs and non-coding RNAs. Most of these predictions are based on bioinformatics analysis and ribosome footprints. The validity of some of these small ORF (sORF) encoded proteins (SEPs) has been established following functional characterization. With the growing number of predicted novel proteins, a strategy to identify reliable candidates that warrant further studies is needed. We developed an integrated proteogenomics workflow to identify reliable set of novel protein-coding regions in the human genome based on their recurrent observations across multiple samples. Publicly available ribosome profiling and global proteomics datasets were used to establish protein-coding evidence. We predicted protein translation from 4008 ORFs based on recurrent ribosome occupancy signals across samples. In addition, we identified 825 SEPs based on proteomics data. Some of the novel protein-coding regions identified were in genome-wide association studies (GWAS) loci associated with various traits and disease phenotypes. Peptides from SEPs are also presented by major histocompatibility complex class I (MHC-I) complex similar to canonical proteins. Novel protein-coding regions reported in this study expand the current catalog of protein-coding genes and warrant experimental studies to elucidate cellular functions regulated by these proteins and their role in human diseases.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/gpbjnl/qzaf004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
One of the main goals of human genome project was to identify all the protein-coding genes. There are ∼ 20,500 protein-coding genes annotated in human reference databases. However, in the last few years, proteogenomics studies have predicted thousands of novel protein-coding regions including low molecular weight proteins encoded by small open reading frames (ORFs) in untranslated regions of messenger RNAs and non-coding RNAs. Most of these predictions are based on bioinformatics analysis and ribosome footprints. The validity of some of these small ORF (sORF) encoded proteins (SEPs) has been established following functional characterization. With the growing number of predicted novel proteins, a strategy to identify reliable candidates that warrant further studies is needed. We developed an integrated proteogenomics workflow to identify reliable set of novel protein-coding regions in the human genome based on their recurrent observations across multiple samples. Publicly available ribosome profiling and global proteomics datasets were used to establish protein-coding evidence. We predicted protein translation from 4008 ORFs based on recurrent ribosome occupancy signals across samples. In addition, we identified 825 SEPs based on proteomics data. Some of the novel protein-coding regions identified were in genome-wide association studies (GWAS) loci associated with various traits and disease phenotypes. Peptides from SEPs are also presented by major histocompatibility complex class I (MHC-I) complex similar to canonical proteins. Novel protein-coding regions reported in this study expand the current catalog of protein-coding genes and warrant experimental studies to elucidate cellular functions regulated by these proteins and their role in human diseases.