The main way of analyzing genetic sequences is by finding sequence regions that are related to each other. There are many methods to do that, usually based on this idea: find an alignment of two sequence regions, which would be unlikely to exist between unrelated sequences. Unfortunately, it is hard to tell if an alignment is likely to exist by chance. Also, the precise alignment of related regions is uncertain. One alignment does not hold all evidence that they are related. We should consider alternative alignments too. This is rarely done, because we lack a simple and fast method that fits easily into practical sequence-search software. Here is described a simplest-conceivable change to standard sequence alignment, which sums probabilities of alternative alignments. This makes it easier to tell if a similarity is likely to occur by chance. This approach is better than standard alignment at finding distant relationships, at least in a few tests. It can be used in practical sequence-search software, with minimal increase in implementation difficulty or run time. It generalizes to different kinds of alignment, e.g. DNA-versus-protein with frameshifts. Thus, it can widely contribute to finding subtle relationships between sequences.
分析基因序列的主要方法是找到相互关联的序列区域。有很多方法可以做到这一点,通常基于以下想法:找到两个序列区域的比对,而这两个序列区域不太可能存在于不相关的序列之间。遗憾的是,很难说对齐是否可能是偶然存在的。而且,相关区域的精确排列也不确定。一次排列并不能证明它们之间存在关联。我们还应该考虑其他的排列方式。我们很少这样做,因为我们缺乏一种简单而快速的方法,可以很容易地应用到实用的序列搜索软件中。这里描述的是对标准序列比对的一个最简单的可想象的改变,即对备选比对的概率进行求和。这样就能更容易地判断相似性是否可能是偶然出现的。至少在一些测试中,这种方法比标准比对更能发现遥远的关系。这种方法可用于实际的序列搜索软件中,而且实施难度和运行时间的增加极少。它适用于不同类型的比对,如带有框架转换的 DNA 与蛋白质比对。因此,它可以广泛用于发现序列之间的微妙关系。
{"title":"A simple method for finding related sequences by adding probabilities of alternative alignments","authors":"Martin C Frith","doi":"10.1101/gr.279464.124","DOIUrl":"https://doi.org/10.1101/gr.279464.124","url":null,"abstract":"The main way of analyzing genetic sequences is by finding sequence regions that are related to each other. There are many methods to do that, usually based on this idea: find an alignment of two sequence regions, which would be unlikely to exist between unrelated sequences. Unfortunately, it is hard to tell if an alignment is likely to exist by chance. Also, the precise alignment of related regions is uncertain. One alignment does not hold all evidence that they are related. We should consider alternative alignments too. This is rarely done, because we lack a simple and fast method that fits easily into practical sequence-search software. Here is described a simplest-conceivable change to standard sequence alignment, which sums probabilities of alternative alignments. This makes it easier to tell if a similarity is likely to occur by chance. This approach is better than standard alignment at finding distant relationships, at least in a few tests. It can be used in practical sequence-search software, with minimal increase in implementation difficulty or run time. It generalizes to different kinds of alignment, e.g. DNA-versus-protein with frameshifts. Thus, it can widely contribute to finding subtle relationships between sequences.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"6 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141994474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During embryonic development, cells undergo dynamic changes in gene expression that are required for appropriate cell fate specification. Although both transcription and mRNA degradation contribute to gene expression dynamics, patterns of mRNA decay are less well-understood. Here we directly measured spatiotemporally resolved mRNA decay rates transcriptome-wide throughout C. elegans embryogenesis by transcription inhibition followed by bulk and single-cell RNA sequencing. This allowed us to calculate mRNA half-lives within specific cell types and developmental stages and identify differentially regulated mRNA decay throughout embryonic development. We identified transcript features that are correlated with mRNA stability and found that mRNA decay rates are associated with distinct peaks in gene expression over time. Moreover, we provide evidence that, on average, mRNA is more stable in the germline compared to in the soma and in later embryonic stages compared to in earlier stages. This work suggests that differential mRNA decay across cell states and time helps to shape developmental gene expression, and it provides a valuable resource for studies of mRNA turnover regulatory mechanisms.
{"title":"A spatiotemporally resolved atlas of mRNA decay in the C. elegans embryo reveals differential regulation of mRNA stability across stages and cell types","authors":"Felicia Peng, C Erik Nordgren, John Isaac Murray","doi":"10.1101/gr.278980.124","DOIUrl":"https://doi.org/10.1101/gr.278980.124","url":null,"abstract":"During embryonic development, cells undergo dynamic changes in gene expression that are required for appropriate cell fate specification. Although both transcription and mRNA degradation contribute to gene expression dynamics, patterns of mRNA decay are less well-understood. Here we directly measured spatiotemporally resolved mRNA decay rates transcriptome-wide throughout <em>C. elegans</em> embryogenesis by transcription inhibition followed by bulk and single-cell RNA sequencing. This allowed us to calculate mRNA half-lives within specific cell types and developmental stages and identify differentially regulated mRNA decay throughout embryonic development. We identified transcript features that are correlated with mRNA stability and found that mRNA decay rates are associated with distinct peaks in gene expression over time. Moreover, we provide evidence that, on average, mRNA is more stable in the germline compared to in the soma and in later embryonic stages compared to in earlier stages. This work suggests that differential mRNA decay across cell states and time helps to shape developmental gene expression, and it provides a valuable resource for studies of mRNA turnover regulatory mechanisms.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"95 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141980810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yulong Liu, Gang Zhai, Jingzhi Su, Yulong Gong, Binyuan Yang, Qisheng Lu, Longwei Xi, Yutong Zheng, Jingyue Cao, Haokun Liu, Junyan Jin, Zhimin Zhang, Yunxia Yang, Xiaoming Zhu, Zhongwei Wang, Gaorui Gong, Jie Mei, Zhan Yin, Rodolphe E. Gozlan, Shouqi Xie, Dong Han
Fish show variation in feeding habits to adapt to complex environments. However, the genetic basis of feeding preference and the corresponding metabolic strategies that differentiate feeding habits remain elusive. Here, by comparing the whole genome of a typical carnivorous fish (Leiocassis longirostris Günther) with that of herbivorous fish, we identify 250 genes through both positive selection and rapid evolution, including taste receptor taste receptor type 1 member 3 (tas1r3) and trypsin. We demonstrate that tas1r3 is required for carnivore preference in tas1r3-deficient zebrafish and in a diet-shifted grass carp model. We confirm that trypsin correlates with the metabolic strategies of fish with distinct feeding habits. Furthermore, marked alterations in trypsin activity and metabolic profiles are accompanied by a transition of feeding preference in tas1r3-deficient zebrafish and diet-shifted grass carp. Our results reveal a conserved adaptation between feeding preference and corresponding metabolic strategies in fish, and provide novel insights into the adaptation of feeding habits over the evolution course.
{"title":"The Chinese longsnout catfish genome provides novel insights into the feeding preference and corresponding metabolic strategy of carnivores","authors":"Yulong Liu, Gang Zhai, Jingzhi Su, Yulong Gong, Binyuan Yang, Qisheng Lu, Longwei Xi, Yutong Zheng, Jingyue Cao, Haokun Liu, Junyan Jin, Zhimin Zhang, Yunxia Yang, Xiaoming Zhu, Zhongwei Wang, Gaorui Gong, Jie Mei, Zhan Yin, Rodolphe E. Gozlan, Shouqi Xie, Dong Han","doi":"10.1101/gr.278476.123","DOIUrl":"https://doi.org/10.1101/gr.278476.123","url":null,"abstract":"Fish show variation in feeding habits to adapt to complex environments. However, the genetic basis of feeding preference and the corresponding metabolic strategies that differentiate feeding habits remain elusive. Here, by comparing the whole genome of a typical carnivorous fish (<em>Leiocassis longirostris</em> Günther) with that of herbivorous fish, we identify 250 genes through both positive selection and rapid evolution, including taste receptor <em>taste receptor type 1 member 3</em> (<em>tas1r3</em>) and <em>trypsin</em>. We demonstrate that <em>tas1r3</em> is required for carnivore preference in <em>tas1r3</em>-deficient zebrafish and in a diet-shifted grass carp model. We confirm that trypsin correlates with the metabolic strategies of fish with distinct feeding habits. Furthermore, marked alterations in trypsin activity and metabolic profiles are accompanied by a transition of feeding preference in <em>tas1r3</em>-deficient zebrafish and diet-shifted grass carp. Our results reveal a conserved adaptation between feeding preference and corresponding metabolic strategies in fish, and provide novel insights into the adaptation of feeding habits over the evolution course.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"83 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141910264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Madalina Giurgiu, Nadine Wittstruck, Elias Rodriguez-Fos, Rocio Chamorro Gonzalez, Lotte Brueckner, Annabell Krienelke-Szymansky, Konstantin Helmsauer, Anne Hartebrodt, Philipp Euskirchen, Richard P. Koche, Kerstin Haase, Knut Reinert, Anton G. Henssen
Circular extrachromosomal DNA (ecDNA) is a form of oncogene amplification found across cancer types and associated with poor outcome in patients. ecDNA can be structurally complex and contain rearranged DNA sequences derived from multiple chromosome locations. As the structure of ecDNA can impact oncogene regulation and may indicate mechanisms of its formation, disentangling it at high resolution from sequencing data is essential. Even though methods have been developed to identify and reconstruct ecDNA in cancer genome sequencing, it remains challenging to resolve complex ecDNA structures, in particular amplicons with shared genomic footprints. We here introduce Decoil, a computational method which combines a breakpoint-graph approach with regression to reconstruct complex ecDNA and deconvolve co-occurring ecDNA elements with overlapping genomic footprints from long-read nanopore sequencing. Decoil outperforms de novo assembly and alignment-based methods in simulated long-read sequencing data for both simple and complex ecDNAs. Applying Decoil on whole genome sequencing data uncovered different ecDNA topologies and explored ecDNA structure heterogeneity in neuroblastoma tumors and cell lines, indicating that this method may improve ecDNA structural analyzes in cancer.
{"title":"Reconstructing extrachromosomal DNA structural heterogeneity from long-read sequencing data using Decoil","authors":"Madalina Giurgiu, Nadine Wittstruck, Elias Rodriguez-Fos, Rocio Chamorro Gonzalez, Lotte Brueckner, Annabell Krienelke-Szymansky, Konstantin Helmsauer, Anne Hartebrodt, Philipp Euskirchen, Richard P. Koche, Kerstin Haase, Knut Reinert, Anton G. Henssen","doi":"10.1101/gr.279123.124","DOIUrl":"https://doi.org/10.1101/gr.279123.124","url":null,"abstract":"Circular extrachromosomal DNA (ecDNA) is a form of oncogene amplification found across cancer types and associated with poor outcome in patients. ecDNA can be structurally complex and contain rearranged DNA sequences derived from multiple chromosome locations. As the structure of ecDNA can impact oncogene regulation and may indicate mechanisms of its formation, disentangling it at high resolution from sequencing data is essential. Even though methods have been developed to identify and reconstruct ecDNA in cancer genome sequencing, it remains challenging to resolve complex ecDNA structures, in particular amplicons with shared genomic footprints. We here introduce Decoil, a computational method which combines a breakpoint-graph approach with regression to reconstruct complex ecDNA and deconvolve co-occurring ecDNA elements with overlapping genomic footprints from long-read nanopore sequencing. Decoil outperforms <em>de novo</em> assembly and alignment-based methods in simulated long-read sequencing data for both simple and complex ecDNAs. Applying Decoil on whole genome sequencing data uncovered different ecDNA topologies and explored ecDNA structure heterogeneity in neuroblastoma tumors and cell lines, indicating that this method may improve ecDNA structural analyzes in cancer.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"75 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141899416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Man-Hou Hong, David Froelicher, Ricky Magner, Victoria Popic, Bonnie Berger, Hyunghoon Cho
Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.
{"title":"Secure discovery of genetic relatives across large-scale and distributed genomic datasets","authors":"Matthew Man-Hou Hong, David Froelicher, Ricky Magner, Victoria Popic, Bonnie Berger, Hyunghoon Cho","doi":"10.1101/gr.279057.124","DOIUrl":"https://doi.org/10.1101/gr.279057.124","url":null,"abstract":"Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging due to the burden of estimating kinship between all pairs of individuals across datasets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us datasets. On a dataset of 200K individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 hours of runtime. Our work enables secure identification of relatives across large-scale genomic datasets.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"39 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141899422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xavi Guitart, David Porubsky, DongAhn Yoo, Max L Dougherty, Philip Dishuck, Katherine M. Munson, Alexandra P. Lewis, Kendra Hoekzema, Jordan Knuth, Stephen Chang, Tomi Pastinen, Evan E. Eichler
TBC1D3 is a primate-specific gene family that has expanded in the human lineage and has been implicated in neuronal progenitor proliferation and expansion of the frontal cortex. The gene family and its expression have been challenging to investigate because it is embedded in high-identity and highly variable segmental duplications. We sequenced and assembled the gene family using long-read sequencing data from 34 humans and 11 non-human primate species. Our analysis shows that this particular gene family has independently duplicated in at least five primate lineages, and the duplicated loci are enriched at sites of large-scale chromosomal rearrangements on Chromosome 17. We find that all human copy number variation maps to two distinct clusters located at Chromosome 17q12 and that humans are highly structurally variable at this locus, differing by as many as 20 copies and ~1 Mbp in length depending on haplotypes. We also show evidence of positive selection, as well as a significant change in the predicted human TBC1D3 protein sequence. Lastly, we find that, despite multiple duplications, human TBC1D3 expression is limited to a subset of copies and, most notably, from a single paralog group: TBC1D3-CDKL. These observations may help explain why a gene potentially important in cortical development can be so variable in the human population.
{"title":"Independent expansion, selection and hypervariability of the TBC1D3 gene family in humans","authors":"Xavi Guitart, David Porubsky, DongAhn Yoo, Max L Dougherty, Philip Dishuck, Katherine M. Munson, Alexandra P. Lewis, Kendra Hoekzema, Jordan Knuth, Stephen Chang, Tomi Pastinen, Evan E. Eichler","doi":"10.1101/gr.279299.124","DOIUrl":"https://doi.org/10.1101/gr.279299.124","url":null,"abstract":"<em>TBC1D3</em> is a primate-specific gene family that has expanded in the human lineage and has been implicated in neuronal progenitor proliferation and expansion of the frontal cortex. The gene family and its expression have been challenging to investigate because it is embedded in high-identity and highly variable segmental duplications. We sequenced and assembled the gene family using long-read sequencing data from 34 humans and 11 non-human primate species. Our analysis shows that this particular gene family has independently duplicated in at least five primate lineages, and the duplicated loci are enriched at sites of large-scale chromosomal rearrangements on Chromosome 17. We find that all human copy number variation maps to two distinct clusters located at Chromosome 17q12 and that humans are highly structurally variable at this locus, differing by as many as 20 copies and ~1 Mbp in length depending on haplotypes. We also show evidence of positive selection, as well as a significant change in the predicted human TBC1D3 protein sequence. Lastly, we find that, despite multiple duplications, human <em>TBC1D3</em> expression is limited to a subset of copies and, most notably, from a single paralog group: <em>TBC1D3-CDKL</em>. These observations may help explain why a gene potentially important in cortical development can be so variable in the human population.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"46 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141895606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chencheng Xu, Suying Bao, Ye Wang, Wenxing Li, Hao Chen, Yufeng Shen, Tao Jiang, Chaolin Zhang
Alternative splicing plays a crucial role in protein diversity and gene expression regulation in higher eukaryotes and mutations causing dysregulated splicing underlie a range of genetic diseases. Computational prediction of alternative splicing from genomic sequences not only provides insight into gene-regulatory mechanisms but also helps identify disease-causing mutations and drug targets. However, the current methods for the quantitative prediction of splice site usage still have limited accuracy. Here, we present DeltaSplice, a deep neural network model optimized to learn the impact of mutations on quantitative changes in alternative splicing from the comparative analysis of homologous genes. The model architecture enables DeltaSplice to perform "reference-informed prediction" by incorporating the known splice site usage of a reference gene sequence to improve its prediction on splicing-altering mutations. We benchmarked DeltaSplice and several other state-of-the-art methods on various prediction tasks, including evolutionary sequence divergence on lineage-specific splicing and splicing-altering mutations in human populations and neurodevelopmental disorders, and demonstrated that DeltaSplice outperformed consistently. DeltaSplice predicted ~15% of splicing quantitative trait loci (sQTLs) in the human brain as causal splicing-altering variants. It also predicted splicing-altering de novo mutations outside the splice sites in a subset of patients affected by autism and other neurodevelopmental disorders (NDD), including 19 genes with recurrent splicing-altering mutations. Integration of splicing-altering mutations with other types of denovo mutation burdens allowed prediction of eight novel NDD-risk genes. Our work expanded the capacity of in silico splicing models with potential applications in genetic diagnosis and the development of splicing-based precision medicine.
{"title":"Reference-informed prediction of alternative splicing and splicing-altering mutations from sequences","authors":"Chencheng Xu, Suying Bao, Ye Wang, Wenxing Li, Hao Chen, Yufeng Shen, Tao Jiang, Chaolin Zhang","doi":"10.1101/gr.279044.124","DOIUrl":"https://doi.org/10.1101/gr.279044.124","url":null,"abstract":"Alternative splicing plays a crucial role in protein diversity and gene expression regulation in higher eukaryotes and mutations causing dysregulated splicing underlie a range of genetic diseases. Computational prediction of alternative splicing from genomic sequences not only provides insight into gene-regulatory mechanisms but also helps identify disease-causing mutations and drug targets. However, the current methods for the quantitative prediction of splice site usage still have limited accuracy. Here, we present DeltaSplice, a deep neural network model optimized to learn the impact of mutations on quantitative changes in alternative splicing from the comparative analysis of homologous genes. The model architecture enables DeltaSplice to perform \"reference-informed prediction\" by incorporating the known splice site usage of a reference gene sequence to improve its prediction on splicing-altering mutations. We benchmarked DeltaSplice and several other state-of-the-art methods on various prediction tasks, including evolutionary sequence divergence on lineage-specific splicing and splicing-altering mutations in human populations and neurodevelopmental disorders, and demonstrated that DeltaSplice outperformed consistently. DeltaSplice predicted ~15% of splicing quantitative trait loci (sQTLs) in the human brain as causal splicing-altering variants. It also predicted splicing-altering de novo mutations outside the splice sites in a subset of patients affected by autism and other neurodevelopmental disorders (NDD), including 19 genes with recurrent splicing-altering mutations. Integration of splicing-altering mutations with other types of denovo mutation burdens allowed prediction of eight novel NDD-risk genes. Our work expanded the capacity of in silico splicing models with potential applications in genetic diagnosis and the development of splicing-based precision medicine.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"79 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141764131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Signal peptides (SP) play a crucial role in protein translocation in cells. The development of large protein language models (PLMs) and prompt-based learning provides a new opportunity for SP prediction, especially for the categories with limited annotated data. We present a parameter-efficient fine-tuning (PEFT) framework for SP prediction, PEFT-SP, to effectively utilize pretrained PLMs. We integrated low-rank adaptation (LoRA) into ESM-2 models to better leverage the protein sequence evolutionary knowledge of PLMs. Experiments show that PEFT-SP using LoRA enhances state-of-the-art results, leading to a maximum Matthews correlation coefficient (MCC) gain of 87.3% for SPs with small training samples and an overall MCC gain of 6.1%. Furthermore, we also employed two other PEFT methods, prompt tuning and adapter tuning, in ESM-2 for SP prediction. More elaborate experiments show that PEFT-SP using adapter tuning can also improve the state-of-the-art results by up to 28.1% MCC gain for SPs with small training samples and an overall MCC gain of 3.8%. LoRA requires fewer computing resources and less memory than the adapter during the training stage, making it possible to adapt larger and more powerful protein models for SP prediction.
{"title":"Parameter-efficient fine-tuning on large protein language models improves signal peptide prediction","authors":"Shuai Zeng, Duolin Wang, Lei Jiang, Dong Xu","doi":"10.1101/gr.279132.124","DOIUrl":"https://doi.org/10.1101/gr.279132.124","url":null,"abstract":"Signal peptides (SP) play a crucial role in protein translocation in cells. The development of large protein language models (PLMs) and prompt-based learning provides a new opportunity for SP prediction, especially for the categories with limited annotated data. We present a parameter-efficient fine-tuning (PEFT) framework for SP prediction, PEFT-SP, to effectively utilize pretrained PLMs. We integrated low-rank adaptation (LoRA) into ESM-2 models to better leverage the protein sequence evolutionary knowledge of PLMs. Experiments show that PEFT-SP using LoRA enhances state-of-the-art results, leading to a maximum Matthews correlation coefficient (MCC) gain of 87.3% for SPs with small training samples and an overall MCC gain of 6.1%. Furthermore, we also employed two other PEFT methods, prompt tuning and adapter tuning, in ESM-2 for SP prediction. More elaborate experiments show that PEFT-SP using adapter tuning can also improve the state-of-the-art results by up to 28.1% MCC gain for SPs with small training samples and an overall MCC gain of 3.8%. LoRA requires fewer computing resources and less memory than the adapter during the training stage, making it possible to adapt larger and more powerful protein models for SP prediction.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"51 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141764132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tasfia Zahin, Qian Shi, Xiaofei Carl Zang, Mingfu Shao
Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with its 5' and 3' ends covalently bonded. circRNAs are known to be more stable than linear RNAs, admit distinct properties and functions, and have been proven to be promising biomarkers. Existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, hence exhibiting unsatisfactory accuracy without a high-quality transcriptome. We present TERRACE, a new algorithm for full-length assembly of circRNAs from paired-end total RNA-seq data. TERRACE uses the splice graph as the underlying data structure that organizes the splicing and coverage information. We transform the problem of assembling circRNAs into finding paths that "bridge" the three fragments in the splice graph induced by back-spliced reads. We adopt a definition for optimal bridging paths and a dynamic programming algorithm to calculate such optimal paths. TERRACE features an efficient algorithm to detect back-spliced reads missed by RNA-seq aligners, contributing to its much improved sensitivity. It also incorporates a new machine-learning approach trained to assign a confidence score to each assembled circRNA, which is shown superior to using abundance for scoring. On both simulations and biological datasets TERRACE consistently outperforms existing methods by a large margin in sensitivity while maintaining better or comparable precision. In particular, when the annotations are not provided, TERRACE assembles 123%-413% more correct circRNAs than state-of-the-art methods. TERRACE presents a major leap on assembling full-length circRNAs from RNA-seq data, and we expect it to be widely used in the downstream research on circRNAs.
{"title":"Accurate assembly of circular RNAs with TERRACE","authors":"Tasfia Zahin, Qian Shi, Xiaofei Carl Zang, Mingfu Shao","doi":"10.1101/gr.279106.124","DOIUrl":"https://doi.org/10.1101/gr.279106.124","url":null,"abstract":"Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with its 5' and 3' ends covalently bonded. circRNAs are known to be more stable than linear RNAs, admit distinct properties and functions, and have been proven to be promising biomarkers. Existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, hence exhibiting unsatisfactory accuracy without a high-quality transcriptome. We present TERRACE, a new algorithm for full-length assembly of circRNAs from paired-end total RNA-seq data. TERRACE uses the splice graph as the underlying data structure that organizes the splicing and coverage information. We transform the problem of assembling circRNAs into finding paths that \"bridge\" the three fragments in the splice graph induced by back-spliced reads. We adopt a definition for optimal bridging paths and a dynamic programming algorithm to calculate such optimal paths. TERRACE features an efficient algorithm to detect back-spliced reads missed by RNA-seq aligners, contributing to its much improved sensitivity. It also incorporates a new machine-learning approach trained to assign a confidence score to each assembled circRNA, which is shown superior to using abundance for scoring. On both simulations and biological datasets TERRACE consistently outperforms existing methods by a large margin in sensitivity while maintaining better or comparable precision. In particular, when the annotations are not provided, TERRACE assembles 123%-413% more correct circRNAs than state-of-the-art methods. TERRACE presents a major leap on assembling full-length circRNAs from RNA-seq data, and we expect it to be widely used in the downstream research on circRNAs.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"58 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141764244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lee H Wong, Kate H Brettingham-Moore, Lyn Chan, Julie M Quach, Melissa A Anderson, Emma L Northrop, Ross Hannan, Richard Saffery, Margaret L Shaw, Evan Williams, K H Andy Choo
{"title":"Corrigendum: Centromere RNA is a key component for the assembly of nucleoproteins at the nucleolus and centromere.","authors":"Lee H Wong, Kate H Brettingham-Moore, Lyn Chan, Julie M Quach, Melissa A Anderson, Emma L Northrop, Ross Hannan, Richard Saffery, Margaret L Shaw, Evan Williams, K H Andy Choo","doi":"10.1101/gr.279693.124","DOIUrl":"10.1101/gr.279693.124","url":null,"abstract":"","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"34 6","pages":"979-980"},"PeriodicalIF":6.2,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11293534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141751459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}