Pub Date : 2024-07-08DOI: 10.1186/s13059-024-03287-7
April Rich, Omer Acar, Anne-Ruxandra Carvunis
Recent studies uncovered pervasive transcription and translation of thousands of noncanonical open reading frames (nORFs) outside of annotated genes. The contribution of nORFs to cellular phenotypes is difficult to infer using conventional approaches because nORFs tend to be short, of recent de novo origins, and lowly expressed. Here we develop a dedicated coexpression analysis framework that accounts for low expression to investigate the transcriptional regulation, evolution, and potential cellular roles of nORFs in Saccharomyces cerevisiae. Our results reveal that nORFs tend to be preferentially coexpressed with genes involved in cellular transport or homeostasis but rarely with genes involved in RNA processing. Mechanistically, we discover that young de novo nORFs located downstream of conserved genes tend to leverage their neighbors’ promoters through transcription readthrough, resulting in high coexpression and high expression levels. Transcriptional piggybacking also influences the coexpression profiles of young de novo nORFs located upstream of genes, but to a lesser extent and without detectable impact on expression levels. Transcriptional piggybacking influences, but does not determine, the transcription profiles of de novo nORFs emerging nearby genes. About 40% of nORFs are not strongly coexpressed with any gene but are transcriptionally regulated nonetheless and tend to form entirely new transcription modules. We offer a web browser interface ( https://carvunislab.csb.pitt.edu/shiny/coexpression/ ) to efficiently query, visualize, and download our coexpression inferences. Our results suggest that nORF transcription is highly regulated. Our coexpression dataset serves as an unprecedented resource for unraveling how nORFs integrate into cellular networks, contribute to cellular phenotypes, and evolve.
{"title":"Massively integrated coexpression analysis reveals transcriptional regulation, evolution and cellular implications of the yeast noncanonical translatome","authors":"April Rich, Omer Acar, Anne-Ruxandra Carvunis","doi":"10.1186/s13059-024-03287-7","DOIUrl":"https://doi.org/10.1186/s13059-024-03287-7","url":null,"abstract":"Recent studies uncovered pervasive transcription and translation of thousands of noncanonical open reading frames (nORFs) outside of annotated genes. The contribution of nORFs to cellular phenotypes is difficult to infer using conventional approaches because nORFs tend to be short, of recent de novo origins, and lowly expressed. Here we develop a dedicated coexpression analysis framework that accounts for low expression to investigate the transcriptional regulation, evolution, and potential cellular roles of nORFs in Saccharomyces cerevisiae. Our results reveal that nORFs tend to be preferentially coexpressed with genes involved in cellular transport or homeostasis but rarely with genes involved in RNA processing. Mechanistically, we discover that young de novo nORFs located downstream of conserved genes tend to leverage their neighbors’ promoters through transcription readthrough, resulting in high coexpression and high expression levels. Transcriptional piggybacking also influences the coexpression profiles of young de novo nORFs located upstream of genes, but to a lesser extent and without detectable impact on expression levels. Transcriptional piggybacking influences, but does not determine, the transcription profiles of de novo nORFs emerging nearby genes. About 40% of nORFs are not strongly coexpressed with any gene but are transcriptionally regulated nonetheless and tend to form entirely new transcription modules. We offer a web browser interface ( https://carvunislab.csb.pitt.edu/shiny/coexpression/ ) to efficiently query, visualize, and download our coexpression inferences. Our results suggest that nORF transcription is highly regulated. Our coexpression dataset serves as an unprecedented resource for unraveling how nORFs integrate into cellular networks, contribute to cellular phenotypes, and evolve.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141557233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-08DOI: 10.1186/s13059-024-03322-7
Fabiola Curion, Charlotte Rich-Griffin, Devika Agarwal, Sarah Ouologuem, Kevin Rue-Albrecht, Lilly May, Giulia E. L. Garcia, Lukas Heumos, Tom Thomas, Wojciech Lason, David Sims, Fabian J. Theis, Calliope A. Dendrou
Single-cell multiomic analysis of the epigenome, transcriptome, and proteome allows for comprehensive characterization of the molecular circuitry that underpins cell identity and state. However, the holistic interpretation of such datasets presents a challenge given a paucity of approaches for systematic, joint evaluation of different modalities. Here, we present Panpipes, a set of computational workflows designed to automate multimodal single-cell and spatial transcriptomic analyses by incorporating widely-used Python-based tools to perform quality control, preprocessing, integration, clustering, and reference mapping at scale. Panpipes allows reliable and customizable analysis and evaluation of individual and integrated modalities, thereby empowering decision-making before downstream investigations.
{"title":"Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis","authors":"Fabiola Curion, Charlotte Rich-Griffin, Devika Agarwal, Sarah Ouologuem, Kevin Rue-Albrecht, Lilly May, Giulia E. L. Garcia, Lukas Heumos, Tom Thomas, Wojciech Lason, David Sims, Fabian J. Theis, Calliope A. Dendrou","doi":"10.1186/s13059-024-03322-7","DOIUrl":"https://doi.org/10.1186/s13059-024-03322-7","url":null,"abstract":"Single-cell multiomic analysis of the epigenome, transcriptome, and proteome allows for comprehensive characterization of the molecular circuitry that underpins cell identity and state. However, the holistic interpretation of such datasets presents a challenge given a paucity of approaches for systematic, joint evaluation of different modalities. Here, we present Panpipes, a set of computational workflows designed to automate multimodal single-cell and spatial transcriptomic analyses by incorporating widely-used Python-based tools to perform quality control, preprocessing, integration, clustering, and reference mapping at scale. Panpipes allows reliable and customizable analysis and evaluation of individual and integrated modalities, thereby empowering decision-making before downstream investigations.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141557238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-05DOI: 10.1186/s13059-024-03329-0
Helena L Crowell, Sarah X Morillo Leonardo, Charlotte Soneson, Mark D Robinson
{"title":"Author Correction: The shaky foundations of simulating single-cell RNA sequencing data.","authors":"Helena L Crowell, Sarah X Morillo Leonardo, Charlotte Soneson, Mark D Robinson","doi":"10.1186/s13059-024-03329-0","DOIUrl":"10.1186/s13059-024-03329-0","url":null,"abstract":"","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":10.1,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11227192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141537814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1186/s13059-024-03319-2
Helyaneh Ziaei Jam, Justin M. Zook, Sara Javadzadeh, Jonghun Park, Aarushi Sehgal, Melissa Gymrek
Tandem repeats are frequent across the human genome, and variation in repeat length has been linked to a variety of traits. Recent improvements in long read sequencing technologies have the potential to greatly improve tandem repeat analysis, especially for long or complex repeats. Here, we introduce LongTR, which accurately genotypes tandem repeats from high-fidelity long reads available from both PacBio and Oxford Nanopore Technologies. LongTR is freely available at https://github.com/gymrek-lab/longtr and https://zenodo.org/doi/10.5281/zenodo.11403979 .
{"title":"LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads","authors":"Helyaneh Ziaei Jam, Justin M. Zook, Sara Javadzadeh, Jonghun Park, Aarushi Sehgal, Melissa Gymrek","doi":"10.1186/s13059-024-03319-2","DOIUrl":"https://doi.org/10.1186/s13059-024-03319-2","url":null,"abstract":"Tandem repeats are frequent across the human genome, and variation in repeat length has been linked to a variety of traits. Recent improvements in long read sequencing technologies have the potential to greatly improve tandem repeat analysis, especially for long or complex repeats. Here, we introduce LongTR, which accurately genotypes tandem repeats from high-fidelity long reads available from both PacBio and Oxford Nanopore Technologies. LongTR is freely available at https://github.com/gymrek-lab/longtr and https://zenodo.org/doi/10.5281/zenodo.11403979 .","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1186/s13059-024-03320-9
Yanqi Dong, Wei-Hua Chen, Xing-Ming Zhao
Identifying viruses from metagenomes is a common step to explore the virus composition in the human gut. Here, we introduce VirRep, a hybrid language representation learning framework, for identifying viruses from human gut metagenomes. VirRep combines a context-aware encoder and an evolution-aware encoder to improve sequence representation by incorporating k-mer patterns and sequence homologies. Benchmarking on both simulated and real datasets with varying viral proportions demonstrates that VirRep outperforms state-of-the-art methods. When applied to fecal metagenomes from a colorectal cancer cohort, VirRep identifies 39 high-quality viral species associated with the disease, many of which cannot be detected by existing methods.
{"title":"VirRep: a hybrid language representation learning framework for identifying viruses from human gut metagenomes","authors":"Yanqi Dong, Wei-Hua Chen, Xing-Ming Zhao","doi":"10.1186/s13059-024-03320-9","DOIUrl":"https://doi.org/10.1186/s13059-024-03320-9","url":null,"abstract":"Identifying viruses from metagenomes is a common step to explore the virus composition in the human gut. Here, we introduce VirRep, a hybrid language representation learning framework, for identifying viruses from human gut metagenomes. VirRep combines a context-aware encoder and an evolution-aware encoder to improve sequence representation by incorporating k-mer patterns and sequence homologies. Benchmarking on both simulated and real datasets with varying viral proportions demonstrates that VirRep outperforms state-of-the-art methods. When applied to fecal metagenomes from a colorectal cancer cohort, VirRep identifies 39 high-quality viral species associated with the disease, many of which cannot be detected by existing methods.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1186/s13059-024-03325-4
Marta Olivares, Paula Hernández-Calderón, Sonia Cárdenas-Brito, Rebeca Liébana-García, Yolanda Sanz, Alfonso Benítez-Páez
The gut microbiota controls broad aspects of human metabolism and feeding behavior, but the basis for this control remains largely unclear. Given the key role of human dipeptidyl peptidase 4 (DPP4) in host metabolism, we investigate whether microbiota DPP4-like counterparts perform the same function. We identify novel functional homologs of human DPP4 in several bacterial species inhabiting the human gut, and specific associations between Parabacteroides and Porphyromonas DPP4-like genes and type 2 diabetes (T2D). We also find that the DPP4-like enzyme from the gut symbiont Parabacteroides merdae mimics the proteolytic activity of the human enzyme on peptide YY, neuropeptide Y, gastric inhibitory polypeptide (GIP), and glucagon-like peptide 1 (GLP-1) hormones in vitro. Importantly, administration of E. coli overexpressing the P. merdae DPP4-like enzyme to lipopolysaccharide-treated mice with impaired gut barrier function reduces active GIP and GLP-1 levels, which is attributed to increased DPP4 activity in the portal circulation and the cecal content. Finally, we observe that linagliptin, saxagliptin, sitagliptin, and vildagliptin, antidiabetic drugs with DPP4 inhibitory activity, differentially inhibit the activity of the DPP4-like enzyme from P. merdae. Our findings confirm that proteolytic enzymes produced by the gut microbiota are likely to contribute to the glucose metabolic dysfunction that underlies T2D by inactivating incretins, which might inspire the development of improved antidiabetic therapies.
{"title":"Gut microbiota DPP4-like enzymes are increased in type-2 diabetes and contribute to incretin inactivation","authors":"Marta Olivares, Paula Hernández-Calderón, Sonia Cárdenas-Brito, Rebeca Liébana-García, Yolanda Sanz, Alfonso Benítez-Páez","doi":"10.1186/s13059-024-03325-4","DOIUrl":"https://doi.org/10.1186/s13059-024-03325-4","url":null,"abstract":"The gut microbiota controls broad aspects of human metabolism and feeding behavior, but the basis for this control remains largely unclear. Given the key role of human dipeptidyl peptidase 4 (DPP4) in host metabolism, we investigate whether microbiota DPP4-like counterparts perform the same function. We identify novel functional homologs of human DPP4 in several bacterial species inhabiting the human gut, and specific associations between Parabacteroides and Porphyromonas DPP4-like genes and type 2 diabetes (T2D). We also find that the DPP4-like enzyme from the gut symbiont Parabacteroides merdae mimics the proteolytic activity of the human enzyme on peptide YY, neuropeptide Y, gastric inhibitory polypeptide (GIP), and glucagon-like peptide 1 (GLP-1) hormones in vitro. Importantly, administration of E. coli overexpressing the P. merdae DPP4-like enzyme to lipopolysaccharide-treated mice with impaired gut barrier function reduces active GIP and GLP-1 levels, which is attributed to increased DPP4 activity in the portal circulation and the cecal content. Finally, we observe that linagliptin, saxagliptin, sitagliptin, and vildagliptin, antidiabetic drugs with DPP4 inhibitory activity, differentially inhibit the activity of the DPP4-like enzyme from P. merdae. Our findings confirm that proteolytic enzymes produced by the gut microbiota are likely to contribute to the glucose metabolic dysfunction that underlies T2D by inactivating incretins, which might inspire the development of improved antidiabetic therapies.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141496055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transposable elements play a critical role in maintaining genome architecture during neurodevelopment. Short Interspersed Nuclear Elements (SINEs), a major subtype of transposable elements, are known to harbor binding sites for the CCCTC-binding factor (CTCF) and pivotal in orchestrating chromatin organization. However, the regulatory mechanisms controlling the activity of SINEs in the developing brain remains elusive. In our study, we conduct a comprehensive genome-wide epigenetic analysis in mouse neural precursor cells using ATAC-seq, ChIP-seq, whole genome bisulfite sequencing, in situ Hi-C, and RNA-seq. Our findings reveal that the SET domain bifurcated histone lysine methyltransferase 1 (SETDB1)-mediated H3K9me3, in conjunction with DNA methylation, restricts chromatin accessibility on a selective subset of SINEs in neural precursor cells. Mechanistically, loss of Setdb1 increases CTCF access to these SINE elements and contributes to chromatin loop reorganization. Moreover, de novo loop formation contributes to differential gene expression, including the dysregulation of genes enriched in mitotic pathways. This leads to the disruptions of cell proliferation in the embryonic brain after genetic ablation of Setdb1 both in vitro and in vivo. In summary, our study sheds light on the epigenetic regulation of SINEs in mouse neural precursor cells, suggesting their role in maintaining chromatin organization and cell proliferation during neurodevelopment.
{"title":"SETDB1 regulates short interspersed nuclear elements and chromatin loop organization in mouse neural precursor cells","authors":"Daijing Sun, Yueyan Zhu, Wenzhu Peng, Shenghui Zheng, Jie Weng, Shulong Dong, Jiaqi Li, Qi Chen, Chuanhui Ge, Liyong Liao, Yuhao Dong, Yun Liu, Weida Meng, Yan Jiang","doi":"10.1186/s13059-024-03327-2","DOIUrl":"https://doi.org/10.1186/s13059-024-03327-2","url":null,"abstract":"Transposable elements play a critical role in maintaining genome architecture during neurodevelopment. Short Interspersed Nuclear Elements (SINEs), a major subtype of transposable elements, are known to harbor binding sites for the CCCTC-binding factor (CTCF) and pivotal in orchestrating chromatin organization. However, the regulatory mechanisms controlling the activity of SINEs in the developing brain remains elusive. In our study, we conduct a comprehensive genome-wide epigenetic analysis in mouse neural precursor cells using ATAC-seq, ChIP-seq, whole genome bisulfite sequencing, in situ Hi-C, and RNA-seq. Our findings reveal that the SET domain bifurcated histone lysine methyltransferase 1 (SETDB1)-mediated H3K9me3, in conjunction with DNA methylation, restricts chromatin accessibility on a selective subset of SINEs in neural precursor cells. Mechanistically, loss of Setdb1 increases CTCF access to these SINE elements and contributes to chromatin loop reorganization. Moreover, de novo loop formation contributes to differential gene expression, including the dysregulation of genes enriched in mitotic pathways. This leads to the disruptions of cell proliferation in the embryonic brain after genetic ablation of Setdb1 both in vitro and in vivo. In summary, our study sheds light on the epigenetic regulation of SINEs in mouse neural precursor cells, suggesting their role in maintaining chromatin organization and cell proliferation during neurodevelopment.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141495938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1186/s13059-024-03301-y
Alison D. Tang, Colette Felton, Eva Hrabeta-Robinson, Roger Volden, Christopher Vollmers, Angela N. Brooks
RNA-seq has brought forth significant discoveries regarding aberrations in RNA processing, implicating these RNA variants in a variety of diseases. Aberrant splicing and single nucleotide variants (SNVs) in RNA have been demonstrated to alter transcript stability, localization, and function. In particular, the upregulation of ADAR, an enzyme that mediates adenosine-to-inosine editing, has been previously linked to an increase in the invasiveness of lung adenocarcinoma cells and associated with splicing regulation. Despite the functional importance of studying splicing and SNVs, the use of short-read RNA-seq has limited the community’s ability to interrogate both forms of RNA variation simultaneously. We employ long-read sequencing technology to obtain full-length transcript sequences, elucidating cis-effects of variants on splicing changes at a single molecule level. We develop a computational workflow that augments FLAIR, a tool that calls isoform models expressed in long-read data, to integrate RNA variant calls with the associated isoforms that bear them. We generate nanopore data with high sequence accuracy from H1975 lung adenocarcinoma cells with and without knockdown of ADAR. We apply our workflow to identify key inosine isoform associations to help clarify the prominence of ADAR in tumorigenesis. Ultimately, we find that a long-read approach provides valuable insight toward characterizing the relationship between RNA variants and splicing patterns.
RNA-seq 在 RNA 处理畸变方面有重大发现,这些 RNA 变异与多种疾病有关。事实证明,RNA 中的异常剪接和单核苷酸变异(SNV)会改变转录本的稳定性、定位和功能。特别是,ADAR(一种介导腺苷酸-肌苷酸编辑的酶)的上调与肺腺癌细胞侵袭性的增加有关,也与剪接调控有关。尽管研究剪接和SNVs具有重要的功能意义,但短线程RNA-seq的使用限制了研究界同时研究这两种形式的RNA变异的能力。我们采用长线程测序技术获取全长转录本序列,在单分子水平上阐明变异对剪接变化的顺式效应。我们开发了一种计算工作流程,它增强了 FLAIR(一种调用长线程数据中表达的同工酶模型的工具)的功能,将 RNA 变异调用与携带变异的相关同工酶整合在一起。我们从有无敲除 ADAR 的 H1975 肺腺癌细胞中生成了具有高序列准确性的纳米孔数据。我们应用我们的工作流程来确定关键肌苷酸同工酶,以帮助阐明 ADAR 在肿瘤发生中的重要作用。最终,我们发现长读方法为描述 RNA 变体与剪接模式之间的关系提供了宝贵的见解。
{"title":"Detecting haplotype-specific transcript variation in long reads with FLAIR2","authors":"Alison D. Tang, Colette Felton, Eva Hrabeta-Robinson, Roger Volden, Christopher Vollmers, Angela N. Brooks","doi":"10.1186/s13059-024-03301-y","DOIUrl":"https://doi.org/10.1186/s13059-024-03301-y","url":null,"abstract":"RNA-seq has brought forth significant discoveries regarding aberrations in RNA processing, implicating these RNA variants in a variety of diseases. Aberrant splicing and single nucleotide variants (SNVs) in RNA have been demonstrated to alter transcript stability, localization, and function. In particular, the upregulation of ADAR, an enzyme that mediates adenosine-to-inosine editing, has been previously linked to an increase in the invasiveness of lung adenocarcinoma cells and associated with splicing regulation. Despite the functional importance of studying splicing and SNVs, the use of short-read RNA-seq has limited the community’s ability to interrogate both forms of RNA variation simultaneously. We employ long-read sequencing technology to obtain full-length transcript sequences, elucidating cis-effects of variants on splicing changes at a single molecule level. We develop a computational workflow that augments FLAIR, a tool that calls isoform models expressed in long-read data, to integrate RNA variant calls with the associated isoforms that bear them. We generate nanopore data with high sequence accuracy from H1975 lung adenocarcinoma cells with and without knockdown of ADAR. We apply our workflow to identify key inosine isoform associations to help clarify the prominence of ADAR in tumorigenesis. Ultimately, we find that a long-read approach provides valuable insight toward characterizing the relationship between RNA variants and splicing patterns.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141489160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1186/s13059-024-03292-w
Mengying Hu, Maria Chikina
Computational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. In our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. Our heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly package https://github.com/humengying0907/deconvBenchmarking and https://doi.org/10.5281/zenodo.8206516 , enabling further developments in deconvolution methods.
{"title":"Heterogeneous pseudobulk simulation enables realistic benchmarking of cell-type deconvolution methods","authors":"Mengying Hu, Maria Chikina","doi":"10.1186/s13059-024-03292-w","DOIUrl":"https://doi.org/10.1186/s13059-024-03292-w","url":null,"abstract":"Computational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. In our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. Our heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly package https://github.com/humengying0907/deconvBenchmarking and https://doi.org/10.5281/zenodo.8206516 , enabling further developments in deconvolution methods.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141489170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-26DOI: 10.1186/s13059-024-03316-5
Silvia Di Maio, Peter Zöscher, Hansi Weissensteiner, Lukas Forer, Johanna F. Schachtl-Riess, Stephan Amstler, Gertraud Streiter, Cathrin Pfurtscheller, Bernhard Paulweber, Florian Kronenberg, Stefan Coassin, Sebastian Schönherr
Variable number tandem repeats (VNTRs) are highly polymorphic DNA regions harboring many potentially disease-causing variants. However, VNTRs often appear unresolved (“dark”) in variation databases due to their repetitive nature. One particularly complex and medically relevant VNTR is the KIV-2 VNTR located in the cardiovascular disease gene LPA which encompasses up to 70% of the coding sequence. Using the highly complex LPA gene as a model, we develop a computational approach to resolve intra-repeat variation in VNTRs from largely available short-read sequencing data. We apply the approach to six protein-coding VNTRs in 2504 samples from the 1000 Genomes Project and developed an optimized method for the LPA KIV-2 VNTR that discriminates the confounding KIV-2 subtypes upfront. This results in an F1-score improvement of up to 2.1-fold compared to previously published strategies. Finally, we analyze the LPA VNTR in > 199,000 UK Biobank samples, detecting > 700 KIV-2 mutations. This approach successfully reveals new strong Lp(a)-lowering effects for KIV-2 variants, with protective effect against coronary artery disease, and also validated previous findings based on tagging SNPs. Our approach paves the way for reliable variant detection in VNTRs at scale and we show that it is transferable to other dark regions, which will help unlock medical information hidden in VNTRs.
{"title":"Resolving intra-repeat variation in medically relevant VNTRs from short-read sequencing data using the cardiovascular risk gene LPA as a model","authors":"Silvia Di Maio, Peter Zöscher, Hansi Weissensteiner, Lukas Forer, Johanna F. Schachtl-Riess, Stephan Amstler, Gertraud Streiter, Cathrin Pfurtscheller, Bernhard Paulweber, Florian Kronenberg, Stefan Coassin, Sebastian Schönherr","doi":"10.1186/s13059-024-03316-5","DOIUrl":"https://doi.org/10.1186/s13059-024-03316-5","url":null,"abstract":"Variable number tandem repeats (VNTRs) are highly polymorphic DNA regions harboring many potentially disease-causing variants. However, VNTRs often appear unresolved (“dark”) in variation databases due to their repetitive nature. One particularly complex and medically relevant VNTR is the KIV-2 VNTR located in the cardiovascular disease gene LPA which encompasses up to 70% of the coding sequence. Using the highly complex LPA gene as a model, we develop a computational approach to resolve intra-repeat variation in VNTRs from largely available short-read sequencing data. We apply the approach to six protein-coding VNTRs in 2504 samples from the 1000 Genomes Project and developed an optimized method for the LPA KIV-2 VNTR that discriminates the confounding KIV-2 subtypes upfront. This results in an F1-score improvement of up to 2.1-fold compared to previously published strategies. Finally, we analyze the LPA VNTR in > 199,000 UK Biobank samples, detecting > 700 KIV-2 mutations. This approach successfully reveals new strong Lp(a)-lowering effects for KIV-2 variants, with protective effect against coronary artery disease, and also validated previous findings based on tagging SNPs. Our approach paves the way for reliable variant detection in VNTRs at scale and we show that it is transferable to other dark regions, which will help unlock medical information hidden in VNTRs.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":null,"pages":null},"PeriodicalIF":12.3,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141452901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}