Pengyao Ping, Shuquan Su, Xinhui Cai, Tian Lan, Xuan Zhang, Hui Peng, Yi Pan, Wei Liu, Jinyan Li
Although the per-base error rate of short-read sequencing data is very low at 0.1%-0.5%, the percentage/probability of erroneous reads in a dataset can be as high as 10%-15% or in the number millions. As current methods correct only some errors while introducing many new errors, we solve this problem by turning erroneous reads into their original states, without bringing up any non-existing reads to keep the data integrity. The novelty is originated in a computable rule translated from polymerase chain reaction (PCR) erring mechanism that: a rare read is erroneous if it has a neighbouring read of high abundance. With this principle, we construct a graph to link each pair of reads of tiny edit distances to detect a solid part of erroneous reads; then we consider these pairs of reads of tiny edit distances as training data to learn the erring mechanisms to identify possibly remaining hard-case errors between pairs of high-abundance reads. The proposed approach, noise2read, is competent to handle the rectification of erroneous reads from short-read sequencing data whenever PCR is involved. Compared with state-of-the-art methods on tens of evaluation datasets of unique molecular identifier (UMI) based ground truth, noise2read performs significantly better on 19 metrics. Case studies found that noise2read can greatly improve short-reads quality and make substantial impact on genome abundance quantification, isoform identification, single nucleotide polymorphisms (SNP) profiling, and genome editing efficiency estimation. Noise2read is publicly available at https://github.com/JappyPing/noise2read and https://ngdc.cncb.ac.cn/biocode/tool/7951.
{"title":"Noise2read: Accurately Rectify Millions of Erroneous Short Reads Through Graph Learning on Edit Distances.","authors":"Pengyao Ping, Shuquan Su, Xinhui Cai, Tian Lan, Xuan Zhang, Hui Peng, Yi Pan, Wei Liu, Jinyan Li","doi":"10.1093/gpbjnl/qzaf120","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf120","url":null,"abstract":"<p><p>Although the per-base error rate of short-read sequencing data is very low at 0.1%-0.5%, the percentage/probability of erroneous reads in a dataset can be as high as 10%-15% or in the number millions. As current methods correct only some errors while introducing many new errors, we solve this problem by turning erroneous reads into their original states, without bringing up any non-existing reads to keep the data integrity. The novelty is originated in a computable rule translated from polymerase chain reaction (PCR) erring mechanism that: a rare read is erroneous if it has a neighbouring read of high abundance. With this principle, we construct a graph to link each pair of reads of tiny edit distances to detect a solid part of erroneous reads; then we consider these pairs of reads of tiny edit distances as training data to learn the erring mechanisms to identify possibly remaining hard-case errors between pairs of high-abundance reads. The proposed approach, noise2read, is competent to handle the rectification of erroneous reads from short-read sequencing data whenever PCR is involved. Compared with state-of-the-art methods on tens of evaluation datasets of unique molecular identifier (UMI) based ground truth, noise2read performs significantly better on 19 metrics. Case studies found that noise2read can greatly improve short-reads quality and make substantial impact on genome abundance quantification, isoform identification, single nucleotide polymorphisms (SNP) profiling, and genome editing efficiency estimation. Noise2read is publicly available at https://github.com/JappyPing/noise2read and https://ngdc.cncb.ac.cn/biocode/tool/7951.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145644031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bitao Zhong, Shaoqi Wang, Xiaoxi Jing, Aniruddh P Patel, Yajie Zhao, Minxian Wang
{"title":"Harnessing Large Cohorts and AI to Bridge Genomic Discovery and Clinical Practice.","authors":"Bitao Zhong, Shaoqi Wang, Xiaoxi Jing, Aniruddh P Patel, Yajie Zhao, Minxian Wang","doi":"10.1093/gpbjnl/qzaf104","DOIUrl":"10.1093/gpbjnl/qzaf104","url":null,"abstract":"","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12674694/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145644039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Wang, Peng Jia, Stephen J Bush, Xia Wang, Yi Yang, Yu Zhang, Shijie Wan, Xiaofei Yang, Pengyu Zhang, Yuanting Zheng, Leming Shi, Lianhua Dong, Kai Ye
Recent advances in sequencing technologies have enabled the complete assembly of human genomes from telomere to telomere (T2T), resolving previously inaccessible regions such as centromeres and segmental duplications. Here, we present an updated, higher-quality, haplotype-phased T2T assembly of the Chinese Quartet (T2T-CQ), a family cohort comprising monozygotic twins and their parents, generated using high-coverage ONT ultralong and PacBio HiFi sequencing. The T2T-CQ assembly serves as a crucial reference genome for integrating publicly available multi-omics data and advances the utility of the Quartet reference materials. The T2T-CQ assembly scores highly on multiple metrics of continuity and completeness, with Genome Continuity Inspector (GCI) scores of 77.76 (maternal) and 76.41 (paternal), 21-mer quality values (QV) > 66, and Clipping Reveals Assembly Quality (CRAQ) scores > 99.6 for both haplotypes, enabling complete annotation of centromeric regions. Within these regions, we identified novel 13-mer higher-order repeat patterns on chromosome 17 which exhibited a monophyletic origin and emerged approximately 230 thousand years ago. Overall, this work establishes an essential genomic resource for the Han Chinese population and advances the development of a T2T pan-Chinese reference genome, which will significantly enable future investigations both into population-specific structural variants and the evolutionary dynamics of centromeres.
{"title":"A Telomere-to-Telomere Diploid Reference Genome and Centromere Structure of the Chinese Quartet.","authors":"Bo Wang, Peng Jia, Stephen J Bush, Xia Wang, Yi Yang, Yu Zhang, Shijie Wan, Xiaofei Yang, Pengyu Zhang, Yuanting Zheng, Leming Shi, Lianhua Dong, Kai Ye","doi":"10.1093/gpbjnl/qzaf118","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf118","url":null,"abstract":"<p><p>Recent advances in sequencing technologies have enabled the complete assembly of human genomes from telomere to telomere (T2T), resolving previously inaccessible regions such as centromeres and segmental duplications. Here, we present an updated, higher-quality, haplotype-phased T2T assembly of the Chinese Quartet (T2T-CQ), a family cohort comprising monozygotic twins and their parents, generated using high-coverage ONT ultralong and PacBio HiFi sequencing. The T2T-CQ assembly serves as a crucial reference genome for integrating publicly available multi-omics data and advances the utility of the Quartet reference materials. The T2T-CQ assembly scores highly on multiple metrics of continuity and completeness, with Genome Continuity Inspector (GCI) scores of 77.76 (maternal) and 76.41 (paternal), 21-mer quality values (QV) > 66, and Clipping Reveals Assembly Quality (CRAQ) scores > 99.6 for both haplotypes, enabling complete annotation of centromeric regions. Within these regions, we identified novel 13-mer higher-order repeat patterns on chromosome 17 which exhibited a monophyletic origin and emerged approximately 230 thousand years ago. Overall, this work establishes an essential genomic resource for the Han Chinese population and advances the development of a T2T pan-Chinese reference genome, which will significantly enable future investigations both into population-specific structural variants and the evolutionary dynamics of centromeres.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Factor analysis is a method that condenses multiple variables into a few latent factors. It can be used to extract the underlying sources of biological variation in high-dimensional data and distill them into interpretable gene programs. However, existing factorization methods lack adaptability in selecting the optimal number of factors and interpretability in capturing biological variation. To address these concerns, we propose Deep Beta Process (DBP), a deep probabilistic framework for adaptive and interpretable factor analysis of single-cell transcriptomic data. DBP achieves adaptive selection of factors through a stick-breaking Beta process and performs batch correction using an adversarial learning strategy. We validate the flexible factor extraction and robust batch correction capabilities of DBP on simulated datasets. We also demonstrate its superior performance in dimensionality reduction and biological interpretability while explaining biological variation from both cell and gene perspectives using factor and loading matrices. The application of DBP to a gastric adenocarcinoma dataset reveals malignant epithelial cell heterogeneity, providing valuable insights for investigating the molecular mechanisms of disease onset and progression. DBP is available at https://github.com/labomics/DBP and https://ngdc.cncb.ac.cn/biocode/tool/BT007954.
{"title":"DBP: Adaptive and Interpretable Factor Analysis for Single-cell RNA-seq Data with Deep Beta Processes.","authors":"Runyan Liu, Shuofeng Hu, Guohua Dong, Tongtong Kan, Jinhui Shi, Jing Wang, Jiahao Zhou, Zhen He, Xiaomin Ying","doi":"10.1093/gpbjnl/qzaf117","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf117","url":null,"abstract":"<p><p>Factor analysis is a method that condenses multiple variables into a few latent factors. It can be used to extract the underlying sources of biological variation in high-dimensional data and distill them into interpretable gene programs. However, existing factorization methods lack adaptability in selecting the optimal number of factors and interpretability in capturing biological variation. To address these concerns, we propose Deep Beta Process (DBP), a deep probabilistic framework for adaptive and interpretable factor analysis of single-cell transcriptomic data. DBP achieves adaptive selection of factors through a stick-breaking Beta process and performs batch correction using an adversarial learning strategy. We validate the flexible factor extraction and robust batch correction capabilities of DBP on simulated datasets. We also demonstrate its superior performance in dimensionality reduction and biological interpretability while explaining biological variation from both cell and gene perspectives using factor and loading matrices. The application of DBP to a gastric adenocarcinoma dataset reveals malignant epithelial cell heterogeneity, providing valuable insights for investigating the molecular mechanisms of disease onset and progression. DBP is available at https://github.com/labomics/DBP and https://ngdc.cncb.ac.cn/biocode/tool/BT007954.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Chen, Mengna Li, Zhaoshan Zhong, Inge Seim, Minxiao Wang, Chao Lian, Lianhong Zhuo, Xinjiang Wan, Hao Wang, Guanghui Han, Li Zhou, Huan Zhang, Lei Cao, Chaolun Li
The deep-sea chemosynthetic ecosystems are among one of the most unusual ecosystems on Earth, where most megafauna form close symbiotic associations with chemosynthetic microbes to obtain nutrition and shelter from the toxic environment. Despite the diverse forms of symbiotic organs in these deep-sea holobionts, the function and development of bacteriocytes, the host cells harboring symbionts, are still largely uncharacterized. Here, we have conducted the in situ decolonization assay and state-of-the-art single-nucleus and spatial transcriptomics to reveal the function and development of deep-sea mussel bacteriocytes. The bacteriocytes appear to optimize immune processes to facilitate recognition, engulfment, and elimination of endosymbionts. They also interact directly with them in carbohydrate and ammonia metabolism by exchanging metabolic intermediates via transporters such as SLC37A2 and RHBG-A. Bacteriocytes arise from three different proliferating cell types, and their successive development trajectory was delineated by multi-omics data and 3D reconstruction analyses. The molecular functions and the developmental processes of bacteriocytes were found to be guided by the same set of molluscan-conserved transcription factors and may be influenced by endosymbionts through sterol metabolism. The coordination in the functions and development of bacteriocytes and between the host and symbionts highlights the phenotypic plasticity of symbiotic cells, and underpins host-symbiont interdependence in adaptation to the deep sea.
{"title":"Function and Development of Deep-sea Mussel Bacteriocytes Revealed by SnRNA-seq and Spatial Transcriptomics.","authors":"Hao Chen, Mengna Li, Zhaoshan Zhong, Inge Seim, Minxiao Wang, Chao Lian, Lianhong Zhuo, Xinjiang Wan, Hao Wang, Guanghui Han, Li Zhou, Huan Zhang, Lei Cao, Chaolun Li","doi":"10.1093/gpbjnl/qzaf109","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf109","url":null,"abstract":"<p><p>The deep-sea chemosynthetic ecosystems are among one of the most unusual ecosystems on Earth, where most megafauna form close symbiotic associations with chemosynthetic microbes to obtain nutrition and shelter from the toxic environment. Despite the diverse forms of symbiotic organs in these deep-sea holobionts, the function and development of bacteriocytes, the host cells harboring symbionts, are still largely uncharacterized. Here, we have conducted the in situ decolonization assay and state-of-the-art single-nucleus and spatial transcriptomics to reveal the function and development of deep-sea mussel bacteriocytes. The bacteriocytes appear to optimize immune processes to facilitate recognition, engulfment, and elimination of endosymbionts. They also interact directly with them in carbohydrate and ammonia metabolism by exchanging metabolic intermediates via transporters such as SLC37A2 and RHBG-A. Bacteriocytes arise from three different proliferating cell types, and their successive development trajectory was delineated by multi-omics data and 3D reconstruction analyses. The molecular functions and the developmental processes of bacteriocytes were found to be guided by the same set of molluscan-conserved transcription factors and may be influenced by endosymbionts through sterol metabolism. The coordination in the functions and development of bacteriocytes and between the host and symbionts highlights the phenotypic plasticity of symbiotic cells, and underpins host-symbiont interdependence in adaptation to the deep sea.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Huang, Mengya Gao, Francesca Vinchi, Xiuli An, Wei Li, Yaomei Wang
Emerging evidence indicates that macrophages play important roles in hematopoiesis in addition to their immune functions. The well-known immune-unrelated functions of macrophages include their roles in hematopoiesis, especially quality control of hematopoietic stem/progenitor cells (HSCs/HSPCs), supporting erythropoiesis, and megakaryopoiesis. Several studies, most using mouse models, have explored the roles of macrophages in hematopoiesis in different organs such as the yolk sac (YS), fetal liver (FL), bone marrow (BM) and spleen (SP). We have recently documented the potential roles and underlying mechanisms of macrophages in myeloproliferative neoplasm (MPN), aplastic anemia (AA), and idiopathic thrombocytopenic purpura (ITP). In this article, we review origin of macrophages, introduce the roles of macrophages in HSCs/HSPCs, erythropoiesis, and megakaryopoiesis in four hematopoietic organs, summarize the recent advances of macrophages in MPN, AA and ITP. Finally, we outline the unresolved questions that future studies should address to explore in greater depth of macrophages' role in both normal and disordered hematopoiesis.
{"title":"Macrophages in Hematopoiesis and Related Blood Diseases.","authors":"Hong Huang, Mengya Gao, Francesca Vinchi, Xiuli An, Wei Li, Yaomei Wang","doi":"10.1093/gpbjnl/qzaf112","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf112","url":null,"abstract":"<p><p>Emerging evidence indicates that macrophages play important roles in hematopoiesis in addition to their immune functions. The well-known immune-unrelated functions of macrophages include their roles in hematopoiesis, especially quality control of hematopoietic stem/progenitor cells (HSCs/HSPCs), supporting erythropoiesis, and megakaryopoiesis. Several studies, most using mouse models, have explored the roles of macrophages in hematopoiesis in different organs such as the yolk sac (YS), fetal liver (FL), bone marrow (BM) and spleen (SP). We have recently documented the potential roles and underlying mechanisms of macrophages in myeloproliferative neoplasm (MPN), aplastic anemia (AA), and idiopathic thrombocytopenic purpura (ITP). In this article, we review origin of macrophages, introduce the roles of macrophages in HSCs/HSPCs, erythropoiesis, and megakaryopoiesis in four hematopoietic organs, summarize the recent advances of macrophages in MPN, AA and ITP. Finally, we outline the unresolved questions that future studies should address to explore in greater depth of macrophages' role in both normal and disordered hematopoiesis.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanan Li, Yanbo Yang, Bin Hu, Zi Wang, Wei Wang, Xiaofeng He, Xusheng Wu, Sheng Lin, Narla Mohandas, Hong Liu, Jing Gong, Long Liang, Jing Liu
Erythropoiesis is precisely regulated by multilayered networks. It is crucial for maintaining steady-state hemoglobin levels and ensuring effective oxygen transport. Alternative polyadenylation (APA) is a post-transcriptional regulatory mechanism generating multiple mRNA isoforms from a single gene based on specific 3'-untranslated region sequences. While APA plays a vital role in various cellular processes, the underlying mechanism in erythropoiesis remains largely unexplored. In this study, we employed an integrative approach, combining bioinformatics analyses and experimental validations, to systematically investigate the role of APA in erythropoiesis. We mapped the APA landscape during erythroid differentiation and identified significant APA shifts essential for the differentiation of erythroid cells from burst-forming unit erythroid (BFU-E) to colony-forming unit erythroid (CFU-E). Notably, our findings highlighted polyadenylate-binding protein cytoplasmic 1 (PABPC1) as the primary regulator of APA during these stages. Functional analyses have revealed that knockdown of PABPC1 disrupts erythroid progenitor cell proliferation and differentiation. These results implicate an essential role of PABPC1 in modulating cell fate through APA regulation. Furthermore, we found that decreased PABPC1 levels increased the usage of the proximal polyadenylation sites in the TSC22D1 gene. This shift led to elevated expression of TSC22D1, uncovering a novel mechanism by which APA influences erythroid progenitor expansion and differentiation. Our findings provide novel insights into APA regulation in early erythropoiesis and suggest potential therapeutic strategies for diseases associated with erythropoietic disorders.
{"title":"Regulation of Alternative Polyadenylation Events by PABPC1 Affects Erythroid Progenitor Cell Expansion.","authors":"Yanan Li, Yanbo Yang, Bin Hu, Zi Wang, Wei Wang, Xiaofeng He, Xusheng Wu, Sheng Lin, Narla Mohandas, Hong Liu, Jing Gong, Long Liang, Jing Liu","doi":"10.1093/gpbjnl/qzaf116","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf116","url":null,"abstract":"<p><p>Erythropoiesis is precisely regulated by multilayered networks. It is crucial for maintaining steady-state hemoglobin levels and ensuring effective oxygen transport. Alternative polyadenylation (APA) is a post-transcriptional regulatory mechanism generating multiple mRNA isoforms from a single gene based on specific 3'-untranslated region sequences. While APA plays a vital role in various cellular processes, the underlying mechanism in erythropoiesis remains largely unexplored. In this study, we employed an integrative approach, combining bioinformatics analyses and experimental validations, to systematically investigate the role of APA in erythropoiesis. We mapped the APA landscape during erythroid differentiation and identified significant APA shifts essential for the differentiation of erythroid cells from burst-forming unit erythroid (BFU-E) to colony-forming unit erythroid (CFU-E). Notably, our findings highlighted polyadenylate-binding protein cytoplasmic 1 (PABPC1) as the primary regulator of APA during these stages. Functional analyses have revealed that knockdown of PABPC1 disrupts erythroid progenitor cell proliferation and differentiation. These results implicate an essential role of PABPC1 in modulating cell fate through APA regulation. Furthermore, we found that decreased PABPC1 levels increased the usage of the proximal polyadenylation sites in the TSC22D1 gene. This shift led to elevated expression of TSC22D1, uncovering a novel mechanism by which APA influences erythroid progenitor expansion and differentiation. Our findings provide novel insights into APA regulation in early erythropoiesis and suggest potential therapeutic strategies for diseases associated with erythropoietic disorders.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yijia Jiang, Zhirui Hu, Feng Lu, Allen W Lynch, Junchen Jiang, Alexander Zhu, Ziqi Zeng, Yi Zhang, Gongwei Wu, Yingtian Xie, Rong Li, Ningxuan Zhou, Cliff Meyer, Paloma Cejas, Myles Brown, Henry W Long, Xintao Qiu
Recent advances in single-cell epigenomic techniques have created a growing demand for scATAC-seq analysis. One key analysis task is to determine cell type identity based on the epigenetic data. We introduce scATAnno, a python package designed to automatically annotate scATAC-seq data using large-scale scATAC-seq reference atlases. This workflow generates the reference atlases from publicly available datasets enabling accurate cell type annotation by integrating query data with reference atlases, without the use of scRNA-seq data. To enhance annotation accuracy, we have incorporated KNN-based and weighted distance-based uncertainty scores to effectively detect cell populations within the query data that are distinct from all cell types in the reference data. We compare and benchmark scATAnno against 5 other published approaches for cell annotation, demonstrating superior performance in multiple data sets and metrics. We showcase the utility of scATAnno across multiple datasets, including peripheral blood mononuclear cell (PBMC), triple negative breast cancer (TNBC), and basal cell carcinoma (BCC), and demonstrate that scATAnno accurately annotates cell types across conditions. Overall, scATAnno is a useful tool for scATAC-seq reference building and cell type annotation in scATAC-seq data and can aid in the interpretation of new scATAC-seq datasets in complex biological systems. scATAnno is available online at https://scatanno-main.readthedocs.io/.
{"title":"scATAnno: Automated Cell Type Annotation for Single-cell ATAC Sequencing Data.","authors":"Yijia Jiang, Zhirui Hu, Feng Lu, Allen W Lynch, Junchen Jiang, Alexander Zhu, Ziqi Zeng, Yi Zhang, Gongwei Wu, Yingtian Xie, Rong Li, Ningxuan Zhou, Cliff Meyer, Paloma Cejas, Myles Brown, Henry W Long, Xintao Qiu","doi":"10.1093/gpbjnl/qzaf108","DOIUrl":"10.1093/gpbjnl/qzaf108","url":null,"abstract":"<p><p>Recent advances in single-cell epigenomic techniques have created a growing demand for scATAC-seq analysis. One key analysis task is to determine cell type identity based on the epigenetic data. We introduce scATAnno, a python package designed to automatically annotate scATAC-seq data using large-scale scATAC-seq reference atlases. This workflow generates the reference atlases from publicly available datasets enabling accurate cell type annotation by integrating query data with reference atlases, without the use of scRNA-seq data. To enhance annotation accuracy, we have incorporated KNN-based and weighted distance-based uncertainty scores to effectively detect cell populations within the query data that are distinct from all cell types in the reference data. We compare and benchmark scATAnno against 5 other published approaches for cell annotation, demonstrating superior performance in multiple data sets and metrics. We showcase the utility of scATAnno across multiple datasets, including peripheral blood mononuclear cell (PBMC), triple negative breast cancer (TNBC), and basal cell carcinoma (BCC), and demonstrate that scATAnno accurately annotates cell types across conditions. Overall, scATAnno is a useful tool for scATAC-seq reference building and cell type annotation in scATAC-seq data and can aid in the interpretation of new scATAC-seq datasets in complex biological systems. scATAnno is available online at https://scatanno-main.readthedocs.io/.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145598403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ze-Hao Zhang, Bo-Han Li, Yin-Wei Wang, Sheng Hu Qian, Lu Chen, Meng-Wei Shi, Hao Zuo, Zhen-Xia Chen
The functional significance of long non-coding RNAs (lncRNAs) remains a subject of debate, largely due to the complexity and cost associated with their validation experiments. However, emerging evidence suggests that pseudogenes, once viewed as genomic relics, may contribute to the origin of functional lncRNA genes. In this study spanning eight species, we systematically identified pseudogene-associated lncRNA genes using our PacBio long-read sequencing data and published RNA-seq data. Our investigation revealed that pseudogene-associated lncRNA genes exhibit heightened functional attributes compared to their non-pseudogene-associated counterparts. Notably, these pseudogene-associated lncRNAs show protein-binding proficiency, positioning them as potent regulators of gene expression. In particular, pseudogene-associated sense lncRNAs retain protein-binding capabilities inherited from parent genes of pseudogenes, thereby demonstrating greater protein-binding proficiency. Through detailed functional characterization, we elucidated the unique advantages and conserved roles of pseudogene-associated lncRNA genes, particularly in the context of gene expression regulation and DNA repair. Leveraging cross-species expression profiling, we demonstrated the prominent contribution of pseudogene-associated lncRNA genes to aging-related transcriptome changes across nine human tissues and eight mouse tissues. Overall, our findings demonstrate enhanced functional attributes of pseudogene-associated lncRNA genes and shed light on their conserved and close association with aging.
{"title":"Enhanced Functional Potential of Pseudogene-associated lncRNA Genes in Mammals.","authors":"Ze-Hao Zhang, Bo-Han Li, Yin-Wei Wang, Sheng Hu Qian, Lu Chen, Meng-Wei Shi, Hao Zuo, Zhen-Xia Chen","doi":"10.1093/gpbjnl/qzaf113","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf113","url":null,"abstract":"<p><p>The functional significance of long non-coding RNAs (lncRNAs) remains a subject of debate, largely due to the complexity and cost associated with their validation experiments. However, emerging evidence suggests that pseudogenes, once viewed as genomic relics, may contribute to the origin of functional lncRNA genes. In this study spanning eight species, we systematically identified pseudogene-associated lncRNA genes using our PacBio long-read sequencing data and published RNA-seq data. Our investigation revealed that pseudogene-associated lncRNA genes exhibit heightened functional attributes compared to their non-pseudogene-associated counterparts. Notably, these pseudogene-associated lncRNAs show protein-binding proficiency, positioning them as potent regulators of gene expression. In particular, pseudogene-associated sense lncRNAs retain protein-binding capabilities inherited from parent genes of pseudogenes, thereby demonstrating greater protein-binding proficiency. Through detailed functional characterization, we elucidated the unique advantages and conserved roles of pseudogene-associated lncRNA genes, particularly in the context of gene expression regulation and DNA repair. Leveraging cross-species expression profiling, we demonstrated the prominent contribution of pseudogene-associated lncRNA genes to aging-related transcriptome changes across nine human tissues and eight mouse tissues. Overall, our findings demonstrate enhanced functional attributes of pseudogene-associated lncRNA genes and shed light on their conserved and close association with aging.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145598362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Depicting gene expression in a spatial context through spatial transcriptomics is beneficial for inferring cellular mechanisms. Identifying spatially variable genes is a crucial step in leveraging spatial transcriptome data to understand intricate spatial dynamics. In this study, we developed Spanve, a nonparametric statistical method for detecting spatially variable genes in large-scale spatial transcriptomics datasets by quantifying expression differences between each spot or cell and its local neighbors. This method offers a nonparametric approach for identifying spatial dependencies in gene expression without distributional assumptions. Compared with existing methods, Spanve yields fewer false positives, leading to more accurate identification of spatially variable genes. Furthermore, Spanve improves the performance of downstream spatial transcriptomics analyses including spatial domain detection and cell type deconvolution. These results show the broad application potential of Spanve in advancing our understanding of spatial gene expression patterns within complex tissue microenvironments. Spanve is publicly available at https://github.com/zjupgx/Spanve and https://ngdc.cncb.ac.cn/biocode/tool/BT7724.
{"title":"Spanve: A Statistical Method for Downstream-friendly Spatially Variable Genes in Large-scale Data.","authors":"Guoxin Cai, Yichang Chen, Shuqing Chen, Xun Gu, Zhan Zhou","doi":"10.1093/gpbjnl/qzaf111","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf111","url":null,"abstract":"<p><p>Depicting gene expression in a spatial context through spatial transcriptomics is beneficial for inferring cellular mechanisms. Identifying spatially variable genes is a crucial step in leveraging spatial transcriptome data to understand intricate spatial dynamics. In this study, we developed Spanve, a nonparametric statistical method for detecting spatially variable genes in large-scale spatial transcriptomics datasets by quantifying expression differences between each spot or cell and its local neighbors. This method offers a nonparametric approach for identifying spatial dependencies in gene expression without distributional assumptions. Compared with existing methods, Spanve yields fewer false positives, leading to more accurate identification of spatially variable genes. Furthermore, Spanve improves the performance of downstream spatial transcriptomics analyses including spatial domain detection and cell type deconvolution. These results show the broad application potential of Spanve in advancing our understanding of spatial gene expression patterns within complex tissue microenvironments. Spanve is publicly available at https://github.com/zjupgx/Spanve and https://ngdc.cncb.ac.cn/biocode/tool/BT7724.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145598375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}