Ziting Feng, Xuyan Liu, Yahui Liu, Kailing Tu, Lin Xia, Dan Xie
Somatic structural variations (somatic SVs) are hallmarks of tumors, but their comprehensive detection remains technically challenging. Long-read sequencing (LRS) technology, which generates reads spanning large-scale SVs and their flanking sequences, enables a wide range of prospects for somatic SV detection. However, existing LRS-based somatic SV detection algorithms and pipelines exhibit variable performance that has not been systematically characterized. In this study, we conducted a rigorous evaluation of 51 LRS-based somatic SV detection strategies, integrating 3 reference genomes, 2 aligners, 5 SV callers, and 5 processing methods tailored for SV callers. We use both simulated datasets and empirical data from HCC1395/HCC1395BL cell lines sequenced on Oxford Nanopore (ONT) and Pacific Biosciences (PacBio) platforms for technical assessment. Our findings highlight the need for further refinement of specialized somatic SV detection tools, as no single strategy consistently outperforms across all scenarios. Workflows based on germline SV callers exhibit a high false-positive rate, which cannot be mitigated by increasing sequencing depth or tumor purity. Furthermore, challenges persist in detecting insertions, genomic tandem repeat regions, and ultra-long SVs. We delineate technical bottlenecks in current somatic SV detection approaches and provide recommendations for their further advancement. Additionally, we offer suggestions for selecting specific tools in different application scenarios. This work offers a comprehensive benchmark for somatic SV detection and valuable insights for future LRS-based tools development and methodological improvements.
{"title":"Benchmark and Evaluation for Somatic Structural Variants Detection with Long-read Sequencing Data.","authors":"Ziting Feng, Xuyan Liu, Yahui Liu, Kailing Tu, Lin Xia, Dan Xie","doi":"10.1093/gpbjnl/qzaf139","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf139","url":null,"abstract":"<p><p>Somatic structural variations (somatic SVs) are hallmarks of tumors, but their comprehensive detection remains technically challenging. Long-read sequencing (LRS) technology, which generates reads spanning large-scale SVs and their flanking sequences, enables a wide range of prospects for somatic SV detection. However, existing LRS-based somatic SV detection algorithms and pipelines exhibit variable performance that has not been systematically characterized. In this study, we conducted a rigorous evaluation of 51 LRS-based somatic SV detection strategies, integrating 3 reference genomes, 2 aligners, 5 SV callers, and 5 processing methods tailored for SV callers. We use both simulated datasets and empirical data from HCC1395/HCC1395BL cell lines sequenced on Oxford Nanopore (ONT) and Pacific Biosciences (PacBio) platforms for technical assessment. Our findings highlight the need for further refinement of specialized somatic SV detection tools, as no single strategy consistently outperforms across all scenarios. Workflows based on germline SV callers exhibit a high false-positive rate, which cannot be mitigated by increasing sequencing depth or tumor purity. Furthermore, challenges persist in detecting insertions, genomic tandem repeat regions, and ultra-long SVs. We delineate technical bottlenecks in current somatic SV detection approaches and provide recommendations for their further advancement. Additionally, we offer suggestions for selecting specific tools in different application scenarios. This work offers a comprehensive benchmark for somatic SV detection and valuable insights for future LRS-based tools development and methodological improvements.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Xie, Longfei Ma, Jing Sun, Hengyu Nie, Lin Yan, Yalin Xue, Jian Chen, Shuguang Duo, Chunsheng Han
Cohesin plays critical roles in chromatin organization and transcription regulation. REC8 is a meiosis-specific cohesin subunit and is essential for homologous chromosome synapsis, recombination, and segregation. However, little is known about the relationship between the dynamic genome-wide distribution of cohesin and transcription regulation during meiotic initiation. In this study, we report that REC8-cohesin is preferentially localized to open promoter regions of genes involved in spermatogonial differentiation and meiosis at early meiosis from preleptonema to zygonema. Genomic localization of REC8-cohesin is changed by the gene knockout of the transcriptional suppressor BEND2. We also find that REC8 is able to interact with mitotic cyclin CCNA2, that the CCNA2 expression is extended to leptonema in Bend2 knockout mice, and that the meiotic cells of Bend2 knockout mice do not exit the mitotic cell cycle completely. We further found that a large number of genes are commonly bound by BEND2, STRA8, MEIOSIN, and REC8-cohesin. Our study has therefore revealed that genes with open promoters are bound by meiotic cohesin and transcription factors coordinately to facilitate chromatin reorganization and transcription regulation leading to the switch from a mitotic cell cycle to a meiotic one at the initiation stage of meiosis.
{"title":"REC8-Cohesin Preferentially Localizes to Promoters of Genes that are Regulated by Transcription Suppressor BEND2 During Early Meiosis.","authors":"Dan Xie, Longfei Ma, Jing Sun, Hengyu Nie, Lin Yan, Yalin Xue, Jian Chen, Shuguang Duo, Chunsheng Han","doi":"10.1093/gpbjnl/qzaf138","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf138","url":null,"abstract":"<p><p>Cohesin plays critical roles in chromatin organization and transcription regulation. REC8 is a meiosis-specific cohesin subunit and is essential for homologous chromosome synapsis, recombination, and segregation. However, little is known about the relationship between the dynamic genome-wide distribution of cohesin and transcription regulation during meiotic initiation. In this study, we report that REC8-cohesin is preferentially localized to open promoter regions of genes involved in spermatogonial differentiation and meiosis at early meiosis from preleptonema to zygonema. Genomic localization of REC8-cohesin is changed by the gene knockout of the transcriptional suppressor BEND2. We also find that REC8 is able to interact with mitotic cyclin CCNA2, that the CCNA2 expression is extended to leptonema in Bend2 knockout mice, and that the meiotic cells of Bend2 knockout mice do not exit the mitotic cell cycle completely. We further found that a large number of genes are commonly bound by BEND2, STRA8, MEIOSIN, and REC8-cohesin. Our study has therefore revealed that genes with open promoters are bound by meiotic cohesin and transcription factors coordinately to facilitate chromatin reorganization and transcription regulation leading to the switch from a mitotic cell cycle to a meiotic one at the initiation stage of meiosis.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hui Jiang, Bin Tang, Kun Li, Liubin Zhang, Junhao Liang, Clara Sze-Man Tang, Paul Kwong-Hang Tam, Binbin Wang, Youqiang Song, Qiang Wang, Mulin Jun Li, Hailiang Huang, Miaoxin Li
Genetic interactions play a crucial role in elucidating the susceptibility and etiology of complex multifactorial diseases. Despite significant efforts to identify disease-associated nonlinear effects in genome-wide association studies, efficient methods for detecting the epistatic impact of rare variants remain lacking. In this study, we proposed iRUNNER, a novel and powerful mutation burden test focused on analyzing the interaction effects of rare variants on a binary trait. Different from conventional association tests comparing cases with controls, iRUNNER evaluates the relative enrichment of rare variant interaction burden of pairwise genes in patients against its baseline, estimated by a recursive truncated negative-binomial regression model that leverages multiple genomic features from public databases. Extensive simulations demonstrated that iRUNNER outperforms existing epistasis tests in statistical power and maintains reasonable type I error rates even when population stratification exists in control samples. Applied to real datasets of five complex diseases, iRUNNER yielded substantial gains in gene-gene interaction detections. Notably, the majority of these signals were missed by alternative methods, especially in small to medium-sized samples. Furthermore, we found that these identified gene pairs of each trait can form interconnected networks, which may provide valuable insights into the underlying molecular mechanisms. We have implemented iRUNNER as a module in our integrative platform KGGSeq (http://pmglab.top/kggseq/) that enables rapid testing of pairwise interactions among all possible non-synonymous rare coding variants within hours.
{"title":"iRUNNER: A Baseline Mutation Burden Regression for Identifying Gene Interaction Between Rare Variants for Diseases.","authors":"Hui Jiang, Bin Tang, Kun Li, Liubin Zhang, Junhao Liang, Clara Sze-Man Tang, Paul Kwong-Hang Tam, Binbin Wang, Youqiang Song, Qiang Wang, Mulin Jun Li, Hailiang Huang, Miaoxin Li","doi":"10.1093/gpbjnl/qzaf135","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf135","url":null,"abstract":"<p><p>Genetic interactions play a crucial role in elucidating the susceptibility and etiology of complex multifactorial diseases. Despite significant efforts to identify disease-associated nonlinear effects in genome-wide association studies, efficient methods for detecting the epistatic impact of rare variants remain lacking. In this study, we proposed iRUNNER, a novel and powerful mutation burden test focused on analyzing the interaction effects of rare variants on a binary trait. Different from conventional association tests comparing cases with controls, iRUNNER evaluates the relative enrichment of rare variant interaction burden of pairwise genes in patients against its baseline, estimated by a recursive truncated negative-binomial regression model that leverages multiple genomic features from public databases. Extensive simulations demonstrated that iRUNNER outperforms existing epistasis tests in statistical power and maintains reasonable type I error rates even when population stratification exists in control samples. Applied to real datasets of five complex diseases, iRUNNER yielded substantial gains in gene-gene interaction detections. Notably, the majority of these signals were missed by alternative methods, especially in small to medium-sized samples. Furthermore, we found that these identified gene pairs of each trait can form interconnected networks, which may provide valuable insights into the underlying molecular mechanisms. We have implemented iRUNNER as a module in our integrative platform KGGSeq (http://pmglab.top/kggseq/) that enables rapid testing of pairwise interactions among all possible non-synonymous rare coding variants within hours.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Hu, Jiaxun Zhang, Zhuoyuan Zhang, Ranlei Wei, Jie Fu, Xiaoxue Tang, Xuyan Liu, Lanfang Yuan, Ziting Feng, Sibo Wu, Lin Xia, Dan Xie
Previous genomic studies have predominantly analyzed oral squamous cell carcinoma (OSCC) in conjunction with other head and neck squamous cell carcinomas (HNSCC), constraining our comprehension of OSCC-specific structural variants (SVs). Here, we performed long-read whole-genome sequencing on 16 paired OSCC tumor and blood samples to elucidate the biological functions of somatic SVs. We identified a total of 5775 high-confidence somatic SVs, including five recurrent simple repeat expansions (SREs). Notably, one SRE located within the promoter region of the OBI1 gene is present in 45% of OSCC samples. Knocking out this SRE in the HSC4 cell line significantly reduces the expression of OBI1, resulting in decreased proliferative and migratory capacities compared to wild-type cells. Furthermore, we found that the frequently amplified region 11q13 in HNSCC is prone to large-scale somatic SVs, affecting the expression of ANO1, FADD, and CTTN, thereby confirming the association of SVs in this region with OSCC development. Our study provides novel insights into the role of somatic SVs in OSCC, especially with respect to SREs and large-scale SVs in critical genomic regions, thereby enhancing our comprehension of the molecular pathogenesis of OSCC.
{"title":"Long-read Sequencing Reveals Repeat Expansions and Large Structural Variants in Oral Squamous Cell Carcinoma.","authors":"Li Hu, Jiaxun Zhang, Zhuoyuan Zhang, Ranlei Wei, Jie Fu, Xiaoxue Tang, Xuyan Liu, Lanfang Yuan, Ziting Feng, Sibo Wu, Lin Xia, Dan Xie","doi":"10.1093/gpbjnl/qzaf133","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf133","url":null,"abstract":"<p><p>Previous genomic studies have predominantly analyzed oral squamous cell carcinoma (OSCC) in conjunction with other head and neck squamous cell carcinomas (HNSCC), constraining our comprehension of OSCC-specific structural variants (SVs). Here, we performed long-read whole-genome sequencing on 16 paired OSCC tumor and blood samples to elucidate the biological functions of somatic SVs. We identified a total of 5775 high-confidence somatic SVs, including five recurrent simple repeat expansions (SREs). Notably, one SRE located within the promoter region of the OBI1 gene is present in 45% of OSCC samples. Knocking out this SRE in the HSC4 cell line significantly reduces the expression of OBI1, resulting in decreased proliferative and migratory capacities compared to wild-type cells. Furthermore, we found that the frequently amplified region 11q13 in HNSCC is prone to large-scale somatic SVs, affecting the expression of ANO1, FADD, and CTTN, thereby confirming the association of SVs in this region with OSCC development. Our study provides novel insights into the role of somatic SVs in OSCC, especially with respect to SREs and large-scale SVs in critical genomic regions, thereby enhancing our comprehension of the molecular pathogenesis of OSCC.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Mi, Die Dai, Xia Xue, Haiming Qin, Feifei Ren, Barry J Marshall, Alfred Tay, Ihtisham Bukhari, Xiaojie Li, Shaogong Zhu, Yong Yu, Wanqing Wu, Yan Tan, Youcai Tang, Xin Xie, Haiqing Bai, Xiaochen Yin, Pengyuan Zheng
The occurrence and progression of gastric cancer (GC) are closely associated with dysbiosis of the gastric microbiota and alteration in host microenvironments. However, the interaction between intratumoral bacteria and gastric microenvironments remains incompletely understood. In this study, we characterized the biological profiles of intratumoral bacteria, metabolome, and proteome in 20 GC tumors and paired non-tumor tissues, in combination with six independent datasets (comprising 497 gastric tissue biopsies and 554 normal tissues), as well as mucosal tissues from 10 individuals without GC. We found that the diversity and richness of gastric microbiota were significantly higher in tumor tissues than in non-tumor tissues. In contrast, the lowest biodiversity, at both the genus and species levels, was found in the microbiota of individuals without GC. Specifically, tumors were enriched with Bacteroides thetaiotaomicron, Lactobacillus parabrevis, Brevundimonas nasdae, and Brevundimonas vesicularis. We also identified 39 human immunity-related proteins, particularly in the tryptophan metabolic pathway, which were differentially expressed across various microenvironments (tumor and non-tumor). Furthermore, we found that several pathways involved in the human immune system and associated with the gastric microbiota, such as thiazole biosynthesis II, pyrimidine deoxyribonucleoside salvage, superpathway of pyrimidine deoxyribonucleoside salvage, and superpathway of heme biosynthesis from uroporphyrinogen-III, hold potential as biomarkers for early detection of GC. Our results provide a comprehensive framework for investigating the complex interactions between the tumor immune microenvironment and intratumoral bacterial community.
{"title":"Multiomics Analysis Reveals How Intratumoral Bacteria Shape the Immune Microenvironment in Gastric Cancer.","authors":"Yang Mi, Die Dai, Xia Xue, Haiming Qin, Feifei Ren, Barry J Marshall, Alfred Tay, Ihtisham Bukhari, Xiaojie Li, Shaogong Zhu, Yong Yu, Wanqing Wu, Yan Tan, Youcai Tang, Xin Xie, Haiqing Bai, Xiaochen Yin, Pengyuan Zheng","doi":"10.1093/gpbjnl/qzaf132","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf132","url":null,"abstract":"<p><p>The occurrence and progression of gastric cancer (GC) are closely associated with dysbiosis of the gastric microbiota and alteration in host microenvironments. However, the interaction between intratumoral bacteria and gastric microenvironments remains incompletely understood. In this study, we characterized the biological profiles of intratumoral bacteria, metabolome, and proteome in 20 GC tumors and paired non-tumor tissues, in combination with six independent datasets (comprising 497 gastric tissue biopsies and 554 normal tissues), as well as mucosal tissues from 10 individuals without GC. We found that the diversity and richness of gastric microbiota were significantly higher in tumor tissues than in non-tumor tissues. In contrast, the lowest biodiversity, at both the genus and species levels, was found in the microbiota of individuals without GC. Specifically, tumors were enriched with Bacteroides thetaiotaomicron, Lactobacillus parabrevis, Brevundimonas nasdae, and Brevundimonas vesicularis. We also identified 39 human immunity-related proteins, particularly in the tryptophan metabolic pathway, which were differentially expressed across various microenvironments (tumor and non-tumor). Furthermore, we found that several pathways involved in the human immune system and associated with the gastric microbiota, such as thiazole biosynthesis II, pyrimidine deoxyribonucleoside salvage, superpathway of pyrimidine deoxyribonucleoside salvage, and superpathway of heme biosynthesis from uroporphyrinogen-III, hold potential as biomarkers for early detection of GC. Our results provide a comprehensive framework for investigating the complex interactions between the tumor immune microenvironment and intratumoral bacterial community.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Liao, Chong Zhang, Zhikang Wang, Fei Qi, Weitian Huang, Shangyan Cai, Junyu Li, Jiazhou Chen, Robin B Gasser, Zhiyuan Yuan, Jiangning Song, Hongmin Cai
Spatial omics technologies have revolutionized life sciences by enabling the simultaneous acquisition of biomolecular and spatial information. Identifying spatial patterns is crucial for understanding organ development and tumor microenvironments. However, the emergence of diverse spatial omics resolutions in these technologies has made it challenging to accurately characterize spatial domains at finer resolutions. To address this, we propose HyperSTAR, a hypergraph-based method designed to precisely identify spatial domains across varying resolutions by leveraging higher-order relationships among spatially adjacent tissue programs. Specifically, a gene expression-guided hyperedge decomposition module is introduced to refine the hypergraph structure to accurately delineate spatial domains boundaries. Additionally, a hypergraph attention convolutional neural network is designed to adaptively learn the importance of each hyperedge, enhancing the model's ability to capture complex higher-order relationships within spatially neighboring multi-spots and/or single cells. HyperSTAR outperforms existing graph neural network models in tasks such as uncovering tissue substructures, inferring spatiotemporal patterns, and denoising spatially resolved gene expressions. It effectively handles diverse spatial omics data types and scales seamlessly to large datasets. The method successfully reveals spatial heterogeneity in breast cancer sections, with findings validated through functional and survival analyses of independent clinical data. HyperSTAR represents a significant advancement in spatial omics analysis, representing a robust tool for exploring complex spatial patterns across varying resolutions and data types. Its ability to capture intricate higher-order relationships among spatially neighboring spots/cells makes it an invaluable tool for advancing research in life sciences, particularly in cancer and developmental biology. The toolbox is available at https://github.com/Ringoio/HyperSTAR.
{"title":"Unveiling Tissue Structure and Tumor Microenvironment from Spatial Omics by Hypergraph Learning.","authors":"Yi Liao, Chong Zhang, Zhikang Wang, Fei Qi, Weitian Huang, Shangyan Cai, Junyu Li, Jiazhou Chen, Robin B Gasser, Zhiyuan Yuan, Jiangning Song, Hongmin Cai","doi":"10.1093/gpbjnl/qzaf128","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf128","url":null,"abstract":"<p><p>Spatial omics technologies have revolutionized life sciences by enabling the simultaneous acquisition of biomolecular and spatial information. Identifying spatial patterns is crucial for understanding organ development and tumor microenvironments. However, the emergence of diverse spatial omics resolutions in these technologies has made it challenging to accurately characterize spatial domains at finer resolutions. To address this, we propose HyperSTAR, a hypergraph-based method designed to precisely identify spatial domains across varying resolutions by leveraging higher-order relationships among spatially adjacent tissue programs. Specifically, a gene expression-guided hyperedge decomposition module is introduced to refine the hypergraph structure to accurately delineate spatial domains boundaries. Additionally, a hypergraph attention convolutional neural network is designed to adaptively learn the importance of each hyperedge, enhancing the model's ability to capture complex higher-order relationships within spatially neighboring multi-spots and/or single cells. HyperSTAR outperforms existing graph neural network models in tasks such as uncovering tissue substructures, inferring spatiotemporal patterns, and denoising spatially resolved gene expressions. It effectively handles diverse spatial omics data types and scales seamlessly to large datasets. The method successfully reveals spatial heterogeneity in breast cancer sections, with findings validated through functional and survival analyses of independent clinical data. HyperSTAR represents a significant advancement in spatial omics analysis, representing a robust tool for exploring complex spatial patterns across varying resolutions and data types. Its ability to capture intricate higher-order relationships among spatially neighboring spots/cells makes it an invaluable tool for advancing research in life sciences, particularly in cancer and developmental biology. The toolbox is available at https://github.com/Ringoio/HyperSTAR.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proteases can cleave peptide bonds of target substrate proteins. Their controlled proteolysis is vital for protein degradation, recycling, and physiological processes. Understanding the hydrolytic mechanisms of proteases is crucial, particularly for identifying their specific substrates and cleavage sites. Bioinformatics approaches can predict novel protease-substrate cleavage events with high accuracy using sequence and structural information. However, existing tools for cleavage site prediction face several limitations, including restricted accuracy due to limited data and cumbersome training processes that impede timely updates. To address these challenges, we developed MPCutter, which was created by fine-tuning a general-purpose protein sequence language model. This method combined the extensive knowledge of the general model with the targeted optimization of fine-tuning, providing a powerful tool for protease-substrate cleavage prediction. MPCutter offers optimized cleavage site prediction models with enhanced performance and broader coverage across proteases, encompassing four major protease families including 62 distinct proteases. Benchmarking experiments using independent test datasets demonstrated that MPCutter outperformed existing generic tools. In our case study and experiments, MPCutter precisely recognized the majority of cleavage sites and validated five caspase-3 cleavage sites crucial for cellular physiology. Notably, its application to the 10,260-protein human proteome and specific cancer pathways revealed potential new target substrates and provided insights into key biochemical behaviors of proteases. MPCutter is expected to serve as a powerful tool for high-throughput prediction of protease-specific substrates and to facilitate hypothesis-driven exploration of protease proteolytic events. The MPCutter code and associated data are freely available at https://github.com/2053798680wang/MPCutter.git.
{"title":"MPCutter: Predicting Protease-specific Substrate Cleavage Sites Using a Protein Language Model.","authors":"Zhe Wang, Tuoyu Liu, Guoshun Xu, Han Gao, Ruohan Zhang, Honglian Zhang, Guijie Zhang, Ningfeng Wu, Bin Yao, Huiying Luo, Feifei Guan, Jian Tian","doi":"10.1093/gpbjnl/qzaf130","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf130","url":null,"abstract":"<p><p>Proteases can cleave peptide bonds of target substrate proteins. Their controlled proteolysis is vital for protein degradation, recycling, and physiological processes. Understanding the hydrolytic mechanisms of proteases is crucial, particularly for identifying their specific substrates and cleavage sites. Bioinformatics approaches can predict novel protease-substrate cleavage events with high accuracy using sequence and structural information. However, existing tools for cleavage site prediction face several limitations, including restricted accuracy due to limited data and cumbersome training processes that impede timely updates. To address these challenges, we developed MPCutter, which was created by fine-tuning a general-purpose protein sequence language model. This method combined the extensive knowledge of the general model with the targeted optimization of fine-tuning, providing a powerful tool for protease-substrate cleavage prediction. MPCutter offers optimized cleavage site prediction models with enhanced performance and broader coverage across proteases, encompassing four major protease families including 62 distinct proteases. Benchmarking experiments using independent test datasets demonstrated that MPCutter outperformed existing generic tools. In our case study and experiments, MPCutter precisely recognized the majority of cleavage sites and validated five caspase-3 cleavage sites crucial for cellular physiology. Notably, its application to the 10,260-protein human proteome and specific cancer pathways revealed potential new target substrates and provided insights into key biochemical behaviors of proteases. MPCutter is expected to serve as a powerful tool for high-throughput prediction of protease-specific substrates and to facilitate hypothesis-driven exploration of protease proteolytic events. The MPCutter code and associated data are freely available at https://github.com/2053798680wang/MPCutter.git.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoxin Wang, Xiaoting Xia, Lulan Zeng, Jing Yang, Jing Han, David E MacHugh, Johannes A Lenstra, Yanliang Song, Ajiao Fan, Yifan Zhu, Zhenliang Zhu, Xinyan Zhang, Yingyu Chen, Jianlin Han, Chuzhao Lei, Ningbo Chen, Yong Zhang, Yuanpeng Gao
Indicine cattle exhibit superior resistance to Mycobacterium bovis infection compared to taurine breeds, revealing divergent genetic mechanisms underlying bovine tuberculosis (bTB) resilience. Previous research has demonstrated that Cytochrome b-245 (CYBB) gene variants are associated with Mendelian susceptibility to Mycobacterium tuberculosis complex (MTBC) infections. In this study, we analyzed the X-chromosomal sequences from 258 female cattle and identified a divergent missense variant (L237M) in the CYBB gene. This variant occurs at high frequencies in indicine populations. Functional studies using murine macrophages revealed that CYBB L237M mitigates M. tuberculosis-induced ferroptosis by elevating glutathione synthesis and glutathione peroxidase 4 expression. Mechanistically, the L237M substitution enhances the stability of the nicotinamide adenine dinucleotide phosphate (NADPH) oxidase 2 (NOX2) and p22phox complex (NOX2-p22), which is critical for the generation of phagosomal reactive oxygen species and bacterial clearance. Our findings demonstrate that CYBB L237M promotes intracellular MTBC elimination through ferroptosis suppression, partially explaining the superior bTB resistance of indicine cattle. This study highlights X-chromosomal genetic variation as an evolutionary driver of innate immunity against mycobacterial infections, with implications for breeding strategies and host-directed tuberculosis therapies. The CYBB variant exemplifies how cattle subspecies divergence can illuminate conserved antimicrobial defense mechanisms in mammals.
{"title":"The Indicine X-linked CYBBL237M Mutation Can Suppress Intracellular Infection with Tubercle Bacilli.","authors":"Haoxin Wang, Xiaoting Xia, Lulan Zeng, Jing Yang, Jing Han, David E MacHugh, Johannes A Lenstra, Yanliang Song, Ajiao Fan, Yifan Zhu, Zhenliang Zhu, Xinyan Zhang, Yingyu Chen, Jianlin Han, Chuzhao Lei, Ningbo Chen, Yong Zhang, Yuanpeng Gao","doi":"10.1093/gpbjnl/qzaf131","DOIUrl":"https://doi.org/10.1093/gpbjnl/qzaf131","url":null,"abstract":"<p><p>Indicine cattle exhibit superior resistance to Mycobacterium bovis infection compared to taurine breeds, revealing divergent genetic mechanisms underlying bovine tuberculosis (bTB) resilience. Previous research has demonstrated that Cytochrome b-245 (CYBB) gene variants are associated with Mendelian susceptibility to Mycobacterium tuberculosis complex (MTBC) infections. In this study, we analyzed the X-chromosomal sequences from 258 female cattle and identified a divergent missense variant (L237M) in the CYBB gene. This variant occurs at high frequencies in indicine populations. Functional studies using murine macrophages revealed that CYBB L237M mitigates M. tuberculosis-induced ferroptosis by elevating glutathione synthesis and glutathione peroxidase 4 expression. Mechanistically, the L237M substitution enhances the stability of the nicotinamide adenine dinucleotide phosphate (NADPH) oxidase 2 (NOX2) and p22phox complex (NOX2-p22), which is critical for the generation of phagosomal reactive oxygen species and bacterial clearance. Our findings demonstrate that CYBB L237M promotes intracellular MTBC elimination through ferroptosis suppression, partially explaining the superior bTB resistance of indicine cattle. This study highlights X-chromosomal genetic variation as an evolutionary driver of innate immunity against mycobacterial infections, with implications for breeding strategies and host-directed tuberculosis therapies. The CYBB variant exemplifies how cattle subspecies divergence can illuminate conserved antimicrobial defense mechanisms in mammals.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenfei Wang, Jiaojiao Zhou, Hong Zhang, Zihan Zhuang, Gali Bai, Ming Tang, Song Liu, Tao Liu
Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) has emerged as a powerful technique to study cell-specific epigenetic landscapes and to provide a multidimensional portrait of gene regulation. However, low genomic coverage per cell results in intrinsic data sparsity and missing-data issues, presenting unique methodological challenges. Consequently, numerous computational methods and techniques have been developed to address these challenges. This review provides a concise overview of published workflows for scATAC-seq analysis, covering preprocessing through downstream analysis including quality control, alignment, peak calling, dimensionality reduction, clustering, gene regulation score calculation, cell type annotation, and multiomics integration. Additionally, we survey key scATAC-seq databases that offer curated, accessible resources; discuss emerging deep-learning methods and Artificial Intelligence (AI) foundation models tailored to scATAC-seq data; and highlight recent advances in spatial ATAC-seq technologies and associated computational approaches. Our objective is to equip readers with a clear understanding of current scATAC-seq methodologies so they can select appropriate tools and construct customized workflows for exploring gene regulation and cellular diversity.
{"title":"Computational Analyses and Challenges of Single-cell ATAC-seq.","authors":"Chenfei Wang, Jiaojiao Zhou, Hong Zhang, Zihan Zhuang, Gali Bai, Ming Tang, Song Liu, Tao Liu","doi":"10.1093/gpbjnl/qzaf115","DOIUrl":"10.1093/gpbjnl/qzaf115","url":null,"abstract":"<p><p>Single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) has emerged as a powerful technique to study cell-specific epigenetic landscapes and to provide a multidimensional portrait of gene regulation. However, low genomic coverage per cell results in intrinsic data sparsity and missing-data issues, presenting unique methodological challenges. Consequently, numerous computational methods and techniques have been developed to address these challenges. This review provides a concise overview of published workflows for scATAC-seq analysis, covering preprocessing through downstream analysis including quality control, alignment, peak calling, dimensionality reduction, clustering, gene regulation score calculation, cell type annotation, and multiomics integration. Additionally, we survey key scATAC-seq databases that offer curated, accessible resources; discuss emerging deep-learning methods and Artificial Intelligence (AI) foundation models tailored to scATAC-seq data; and highlight recent advances in spatial ATAC-seq technologies and associated computational approaches. Our objective is to equip readers with a clear understanding of current scATAC-seq methodologies so they can select appropriate tools and construct customized workflows for exploring gene regulation and cellular diversity.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12753137/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145575105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zerui Wang 王则锐, Xin Cheng 程欣, Yibin Xu 徐义斌, Zhiyi Wang 汪之一, Liyan Ma 马立艳, Caiming Li 李埰明, Shize Jiang 姜世泽, Yuchen Li 黎雨尘, Shuilong Guo 郭水龙, Wenbin Du 杜文斌
Neonatal pneumonia is a leading cause of infant mortality worldwide; however, a lack of microbial profiling, especially of low-abundance species, makes accurate diagnosis challenging. Traditional methods can fail to capture the complexity of the neonatal respiratory microbiota, thereby obscuring its role in disease progression. Here, we describe a novel approach that combines high-throughput sequencing with droplet-based microfluidic cultivation to investigate microbiome shifts in neonates with pneumonia. Using 16S ribosomal RNA (rRNA) gene sequencing of 71 pneumonia cases and 49 controls, we identified 1009 genera, including 930 low-abundance taxa, which showed significant compositional differences between groups. Linear discriminant analysis effect size identified key pneumonia-associated genera, such as Streptococcus, Rothia, and Corynebacterium. Droplet-based cultivation recovered 299 strains from 94 taxa, including rare species and ESKAPE pathogens, thereby supporting targeted antimicrobial management. Host-pathogen interaction assays showed that Rothia and Corynebacterium induced inflammation in lung epithelial cells, likely via dysregulation of the PI3K-Akt pathway. Integrating these marker taxa with clinical factors, such as gestational age and delivery type, offers the potential for precise diagnosis and treatment. The recovery of diverse species can support the construction of a biobank of neonatal respiratory microbiota to advance mechanistic studies and therapeutic strategies.
{"title":"Unveiling Neonatal Pneumonia Microbiome by High-throughput Sequencing and Droplet Culturomics.","authors":"Zerui Wang 王则锐, Xin Cheng 程欣, Yibin Xu 徐义斌, Zhiyi Wang 汪之一, Liyan Ma 马立艳, Caiming Li 李埰明, Shize Jiang 姜世泽, Yuchen Li 黎雨尘, Shuilong Guo 郭水龙, Wenbin Du 杜文斌","doi":"10.1093/gpbjnl/qzaf047","DOIUrl":"10.1093/gpbjnl/qzaf047","url":null,"abstract":"<p><p>Neonatal pneumonia is a leading cause of infant mortality worldwide; however, a lack of microbial profiling, especially of low-abundance species, makes accurate diagnosis challenging. Traditional methods can fail to capture the complexity of the neonatal respiratory microbiota, thereby obscuring its role in disease progression. Here, we describe a novel approach that combines high-throughput sequencing with droplet-based microfluidic cultivation to investigate microbiome shifts in neonates with pneumonia. Using 16S ribosomal RNA (rRNA) gene sequencing of 71 pneumonia cases and 49 controls, we identified 1009 genera, including 930 low-abundance taxa, which showed significant compositional differences between groups. Linear discriminant analysis effect size identified key pneumonia-associated genera, such as Streptococcus, Rothia, and Corynebacterium. Droplet-based cultivation recovered 299 strains from 94 taxa, including rare species and ESKAPE pathogens, thereby supporting targeted antimicrobial management. Host-pathogen interaction assays showed that Rothia and Corynebacterium induced inflammation in lung epithelial cells, likely via dysregulation of the PI3K-Akt pathway. Integrating these marker taxa with clinical factors, such as gestational age and delivery type, offers the potential for precise diagnosis and treatment. The recovery of diverse species can support the construction of a biobank of neonatal respiratory microbiota to advance mechanistic studies and therapeutic strategies.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12721867/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}