NAR Genomics and Bioinformatics最新文献

Development of a vaccine construct against Pneumocystis jirovecii pneumonia using computational tools. 利用计算工具开发一种抗吉罗氏肺囊虫肺炎的疫苗结构。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf199

Ragini Mishra, Nahid Akhtar, Jorge Samuel Leon Magdeleno, Abdul Rajjak Shaikh, Manik Prabhu Narsing Rao, Neeta Raj Sharma, Luigi Cavallo, Mohit Chawla

Pneumocystis jirovecii poses a significant threat to immunocompromised individuals, necessitating the development of an effective vaccine. This study employs an immunoinformatics approach to design a promising vaccine candidate against P. jirovecii. Utilizing various computational tools, the study identified potential antigenic epitopes capable of eliciting immune responses within the P. jirovecii major surface glycoprotein C. The chosen epitopes were evaluated using computational tools for their allergenicity, interferon-γ and interleukin activation ability, and toxicity, ensuring the selection of immunogenic and safe candidates. These analyses led to the selection of 10 epitopes, which were then linked with adjuvants to model a potential vaccine candidate. Molecular docking and molecular dynamics simulations were performed in a solvent environment to investigate the binding interactions between the vaccine candidate and toll-like receptors, along with calculations of thermodynamic properties. Finally, in silico immune simulations were performed to analyze the immunogenic potential of the vaccine candidate. Future prospects include in vitro and in vivo validation of the vaccine candidate and the exploration of novel adjuvants to enhance its immunogenicity. This study contributes to the ongoing efforts to develop a preventive solution against P. jirovecii infections, addressing a critical gap in the protection of immunocompromised individuals.

吉罗氏肺囊虫对免疫功能低下的个体构成重大威胁，需要开发有效的疫苗。本研究采用免疫信息学方法设计了一种有前途的疫苗候选物。利用各种计算工具，该研究确定了能够在p.j rovecii主要表面糖蛋白c中引发免疫应答的潜在抗原表位。选择的表位使用计算工具评估其致敏性，干扰素-γ和白细胞介素活化能力以及毒性，确保选择免疫原性和安全的候选物。这些分析导致了10个表位的选择，然后将其与佐剂连接以模拟潜在的候选疫苗。在溶剂环境中进行分子对接和分子动力学模拟，以研究候选疫苗与toll样受体之间的结合相互作用，并计算热力学性质。最后，进行了计算机免疫模拟，分析了候选疫苗的免疫原性潜力。未来的前景包括在体外和体内验证候选疫苗和探索新的佐剂，以增强其免疫原性。这项研究有助于正在进行的努力，以开发一种预防方案，以防止p.j roveci感染，解决一个关键的差距，在保护免疫功能低下的个体。

{"title":"Development of a vaccine construct against Pneumocystis jirovecii pneumonia using computational tools.","authors":"Ragini Mishra, Nahid Akhtar, Jorge Samuel Leon Magdeleno, Abdul Rajjak Shaikh, Manik Prabhu Narsing Rao, Neeta Raj Sharma, Luigi Cavallo, Mohit Chawla","doi":"10.1093/nargab/lqaf199","DOIUrl":"10.1093/nargab/lqaf199","url":null,"abstract":"Pneumocystis jirovecii poses a significant threat to immunocompromised individuals, necessitating the development of an effective vaccine. This study employs an immunoinformatics approach to design a promising vaccine candidate against P. jirovecii. Utilizing various computational tools, the study identified potential antigenic epitopes capable of eliciting immune responses within the P. jirovecii major surface glycoprotein C. The chosen epitopes were evaluated using computational tools for their allergenicity, interferon-γ and interleukin activation ability, and toxicity, ensuring the selection of immunogenic and safe candidates. These analyses led to the selection of 10 epitopes, which were then linked with adjuvants to model a potential vaccine candidate. Molecular docking and molecular dynamics simulations were performed in a solvent environment to investigate the binding interactions between the vaccine candidate and toll-like receptors, along with calculations of thermodynamic properties. Finally, in silico immune simulations were performed to analyze the immunogenic potential of the vaccine candidate. Future prospects include in vitro and in vivo validation of the vaccine candidate and the exploration of novel adjuvants to enhance its immunogenicity. This study contributes to the ongoing efforts to develop a preventive solution against P. jirovecii infections, addressing a critical gap in the protection of immunocompromised individuals.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf199"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754782/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving accuracy in genome-wide association studies: a two-step approach for handling below limit of detection biomarker measurements. 提高全基因组关联研究的准确性：处理低于检测生物标志物测量限制的两步方法。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf201

Yaqi A Deng, Torgny Karlsson, Åsa Johansson

Advances in high-throughput technologies enable large-scale studies on genomics and molecular phenotypes. However, the trade-off between quality and quantity reduces assay sensitivity, and several measurements in large-scale proteomics and metabolomics analytes fall below the limit of detection (LOD). If not properly addressed, this may introduce bias in effect estimates. To address this, we conducted a simulation study to evaluate the performance of linear, Tobit, Cox, and logistic modeling in the presence of below-LOD measurements in genome-wide association studies. We identified the optimal strategy as a two-step Linear-Tobit scheme, including rapid screening with linear regression followed by refinement with Tobit regression to retrieve accurate effect estimates. This higher accuracy helps mitigate a 1.3-fold and 2.7-fold inflation in causal estimates in a Mendelian randomization (MR) study, which would otherwise be present with 50% and 90% values below LOD. Validation through case studies on estradiol and testosterone levels in the UK Biobank confirmed the simulation results across subgroups with varying proportions of below-LOD measurements. The Linear-Tobit scheme offers optimal detection power and efficiency, with a focus on its applicability to biobank-scale datasets and accuracy in effect estimates to mitigate bias in downstream applications such as MR and polygenic risk scores.

高通量技术的进步使基因组学和分子表型的大规模研究成为可能。然而，质量和数量之间的权衡降低了分析的敏感性，并且在大规模蛋白质组学和代谢组学分析物中的一些测量值低于检测限（LOD）。如果处理不当，这可能会在效果估计中引入偏差。为了解决这个问题，我们进行了一项模拟研究，以评估全基因组关联研究中存在低于lod测量的线性、Tobit、Cox和logistic模型的性能。我们将最佳策略确定为两步线性-Tobit方案，包括线性回归快速筛选，然后使用Tobit回归进行细化，以获得准确的效果估计。这种更高的准确性有助于减轻孟德尔随机化（MR）研究中因果估计的1.3倍和2.7倍膨胀，否则会出现50%和90%的值低于LOD。通过对英国生物银行中雌二醇和睾酮水平的案例研究验证，在不同比例的低于lod测量的亚组中证实了模拟结果。线性tobit方案提供了最佳的检测能力和效率，重点是其对生物库规模数据集的适用性和效果估计的准确性，以减轻下游应用（如MR和多基因风险评分）的偏差。

{"title":"Improving accuracy in genome-wide association studies: a two-step approach for handling below limit of detection biomarker measurements.","authors":"Yaqi A Deng, Torgny Karlsson, Åsa Johansson","doi":"10.1093/nargab/lqaf201","DOIUrl":"10.1093/nargab/lqaf201","url":null,"abstract":"Advances in high-throughput technologies enable large-scale studies on genomics and molecular phenotypes. However, the trade-off between quality and quantity reduces assay sensitivity, and several measurements in large-scale proteomics and metabolomics analytes fall below the limit of detection (LOD). If not properly addressed, this may introduce bias in effect estimates. To address this, we conducted a simulation study to evaluate the performance of linear, Tobit, Cox, and logistic modeling in the presence of below-LOD measurements in genome-wide association studies. We identified the optimal strategy as a two-step Linear-Tobit scheme, including rapid screening with linear regression followed by refinement with Tobit regression to retrieve accurate effect estimates. This higher accuracy helps mitigate a 1.3-fold and 2.7-fold inflation in causal estimates in a Mendelian randomization (MR) study, which would otherwise be present with 50% and 90% values below LOD. Validation through case studies on estradiol and testosterone levels in the UK Biobank confirmed the simulation results across subgroups with varying proportions of below-LOD measurements. The Linear-Tobit scheme offers optimal detection power and efficiency, with a focus on its applicability to biobank-scale datasets and accuracy in effect estimates to mitigate bias in downstream applications such as MR and polygenic risk scores.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf201"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A supervised Bayesian method for time (re)annotation of transcriptomics data. 一种用于转录组学数据时间（重新）注释的监督贝叶斯方法。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf203

Elio Nushi, François P Douillard, Katja Selby, Benjamin A Blount, Oliver J Pennington, Nigel P Minton, Miia Lindström, Antti Honkela

Transcriptomics experiments are often conducted to capture changes in gene expression over time. However, time annotations may be missing, imprecise, or not reflect the same physiological state of the bacterial culture between different experiments. Assigning accurate time points to these experiments using a reference time course is therefore crucial for identifying differentially expressed genes, and understanding gene regulatory networks for elucidating the studied organism's physiology and life cycle. This important task, which could enhance the biological interpretation of the transcriptomics experiments, has not been previously addressed. In this work, we propose a novel method to solve the challenge of realigning transcriptomics experiments based on a reference time course. Our method is based on a Bayesian approach that uses Gaussian process regression modeling. We show a use case of applying our method for assigning time annotations in legacy microarray samples of the bacterium Clostridium botulinum, which were solely annotated based on the growth phase at the time when the culture aliquots were sampled, utilizing recently collected RNA-Seq time series data comprising multiple replicates as a reference. The method significantly improved the description of the growth phases of the microarray data compared to the original annotations by clearly delineating the microarray samples belonging to different growth phases, as demonstrated by principal component analysis. Consequently, a larger number of differentially expressed genes was detected when comparing experiments belonging to successive growth phases. We compare this innovative approach with a baseline method that uses k-nearest neighbor algorithm and show that our method offers a higher resolution in the description of the data by exposing smaller time changes between samples. We also test the performance of the method on sparse RNA-Seq time series (i.e. sampled every second hour). All the predictions for the samples were within a 30-min margin of their true time.

转录组学实验通常用于捕捉基因表达随时间的变化。但在不同的实验中，时间注释可能会缺失、不精确或不反映细菌培养的相同生理状态。因此，使用参考时间过程为这些实验分配准确的时间点对于识别差异表达基因和理解基因调控网络以阐明所研究生物体的生理和生命周期至关重要。这项重要的任务可以增强转录组学实验的生物学解释，以前没有解决过。在这项工作中，我们提出了一种新的方法来解决基于参考时间过程重新调整转录组学实验的挑战。我们的方法是基于使用高斯过程回归建模的贝叶斯方法。我们展示了一个用例，应用我们的方法在肉毒杆菌的遗留微阵列样本中分配时间注释，这些样本仅根据培养等分取样时的生长阶段进行注释，利用最近收集的RNA-Seq时间序列数据包括多个重复作为参考。主成分分析表明，与原始注释相比，该方法通过清晰地描绘属于不同生长阶段的微阵列样本，显著改善了对微阵列数据生长阶段的描述。因此，当比较属于连续生长阶段的实验时，检测到更多的差异表达基因。我们将这种创新方法与使用k-最近邻算法的基线方法进行了比较，并表明我们的方法通过暴露样本之间较小的时间变化，在数据描述中提供了更高的分辨率。我们还测试了该方法在稀疏RNA-Seq时间序列（即每隔一小时采样一次）上的性能。所有对样本的预测都与真实时间相差不超过30分钟。

{"title":"A supervised Bayesian method for time (re)annotation of transcriptomics data.","authors":"Elio Nushi, François P Douillard, Katja Selby, Benjamin A Blount, Oliver J Pennington, Nigel P Minton, Miia Lindström, Antti Honkela","doi":"10.1093/nargab/lqaf203","DOIUrl":"10.1093/nargab/lqaf203","url":null,"abstract":"Transcriptomics experiments are often conducted to capture changes in gene expression over time. However, time annotations may be missing, imprecise, or not reflect the same physiological state of the bacterial culture between different experiments. Assigning accurate time points to these experiments using a reference time course is therefore crucial for identifying differentially expressed genes, and understanding gene regulatory networks for elucidating the studied organism's physiology and life cycle. This important task, which could enhance the biological interpretation of the transcriptomics experiments, has not been previously addressed. In this work, we propose a novel method to solve the challenge of realigning transcriptomics experiments based on a reference time course. Our method is based on a Bayesian approach that uses Gaussian process regression modeling. We show a use case of applying our method for assigning time annotations in legacy microarray samples of the bacterium Clostridium botulinum, which were solely annotated based on the growth phase at the time when the culture aliquots were sampled, utilizing recently collected RNA-Seq time series data comprising multiple replicates as a reference. The method significantly improved the description of the growth phases of the microarray data compared to the original annotations by clearly delineating the microarray samples belonging to different growth phases, as demonstrated by principal component analysis. Consequently, a larger number of differentially expressed genes was detected when comparing experiments belonging to successive growth phases. We compare this innovative approach with a baseline method that uses k-nearest neighbor algorithm and show that our method offers a higher resolution in the description of the data by exposing smaller time changes between samples. We also test the performance of the method on sparse RNA-Seq time series (i.e. sampled every second hour). All the predictions for the samples were within a 30-min margin of their true time.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf203"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754789/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DEDUCE: statistical inference on disease-associated genes uncovers tissue-disease associations. 推论：对疾病相关基因的统计推断揭示了组织与疾病的关联。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf205

Boqi Wang, Jiayi Wang, Ammar Aleem Rashied, Bo Meng, Jesse Zhang, Jun S Liu, Jie Jiang, Zhaohui S Qin

Accurate identification of affected tissues of human diseases is important for the derivation of disease etiology and the development of new treatment strategies. In this study, we develop a logistic regression-based method named DEDUCE (disease tissue detection using logistic regression) that combines genomics big data and machine learning to address this important problem. The central hypothesis is that most disease-associated genes are expressed specifically in affected tissues. DEDUCE takes advantage of newly emerged data on disease-related genes as well as tissue-specific gene expression data. The unique feature of DEDUCE is that it takes into account the strength of gene-disease associations. When we applied DEDUCE to a total of 3261, 324 gene-disease associations collected from DisGeNET covering 30,170 diseases and 21,666 genes, we identified 216 significant tissue-disease pairs composed of 120 unique diseases and 37 unique tissues. Many of them shed light on potential explanations for disease pathogenesis. The results showed great consistency with previous findings and were proven effective by empirical plots and gene set enrichment analysis. Overall, DEDUCE has shown great potential in uncovering novel pathogenesis mechanisms of complex diseases. In-depth analysis and experimental validation were required to fully understand these discovered tissue-trait associations and their enriched genes.

准确识别人类疾病的受影响组织对于疾病病因的推导和新的治疗策略的发展是重要的。在本研究中，我们开发了一种基于逻辑回归的方法，名为推导（利用逻辑回归进行疾病组织检测），该方法结合了基因组学大数据和机器学习来解决这一重要问题。中心假设是大多数疾病相关基因在受影响组织中特异性表达。演绎利用新出现的数据对疾病相关基因以及组织特异性基因表达数据。演绎的独特之处在于它考虑了基因-疾病关联的强度。当我们将推导结果应用于从DisGeNET收集的3261,324个基因-疾病关联，涵盖30,170种疾病和21,666个基因时，我们确定了216个重要的组织-疾病对，由120种独特疾病和37种独特组织组成。其中许多研究揭示了疾病发病机制的潜在解释。结果与前人的研究结果一致，并通过实验图和基因集富集分析证明了该方法的有效性。总的来说，演绎在揭示复杂疾病的新发病机制方面显示出巨大的潜力。为了充分理解这些发现的组织性状关联及其富集的基因，需要进行深入的分析和实验验证。

{"title":"DEDUCE: statistical inference on disease-associated genes uncovers tissue-disease associations.","authors":"Boqi Wang, Jiayi Wang, Ammar Aleem Rashied, Bo Meng, Jesse Zhang, Jun S Liu, Jie Jiang, Zhaohui S Qin","doi":"10.1093/nargab/lqaf205","DOIUrl":"10.1093/nargab/lqaf205","url":null,"abstract":"Accurate identification of affected tissues of human diseases is important for the derivation of disease etiology and the development of new treatment strategies. In this study, we develop a logistic regression-based method named DEDUCE (disease tissue detection using logistic regression) that combines genomics big data and machine learning to address this important problem. The central hypothesis is that most disease-associated genes are expressed specifically in affected tissues. DEDUCE takes advantage of newly emerged data on disease-related genes as well as tissue-specific gene expression data. The unique feature of DEDUCE is that it takes into account the strength of gene-disease associations. When we applied DEDUCE to a total of 3261, 324 gene-disease associations collected from DisGeNET covering 30,170 diseases and 21,666 genes, we identified 216 significant tissue-disease pairs composed of 120 unique diseases and 37 unique tissues. Many of them shed light on potential explanations for disease pathogenesis. The results showed great consistency with previous findings and were proven effective by empirical plots and gene set enrichment analysis. Overall, DEDUCE has shown great potential in uncovering novel pathogenesis mechanisms of complex diseases. In-depth analysis and experimental validation were required to fully understand these discovered tissue-trait associations and their enriched genes.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf205"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IRESeek: structure-informed deep learning method for accurate identification of internal ribosome entry sites in circular RNAs. IRESeek：结构信息深度学习方法，用于准确识别环状rna的内部核糖体进入位点。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf210

Feng Zhang, Heqin Zhu, Jiayin Gao, Jie Hu, Ke Chen, Shaohua Kevin Zhou, Peng Xiong

The internal ribosome entry site (IRES) is a special type of RNA cis-acting element that can initiate translation independently of the 5' cap structure and is widely found in viral RNAs and eukaryotic messenger RNAs. In recent years, an increasing number of studies have revealed that IRES elements also exist in circular RNAs (circRNAs) and mediate their translation. CircRNAs exhibit high stability and tissue specificity, playing critical roles in various physiological and pathological processes. Their coding potential provides important clues for the discovery of novel functional proteins. However, due to the nonlinear structure of circRNAs and the complexity of IRES-mediated regulatory mechanisms, accurately identifying IRES elements within circRNAs remains a significant challenge. Here, we propose IRESeek, a dual-branch deep learning framework for highly accurate detection of IRES elements in circRNA, which utilizes transformer for RNA sequence modeling and graph convolutional network for RNA structural guidance. To grasp the structural patterns of circRNAs, IRESeek employs physical-based thermodynamic energy of RNA secondary structure-base pair motif energy and the base pair probability as guidance structural characteristics to incorporate with RNA sequence, enabling comprehensive joint learning of RNA sequence and base pair interactions.

内部核糖体进入位点（internal ribosome entry site， IRES）是一种特殊类型的RNA顺式作用元件，可以独立于5'帽结构启动翻译，广泛存在于病毒RNA和真核信使RNA中。近年来，越来越多的研究表明，IRES元件也存在于环状rna （circRNAs）中，并介导其翻译。CircRNAs具有高度的稳定性和组织特异性，在各种生理和病理过程中发挥关键作用。它们的编码潜力为发现新的功能蛋白提供了重要线索。然而，由于circrna的非线性结构和IRES介导的调控机制的复杂性，准确识别circrna中的IRES元件仍然是一个重大挑战。在这里，我们提出了IRESeek，这是一个双分支深度学习框架，用于高精度检测circRNA中的IRES元素，该框架利用transformer进行RNA序列建模，并利用图卷积网络进行RNA结构指导。为了掌握circRNAs的结构模式，IRESeek利用RNA二级结构的物理热力学能量-碱基对基序能量和碱基对概率作为指导结构特征与RNA序列结合，实现RNA序列与碱基对相互作用的综合联合学习。

{"title":"IRESeek: structure-informed deep learning method for accurate identification of internal ribosome entry sites in circular RNAs.","authors":"Feng Zhang, Heqin Zhu, Jiayin Gao, Jie Hu, Ke Chen, Shaohua Kevin Zhou, Peng Xiong","doi":"10.1093/nargab/lqaf210","DOIUrl":"10.1093/nargab/lqaf210","url":null,"abstract":"The internal ribosome entry site (IRES) is a special type of RNA cis-acting element that can initiate translation independently of the 5' cap structure and is widely found in viral RNAs and eukaryotic messenger RNAs. In recent years, an increasing number of studies have revealed that IRES elements also exist in circular RNAs (circRNAs) and mediate their translation. CircRNAs exhibit high stability and tissue specificity, playing critical roles in various physiological and pathological processes. Their coding potential provides important clues for the discovery of novel functional proteins. However, due to the nonlinear structure of circRNAs and the complexity of IRES-mediated regulatory mechanisms, accurately identifying IRES elements within circRNAs remains a significant challenge. Here, we propose IRESeek, a dual-branch deep learning framework for highly accurate detection of IRES elements in circRNA, which utilizes transformer for RNA sequence modeling and graph convolutional network for RNA structural guidance. To grasp the structural patterns of circRNAs, IRESeek employs physical-based thermodynamic energy of RNA secondary structure-base pair motif energy and the base pair probability as guidance structural characteristics to incorporate with RNA sequence, enabling comprehensive joint learning of RNA sequence and base pair interactions.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf210"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145889649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

G-quadruplex structures as modulators of alternative promoter usage. g -四重结构作为替代启动子使用的调制剂。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf208

Rongxin Zhang, Jean-Louis Mergny

The precise regulation of gene transcription relies on promoters, and the selection of specific promoters for a particular gene is a key determinant of transcript diversity. However, the regulatory mechanisms governing promoter selection are not fully understood. G-quadruplexes (G4s) are unique DNA noncanonical secondary structures that have emerged as important regulators of gene expression. In this study, we systematically analyzed the relationship between G4 structures and alternative promoters (APs) in two cancer cell lines, K562 and HepG2, by integrating native elongating transcript-cap analysis of gene expression and G4 ChIP-seq datasets. We identified 573 differentially utilized APs (|fold change| > 2, false discovery rate < 0.05), 26% of which being associated with G4 structures within 100 base pairs. Notably, G4-associated promoters predominantly exhibited increased activity, suggesting that G4s generally promote AP selection. Furthermore, treatment with G4 ligands induced the generation of APs, suggesting that the stabilization of G4 structures may modulate AP usage. Collectively, these findings provide new insights into the G4-based mechanisms that regulate transcript isoform diversity.

基因转录的精确调控依赖于启动子，而特定基因对特定启动子的选择是转录物多样性的关键决定因素。然而，调控启动子选择的调控机制尚不完全清楚。g -四联体（G4s）是独特的DNA非规范二级结构，已成为基因表达的重要调节因子。在这项研究中，我们通过整合基因表达的天然伸长转录帽分析和G4 ChIP-seq数据集，系统地分析了两种癌细胞系K562和HepG2中G4结构与替代启动子（APs）之间的关系。我们鉴定出573个差异利用ap (|fold change| >2，错误发现率< 0.05)，其中26%与100个碱基对内的G4结构相关。值得注意的是，g4相关启动子主要表现出活性增加，表明g4通常促进AP选择。此外，G4配体处理诱导AP的产生，这表明G4结构的稳定可能会调节AP的使用。总的来说，这些发现为调控转录异构体多样性的基于g4的机制提供了新的见解。

{"title":"G-quadruplex structures as modulators of alternative promoter usage.","authors":"Rongxin Zhang, Jean-Louis Mergny","doi":"10.1093/nargab/lqaf208","DOIUrl":"10.1093/nargab/lqaf208","url":null,"abstract":"The precise regulation of gene transcription relies on promoters, and the selection of specific promoters for a particular gene is a key determinant of transcript diversity. However, the regulatory mechanisms governing promoter selection are not fully understood. G-quadruplexes (G4s) are unique DNA noncanonical secondary structures that have emerged as important regulators of gene expression. In this study, we systematically analyzed the relationship between G4 structures and alternative promoters (APs) in two cancer cell lines, K562 and HepG2, by integrating native elongating transcript-cap analysis of gene expression and G4 ChIP-seq datasets. We identified 573 differentially utilized APs (|fold change| > 2, false discovery rate < 0.05), 26% of which being associated with G4 structures within 100 base pairs. Notably, G4-associated promoters predominantly exhibited increased activity, suggesting that G4s generally promote AP selection. Furthermore, treatment with G4 ligands induced the generation of APs, suggesting that the stabilization of G4 structures may modulate AP usage. Collectively, these findings provide new insights into the G4-based mechanisms that regulate transcript isoform diversity.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf208"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Designing genetically stable multicopy gene constructs with the ChimeraUGEM web server. 使用ChimeraUGEM web服务器设计遗传稳定的多拷贝基因结构。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-29 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf191

Moritz Burghardt, Alon Diament, Tamir Tuller

High expression of heterologous proteins is often achieved by integrating multiple copies of a gene into a host. However, such multicopy systems are prone to genetic instability due to homologous recombination between identical sequences. We present the multisequence ChimeraMap (MScMap), an algorithm for designing multiple synonymous coding sequences that minimizes recombination risk while maintaining high expression. MScMap extends the ChimeraMap framework by selecting diverse nucleotide blocks from a host genome to encode the target protein, balancing host adaptation and sequence dissimilarity. We introduce heuristics for block selection and concatenation to reduce long common substrings, a known driver of recombination. Our method outperforms a multi-objective evolutionary algorithm in both genetic stability and predicted expression across a wide range of human proteins while being significantly faster. We also show that MScMap can also be used to reduce sequence repeats within a single coding sequence. A web tool for single and multicopy coding sequence optimization is available online.

外源蛋白的高表达通常是通过将一个基因的多个拷贝整合到宿主中来实现的。然而，由于相同序列之间的同源重组，这种多拷贝系统容易产生遗传不稳定性。我们提出了多序列ChimeraMap (MScMap)，这是一种设计多个同义编码序列的算法，可以在保持高表达的同时最大限度地降低重组风险。MScMap扩展了ChimeraMap框架，从宿主基因组中选择不同的核苷酸块来编码目标蛋白，平衡宿主适应和序列差异。我们引入了块选择和连接的启发式方法，以减少长公共子字符串，这是已知的重组驱动因素。我们的方法在遗传稳定性和预测多种人类蛋白质的表达方面都优于多目标进化算法，同时速度快得多。我们还表明，MScMap还可以用于减少单个编码序列中的序列重复。一个用于单副本和多副本编码序列优化的网络工具在线可用。

{"title":"Designing genetically stable multicopy gene constructs with the ChimeraUGEM web server.","authors":"Moritz Burghardt, Alon Diament, Tamir Tuller","doi":"10.1093/nargab/lqaf191","DOIUrl":"10.1093/nargab/lqaf191","url":null,"abstract":"High expression of heterologous proteins is often achieved by integrating multiple copies of a gene into a host. However, such multicopy systems are prone to genetic instability due to homologous recombination between identical sequences. We present the multisequence ChimeraMap (MScMap), an algorithm for designing multiple synonymous coding sequences that minimizes recombination risk while maintaining high expression. MScMap extends the ChimeraMap framework by selecting diverse nucleotide blocks from a host genome to encode the target protein, balancing host adaptation and sequence dissimilarity. We introduce heuristics for block selection and concatenation to reduce long common substrings, a known driver of recombination. Our method outperforms a multi-objective evolutionary algorithm in both genetic stability and predicted expression across a wide range of human proteins while being significantly faster. We also show that MScMap can also be used to reduce sequence repeats within a single coding sequence. A web tool for single and multicopy coding sequence optimization is available online.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf191"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction. 整合自然语言处理和基因组分析使准确的细菌表型预测。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-29 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf174

Daniel Gómez-Pérez, Alexander Keller

Understanding microbial phenotypes from genomic data is crucial for studying co-evolution, ecology, and pathology. This study presents a scalable approach that integrates literature-extracted information with genomic data, combining natural language processing and functional genome analysis. We applied this method to publicly available data, providing novel insights into predicting microbial phenotypes. We fine-tuned transformer-based language models to analyze 3.83 million open-access scientific articles, extracting a phenotypic network of bacterial strains. This network maps relationships between strains and traits such as pathogenicity, metabolism, and biome preference. By annotating their reference genomes, we predicted key genes influencing these traits. Our findings align with known phenotypes, reveal novel correlations, and uncover genes involved in disease and host associations. The network's interconnectivity provides deeper understanding of microbial communities and allowed identification of hub species through inferred trophic connections that are difficult to infer experimentally. This work demonstrates the potential of machine learning for uncovering cross-species gene-phenotype patterns. As microbial genomic data and literature expand, such methods will be essential for extracting meaningful insights and advancing microbiology research. In summary, this integrative approach can accelerate discovery and understanding in microbial genomics. Ultimately, such techniques will facilitate the study of microbial ecology, co-evolutionary processes, and disease pathogenesis to an unprecedented depth.

从基因组数据中了解微生物表型对于研究共同进化、生态学和病理学至关重要。本研究提出了一种可扩展的方法，将文献提取信息与基因组数据相结合，结合自然语言处理和功能基因组分析。我们将这种方法应用于公开可用的数据，为预测微生物表型提供了新的见解。我们对基于转换器的语言模型进行了微调，分析了383万篇开放获取的科学文章，提取了细菌菌株的表型网络。该网络绘制了菌株和性状之间的关系，如致病性、代谢和生物群落偏好。通过注释它们的参考基因组，我们预测了影响这些性状的关键基因。我们的研究结果与已知的表型一致，揭示了新的相关性，并揭示了与疾病和宿主相关的基因。该网络的互联性提供了对微生物群落更深入的了解，并允许通过推断的营养联系来识别中心物种，这在实验上是很难推断的。这项工作证明了机器学习在揭示跨物种基因表型模式方面的潜力。随着微生物基因组数据和文献的扩展，这些方法对于提取有意义的见解和推进微生物学研究至关重要。总之，这种综合方法可以加速微生物基因组学的发现和理解。最终，这些技术将促进微生物生态学、共同进化过程和疾病发病机制的研究达到前所未有的深度。

{"title":"Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction.","authors":"Daniel Gómez-Pérez, Alexander Keller","doi":"10.1093/nargab/lqaf174","DOIUrl":"10.1093/nargab/lqaf174","url":null,"abstract":"Understanding microbial phenotypes from genomic data is crucial for studying co-evolution, ecology, and pathology. This study presents a scalable approach that integrates literature-extracted information with genomic data, combining natural language processing and functional genome analysis. We applied this method to publicly available data, providing novel insights into predicting microbial phenotypes. We fine-tuned transformer-based language models to analyze 3.83 million open-access scientific articles, extracting a phenotypic network of bacterial strains. This network maps relationships between strains and traits such as pathogenicity, metabolism, and biome preference. By annotating their reference genomes, we predicted key genes influencing these traits. Our findings align with known phenotypes, reveal novel correlations, and uncover genes involved in disease and host associations. The network's interconnectivity provides deeper understanding of microbial communities and allowed identification of hub species through inferred trophic connections that are difficult to infer experimentally. This work demonstrates the potential of machine learning for uncovering cross-species gene-phenotype patterns. As microbial genomic data and literature expand, such methods will be essential for extracting meaningful insights and advancing microbiology research. In summary, this integrative approach can accelerate discovery and understanding in microbial genomics. Ultimately, such techniques will facilitate the study of microbial ecology, co-evolutionary processes, and disease pathogenesis to an unprecedented depth.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf174"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746109/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A computational framework to dissect imputation strategies for single-cell histone modification data. 一个计算框架来剖析为单细胞组蛋白修饰数据的imputation策略。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-29 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf192

Marta Moreno-González, Jeroen de Ridder, Jop Kind, Robin H van der Weide

Single-cell profiling of histone post-translational modifications (scHPTMs) offers a powerful lens for dissecting epigenetic regulation and cellular identity, yet low read depth and inherent noise in these datasets pose significant analytical challenges. Here, we introduce the first comprehensive computational framework that systematically evaluates imputation strategies on scHPTM data, including methods originally developed for scRNA-seq and scATAC-seq. Leveraging both synthetic and published datasets, we apply novel performance metrics-implemented in a modular R package-to assess signal recovery, enrichment at biologically relevant genomic sites, and preservation of cell-to-cell similarities. Our extensive benchmarking reveals that performance varies markedly by analytical task (e.g. signal denoising, peak detection, and clustering), highlighting that no one-size-fits-all solution exists for these data. By delineating the strengths and limitations of current imputation approaches, this work lays the foundation for the targeted development of next-generation, task-aware algorithms, while providing critical guidance for researchers and developers on the current capabilities and unmet needs in single-cell epigenomics.

组蛋白翻译后修饰（scHPTMs）的单细胞分析为剖析表观遗传调控和细胞特性提供了一个强有力的视角，但这些数据集的低读取深度和固有噪声构成了重大的分析挑战。在这里，我们介绍了第一个全面的计算框架，该框架系统地评估了scHPTM数据的代入策略，包括最初为scRNA-seq和scATAC-seq开发的方法。利用合成和已发表的数据集，我们采用模块化R包实现的新型性能指标来评估信号恢复、生物学相关基因组位点的富集以及细胞间相似性的保存。我们广泛的基准测试显示，性能因分析任务（例如信号去噪，峰值检测和聚类）而显着变化，突出表明这些数据不存在一刀切的解决方案。通过描述当前代入方法的优势和局限性，本工作为下一代任务感知算法的有针对性开发奠定了基础，同时为研究人员和开发人员提供了关于单细胞表观基因组学当前能力和未满足需求的关键指导。

{"title":"A computational framework to dissect imputation strategies for single-cell histone modification data.","authors":"Marta Moreno-González, Jeroen de Ridder, Jop Kind, Robin H van der Weide","doi":"10.1093/nargab/lqaf192","DOIUrl":"10.1093/nargab/lqaf192","url":null,"abstract":"Single-cell profiling of histone post-translational modifications (scHPTMs) offers a powerful lens for dissecting epigenetic regulation and cellular identity, yet low read depth and inherent noise in these datasets pose significant analytical challenges. Here, we introduce the first comprehensive computational framework that systematically evaluates imputation strategies on scHPTM data, including methods originally developed for scRNA-seq and scATAC-seq. Leveraging both synthetic and published datasets, we apply novel performance metrics-implemented in a modular R package-to assess signal recovery, enrichment at biologically relevant genomic sites, and preservation of cell-to-cell similarities. Our extensive benchmarking reveals that performance varies markedly by analytical task (e.g. signal denoising, peak detection, and clustering), highlighting that no one-size-fits-all solution exists for these data. By delineating the strengths and limitations of current imputation approaches, this work lays the foundation for the targeted development of next-generation, task-aware algorithms, while providing critical guidance for researchers and developers on the current capabilities and unmet needs in single-cell epigenomics.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf192"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Life at the extremes: maximally divergent microbes with similar genomic signatures linked to extreme environments. 极端环境下的生命：极端环境下具有相似基因组特征的最大差异微生物。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-12-23 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf189

Monireh Safari, Joseph Butler, Gurjit S Randhawa, Kathleen A Hill, Lila Kari

Extreme environments impose strong mutation and selection pressures that drive distinctive, yet understudied, genomic adaptations in extremophiles. In this study, we identify 15 bacterium-archaeon pairs that exhibit highly similar [Formula: see text]-mer-based genomic signatures despite maximal taxonomic divergence, suggesting that shared environmental conditions can produce convergent, genome-wide sequence patterns that transcend evolutionary distance. To uncover these patterns, we developed a computational pipeline to select a composite genome proxy assembled from noncontiguous subsequences of the genome. Using supervised machine learning on a curated dataset of 693 extremophile microbial genomes, we found that 6-mers and 100 kbp genome proxy lengths provide the best balance between classification accuracy and computational efficiency. Our results provide conclusive evidence of the pervasive nature of [Formula: see text]-mer-based patterns across the genome, and uncover the presence of taxonomic and environmental components that persist across all regions of the genome. The 15 bacterium-archaeon pairs identified by our method as having similar genomic signatures were validated through multiple independent analyses, including 3-mer frequency profile comparisons, phenotypic trait similarity, and geographic co-occurrence data. These complementary validations confirmed that extreme environmental pressures can override traditionally recognized taxonomic components at the whole-genome level. Together, these findings reveal that adaptation to extreme conditions can carry robust, taxonomic domain-spanning imprints on microbial genomes, offering new insight into the relationship between environmental impacts and genome sequence composition convergence.

极端环境施加了强大的突变和选择压力，推动了极端微生物独特的基因组适应，但尚未得到充分研究。在这项研究中，我们鉴定了15对细菌-古菌，它们表现出高度相似的基于mer的基因组特征，尽管有最大的分类差异，这表明共享的环境条件可以产生超越进化距离的趋同的全基因组序列模式。为了揭示这些模式，我们开发了一个计算管道来选择从基因组的不连续子序列组装的复合基因组代理。通过对693个极端微生物基因组的数据集进行监督式机器学习，我们发现6-mers和100 kbp基因组代理长度在分类精度和计算效率之间提供了最佳平衡。我们的研究结果提供了确凿的证据，证明了基因组中基于mer的模式的普遍性，并揭示了在基因组的所有区域中存在的分类和环境成分。通过多个独立分析，包括3-mer频率谱比较、表型性状相似性和地理共现数据，验证了我们方法鉴定的15对具有相似基因组特征的细菌-古菌对。这些互补的验证证实，极端的环境压力可以在全基因组水平上覆盖传统上公认的分类成分。总之，这些发现表明，对极端条件的适应可以在微生物基因组上携带强大的、跨分类域的印记，为环境影响与基因组序列组成收敛之间的关系提供了新的见解。

{"title":"Life at the extremes: maximally divergent microbes with similar genomic signatures linked to extreme environments.","authors":"Monireh Safari, Joseph Butler, Gurjit S Randhawa, Kathleen A Hill, Lila Kari","doi":"10.1093/nargab/lqaf189","DOIUrl":"10.1093/nargab/lqaf189","url":null,"abstract":"Extreme environments impose strong mutation and selection pressures that drive distinctive, yet understudied, genomic adaptations in extremophiles. In this study, we identify 15 bacterium-archaeon pairs that exhibit highly similar [Formula: see text]-mer-based genomic signatures despite maximal taxonomic divergence, suggesting that shared environmental conditions can produce convergent, genome-wide sequence patterns that transcend evolutionary distance. To uncover these patterns, we developed a computational pipeline to select a composite genome proxy assembled from noncontiguous subsequences of the genome. Using supervised machine learning on a curated dataset of 693 extremophile microbial genomes, we found that 6-mers and 100 kbp genome proxy lengths provide the best balance between classification accuracy and computational efficiency. Our results provide conclusive evidence of the pervasive nature of [Formula: see text]-mer-based patterns across the genome, and uncover the presence of taxonomic and environmental components that persist across all regions of the genome. The 15 bacterium-archaeon pairs identified by our method as having similar genomic signatures were validated through multiple independent analyses, including 3-mer frequency profile comparisons, phenotypic trait similarity, and geographic co-occurrence data. These complementary validations confirmed that extreme environmental pressures can override traditionally recognized taxonomic components at the whole-genome level. Together, these findings reveal that adaptation to extreme conditions can carry robust, taxonomic domain-spanning imprints on microbial genomes, offering new insight into the relationship between environmental impacts and genome sequence composition convergence.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf189"},"PeriodicalIF":2.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145828555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0