NAR Genomics and Bioinformatics最新文献_第8页

Why AGG is associated with high transgene output: passenger effects and their implications for transgene design. 为什么AGG与高转基因输出相关：乘客效应及其对转基因设计的影响。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-19 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf086

Kate G Daniels, Sofia Radrizzani, Laurence D Hurst

In bacteria, high A and low G content of the 5' end of the coding sequence (CDS) promotes low RNA stability, facilitating ribosomal initiation and subsequently a high protein to transcript ratio. Additionally, 5' NGG codons are suppressive owing to peptidyl-tRNA drop off. It was, therefore, surprising that the first large-scale transgene experiment to interrogate the 5' effect by codon randomization found the NGG, G-rich codon AGG to be the most associated with high transgene output. Why is this? We show that this is not replicated in other large transgene datasets, where AGG and NGG are associated with low efficiency. More generally, there is limited agreement between the first experiment and others. This we find to be a consequence of non-random construct design. In constructs of the first experiment, AGG disproportionately occurs with non-AGG codons associated with low stability and high protein output, making AGG's association with high output an artefact. While translationally non-optimal codons like AGG are conjectured to slow ribosomes for orderly initiation, we find that in the less biased constructs high, not low, translational adaptation in the first 10 codons is (weakly) predictive of higher translational efficiency. These results have implications for both transgene and experimental design.

在细菌中，编码序列（CDS） 5'端的高A和低G含量导致RNA的低稳定性，促进核糖体起始和随后的高蛋白转录比。此外，5' NGG密码子由于肽基trna的缺失而具有抑制作用。因此，令人惊讶的是，第一个通过密码子随机化来询问5'效应的大规模转基因实验发现，富含NGG的密码子AGG与高转基因输出最相关。为什么会这样？我们发现这在其他大型转基因数据集中没有复制，其中AGG和NGG与低效率相关。更普遍的是，第一个实验和其他实验之间的一致性有限。我们发现这是非随机构造设计的结果。在第一个实验的构建中，AGG不成比例地发生在与低稳定性和高蛋白质输出相关的非AGG密码子上，这使得AGG与高蛋白质输出的关联成为一种假象。虽然像AGG这样的翻译非最佳密码子被推测为减缓核糖体有序起始，但我们发现，在偏倚较少的结构中，前10个密码子的翻译适应性高，而不是低，（弱）预示着更高的翻译效率。这些结果对转基因和实验设计都有启示意义。

{"title":"Why AGG is associated with high transgene output: passenger effects and their implications for transgene design.","authors":"Kate G Daniels, Sofia Radrizzani, Laurence D Hurst","doi":"10.1093/nargab/lqaf086","DOIUrl":"10.1093/nargab/lqaf086","url":null,"abstract":"In bacteria, high A and low G content of the 5' end of the coding sequence (CDS) promotes low RNA stability, facilitating ribosomal initiation and subsequently a high protein to transcript ratio. Additionally, 5' NGG codons are suppressive owing to peptidyl-tRNA drop off. It was, therefore, surprising that the first large-scale transgene experiment to interrogate the 5' effect by codon randomization found the NGG, G-rich codon AGG to be the most associated with high transgene output. Why is this? We show that this is not replicated in other large transgene datasets, where AGG and NGG are associated with low efficiency. More generally, there is limited agreement between the first experiment and others. This we find to be a consequence of non-random construct design. In constructs of the first experiment, AGG disproportionately occurs with non-AGG codons associated with low stability and high protein output, making AGG's association with high output an artefact. While translationally non-optimal codons like AGG are conjectured to slow ribosomes for orderly initiation, we find that in the less biased constructs high, not low, translational adaptation in the first 10 codons is (weakly) predictive of higher translational efficiency. These results have implications for both transgene and experimental design.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf086"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204400/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Novel treatment-specific causal biomarkers for colorectal cancer by omics integration. 通过组学整合研究结直肠癌的新型治疗特异性因果生物标志物。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-19 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf053

Akram Yazdani, Azam Yazdani, Raul Mendez-Giraldez, Gianluigi Pillonetto, Esmat Samiei, Reza Hadi, Heinz-Josef Lenz, Alan P Venook, Ahmad Samiei, Andrew B Nixon, Joseph A Lucci, Scott Kopetz, Monica M Bertagnolli, Federico Innocenti

While monoclonal antibody-based targeted therapies have substantially improved progression-free survival in cancer patients, the variability in individual responses poses a significant challenge in patient care. Therefore, identifying cancer subtypes and their associated biomarkers is required for assigning effective treatment. In this study, we integrated genotype and pre-treatment tissue RNA-seq data and identified biomarkers causally associated with the overall survival (OS) of colorectal cancer (CRC) patients treated with either cetuximab or bevacizumab. We performed enrichment analysis for specific consensus molecular subtypes (CMS) of CRC and evaluated differential expression of identified genes using paired tumor and normal tissue from an external cohort. In addition, we replicated the causal effect of these genes on OS using a validation cohort and assessed their association with The Cancer Genome Atlas Program data as an external cohort. One of the replicated findings was WDR62, whose overexpression shortened OS of patients treated with cetuximab. Enrichment of its overexpression in CMS1 and low expression in CMS4 suggests that patients with the CMS4 subtype may derive greater benefit from cetuximab. In summary, this study highlights the importance of integrating different omics data for identifying promising biomarkers specific to a treatment or a cancer subtype.

虽然以单克隆抗体为基础的靶向治疗大大提高了癌症患者的无进展生存期，但个体反应的可变性对患者护理提出了重大挑战。因此，确定癌症亚型及其相关的生物标志物是分配有效治疗的必要条件。在这项研究中，我们整合了基因型和治疗前组织RNA-seq数据，并鉴定了与西妥昔单抗或贝伐单抗治疗的结直肠癌（CRC）患者总生存率（OS）有因果关系的生物标志物。我们对CRC的特定共识分子亚型（CMS）进行富集分析，并使用来自外部队列的配对肿瘤和正常组织评估鉴定基因的差异表达。此外，我们使用验证队列复制了这些基因对OS的因果效应，并作为外部队列评估了它们与癌症基因组图谱计划数据的关联。其中一个重复的发现是WDR62，其过表达缩短了西妥昔单抗治疗患者的OS。其在CMS1中过表达的富集和在CMS4中的低表达表明CMS4亚型患者可能从西妥昔单抗中获得更大的益处。总之，本研究强调了整合不同组学数据对于识别特定于治疗或癌症亚型的有前途的生物标志物的重要性。

{"title":"Novel treatment-specific causal biomarkers for colorectal cancer by omics integration.","authors":"Akram Yazdani, Azam Yazdani, Raul Mendez-Giraldez, Gianluigi Pillonetto, Esmat Samiei, Reza Hadi, Heinz-Josef Lenz, Alan P Venook, Ahmad Samiei, Andrew B Nixon, Joseph A Lucci, Scott Kopetz, Monica M Bertagnolli, Federico Innocenti","doi":"10.1093/nargab/lqaf053","DOIUrl":"10.1093/nargab/lqaf053","url":null,"abstract":"While monoclonal antibody-based targeted therapies have substantially improved progression-free survival in cancer patients, the variability in individual responses poses a significant challenge in patient care. Therefore, identifying cancer subtypes and their associated biomarkers is required for assigning effective treatment. In this study, we integrated genotype and pre-treatment tissue RNA-seq data and identified biomarkers causally associated with the overall survival (OS) of colorectal cancer (CRC) patients treated with either cetuximab or bevacizumab. We performed enrichment analysis for specific consensus molecular subtypes (CMS) of CRC and evaluated differential expression of identified genes using paired tumor and normal tissue from an external cohort. In addition, we replicated the causal effect of these genes on OS using a validation cohort and assessed their association with The Cancer Genome Atlas Program data as an external cohort. One of the replicated findings was WDR62, whose overexpression shortened OS of patients treated with cetuximab. Enrichment of its overexpression in CMS1 and low expression in CMS4 suggests that patients with the CMS4 subtype may derive greater benefit from cetuximab. In summary, this study highlights the importance of integrating different omics data for identifying promising biomarkers specific to a treatment or a cancer subtype.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf053"},"PeriodicalIF":4.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144529961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neur-Ally: a deep learning model for regulatory variant prediction based on genomic and epigenomic features in brain and its validation in certain neurological disorders. neural - ally：基于大脑基因组和表观基因组特征的调节变异预测的深度学习模型及其在某些神经系统疾病中的验证。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-13 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf080

Anil Prakash, Moinak Banerjee

Large-scale quantitative studies have identified significant genetic associations for various neurological disorders. Expression quantitative trait locus (eQTL) studies have shown the effect of single-nucleotide polymorphisms (SNPs) on the differential expression of genes in brain tissues. However, a large majority of the associations are contributed by SNPs in the noncoding regions that can have significant regulatory function but are often ignored. Besides, mutations that are in high linkage disequilibrium with actual regulatory SNPs will also show significant associations. Therefore, it is important to differentiate a regulatory noncoding SNP with a nonregulatory one. To resolve this, we developed a deep learning model named Neur-Ally, which was trained on epigenomic datasets from nervous tissue and cell line samples. The model predicts differential occurrence of regulatory features like chromatin accessibility, histone modifications, and transcription factor binding on genomic regions using DNA sequence as input. The model was used to predict the regulatory effect of neurological condition-specific noncoding SNPs using in silico mutagenesis. The effect of associated SNPs reported in genome-wide association studies of neurological condition, brain eQTLs, autism spectrum disorder, and reported probable regulatory SNPs in neurological conditions were predicted by Neur-Ally.

大规模的定量研究已经确定了各种神经系统疾病的显著遗传关联。表达数量性状位点（eQTL）研究表明，单核苷酸多态性（SNPs）对脑组织中基因的差异表达有影响。然而，绝大多数关联是由非编码区域的snp促成的，这些区域可能具有重要的调控功能，但经常被忽视。此外，与实际调控snp高度连锁不平衡的突变也会显示出显著的相关性。因此，区分调节性非编码SNP与非调节性SNP是很重要的。为了解决这个问题，我们开发了一个名为neural - ally的深度学习模型，该模型是在神经组织和细胞系样本的表观基因组数据集上进行训练的。该模型使用DNA序列作为输入，预测了染色质可及性、组蛋白修饰和转录因子结合等调控特征在基因组区域上的差异发生。该模型被用于预测神经系统疾病特异性非编码snp的调控作用。神经系统疾病、脑eqtl、自闭症谱系障碍的全基因组关联研究中报道的相关snp的影响，以及报道的神经系统疾病中可能的调节snp的影响，由neural - ally预测。

{"title":"Neur-Ally: a deep learning model for regulatory variant prediction based on genomic and epigenomic features in brain and its validation in certain neurological disorders.","authors":"Anil Prakash, Moinak Banerjee","doi":"10.1093/nargab/lqaf080","DOIUrl":"10.1093/nargab/lqaf080","url":null,"abstract":"Large-scale quantitative studies have identified significant genetic associations for various neurological disorders. Expression quantitative trait locus (eQTL) studies have shown the effect of single-nucleotide polymorphisms (SNPs) on the differential expression of genes in brain tissues. However, a large majority of the associations are contributed by SNPs in the noncoding regions that can have significant regulatory function but are often ignored. Besides, mutations that are in high linkage disequilibrium with actual regulatory SNPs will also show significant associations. Therefore, it is important to differentiate a regulatory noncoding SNP with a nonregulatory one. To resolve this, we developed a deep learning model named Neur-Ally, which was trained on epigenomic datasets from nervous tissue and cell line samples. The model predicts differential occurrence of regulatory features like chromatin accessibility, histone modifications, and transcription factor binding on genomic regions using DNA sequence as input. The model was used to predict the regulatory effect of neurological condition-specific noncoding SNPs using in silico mutagenesis. The effect of associated SNPs reported in genome-wide association studies of neurological condition, brain eQTLs, autism spectrum disorder, and reported probable regulatory SNPs in neurological conditions were predicted by Neur-Ally.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf080"},"PeriodicalIF":4.0,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12164584/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144303025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to 'Using paired-end read orientations to assess technical biases in capture Hi-C'. 修正了“使用对端读取方向来评估捕获Hi-C中的技术偏差”。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-13 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf091

[This corrects the article DOI: 10.1093/nar/lqae156.].

[这更正了文章DOI: 10.1093/nar/lqae156.]。

引用次数: 0

T2T-CHM13 versus hg38: accurate identification of immunoglobulin isotypes from scRNA-seq requires a genome reference matched for ancestry. T2T-CHM13与hg38：准确鉴定来自scRNA-seq的免疫球蛋白同型需要与祖先匹配的基因组参考。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-11 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf074

Junli Nie, Julie Tellier, Ilariya Tarasova, Stephen L Nutt, Gordon K Smyth

Antibody production by B cells is essential for protective immunity. The clonal selection theory posits that each mature B cell has a unique immunoglobulin receptor generated through random gene recombination and, when stimulated to differentiate into an antibody-secreting cell, has the capacity to produce only a single antibody specificity. It follows from this 'one-cell-one-antibody' dogma that single-cell RNA-seq profiling of antibody-secreting cells should find that each cell expresses only a single form of each of the immunoglobulin heavy and light chains. However, when using GRCh38 as the genome reference, we found that many antibody-secreting cells appeared to express multiple immunoglobulin isotypes. When the newly published T2T-CHM13 genome was used instead as the genome reference, every antibody-secreting cell was found to express a unique isotype, and read mapping quality was also improved. We show that the superior performance of T2T-CHM13 was due to its European origin matching the genetic background of the query samples. On the other hand, T2T-CHM13 failed to appropriately fit the 'one-cell-one-antibody' dogma when applied to data derived from East Asia. Our results show that read assignment to human immunoglobulin isotype genes is very sensitive to the ancestral origin of the genome reference.

B细胞产生抗体对保护性免疫至关重要。克隆选择理论认为，每个成熟的B细胞都有一个独特的免疫球蛋白受体，通过随机基因重组产生，当刺激分化为抗体分泌细胞时，只能产生单一的抗体特异性。根据这种“一细胞一抗体”的教条，对抗体分泌细胞进行单细胞RNA-seq分析应该会发现，每个细胞只表达每种免疫球蛋白重链和轻链的一种形式。然而，当使用GRCh38作为基因组参考时，我们发现许多抗体分泌细胞似乎表达多种免疫球蛋白同种型。当使用新发表的T2T-CHM13基因组作为基因组参考时，发现每个抗体分泌细胞都表达一个独特的同型，并且读取图谱的质量也得到了提高。我们表明，T2T-CHM13的优越性能是由于其欧洲血统与查询样本的遗传背景相匹配。另一方面，T2T-CHM13在应用于东亚数据时未能适当地符合“一细胞一抗体”的教条。我们的研究结果表明，人类免疫球蛋白同型基因的读取分配对基因组参考的祖先起源非常敏感。

{"title":"T2T-CHM13 versus hg38: accurate identification of immunoglobulin isotypes from scRNA-seq requires a genome reference matched for ancestry.","authors":"Junli Nie, Julie Tellier, Ilariya Tarasova, Stephen L Nutt, Gordon K Smyth","doi":"10.1093/nargab/lqaf074","DOIUrl":"10.1093/nargab/lqaf074","url":null,"abstract":"Antibody production by B cells is essential for protective immunity. The clonal selection theory posits that each mature B cell has a unique immunoglobulin receptor generated through random gene recombination and, when stimulated to differentiate into an antibody-secreting cell, has the capacity to produce only a single antibody specificity. It follows from this 'one-cell-one-antibody' dogma that single-cell RNA-seq profiling of antibody-secreting cells should find that each cell expresses only a single form of each of the immunoglobulin heavy and light chains. However, when using GRCh38 as the genome reference, we found that many antibody-secreting cells appeared to express multiple immunoglobulin isotypes. When the newly published T2T-CHM13 genome was used instead as the genome reference, every antibody-secreting cell was found to express a unique isotype, and read mapping quality was also improved. We show that the superior performance of T2T-CHM13 was due to its European origin matching the genetic background of the query samples. On the other hand, T2T-CHM13 failed to appropriately fit the 'one-cell-one-antibody' dogma when applied to data derived from East Asia. Our results show that read assignment to human immunoglobulin isotype genes is very sensitive to the ancestral origin of the genome reference.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf074"},"PeriodicalIF":4.0,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12153336/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144276156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

mitoLEAF: mitochondrial DNA Lineage, Evolution, Annotation Framework. mitoLEAF：线粒体DNA谱系，进化，注释框架。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-11 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf079

Nicole Huber, Noah Hurmer, Arne Dür, Walther Parson

The study of mitochondrial DNA (mtDNA) provides invaluable insights into genetic variation, human evolution, and disease mechanisms. However, maintaining a consistent and reliable classification system requires continuous updates. Since Phylotree updates ended in 2016, the accumulation of new haplogroup findings in individual studies has highlighted the critical need for a centralized resource to ensure consistent classifications. To address this gap, we present mitoLEAF, a collaborative, freely accessible, and academically driven repository for mitochondrial phylogenetic analyses. Unlike commercial alternatives that restrict access to their customers through subscription or purchase, mitoLEAF is openly accessible and replicable, ensuring transparency and scientific reproducibility. Hosted as a GitHub repository and supported by an interactive website, mitoLEAF provides an evolving, quality-controlled phylogenetic resource derived from GenBank, EMPOP, and peer-reviewed literature. In this first release, it expands the haplogroup landscape from 5435 to 6409 haplogroups, integrating recent findings and improving phylogenetic accuracy. By excluding known pathogenic variants, mitoLEAF aims to mitigate ethical concerns associated with reporting medically relevant variants. By prioritizing open science over commercial interests, mitoLEAF serves as a vital, community-driven platform for mitochondrial research, fostering collaboration and continuous development.

线粒体DNA （mtDNA）的研究为遗传变异、人类进化和疾病机制提供了宝贵的见解。但是，维护一致和可靠的分类系统需要不断更新。自从Phylotree更新于2016年结束以来，单个研究中新单倍群发现的积累凸显了对集中资源的迫切需求，以确保分类的一致性。为了解决这一差距，我们提出了mitoLEAF，这是一个协作的、免费访问的、学术驱动的线粒体系统发育分析存储库。与通过订阅或购买限制客户访问的商业替代方案不同，mitoLEAF是公开访问和可复制的，确保了透明度和科学可重复性。mitoLEAF作为GitHub存储库托管，并由一个交互式网站提供支持，它提供了一个不断发展的、质量控制的系统发育资源，这些资源来自GenBank、EMPOP和同行评审的文献。在这个第一个版本中，它扩展了单倍群景观从5435到6409单倍群，整合了最近的发现和提高系统发育的准确性。通过排除已知的致病变异，mitoLEAF旨在减轻与报告医学相关变异相关的伦理问题。mitoLEAF将开放科学置于商业利益之上，是一个重要的、社区驱动的线粒体研究平台，促进合作和持续发展。

{"title":"mitoLEAF: mitochondrial DNA Lineage, Evolution, Annotation Framework.","authors":"Nicole Huber, Noah Hurmer, Arne Dür, Walther Parson","doi":"10.1093/nargab/lqaf079","DOIUrl":"10.1093/nargab/lqaf079","url":null,"abstract":"The study of mitochondrial DNA (mtDNA) provides invaluable insights into genetic variation, human evolution, and disease mechanisms. However, maintaining a consistent and reliable classification system requires continuous updates. Since Phylotree updates ended in 2016, the accumulation of new haplogroup findings in individual studies has highlighted the critical need for a centralized resource to ensure consistent classifications. To address this gap, we present mitoLEAF, a collaborative, freely accessible, and academically driven repository for mitochondrial phylogenetic analyses. Unlike commercial alternatives that restrict access to their customers through subscription or purchase, mitoLEAF is openly accessible and replicable, ensuring transparency and scientific reproducibility. Hosted as a GitHub repository and supported by an interactive website, mitoLEAF provides an evolving, quality-controlled phylogenetic resource derived from GenBank, EMPOP, and peer-reviewed literature. In this first release, it expands the haplogroup landscape from 5435 to 6409 haplogroups, integrating recent findings and improving phylogenetic accuracy. By excluding known pathogenic variants, mitoLEAF aims to mitigate ethical concerns associated with reporting medically relevant variants. By prioritizing open science over commercial interests, mitoLEAF serves as a vital, community-driven platform for mitochondrial research, fostering collaboration and continuous development.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf079"},"PeriodicalIF":4.0,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12153335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144276155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing R-loop prediction with high-throughput sequencing data. 利用高通量测序数据增强r环预测。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-11 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf077

Thomas Vanhaeren, Ludovica Cataneo, Federico Divina, Pedro Manuel Martínez-García

R-loops are three-stranded RNA and DNA hybrid structures that often occur in the genome and play important roles in a variety of cellular processes from bacteria to mammals. Sequencing methods profiling R-loops genome-wide have revealed that they can form co-transcriptionally at cell type specific genes and associate with specific chromatin states during cell differentiation and reprogramming. However, current computational methods for the prediction of R-loops rely solely on their DNA sequence properties, which precludes detection across cell types, tissues or developmental stages. Here, we conduct a machine learning approach that allows the prediction of mammalian cell type-specific R-loops using sequence information and high-throughput sequencing signals. Our predictive models are induced from human samples and achieve highly accurate predictions, with transcriptomics, DNA features, chromatin accessibility and the active gene body H3K36me3 epigenomic mark being the most informative datasets. We generate de novo virtual R-loop maps that show high concordance with experimental ones and capture cell type specificity. Our approach compares favorably to sequence-based methods and can be generalized to mouse datasets. Based on this, we generate virtual R-loop maps in 51 mammalian systems that are freely accessible to the scientific community.

r环是一种三链RNA和DNA杂交结构，经常出现在基因组中，在从细菌到哺乳动物的各种细胞过程中发挥重要作用。分析全基因组r环的测序方法表明，它们可以在细胞类型特异性基因上形成共转录，并在细胞分化和重编程过程中与特定的染色质状态相关。然而，目前预测r环的计算方法仅仅依赖于它们的DNA序列特性，这就排除了跨细胞类型、组织或发育阶段的检测。在这里，我们进行了一种机器学习方法，允许使用序列信息和高通量测序信号预测哺乳动物细胞类型特异性r环。我们的预测模型是从人类样本中诱导出来的，并实现了高度准确的预测，其中转录组学、DNA特征、染色质可及性和活性基因体H3K36me3表观基因组标记是信息量最大的数据集。我们生成了从头开始的虚拟r -环路图，显示出与实验图的高度一致性，并捕获细胞类型特异性。我们的方法优于基于序列的方法，并且可以推广到鼠标数据集。在此基础上，我们生成了51个哺乳动物系统的虚拟R-loop地图，这些地图可以免费向科学界开放。

{"title":"Enhancing R-loop prediction with high-throughput sequencing data.","authors":"Thomas Vanhaeren, Ludovica Cataneo, Federico Divina, Pedro Manuel Martínez-García","doi":"10.1093/nargab/lqaf077","DOIUrl":"10.1093/nargab/lqaf077","url":null,"abstract":"R-loops are three-stranded RNA and DNA hybrid structures that often occur in the genome and play important roles in a variety of cellular processes from bacteria to mammals. Sequencing methods profiling R-loops genome-wide have revealed that they can form co-transcriptionally at cell type specific genes and associate with specific chromatin states during cell differentiation and reprogramming. However, current computational methods for the prediction of R-loops rely solely on their DNA sequence properties, which precludes detection across cell types, tissues or developmental stages. Here, we conduct a machine learning approach that allows the prediction of mammalian cell type-specific R-loops using sequence information and high-throughput sequencing signals. Our predictive models are induced from human samples and achieve highly accurate predictions, with transcriptomics, DNA features, chromatin accessibility and the active gene body H3K36me3 epigenomic mark being the most informative datasets. We generate de novo virtual R-loop maps that show high concordance with experimental ones and capture cell type specificity. Our approach compares favorably to sequence-based methods and can be generalized to mouse datasets. Based on this, we generate virtual R-loop maps in 51 mammalian systems that are freely accessible to the scientific community.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf077"},"PeriodicalIF":4.0,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12153340/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144276110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

scEVE: a single-cell RNA-seq ensemble clustering algorithm capitalizing on the differences of predictions between multiple clustering methods. scEVE：一种单细胞RNA-seq集成聚类算法，利用多种聚类方法之间的预测差异。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-09 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf073

Yanis Asloudj, Fleur Mougin, Patricia Thébault

Single-cell RNA sequencing measures individual cell transcriptomes in a sample. In the past decade, this technology has motivated the development of hundreds of clustering methods. These methods attempt to group cells into populations by leveraging the similarity of their transcriptomes. Because each method relies on specific hypotheses, their predictions can vary drastically. To address this issue, ensemble algorithms detect cell populations by integrating multiple clustering methods, and minimizing the differences of their predictions. While this approach is sensible, it has yet to address some conceptual challenges in single-cell data science; namely, ensemble algorithms have yet to generate clustering results with uncertainty values and multiple resolutions. In this work, we present an original approach to ensemble clustering that addresses these challenges, by describing the differences between clustering results, rather than minimizing them. We present the scEVE algorithm, and we evaluate it on 15 experimental datasets, and up to 1200 synthetic datasets. Our results reveal that scEVE outperforms the state of the art, and addresses both conceptual challenges. We also highlight how biological downstream analyses will benefit from addressing these challenges. We expect that this work will provide an alternative direction for developing single-cell ensemble clustering algorithms.

单细胞RNA测序测量样本中的单个细胞转录组。在过去的十年中，这项技术推动了数百种聚类方法的发展。这些方法试图通过利用细胞转录组的相似性将细胞分组成群体。因为每种方法都依赖于特定的假设，它们的预测可能会有很大的不同。为了解决这个问题，集成算法通过集成多种聚类方法来检测细胞群，并最小化其预测的差异。虽然这种方法是明智的，但它尚未解决单细胞数据科学中的一些概念挑战；也就是说，集成算法尚未产生具有不确定性值和多分辨率的聚类结果。在这项工作中，我们提出了一种原始的集成聚类方法，通过描述聚类结果之间的差异，而不是最小化它们，来解决这些挑战。我们提出了scEVE算法，并在15个实验数据集和多达1200个合成数据集上对其进行了评估。我们的研究结果表明，scEVE优于目前的技术水平，并解决了这两个概念上的挑战。我们还强调了生物下游分析将如何从解决这些挑战中受益。我们期望这项工作将为开发单细胞集成聚类算法提供另一种方向。

{"title":"scEVE: a single-cell RNA-seq ensemble clustering algorithm capitalizing on the differences of predictions between multiple clustering methods.","authors":"Yanis Asloudj, Fleur Mougin, Patricia Thébault","doi":"10.1093/nargab/lqaf073","DOIUrl":"10.1093/nargab/lqaf073","url":null,"abstract":"Single-cell RNA sequencing measures individual cell transcriptomes in a sample. In the past decade, this technology has motivated the development of hundreds of clustering methods. These methods attempt to group cells into populations by leveraging the similarity of their transcriptomes. Because each method relies on specific hypotheses, their predictions can vary drastically. To address this issue, ensemble algorithms detect cell populations by integrating multiple clustering methods, and minimizing the differences of their predictions. While this approach is sensible, it has yet to address some conceptual challenges in single-cell data science; namely, ensemble algorithms have yet to generate clustering results with uncertainty values and multiple resolutions. In this work, we present an original approach to ensemble clustering that addresses these challenges, by describing the differences between clustering results, rather than minimizing them. We present the scEVE algorithm, and we evaluate it on 15 experimental datasets, and up to 1200 synthetic datasets. Our results reveal that scEVE outperforms the state of the art, and addresses both conceptual challenges. We also highlight how biological downstream analyses will benefit from addressing these challenges. We expect that this work will provide an alternative direction for developing single-cell ensemble clustering algorithms.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf073"},"PeriodicalIF":4.0,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144259052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Precise and scalable metagenomic profiling with sample-tailored minimizer libraries. 精确和可扩展的宏基因组分析与样本定制的最小化库。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-09 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf076

Johan Nyström-Persson, Nishad Bapatdhar, Samik Ghosh

Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-lowest common ancestor classification algorithm from the widely used Kraken 2. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 'strain madness' samples, the fraction of reads classified at species level increased by 3.5×, while for in silico samples, it increased by 2.2×. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.

基于参考的宏基因组分析需要庞大的基因组文库来最大限度地检测和减少假阳性。然而，随着文库的增长，分类准确性受到影响，特别是在基于k-mer的工具中，因为生物之间基因组区域的重叠越来越多，导致更高级别的分类分配，从而降低了精度。为了解决这个问题，我们提出了样本定制的最小化库，它改进了广泛使用的Kraken 2中的最小化-最低共同祖先分类算法。在该方法中，初始过滤步骤使用大型库来去除不相似的基因组，然后使用动态构建的更小的最小化库进行精细分类。与现有的分类方法相比，这种两步分类方法显示了显著的性能改进。我们开发了一个新的计算工具，Slacken，一个基于Apache Spark的分布式和高度可扩展的平台，来实现两步分类方法，这提高了速度，同时保持每个样本的成本与Kraken 2相当。具体而言，CAMI2“菌株疯狂”样本中，在物种水平上分类的reads比例增加了3.5倍，而在硅样品中，这一比例增加了2.2倍。两步法实现了大型基因组文库的敏感性和较小基因组文库的特异性，释放了大型参考文库用于宏基因组读取分析的真正潜力。

{"title":"Precise and scalable metagenomic profiling with sample-tailored minimizer libraries.","authors":"Johan Nyström-Persson, Nishad Bapatdhar, Samik Ghosh","doi":"10.1093/nargab/lqaf076","DOIUrl":"10.1093/nargab/lqaf076","url":null,"abstract":"Reference-based metagenomic profiling requires large genome libraries to maximize detection and minimize false positives. However, as libraries grow, classification accuracy suffers, particularly in k-mer-based tools, as the growing overlap in genomic regions among organisms results in more high-level taxonomic assignments, blunting precision. To address this, we propose sample-tailored minimizer libraries, which improve on the minimizer-lowest common ancestor classification algorithm from the widely used Kraken 2. In this method, an initial filtering step using a large library removes non-resemblance genomes, followed by a refined classification step using a dynamically built smaller minimizer library. This 2-step classification method shows significant performance improvements compared to the state of the art. We develop a new computational tool called Slacken, a distributed and highly scalable platform based on Apache Spark, to implement the 2-step classification method, which improves speed while keeping the cost per sample comparable to Kraken 2. Specifically, in the CAMI2 'strain madness' samples, the fraction of reads classified at species level increased by 3.5×, while for in silico samples, it increased by 2.2×. The 2-step method achieves the sensitivity of large genomic libraries and the specificity of smaller ones, unlocking the true potential of large reference libraries for metagenomic read profiling.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf076"},"PeriodicalIF":4.0,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144259051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evolutionary perspective of the CAG/CAA interplay coding for pure polyglutamine stretches in proteins. 蛋白质中纯聚谷氨酰胺延伸的CAG/CAA相互作用编码的进化观点。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics

Pub Date : 2025-06-09 eCollection Date: 2025-06-01 DOI: 10.1093/nargab/lqaf075

Antonio Moreno-Rodríguez, Antonio J Pérez-Pulido, Pablo Mier

Polyglutamine regions appear in many eukaryotic proteins. Most research on these stretches has focused on humans and primates. We wanted to check whether patterns in their codon usage are shared across a wide taxonomic range. Protein-coding transcripts from 30 eukaryotic model species were searched for stretches of consecutive glutamine codons (CAA/CAG). Most species have higher CAG proportion in longer stretches, except fishes, which either reduced or kept a stable CAG use. CAA codons are located closer to the C-terminal side of the stretches in plants, invertebrates, and tetrapods; fungi showed no bias and fishes showed the opposite. Many tetrapods have codons flanking pure CAG stretches that hint at a mutational control of repeat growth. However, the maximum number of consecutive identical codons within the polyglutamine stretches in most species followed random expectations, with fishes as a main exception. We detected shared patterns in codon usage and position across taxonomically distant species, yet each group retained unique traits. Internal CAA position and external flanking codons both seemed to slow pure CAG expansion. Overall, a mix of random processes and species-specific factors drives how glutamine repeats are shaped and maintained in evolution.

多谷氨酰胺区出现在许多真核蛋白中。大多数关于这些拉伸的研究都集中在人类和灵长类动物身上。我们想检查它们的密码子使用模式是否在广泛的分类范围内共享。从30种真核生物模型物种中寻找连续谷氨酰胺密码子（CAA/CAG）的蛋白编码转录本。除鱼类外，大多数物种在较长时间内具有较高的CAG比例，减少或保持稳定的CAG利用。在植物、无脊椎动物和四足动物中，CAA密码子更靠近伸展的c端；真菌没有表现出偏向性，鱼类则相反。许多四足动物在纯CAG延伸的两侧都有密码子，这暗示着对重复生长的突变控制。然而，在大多数物种中，聚谷氨酰胺内连续相同密码子的最大数量遵循随机预期，鱼类是一个主要例外。我们发现在分类上遥远的物种中密码子的使用和位置的共享模式，但每个群体都保留了独特的特征。内部CAA位置和外部侧翼密码子似乎都减缓了纯CAG的扩增。总的来说，随机过程和物种特异性因素的混合驱动了谷氨酰胺重复序列如何在进化中形成和维持。

{"title":"Evolutionary perspective of the CAG/CAA interplay coding for pure polyglutamine stretches in proteins.","authors":"Antonio Moreno-Rodríguez, Antonio J Pérez-Pulido, Pablo Mier","doi":"10.1093/nargab/lqaf075","DOIUrl":"10.1093/nargab/lqaf075","url":null,"abstract":"Polyglutamine regions appear in many eukaryotic proteins. Most research on these stretches has focused on humans and primates. We wanted to check whether patterns in their codon usage are shared across a wide taxonomic range. Protein-coding transcripts from 30 eukaryotic model species were searched for stretches of consecutive glutamine codons (CAA/CAG). Most species have higher CAG proportion in longer stretches, except fishes, which either reduced or kept a stable CAG use. CAA codons are located closer to the C-terminal side of the stretches in plants, invertebrates, and tetrapods; fungi showed no bias and fishes showed the opposite. Many tetrapods have codons flanking pure CAG stretches that hint at a mutational control of repeat growth. However, the maximum number of consecutive identical codons within the polyglutamine stretches in most species followed random expectations, with fishes as a main exception. We detected shared patterns in codon usage and position across taxonomically distant species, yet each group retained unique traits. Internal CAA position and external flanking codons both seemed to slow pure CAG expansion. Overall, a mix of random processes and species-specific factors drives how glutamine repeats are shaped and maintained in evolution.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf075"},"PeriodicalIF":4.0,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12147016/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144259050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0