首页 > 最新文献

GigaScience最新文献

英文 中文
An Interpretable Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis. 对角单细胞综合分析的可解释图正则化最优输运框架。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-09 DOI: 10.1093/gigascience/giag012
Zexuan Wang, Qipeng Zhan, Shu Yang, Zhuoping Zhou, Mengyuan Kan, Tianhuan Zhai, Li Shen

Background: Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions.

Finding: we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations.

Conclusions: We demonstrate GROTIA's superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain.

背景:单细胞组学技术的最新进展使细胞过程的详细表征成为可能。然而,共测定测序技术仍然有限,导致未配对的单细胞组学数据集具有不同的特征维度。发现:我们提出了GROTIA(对角单细胞整合分析的图正则化最佳传输框架),这是一种无需任何事先对应信息即可对齐多组学数据集的计算方法。GROTIA通过最优传输实现全局对齐,同时通过图正则化保持局部关系。此外,我们的方法通过从偏导数中获得特定领域特征的重要性来提供可解释性,突出了关键的生物标记。此外,模式之间的传输计划可以用于整合后的聚类,从而实现数据驱动的方法来发现新的细胞亚群。结论:我们证明了GROTIA在四个模拟和四个真实数据集上的卓越性能,超越了最先进的无监督比对方法,并确认了每个领域中识别的顶级特征的生物学意义。
{"title":"An Interpretable Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis.","authors":"Zexuan Wang, Qipeng Zhan, Shu Yang, Zhuoping Zhou, Mengyuan Kan, Tianhuan Zhai, Li Shen","doi":"10.1093/gigascience/giag012","DOIUrl":"https://doi.org/10.1093/gigascience/giag012","url":null,"abstract":"<p><strong>Background: </strong>Recent advancements in single-cell omics technologies have enabled detailed characterization of cellular processes. However, coassay sequencing technologies remain limited, resulting in un-paired single-cell omics datasets with differing feature dimensions.</p><p><strong>Finding: </strong>we present GROTIA (Graph-Regularized Optimal Transport Framework for Diagonal Single-Cell Integrative Analysis), a computational method to align multi-omics datasets without requiring any prior correspondence information. GROTIA achieves global alignment through optimal transport while preserving local relationships via graph regularization. Additionally, our approach provides interpretability by deriving domain-specific feature importance from partial derivatives, highlighting key biological markers. Moreover, the transport plan between modalities can be leveraged for post-integration clustering, enabling a data-driven approach to discover novel cell subpopulations.</p><p><strong>Conclusions: </strong>We demonstrate GROTIA's superior performance on four simulated and four real-world datasets, surpassing state-of-the-art unsupervised alignment methods and confirming the biological significance of the top features identified in each domain.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146141754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved genome assembly of whale shark, the world's biggest fish: revealing intragenomic heterogeneity in molecular evolution. 世界上最大的鱼类鲸鲨的改进基因组组装:揭示分子进化中的基因组内异质性。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-06 DOI: 10.1093/gigascience/giag014
Yawako W Kawaguchi, Rui Matsumoto, Shigehiro Kuraku

High-quality chromosome-level assemblies are essential for understanding genome evolution but remain difficult to obtain for large and complex genomes. Here we present a near gap-free genome assembly of the whale shark (Rhincodon typus) generated with long-read sequencing and Hi-C scaffolding, markedly improving contiguity and completeness. In particular, the X chromosome was extended to nearly twice its previous length, and putative pseudoautosomal regions were identified. Moreover, we report the first Y-linked scaffolds for this species. Comparative analyses with the zebra shark revealed exceptionally low substitution rates across the genome. We further detected a negative correlation between chromosome length and synonymous substitution rate (dS), explained by a positional gradient, here referred to as "chromocline", in which substitution rates gradually decrease from chromosomal ends toward central regions. Notably, the X chromosome exhibited low dS compared with autosomes of similar size, consistent with male-driven evolution. Our results highlight positional and sex-chromosome effects as key determinants of molecular evolutionary rates. The improved assembly will enable broad application to population-genetic and conservation genomic analyses in the whale shark.

高质量的染色体水平组装对于理解基因组进化至关重要,但对于大型和复杂的基因组来说仍然很难获得。在这里,我们提出了一个几乎无间隙的鲸鲨(Rhincodon typus)基因组组装,通过长读测序和Hi-C脚手架生成,显着提高了连续性和完整性。特别是,X染色体被延长到其先前长度的近两倍,并确定了假定的假常染色体区域。此外,我们报道了该物种的第一个y -连锁支架。与斑马鲨的比较分析显示,整个基因组的替代率非常低。我们进一步发现了染色体长度与同义取代率(dS)之间的负相关,这可以用位置梯度来解释,这里称为“染色体斜线”,其中取代率从染色体末端向中心区域逐渐降低。值得注意的是,与相同大小的常染色体相比,X染色体显示出较低的dS,这与男性驱动的进化一致。我们的研究结果强调了位置和性染色体效应是分子进化速率的关键决定因素。改进后的组装将广泛应用于鲸鲨种群遗传和保护基因组分析。
{"title":"Improved genome assembly of whale shark, the world's biggest fish: revealing intragenomic heterogeneity in molecular evolution.","authors":"Yawako W Kawaguchi, Rui Matsumoto, Shigehiro Kuraku","doi":"10.1093/gigascience/giag014","DOIUrl":"https://doi.org/10.1093/gigascience/giag014","url":null,"abstract":"<p><p>High-quality chromosome-level assemblies are essential for understanding genome evolution but remain difficult to obtain for large and complex genomes. Here we present a near gap-free genome assembly of the whale shark (Rhincodon typus) generated with long-read sequencing and Hi-C scaffolding, markedly improving contiguity and completeness. In particular, the X chromosome was extended to nearly twice its previous length, and putative pseudoautosomal regions were identified. Moreover, we report the first Y-linked scaffolds for this species. Comparative analyses with the zebra shark revealed exceptionally low substitution rates across the genome. We further detected a negative correlation between chromosome length and synonymous substitution rate (dS), explained by a positional gradient, here referred to as \"chromocline\", in which substitution rates gradually decrease from chromosomal ends toward central regions. Notably, the X chromosome exhibited low dS compared with autosomes of similar size, consistent with male-driven evolution. Our results highlight positional and sex-chromosome effects as key determinants of molecular evolutionary rates. The improved assembly will enable broad application to population-genetic and conservation genomic analyses in the whale shark.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146131532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-nucleus multiple-organ chromatin accessibility landscape in the adult rat. 成年大鼠单核多器官染色质可及性景观。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-02-03 DOI: 10.1093/gigascience/giag013
Ronghai Li, Shanshan Duan, Qiuting Deng, Wen Ma, Chang Liu, Peng Gao, Li Lu, Yue Yuan

The chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multi organ, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), Shared and organ-specific features of endothelial and stromal cells, as well as cross-organ macrophage regulatory states, and the conservation and specificity of gene regulatory programs across species. Together, these findings provide a valuable foundation for dissecting tissue-specific regulatory logic and for advancing cross-organ and cross-species cell type annotation and functional inference in the rat model.

染色质可及性景观是细胞特异性基因表达的基础。我们从模式生物褐家鼠(Rattus norvegicus)中生成了一个多器官、单核染色质可及性景观。对于这个单细胞图谱,我们通过snATAC-seq从大鼠的9个器官中构建了25个文库,总共超过11万个细胞。结合基因活性评分和已知标记基因的细胞分类鉴定出77种细胞类型,这些细胞类型与已发表的小鼠单细胞转录组图谱密切相关。我们进一步研究了细胞类型和器官特异性转录因子(TFs)的富集,内皮细胞和基质细胞的共享和器官特异性特征,以及跨器官巨噬细胞的调节状态,以及基因调控程序在物种间的保守性和特异性。总之,这些发现为解剖组织特异性调控逻辑以及推进大鼠模型中跨器官和跨物种细胞类型注释和功能推断提供了有价值的基础。
{"title":"Single-nucleus multiple-organ chromatin accessibility landscape in the adult rat.","authors":"Ronghai Li, Shanshan Duan, Qiuting Deng, Wen Ma, Chang Liu, Peng Gao, Li Lu, Yue Yuan","doi":"10.1093/gigascience/giag013","DOIUrl":"https://doi.org/10.1093/gigascience/giag013","url":null,"abstract":"<p><p>The chromatin accessibility landscape is the basis of cell-specific gene expression. We generated a multi organ, single-nucleus chromatin accessibility landscape from the model organism Rattus norvegicus. For this single-cell atlas, we constructed 25 libraries via snATAC-seq from nine organs in the rat, with a total of over 110,000 cells. Cell classification integrating gene activity scores with known marker genes identified 77 cell types, which were strongly correlated with those in published mouse single-cell transcriptome atlases. We further investigated the enrichment of cell type- and organ-specific transcription factors (TFs), Shared and organ-specific features of endothelial and stromal cells, as well as cross-organ macrophage regulatory states, and the conservation and specificity of gene regulatory programs across species. Together, these findings provide a valuable foundation for dissecting tissue-specific regulatory logic and for advancing cross-organ and cross-species cell type annotation and functional inference in the rat model.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Proposed Unified, Scalable Platform for Integrative Research on Venomous Species. 一个建议的统一的、可扩展的有毒物种综合研究平台。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-31 DOI: 10.1093/gigascience/giaf153
Shaadi Mehr, Todd Castoe, Marymegan Daly, Florence Jungo, Kim N Kirchhoff, Ivan Koludarov, Stephen P Mackessy, Jason Macrander, Praveena Naidu, Maria Vittoria Modica, Elda E Sanchez, Giulia Zancolli, Mandë Holford

Venomous animal research is hampered by fragmented, specialized, and non-interoperable databases (isolated genomic, proteomic, and ecological data). Despite the immense promise of venomous organisms to yield novel bioactive compounds for pharmacological and evolutionary applications, the informatics landscape for such taxa has remained patchy, lacking macro-scale integration across species. We present VenomsBase, an integrated, modular resource that synthesizes multi-omics data, ecological metadata, and functional annotations for venom-bearing organisms. Following the FAIR guidelines, VenomsBase combines an ontology-driven architecture with big-data cloud workflows for sequence integration, motif clustering, 3D display, and linking ecological metadata. Standardized tools and training modules facilitate worldwide access to resources for both researchers in developed countries and in resource-limited areas. Its plug-and-play design allows for integration of additional analytical modules and extension to other species. One can also examine evolutionary trends and connect venom chemistry to ecological niches. VenomsBase would (i) accelerate the pace of venom discovery, whether for therapeutic purposes or evolutionary significance, by providing validated, cross-referenced data sets and community-driven curation, and (ii) foster an open, just, and innovation-ready venom research ecosystem.

有毒动物研究受到碎片化、专门化和不可互操作的数据库(孤立的基因组、蛋白质组学和生态数据)的阻碍。尽管有毒生物为药理学和进化应用产生新的生物活性化合物的巨大希望,但这类分类群的信息学景观仍然不完整,缺乏跨物种的宏观整合。我们提出了一个集成的、模块化的资源VenomsBase,它综合了多组学数据、生态元数据和含毒生物的功能注释。VenomsBase遵循FAIR的指导方针,将本体驱动的架构与大数据云工作流程相结合,用于序列集成、motif聚类、3D显示和链接生态元数据。标准化的工具和培训模块有助于发达国家和资源有限地区的研究人员在世界范围内获得资源。它的即插即用设计允许集成额外的分析模块和扩展到其他物种。人们还可以研究进化趋势,并将毒液化学与生态位联系起来。VenomsBase将(i)通过提供经过验证的、交叉参考的数据集和社区驱动的管理,加快毒液发现的步伐,无论是出于治疗目的还是进化意义;(ii)建立一个开放、公正、随时准备创新的毒液研究生态系统。
{"title":"A Proposed Unified, Scalable Platform for Integrative Research on Venomous Species.","authors":"Shaadi Mehr, Todd Castoe, Marymegan Daly, Florence Jungo, Kim N Kirchhoff, Ivan Koludarov, Stephen P Mackessy, Jason Macrander, Praveena Naidu, Maria Vittoria Modica, Elda E Sanchez, Giulia Zancolli, Mandë Holford","doi":"10.1093/gigascience/giaf153","DOIUrl":"https://doi.org/10.1093/gigascience/giaf153","url":null,"abstract":"<p><p>Venomous animal research is hampered by fragmented, specialized, and non-interoperable databases (isolated genomic, proteomic, and ecological data). Despite the immense promise of venomous organisms to yield novel bioactive compounds for pharmacological and evolutionary applications, the informatics landscape for such taxa has remained patchy, lacking macro-scale integration across species. We present VenomsBase, an integrated, modular resource that synthesizes multi-omics data, ecological metadata, and functional annotations for venom-bearing organisms. Following the FAIR guidelines, VenomsBase combines an ontology-driven architecture with big-data cloud workflows for sequence integration, motif clustering, 3D display, and linking ecological metadata. Standardized tools and training modules facilitate worldwide access to resources for both researchers in developed countries and in resource-limited areas. Its plug-and-play design allows for integration of additional analytical modules and extension to other species. One can also examine evolutionary trends and connect venom chemistry to ecological niches. VenomsBase would (i) accelerate the pace of venom discovery, whether for therapeutic purposes or evolutionary significance, by providing validated, cross-referenced data sets and community-driven curation, and (ii) foster an open, just, and innovation-ready venom research ecosystem.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146092995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expression-Driven Genetic Dependency Reveals Targets for Precision Oncology. 表达驱动的基因依赖性揭示了精确肿瘤学的靶标。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-29 DOI: 10.1093/gigascience/giag011
Abdulkadir Elmas, Hillary M Layden, Jacob D Ellis, Luke N Bartlett, Xian Zhao, Reika Kawabata-Iwakawa, Zishan Wang, Hideru Obinata, Scott W Hiebert, Kuan-Lin Huang

Background: Cancer cells are heterogeneous, each harboring distinct molecular aberrations and being dependent on different genes for their survival and proliferation. While targeted therapies based on driver DNA mutations have shown success, many tumors lack druggable mutations, limiting treatment options. We hypothesize that new precision oncology targets may be identified through "expression-driven dependency," where cancer cells with high expression of specific genes are more vulnerable to the knockout of those same genes.

Results: We developed BEACON, a Bayesian approach to identify expression-driven dependency targets by analyzing global transcriptomic and proteomic profiles alongside genetic dependency data from cancer cell lines across 17 tissue lineages. BEACON successfully identified known druggable genes, including BCL2, ERBB2, EGFR, ESR1, and MYC, while revealing novel targets confirmed by both mRNA and protein-expression driven dependency. The identified genes showed a 3.8-fold enrichment for approved drug targets and a 7 to 10-fold enrichment for druggable oncology targets. Experimental validation demonstrated that depletion of GRHL2, TP63, and PAX5 effectively reduced tumor cell growth and survival in their dependent cells.

Conclusions: Our approach provides a systematic method to identify precision oncology targets based on expression-driven dependency patterns. By integrating multi-omics data with genetic dependency screens, we've created a comprehensive catalog of potential therapeutic targets that may expand treatment options for cancer patients lacking druggable mutations. This resource offers new opportunities for precision oncology target discovery beyond mutation-based approaches.

背景:癌细胞是异质的,每个癌细胞都有不同的分子畸变,并且依赖于不同的基因来生存和增殖。虽然基于驱动DNA突变的靶向治疗已经显示出成功,但许多肿瘤缺乏可药物突变,限制了治疗选择。我们假设新的精确肿瘤靶点可以通过“表达驱动依赖性”来确定,即特定基因高表达的癌细胞更容易被敲除这些相同的基因。结果:我们开发了BEACON,这是一种贝叶斯方法,通过分析来自17个组织谱系的癌细胞系的全球转录组学和蛋白质组学谱以及遗传依赖性数据,来识别表达驱动依赖性靶点。BEACON成功鉴定了已知的可药物基因,包括BCL2、ERBB2、EGFR、ESR1和MYC,同时揭示了mRNA和蛋白表达驱动依赖性证实的新靶点。所鉴定的基因对已批准的药物靶点具有3.8倍的富集,对可用药的肿瘤靶点具有7 - 10倍的富集。实验验证表明,GRHL2、TP63和PAX5的缺失可有效降低肿瘤细胞在其依赖细胞中的生长和存活。结论:我们的方法提供了一种系统的方法来识别基于表达驱动依赖模式的精确肿瘤靶点。通过将多组学数据与基因依赖筛选相结合,我们已经创建了一个潜在治疗靶点的综合目录,这可能会为缺乏可药物突变的癌症患者扩大治疗选择。这种资源为精确肿瘤靶点发现提供了新的机会,超越了基于突变的方法。
{"title":"Expression-Driven Genetic Dependency Reveals Targets for Precision Oncology.","authors":"Abdulkadir Elmas, Hillary M Layden, Jacob D Ellis, Luke N Bartlett, Xian Zhao, Reika Kawabata-Iwakawa, Zishan Wang, Hideru Obinata, Scott W Hiebert, Kuan-Lin Huang","doi":"10.1093/gigascience/giag011","DOIUrl":"10.1093/gigascience/giag011","url":null,"abstract":"<p><strong>Background: </strong>Cancer cells are heterogeneous, each harboring distinct molecular aberrations and being dependent on different genes for their survival and proliferation. While targeted therapies based on driver DNA mutations have shown success, many tumors lack druggable mutations, limiting treatment options. We hypothesize that new precision oncology targets may be identified through \"expression-driven dependency,\" where cancer cells with high expression of specific genes are more vulnerable to the knockout of those same genes.</p><p><strong>Results: </strong>We developed BEACON, a Bayesian approach to identify expression-driven dependency targets by analyzing global transcriptomic and proteomic profiles alongside genetic dependency data from cancer cell lines across 17 tissue lineages. BEACON successfully identified known druggable genes, including BCL2, ERBB2, EGFR, ESR1, and MYC, while revealing novel targets confirmed by both mRNA and protein-expression driven dependency. The identified genes showed a 3.8-fold enrichment for approved drug targets and a 7 to 10-fold enrichment for druggable oncology targets. Experimental validation demonstrated that depletion of GRHL2, TP63, and PAX5 effectively reduced tumor cell growth and survival in their dependent cells.</p><p><strong>Conclusions: </strong>Our approach provides a systematic method to identify precision oncology targets based on expression-driven dependency patterns. By integrating multi-omics data with genetic dependency screens, we've created a comprehensive catalog of potential therapeutic targets that may expand treatment options for cancer patients lacking druggable mutations. This resource offers new opportunities for precision oncology target discovery beyond mutation-based approaches.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146085410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ViralBindPredict: Empowering Viral Protein-Ligand Binding Sites through Deep Learning and Protein Sequence-Derived Insights. ViralBindPredict:通过深度学习和蛋白质序列衍生的见解增强病毒蛋白质配体结合位点。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-24 DOI: 10.1093/gigascience/giag010
A M B Amorim, C Marques-Pereira, T Almeida, N Rosário-Ferreira, H S Pinto, C Vaz, A Francisco, I S Moreira

Background: The development of a single therapeutic compound can exceed 1.8 billion USD and take more than a decade, underscoring the urgent need to accelerate drug discovery. Computational methods have become indispensable; however, traditional approaches, such as docking simulations, face limitations because they depend on protein and ligand structures that may be unavailable, incomplete, or of low accuracy. Even recent breakthroughs, such as AlphaFold, do not consistently provide models precise enough to identify ligand-binding sites or drug-target interactions.

Results: We present ViralBindPredict, a deep learning framework that predicts viral protein-ligand binding sites directly from sequence. We also introduce the first curated large-scale benchmark of viral protein-ligand interactions, comprising >10,000 viral chains and ≈13,000 interactions processed using a 4.5 Å heavy-atom contact threshold. ViralBindPredict combines Mordred ligand descriptors with contextual protein embeddings from ESM2 or ProtTrans, enabling structure-free learning of binding preferences. Leakage-controlled data splits were applied to prevent overlap across protein sequence clusters and ligand scaffolds (Cluster90%, NoRed90%→Cluster90%, Cluster40%, NoRed90%→Cluster40%). Across most regimes, multilayer perceptrons, especially with ESM-2 embeddings, outperformed LightGBM baselines, maintaining strong precision-recall for unseen ligands but showing larger drops for unseen proteins, indicating that the protein context dominates generalization.

Conclusions: ViralBindPredict introduces the first leakage-controlled benchmark for viral protein-ligand interactions and demonstrates accurate ligand-binding residue prediction directly from protein sequence. Together, these advances establish ViralBindPredict as a robust and extensible workflow for sequence-based antiviral discovery, supporting rapid target prioritization, compound repurposing, and de novo drug design, even in the absence of structural data.

背景:单个治疗性化合物的开发可能超过18亿美元,需要十多年的时间,这凸显了加速药物发现的迫切需要。计算方法已经变得不可或缺;然而,传统的方法,如对接模拟,面临着局限性,因为它们依赖于蛋白质和配体结构,这些结构可能不可用、不完整或精度低。即使是最近的突破,如AlphaFold,也不能始终如一地提供足够精确的模型来识别配体结合位点或药物靶标相互作用。结果:我们提出了ViralBindPredict,这是一个深度学习框架,可以直接从序列中预测病毒蛋白-配体结合位点。我们还介绍了第一个精心设计的病毒蛋白-配体相互作用的大规模基准,包括bbb10万个病毒链和≈13,000个相互作用,使用4.5 Å重原子接触阈值进行处理。ViralBindPredict将Mordred配体描述子与ESM2或ProtTrans的上下文蛋白嵌入结合在一起,实现了无结构的结合偏好学习。采用泄漏控制的数据分割,以防止蛋白质序列簇和配体支架之间的重叠(Cluster90%, NoRed90%→Cluster90%, Cluster40%, NoRed90%→Cluster40%)。在大多数情况下,多层感知器,特别是ESM-2嵌入,表现优于LightGBM基线,对未见配体保持较高的精确召回率,但对未见蛋白质显示更大的下降,表明蛋白质上下文主导泛化。结论:ViralBindPredict引入了第一个病毒蛋白-配体相互作用的泄漏控制基准,并证明了直接从蛋白质序列准确预测配体结合残基。总之,这些进展使ViralBindPredict成为基于序列的抗病毒发现的一个强大且可扩展的工作流程,即使在缺乏结构数据的情况下,也支持快速的靶点优先排序、化合物重新利用和新药物设计。
{"title":"ViralBindPredict: Empowering Viral Protein-Ligand Binding Sites through Deep Learning and Protein Sequence-Derived Insights.","authors":"A M B Amorim, C Marques-Pereira, T Almeida, N Rosário-Ferreira, H S Pinto, C Vaz, A Francisco, I S Moreira","doi":"10.1093/gigascience/giag010","DOIUrl":"https://doi.org/10.1093/gigascience/giag010","url":null,"abstract":"<p><strong>Background: </strong>The development of a single therapeutic compound can exceed 1.8 billion USD and take more than a decade, underscoring the urgent need to accelerate drug discovery. Computational methods have become indispensable; however, traditional approaches, such as docking simulations, face limitations because they depend on protein and ligand structures that may be unavailable, incomplete, or of low accuracy. Even recent breakthroughs, such as AlphaFold, do not consistently provide models precise enough to identify ligand-binding sites or drug-target interactions.</p><p><strong>Results: </strong>We present ViralBindPredict, a deep learning framework that predicts viral protein-ligand binding sites directly from sequence. We also introduce the first curated large-scale benchmark of viral protein-ligand interactions, comprising >10,000 viral chains and ≈13,000 interactions processed using a 4.5 Å heavy-atom contact threshold. ViralBindPredict combines Mordred ligand descriptors with contextual protein embeddings from ESM2 or ProtTrans, enabling structure-free learning of binding preferences. Leakage-controlled data splits were applied to prevent overlap across protein sequence clusters and ligand scaffolds (Cluster90%, NoRed90%→Cluster90%, Cluster40%, NoRed90%→Cluster40%). Across most regimes, multilayer perceptrons, especially with ESM-2 embeddings, outperformed LightGBM baselines, maintaining strong precision-recall for unseen ligands but showing larger drops for unseen proteins, indicating that the protein context dominates generalization.</p><p><strong>Conclusions: </strong>ViralBindPredict introduces the first leakage-controlled benchmark for viral protein-ligand interactions and demonstrates accurate ligand-binding residue prediction directly from protein sequence. Together, these advances establish ViralBindPredict as a robust and extensible workflow for sequence-based antiviral discovery, supporting rapid target prioritization, compound repurposing, and de novo drug design, even in the absence of structural data.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-definition likelihood inference of genetic colocalization reveals protein biomarkers for human complex diseases. 基因共定位的高清晰度似然推断揭示了人类复杂疾病的蛋白质生物标志物。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-23 DOI: 10.1093/gigascience/giaf155
Yuying Li, Ranran Zhai, Zhijian Yang, Ting Li, Yudi Pawitan, Xia Shen

Background: Genetic colocalization analysis is essential for understanding the shared genetic basis between phenotypic traits. Such an analysis is particularly useful for identifying plasma proteins with potential as therapeutic targets or clinical biomarkers. Improvements to existing tools are needed for more accurate inference of potentially causal biomarkers.

Findings: We develop HDL-C, a high-definition likelihood inference method for genetic colocalization analysis. Based on simulations and observed rediscovery rates in real data analyses, we demonstrate that the HDL-C approach outperforms state-of-the-art methods, COLOC, SuSiE, and SharePro, in detecting genetic colocalization, thus enabling a more complete understanding of genetic connections at specific loci. Analyses of the top 50 protein-disease pairs identified by HDL-C in the male and female cohorts of the UK Biobank uncovered 40 previously validated drug-protein-disease combinations with approved drugs matching the phenotypes and 62 combinations with potential drug repurposing opportunities. Additionally, we identified 63 novel protein-disease pairs that suggest promising candidates for future therapeutic interventions.

Conclusion: This research establishes a robust framework for detecting genetic colocalization signals, enabling the prioritization of disease-relevant protein targets and informing therapeutic development strategies.

背景:遗传共定位分析是了解表型性状之间共有遗传基础的必要条件。这种分析对于鉴别具有潜在治疗靶点或临床生物标志物的血浆蛋白特别有用。为了更准确地推断潜在的因果生物标志物,需要改进现有的工具。研究结果:我们开发了HDL-C,一种用于基因共定位分析的高清晰度似然推断方法。基于模拟和在真实数据分析中观察到的再发现率,我们证明HDL-C方法在检测基因共定位方面优于最先进的方法COLOC、SuSiE和SharePro,从而能够更全面地了解特定位点的遗传连接。在UK Biobank的男性和女性队列中,通过HDL-C鉴定的前50个蛋白质-疾病对的分析发现了40个先前验证的药物-蛋白质-疾病组合,其中批准的药物与表型匹配,62个组合具有潜在的药物再利用机会。此外,我们确定了63种新的蛋白质疾病对,为未来的治疗干预提供了有希望的候选者。结论:本研究为检测基因共定位信号建立了一个强大的框架,使疾病相关蛋白靶点优先化,并为治疗开发策略提供信息。
{"title":"High-definition likelihood inference of genetic colocalization reveals protein biomarkers for human complex diseases.","authors":"Yuying Li, Ranran Zhai, Zhijian Yang, Ting Li, Yudi Pawitan, Xia Shen","doi":"10.1093/gigascience/giaf155","DOIUrl":"https://doi.org/10.1093/gigascience/giaf155","url":null,"abstract":"<p><strong>Background: </strong>Genetic colocalization analysis is essential for understanding the shared genetic basis between phenotypic traits. Such an analysis is particularly useful for identifying plasma proteins with potential as therapeutic targets or clinical biomarkers. Improvements to existing tools are needed for more accurate inference of potentially causal biomarkers.</p><p><strong>Findings: </strong>We develop HDL-C, a high-definition likelihood inference method for genetic colocalization analysis. Based on simulations and observed rediscovery rates in real data analyses, we demonstrate that the HDL-C approach outperforms state-of-the-art methods, COLOC, SuSiE, and SharePro, in detecting genetic colocalization, thus enabling a more complete understanding of genetic connections at specific loci. Analyses of the top 50 protein-disease pairs identified by HDL-C in the male and female cohorts of the UK Biobank uncovered 40 previously validated drug-protein-disease combinations with approved drugs matching the phenotypes and 62 combinations with potential drug repurposing opportunities. Additionally, we identified 63 novel protein-disease pairs that suggest promising candidates for future therapeutic interventions.</p><p><strong>Conclusion: </strong>This research establishes a robust framework for detecting genetic colocalization signals, enabling the prioritization of disease-relevant protein targets and informing therapeutic development strategies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146029313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An integrative multiomics random forest framework for robust biomarker discovery. 一个整合的多组学随机森林框架稳健的生物标志物发现。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-21 DOI: 10.1093/gigascience/giaf148
Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen

Background: High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. However, many existing integrative methods rely on linear assumptions or univariate feature importance, limiting their ability to capture nonlinear and interaction-driven dependencies across data modalities.

Results: We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response maximal splitting response variable) appears across trees, yielding interpretable, cross-layer feature rankings. We provide two IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches sparse partial least squares/canonical correlation analysis under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (random forest, gradient boosting machine, XGBoost) underperform in the multivariate, unsupervised context. Applied to breast invasive carcinoma and colon adenocarcinoma in The Cancer Genome Atlas (TCGA), MRF-IMD identifies genes, CpGs, and microRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve a higher Adjusted Rand Index than alternatives and recover coherent tumor-type clusters; in the Alzheimer's Disease Neuroimaging Initiative (ADNI), the integrative signature improves dementia progression stratification over a published methylation risk score.

Conclusions: MRF-IMD provides a scalable and interpretable framework for multiomics integration that reliably identifies cross-layer biomarkers when nonlinear and interaction-driven dependencies are present. This approach advances robust biomarker discovery beyond the limits of linear integrative methods.

高通量技术现在产生广泛的组学数据,从基因组和转录组谱到表观基因组和蛋白质组测量。整合在相同样品上测量的多个组学层可以揭示单层分析遗漏的跨层分子中心。我们提出了一个无监督的多变量随机森林(MRF)框架,具有逆最小深度(IMD)重要性,可以优先考虑组学中共享的生物标志物。在每个森林中,一层作为多变量响应,另一层作为预测因子;IMD总结了预测器(或响应MSRV)在树中出现的时间,从而产生可解释的跨层特征排名。我们提供了三种基于IMD的选择策略,并引入了一个可选的IMD功率变换来提高对交互信号的灵敏度。在跨越线性、非线性和交互机制的广泛模拟中,我们的方法在线性设置下匹配SPLS/CCA,并在非线性增加时优于它们,而自适应单变量集成学习器(RF、GBM、XGBoost)在多变量、无监督环境下表现不佳。应用于TCGA、BRCA和COAD, MRF-IMD可以识别癌症相关途径富集的基因、CpGs和mirna,并且比具有匹配模型大小的线性整合器产生更强大的生存分层。在TCGA泛癌症分析中,MRF-IMD特征比其他选择获得更高的ARI,并恢复连贯的肿瘤类型集群;在ADNI中,综合特征优于已公布的甲基化风险评分,可改善痴呆进展分层。我们的可扩展、可解释的MRF-IMD框架在非线性、跨层依赖关系重要的情况下,推进了可靠的多组学生物标志物发现。
{"title":"An integrative multiomics random forest framework for robust biomarker discovery.","authors":"Wei Zhang, Hanchen Huang, Lily Wang, Brian D Lehmann, X Steven Chen","doi":"10.1093/gigascience/giaf148","DOIUrl":"10.1093/gigascience/giaf148","url":null,"abstract":"<p><strong>Background: </strong>High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. However, many existing integrative methods rely on linear assumptions or univariate feature importance, limiting their ability to capture nonlinear and interaction-driven dependencies across data modalities.</p><p><strong>Results: </strong>We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response maximal splitting response variable) appears across trees, yielding interpretable, cross-layer feature rankings. We provide two IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches sparse partial least squares/canonical correlation analysis under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (random forest, gradient boosting machine, XGBoost) underperform in the multivariate, unsupervised context. Applied to breast invasive carcinoma and colon adenocarcinoma in The Cancer Genome Atlas (TCGA), MRF-IMD identifies genes, CpGs, and microRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve a higher Adjusted Rand Index than alternatives and recover coherent tumor-type clusters; in the Alzheimer's Disease Neuroimaging Initiative (ADNI), the integrative signature improves dementia progression stratification over a published methylation risk score.</p><p><strong>Conclusions: </strong>MRF-IMD provides a scalable and interpretable framework for multiomics integration that reliably identifies cross-layer biomarkers when nonlinear and interaction-driven dependencies are present. This approach advances robust biomarker discovery beyond the limits of linear integrative methods.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12821379/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
deMEM: a novel divide-and-conquer framework based on de Bruijn graph for scalable multiple sequence alignment. deMEM:一种基于de Bruijn图的可扩展多序列对齐分治框架。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-21 DOI: 10.1093/gigascience/giaf163
Yanming Wei, Zhaoyang Huang, Pinglu Zhang, Yizheng Wang, Yan Li, Liang Yu, Quan Zou

Background: Multiple sequence alignment (MSA) continues to be a central challenge in comparative genomics, where the quality of alignment plays a crucial role in determining the accuracy of downstream analyses. However, the challenge of large-scale alignment remains significant.

Findings: This article introduces deMEM, a novel and effective framework for DNA multiple sequence alignment, which enables existing MSA methods such as MAFFT to handle extremely large sequences. deMEM is a 3-stage alignment process: (i) representing maximum exact matches using a de Bruijn graph and clustering them based on their area, (ii) employing a novel divide-and-conquer framework for alignment, and (iii) providing profile-profile alignment between different clusters.

Conclusions: DeMEM enables existing methods like MAFFT to align an extremely large number of sequences, including long sequences that cannot be directly aligned, such as those in a dataset of a thousand monkeypox virus genomes. The deMEM package is free and available at https://github.com/malabz/deMEM.

背景:多序列比对(MSA)仍然是比较基因组学的核心挑战,其中比对的质量在确定下游分析的准确性方面起着至关重要的作用。然而,大规模对齐的挑战仍然很大。研究结果:本文介绍了一种新颖有效的DNA多序列比对框架deMEM,使现有的MSA方法(如MAFFT)能够处理超大序列。deMEM是一个三阶段的对齐过程:(i)使用de Bruijn图表示最大精确匹配,并根据它们的面积对它们进行聚类;(ii)采用一种新颖的分而治之框架进行结盟;(iii)不同集群之间的配置文件-配置文件对齐。结论:DeMEM使MAFFT等现有方法能够对大量序列进行比对,包括不能直接比对的长序列,例如在一千个猴痘病毒基因组数据集中的序列。deMEM包是免费的,可以在https://github.com/malabz/deMEM上获得。
{"title":"deMEM: a novel divide-and-conquer framework based on de Bruijn graph for scalable multiple sequence alignment.","authors":"Yanming Wei, Zhaoyang Huang, Pinglu Zhang, Yizheng Wang, Yan Li, Liang Yu, Quan Zou","doi":"10.1093/gigascience/giaf163","DOIUrl":"10.1093/gigascience/giaf163","url":null,"abstract":"<p><strong>Background: </strong>Multiple sequence alignment (MSA) continues to be a central challenge in comparative genomics, where the quality of alignment plays a crucial role in determining the accuracy of downstream analyses. However, the challenge of large-scale alignment remains significant.</p><p><strong>Findings: </strong>This article introduces deMEM, a novel and effective framework for DNA multiple sequence alignment, which enables existing MSA methods such as MAFFT to handle extremely large sequences. deMEM is a 3-stage alignment process: (i) representing maximum exact matches using a de Bruijn graph and clustering them based on their area, (ii) employing a novel divide-and-conquer framework for alignment, and (iii) providing profile-profile alignment between different clusters.</p><p><strong>Conclusions: </strong>DeMEM enables existing methods like MAFFT to align an extremely large number of sequences, including long sequences that cannot be directly aligned, such as those in a dataset of a thousand monkeypox virus genomes. The deMEM package is free and available at https://github.com/malabz/deMEM.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12878729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145900220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Giant chromosomes of a tiny plant-the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta). 一种微小植物的巨大染色体——简单菌体肝草Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta)端粒到端粒的完整基因组组装。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2026-01-21 DOI: 10.1093/gigascience/giaf145
Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki

Background: The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.

Findings: We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.

Conclusions: This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.

背景:苔类a . endiviifolia dioicous,简单的叶状物种,值得注意的是它的神秘的多样性、生境适应性,基因组创新,代表着一个进化枝,是所有其他Jungermanniopsida妹妹。这些特征使其成为探索地植物物种形成机制和基因组结构进化的重要模型。结果:我们展示了一个单倍体a . endiviifolia分离物的基因组组装,其总大小为2,914,960,273 bp, N50为468,157,909 bp,显示出高完整性(99.2% BUSCO)和高一致性质量(QV 47.6)。该组合由9条染色体组成,其中包括18个端粒和9个着丝粒(长度从1.9到5mbp不等)。基于rna -seq的注释鉴定了34,615个基因,主要是蛋白质编码。TEs由12.16%的LTR元素和57个helitron组成。其中,Copia超家族和Gypsy超家族分别占基因组的8.94%和2.95%。Ty3/Gypsy超家族在着丝粒区显著富集。平均GC含量为38.8% ~ 39.6%,基因密度为5.52 ~ 9.78个/ 500 kbp。近缘种的同源性分析揭示了复杂的染色体关系,表明物种之间广泛的基因组重排。结论:本研究提供了第一个高质量的单倍体肝草参考基因组序列。组装和注释为研究简单菌体苔类的进化、着丝粒生物学和基因组扩增提供了宝贵的资源。
{"title":"Giant chromosomes of a tiny plant-the complete telomere-to-telomere genome assembly of the simple thalloid liverwort Apopellia endiviifolia (Jungermanniopsida, Marchantiophyta).","authors":"Joanna Szablińska-Piernik, Paweł Sulima, Jakub Sawicki","doi":"10.1093/gigascience/giaf145","DOIUrl":"10.1093/gigascience/giaf145","url":null,"abstract":"<p><strong>Background: </strong>The liverwort Apopellia endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, and genomic innovation, and it represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts.</p><p><strong>Findings: </strong>We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (quality value 47.6). The assembly consisted of 9 chromosomes, which included 18 telomeres and 9 centromeres (ranging from 1.9 to 5 Mbp in length). RNA sequencing-based annotation identified 34,615 genes, predominantly protein coding. The transposable elements comprised 12.16% long terminal repeat elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varying between 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species.</p><p><strong>Conclusions: </strong>This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145632216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
GigaScience
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1