首页 > 最新文献

GigaScience最新文献

英文 中文
Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants 利用基于深度学习的方法进行大规模基因组调查,揭示菌株级噬菌体特异性决定因素
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-04-22 DOI: 10.1093/gigascience/giae017
Yiyan Yang, Keith Dufault-Thompson, Wei Yan, Tian Cai, Lei Xie, Xiaofang Jiang
Background Phage therapy, reemerging as a promising approach to counter antimicrobial-resistant infections, relies on a comprehensive understanding of the specificity of individual phages. Yet the significant diversity within phage populations presents a considerable challenge. Currently, there is a notable lack of tools designed for large-scale characterization of phage receptor-binding proteins, which are crucial in determining the phage host range. Results In this study, we present SpikeHunter, a deep learning method based on the ESM-2 protein language model. With SpikeHunter, we identified 231,965 diverse phage-encoded tailspike proteins, a crucial determinant of phage specificity that targets bacterial polysaccharide receptors, across 787,566 bacterial genomes from 5 virulent, antibiotic-resistant pathogens. Notably, 86.60% (143,200) of these proteins exhibited strong associations with specific bacterial polysaccharides. We discovered that phages with identical tailspike proteins can infect different bacterial species with similar polysaccharide receptors, underscoring the pivotal role of tailspike proteins in determining host range. The specificity is mainly attributed to the protein’s C-terminal domain, which strictly correlates with host specificity during domain swapping in tailspike proteins. Importantly, our dataset-driven predictions of phage–host specificity closely match the phage–host pairs observed in real-world phage therapy cases we studied. Conclusions Our research provides a rich resource, including both the method and a database derived from a large-scale genomics survey. This substantially enhances understanding of phage specificity determinants at the strain level and offers a valuable framework for guiding phage selection in therapeutic applications.
背景噬菌体疗法作为一种很有前景的方法,正在重新成为对抗抗菌药物耐药性感染的手段,它依赖于对单个噬菌体特异性的全面了解。然而,噬菌体种群的巨大多样性带来了相当大的挑战。目前,用于大规模鉴定噬菌体受体结合蛋白的工具明显缺乏,而受体结合蛋白对确定噬菌体宿主范围至关重要。结果 在本研究中,我们介绍了基于 ESM-2 蛋白语言模型的深度学习方法 SpikeHunter。通过 SpikeHunter,我们在 5 种具有毒性、抗生素耐药性的病原体的 787566 个细菌基因组中鉴定出了 231965 种不同的噬菌体编码的尾穗蛋白,这是决定噬菌体特异性的一个关键因素,它以细菌多糖受体为目标。值得注意的是,这些蛋白质中有 86.60% (143,200 个)表现出与特定细菌多糖的紧密联系。我们发现,具有相同尾穗蛋白的噬菌体可以感染具有相似多糖受体的不同细菌种类,这突出表明了尾穗蛋白在决定宿主范围方面的关键作用。这种特异性主要归因于蛋白质的 C 端结构域,它与尾梭蛋白结构域交换过程中的宿主特异性密切相关。重要的是,我们根据数据集预测的噬菌体-宿主特异性与我们研究的真实世界噬菌体治疗案例中观察到的噬菌体-宿主对密切吻合。结论 我们的研究提供了丰富的资源,包括从大规模基因组学调查中获得的方法和数据库。这大大增强了人们对菌株水平上噬菌体特异性决定因素的了解,并为指导治疗应用中的噬菌体选择提供了宝贵的框架。
{"title":"Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants","authors":"Yiyan Yang, Keith Dufault-Thompson, Wei Yan, Tian Cai, Lei Xie, Xiaofang Jiang","doi":"10.1093/gigascience/giae017","DOIUrl":"https://doi.org/10.1093/gigascience/giae017","url":null,"abstract":"Background Phage therapy, reemerging as a promising approach to counter antimicrobial-resistant infections, relies on a comprehensive understanding of the specificity of individual phages. Yet the significant diversity within phage populations presents a considerable challenge. Currently, there is a notable lack of tools designed for large-scale characterization of phage receptor-binding proteins, which are crucial in determining the phage host range. Results In this study, we present SpikeHunter, a deep learning method based on the ESM-2 protein language model. With SpikeHunter, we identified 231,965 diverse phage-encoded tailspike proteins, a crucial determinant of phage specificity that targets bacterial polysaccharide receptors, across 787,566 bacterial genomes from 5 virulent, antibiotic-resistant pathogens. Notably, 86.60% (143,200) of these proteins exhibited strong associations with specific bacterial polysaccharides. We discovered that phages with identical tailspike proteins can infect different bacterial species with similar polysaccharide receptors, underscoring the pivotal role of tailspike proteins in determining host range. The specificity is mainly attributed to the protein’s C-terminal domain, which strictly correlates with host specificity during domain swapping in tailspike proteins. Importantly, our dataset-driven predictions of phage–host specificity closely match the phage–host pairs observed in real-world phage therapy cases we studied. Conclusions Our research provides a rich resource, including both the method and a database derived from a large-scale genomics survey. This substantially enhances understanding of phage specificity determinants at the strain level and offers a valuable framework for guiding phage selection in therapeutic applications.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"41 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140804820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An effective strategy for assembling the sex-limited chromosome 组装性别限制染色体的有效策略
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-04-16 DOI: 10.1093/gigascience/giae015
Xiao-Bo Wang, Hong-Wei Lu, Qing-You Liu, A-Lun Li, Hong-Ling Zhou, Yong Zhang, Tian-Qi Zhu, Jue Ruan
Background Most currently available reference genomes lack the sequence map of sex-limited (such as Y and W) chromosomes, which results in incomplete assemblies that hinder further research on sex chromosomes. Recent advancements in long-read sequencing and population sequencing have provided the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. Findings We introduce the first computational method, Sorting long Reads of Y or other sex-limited chromosome (SRY), which achieves improved assembly results compared to flow sorting. Specifically, SRY outperforms in the heterochromatic region and demonstrates comparable performance in other regions. Furthermore, SRY enhances the capabilities of the hybrid assembly software, resulting in improved continuity and accuracy. Conclusions Our method enables true complete genome assembly and facilitates downstream research of sex-limited chromosomes.
背景 目前可用的大多数参考基因组都缺乏性限(如 Y 和 W)染色体的序列图,导致组装不完整,阻碍了对性染色体的进一步研究。最近在长线程测序和群体测序方面取得的进展为我们提供了无需传统的复杂实验工作就能组装性限染色体的机会。研究结果 我们介绍了第一种计算方法--Y或其他性别限制染色体长读数排序法(SRY),与流式排序法相比,SRY的组装结果更好。具体来说,SRY 在异染色质区域的表现更好,在其他区域的表现也不相上下。此外,SRY 还增强了混合组装软件的功能,从而提高了连续性和准确性。结论 我们的方法实现了真正的全基因组组装,并促进了性别限制染色体的下游研究。
{"title":"An effective strategy for assembling the sex-limited chromosome","authors":"Xiao-Bo Wang, Hong-Wei Lu, Qing-You Liu, A-Lun Li, Hong-Ling Zhou, Yong Zhang, Tian-Qi Zhu, Jue Ruan","doi":"10.1093/gigascience/giae015","DOIUrl":"https://doi.org/10.1093/gigascience/giae015","url":null,"abstract":"Background Most currently available reference genomes lack the sequence map of sex-limited (such as Y and W) chromosomes, which results in incomplete assemblies that hinder further research on sex chromosomes. Recent advancements in long-read sequencing and population sequencing have provided the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. Findings We introduce the first computational method, Sorting long Reads of Y or other sex-limited chromosome (SRY), which achieves improved assembly results compared to flow sorting. Specifically, SRY outperforms in the heterochromatic region and demonstrates comparable performance in other regions. Furthermore, SRY enhances the capabilities of the hybrid assembly software, resulting in improved continuity and accuracy. Conclusions Our method enables true complete genome assembly and facilitates downstream research of sex-limited chromosomes.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"21 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140613961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology 通过整合转录组学和表观转录组学数据集加强牛基因组注释,促进基因组生物学发展
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-04-16 DOI: 10.1093/gigascience/giae019
Hamid Beiki, Brenda M Murdoch, Carissa A Park, Chandlar Kern, Denise Kontechy, Gabrielle Becker, Gonzalo Rincon, Honglin Jiang, Huaijun Zhou, Jacob Thorne, James E Koltes, Jennifer J Michal, Kimberly Davenport, Monique Rijnkels, Pablo J Ross, Rui Hu, Sarah Corum, Stephanie McKay, Timothy P L Smith, Wansheng Liu, Wenzhi Ma, Xiaohui Zhang, Xiaoqing Xu, Xuelei Han, Zhihua Jiang, Zhi-Liang Hu, James M Reecy
Background The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. Results A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5′ untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue–tissue interconnection involved in different traits and construct the first bovine trait similarity network. Conclusions These validated results show significant improvement over current bovine genome annotations.
背景 准确识别牛基因组中的功能元件是高质量分析数据、为基因组生物学和基因组选择提供信息的基本要求。我们对牛基因组进行了功能注释,以确定牛组织中更完整的转录本异构体目录。结果 在各组织中共鉴定出 160,820 个独特的转录本(50% 蛋白编码),代表 34,882 个独特的基因(60% 蛋白编码)。其中,118,563 个转录本(占总数的 73%)通过独立数据集(PacBio 异构体测序数据、牛津纳米孔技术测序数据、从 RNA 测序数据中重新组装的转录本)以及与 Ensembl 和 NCBI 基因集的比较进行了结构验证。此外,所有转录本都有来自不同技术的大量数据支持,如全转录本组末端位点测序、用于基因表达分析的 RNA 注释和启动子图谱、染色质免疫沉淀测序,以及使用测序法检测转座酶可进入染色质。鉴定出的转录本中有很大一部分(69%)是未注释的,其中 86% 由已注释基因产生,14% 由未注释基因产生。每个基因表达的 5′非翻译区中位数为两个。每个组织中约有 50%的蛋白编码基因具有双重功能,同时转录编码和非编码同工酶。此外,我们还发现 3744 个基因在胎儿组织中作为非编码基因,但在成年组织中作为蛋白编码基因。与 Ensembl 或 NCBI 的注释相比,我们的新牛基因组注释扩展了 11,000 多个注释基因边界。我们将得到的牛转录组与公开的定量性状位点数据整合在一起,以研究不同性状所涉及的组织-组织之间的相互联系,并构建了第一个牛性状相似性网络。结论 这些验证结果表明,与目前的牛基因组注释相比,牛基因组注释有了显著改善。
{"title":"Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology","authors":"Hamid Beiki, Brenda M Murdoch, Carissa A Park, Chandlar Kern, Denise Kontechy, Gabrielle Becker, Gonzalo Rincon, Honglin Jiang, Huaijun Zhou, Jacob Thorne, James E Koltes, Jennifer J Michal, Kimberly Davenport, Monique Rijnkels, Pablo J Ross, Rui Hu, Sarah Corum, Stephanie McKay, Timothy P L Smith, Wansheng Liu, Wenzhi Ma, Xiaohui Zhang, Xiaoqing Xu, Xuelei Han, Zhihua Jiang, Zhi-Liang Hu, James M Reecy","doi":"10.1093/gigascience/giae019","DOIUrl":"https://doi.org/10.1093/gigascience/giae019","url":null,"abstract":"Background The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. Results A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5′ untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue–tissue interconnection involved in different traits and construct the first bovine trait similarity network. Conclusions These validated results show significant improvement over current bovine genome annotations.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"19 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140614176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups Korea4K:4 157 名韩国人的全基因组序列,其中 107 种表型来自广泛的健康检查
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-04-16 DOI: 10.1093/gigascience/giae014
Sungwon Jeon, Hansol Choi, Yeonsu Jeon, Whan-Hyuk Choi, Hyunjoo Choi, Kyungwhan An, Hyojung Ryu, Jihun Bhak, Hyeonjae Lee, Yoonsung Kwon, Sukyeon Ha, Yeo Jin Kim, Asta Blazyte, Changjae Kim, Yeonkyung Kim, Younghui Kang, Yeong Ju Woo, Chanyoung Lee, Jeongwoo Seo, Changhan Yoon, Dan Bolser, Orsolya Biro, Eun-Seok Shin, Byung Chul Kim, Seon-Young Kim, Ji-Hwan Park, Jongbum Jeon, Dooyoung Jung, Semin Lee, Jong Bhak
Background Phenome-wide association studies (PheWASs) have been conducted on Asian populations, including Koreans, but many were based on chip or exome genotyping data. Such studies have limitations regarding whole genome–wide association analysis, making it crucial to have genome-to-phenome association information with the largest possible whole genome and matched phenome data to conduct further population-genome studies and develop health care services based on population genomics. Results Here, we present 4,157 whole genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It encompasses most of the variants with allele frequency >0.001 in Koreans, indicating that it sufficiently covered most of the common and rare genetic variants with commonly measured phenotypes for Koreans. Korea4K provides 45,537,252 variants, and half of them were not present in Korea1K (1,094 samples). We also identified 1,356 new genotype–phenotype associations that were not found by the Korea1K dataset. Phenomics analyses further revealed 24 significant genetic correlations, 14 pleiotropic associations, and 127 causal relationships based on Mendelian randomization among 37 traits. In addition, the Korea4K imputation reference panel, the largest Korean variants reference to date, showed a superior imputation performance to Korea1K across all allele frequency categories. Conclusions Collectively, Korea4K provides not only the largest Korean genome data but also corresponding health check-up parameters and novel genome–phenome associations. The large-scale pathological whole genome–wide omics data will become a powerful set for genome–phenome level association studies to discover causal markers for the prediction and diagnosis of health conditions in future studies.
背景 对包括韩国人在内的亚洲人群进行了全表型关联研究(Phenome-wide Association Studies,PheWASs),但许多研究是基于芯片或外显子组基因分型数据进行的。这些研究在全基因组关联分析方面存在局限性,因此,拥有尽可能多的全基因组和匹配表型组数据的基因组到表型组关联信息对于开展进一步的人群基因组研究和开发基于人群基因组学的医疗保健服务至关重要。结果 在这里,我们展示了 4,157 个全基因组序列(Korea4K)和 107 个健康检查参数,这是韩国基因组计划最大的基因组资源。它涵盖了韩国人等位基因频率>0.001的大多数变异,表明它充分涵盖了韩国人大多数常见和罕见的基因变异,以及常见的测量表型。Korea4K 提供了 45,537,252 个变体,其中一半在 Korea1K 中不存在(1,094 个样本)。我们还发现了 1,356 个新的基因型-表型关联,这些关联是 Korea1K 数据集所没有的。表型组学分析进一步揭示了 37 个性状中的 24 个显著遗传相关性、14 个多效性关联和 127 个基于孟德尔随机化的因果关系。此外,Korea4K 归因参考面板是迄今为止最大的韩国变异参考面板,在所有等位基因频率类别中都显示出优于 Korea1K 的归因性能。结论 总的来说,Korea4K 不仅提供了最大的韩国基因组数据,还提供了相应的健康检查参数和新的基因组-表型组关联。大规模病理全基因组 omics 数据将成为基因组-表型组水平关联研究的强大数据集,在未来的研究中为预测和诊断健康状况发现因果标记。
{"title":"Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups","authors":"Sungwon Jeon, Hansol Choi, Yeonsu Jeon, Whan-Hyuk Choi, Hyunjoo Choi, Kyungwhan An, Hyojung Ryu, Jihun Bhak, Hyeonjae Lee, Yoonsung Kwon, Sukyeon Ha, Yeo Jin Kim, Asta Blazyte, Changjae Kim, Yeonkyung Kim, Younghui Kang, Yeong Ju Woo, Chanyoung Lee, Jeongwoo Seo, Changhan Yoon, Dan Bolser, Orsolya Biro, Eun-Seok Shin, Byung Chul Kim, Seon-Young Kim, Ji-Hwan Park, Jongbum Jeon, Dooyoung Jung, Semin Lee, Jong Bhak","doi":"10.1093/gigascience/giae014","DOIUrl":"https://doi.org/10.1093/gigascience/giae014","url":null,"abstract":"Background Phenome-wide association studies (PheWASs) have been conducted on Asian populations, including Koreans, but many were based on chip or exome genotyping data. Such studies have limitations regarding whole genome–wide association analysis, making it crucial to have genome-to-phenome association information with the largest possible whole genome and matched phenome data to conduct further population-genome studies and develop health care services based on population genomics. Results Here, we present 4,157 whole genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It encompasses most of the variants with allele frequency >0.001 in Koreans, indicating that it sufficiently covered most of the common and rare genetic variants with commonly measured phenotypes for Koreans. Korea4K provides 45,537,252 variants, and half of them were not present in Korea1K (1,094 samples). We also identified 1,356 new genotype–phenotype associations that were not found by the Korea1K dataset. Phenomics analyses further revealed 24 significant genetic correlations, 14 pleiotropic associations, and 127 causal relationships based on Mendelian randomization among 37 traits. In addition, the Korea4K imputation reference panel, the largest Korean variants reference to date, showed a superior imputation performance to Korea1K across all allele frequency categories. Conclusions Collectively, Korea4K provides not only the largest Korean genome data but also corresponding health check-up parameters and novel genome–phenome associations. The large-scale pathological whole genome–wide omics data will become a powerful set for genome–phenome level association studies to discover causal markers for the prediction and diagnosis of health conditions in future studies.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"24 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140614276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved integration of single-cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans 单细胞转录组数据的改进整合显示了小鼠和人类心力衰竭的共同和独特特征
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-04-04 DOI: 10.1093/gigascience/giae011
Mariano Ruz Jurado, Lukas S Tombor, Mani Arsalan, Tomas Holubec, Fabian Emrich, Thomas Walther, Wesley Abplanalp, Ariane Fischer, Andreas M Zeiher, Marcel H Schulz, Stefanie Dimmeler, David John
Background Cardiovascular research heavily relies on mouse (Mus musculus) models to study disease mechanisms and to test novel biomarkers and medications. Yet, applying these results to patients remains a major challenge and often results in noneffective drugs. Therefore, it is an open challenge of translational science to develop models with high similarities and predictive value. This requires a comparison of disease models in mice with diseased tissue derived from humans. Results To compare the transcriptional signatures at single-cell resolution, we implemented an integration pipeline called OrthoIntegrate, which uniquely assigns orthologs and therewith merges single-cell RNA sequencing (scRNA-seq) RNA of different species. The pipeline has been designed to be as easy to use and is fully integrable in the standard Seurat workflow. We applied OrthoIntegrate on scRNA-seq from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and scRNA-seq from the mice after chronic infarction, which is a commonly used mouse model to mimic HFrEF. We discovered shared and distinct regulatory pathways between human HFrEF patients and the corresponding mouse model. Overall, 54% of genes were commonly regulated, including major changes in cardiomyocyte energy metabolism. However, several regulatory pathways (e.g., angiogenesis) were specifically regulated in humans. Conclusions The demonstration of unique pathways occurring in humans indicates limitations on the comparability between mice models and human HFrEF and shows that results from the mice model should be validated carefully. OrthoIntegrate is publicly accessible (https://github.com/MarianoRuzJurado/OrthoIntegrate) and can be used to integrate other large datasets to provide a general comparison of models with patient data.
背景心血管研究在很大程度上依赖于小鼠(麝香猫)模型来研究疾病机制以及测试新型生物标记物和药物。然而,将这些结果应用于患者仍然是一项重大挑战,而且往往会导致药物无效。因此,开发具有高度相似性和预测价值的模型是转化科学的一项公开挑战。这就需要将小鼠的疾病模型与来自人类的疾病组织进行比较。结果 为了比较单细胞分辨率下的转录特征,我们实施了一个名为 OrthoIntegrate 的整合管道,它能唯一分配直向同源物,从而合并不同物种的单细胞 RNA 测序(scRNA-seq)RNA。该管道设计简单易用,可完全集成到标准的 Seurat 工作流程中。我们将 OrthoIntegrate 应用于射血分数降低型心力衰竭(HFrEF)患者心脏组织的 scRNA-seq 和慢性梗塞后小鼠的 scRNA-seq 上,慢性梗塞是模拟 HFrEF 的常用小鼠模型。我们发现了人类 HFrEF 患者与相应小鼠模型之间共有的和不同的调控通路。总体而言,54%的基因受到共同调控,包括心肌细胞能量代谢的主要变化。然而,有几种调控途径(如血管生成)在人类中受到特殊调控。结论 在人类中出现的独特通路表明,小鼠模型与人类高频低氧血症之间的可比性存在局限性,并表明应仔细验证小鼠模型的结果。OrthoIntegrate 可公开访问 (https://github.com/MarianoRuzJurado/OrthoIntegrate),可用于整合其他大型数据集,提供模型与患者数据的一般比较。
{"title":"Improved integration of single-cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans","authors":"Mariano Ruz Jurado, Lukas S Tombor, Mani Arsalan, Tomas Holubec, Fabian Emrich, Thomas Walther, Wesley Abplanalp, Ariane Fischer, Andreas M Zeiher, Marcel H Schulz, Stefanie Dimmeler, David John","doi":"10.1093/gigascience/giae011","DOIUrl":"https://doi.org/10.1093/gigascience/giae011","url":null,"abstract":"Background Cardiovascular research heavily relies on mouse (Mus musculus) models to study disease mechanisms and to test novel biomarkers and medications. Yet, applying these results to patients remains a major challenge and often results in noneffective drugs. Therefore, it is an open challenge of translational science to develop models with high similarities and predictive value. This requires a comparison of disease models in mice with diseased tissue derived from humans. Results To compare the transcriptional signatures at single-cell resolution, we implemented an integration pipeline called OrthoIntegrate, which uniquely assigns orthologs and therewith merges single-cell RNA sequencing (scRNA-seq) RNA of different species. The pipeline has been designed to be as easy to use and is fully integrable in the standard Seurat workflow. We applied OrthoIntegrate on scRNA-seq from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and scRNA-seq from the mice after chronic infarction, which is a commonly used mouse model to mimic HFrEF. We discovered shared and distinct regulatory pathways between human HFrEF patients and the corresponding mouse model. Overall, 54% of genes were commonly regulated, including major changes in cardiomyocyte energy metabolism. However, several regulatory pathways (e.g., angiogenesis) were specifically regulated in humans. Conclusions The demonstration of unique pathways occurring in humans indicates limitations on the comparability between mice models and human HFrEF and shows that results from the mice model should be validated carefully. OrthoIntegrate is publicly accessible (https://github.com/MarianoRuzJurado/OrthoIntegrate) and can be used to integrate other large datasets to provide a general comparison of models with patient data.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"103 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140599512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data 泛基因组数据库改进了临床元基因组数据中的宿主去除和分枝杆菌分类工作
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-04-04 DOI: 10.1093/gigascience/giae010
Michael B Hall, Lachlan J M Coin
Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. Results We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. Conclusions Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.
背景 临床元基因组样本的无培养基实时测序可实现病原体的快速检测和抗菌药耐药性分析。然而,这种方法会带来病人 DNA 泄漏的风险。为了降低这种风险,我们需要在测序时近乎全面地清除人类 DNA 序列,通常需要使用资源有限的设备。现有的基准主要集中在标准化数据库的使用上,在很大程度上忽略了删除管道的计算要求以及人类基因组多样性的影响。结果 我们在模拟和人工真实 Illumina 和 Nanopore 元基因组样本上对宿主去除管道进行了基准测试。我们发现,构建一个包含不同人类基因组的定制 kraken 数据库,能在准确性和计算资源使用之间取得最佳平衡。此外,我们还利用标准数据库和定制数据库,对使用 kraken 和 minimap2 对分枝杆菌读数进行分类的管道进行了基准测试。与结核分枝杆菌分类的标准数据库相比,使用具有代表性的分枝杆菌属数据库,这两种工具都提高了特异性和灵敏度。这些定制数据库的计算效率优于大多数标准方法,可以在笔记本电脑上执行。结论 与标准数据库相比,定制的泛基因组数据库在从元基因组样本中去除人类读数和进行结核分枝杆菌读数分类时,能在准确性和计算效率之间取得最佳平衡。这样的数据库可以在笔记本电脑上执行,而不会牺牲准确性,这在资源匮乏的环境中是一个特别重要的考虑因素。我们免费提供所有定制的数据库和管道。
{"title":"Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data","authors":"Michael B Hall, Lachlan J M Coin","doi":"10.1093/gigascience/giae010","DOIUrl":"https://doi.org/10.1093/gigascience/giae010","url":null,"abstract":"Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. Results We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. Conclusions Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"244 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140599486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-omics data analysis workflow packaged as a FAIR Digital Object 打包为 FAIR 数字对象的多组学数据分析工作流程
IF 9.2 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-01-10 DOI: 10.1093/gigascience/giad115
Anna Niehues, Casper de Visser, Fiona A Hagenbeek, Purva Kulkarni, René Pool, Naama Karu, Alida S D Kindt, Gurnoor Singh, Robert R J M Vermeiren, Dorret I Boomsma, Jenny van Dongen, Peter A C ’t Hoen, Alain J van Gool
Background Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
背景 在研究项目中应用良好的数据管理和 FAIR(可查找、可访问、可互操作和可重用)数据原则,有助于在未来的研究中将知识发现、研究结果可重现性和数据重用区分开来。基于最初的研究数据 FAIR 原则的概念,最近又提出了研究软件 FAIR 原则。FAIR 数字对象可以实现研究对象的发现和重用,包括人类和机器的计算工作流程。实际案例有助于促进研究界在计算工作流程中采用 FAIR 实践。我们开发了一个多组学数据分析工作流,将其作为 FAIR 数字对象进行共享。研究结果 我们进行了一项案例研究,调查多组学数据与儿童外化行为之间的共享模式。分析工作流在工作流管理器 Nextflow 中以模块化流水线的形式实现,包括具有软件依赖性的容器。我们遵守了软件开发规范,如版本控制、文档和许可。最后,我们用丰富的语义元数据对工作流进行了描述,将其打包为研究对象板块(Research Object Crate),并通过 WorkflowHub 进行共享。结论 除了打包的多组学数据分析工作流程,我们还分享了采用各种 FAIR 实践和创建 FAIR 数字对象的经验。我们希望我们的经验能够帮助其他开发 omics 数据分析工作流程的研究人员将 FAIR 原则付诸实践。
{"title":"A multi-omics data analysis workflow packaged as a FAIR Digital Object","authors":"Anna Niehues, Casper de Visser, Fiona A Hagenbeek, Purva Kulkarni, René Pool, Naama Karu, Alida S D Kindt, Gurnoor Singh, Robert R J M Vermeiren, Dorret I Boomsma, Jenny van Dongen, Peter A C ’t Hoen, Alain J van Gool","doi":"10.1093/gigascience/giad115","DOIUrl":"https://doi.org/10.1093/gigascience/giad115","url":null,"abstract":"Background Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"8 1","pages":""},"PeriodicalIF":9.2,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139463300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disentangling river and swamp buffalo genetic diversity: initial insights from the 1000 Buffalo Genomes Project. 区分河流水牛和沼泽水牛的遗传多样性:1000 头水牛基因组项目的初步见解。
IF 3.5 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-01-02 DOI: 10.1093/gigascience/giae053
Paulene S Pineda, Ester B Flores, Lilian P Villamor, Connie Joyce M Parac, Mehar S Khatkar, Hien To Thu, Timothy P L Smith, Benjamin D Rosen, Paolo Ajmone-Marsan, Licia Colli, John L Williams, Wai Yee Low

More people in the world depend on water buffalo for their livelihoods than on any other domesticated animals, but its genetics is still not extensively explored. The 1000 Buffalo Genomes Project (1000BGP) provides genetic resources for global buffalo population study and tools to breed more sustainable and productive buffaloes. Here we report the most contiguous swamp buffalo genome assembly (PCC_UOA_SB_1v2) with substantial resolution of telomeric and centromeric repeats, ∼4-fold more contiguous than the existing reference river buffalo assembly and exceeding a recently published male swamp buffalo genome. This assembly was used along with the current reference to align 140 water buffalo short-read sequences and produce a public genetic resource with an average of ∼41 million single nucleotide polymorphisms per swamp and river buffalo genome. Comparison of the swamp and river buffalo sequences showed ∼1.5% genetic differences, and estimated divergence time occurred 3.1 million years ago (95% CI, 2.6-4.9). The open science model employed in the 1000BGP provides a key genomic resource and tools for a species with global economic relevance.

世界上依赖水牛为生的人比依赖其他任何驯养动物的人都要多,但对水牛遗传学的研究却仍然不够广泛。水牛基因组千人计划(1000BGP)为全球水牛种群研究提供了遗传资源,也为培育更可持续、更高产的水牛提供了工具。在这里,我们报告了最连续的沼泽水牛基因组组装(PCC_UOA_SB_1v2),其端粒和中心粒重复序列的分辨率很高,比现有的参考河水牛基因组组装的连续性高出 4 倍,超过了最近发表的雄性沼泽水牛基因组。该序列集与现有参考文献一起用于比对 140 个水牛短读序列,并产生了一个公共遗传资源,其中每个沼泽水牛和河流水牛基因组平均有 4100 万个单核苷酸多态性。沼泽水牛和河流水牛序列的比较显示遗传差异为1.5%,估计分化时间为310万年前(95% CI,2.6-4.9)。1000BGP 采用的开放科学模式为这一具有全球经济意义的物种提供了重要的基因组资源和工具。
{"title":"Disentangling river and swamp buffalo genetic diversity: initial insights from the 1000 Buffalo Genomes Project.","authors":"Paulene S Pineda, Ester B Flores, Lilian P Villamor, Connie Joyce M Parac, Mehar S Khatkar, Hien To Thu, Timothy P L Smith, Benjamin D Rosen, Paolo Ajmone-Marsan, Licia Colli, John L Williams, Wai Yee Low","doi":"10.1093/gigascience/giae053","DOIUrl":"10.1093/gigascience/giae053","url":null,"abstract":"<p><p>More people in the world depend on water buffalo for their livelihoods than on any other domesticated animals, but its genetics is still not extensively explored. The 1000 Buffalo Genomes Project (1000BGP) provides genetic resources for global buffalo population study and tools to breed more sustainable and productive buffaloes. Here we report the most contiguous swamp buffalo genome assembly (PCC_UOA_SB_1v2) with substantial resolution of telomeric and centromeric repeats, ∼4-fold more contiguous than the existing reference river buffalo assembly and exceeding a recently published male swamp buffalo genome. This assembly was used along with the current reference to align 140 water buffalo short-read sequences and produce a public genetic resource with an average of ∼41 million single nucleotide polymorphisms per swamp and river buffalo genome. Comparison of the swamp and river buffalo sequences showed ∼1.5% genetic differences, and estimated divergence time occurred 3.1 million years ago (95% CI, 2.6-4.9). The open science model employed in the 1000BGP provides a key genomic resource and tools for a species with global economic relevance.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11382405/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142153663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoCoPyE: feature engineering for learning and prediction of genome quality indices. CoCoPyE:用于学习和预测基因组质量指数的特征工程。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-01-02 DOI: 10.1093/gigascience/giae079
Niklas Birth, Nicolina Leppich, Julia Schirmacher, Nina Andreae, Rasmus Steinkamp, Matthias Blanke, Peter Meinicke

Background: The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.

Results: We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.

Conclusions: CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.

背景:通过元基因组序列数据重建基因组极大地推动了对微生物世界的探索。然而,元基因组组装基因组数量的迅速增加也导致了数据质量的巨大差异。因此,在将重建的基因组用于后续分析之前,必须对其达到的完整性和可能的污染进行量化。估算质量指数的经典方法仅依赖于相对较少的通用单拷贝基因。最近的工具试图扩大估算的基因组覆盖范围以提高准确性:我们开发了 CoCoPyE,这是一种基于新颖的两阶段特征提取和转换方案的快速工具。首先,它能识别基因组标记,然后通过机器学习方法完善基于标记的估计值。在我们的模拟研究中,CoCoPyE 对质量指标的预测比现有工具更准确。CoCoPyE 网络服务器提供了一种试用该工具的简便方法,而免费提供的 Python 实现则可将其集成到现有的基因组重建管道中:结论:CoCoPyE 提供了一种评估基因组数据质量的新方法。结论:CoCoPyE 提供了一种评估基因组数据质量的新方法,它是对现有工具的补充和改进,可帮助研究人员在元基因组测序项目中更好地区分低质量草案和高质量基因组组装。
{"title":"CoCoPyE: feature engineering for learning and prediction of genome quality indices.","authors":"Niklas Birth, Nicolina Leppich, Julia Schirmacher, Nina Andreae, Rasmus Steinkamp, Matthias Blanke, Peter Meinicke","doi":"10.1093/gigascience/giae079","DOIUrl":"https://doi.org/10.1093/gigascience/giae079","url":null,"abstract":"<p><strong>Background: </strong>The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.</p><p><strong>Results: </strong>We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.</p><p><strong>Conclusions: </strong>CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503480/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142498590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0. 使代谢组学更加定量化的数据处理解决方案:使用 Metabox 2.0 进行的食品和临床代谢组学案例研究。
IF 3.5 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-01-02 DOI: 10.1093/gigascience/giae005
Kwanjeera Wanichthanarak, Ammarin In-On, Sili Fan, Oliver Fiehn, Arporn Wangwiwatsin, Sakda Khoomrung

In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated data processing methods to produce outputs that were most similar to the corresponding absolute quantified data. We examined the data distribution characteristics, fold difference patterns between 2 metabolites, and sample variance. We used 2 metabolomic datasets from a retail milk study and a lupus nephritis cohort as test cases. When studying the impact of data normalization, transformation, scaling, and combinations of these methods, we found that the cross-contribution compensating multiple standard normalization (ccmn) method, followed by square root data transformation, was most appropriate for a well-controlled study such as the milk study dataset. Regarding the lupus nephritis cohort study, only ccmn normalization could slightly improve the data quality of the noisy cohort. Since the assessment accounted for the resemblance between processed data and the corresponding absolute quantified data, our results denote a helpful guideline for processing metabolomic datasets within a similar context (food and clinical metabolomics). Finally, we introduce Metabox 2.0, which enables thorough analysis of metabolomic data, including data processing, biomarker analysis, integrative analysis, and data interpretation. It was successfully used to process and analyze the data in this study. An online web version is available at http://metsysbio.com/metabox.

在传统的半定量代谢组学研究中,代谢物强度会受到生物因素和其他不必要变化的影响。对数据处理方法进行系统评估对于确定特定实验设置的适当处理程序至关重要。目前的比较研究大多侧重于峰面积数据,而不是绝对浓度。在本研究中,我们评估了数据处理方法,以得出与相应绝对定量数据最相似的输出结果。我们考察了数据分布特征、两种代谢物之间的折差模式以及样本方差。我们使用了来自零售牛奶研究和狼疮肾炎队列的两个代谢组数据集作为测试案例。在研究数据归一化、转换、缩放和这些方法组合的影响时,我们发现交叉分布补偿多重标准归一化(ccmn)方法和平方根数据转换最适合牛奶研究数据集这样的控制良好的研究。至于狼疮性肾炎队列研究,只有 ccmn 归一化能稍微改善噪声队列的数据质量。由于评估考虑了处理后数据与相应绝对量化数据之间的相似性,我们的结果为在类似情况下(食品和临床代谢组学)处理代谢组学数据集提供了有益的指导。最后,我们介绍了 Metabox 2.0,它能对代谢组学数据进行全面分析,包括数据处理、生物标记分析、综合分析和数据解读。在本研究中,我们成功地使用了它来处理和分析数据。在线网络版可在 http://metsysbio.com/metabox 上查阅。
{"title":"Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0.","authors":"Kwanjeera Wanichthanarak, Ammarin In-On, Sili Fan, Oliver Fiehn, Arporn Wangwiwatsin, Sakda Khoomrung","doi":"10.1093/gigascience/giae005","DOIUrl":"10.1093/gigascience/giae005","url":null,"abstract":"<p><p>In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated data processing methods to produce outputs that were most similar to the corresponding absolute quantified data. We examined the data distribution characteristics, fold difference patterns between 2 metabolites, and sample variance. We used 2 metabolomic datasets from a retail milk study and a lupus nephritis cohort as test cases. When studying the impact of data normalization, transformation, scaling, and combinations of these methods, we found that the cross-contribution compensating multiple standard normalization (ccmn) method, followed by square root data transformation, was most appropriate for a well-controlled study such as the milk study dataset. Regarding the lupus nephritis cohort study, only ccmn normalization could slightly improve the data quality of the noisy cohort. Since the assessment accounted for the resemblance between processed data and the corresponding absolute quantified data, our results denote a helpful guideline for processing metabolomic datasets within a similar context (food and clinical metabolomics). Finally, we introduce Metabox 2.0, which enables thorough analysis of metabolomic data, including data processing, biomarker analysis, integrative analysis, and data interpretation. It was successfully used to process and analyze the data in this study. An online web version is available at http://metsysbio.com/metabox.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10941642/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140131178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
GigaScience
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1