首页 > 最新文献

GigaScience最新文献

英文 中文
IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning IPEV:利用深度学习识别病毒体中的原核和真核病毒衍生序列
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-22 DOI: 10.1093/gigascience/giae018
Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, Huaiqiu Zhu
Background The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses. Findings We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2-dimensional convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals. Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.
背景 通过病毒样颗粒富集获得的病毒体包含原核和真核病毒衍生片段的混合物。要了解病毒在微生物群落中的作用和功能,对这些片段进行准确鉴定和分类至关重要。然而,病毒基因组的快速突变率给开发高性能分类工具带来了挑战,可能会限制下游分析。研究结果 我们介绍了 IPEV,这是一种区分病毒组中原核和真核病毒的新方法,它采用了一种结合三核苷酸对相对距离和频率的二维卷积神经网络。对 IPEV 进行的交叉验证评估证明了其先进的精确度,当查询病毒与已知病毒的序列相似度低于 30% 时,与现有方法相比,IPEV 在独立测试集上的 F1 分数显著提高了约 22%。此外,在基于序列比对注释的海洋和肠道病毒组样本上,IPEV 的精确度也优于其他方法。在相同的计算配置下,IPEV 的运行时间比现有方法最多缩短了 1,225 倍。我们还利用 IPEV 对纵向样本进行了分析,发现肠道病毒组表现出比以前在持久性个人病毒组中观察到的更高程度的时间稳定性,为了解个人肠道病毒组的恢复能力提供了新的视角。结论 IPEV 是一种高性能、用户友好型工具,可帮助生物学家识别病毒组中的原核和真核病毒并对其进行分类。该工具可在 https://github.com/basehc/IPEV 上获取。
{"title":"IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning","authors":"Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, Huaiqiu Zhu","doi":"10.1093/gigascience/giae018","DOIUrl":"https://doi.org/10.1093/gigascience/giae018","url":null,"abstract":"Background The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses. Findings We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2-dimensional convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals. Conclusions IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140804753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants 利用基于深度学习的方法进行大规模基因组调查,揭示菌株级噬菌体特异性决定因素
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-22 DOI: 10.1093/gigascience/giae017
Yiyan Yang, Keith Dufault-Thompson, Wei Yan, Tian Cai, Lei Xie, Xiaofang Jiang
Background Phage therapy, reemerging as a promising approach to counter antimicrobial-resistant infections, relies on a comprehensive understanding of the specificity of individual phages. Yet the significant diversity within phage populations presents a considerable challenge. Currently, there is a notable lack of tools designed for large-scale characterization of phage receptor-binding proteins, which are crucial in determining the phage host range. Results In this study, we present SpikeHunter, a deep learning method based on the ESM-2 protein language model. With SpikeHunter, we identified 231,965 diverse phage-encoded tailspike proteins, a crucial determinant of phage specificity that targets bacterial polysaccharide receptors, across 787,566 bacterial genomes from 5 virulent, antibiotic-resistant pathogens. Notably, 86.60% (143,200) of these proteins exhibited strong associations with specific bacterial polysaccharides. We discovered that phages with identical tailspike proteins can infect different bacterial species with similar polysaccharide receptors, underscoring the pivotal role of tailspike proteins in determining host range. The specificity is mainly attributed to the protein’s C-terminal domain, which strictly correlates with host specificity during domain swapping in tailspike proteins. Importantly, our dataset-driven predictions of phage–host specificity closely match the phage–host pairs observed in real-world phage therapy cases we studied. Conclusions Our research provides a rich resource, including both the method and a database derived from a large-scale genomics survey. This substantially enhances understanding of phage specificity determinants at the strain level and offers a valuable framework for guiding phage selection in therapeutic applications.
背景噬菌体疗法作为一种很有前景的方法,正在重新成为对抗抗菌药物耐药性感染的手段,它依赖于对单个噬菌体特异性的全面了解。然而,噬菌体种群的巨大多样性带来了相当大的挑战。目前,用于大规模鉴定噬菌体受体结合蛋白的工具明显缺乏,而受体结合蛋白对确定噬菌体宿主范围至关重要。结果 在本研究中,我们介绍了基于 ESM-2 蛋白语言模型的深度学习方法 SpikeHunter。通过 SpikeHunter,我们在 5 种具有毒性、抗生素耐药性的病原体的 787566 个细菌基因组中鉴定出了 231965 种不同的噬菌体编码的尾穗蛋白,这是决定噬菌体特异性的一个关键因素,它以细菌多糖受体为目标。值得注意的是,这些蛋白质中有 86.60% (143,200 个)表现出与特定细菌多糖的紧密联系。我们发现,具有相同尾穗蛋白的噬菌体可以感染具有相似多糖受体的不同细菌种类,这突出表明了尾穗蛋白在决定宿主范围方面的关键作用。这种特异性主要归因于蛋白质的 C 端结构域,它与尾梭蛋白结构域交换过程中的宿主特异性密切相关。重要的是,我们根据数据集预测的噬菌体-宿主特异性与我们研究的真实世界噬菌体治疗案例中观察到的噬菌体-宿主对密切吻合。结论 我们的研究提供了丰富的资源,包括从大规模基因组学调查中获得的方法和数据库。这大大增强了人们对菌株水平上噬菌体特异性决定因素的了解,并为指导治疗应用中的噬菌体选择提供了宝贵的框架。
{"title":"Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants","authors":"Yiyan Yang, Keith Dufault-Thompson, Wei Yan, Tian Cai, Lei Xie, Xiaofang Jiang","doi":"10.1093/gigascience/giae017","DOIUrl":"https://doi.org/10.1093/gigascience/giae017","url":null,"abstract":"Background Phage therapy, reemerging as a promising approach to counter antimicrobial-resistant infections, relies on a comprehensive understanding of the specificity of individual phages. Yet the significant diversity within phage populations presents a considerable challenge. Currently, there is a notable lack of tools designed for large-scale characterization of phage receptor-binding proteins, which are crucial in determining the phage host range. Results In this study, we present SpikeHunter, a deep learning method based on the ESM-2 protein language model. With SpikeHunter, we identified 231,965 diverse phage-encoded tailspike proteins, a crucial determinant of phage specificity that targets bacterial polysaccharide receptors, across 787,566 bacterial genomes from 5 virulent, antibiotic-resistant pathogens. Notably, 86.60% (143,200) of these proteins exhibited strong associations with specific bacterial polysaccharides. We discovered that phages with identical tailspike proteins can infect different bacterial species with similar polysaccharide receptors, underscoring the pivotal role of tailspike proteins in determining host range. The specificity is mainly attributed to the protein’s C-terminal domain, which strictly correlates with host specificity during domain swapping in tailspike proteins. Importantly, our dataset-driven predictions of phage–host specificity closely match the phage–host pairs observed in real-world phage therapy cases we studied. Conclusions Our research provides a rich resource, including both the method and a database derived from a large-scale genomics survey. This substantially enhances understanding of phage specificity determinants at the strain level and offers a valuable framework for guiding phage selection in therapeutic applications.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140804820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An effective strategy for assembling the sex-limited chromosome 组装性别限制染色体的有效策略
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-16 DOI: 10.1093/gigascience/giae015
Xiao-Bo Wang, Hong-Wei Lu, Qing-You Liu, A-Lun Li, Hong-Ling Zhou, Yong Zhang, Tian-Qi Zhu, Jue Ruan
Background Most currently available reference genomes lack the sequence map of sex-limited (such as Y and W) chromosomes, which results in incomplete assemblies that hinder further research on sex chromosomes. Recent advancements in long-read sequencing and population sequencing have provided the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. Findings We introduce the first computational method, Sorting long Reads of Y or other sex-limited chromosome (SRY), which achieves improved assembly results compared to flow sorting. Specifically, SRY outperforms in the heterochromatic region and demonstrates comparable performance in other regions. Furthermore, SRY enhances the capabilities of the hybrid assembly software, resulting in improved continuity and accuracy. Conclusions Our method enables true complete genome assembly and facilitates downstream research of sex-limited chromosomes.
背景 目前可用的大多数参考基因组都缺乏性限(如 Y 和 W)染色体的序列图,导致组装不完整,阻碍了对性染色体的进一步研究。最近在长线程测序和群体测序方面取得的进展为我们提供了无需传统的复杂实验工作就能组装性限染色体的机会。研究结果 我们介绍了第一种计算方法--Y或其他性别限制染色体长读数排序法(SRY),与流式排序法相比,SRY的组装结果更好。具体来说,SRY 在异染色质区域的表现更好,在其他区域的表现也不相上下。此外,SRY 还增强了混合组装软件的功能,从而提高了连续性和准确性。结论 我们的方法实现了真正的全基因组组装,并促进了性别限制染色体的下游研究。
{"title":"An effective strategy for assembling the sex-limited chromosome","authors":"Xiao-Bo Wang, Hong-Wei Lu, Qing-You Liu, A-Lun Li, Hong-Ling Zhou, Yong Zhang, Tian-Qi Zhu, Jue Ruan","doi":"10.1093/gigascience/giae015","DOIUrl":"https://doi.org/10.1093/gigascience/giae015","url":null,"abstract":"Background Most currently available reference genomes lack the sequence map of sex-limited (such as Y and W) chromosomes, which results in incomplete assemblies that hinder further research on sex chromosomes. Recent advancements in long-read sequencing and population sequencing have provided the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. Findings We introduce the first computational method, Sorting long Reads of Y or other sex-limited chromosome (SRY), which achieves improved assembly results compared to flow sorting. Specifically, SRY outperforms in the heterochromatic region and demonstrates comparable performance in other regions. Furthermore, SRY enhances the capabilities of the hybrid assembly software, resulting in improved continuity and accuracy. Conclusions Our method enables true complete genome assembly and facilitates downstream research of sex-limited chromosomes.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140613961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology 通过整合转录组学和表观转录组学数据集加强牛基因组注释,促进基因组生物学发展
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-16 DOI: 10.1093/gigascience/giae019
Hamid Beiki, Brenda M Murdoch, Carissa A Park, Chandlar Kern, Denise Kontechy, Gabrielle Becker, Gonzalo Rincon, Honglin Jiang, Huaijun Zhou, Jacob Thorne, James E Koltes, Jennifer J Michal, Kimberly Davenport, Monique Rijnkels, Pablo J Ross, Rui Hu, Sarah Corum, Stephanie McKay, Timothy P L Smith, Wansheng Liu, Wenzhi Ma, Xiaohui Zhang, Xiaoqing Xu, Xuelei Han, Zhihua Jiang, Zhi-Liang Hu, James M Reecy
Background The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. Results A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5′ untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue–tissue interconnection involved in different traits and construct the first bovine trait similarity network. Conclusions These validated results show significant improvement over current bovine genome annotations.
背景 准确识别牛基因组中的功能元件是高质量分析数据、为基因组生物学和基因组选择提供信息的基本要求。我们对牛基因组进行了功能注释,以确定牛组织中更完整的转录本异构体目录。结果 在各组织中共鉴定出 160,820 个独特的转录本(50% 蛋白编码),代表 34,882 个独特的基因(60% 蛋白编码)。其中,118,563 个转录本(占总数的 73%)通过独立数据集(PacBio 异构体测序数据、牛津纳米孔技术测序数据、从 RNA 测序数据中重新组装的转录本)以及与 Ensembl 和 NCBI 基因集的比较进行了结构验证。此外,所有转录本都有来自不同技术的大量数据支持,如全转录本组末端位点测序、用于基因表达分析的 RNA 注释和启动子图谱、染色质免疫沉淀测序,以及使用测序法检测转座酶可进入染色质。鉴定出的转录本中有很大一部分(69%)是未注释的,其中 86% 由已注释基因产生,14% 由未注释基因产生。每个基因表达的 5′非翻译区中位数为两个。每个组织中约有 50%的蛋白编码基因具有双重功能,同时转录编码和非编码同工酶。此外,我们还发现 3744 个基因在胎儿组织中作为非编码基因,但在成年组织中作为蛋白编码基因。与 Ensembl 或 NCBI 的注释相比,我们的新牛基因组注释扩展了 11,000 多个注释基因边界。我们将得到的牛转录组与公开的定量性状位点数据整合在一起,以研究不同性状所涉及的组织-组织之间的相互联系,并构建了第一个牛性状相似性网络。结论 这些验证结果表明,与目前的牛基因组注释相比,牛基因组注释有了显著改善。
{"title":"Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology","authors":"Hamid Beiki, Brenda M Murdoch, Carissa A Park, Chandlar Kern, Denise Kontechy, Gabrielle Becker, Gonzalo Rincon, Honglin Jiang, Huaijun Zhou, Jacob Thorne, James E Koltes, Jennifer J Michal, Kimberly Davenport, Monique Rijnkels, Pablo J Ross, Rui Hu, Sarah Corum, Stephanie McKay, Timothy P L Smith, Wansheng Liu, Wenzhi Ma, Xiaohui Zhang, Xiaoqing Xu, Xuelei Han, Zhihua Jiang, Zhi-Liang Hu, James M Reecy","doi":"10.1093/gigascience/giae019","DOIUrl":"https://doi.org/10.1093/gigascience/giae019","url":null,"abstract":"Background The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. Results A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5′ untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue–tissue interconnection involved in different traits and construct the first bovine trait similarity network. Conclusions These validated results show significant improvement over current bovine genome annotations.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140614176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups Korea4K:4 157 名韩国人的全基因组序列,其中 107 种表型来自广泛的健康检查
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-16 DOI: 10.1093/gigascience/giae014
Sungwon Jeon, Hansol Choi, Yeonsu Jeon, Whan-Hyuk Choi, Hyunjoo Choi, Kyungwhan An, Hyojung Ryu, Jihun Bhak, Hyeonjae Lee, Yoonsung Kwon, Sukyeon Ha, Yeo Jin Kim, Asta Blazyte, Changjae Kim, Yeonkyung Kim, Younghui Kang, Yeong Ju Woo, Chanyoung Lee, Jeongwoo Seo, Changhan Yoon, Dan Bolser, Orsolya Biro, Eun-Seok Shin, Byung Chul Kim, Seon-Young Kim, Ji-Hwan Park, Jongbum Jeon, Dooyoung Jung, Semin Lee, Jong Bhak
Background Phenome-wide association studies (PheWASs) have been conducted on Asian populations, including Koreans, but many were based on chip or exome genotyping data. Such studies have limitations regarding whole genome–wide association analysis, making it crucial to have genome-to-phenome association information with the largest possible whole genome and matched phenome data to conduct further population-genome studies and develop health care services based on population genomics. Results Here, we present 4,157 whole genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It encompasses most of the variants with allele frequency >0.001 in Koreans, indicating that it sufficiently covered most of the common and rare genetic variants with commonly measured phenotypes for Koreans. Korea4K provides 45,537,252 variants, and half of them were not present in Korea1K (1,094 samples). We also identified 1,356 new genotype–phenotype associations that were not found by the Korea1K dataset. Phenomics analyses further revealed 24 significant genetic correlations, 14 pleiotropic associations, and 127 causal relationships based on Mendelian randomization among 37 traits. In addition, the Korea4K imputation reference panel, the largest Korean variants reference to date, showed a superior imputation performance to Korea1K across all allele frequency categories. Conclusions Collectively, Korea4K provides not only the largest Korean genome data but also corresponding health check-up parameters and novel genome–phenome associations. The large-scale pathological whole genome–wide omics data will become a powerful set for genome–phenome level association studies to discover causal markers for the prediction and diagnosis of health conditions in future studies.
背景 对包括韩国人在内的亚洲人群进行了全表型关联研究(Phenome-wide Association Studies,PheWASs),但许多研究是基于芯片或外显子组基因分型数据进行的。这些研究在全基因组关联分析方面存在局限性,因此,拥有尽可能多的全基因组和匹配表型组数据的基因组到表型组关联信息对于开展进一步的人群基因组研究和开发基于人群基因组学的医疗保健服务至关重要。结果 在这里,我们展示了 4,157 个全基因组序列(Korea4K)和 107 个健康检查参数,这是韩国基因组计划最大的基因组资源。它涵盖了韩国人等位基因频率>0.001的大多数变异,表明它充分涵盖了韩国人大多数常见和罕见的基因变异,以及常见的测量表型。Korea4K 提供了 45,537,252 个变体,其中一半在 Korea1K 中不存在(1,094 个样本)。我们还发现了 1,356 个新的基因型-表型关联,这些关联是 Korea1K 数据集所没有的。表型组学分析进一步揭示了 37 个性状中的 24 个显著遗传相关性、14 个多效性关联和 127 个基于孟德尔随机化的因果关系。此外,Korea4K 归因参考面板是迄今为止最大的韩国变异参考面板,在所有等位基因频率类别中都显示出优于 Korea1K 的归因性能。结论 总的来说,Korea4K 不仅提供了最大的韩国基因组数据,还提供了相应的健康检查参数和新的基因组-表型组关联。大规模病理全基因组 omics 数据将成为基因组-表型组水平关联研究的强大数据集,在未来的研究中为预测和诊断健康状况发现因果标记。
{"title":"Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups","authors":"Sungwon Jeon, Hansol Choi, Yeonsu Jeon, Whan-Hyuk Choi, Hyunjoo Choi, Kyungwhan An, Hyojung Ryu, Jihun Bhak, Hyeonjae Lee, Yoonsung Kwon, Sukyeon Ha, Yeo Jin Kim, Asta Blazyte, Changjae Kim, Yeonkyung Kim, Younghui Kang, Yeong Ju Woo, Chanyoung Lee, Jeongwoo Seo, Changhan Yoon, Dan Bolser, Orsolya Biro, Eun-Seok Shin, Byung Chul Kim, Seon-Young Kim, Ji-Hwan Park, Jongbum Jeon, Dooyoung Jung, Semin Lee, Jong Bhak","doi":"10.1093/gigascience/giae014","DOIUrl":"https://doi.org/10.1093/gigascience/giae014","url":null,"abstract":"Background Phenome-wide association studies (PheWASs) have been conducted on Asian populations, including Koreans, but many were based on chip or exome genotyping data. Such studies have limitations regarding whole genome–wide association analysis, making it crucial to have genome-to-phenome association information with the largest possible whole genome and matched phenome data to conduct further population-genome studies and develop health care services based on population genomics. Results Here, we present 4,157 whole genome sequences (Korea4K) coupled with 107 health check-up parameters as the largest genomic resource of the Korean Genome Project. It encompasses most of the variants with allele frequency >0.001 in Koreans, indicating that it sufficiently covered most of the common and rare genetic variants with commonly measured phenotypes for Koreans. Korea4K provides 45,537,252 variants, and half of them were not present in Korea1K (1,094 samples). We also identified 1,356 new genotype–phenotype associations that were not found by the Korea1K dataset. Phenomics analyses further revealed 24 significant genetic correlations, 14 pleiotropic associations, and 127 causal relationships based on Mendelian randomization among 37 traits. In addition, the Korea4K imputation reference panel, the largest Korean variants reference to date, showed a superior imputation performance to Korea1K across all allele frequency categories. Conclusions Collectively, Korea4K provides not only the largest Korean genome data but also corresponding health check-up parameters and novel genome–phenome associations. The large-scale pathological whole genome–wide omics data will become a powerful set for genome–phenome level association studies to discover causal markers for the prediction and diagnosis of health conditions in future studies.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140614276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved integration of single-cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans 单细胞转录组数据的改进整合显示了小鼠和人类心力衰竭的共同和独特特征
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-04 DOI: 10.1093/gigascience/giae011
Mariano Ruz Jurado, Lukas S Tombor, Mani Arsalan, Tomas Holubec, Fabian Emrich, Thomas Walther, Wesley Abplanalp, Ariane Fischer, Andreas M Zeiher, Marcel H Schulz, Stefanie Dimmeler, David John
Background Cardiovascular research heavily relies on mouse (Mus musculus) models to study disease mechanisms and to test novel biomarkers and medications. Yet, applying these results to patients remains a major challenge and often results in noneffective drugs. Therefore, it is an open challenge of translational science to develop models with high similarities and predictive value. This requires a comparison of disease models in mice with diseased tissue derived from humans. Results To compare the transcriptional signatures at single-cell resolution, we implemented an integration pipeline called OrthoIntegrate, which uniquely assigns orthologs and therewith merges single-cell RNA sequencing (scRNA-seq) RNA of different species. The pipeline has been designed to be as easy to use and is fully integrable in the standard Seurat workflow. We applied OrthoIntegrate on scRNA-seq from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and scRNA-seq from the mice after chronic infarction, which is a commonly used mouse model to mimic HFrEF. We discovered shared and distinct regulatory pathways between human HFrEF patients and the corresponding mouse model. Overall, 54% of genes were commonly regulated, including major changes in cardiomyocyte energy metabolism. However, several regulatory pathways (e.g., angiogenesis) were specifically regulated in humans. Conclusions The demonstration of unique pathways occurring in humans indicates limitations on the comparability between mice models and human HFrEF and shows that results from the mice model should be validated carefully. OrthoIntegrate is publicly accessible (https://github.com/MarianoRuzJurado/OrthoIntegrate) and can be used to integrate other large datasets to provide a general comparison of models with patient data.
背景心血管研究在很大程度上依赖于小鼠(麝香猫)模型来研究疾病机制以及测试新型生物标记物和药物。然而,将这些结果应用于患者仍然是一项重大挑战,而且往往会导致药物无效。因此,开发具有高度相似性和预测价值的模型是转化科学的一项公开挑战。这就需要将小鼠的疾病模型与来自人类的疾病组织进行比较。结果 为了比较单细胞分辨率下的转录特征,我们实施了一个名为 OrthoIntegrate 的整合管道,它能唯一分配直向同源物,从而合并不同物种的单细胞 RNA 测序(scRNA-seq)RNA。该管道设计简单易用,可完全集成到标准的 Seurat 工作流程中。我们将 OrthoIntegrate 应用于射血分数降低型心力衰竭(HFrEF)患者心脏组织的 scRNA-seq 和慢性梗塞后小鼠的 scRNA-seq 上,慢性梗塞是模拟 HFrEF 的常用小鼠模型。我们发现了人类 HFrEF 患者与相应小鼠模型之间共有的和不同的调控通路。总体而言,54%的基因受到共同调控,包括心肌细胞能量代谢的主要变化。然而,有几种调控途径(如血管生成)在人类中受到特殊调控。结论 在人类中出现的独特通路表明,小鼠模型与人类高频低氧血症之间的可比性存在局限性,并表明应仔细验证小鼠模型的结果。OrthoIntegrate 可公开访问 (https://github.com/MarianoRuzJurado/OrthoIntegrate),可用于整合其他大型数据集,提供模型与患者数据的一般比较。
{"title":"Improved integration of single-cell transcriptome data demonstrates common and unique signatures of heart failure in mice and humans","authors":"Mariano Ruz Jurado, Lukas S Tombor, Mani Arsalan, Tomas Holubec, Fabian Emrich, Thomas Walther, Wesley Abplanalp, Ariane Fischer, Andreas M Zeiher, Marcel H Schulz, Stefanie Dimmeler, David John","doi":"10.1093/gigascience/giae011","DOIUrl":"https://doi.org/10.1093/gigascience/giae011","url":null,"abstract":"Background Cardiovascular research heavily relies on mouse (Mus musculus) models to study disease mechanisms and to test novel biomarkers and medications. Yet, applying these results to patients remains a major challenge and often results in noneffective drugs. Therefore, it is an open challenge of translational science to develop models with high similarities and predictive value. This requires a comparison of disease models in mice with diseased tissue derived from humans. Results To compare the transcriptional signatures at single-cell resolution, we implemented an integration pipeline called OrthoIntegrate, which uniquely assigns orthologs and therewith merges single-cell RNA sequencing (scRNA-seq) RNA of different species. The pipeline has been designed to be as easy to use and is fully integrable in the standard Seurat workflow. We applied OrthoIntegrate on scRNA-seq from cardiac tissue of heart failure patients with reduced ejection fraction (HFrEF) and scRNA-seq from the mice after chronic infarction, which is a commonly used mouse model to mimic HFrEF. We discovered shared and distinct regulatory pathways between human HFrEF patients and the corresponding mouse model. Overall, 54% of genes were commonly regulated, including major changes in cardiomyocyte energy metabolism. However, several regulatory pathways (e.g., angiogenesis) were specifically regulated in humans. Conclusions The demonstration of unique pathways occurring in humans indicates limitations on the comparability between mice models and human HFrEF and shows that results from the mice model should be validated carefully. OrthoIntegrate is publicly accessible (https://github.com/MarianoRuzJurado/OrthoIntegrate) and can be used to integrate other large datasets to provide a general comparison of models with patient data.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140599512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data 泛基因组数据库改进了临床元基因组数据中的宿主去除和分枝杆菌分类工作
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-04-04 DOI: 10.1093/gigascience/giae010
Michael B Hall, Lachlan J M Coin
Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. Results We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. Conclusions Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.
背景 临床元基因组样本的无培养基实时测序可实现病原体的快速检测和抗菌药耐药性分析。然而,这种方法会带来病人 DNA 泄漏的风险。为了降低这种风险,我们需要在测序时近乎全面地清除人类 DNA 序列,通常需要使用资源有限的设备。现有的基准主要集中在标准化数据库的使用上,在很大程度上忽略了删除管道的计算要求以及人类基因组多样性的影响。结果 我们在模拟和人工真实 Illumina 和 Nanopore 元基因组样本上对宿主去除管道进行了基准测试。我们发现,构建一个包含不同人类基因组的定制 kraken 数据库,能在准确性和计算资源使用之间取得最佳平衡。此外,我们还利用标准数据库和定制数据库,对使用 kraken 和 minimap2 对分枝杆菌读数进行分类的管道进行了基准测试。与结核分枝杆菌分类的标准数据库相比,使用具有代表性的分枝杆菌属数据库,这两种工具都提高了特异性和灵敏度。这些定制数据库的计算效率优于大多数标准方法,可以在笔记本电脑上执行。结论 与标准数据库相比,定制的泛基因组数据库在从元基因组样本中去除人类读数和进行结核分枝杆菌读数分类时,能在准确性和计算效率之间取得最佳平衡。这样的数据库可以在笔记本电脑上执行,而不会牺牲准确性,这在资源匮乏的环境中是一个特别重要的考虑因素。我们免费提供所有定制的数据库和管道。
{"title":"Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data","authors":"Michael B Hall, Lachlan J M Coin","doi":"10.1093/gigascience/giae010","DOIUrl":"https://doi.org/10.1093/gigascience/giae010","url":null,"abstract":"Background Culture-free real-time sequencing of clinical metagenomic samples promises both rapid pathogen detection and antimicrobial resistance profiling. However, this approach introduces the risk of patient DNA leakage. To mitigate this risk, we need near-comprehensive removal of human DNA sequences at the point of sequencing, typically involving the use of resource-constrained devices. Existing benchmarks have largely focused on the use of standardized databases and largely ignored the computational requirements of depletion pipelines as well as the impact of human genome diversity. Results We benchmarked host removal pipelines on simulated and artificial real Illumina and Nanopore metagenomic samples. We found that construction of a custom kraken database containing diverse human genomes results in the best balance of accuracy and computational resource usage. In addition, we benchmarked pipelines using kraken and minimap2 for taxonomic classification of Mycobacterium reads using standard and custom databases. With a database representative of the Mycobacterium genus, both tools obtained improved specificity and sensitivity, compared to the standard databases for classification of Mycobacterium tuberculosis. Computational efficiency of these custom databases was superior to most standard approaches, allowing them to be executed on a laptop device. Conclusions Customized pangenome databases provide the best balance of accuracy and computational efficiency when compared to standard databases for the task of human read removal and M. tuberculosis read classification from metagenomic samples. Such databases allow for execution on a laptop, without sacrificing accuracy, an especially important consideration in low-resource settings. We make all customized databases and pipelines freely available.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140599486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-omics data analysis workflow packaged as a FAIR Digital Object 打包为 FAIR 数字对象的多组学数据分析工作流程
IF 9.2 2区 生物学 Q1 Medicine Pub Date : 2024-01-10 DOI: 10.1093/gigascience/giad115
Anna Niehues, Casper de Visser, Fiona A Hagenbeek, Purva Kulkarni, René Pool, Naama Karu, Alida S D Kindt, Gurnoor Singh, Robert R J M Vermeiren, Dorret I Boomsma, Jenny van Dongen, Peter A C ’t Hoen, Alain J van Gool
Background Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
背景 在研究项目中应用良好的数据管理和 FAIR(可查找、可访问、可互操作和可重用)数据原则,有助于在未来的研究中将知识发现、研究结果可重现性和数据重用区分开来。基于最初的研究数据 FAIR 原则的概念,最近又提出了研究软件 FAIR 原则。FAIR 数字对象可以实现研究对象的发现和重用,包括人类和机器的计算工作流程。实际案例有助于促进研究界在计算工作流程中采用 FAIR 实践。我们开发了一个多组学数据分析工作流,将其作为 FAIR 数字对象进行共享。研究结果 我们进行了一项案例研究,调查多组学数据与儿童外化行为之间的共享模式。分析工作流在工作流管理器 Nextflow 中以模块化流水线的形式实现,包括具有软件依赖性的容器。我们遵守了软件开发规范,如版本控制、文档和许可。最后,我们用丰富的语义元数据对工作流进行了描述,将其打包为研究对象板块(Research Object Crate),并通过 WorkflowHub 进行共享。结论 除了打包的多组学数据分析工作流程,我们还分享了采用各种 FAIR 实践和创建 FAIR 数字对象的经验。我们希望我们的经验能够帮助其他开发 omics 数据分析工作流程的研究人员将 FAIR 原则付诸实践。
{"title":"A multi-omics data analysis workflow packaged as a FAIR Digital Object","authors":"Anna Niehues, Casper de Visser, Fiona A Hagenbeek, Purva Kulkarni, René Pool, Naama Karu, Alida S D Kindt, Gurnoor Singh, Robert R J M Vermeiren, Dorret I Boomsma, Jenny van Dongen, Peter A C ’t Hoen, Alain J van Gool","doi":"10.1093/gigascience/giad115","DOIUrl":"https://doi.org/10.1093/gigascience/giad115","url":null,"abstract":"Background Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":9.2,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139463300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Habitat suitability maps for Australian flora and fauna under CMIP6 climate scenarios. CMIP6 气候情景下澳大利亚动植物栖息地适宜性地图。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-01-02 DOI: 10.1093/gigascience/giae002
Carla L Archibald, David M Summers, Erin M Graham, Brett A Bryan

Background: Spatial information about the location and suitability of areas for native plant and animal species under different climate futures is an important input to land use and conservation planning and management. Australia, renowned for its abundant species diversity and endemism, often relies on modeled data to assess species distributions due to the country's vast size and the challenges associated with conducting on-ground surveys on such a large scale. The objective of this article is to develop habitat suitability maps for Australian flora and fauna under different climate futures.

Results: Using MaxEnt, we produced Australia-wide habitat suitability maps under RCP2.6-SSP1, RCP4.5-SSP2, RCP7.0-SSP3, and RCP8.5-SSP5 climate futures for 1,382 terrestrial vertebrates and 9,251 vascular plants vascular plants at 5 km2 for open access. This represents 60% of all Australian mammal species, 77% of amphibian species, 50% of reptile species, 71% of bird species, and 44% of vascular plant species. We also include tabular data, which include summaries of total quality-weighted habitat area of species under different climate scenarios and time periods.

Conclusions: The spatial data supplied can help identify important and sensitive locations for species under various climate futures. Additionally, the supplied tabular data can provide insights into the impacts of climate change on biodiversity in Australia. These habitat suitability maps can be used as input data for landscape and conservation planning or species management, particularly under different climate change scenarios in Australia.

背景:在不同的未来气候条件下,有关本地动植物物种分布位置和适宜性的空间信息是土地利用和保护规划与管理的重要依据。澳大利亚以其丰富的物种多样性和特有性闻名于世,但由于国土面积辽阔,在如此大的范围内进行实地调查存在诸多挑战,因此通常依赖模型数据来评估物种分布。本文旨在绘制不同气候条件下澳大利亚动植物的栖息地适宜性地图:使用 MaxEnt,我们绘制了澳大利亚全境在 RCP2.6-SSP1、RCP4.5-SSP2、RCP7.0-SSP3 和 RCP8.5-SSP5 气候未来下的栖息地适宜性地图,涉及 1,382 种陆生脊椎动物和 9,251 种维管束植物,面积为 5 平方公里,可公开获取。这代表了澳大利亚所有哺乳动物物种的 60%、两栖动物物种的 77%、爬行动物物种的 50%、鸟类物种的 71% 和维管植物物种的 44%。我们还提供了表格数据,其中包括不同气候情景和时间段下的物种质量加权栖息地总面积汇总:所提供的空间数据有助于确定不同气候未来下物种的重要和敏感地点。此外,所提供的表格数据还能让人们深入了解气候变化对澳大利亚生物多样性的影响。这些栖息地适宜性地图可用作景观和保护规划或物种管理的输入数据,尤其是在澳大利亚不同的气候变化情景下。
{"title":"Habitat suitability maps for Australian flora and fauna under CMIP6 climate scenarios.","authors":"Carla L Archibald, David M Summers, Erin M Graham, Brett A Bryan","doi":"10.1093/gigascience/giae002","DOIUrl":"10.1093/gigascience/giae002","url":null,"abstract":"<p><strong>Background: </strong>Spatial information about the location and suitability of areas for native plant and animal species under different climate futures is an important input to land use and conservation planning and management. Australia, renowned for its abundant species diversity and endemism, often relies on modeled data to assess species distributions due to the country's vast size and the challenges associated with conducting on-ground surveys on such a large scale. The objective of this article is to develop habitat suitability maps for Australian flora and fauna under different climate futures.</p><p><strong>Results: </strong>Using MaxEnt, we produced Australia-wide habitat suitability maps under RCP2.6-SSP1, RCP4.5-SSP2, RCP7.0-SSP3, and RCP8.5-SSP5 climate futures for 1,382 terrestrial vertebrates and 9,251 vascular plants vascular plants at 5 km2 for open access. This represents 60% of all Australian mammal species, 77% of amphibian species, 50% of reptile species, 71% of bird species, and 44% of vascular plant species. We also include tabular data, which include summaries of total quality-weighted habitat area of species under different climate scenarios and time periods.</p><p><strong>Conclusions: </strong>The spatial data supplied can help identify important and sensitive locations for species under various climate futures. Additionally, the supplied tabular data can provide insights into the impacts of climate change on biodiversity in Australia. These habitat suitability maps can be used as input data for landscape and conservation planning or species management, particularly under different climate change scenarios in Australia.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10939329/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140039094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0. 使代谢组学更加定量化的数据处理解决方案:使用 Metabox 2.0 进行的食品和临床代谢组学案例研究。
IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES Pub Date : 2024-01-02 DOI: 10.1093/gigascience/giae005
Kwanjeera Wanichthanarak, Ammarin In-On, Sili Fan, Oliver Fiehn, Arporn Wangwiwatsin, Sakda Khoomrung

In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated data processing methods to produce outputs that were most similar to the corresponding absolute quantified data. We examined the data distribution characteristics, fold difference patterns between 2 metabolites, and sample variance. We used 2 metabolomic datasets from a retail milk study and a lupus nephritis cohort as test cases. When studying the impact of data normalization, transformation, scaling, and combinations of these methods, we found that the cross-contribution compensating multiple standard normalization (ccmn) method, followed by square root data transformation, was most appropriate for a well-controlled study such as the milk study dataset. Regarding the lupus nephritis cohort study, only ccmn normalization could slightly improve the data quality of the noisy cohort. Since the assessment accounted for the resemblance between processed data and the corresponding absolute quantified data, our results denote a helpful guideline for processing metabolomic datasets within a similar context (food and clinical metabolomics). Finally, we introduce Metabox 2.0, which enables thorough analysis of metabolomic data, including data processing, biomarker analysis, integrative analysis, and data interpretation. It was successfully used to process and analyze the data in this study. An online web version is available at http://metsysbio.com/metabox.

在传统的半定量代谢组学研究中,代谢物强度会受到生物因素和其他不必要变化的影响。对数据处理方法进行系统评估对于确定特定实验设置的适当处理程序至关重要。目前的比较研究大多侧重于峰面积数据,而不是绝对浓度。在本研究中,我们评估了数据处理方法,以得出与相应绝对定量数据最相似的输出结果。我们考察了数据分布特征、两种代谢物之间的折差模式以及样本方差。我们使用了来自零售牛奶研究和狼疮肾炎队列的两个代谢组数据集作为测试案例。在研究数据归一化、转换、缩放和这些方法组合的影响时,我们发现交叉分布补偿多重标准归一化(ccmn)方法和平方根数据转换最适合牛奶研究数据集这样的控制良好的研究。至于狼疮性肾炎队列研究,只有 ccmn 归一化能稍微改善噪声队列的数据质量。由于评估考虑了处理后数据与相应绝对量化数据之间的相似性,我们的结果为在类似情况下(食品和临床代谢组学)处理代谢组学数据集提供了有益的指导。最后,我们介绍了 Metabox 2.0,它能对代谢组学数据进行全面分析,包括数据处理、生物标记分析、综合分析和数据解读。在本研究中,我们成功地使用了它来处理和分析数据。在线网络版可在 http://metsysbio.com/metabox 上查阅。
{"title":"Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0.","authors":"Kwanjeera Wanichthanarak, Ammarin In-On, Sili Fan, Oliver Fiehn, Arporn Wangwiwatsin, Sakda Khoomrung","doi":"10.1093/gigascience/giae005","DOIUrl":"10.1093/gigascience/giae005","url":null,"abstract":"<p><p>In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated data processing methods to produce outputs that were most similar to the corresponding absolute quantified data. We examined the data distribution characteristics, fold difference patterns between 2 metabolites, and sample variance. We used 2 metabolomic datasets from a retail milk study and a lupus nephritis cohort as test cases. When studying the impact of data normalization, transformation, scaling, and combinations of these methods, we found that the cross-contribution compensating multiple standard normalization (ccmn) method, followed by square root data transformation, was most appropriate for a well-controlled study such as the milk study dataset. Regarding the lupus nephritis cohort study, only ccmn normalization could slightly improve the data quality of the noisy cohort. Since the assessment accounted for the resemblance between processed data and the corresponding absolute quantified data, our results denote a helpful guideline for processing metabolomic datasets within a similar context (food and clinical metabolomics). Finally, we introduce Metabox 2.0, which enables thorough analysis of metabolomic data, including data processing, biomarker analysis, integrative analysis, and data interpretation. It was successfully used to process and analyze the data in this study. An online web version is available at http://metsysbio.com/metabox.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10941642/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140131178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
GigaScience
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1