The killer-cell immunoglobulin-like receptor (KIR) gene complex, a highly polymorphic region of the human genome that encodes proteins involved in immune responses, poses strong challenges in genotyping owing to its remarkable genetic diversity and structural intricacy. Accurate analysis of KIR alleles, including their structural variations, is crucial for understanding their roles in various immune responses. Leveraging the high-quality genome assemblies from the Human Pangenome Reference Consortium (HPRC), we present a novel bioinformatic tool, the structural KIR annoTator (SKIRT), to investigate gene diversity and facilitate precise KIR allele analysis. In 47 HPRC-phased assemblies, SKIRT identifies a recurrent novel KIR2DS4/3DL1 fusion gene in the paternal haplotype of HG02630 and maternal haplotype of NA19240. Additionally, SKIRT accurately identifies eight structural variants and 15 novel nonsynonymous alleles, all of which are independently validated using short-read data or quantitative polymerase chain reaction. Our study has discovered a total of 570 novel alleles, among which eight haplotypes harbor at least one KIR gene duplication, six haplotypes have lost at least one framework gene, and 75 out of 94 haplotypes (79.8%) carry at least five novel alleles, thus confirming KIR genetic diversity. These findings are pivotal in providing insights into KIR gene diversity and serve as a solid foundation for understanding the functional consequences of KIR structural variations. High-resolution genome assemblies offer unprecedented opportunities to explore polymorphic regions that are challenging to investigate using short-read sequencing methods. The SKIRT pipeline emerges as a highly efficient tool, enabling the comprehensive detection of the complete spectrum of KIR alleles within human genome assemblies.
杀伤细胞免疫球蛋白样受体(KIR)基因复合物是人类基因组中一个编码参与免疫反应的蛋白质的高度多态区,由于其显著的遗传多样性和结构的复杂性,给基因分型带来了巨大挑战。准确分析 KIR 等位基因,包括其结构变异,对于了解它们在各种免疫反应中的作用至关重要。利用人类泛基因组参考联盟(Human Pangenome Reference Consortium,HPRC)的高质量基因组组装,我们提出了一种新型生物信息学工具--KIR结构注释器(structural KIR annoTator,SKIRT),用于研究基因多样性并促进精确的KIR等位基因分析。在 47 个 HPRC 分期组合中,SKIRT 在父系单倍型 HG02630 和母系单倍型 NA19240 中发现了一个反复出现的新型 KIR2DS4/3DL1 融合基因。此外,SKIRT 还准确鉴定出了 8 个结构变异和 15 个新型非同义等位基因,所有这些变异和等位基因都通过短读数据或定量聚合酶链反应进行了独立验证。我们的研究共发现了 570 个新型等位基因,其中有 8 个单倍型携带至少一个 KIR 基因重复,6 个单倍型丢失了至少一个框架基因,94 个单倍型中有 75 个(79.8%)携带至少 5 个新型等位基因,从而证实了 KIR 遗传多样性。这些发现对于深入了解 KIR 基因多样性至关重要,也为了解 KIR 结构变异的功能性后果奠定了坚实的基础。高分辨率基因组组装为探索多态性区域提供了前所未有的机会,而使用短线程测序方法对这些区域进行研究具有挑战性。SKIRT 管道是一种高效的工具,能够全面检测人类基因组组装中的全部 KIR 等位基因。
{"title":"Genetic complexity of killer-cell immunoglobulin-like receptor genes in human pangenome assemblies","authors":"Tsung-Kai Hung, Wan-Chi Liu, Sheng-Kai Lai, Hui-Wen Chuang, Yi-Che Lee, Hong-Ye Lin, Chia-Lang Hsu, Chien-Yu Chen, Ya-Chien Yang, Jacob Shujui Hsu, Pei-Lung Chen","doi":"10.1101/gr.278358.123","DOIUrl":"https://doi.org/10.1101/gr.278358.123","url":null,"abstract":"The killer-cell immunoglobulin-like receptor (KIR) gene complex, a highly polymorphic region of the human genome that encodes proteins involved in immune responses, poses strong challenges in genotyping owing to its remarkable genetic diversity and structural intricacy. Accurate analysis of KIR alleles, including their structural variations, is crucial for understanding their roles in various immune responses. Leveraging the high-quality genome assemblies from the Human Pangenome Reference Consortium (HPRC), we present a novel bioinformatic tool, the structural KIR annoTator (SKIRT), to investigate gene diversity and facilitate precise KIR allele analysis. In 47 HPRC-phased assemblies, SKIRT identifies a recurrent novel <em>KIR2DS4/3DL1</em> fusion gene in the paternal haplotype of HG02630 and maternal haplotype of NA19240. Additionally, SKIRT accurately identifies eight structural variants and 15 novel nonsynonymous alleles, all of which are independently validated using short-read data or quantitative polymerase chain reaction. Our study has discovered a total of 570 novel alleles, among which eight haplotypes harbor at least one KIR gene duplication, six haplotypes have lost at least one framework gene, and 75 out of 94 haplotypes (79.8%) carry at least five novel alleles, thus confirming KIR genetic diversity. These findings are pivotal in providing insights into KIR gene diversity and serve as a solid foundation for understanding the functional consequences of KIR structural variations. High-resolution genome assemblies offer unprecedented opportunities to explore polymorphic regions that are challenging to investigate using short-read sequencing methods. The SKIRT pipeline emerges as a highly efficient tool, enabling the comprehensive detection of the complete spectrum of KIR alleles within human genome assemblies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"68 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meir Goldenberg, Loay Mualem, Amit Shahar, Sagi Snir, Adi Akavia
DNA methylation data plays a crucial role in estimating chronological age in mammals, offering real-time insights into an individual’s aging process. The Epigenetic Pacemaker (EPM) model allows inference of the biological age as deviations from the population trend. Given the sensitivity of this data, it is essential to safeguard both inputs and outputs of the EPM model. In a recent study, a privacy-preserving approach for EPM computation was introduced, utilizing Fully Homomorphic Encryption (FHE). However, their method had limitations, including having high communication complexity and being impractical for large datasets Our work presents a new privacy preserving protocol for EPM computation, analytically improving both privacy and complexity. Notably, we employ a single server for the secure computation phase while ensuring privacy even in the event of server corruption (compared to requiring two non-colluding servers. Using techniques from symbolic algebra and number theory, the new protocol eliminates the need for communication during secure computation, significantly improves asymptotic runtime and and offers better compatibility to parallel computing for further time complexity reduction. We have implemented our protocol, demonstrating its ability to produce results similar to the standard (insecure) EPM model with substantial performance improvement compared to previous methods. These findings hold promise for enhancing data security in medical applications where personal privacy is paramount. The generality of both the new approach and the EPM, suggests that this protocol may be useful to other uses employing similar expectation maximization techniques.
{"title":"Privacy-preserving biological age prediction over federated human methylation data using fully homomorphic encryption","authors":"Meir Goldenberg, Loay Mualem, Amit Shahar, Sagi Snir, Adi Akavia","doi":"10.1101/gr.279071.124","DOIUrl":"https://doi.org/10.1101/gr.279071.124","url":null,"abstract":"DNA methylation data plays a crucial role in estimating chronological age in mammals, offering real-time insights into an individual’s aging process. The Epigenetic Pacemaker (EPM) model allows inference of the biological age as deviations from the population trend. Given the sensitivity of this data, it is essential to safeguard both inputs and outputs of the EPM model. In a recent study, a privacy-preserving approach for EPM computation was introduced, utilizing Fully Homomorphic Encryption (FHE). However, their method had limitations, including having high communication complexity and being impractical for large datasets Our work presents a new privacy preserving protocol for EPM computation, analytically improving both privacy and complexity. Notably, we employ a single server for the secure computation phase while ensuring privacy even in the event of server corruption (compared to requiring two non-colluding servers. Using techniques from symbolic algebra and number theory, the new protocol eliminates the need for communication during secure computation, significantly improves asymptotic runtime and and offers better compatibility to parallel computing for further time complexity reduction. We have implemented our protocol, demonstrating its ability to produce results similar to the standard (insecure) EPM model with substantial performance improvement compared to previous methods. These findings hold promise for enhancing data security in medical applications where personal privacy is paramount. The generality of both the new approach and the EPM, suggests that this protocol may be useful to other uses employing similar expectation maximization techniques.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"7 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cameron Y Park, Shouvik Mani, Nicolas Beltran-Velez, Katie Maurer, Teddy Huang, Shuqiang Li, Satyen Gohil, Kenneth J Livak, David A Knowles, Catherine J Wu, Elham Azizi
Characterizing cell-cell communication and tracking its variability over time are crucial for understanding the coordination of biological processes mediating normal development, disease progression, and responses to perturbations such as therapies. Existing tools fail to capture time-dependent intercellular interactions, and primarily rely on existing databases compiled from limited contexts. We introduce DIISCO, a Bayesian framework designed to characterize the temporal dynamics of cellular interactions using single-cell RNA sequencing data from multiple time points. Our method utilizes structured Gaussian process regression to unveil time-resolved interactions among diverse cell types according to their coevolution and incorporates prior knowledge of receptor-ligand complexes. We show the interpretability of DIISCO in simulated data and new data collected from T cells co-cultured with lymphoma cells, demonstrating its potential to uncover dynamic cell-cell crosstalk.
表征细胞-细胞通讯并跟踪其随时间的变化,对于了解介导正常发育、疾病进展和对疗法等干扰的反应的生物过程的协调至关重要。现有的工具无法捕捉随时间变化的细胞间相互作用,而且主要依赖于从有限的环境中汇编的现有数据库。我们介绍了 DIISCO,这是一个贝叶斯框架,旨在利用多个时间点的单细胞 RNA 测序数据描述细胞间相互作用的时间动态。我们的方法利用结构化高斯过程回归,根据不同细胞类型的共同进化揭示它们之间时间分辨的相互作用,并结合受体配体复合物的先验知识。我们在模拟数据和从与淋巴瘤细胞共培养的 T 细胞收集的新数据中展示了 DIISCO 的可解释性,证明了它揭示动态细胞间串扰的潜力。
{"title":"A Bayesian framework for inferring dynamic intercellular interactions from time-series single-cell data","authors":"Cameron Y Park, Shouvik Mani, Nicolas Beltran-Velez, Katie Maurer, Teddy Huang, Shuqiang Li, Satyen Gohil, Kenneth J Livak, David A Knowles, Catherine J Wu, Elham Azizi","doi":"10.1101/gr.279126.124","DOIUrl":"https://doi.org/10.1101/gr.279126.124","url":null,"abstract":"Characterizing cell-cell communication and tracking its variability over time are crucial for understanding the coordination of biological processes mediating normal development, disease progression, and responses to perturbations such as therapies. Existing tools fail to capture time-dependent intercellular interactions, and primarily rely on existing databases compiled from limited contexts. We introduce DIISCO, a Bayesian framework designed to characterize the temporal dynamics of cellular interactions using single-cell RNA sequencing data from multiple time points. Our method utilizes structured Gaussian process regression to unveil time-resolved interactions among diverse cell types according to their coevolution and incorporates prior knowledge of receptor-ligand complexes. We show the interpretability of DIISCO in simulated data and new data collected from T cells co-cultured with lymphoma cells, demonstrating its potential to uncover dynamic cell-cell crosstalk.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"8 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark showed that DCTdomain was able to detect distant homologs by leveraging the structural information in the contextual embeddings.
{"title":"Protein domain embeddings for fast and accurate similarity search","authors":"Benjamin Giovanni Iovino, Haixu Tang, Yuzhen Ye","doi":"10.1101/gr.279127.124","DOIUrl":"https://doi.org/10.1101/gr.279127.124","url":null,"abstract":"Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark showed that DCTdomain was able to detect distant homologs by leveraging the structural information in the contextual embeddings.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"4 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Schrod, Niklas Lück, Robert Lohmayer, Stefan Solbrig, Dennis Völkl, Tina Wipfler, Katherine H. Shutta, Marouen Ben Guebila, Andreas Schäfer, Tim Beißbarth, Helena U. Zacharias, Peter Oefner, John Quackenbush, Michael Altenbuchinger
Advances in omics technologies have allowed spatially resolved molecular profiling of single cells, providing a window not only into the diversity and distribution of cell types within a tissue but also into the effects of interactions between cells in shaping the transcriptional landscape. Cells send chemical and mechanical signals which are received by other cells, where they can subsequently initiate context-specific gene regulatory responses. These interactions and their responses shape the individual molecular phenotype of a cell in a given microenvironment. RNAs or proteins measured in individual cells, together with the cells' spatial distribution, provide invaluable information about these mechanisms and the regulation of genes beyond processes occurring independently in each individual cell. SpaCeNet is a method designed to elucidate both the intracellular molecular networks (how molecular variables affect each other within the cell) and the intercellular molecular networks (how cells affect molecular variables in their neighbors). This is achieved by estimating conditional independence relations between captured variables within individual cells and by disentangling these from conditional independence relations between variables of different cells.
{"title":"Spatial Cellular Networks from omics data with SpaCeNet","authors":"Stefan Schrod, Niklas Lück, Robert Lohmayer, Stefan Solbrig, Dennis Völkl, Tina Wipfler, Katherine H. Shutta, Marouen Ben Guebila, Andreas Schäfer, Tim Beißbarth, Helena U. Zacharias, Peter Oefner, John Quackenbush, Michael Altenbuchinger","doi":"10.1101/gr.279125.124","DOIUrl":"https://doi.org/10.1101/gr.279125.124","url":null,"abstract":"Advances in omics technologies have allowed spatially resolved molecular profiling of single cells, providing a window not only into the diversity and distribution of cell types within a tissue but also into the effects of interactions between cells in shaping the transcriptional landscape. Cells send chemical and mechanical signals which are received by other cells, where they can subsequently initiate context-specific gene regulatory responses. These interactions and their responses shape the individual molecular phenotype of a cell in a given microenvironment. RNAs or proteins measured in individual cells, together with the cells' spatial distribution, provide invaluable information about these mechanisms and the regulation of genes beyond processes occurring independently in each individual cell. SpaCeNet is a method designed to elucidate both the intracellular molecular networks (how molecular variables affect each other within the cell) and the intercellular molecular networks (how cells affect molecular variables in their neighbors). This is achieved by estimating conditional independence relations between captured variables within individual cells and by disentangling these from conditional independence relations between variables of different cells.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"8 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linear mixed models (LMMs) have been widely used in genome-wide association studies (GWAS) to control for population stratification and cryptic relatedness. However, estimating LMM parameters is computationally expensive, necessitating large-scale matrix operations to build the genetic relatedness matrix (GRM). Over the past 25 years, Randomized Linear Algebra has provided alternative approaches to such matrix operations by leveraging matrix sketching, which often results in provably accurate fast and efficient approximations. We leverage matrix sketching to develop a fast and efficient LMM method called Matrix-Sketching LMM (MaSk-LMM) by sketching the genotype matrix to reduce its dimensions and speed up computations. Our framework comes with both theoretical guarantees and a strong empirical performance compared to current state-of-the-art for simulated traits and complex diseases.
{"title":"Matrix sketching framework for linear mixed models in association studies","authors":"Myson C Burch, Aritra Bose, Gregory Dexter, Laxmi Parida, Petros Drineas","doi":"10.1101/gr.279230.124","DOIUrl":"https://doi.org/10.1101/gr.279230.124","url":null,"abstract":"Linear mixed models (LMMs) have been widely used in genome-wide association studies (GWAS) to control for population stratification and cryptic relatedness. However, estimating LMM parameters is computationally expensive, necessitating large-scale matrix operations to build the genetic relatedness matrix (GRM). Over the past 25 years, Randomized Linear Algebra has provided alternative approaches to such matrix operations by leveraging matrix sketching, which often results in provably accurate fast and efficient approximations. We leverage matrix sketching to develop a fast and efficient LMM method called Matrix-Sketching LMM (MaSk-LMM) by sketching the genotype matrix to reduce its dimensions and speed up computations. Our framework comes with both theoretical guarantees and a strong empirical performance compared to current state-of-the-art for simulated traits and complex diseases.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"25 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lechuan Li, Ruth Dannenfelser, Charlie Cruz, Vicky Yao
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose ANDES, a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation-based and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multi-organism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
嵌入方法是一类非常有价值的方法,可将复杂的高维数据中的基本信息提炼到更容易获取的低维空间中。嵌入方法在生物数据中的应用表明,基因嵌入能有效捕捉基因之间的物理、结构和功能关系。然而,这种效用主要是通过将基因嵌入用于下游机器学习任务来实现的。直接研究嵌入,特别是分析嵌入空间中的基因集的工作则少得多。在这里,我们提出了一种新颖的最佳匹配方法--ANDES,它可以与现有的基因嵌入一起使用,在比较基因集的同时协调基因集的多样性。这种直观的方法对于提高嵌入空间在各种任务中的实用性具有重要的下游意义。具体来说,我们展示了当 ANDES 应用于编码蛋白质-蛋白质相互作用的不同基因嵌入时,如何将其用作一种新型的基于过度代表性和基于等级的基因组富集分析方法,从而达到最先进的性能。此外,ANDES 还能利用多生物体联合基因嵌入促进跨生物体的功能知识转移,从而实现跨模型系统的表型映射。我们灵活、直接的最佳匹配方法可扩展到集合元素之间具有不同群落结构的其他嵌入空间。
{"title":"A best-match approach for gene set analysis in embedding spaces","authors":"Lechuan Li, Ruth Dannenfelser, Charlie Cruz, Vicky Yao","doi":"10.1101/gr.279141.124","DOIUrl":"https://doi.org/10.1101/gr.279141.124","url":null,"abstract":"Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose ANDES, a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation-based and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multi-organism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"23 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kavitha Uthanumallian, Andrea Del Cortona, Susana Coelho, Olivier De Clerck, Sebastian Duchene, Heroen Verbruggen
There are many gaps in our knowledge of how life cycle variation and organismal body architecture associate with molecular evolution. Using the diverse range of green algal body architectures and life cycle types as a test case, we hypothesize that increases in cytomorphological complexity are likely to be associated with a decrease in the effective population size, since larger-bodied organisms typically have smaller populations, resulting in increased drift. For life cycles, we expect haploid-dominant lineages to evolve under stronger selection intensity relative to diploid-dominant life cycles due to masking of deleterious alleles in heterozygotes. We use a genome-scale dataset spanning the phylogenetic diversity of green algae and phylogenetic comparative approaches to measure the relative selection intensity across different trait categories. We show stronger signatures of drift in lineages with more complex body architectures compared to unicellular lineages, which we consider to be a consequence of smaller effective population sizes of the more complex algae. Significantly higher rates of synonymous as well as nonsynonymous substitutions relative to other algal body architectures highlight that siphonous and siphonocladous body architectures, characteristic of many green seaweeds, form an interesting test case to study the potential impacts of genome redundancy on molecular evolution. Contrary to expectations, we show that levels of selection efficacy do not show a strong association with life cycle types in green algae. Taken together, our results underline the prominent impact of body architecture on the molecular evolution of green algal genomes.
{"title":"Genome–wide patterns of selection-drift variation strongly associate with organismal traits across the green plant lineage","authors":"Kavitha Uthanumallian, Andrea Del Cortona, Susana Coelho, Olivier De Clerck, Sebastian Duchene, Heroen Verbruggen","doi":"10.1101/gr.279002.124","DOIUrl":"https://doi.org/10.1101/gr.279002.124","url":null,"abstract":"There are many gaps in our knowledge of how life cycle variation and organismal body architecture associate with molecular evolution. Using the diverse range of green algal body architectures and life cycle types as a test case, we hypothesize that increases in cytomorphological complexity are likely to be associated with a decrease in the effective population size, since larger-bodied organisms typically have smaller populations, resulting in increased drift. For life cycles, we expect haploid-dominant lineages to evolve under stronger selection intensity relative to diploid-dominant life cycles due to masking of deleterious alleles in heterozygotes. We use a genome-scale dataset spanning the phylogenetic diversity of green algae and phylogenetic comparative approaches to measure the relative selection intensity across different trait categories. We show stronger signatures of drift in lineages with more complex body architectures compared to unicellular lineages, which we consider to be a consequence of smaller effective population sizes of the more complex algae. Significantly higher rates of synonymous as well as nonsynonymous substitutions relative to other algal body architectures highlight that siphonous and siphonocladous body architectures, characteristic of many green seaweeds, form an interesting test case to study the potential impacts of genome redundancy on molecular evolution. Contrary to expectations, we show that levels of selection efficacy do not show a strong association with life cycle types in green algae. Taken together, our results underline the prominent impact of body architecture on the molecular evolution of green algal genomes.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"23 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. The k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling k-mers have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (K-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection dramatically reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.
{"title":"Memory-bound k-mer selection for large and evolutionary diverse reference libraries","authors":"Ali Osman Berk Sapci, Siavash Mirarab","doi":"10.1101/gr.279339.124","DOIUrl":"https://doi.org/10.1101/gr.279339.124","url":null,"abstract":"Using <em>k</em>-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. The <em>k</em>-mers are kept in the memory during the query time, and saving all <em>k</em>-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling <em>k</em>-mers have been proposed, including minimizers and finding taxon-specific <em>k</em>-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of <em>k</em>-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (<em>K</em>-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK <em>k</em>-mer selection dramatically reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms <em>k</em>-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"6 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boyang Fu, Prateek Anand, Aakarsh Anand, Joel Mefford, Sriram Sankararaman
Our knowledge of the contribution of genetic interactions (epistasis) to variation in human complex traits remains limited, partly due to the lack of efficient, powerful, and interpretable algorithms to detect interactions. Recently proposed approaches for set-based association tests show promise in improving power to detect epistasis by examining the aggregated effects of multiple variants. Nevertheless, these methods either do not scale to large Biobank datasets or lack interpretability. We propose QuadKAST, a scalable algorithm focused on testing pairwise interaction effects (quadratic effects) within small to medium sized sets of genetic variants (<= 100 SNPs) on a trait and provide quantified interpretation of these effects. Comprehensive simulations showed that QuadKAST is well-calibrated. Additionally, QuadKAST is highly sensitive in detecting loci with epistatic signals and accurate in its estimation of quadratic effects. We applied QuadKAST to 52 quantitative phenotypes measured in ~ 300,000 unrelated white British individuals in the UK Biobank to test for quadratic effects within each of 9,515 protein-coding genes. We detected 32 trait-gene pairs across 17 traits and 29 genes that demonstrate statistically significant signals of quadratic effects (p <= 0.05/(9,515*52) accounting for the number of genes and traits tested). Across these trait-gene pairs, the proportion of trait variance explained by quadratic effects is similar to additive effects (median {sigma^{2}_{quad}} / {sigma^{2}_{g}} = 0.15), with five pairs having a ratio greater than one. Our method enables the detailed investigation of epistasis on a large scale, offering new insights into its role and importance.
{"title":"A scalable adaptive quadratic kernel method for interpretable epistasis analysis in complex traits","authors":"Boyang Fu, Prateek Anand, Aakarsh Anand, Joel Mefford, Sriram Sankararaman","doi":"10.1101/gr.279140.124","DOIUrl":"https://doi.org/10.1101/gr.279140.124","url":null,"abstract":"Our knowledge of the contribution of genetic interactions (epistasis) to variation in human complex traits remains limited, partly due to the lack of efficient, powerful, and interpretable algorithms to detect interactions. Recently proposed approaches for set-based association tests show promise in improving power to detect epistasis by examining the aggregated effects of multiple variants. Nevertheless, these methods either do not scale to large Biobank datasets or lack interpretability. We propose QuadKAST, a scalable algorithm focused on testing pairwise interaction effects (quadratic effects) within small to medium sized sets of genetic variants (<= 100 SNPs) on a trait and provide quantified interpretation of these effects. Comprehensive simulations showed that QuadKAST is well-calibrated. Additionally, QuadKAST is highly sensitive in detecting loci with epistatic signals and accurate in its estimation of quadratic effects. We applied QuadKAST to 52 quantitative phenotypes measured in ~ 300,000 unrelated white British individuals in the UK Biobank to test for quadratic effects within each of 9,515 protein-coding genes. We detected 32 trait-gene pairs across 17 traits and 29 genes that demonstrate statistically significant signals of quadratic effects (p <= 0.05/(9,515*52) accounting for the number of genes and traits tested). Across these trait-gene pairs, the proportion of trait variance explained by quadratic effects is similar to additive effects (median {sigma^{2}_{quad}} / {sigma^{2}_{g}} = 0.15), with five pairs having a ratio greater than one. Our method enables the detailed investigation of epistasis on a large scale, offering new insights into its role and importance.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"69 1","pages":""},"PeriodicalIF":7.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}