Laura E. Tibbs-Cortes, Tingting Guo, Carson M. Andorf, Xianran Li, Jianming Yu
Maize phenotypes are plastic, determined by the complex interplay of genetics and environmental variables. Uncovering the genes responsible and understanding how their effects change across a large geographic region are challenging. In this study, we conducted systematic analysis to identify environmental indices that strongly influence 19 traits (including flowering time, plant architecture, and yield component traits) measured in the maize nested association mapping (NAM) population grown in 11 environments. Identified environmental indices based on day length, temperature, moisture, and combinations of these are biologically meaningful. Next, we leveraged a total of more than 20 million SNP and SV markers derived from recent de novo sequencing of the NAM founders for trait prediction and dissection. When combined with identified environmental indices, genomic prediction enables accurate performance predictions. Genome-wide association studies (GWASs) detected genetic loci associated with the plastic response to the identified environmental indices for all examined traits. By systematically uncovering the major environmental and genomic factors underlying phenotypic plasticity in a wide variety of traits and depositing our results as a track on the MaizeGDB genome browser, we provide a community resource as well as a comprehensive analytical framework to facilitate continuing complex trait dissection and prediction in maize and other crops. Our findings also provide a conceptual framework for the genetic architecture of phenotypic plasticity by accommodating two alternative models, regulatory gene model and allelic sensitivity model, as special cases of a continuum.
玉米的表型具有可塑性,由遗传和环境变量的复杂相互作用决定。揭示相关基因并了解其影响如何在一个大的地理区域内发生变化是一项挑战。在这项研究中,我们进行了系统分析,以确定对在 11 种环境中生长的玉米嵌套关联图谱(NAM)群体测量的 19 个性状(包括开花时间、植株结构和产量成分性状)有强烈影响的环境指数。所确定的环境指数基于日长、温度、湿度以及这些指数的组合,具有生物学意义。接下来,我们利用最近对 NAM 创始者进行从头测序得到的总计超过 2,000 万个 SNP 和 SV 标记进行性状预测和分析。结合已确定的环境指数,基因组预测可实现准确的性能预测。全基因组关联研究(GWAS)发现了与所有受检性状对已确定环境指数的可塑性响应相关的基因位点。通过系统地揭示各种性状表型可塑性的主要环境和基因组因素,并将我们的研究成果作为一个轨道存放在 MaizeGDB 基因组浏览器上,我们提供了一个社区资源和一个全面的分析框架,以促进对玉米和其他作物复杂性状的持续分析和预测。我们的研究结果还为表型可塑性的遗传结构提供了一个概念框架,将两种可选模型(调控基因模型和等位基因敏感性模型)作为连续体的特例。
{"title":"Comprehensive identification of genomic and environmental determinants of phenotypic plasticity in maize","authors":"Laura E. Tibbs-Cortes, Tingting Guo, Carson M. Andorf, Xianran Li, Jianming Yu","doi":"10.1101/gr.279027.124","DOIUrl":"https://doi.org/10.1101/gr.279027.124","url":null,"abstract":"Maize phenotypes are plastic, determined by the complex interplay of genetics and environmental variables. Uncovering the genes responsible and understanding how their effects change across a large geographic region are challenging. In this study, we conducted systematic analysis to identify environmental indices that strongly influence 19 traits (including flowering time, plant architecture, and yield component traits) measured in the maize nested association mapping (NAM) population grown in 11 environments. Identified environmental indices based on day length, temperature, moisture, and combinations of these are biologically meaningful. Next, we leveraged a total of more than 20 million SNP and SV markers derived from recent de novo sequencing of the NAM founders for trait prediction and dissection. When combined with identified environmental indices, genomic prediction enables accurate performance predictions. Genome-wide association studies (GWASs) detected genetic loci associated with the plastic response to the identified environmental indices for all examined traits. By systematically uncovering the major environmental and genomic factors underlying phenotypic plasticity in a wide variety of traits and depositing our results as a track on the MaizeGDB genome browser, we provide a community resource as well as a comprehensive analytical framework to facilitate continuing complex trait dissection and prediction in maize and other crops. Our findings also provide a conceptual framework for the genetic architecture of phenotypic plasticity by accommodating two alternative models, regulatory gene model and allelic sensitivity model, as special cases of a continuum.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142231609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Müller, Christina Hartwig, Mirko Sonntag, Lisa Bitzer, Christopher Adelmann, Yevhen Vainshtein, Karolina Glanz, Sebastian O. Decker, Thorsten Brenner, Georg F. Weber, Arndt von Haeseler, Kai Sohn
Here, we present a method for enrichment of double-stranded cfDNA with an average length of ∼40 bp from cfDNA for high-throughput DNA sequencing. This class of cfDNA is enriched at gene promoters and binding sites of transcription factors or structural DNA-binding proteins, so that a genome-wide DNA footprint is directly captured from liquid biopsies. In short double-stranded cfDNA from healthy individuals, we find significant enrichment of 203 transcription factor motifs. Additionally, short double-stranded cfDNA signals at specific genomic regions correlate negatively with DNA methylation, positively with H3K4me3 histone modifications and gene transcription. The diagnostic potential of short double-stranded cell-free DNA (cfDNA) in blood plasma has not yet been recognized. When comparing short double-stranded cfDNA from patient samples of pancreatic ductal adenocarcinoma with colorectal carcinoma or septic with postoperative controls, we identify 136 and 241 differentially enriched loci, respectively. Using these differentially enriched loci, the disease types can be clearly distinguished by principal component analysis, demonstrating the diagnostic potential of short double-stranded cfDNA signals as a new class of biomarkers for liquid biopsies.
在此,我们介绍一种从 cfDNA 中富集平均长度为 40 bp 的双链 cfDNA 的方法,用于高通量 DNA 测序。这类 cfDNA 富集在基因启动子和转录因子或结构 DNA 结合蛋白的结合位点,因此可以直接从液体活检组织中捕获全基因组的 DNA 足印。在来自健康人的短双链 cfDNA 中,我们发现 203 个转录因子基序显著富集。此外,特定基因组区域的短双链 cfDNA 信号与 DNA 甲基化呈负相关,与 H3K4me3 组蛋白修饰和基因转录呈正相关。血浆中短双链无细胞 DNA(cfDNA)的诊断潜力尚未得到认可。在比较胰腺导管腺癌、结直肠癌或败血症患者样本与术后对照组样本中的短双链 cfDNA 时,我们分别发现了 136 个和 241 个不同的富集位点。利用这些不同的富集位点,可以通过主成分分析清楚地区分疾病类型,这证明了短双链 cfDNA 信号作为液体活检的一类新生物标记物的诊断潜力。
{"title":"A novel approach for in vivo DNA footprinting using short double-stranded cell-free DNA from plasma","authors":"Jan Müller, Christina Hartwig, Mirko Sonntag, Lisa Bitzer, Christopher Adelmann, Yevhen Vainshtein, Karolina Glanz, Sebastian O. Decker, Thorsten Brenner, Georg F. Weber, Arndt von Haeseler, Kai Sohn","doi":"10.1101/gr.279326.124","DOIUrl":"https://doi.org/10.1101/gr.279326.124","url":null,"abstract":"Here, we present a method for enrichment of double-stranded cfDNA with an average length of ∼40 bp from cfDNA for high-throughput DNA sequencing. This class of cfDNA is enriched at gene promoters and binding sites of transcription factors or structural DNA-binding proteins, so that a genome-wide DNA footprint is directly captured from liquid biopsies. In short double-stranded cfDNA from healthy individuals, we find significant enrichment of 203 transcription factor motifs. Additionally, short double-stranded cfDNA signals at specific genomic regions correlate negatively with DNA methylation, positively with H3K4me3 histone modifications and gene transcription. The diagnostic potential of short double-stranded cell-free DNA (cfDNA) in blood plasma has not yet been recognized. When comparing short double-stranded cfDNA from patient samples of pancreatic ductal adenocarcinoma with colorectal carcinoma or septic with postoperative controls, we identify 136 and 241 differentially enriched loci, respectively. Using these differentially enriched loci, the disease types can be clearly distinguished by principal component analysis, demonstrating the diagnostic potential of short double-stranded cfDNA signals as a new class of biomarkers for liquid biopsies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142231563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zane Kliesmete, Peter Orchard, Victor Yan Kin Lee, Johanna Geuder, Simon M. Krauß, Mari Ohnuki, Jessica Jocher, Beate Vieth, Wolfgang Enard, Ines Hellmann
Pleiotropy, measured as expression breadth across tissues, is one of the best predictors for protein sequence and expression conservation. In this study, we investigated its effect on the evolution of cis-regulatory elements (CREs). To this end, we carefully reanalyzed the Epigenomics Roadmap data for nine fetal tissues, assigning a measure of pleiotropic degree to nearly half a million CREs. To assess the functional conservation of CREs, we generated ATAC-seq and RNA-seq data from humans and macaques. We found that more pleiotropic CREs exhibit greater conservation in accessibility, and the mRNA expression levels of the associated genes are more conserved. This trend of higher conservation for higher degrees of pleiotropy persists when analyzing the transcription factor binding repertoire. In contrast, simple DNA sequence conservation of orthologous sites between species tends to be even lower for pleiotropic CREs than for species-specific CREs. Combining various lines of evidence, we propose that the lack of sequence conservation in functionally conserved pleiotropic CREs is due to within-element compensatory evolution. In summary, our findings suggest that pleiotropy is also a good predictor for the functional conservation of CREs, even though this is not reflected in the sequence conservation of pleiotropic CREs.
{"title":"Evidence for compensatory evolution within pleiotropic regulatory elements","authors":"Zane Kliesmete, Peter Orchard, Victor Yan Kin Lee, Johanna Geuder, Simon M. Krauß, Mari Ohnuki, Jessica Jocher, Beate Vieth, Wolfgang Enard, Ines Hellmann","doi":"10.1101/gr.279001.124","DOIUrl":"https://doi.org/10.1101/gr.279001.124","url":null,"abstract":"Pleiotropy, measured as expression breadth across tissues, is one of the best predictors for protein sequence and expression conservation. In this study, we investigated its effect on the evolution of <em>cis</em>-regulatory elements (CREs). To this end, we carefully reanalyzed the Epigenomics Roadmap data for nine fetal tissues, assigning a measure of pleiotropic degree to nearly half a million CREs. To assess the functional conservation of CREs, we generated ATAC-seq and RNA-seq data from humans and macaques. We found that more pleiotropic CREs exhibit greater conservation in accessibility, and the mRNA expression levels of the associated genes are more conserved. This trend of higher conservation for higher degrees of pleiotropy persists when analyzing the transcription factor binding repertoire. In contrast, simple DNA sequence conservation of orthologous sites between species tends to be even lower for pleiotropic CREs than for species-specific CREs. Combining various lines of evidence, we propose that the lack of sequence conservation in functionally conserved pleiotropic CREs is due to within-element compensatory evolution. In summary, our findings suggest that pleiotropy is also a good predictor for the functional conservation of CREs, even though this is not reflected in the sequence conservation of pleiotropic CREs.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shujun Ou, Armin Scheben, Tyler Collins, Yinjie Qiu, Arun S. Seetharam, Claire C. Menard, Nancy Manchanda, Jonathan I. Gent, Michael C. Schatz, Sarah N. Anderson, Matthew B. Hufford, Candice N. Hirsch
Much of the profound interspecific variation in genome content has been attributed to transposable elements (TEs). To explore the extent of TE variation within species, we developed an optimized open-source algorithm, panEDTA, to de novo annotate TEs in a pangenome context. We then generated a unified TE annotation for a maize pangenome derived from 26 reference-quality genomes, which reveals an excess of 35.1 Mb of TE sequences per genome in tropical maize relative to temperate maize. A small number (n = 216) of TE families, mainly LTR retrotransposons, drive these differences. Evidence from the methylome, transcriptome, LTR age distribution, and LTR insertional polymorphisms reveals that 64.7% of the variability is contributed by LTR families that are young, less methylated, and more expressed in tropical maize, whereas 18.5% is driven by LTR families with removal or loss in temperate maize. Additionally, we find enrichment for Young LTR families adjacent to nucleotide-binding and leucine-rich repeat (NLR) clusters of varying copy number across lines, suggesting TE activity may be associated with disease resistance in maize.
转座元件(TEs)在基因组内容上的深刻种间差异中占了很大一部分。为了探索物种内 TE 的变异程度,我们开发了一种优化的开源算法--panEDTA,用于在泛基因组背景下重新注释 TE。然后,我们为来自26个参考质量基因组的玉米泛基因组生成了统一的TE注释,发现热带玉米每个基因组的TE序列比温带玉米多35.1 Mb。这些差异是由少数(n = 216)TE 家族(主要是 LTR 反转座子)造成的。来自甲基组、转录组、LTR年龄分布和LTR插入多态性的证据显示,64.7%的变异是由热带玉米中年轻、甲基化较少和表达较多的LTR家族贡献的,而18.5%的变异是由温带玉米中移除或丢失的LTR家族驱动的。此外,我们还发现,年轻的 LTR 家族富集在各品系拷贝数不同的核苷酸结合和富亮氨酸重复(NLR)簇附近,这表明 TE 活性可能与玉米的抗病性有关。
{"title":"Differences in activity and stability drive transposable element variation in tropical and temperate maize","authors":"Shujun Ou, Armin Scheben, Tyler Collins, Yinjie Qiu, Arun S. Seetharam, Claire C. Menard, Nancy Manchanda, Jonathan I. Gent, Michael C. Schatz, Sarah N. Anderson, Matthew B. Hufford, Candice N. Hirsch","doi":"10.1101/gr.278131.123","DOIUrl":"https://doi.org/10.1101/gr.278131.123","url":null,"abstract":"Much of the profound interspecific variation in genome content has been attributed to transposable elements (TEs). To explore the extent of TE variation within species, we developed an optimized open-source algorithm, panEDTA, to de novo annotate TEs in a pangenome context. We then generated a unified TE annotation for a maize pangenome derived from 26 reference-quality genomes, which reveals an excess of 35.1 Mb of TE sequences per genome in tropical maize relative to temperate maize. A small number (<em>n</em> = 216) of TE families, mainly LTR retrotransposons, drive these differences. Evidence from the methylome, transcriptome, LTR age distribution, and LTR insertional polymorphisms reveals that 64.7% of the variability is contributed by LTR families that are young, less methylated, and more expressed in tropical maize, whereas 18.5% is driven by LTR families with removal or loss in temperate maize. Additionally, we find enrichment for Young LTR families adjacent to nucleotide-binding and leucine-rich repeat (NLR) clusters of varying copy number across lines, suggesting TE activity may be associated with disease resistance in maize.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The killer-cell immunoglobulin-like receptor (KIR) gene complex, a highly polymorphic region of the human genome that encodes proteins involved in immune responses, poses strong challenges in genotyping owing to its remarkable genetic diversity and structural intricacy. Accurate analysis of KIR alleles, including their structural variations, is crucial for understanding their roles in various immune responses. Leveraging the high-quality genome assemblies from the Human Pangenome Reference Consortium (HPRC), we present a novel bioinformatic tool, the structural KIR annoTator (SKIRT), to investigate gene diversity and facilitate precise KIR allele analysis. In 47 HPRC-phased assemblies, SKIRT identifies a recurrent novel KIR2DS4/3DL1 fusion gene in the paternal haplotype of HG02630 and maternal haplotype of NA19240. Additionally, SKIRT accurately identifies eight structural variants and 15 novel nonsynonymous alleles, all of which are independently validated using short-read data or quantitative polymerase chain reaction. Our study has discovered a total of 570 novel alleles, among which eight haplotypes harbor at least one KIR gene duplication, six haplotypes have lost at least one framework gene, and 75 out of 94 haplotypes (79.8%) carry at least five novel alleles, thus confirming KIR genetic diversity. These findings are pivotal in providing insights into KIR gene diversity and serve as a solid foundation for understanding the functional consequences of KIR structural variations. High-resolution genome assemblies offer unprecedented opportunities to explore polymorphic regions that are challenging to investigate using short-read sequencing methods. The SKIRT pipeline emerges as a highly efficient tool, enabling the comprehensive detection of the complete spectrum of KIR alleles within human genome assemblies.
杀伤细胞免疫球蛋白样受体(KIR)基因复合物是人类基因组中一个编码参与免疫反应的蛋白质的高度多态区,由于其显著的遗传多样性和结构的复杂性,给基因分型带来了巨大挑战。准确分析 KIR 等位基因,包括其结构变异,对于了解它们在各种免疫反应中的作用至关重要。利用人类泛基因组参考联盟(Human Pangenome Reference Consortium,HPRC)的高质量基因组组装,我们提出了一种新型生物信息学工具--KIR结构注释器(structural KIR annoTator,SKIRT),用于研究基因多样性并促进精确的KIR等位基因分析。在 47 个 HPRC 分期组合中,SKIRT 在父系单倍型 HG02630 和母系单倍型 NA19240 中发现了一个反复出现的新型 KIR2DS4/3DL1 融合基因。此外,SKIRT 还准确鉴定出了 8 个结构变异和 15 个新型非同义等位基因,所有这些变异和等位基因都通过短读数据或定量聚合酶链反应进行了独立验证。我们的研究共发现了 570 个新型等位基因,其中有 8 个单倍型携带至少一个 KIR 基因重复,6 个单倍型丢失了至少一个框架基因,94 个单倍型中有 75 个(79.8%)携带至少 5 个新型等位基因,从而证实了 KIR 遗传多样性。这些发现对于深入了解 KIR 基因多样性至关重要,也为了解 KIR 结构变异的功能性后果奠定了坚实的基础。高分辨率基因组组装为探索多态性区域提供了前所未有的机会,而使用短线程测序方法对这些区域进行研究具有挑战性。SKIRT 管道是一种高效的工具,能够全面检测人类基因组组装中的全部 KIR 等位基因。
{"title":"Genetic complexity of killer-cell immunoglobulin-like receptor genes in human pangenome assemblies","authors":"Tsung-Kai Hung, Wan-Chi Liu, Sheng-Kai Lai, Hui-Wen Chuang, Yi-Che Lee, Hong-Ye Lin, Chia-Lang Hsu, Chien-Yu Chen, Ya-Chien Yang, Jacob Shujui Hsu, Pei-Lung Chen","doi":"10.1101/gr.278358.123","DOIUrl":"https://doi.org/10.1101/gr.278358.123","url":null,"abstract":"The killer-cell immunoglobulin-like receptor (KIR) gene complex, a highly polymorphic region of the human genome that encodes proteins involved in immune responses, poses strong challenges in genotyping owing to its remarkable genetic diversity and structural intricacy. Accurate analysis of KIR alleles, including their structural variations, is crucial for understanding their roles in various immune responses. Leveraging the high-quality genome assemblies from the Human Pangenome Reference Consortium (HPRC), we present a novel bioinformatic tool, the structural KIR annoTator (SKIRT), to investigate gene diversity and facilitate precise KIR allele analysis. In 47 HPRC-phased assemblies, SKIRT identifies a recurrent novel <em>KIR2DS4/3DL1</em> fusion gene in the paternal haplotype of HG02630 and maternal haplotype of NA19240. Additionally, SKIRT accurately identifies eight structural variants and 15 novel nonsynonymous alleles, all of which are independently validated using short-read data or quantitative polymerase chain reaction. Our study has discovered a total of 570 novel alleles, among which eight haplotypes harbor at least one KIR gene duplication, six haplotypes have lost at least one framework gene, and 75 out of 94 haplotypes (79.8%) carry at least five novel alleles, thus confirming KIR genetic diversity. These findings are pivotal in providing insights into KIR gene diversity and serve as a solid foundation for understanding the functional consequences of KIR structural variations. High-resolution genome assemblies offer unprecedented opportunities to explore polymorphic regions that are challenging to investigate using short-read sequencing methods. The SKIRT pipeline emerges as a highly efficient tool, enabling the comprehensive detection of the complete spectrum of KIR alleles within human genome assemblies.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meir Goldenberg, Loay Mualem, Amit Shahar, Sagi Snir, Adi Akavia
DNA methylation data plays a crucial role in estimating chronological age in mammals, offering real-time insights into an individual’s aging process. The Epigenetic Pacemaker (EPM) model allows inference of the biological age as deviations from the population trend. Given the sensitivity of this data, it is essential to safeguard both inputs and outputs of the EPM model. In a recent study, a privacy-preserving approach for EPM computation was introduced, utilizing Fully Homomorphic Encryption (FHE). However, their method had limitations, including having high communication complexity and being impractical for large datasets Our work presents a new privacy preserving protocol for EPM computation, analytically improving both privacy and complexity. Notably, we employ a single server for the secure computation phase while ensuring privacy even in the event of server corruption (compared to requiring two non-colluding servers. Using techniques from symbolic algebra and number theory, the new protocol eliminates the need for communication during secure computation, significantly improves asymptotic runtime and and offers better compatibility to parallel computing for further time complexity reduction. We have implemented our protocol, demonstrating its ability to produce results similar to the standard (insecure) EPM model with substantial performance improvement compared to previous methods. These findings hold promise for enhancing data security in medical applications where personal privacy is paramount. The generality of both the new approach and the EPM, suggests that this protocol may be useful to other uses employing similar expectation maximization techniques.
{"title":"Privacy-preserving biological age prediction over federated human methylation data using fully homomorphic encryption","authors":"Meir Goldenberg, Loay Mualem, Amit Shahar, Sagi Snir, Adi Akavia","doi":"10.1101/gr.279071.124","DOIUrl":"https://doi.org/10.1101/gr.279071.124","url":null,"abstract":"DNA methylation data plays a crucial role in estimating chronological age in mammals, offering real-time insights into an individual’s aging process. The Epigenetic Pacemaker (EPM) model allows inference of the biological age as deviations from the population trend. Given the sensitivity of this data, it is essential to safeguard both inputs and outputs of the EPM model. In a recent study, a privacy-preserving approach for EPM computation was introduced, utilizing Fully Homomorphic Encryption (FHE). However, their method had limitations, including having high communication complexity and being impractical for large datasets Our work presents a new privacy preserving protocol for EPM computation, analytically improving both privacy and complexity. Notably, we employ a single server for the secure computation phase while ensuring privacy even in the event of server corruption (compared to requiring two non-colluding servers. Using techniques from symbolic algebra and number theory, the new protocol eliminates the need for communication during secure computation, significantly improves asymptotic runtime and and offers better compatibility to parallel computing for further time complexity reduction. We have implemented our protocol, demonstrating its ability to produce results similar to the standard (insecure) EPM model with substantial performance improvement compared to previous methods. These findings hold promise for enhancing data security in medical applications where personal privacy is paramount. The generality of both the new approach and the EPM, suggests that this protocol may be useful to other uses employing similar expectation maximization techniques.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cameron Y Park, Shouvik Mani, Nicolas Beltran-Velez, Katie Maurer, Teddy Huang, Shuqiang Li, Satyen Gohil, Kenneth J Livak, David A Knowles, Catherine J Wu, Elham Azizi
Characterizing cell-cell communication and tracking its variability over time are crucial for understanding the coordination of biological processes mediating normal development, disease progression, and responses to perturbations such as therapies. Existing tools fail to capture time-dependent intercellular interactions, and primarily rely on existing databases compiled from limited contexts. We introduce DIISCO, a Bayesian framework designed to characterize the temporal dynamics of cellular interactions using single-cell RNA sequencing data from multiple time points. Our method utilizes structured Gaussian process regression to unveil time-resolved interactions among diverse cell types according to their coevolution and incorporates prior knowledge of receptor-ligand complexes. We show the interpretability of DIISCO in simulated data and new data collected from T cells co-cultured with lymphoma cells, demonstrating its potential to uncover dynamic cell-cell crosstalk.
表征细胞-细胞通讯并跟踪其随时间的变化,对于了解介导正常发育、疾病进展和对疗法等干扰的反应的生物过程的协调至关重要。现有的工具无法捕捉随时间变化的细胞间相互作用,而且主要依赖于从有限的环境中汇编的现有数据库。我们介绍了 DIISCO,这是一个贝叶斯框架,旨在利用多个时间点的单细胞 RNA 测序数据描述细胞间相互作用的时间动态。我们的方法利用结构化高斯过程回归,根据不同细胞类型的共同进化揭示它们之间时间分辨的相互作用,并结合受体配体复合物的先验知识。我们在模拟数据和从与淋巴瘤细胞共培养的 T 细胞收集的新数据中展示了 DIISCO 的可解释性,证明了它揭示动态细胞间串扰的潜力。
{"title":"A Bayesian framework for inferring dynamic intercellular interactions from time-series single-cell data","authors":"Cameron Y Park, Shouvik Mani, Nicolas Beltran-Velez, Katie Maurer, Teddy Huang, Shuqiang Li, Satyen Gohil, Kenneth J Livak, David A Knowles, Catherine J Wu, Elham Azizi","doi":"10.1101/gr.279126.124","DOIUrl":"https://doi.org/10.1101/gr.279126.124","url":null,"abstract":"Characterizing cell-cell communication and tracking its variability over time are crucial for understanding the coordination of biological processes mediating normal development, disease progression, and responses to perturbations such as therapies. Existing tools fail to capture time-dependent intercellular interactions, and primarily rely on existing databases compiled from limited contexts. We introduce DIISCO, a Bayesian framework designed to characterize the temporal dynamics of cellular interactions using single-cell RNA sequencing data from multiple time points. Our method utilizes structured Gaussian process regression to unveil time-resolved interactions among diverse cell types according to their coevolution and incorporates prior knowledge of receptor-ligand complexes. We show the interpretability of DIISCO in simulated data and new data collected from T cells co-cultured with lymphoma cells, demonstrating its potential to uncover dynamic cell-cell crosstalk.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark showed that DCTdomain was able to detect distant homologs by leveraging the structural information in the contextual embeddings.
{"title":"Protein domain embeddings for fast and accurate similarity search","authors":"Benjamin Giovanni Iovino, Haixu Tang, Yuzhen Ye","doi":"10.1101/gr.279127.124","DOIUrl":"https://doi.org/10.1101/gr.279127.124","url":null,"abstract":"Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark showed that DCTdomain was able to detect distant homologs by leveraging the structural information in the contextual embeddings.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Schrod, Niklas Lück, Robert Lohmayer, Stefan Solbrig, Dennis Völkl, Tina Wipfler, Katherine H. Shutta, Marouen Ben Guebila, Andreas Schäfer, Tim Beißbarth, Helena U. Zacharias, Peter Oefner, John Quackenbush, Michael Altenbuchinger
Advances in omics technologies have allowed spatially resolved molecular profiling of single cells, providing a window not only into the diversity and distribution of cell types within a tissue but also into the effects of interactions between cells in shaping the transcriptional landscape. Cells send chemical and mechanical signals which are received by other cells, where they can subsequently initiate context-specific gene regulatory responses. These interactions and their responses shape the individual molecular phenotype of a cell in a given microenvironment. RNAs or proteins measured in individual cells, together with the cells' spatial distribution, provide invaluable information about these mechanisms and the regulation of genes beyond processes occurring independently in each individual cell. SpaCeNet is a method designed to elucidate both the intracellular molecular networks (how molecular variables affect each other within the cell) and the intercellular molecular networks (how cells affect molecular variables in their neighbors). This is achieved by estimating conditional independence relations between captured variables within individual cells and by disentangling these from conditional independence relations between variables of different cells.
{"title":"Spatial Cellular Networks from omics data with SpaCeNet","authors":"Stefan Schrod, Niklas Lück, Robert Lohmayer, Stefan Solbrig, Dennis Völkl, Tina Wipfler, Katherine H. Shutta, Marouen Ben Guebila, Andreas Schäfer, Tim Beißbarth, Helena U. Zacharias, Peter Oefner, John Quackenbush, Michael Altenbuchinger","doi":"10.1101/gr.279125.124","DOIUrl":"https://doi.org/10.1101/gr.279125.124","url":null,"abstract":"Advances in omics technologies have allowed spatially resolved molecular profiling of single cells, providing a window not only into the diversity and distribution of cell types within a tissue but also into the effects of interactions between cells in shaping the transcriptional landscape. Cells send chemical and mechanical signals which are received by other cells, where they can subsequently initiate context-specific gene regulatory responses. These interactions and their responses shape the individual molecular phenotype of a cell in a given microenvironment. RNAs or proteins measured in individual cells, together with the cells' spatial distribution, provide invaluable information about these mechanisms and the regulation of genes beyond processes occurring independently in each individual cell. SpaCeNet is a method designed to elucidate both the intracellular molecular networks (how molecular variables affect each other within the cell) and the intercellular molecular networks (how cells affect molecular variables in their neighbors). This is achieved by estimating conditional independence relations between captured variables within individual cells and by disentangling these from conditional independence relations between variables of different cells.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lechuan Li, Ruth Dannenfelser, Charlie Cruz, Vicky Yao
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose ANDES, a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation-based and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multi-organism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
嵌入方法是一类非常有价值的方法,可将复杂的高维数据中的基本信息提炼到更容易获取的低维空间中。嵌入方法在生物数据中的应用表明,基因嵌入能有效捕捉基因之间的物理、结构和功能关系。然而,这种效用主要是通过将基因嵌入用于下游机器学习任务来实现的。直接研究嵌入,特别是分析嵌入空间中的基因集的工作则少得多。在这里,我们提出了一种新颖的最佳匹配方法--ANDES,它可以与现有的基因嵌入一起使用,在比较基因集的同时协调基因集的多样性。这种直观的方法对于提高嵌入空间在各种任务中的实用性具有重要的下游意义。具体来说,我们展示了当 ANDES 应用于编码蛋白质-蛋白质相互作用的不同基因嵌入时,如何将其用作一种新型的基于过度代表性和基于等级的基因组富集分析方法,从而达到最先进的性能。此外,ANDES 还能利用多生物体联合基因嵌入促进跨生物体的功能知识转移,从而实现跨模型系统的表型映射。我们灵活、直接的最佳匹配方法可扩展到集合元素之间具有不同群落结构的其他嵌入空间。
{"title":"A best-match approach for gene set analysis in embedding spaces","authors":"Lechuan Li, Ruth Dannenfelser, Charlie Cruz, Vicky Yao","doi":"10.1101/gr.279141.124","DOIUrl":"https://doi.org/10.1101/gr.279141.124","url":null,"abstract":"Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose ANDES, a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation-based and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multi-organism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":null,"pages":null},"PeriodicalIF":7.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}