Pub Date : 2024-12-05DOI: 10.1038/s43588-024-00739-9
Drew DeHaas, Ziqing Pan, Xinzhu Wei
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms. Exploiting variant-sharing across samples enables GRG to compress 200,000 UK Biobank phased human genomes to 5–26 gigabytes per chromosome, also enabling graph-traversal algorithms to reuse computed values in random access memory. Constructing and processing GRG files scales to a million whole genomes. Using allele frequencies and association effects as examples, we show that computation on GRG via graph traversal runs the fastest among all tested alternatives. GRG-based algorithms have the potential to increase the scalability and reduce the cost of analyzing large genomic datasets. The genotype representation graph (GRG) is a compact data structure that encodes 200,000 human genomes in just 5–26 gigabytes per chromosome. Computation on GRG via graph traversal greatly accelerates genome-wide analysis.
{"title":"Enabling efficient analysis of biobank-scale data with genotype representation graphs","authors":"Drew DeHaas, Ziqing Pan, Xinzhu Wei","doi":"10.1038/s43588-024-00739-9","DOIUrl":"10.1038/s43588-024-00739-9","url":null,"abstract":"Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms. Exploiting variant-sharing across samples enables GRG to compress 200,000 UK Biobank phased human genomes to 5–26 gigabytes per chromosome, also enabling graph-traversal algorithms to reuse computed values in random access memory. Constructing and processing GRG files scales to a million whole genomes. Using allele frequencies and association effects as examples, we show that computation on GRG via graph traversal runs the fastest among all tested alternatives. GRG-based algorithms have the potential to increase the scalability and reduce the cost of analyzing large genomic datasets. The genotype representation graph (GRG) is a compact data structure that encodes 200,000 human genomes in just 5–26 gigabytes per chromosome. Computation on GRG via graph traversal greatly accelerates genome-wide analysis.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"5 2","pages":"112-124"},"PeriodicalIF":12.0,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142788035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1038/s43588-024-00727-z
Yongle Li, Yuhao Chen, Xiao He
By developing an efficient spin symmetry penalty, a recent study has substantially accelerated the calculation of accurate energies with correct spin states in variational Monte Carlo for both ground and excited states of quantum many-particle systems.
{"title":"Teaching spin symmetry while learning neural network wave functions","authors":"Yongle Li, Yuhao Chen, Xiao He","doi":"10.1038/s43588-024-00727-z","DOIUrl":"10.1038/s43588-024-00727-z","url":null,"abstract":"By developing an efficient spin symmetry penalty, a recent study has substantially accelerated the calculation of accurate energies with correct spin states in variational Monte Carlo for both ground and excited states of quantum many-particle systems.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 12","pages":"884-885"},"PeriodicalIF":12.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1038/s43588-024-00728-y
Inspired by recent approaches for natural language processing and computer vision, we developed Annotatability, a framework that analyzes deep neural network training dynamics to interpret pre-annotated single-cell and spatial omics data. Annotatability identified erroneous annotations and ambiguous cell states, inferred trajectories from binary labels, and revealed underlying biological signals.
{"title":"Deep learning training dynamics analysis for single-cell data","authors":"","doi":"10.1038/s43588-024-00728-y","DOIUrl":"10.1038/s43588-024-00728-y","url":null,"abstract":"Inspired by recent approaches for natural language processing and computer vision, we developed Annotatability, a framework that analyzes deep neural network training dynamics to interpret pre-annotated single-cell and spatial omics data. Annotatability identified erroneous annotations and ambiguous cell states, inferred trajectories from binary labels, and revealed underlying biological signals.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 12","pages":"886-887"},"PeriodicalIF":12.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The integration of deep neural networks with the variational Monte Carlo (VMC) method has marked a substantial advancement in solving the Schrödinger equation. In this work we enforce spin symmetry in the neural-network-based VMC calculation using a modified optimization target. Our method is designed to solve for the ground state and multiple excited states with target spin symmetry at a low computational cost. It predicts accurate energies while maintaining the correct symmetry in strongly correlated systems, even in cases in which different spin states are nearly degenerate. Our approach also excels at spin–gap calculations, including the singlet–triplet gap in biradical systems, which is of high interest in photochemistry. Overall, this work establishes a robust framework for efficiently calculating various quantum states with specific spin symmetry in correlated systems. An efficient approach is developed to enforce spin symmetry for neural network wavefunctions when solving the many-body Schrödinger equation. This enables accurate and spin-pure simulations of both ground and excited states.
{"title":"Spin-symmetry-enforced solution of the many-body Schrödinger equation with a deep neural network","authors":"Zhe Li, Zixiang Lu, Ruichen Li, Xuelan Wen, Xiang Li, Liwei Wang, Ji Chen, Weiluo Ren","doi":"10.1038/s43588-024-00730-4","DOIUrl":"10.1038/s43588-024-00730-4","url":null,"abstract":"The integration of deep neural networks with the variational Monte Carlo (VMC) method has marked a substantial advancement in solving the Schrödinger equation. In this work we enforce spin symmetry in the neural-network-based VMC calculation using a modified optimization target. Our method is designed to solve for the ground state and multiple excited states with target spin symmetry at a low computational cost. It predicts accurate energies while maintaining the correct symmetry in strongly correlated systems, even in cases in which different spin states are nearly degenerate. Our approach also excels at spin–gap calculations, including the singlet–triplet gap in biradical systems, which is of high interest in photochemistry. Overall, this work establishes a robust framework for efficiently calculating various quantum states with specific spin symmetry in correlated systems. An efficient approach is developed to enforce spin symmetry for neural network wavefunctions when solving the many-body Schrödinger equation. This enables accurate and spin-pure simulations of both ground and excited states.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 12","pages":"910-919"},"PeriodicalIF":12.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1038/s43588-024-00721-5
Jonathan Karin, Reshef Mintz, Barak Raveh, Mor Nitzan
Single-cell and spatial omics datasets can be organized and interpreted by annotating single cells to distinct types, states, locations or phenotypes. However, cell annotations are inherently ambiguous, as discrete labels with subjective interpretations are assigned to heterogeneous cell populations on the basis of noisy, sparse and high-dimensional data. Here we developed Annotatability, a framework for identifying annotation mismatches and characterizing biological data structure by monitoring the dynamics and difficulty of training a deep neural network over such annotated data. Following this, we developed a signal-aware graph embedding method that enables downstream analysis of biological signals. This embedding captures cellular communities associated with target signals. Using Annotatability, we address key challenges in the interpretation of genomic data, demonstrated over eight single-cell RNA sequencing and spatial omics datasets, including identifying erroneous annotations and intermediate cell states, delineating developmental or disease trajectories, and capturing cellular heterogeneity. These results underscore the broad applicability of annotation-trainability analysis via Annotatability for unraveling cellular diversity and interpreting collective cell behaviors in health and disease. The Annotatability framework analyzes neural network training dynamics to interpret single-cell and spatial omics data. It identifies erroneous annotations and ambiguous cell states, infers trajectories from binary labels and enables signal-aware analysis.
{"title":"Interpreting single-cell and spatial omics data using deep neural network training dynamics","authors":"Jonathan Karin, Reshef Mintz, Barak Raveh, Mor Nitzan","doi":"10.1038/s43588-024-00721-5","DOIUrl":"10.1038/s43588-024-00721-5","url":null,"abstract":"Single-cell and spatial omics datasets can be organized and interpreted by annotating single cells to distinct types, states, locations or phenotypes. However, cell annotations are inherently ambiguous, as discrete labels with subjective interpretations are assigned to heterogeneous cell populations on the basis of noisy, sparse and high-dimensional data. Here we developed Annotatability, a framework for identifying annotation mismatches and characterizing biological data structure by monitoring the dynamics and difficulty of training a deep neural network over such annotated data. Following this, we developed a signal-aware graph embedding method that enables downstream analysis of biological signals. This embedding captures cellular communities associated with target signals. Using Annotatability, we address key challenges in the interpretation of genomic data, demonstrated over eight single-cell RNA sequencing and spatial omics datasets, including identifying erroneous annotations and intermediate cell states, delineating developmental or disease trajectories, and capturing cellular heterogeneity. These results underscore the broad applicability of annotation-trainability analysis via Annotatability for unraveling cellular diversity and interpreting collective cell behaviors in health and disease. The Annotatability framework analyzes neural network training dynamics to interpret single-cell and spatial omics data. It identifies erroneous annotations and ambiguous cell states, infers trajectories from binary labels and enables signal-aware analysis.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 12","pages":"941-954"},"PeriodicalIF":12.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s43588-024-00721-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27DOI: 10.1038/s43588-024-00733-1
Boming Kang, Rui Fan, Chunmei Cui, Qinghua Cui
Human essential proteins (HEPs) are indispensable for individual viability and development. However, experimental methods to identify HEPs are often costly, time consuming and labor intensive. In addition, existing computational methods predict HEPs only at the cell line level, but HEPs vary across living human, cell line and animal models. Here we develop a sequence-based deep learning model, Protein Importance Calculator (PIC), by fine-tuning a pretrained protein language model. PIC not only substantially outperforms existing methods for predicting HEPs but also provides comprehensive prediction results across three levels: human, cell line and mouse. Furthermore, we define the protein essential score, derived from PIC, to quantify human protein essentiality and validate its effectiveness by a series of biological analyses. We also demonstrate the biomedical value of the protein essential score by identifying potential prognostic biomarkers for breast cancer and quantifying the essentiality of 617,462 human microproteins.
{"title":"Comprehensive prediction and analysis of human protein essentiality based on a pretrained large language model.","authors":"Boming Kang, Rui Fan, Chunmei Cui, Qinghua Cui","doi":"10.1038/s43588-024-00733-1","DOIUrl":"https://doi.org/10.1038/s43588-024-00733-1","url":null,"abstract":"<p><p>Human essential proteins (HEPs) are indispensable for individual viability and development. However, experimental methods to identify HEPs are often costly, time consuming and labor intensive. In addition, existing computational methods predict HEPs only at the cell line level, but HEPs vary across living human, cell line and animal models. Here we develop a sequence-based deep learning model, Protein Importance Calculator (PIC), by fine-tuning a pretrained protein language model. PIC not only substantially outperforms existing methods for predicting HEPs but also provides comprehensive prediction results across three levels: human, cell line and mouse. Furthermore, we define the protein essential score, derived from PIC, to quantify human protein essentiality and validate its effectiveness by a series of biological analyses. We also demonstrate the biomedical value of the protein essential score by identifying potential prognostic biomarkers for breast cancer and quantifying the essentiality of 617,462 human microproteins.</p>","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":" ","pages":""},"PeriodicalIF":12.0,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142741716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21DOI: 10.1038/s43588-024-00742-0
We discuss the thirty-year anniversary of the seminal work on DNA computing and its implications for the field of biotechnology.
我们讨论了 DNA 计算开创性工作 30 周年及其对生物技术领域的影响。
{"title":"Harnessing the power of DNA for computing","authors":"","doi":"10.1038/s43588-024-00742-0","DOIUrl":"10.1038/s43588-024-00742-0","url":null,"abstract":"We discuss the thirty-year anniversary of the seminal work on DNA computing and its implications for the field of biotechnology.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 11","pages":"801-801"},"PeriodicalIF":12.0,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s43588-024-00742-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142679987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-14DOI: 10.1038/s43588-024-00725-1
Orestis A. Ntintas, Theodoros Daglis, Vassilis G. Gorgoulis
A recent study proposes DeepBlock, a deep learning-based approach for generating ligands with targeted properties, such as low toxicity and high affinity with the given target. This approach outperforms existing methods in the field while maintaining synthetic accessibility and drug-likeness.
{"title":"Harnessing deep learning to build optimized ligands","authors":"Orestis A. Ntintas, Theodoros Daglis, Vassilis G. Gorgoulis","doi":"10.1038/s43588-024-00725-1","DOIUrl":"10.1038/s43588-024-00725-1","url":null,"abstract":"A recent study proposes DeepBlock, a deep learning-based approach for generating ligands with targeted properties, such as low toxicity and high affinity with the given target. This approach outperforms existing methods in the field while maintaining synthetic accessibility and drug-likeness.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 11","pages":"809-810"},"PeriodicalIF":12.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-11DOI: 10.1038/s43588-024-00714-4
Nessim Raouraoua, Claudio Mirabello, Thibaut Véry, Christophe Blanchet, Björn Wallner, Marc F. Lensink, Guillaume Brysbaert
Massive sampling in AlphaFold enables access to increased structural diversity. In combination with its efficient confidence ranking, this unlocks elevated modeling capabilities for monomeric structures and foremost for protein assemblies. However, the approach struggles with GPU cost and data storage. Here we introduce MassiveFold, an optimized and customizable version of AlphaFold that runs predictions in parallel, reducing the computing time from several months to hours. MassiveFold is scalable and able to run on anything from a single computer to a large GPU infrastructure, where it can fully benefit from all the computing nodes. Although AlphaFold is very efficient for protein structure prediction, massive sampling is a very GPU demanding task. MassiveFold overcomes this limitation, being capable of parallelizing structure prediction computation.
{"title":"MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive sampling","authors":"Nessim Raouraoua, Claudio Mirabello, Thibaut Véry, Christophe Blanchet, Björn Wallner, Marc F. Lensink, Guillaume Brysbaert","doi":"10.1038/s43588-024-00714-4","DOIUrl":"10.1038/s43588-024-00714-4","url":null,"abstract":"Massive sampling in AlphaFold enables access to increased structural diversity. In combination with its efficient confidence ranking, this unlocks elevated modeling capabilities for monomeric structures and foremost for protein assemblies. However, the approach struggles with GPU cost and data storage. Here we introduce MassiveFold, an optimized and customizable version of AlphaFold that runs predictions in parallel, reducing the computing time from several months to hours. MassiveFold is scalable and able to run on anything from a single computer to a large GPU infrastructure, where it can fully benefit from all the computing nodes. Although AlphaFold is very efficient for protein structure prediction, massive sampling is a very GPU demanding task. MassiveFold overcomes this limitation, being capable of parallelizing structure prediction computation.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"4 11","pages":"824-828"},"PeriodicalIF":12.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s43588-024-00714-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}