Martin Hemberg, Federico Marini, Shila Ghazanfar, Ahmad Al Ajami, Najla Abassi, Benedict Anchang, Bérénice A. Benayoun, Yue Cao, Ken Chen, Yesid Cuesta-Astroz, Zach DeBruine, Calliope A. Dendrou, Iwijn De Vlaminck, Katharina Imkeller, Ilya Korsunsky, Alex R. Lederer, Pieter Meysman, Clint Miller, Kerry Mullan, Uwe Ohler, Nikolaos Patikas, Jonas Schuck, Jacqueline HY Siu, Timothy J. Triche Jr., Alex Tsankov, Sander W. van der Laan, Masanao Yajima, Jean Yang, Fabio Zanini, Ivana Jelic
The field of single-cell biology is growing rapidly and is generating large amounts of data from a variety of species, disease conditions, tissues, and organs. Coordinated efforts such as CZI CELLxGENE, HuBMAP, Broad Institute Single Cell Portal, and DISCO, allow researchers to access large volumes of curated datasets. Although the majority of the data is from scRNAseq experiments, a wide range of other modalities are represented as well. These resources have created an opportunity to build and expand the computational biology ecosystem to develop tools necessary for data reuse, and for extracting novel biological insights. Here, we highlight achievements made so far, areas where further development is needed, and specific challenges that need to be overcome.
{"title":"Insights, opportunities and challenges provided by large cell atlases","authors":"Martin Hemberg, Federico Marini, Shila Ghazanfar, Ahmad Al Ajami, Najla Abassi, Benedict Anchang, Bérénice A. Benayoun, Yue Cao, Ken Chen, Yesid Cuesta-Astroz, Zach DeBruine, Calliope A. Dendrou, Iwijn De Vlaminck, Katharina Imkeller, Ilya Korsunsky, Alex R. Lederer, Pieter Meysman, Clint Miller, Kerry Mullan, Uwe Ohler, Nikolaos Patikas, Jonas Schuck, Jacqueline HY Siu, Timothy J. Triche Jr., Alex Tsankov, Sander W. van der Laan, Masanao Yajima, Jean Yang, Fabio Zanini, Ivana Jelic","doi":"arxiv-2408.06563","DOIUrl":"https://doi.org/arxiv-2408.06563","url":null,"abstract":"The field of single-cell biology is growing rapidly and is generating large\u0000amounts of data from a variety of species, disease conditions, tissues, and\u0000organs. Coordinated efforts such as CZI CELLxGENE, HuBMAP, Broad Institute\u0000Single Cell Portal, and DISCO, allow researchers to access large volumes of\u0000curated datasets. Although the majority of the data is from scRNAseq\u0000experiments, a wide range of other modalities are represented as well. These\u0000resources have created an opportunity to build and expand the computational\u0000biology ecosystem to develop tools necessary for data reuse, and for extracting\u0000novel biological insights. Here, we highlight achievements made so far, areas\u0000where further development is needed, and specific challenges that need to be\u0000overcome.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emerging evidence indicates that human cancers are intricately linked to human microbiomes, forming an inseparable connection. However, due to limited sample sizes and significant data loss during collection for various reasons, some machine learning methods have been proposed to address the issue of missing data. These methods have not fully utilized the known clinical information of patients to enhance the accuracy of data imputation. Therefore, we introduce mbVDiT, a novel pre-trained conditional diffusion model for microbiome data imputation and denoising, which uses the unmasked data and patient metadata as conditional guidance for imputating missing values. It is also uses VAE to integrate the the other public microbiome datasets to enhance model performance. The results on the microbiome datasets from three different cancer types demonstrate the performance of our methods in comparison with existing methods.
{"title":"Pretrained-Guided Conditional Diffusion Models for Microbiome Data Analysis","authors":"Xinyuan Shi, Fangfang Zhu, Wenwen Min","doi":"arxiv-2408.07709","DOIUrl":"https://doi.org/arxiv-2408.07709","url":null,"abstract":"Emerging evidence indicates that human cancers are intricately linked to\u0000human microbiomes, forming an inseparable connection. However, due to limited\u0000sample sizes and significant data loss during collection for various reasons,\u0000some machine learning methods have been proposed to address the issue of\u0000missing data. These methods have not fully utilized the known clinical\u0000information of patients to enhance the accuracy of data imputation. Therefore,\u0000we introduce mbVDiT, a novel pre-trained conditional diffusion model for\u0000microbiome data imputation and denoising, which uses the unmasked data and\u0000patient metadata as conditional guidance for imputating missing values. It is\u0000also uses VAE to integrate the the other public microbiome datasets to enhance\u0000model performance. The results on the microbiome datasets from three different\u0000cancer types demonstrate the performance of our methods in comparison with\u0000existing methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenwen Min, Zhen Wang, Fangfang Zhu, Taosheng Xu, Shunfang Wang
Single-cell RNA sequencing (scRNA-seq) data analysis is pivotal for understanding cellular heterogeneity. However, the high sparsity and complex noise patterns inherent in scRNA-seq data present significant challenges for traditional clustering methods. To address these issues, we propose a deep clustering method, Attention-Enhanced Structural Deep Embedding Graph Clustering (scASDC), which integrates multiple advanced modules to improve clustering accuracy and robustness.Our approach employs a multi-layer graph convolutional network (GCN) to capture high-order structural relationships between cells, termed as the graph autoencoder module. To mitigate the oversmoothing issue in GCNs, we introduce a ZINB-based autoencoder module that extracts content information from the data and learns latent representations of gene expression. These modules are further integrated through an attention fusion mechanism, ensuring effective combination of gene expression and structural information at each layer of the GCN. Additionally, a self-supervised learning module is incorporated to enhance the robustness of the learned embeddings. Extensive experiments demonstrate that scASDC outperforms existing state-of-the-art methods, providing a robust and effective solution for single-cell clustering tasks. Our method paves the way for more accurate and meaningful analysis of single-cell RNA sequencing data, contributing to better understanding of cellular heterogeneity and biological processes. All code and public datasets used in this paper are available at url{https://github.com/wenwenmin/scASDC} and url{https://zenodo.org/records/12814320}.
{"title":"scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data","authors":"Wenwen Min, Zhen Wang, Fangfang Zhu, Taosheng Xu, Shunfang Wang","doi":"arxiv-2408.05258","DOIUrl":"https://doi.org/arxiv-2408.05258","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) data analysis is pivotal for\u0000understanding cellular heterogeneity. However, the high sparsity and complex\u0000noise patterns inherent in scRNA-seq data present significant challenges for\u0000traditional clustering methods. To address these issues, we propose a deep\u0000clustering method, Attention-Enhanced Structural Deep Embedding Graph\u0000Clustering (scASDC), which integrates multiple advanced modules to improve\u0000clustering accuracy and robustness.Our approach employs a multi-layer graph\u0000convolutional network (GCN) to capture high-order structural relationships\u0000between cells, termed as the graph autoencoder module. To mitigate the\u0000oversmoothing issue in GCNs, we introduce a ZINB-based autoencoder module that\u0000extracts content information from the data and learns latent representations of\u0000gene expression. These modules are further integrated through an attention\u0000fusion mechanism, ensuring effective combination of gene expression and\u0000structural information at each layer of the GCN. Additionally, a\u0000self-supervised learning module is incorporated to enhance the robustness of\u0000the learned embeddings. Extensive experiments demonstrate that scASDC\u0000outperforms existing state-of-the-art methods, providing a robust and effective\u0000solution for single-cell clustering tasks. Our method paves the way for more\u0000accurate and meaningful analysis of single-cell RNA sequencing data,\u0000contributing to better understanding of cellular heterogeneity and biological\u0000processes. All code and public datasets used in this paper are available at\u0000url{https://github.com/wenwenmin/scASDC} and\u0000url{https://zenodo.org/records/12814320}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donghai Fang, Fangfang Zhu, Dongting Xie, Wenwen Min
With the rapid advancement of Spatial Resolved Transcriptomics (SRT) technology, it is now possible to comprehensively measure gene transcription while preserving the spatial context of tissues. Spatial domain identification and gene denoising are key objectives in SRT data analysis. We propose a Contrastively Augmented Masked Graph Autoencoder (STMGAC) to learn low-dimensional latent representations for domain identification. In the latent space, persistent signals for representations are obtained through self-distillation to guide self-supervised matching. At the same time, positive and negative anchor pairs are constructed using triplet learning to augment the discriminative ability. We evaluated the performance of STMGAC on five datasets, achieving results superior to those of existing baseline methods. All code and public datasets used in this paper are available at https://github.com/wenwenmin/STMGAC and https://zenodo.org/records/13253801.
{"title":"Masked Graph Autoencoders with Contrastive Augmentation for Spatially Resolved Transcriptomics Data","authors":"Donghai Fang, Fangfang Zhu, Dongting Xie, Wenwen Min","doi":"arxiv-2408.06377","DOIUrl":"https://doi.org/arxiv-2408.06377","url":null,"abstract":"With the rapid advancement of Spatial Resolved Transcriptomics (SRT)\u0000technology, it is now possible to comprehensively measure gene transcription\u0000while preserving the spatial context of tissues. Spatial domain identification\u0000and gene denoising are key objectives in SRT data analysis. We propose a\u0000Contrastively Augmented Masked Graph Autoencoder (STMGAC) to learn\u0000low-dimensional latent representations for domain identification. In the latent\u0000space, persistent signals for representations are obtained through\u0000self-distillation to guide self-supervised matching. At the same time, positive\u0000and negative anchor pairs are constructed using triplet learning to augment the\u0000discriminative ability. We evaluated the performance of STMGAC on five\u0000datasets, achieving results superior to those of existing baseline methods. All\u0000code and public datasets used in this paper are available at\u0000https://github.com/wenwenmin/STMGAC and https://zenodo.org/records/13253801.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sina Tabakhi, Charlotte Vandermeulen, Ian Sudbery, Haiping Lu
The increase in high-dimensional multiomics data demands advanced integration models to capture the complexity of human diseases. Graph-based deep learning integration models, despite their promise, struggle with small patient cohorts and high-dimensional features, often applying independent feature selection without modeling relationships among omics. Furthermore, conventional graph-based omics models focus on homogeneous graphs, lacking multiple types of nodes and edges to capture diverse structures. We introduce a Heterogeneous Graph ATtention network for omics integration (HeteroGATomics) to improve cancer diagnosis. HeteroGATomics performs joint feature selection through a multi-agent system, creating dedicated networks of feature and patient similarity for each omic modality. These networks are then combined into one heterogeneous graph for learning holistic omic-specific representations and integrating predictions across modalities. Experiments on three cancer multiomics datasets demonstrate HeteroGATomics' superior performance in cancer diagnosis. Moreover, HeteroGATomics enhances interpretability by identifying important biomarkers contributing to the diagnosis outcomes.
{"title":"Heterogeneous graph attention network improves cancer multiomics integration","authors":"Sina Tabakhi, Charlotte Vandermeulen, Ian Sudbery, Haiping Lu","doi":"arxiv-2408.02845","DOIUrl":"https://doi.org/arxiv-2408.02845","url":null,"abstract":"The increase in high-dimensional multiomics data demands advanced integration\u0000models to capture the complexity of human diseases. Graph-based deep learning\u0000integration models, despite their promise, struggle with small patient cohorts\u0000and high-dimensional features, often applying independent feature selection\u0000without modeling relationships among omics. Furthermore, conventional\u0000graph-based omics models focus on homogeneous graphs, lacking multiple types of\u0000nodes and edges to capture diverse structures. We introduce a Heterogeneous\u0000Graph ATtention network for omics integration (HeteroGATomics) to improve\u0000cancer diagnosis. HeteroGATomics performs joint feature selection through a\u0000multi-agent system, creating dedicated networks of feature and patient\u0000similarity for each omic modality. These networks are then combined into one\u0000heterogeneous graph for learning holistic omic-specific representations and\u0000integrating predictions across modalities. Experiments on three cancer\u0000multiomics datasets demonstrate HeteroGATomics' superior performance in cancer\u0000diagnosis. Moreover, HeteroGATomics enhances interpretability by identifying\u0000important biomarkers contributing to the diagnosis outcomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Variant calling refinement is crucial for distinguishing true genetic variants from technical artifacts in high-throughput sequencing data. Manual review is time-consuming while heuristic filtering often lacks optimal solutions. Traditional variant calling methods often struggle with accuracy, especially in regions of low read coverage, leading to false-positive or false-negative calls. Here, we introduce VariantTransformer, a Transformer-based deep learning model, designed to automate variant calling refinement directly from VCF files in low-coverage data (10-15X). VariantTransformer, trained on two million variants, including SNPs and short InDels, from low-coverage sequencing data, achieved an accuracy of 89.26% and a ROC AUC of 0.88. When integrated into conventional variant calling pipelines, VariantTransformer outperformed traditional heuristic filters and approached the performance of state-of-the-art AI-based variant callers like DeepVariant. Comparative analysis demonstrated VariantTransformer's superiority in functionality, variant type coverage, training size, and input data type. VariantTransformer represents a significant advancement in variant calling refinement for low-coverage genomic studies.
{"title":"Refinement of genetic variants needs attention","authors":"Omar Abdelwahab, Davoud Torkamaneh","doi":"arxiv-2408.00659","DOIUrl":"https://doi.org/arxiv-2408.00659","url":null,"abstract":"Variant calling refinement is crucial for distinguishing true genetic\u0000variants from technical artifacts in high-throughput sequencing data. Manual\u0000review is time-consuming while heuristic filtering often lacks optimal\u0000solutions. Traditional variant calling methods often struggle with accuracy,\u0000especially in regions of low read coverage, leading to false-positive or\u0000false-negative calls. Here, we introduce VariantTransformer, a\u0000Transformer-based deep learning model, designed to automate variant calling\u0000refinement directly from VCF files in low-coverage data (10-15X).\u0000VariantTransformer, trained on two million variants, including SNPs and short\u0000InDels, from low-coverage sequencing data, achieved an accuracy of 89.26% and a\u0000ROC AUC of 0.88. When integrated into conventional variant calling pipelines,\u0000VariantTransformer outperformed traditional heuristic filters and approached\u0000the performance of state-of-the-art AI-based variant callers like DeepVariant.\u0000Comparative analysis demonstrated VariantTransformer's superiority in\u0000functionality, variant type coverage, training size, and input data type.\u0000VariantTransformer represents a significant advancement in variant calling\u0000refinement for low-coverage genomic studies.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boyi Guo, Wodan Ling, Sang Ho Kwon, Pratibha Panwar, Shila Ghazanfar, Keri Martinowich, Stephanie C. Hicks
Advances in spatially-resolved transcriptomics (SRT) technologies have propelled the development of new computational analysis methods to unlock biological insights. As the cost of generating these data decreases, these technologies provide an exciting opportunity to create large-scale atlases that integrate SRT data across multiple tissues, individuals, species, or phenotypes to perform population-level analyses. Here, we describe unique challenges of varying spatial resolutions in SRT data, as well as highlight the opportunities for standardized preprocessing methods along with computational algorithms amenable to atlas-scale datasets leading to improved sensitivity and reproducibility in the future.
{"title":"Integrating spatially-resolved transcriptomics data across tissues and individuals: challenges and opportunities","authors":"Boyi Guo, Wodan Ling, Sang Ho Kwon, Pratibha Panwar, Shila Ghazanfar, Keri Martinowich, Stephanie C. Hicks","doi":"arxiv-2408.00367","DOIUrl":"https://doi.org/arxiv-2408.00367","url":null,"abstract":"Advances in spatially-resolved transcriptomics (SRT) technologies have\u0000propelled the development of new computational analysis methods to unlock\u0000biological insights. As the cost of generating these data decreases, these\u0000technologies provide an exciting opportunity to create large-scale atlases that\u0000integrate SRT data across multiple tissues, individuals, species, or phenotypes\u0000to perform population-level analyses. Here, we describe unique challenges of\u0000varying spatial resolutions in SRT data, as well as highlight the opportunities\u0000for standardized preprocessing methods along with computational algorithms\u0000amenable to atlas-scale datasets leading to improved sensitivity and\u0000reproducibility in the future.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Hartung, Andreas Maier, Fernando Delgado-Chaves, Yuliya Burankova, Olga I. Isaeva, Fábio Malta de Sá Patroni, Daniel He, Casey Shannon, Katharina Kaufmann, Jens Lohmann, Alexey Savchik, Anne Hartebrodt, Zoe Chervontseva, Farzaneh Firoozbakht, Niklas Probul, Evgenia Zotova, Olga Tsoy, David B. Blumenthal, Martin Ester, Tanja Laske, Jan Baumbach, Olga Zolotareva
Most complex diseases, including cancer and non-malignant diseases like asthma, have distinct molecular subtypes that require distinct clinical approaches. However, existing computational patient stratification methods have been benchmarked almost exclusively on cancer omics data and only perform well when mutually exclusive subtypes can be characterized by many biomarkers. Here, we contribute with a massive evaluation attempt, quantitatively exploring the power of 22 unsupervised patient stratification methods using both, simulated and real transcriptome data. From this experience, we developed UnPaSt (https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification, working even with only a limited number of subtype-predictive biomarkers. We evaluated all 23 methods on real-world breast cancer and asthma transcriptomics data. Although many methods reliably detected major breast cancer subtypes, only few identified Th2-high asthma, and UnPaSt significantly outperformed its closest competitors in both test datasets. Essentially, we showed that UnPaSt can detect many biologically insightful and reproducible patterns in omic datasets.
{"title":"UnPaSt: unsupervised patient stratification by differentially expressed biclusters in omics data","authors":"Michael Hartung, Andreas Maier, Fernando Delgado-Chaves, Yuliya Burankova, Olga I. Isaeva, Fábio Malta de Sá Patroni, Daniel He, Casey Shannon, Katharina Kaufmann, Jens Lohmann, Alexey Savchik, Anne Hartebrodt, Zoe Chervontseva, Farzaneh Firoozbakht, Niklas Probul, Evgenia Zotova, Olga Tsoy, David B. Blumenthal, Martin Ester, Tanja Laske, Jan Baumbach, Olga Zolotareva","doi":"arxiv-2408.00200","DOIUrl":"https://doi.org/arxiv-2408.00200","url":null,"abstract":"Most complex diseases, including cancer and non-malignant diseases like\u0000asthma, have distinct molecular subtypes that require distinct clinical\u0000approaches. However, existing computational patient stratification methods have\u0000been benchmarked almost exclusively on cancer omics data and only perform well\u0000when mutually exclusive subtypes can be characterized by many biomarkers. Here,\u0000we contribute with a massive evaluation attempt, quantitatively exploring the\u0000power of 22 unsupervised patient stratification methods using both, simulated\u0000and real transcriptome data. From this experience, we developed UnPaSt\u0000(https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification,\u0000working even with only a limited number of subtype-predictive biomarkers. We\u0000evaluated all 23 methods on real-world breast cancer and asthma transcriptomics\u0000data. Although many methods reliably detected major breast cancer subtypes,\u0000only few identified Th2-high asthma, and UnPaSt significantly outperformed its\u0000closest competitors in both test datasets. Essentially, we showed that UnPaSt\u0000can detect many biologically insightful and reproducible patterns in omic\u0000datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monica Isgut, Andrew Hornback, Yunan Luo, Asma Khimani, Neha Jain, May D. Wang
Polygenic risk scores (PRSs) can significantly enhance breast cancer risk prediction when combined with clinical risk factor data. While many studies have explored the value-add of PRSs, little is known about the potential impact of gene-by-gene or gene-by-environment interactions towards enhancing the risk discrimination capabilities of multi-modal models combining PRSs with clinical data. In this study, we integrated data on 318 individual genotype variants along with clinical data in a neural network to explore whether gene-by-gene (i.e., between individual variants) and/or gene-by-environment (between clinical risk factors and variants) interactions could be leveraged jointly during training to improve breast cancer risk prediction performance. We benchmarked our approach against a baseline model combining traditional univariate PRSs with clinical data in a logistic regression model and ran an interpretability analysis to identify feature interactions. While our model did not demonstrate improved performance over the baseline, we discovered 248 (<1%) statistically significant gene-by-gene and gene-by-environment interactions out of the ~53.6k possible feature pairs, the most contributory of which included rs6001930 (MKL1) and rs889312 (MAP3K1), with age and menopause being the most heavily interacting non-genetic risk factors. We also modeled the significant interactions as a network of highly connected features, suggesting that potential higher-order interactions are captured by the model. Although gene-by-environment (or gene-by-gene) interactions did not enhance breast cancer risk prediction performance in neural networks, our study provides evidence that these interactions can be leveraged by these models to inform their predictions. This study represents the first application of neural networks to screen for interactions impacting breast cancer risk using real-world data.
{"title":"Are gene-by-environment interactions leveraged in multi-modality neural networks for breast cancer prediction?","authors":"Monica Isgut, Andrew Hornback, Yunan Luo, Asma Khimani, Neha Jain, May D. Wang","doi":"arxiv-2407.20978","DOIUrl":"https://doi.org/arxiv-2407.20978","url":null,"abstract":"Polygenic risk scores (PRSs) can significantly enhance breast cancer risk\u0000prediction when combined with clinical risk factor data. While many studies\u0000have explored the value-add of PRSs, little is known about the potential impact\u0000of gene-by-gene or gene-by-environment interactions towards enhancing the risk\u0000discrimination capabilities of multi-modal models combining PRSs with clinical\u0000data. In this study, we integrated data on 318 individual genotype variants\u0000along with clinical data in a neural network to explore whether gene-by-gene\u0000(i.e., between individual variants) and/or gene-by-environment (between\u0000clinical risk factors and variants) interactions could be leveraged jointly\u0000during training to improve breast cancer risk prediction performance. We\u0000benchmarked our approach against a baseline model combining traditional\u0000univariate PRSs with clinical data in a logistic regression model and ran an\u0000interpretability analysis to identify feature interactions. While our model did not demonstrate improved performance over the baseline,\u0000we discovered 248 (<1%) statistically significant gene-by-gene and\u0000gene-by-environment interactions out of the ~53.6k possible feature pairs, the\u0000most contributory of which included rs6001930 (MKL1) and rs889312 (MAP3K1),\u0000with age and menopause being the most heavily interacting non-genetic risk\u0000factors. We also modeled the significant interactions as a network of highly\u0000connected features, suggesting that potential higher-order interactions are\u0000captured by the model. Although gene-by-environment (or gene-by-gene)\u0000interactions did not enhance breast cancer risk prediction performance in\u0000neural networks, our study provides evidence that these interactions can be\u0000leveraged by these models to inform their predictions. This study represents\u0000the first application of neural networks to screen for interactions impacting\u0000breast cancer risk using real-world data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PyamilySeq is a Python-based tool designed for interpretable gene clustering and pangenomic inference, supporting analyses at both species and genus levels. It facilitates the clustering of gene sequences into families based on sequence similarity using CD-HIT, and can take the output of tried-and-tested sequence clustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq is distinctive in its ability to integrate new sequences into existing clusters, providing a robust framework for iterative analysis while preserving the original clusters, useful when reannotating genomes. In addition to the standard Species mode which as with other tools performs core-gene analysis across a species range, PyamilySeq can be run in Genus mode where it detects the presence of gene families shared across multiple genera. These features enhance the tools applicability for ongoing and past genomic studies and comparative analyses. PyamilySeq generates comprehensive outputs, including gene presence-absence matrices and aligned sequence data, enabling downstream analysis and interpretation of the identified gene groups and pangenomic data.
{"title":"PyamilySeq: A Python Tool for Interpretable Gene (Re)Clustering and Pangenomic Inference Across Species and Genera","authors":"Nicholas J. Dimonaco","doi":"arxiv-2407.19328","DOIUrl":"https://doi.org/arxiv-2407.19328","url":null,"abstract":"PyamilySeq is a Python-based tool designed for interpretable gene clustering\u0000and pangenomic inference, supporting analyses at both species and genus levels.\u0000It facilitates the clustering of gene sequences into families based on sequence\u0000similarity using CD-HIT, and can take the output of tried-and-tested sequence\u0000clustering tools such as CD-HIT, BLAST, DIAMOND, and MMseqs2. PyamilySeq is\u0000distinctive in its ability to integrate new sequences into existing clusters,\u0000providing a robust framework for iterative analysis while preserving the\u0000original clusters, useful when reannotating genomes. In addition to the\u0000standard Species mode which as with other tools performs core-gene analysis\u0000across a species range, PyamilySeq can be run in Genus mode where it detects\u0000the presence of gene families shared across multiple genera. These features\u0000enhance the tools applicability for ongoing and past genomic studies and\u0000comparative analyses. PyamilySeq generates comprehensive outputs, including\u0000gene presence-absence matrices and aligned sequence data, enabling downstream\u0000analysis and interpretation of the identified gene groups and pangenomic data.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}