In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation. QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research.
在本文中,我们介绍了 QuST-LLM,它是 QuPath 的创新扩展,利用大型语言模型(LLM)的功能来分析和解释空间转录组学(ST)数据。该工具提供了一个全面的工作流程,包括数据加载、区域选择、基因表达分析和功能注释,从而有效简化了空间转录组学数据的复杂性和高维性。QuST-LLM 利用 LLM 将复杂的 ST 数据转化为基于基因图谱注释的可理解的详细生物学叙述,从而大大提高了 ST 数据的可解释性。因此,用户可以使用自然语言与自己的 ST 数据进行交互。因此,QuST-LLM 为研究人员提供了揭示问题的空间和功能复杂性的强大功能,促进了生物医学研究的新见解和新进展。
{"title":"QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis","authors":"Chao Hui Huang","doi":"arxiv-2406.14307","DOIUrl":"https://doi.org/arxiv-2406.14307","url":null,"abstract":"In this paper, we introduce QuST-LLM, an innovative extension of QuPath that\u0000utilizes the capabilities of large language models (LLMs) to analyze and\u0000interpret spatial transcriptomics (ST) data. This tool effectively simplifies\u0000the intricate and high-dimensional nature of ST data by offering a\u0000comprehensive workflow that includes data loading, region selection, gene\u0000expression analysis, and functional annotation. QuST-LLM employs LLMs to\u0000transform complex ST data into understandable and detailed biological\u0000narratives based on gene ontology annotations, thereby significantly improving\u0000the interpretability of ST data. Consequently, users can interact with their\u0000own ST data using natural language. Hence, QuST-LLM provides researchers with a\u0000potent functionality to unravel the spatial and functional complexities of\u0000tissues, fostering novel insights and advancements in biomedical research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska
In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes Natural Language Processing (NLP) techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach. The training dataset consisted of RNASeq data from 6 samples. Having tested multiple NLP models, the best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV 99.25%, compared to minimap2's performance in a cross-validation scenario. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained NLP model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one. We conclude that for long Oxford Nanopore reads, an NLP-based approach can successfully replace classical mapping in case of emergency. The developed NLP model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences and represent a significant advancement in this field of science.
{"title":"A mapping-free NLP-based technique for sequence search in Nanopore long-reads","authors":"Tomasz Strzoda, Lourdes Cruz-Garcia, Mustafa Najim, Christophe Badie, Joanna Polanska","doi":"arxiv-2406.14187","DOIUrl":"https://doi.org/arxiv-2406.14187","url":null,"abstract":"In unforeseen situations, such as nuclear power plant's or civilian radiation\u0000accidents, there is a need for effective and computationally inexpensive\u0000methods to determine the expression level of a selected gene panel, allowing\u0000for rough dose estimates in thousands of donors. The new generation in-situ\u0000mapper, fast and of low energy consumption, working at the level of single\u0000nanopore output, is in demand. We aim to create a sequence identification tool\u0000that utilizes Natural Language Processing (NLP) techniques and ensures a high\u0000level of negative predictive value (NPV) compared to the classical approach.\u0000The training dataset consisted of RNASeq data from 6 samples. Having tested\u0000multiple NLP models, the best configuration analyses the entire sequence and\u0000uses a word length of 3 base pairs with one-word neighbor on each side. For the\u0000considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and\u0000NPV 99.25%, compared to minimap2's performance in a cross-validation scenario.\u0000Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to\u000098.15%. Obtained NLP model, validated on an external independent genome\u0000sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced\u0000dictionary. The salmon-estimated read counts differed from the classical\u0000approach on average by 3.48% for the complete dictionary and by 5.82% for the\u0000reduced one. We conclude that for long Oxford Nanopore reads, an NLP-based\u0000approach can successfully replace classical mapping in case of emergency. The\u0000developed NLP model can be easily retrained to identify selected transcripts\u0000and/or work with various long-read sequencing techniques. Our results of the\u0000study clearly demonstrate the potential of applying techniques known from\u0000classical text processing to nucleotide sequences and represent a significant\u0000advancement in this field of science.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro Liò
We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins (4 atoms per residue). Toward tackling the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations. Additionally, we define a suite of evaluation metrics to measure whether the generated RNA structures are globally self-consistent (via inverse folding followed by forward folding) and locally recover RNA-specific structural descriptors. The most performant version of RNA-FrameFlow generates locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass our validity criteria as measured by a self-consistency TM-score >= 0.45, at which two RNAs have the same global fold. Open-source code: https://github.com/rish-16/rna-backbone-design
{"title":"RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design","authors":"Rishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro Liò","doi":"arxiv-2406.13839","DOIUrl":"https://doi.org/arxiv-2406.13839","url":null,"abstract":"We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone\u0000design. We build upon SE(3) flow matching for protein backbone generation and\u0000establish protocols for data preparation and evaluation to address unique\u0000challenges posed by RNA modeling. We formulate RNA structures as a set of\u0000rigid-body frames and associated loss functions which account for larger, more\u0000conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins\u0000(4 atoms per residue). Toward tackling the lack of diversity in 3D RNA\u0000datasets, we explore training with structural clustering and cropping\u0000augmentations. Additionally, we define a suite of evaluation metrics to measure\u0000whether the generated RNA structures are globally self-consistent (via inverse\u0000folding followed by forward folding) and locally recover RNA-specific\u0000structural descriptors. The most performant version of RNA-FrameFlow generates\u0000locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass\u0000our validity criteria as measured by a self-consistency TM-score >= 0.45, at\u0000which two RNAs have the same global fold. Open-source code:\u0000https://github.com/rish-16/rna-backbone-design","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
{"title":"PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model","authors":"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang","doi":"arxiv-2406.13133","DOIUrl":"https://doi.org/arxiv-2406.13133","url":null,"abstract":"Pathogen identification is pivotal in diagnosing, treating, and preventing\u0000diseases, crucial for controlling infections and safeguarding public health.\u0000Traditional alignment-based methods, though widely used, are computationally\u0000intense and reliant on extensive reference databases, often failing to detect\u0000novel pathogens due to their low sensitivity and specificity. Similarly,\u0000conventional machine learning techniques, while promising, require large\u0000annotated datasets and extensive feature engineering and are prone to\u0000overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\u0000pathogen language model optimized for the identification of pathogenicity in\u0000bacterial and viral sequences. Leveraging the strengths of pre-trained DNA\u0000models such as the Nucleotide Transformer, PathoLM requires minimal data for\u0000fine-tuning, thereby enhancing pathogen detection capabilities. It effectively\u0000captures a broader genomic context, significantly improving the identification\u0000of novel and divergent pathogens. We developed a comprehensive data set\u0000comprising approximately 30 species of viruses and bacteria, including ESKAPEE\u0000pathogens, seven notably virulent bacterial strains resistant to antibiotics.\u0000Additionally, we curated a species classification dataset centered specifically\u0000on the ESKAPEE group. In comparative assessments, PathoLM dramatically\u0000outperforms existing models like DciPatho, demonstrating robust zero-shot and\u0000few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\u0000classification, where it showed superior performance compared to other advanced\u0000deep learning methods, despite the complexities of the task.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu
Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often don't agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against $>$65,000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10kbp), skandiver's recall was 48% and 47%, MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%, respectively. For isolated large plasmids, skandiver's recall (48%) is lower than state-of-the-art reference-based methods geNomad (86%) and MobileElementFinder (59%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver
{"title":"skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements","authors":"Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu","doi":"arxiv-2406.12064","DOIUrl":"https://doi.org/arxiv-2406.12064","url":null,"abstract":"Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied\u0000in type, ranging from viral insertions to transposons to incorporated plasmids.\u0000Horizontal transfer of MGEs across bacterial species may also pose a\u0000significant threat to global health due to their capability to harbour\u0000antibiotic resistance genes. However, despite cheap and rapid whole genome\u0000sequencing, the varied nature of MGEs makes it difficult to fully characterize\u0000them, and existing methods for detecting MGEs often don't agree on what should\u0000count. In this manuscript, we first define and argue in favor of a\u0000divergence-based characterization of mobile-genetic elements. Using that\u0000paradigm, we present skandiver, a tool designed to efficiently detect MGEs from\u0000whole genome assemblies without the need for gene annotation or markers.\u0000skandiver determines mobile elements via genome fragmentation, average\u0000nucleotide identity (ANI), and divergence time. By building on the scalable\u0000skani software for ANI computation, skandiver can query hundreds of complete\u0000assemblies against $>$65,000 representative genomes in a few minutes and 19 GB\u0000memory, providing scalable and efficient method for elucidating mobile element\u0000profiles in incomplete, uncharacterized genomic sequences. For isolated and\u0000integrated large plasmids (>10kbp), skandiver's recall was 48% and 47%,\u0000MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%,\u0000respectively. For isolated large plasmids, skandiver's recall (48%) is lower\u0000than state-of-the-art reference-based methods geNomad (86%) and\u0000MobileElementFinder (59%). However, skandiver achieves higher recall on\u0000integrated plasmids and, unlike other methods, without comparing against a\u0000curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huiming Xia, My Hoang, Evelyn Schmidt, Susanna Kiwala, Joshua McMichael, Zachary L. Skidmore, Bryan Fisk, Jonathan J. Song, Jasreet Hundal, Thomas Mooney, Jason R. Walker, S. Peter Goedegebuure, Christopher A. Miller, William E. Gillanders, Obi L. Griffith, Malachi Griffith
Neoantigen targeting therapies including personalized vaccines have shown promise in the treatment of cancers. Accurate identification/prioritization of neoantigens is highly relevant to designing clinical trials, predicting treatment response, and understanding mechanisms of resistance. With the advent of massively parallel sequencing technologies, it is now possible to predict neoantigens based on patient-specific variant information. However, numerous factors must be considered when prioritizing neoantigens for use in personalized therapies. Complexities such as alternative transcript annotations, various binding, presentation and immunogenicity prediction algorithms, and variable peptide lengths/registers all potentially impact the neoantigen selection process. While computational tools generate numerous algorithmic predictions for neoantigen characterization, results from these pipelines are difficult to navigate and require extensive knowledge of the underlying tools for accurate interpretation. Due to the intricate nature and number of salient neoantigen features, presenting all relevant information to facilitate candidate selection for downstream applications is a difficult challenge that current tools fail to address. We have created pVACview, the first interactive tool designed to aid in the prioritization and selection of neoantigen candidates for personalized neoantigen therapies. pVACview has a user-friendly and intuitive interface where users can upload, explore, select and export their neoantigen candidates. The tool allows users to visualize candidates using variant, transcript and peptide information. pVACview will allow researchers to analyze and prioritize neoantigen candidates with greater efficiency and accuracy in basic and translational settings. The application is available as part of the pVACtools pipeline at pvactools.org and as an online server at pvacview.org.
{"title":"pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection","authors":"Huiming Xia, My Hoang, Evelyn Schmidt, Susanna Kiwala, Joshua McMichael, Zachary L. Skidmore, Bryan Fisk, Jonathan J. Song, Jasreet Hundal, Thomas Mooney, Jason R. Walker, S. Peter Goedegebuure, Christopher A. Miller, William E. Gillanders, Obi L. Griffith, Malachi Griffith","doi":"arxiv-2406.06985","DOIUrl":"https://doi.org/arxiv-2406.06985","url":null,"abstract":"Neoantigen targeting therapies including personalized vaccines have shown\u0000promise in the treatment of cancers. Accurate identification/prioritization of\u0000neoantigens is highly relevant to designing clinical trials, predicting\u0000treatment response, and understanding mechanisms of resistance. With the advent\u0000of massively parallel sequencing technologies, it is now possible to predict\u0000neoantigens based on patient-specific variant information. However, numerous\u0000factors must be considered when prioritizing neoantigens for use in\u0000personalized therapies. Complexities such as alternative transcript\u0000annotations, various binding, presentation and immunogenicity prediction\u0000algorithms, and variable peptide lengths/registers all potentially impact the\u0000neoantigen selection process. While computational tools generate numerous\u0000algorithmic predictions for neoantigen characterization, results from these\u0000pipelines are difficult to navigate and require extensive knowledge of the\u0000underlying tools for accurate interpretation. Due to the intricate nature and\u0000number of salient neoantigen features, presenting all relevant information to\u0000facilitate candidate selection for downstream applications is a difficult\u0000challenge that current tools fail to address. We have created pVACview, the\u0000first interactive tool designed to aid in the prioritization and selection of\u0000neoantigen candidates for personalized neoantigen therapies. pVACview has a\u0000user-friendly and intuitive interface where users can upload, explore, select\u0000and export their neoantigen candidates. The tool allows users to visualize\u0000candidates using variant, transcript and peptide information. pVACview will\u0000allow researchers to analyze and prioritize neoantigen candidates with greater\u0000efficiency and accuracy in basic and translational settings. The application is\u0000available as part of the pVACtools pipeline at pvactools.org and as an online\u0000server at pvacview.org.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in single-cell genomics necessitate precision in gene panel selection to interpret complex biological data effectively. Those methods aim to streamline the analysis of scRNA-seq data by focusing on the most informative genes that contribute significantly to the specific analysis task. Traditional selection methods, which often rely on expert domain knowledge, embedded machine learning models, or heuristic-based iterative optimization, are prone to biases and inefficiencies that may obscure critical genomic signals. Recognizing the limitations of traditional methods, we aim to transcend these constraints with a refined strategy. In this study, we introduce an iterative gene panel selection strategy that is applicable to clustering tasks in single-cell genomics. Our method uniquely integrates results from other gene selection algorithms, providing valuable preliminary boundaries or prior knowledge as initial guides in the search space to enhance the efficiency of our framework. Furthermore, we incorporate the stochastic nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization through reward-based feedback. This combination mitigates the biases inherent in the initial boundaries and harnesses RL's adaptability to refine and target gene panel selection dynamically. To illustrate the effectiveness of our method, we conducted detailed comparative experiments, case studies, and visualization analysis.
{"title":"Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization","authors":"Weiliang Zhang, Zhen Meng, Dongjie Wang, Min Wu, Kunpeng Liu, Yuanchun Zhou, Meng Xiao","doi":"arxiv-2406.07418","DOIUrl":"https://doi.org/arxiv-2406.07418","url":null,"abstract":"Recent advancements in single-cell genomics necessitate precision in gene\u0000panel selection to interpret complex biological data effectively. Those methods\u0000aim to streamline the analysis of scRNA-seq data by focusing on the most\u0000informative genes that contribute significantly to the specific analysis task.\u0000Traditional selection methods, which often rely on expert domain knowledge,\u0000embedded machine learning models, or heuristic-based iterative optimization,\u0000are prone to biases and inefficiencies that may obscure critical genomic\u0000signals. Recognizing the limitations of traditional methods, we aim to\u0000transcend these constraints with a refined strategy. In this study, we\u0000introduce an iterative gene panel selection strategy that is applicable to\u0000clustering tasks in single-cell genomics. Our method uniquely integrates\u0000results from other gene selection algorithms, providing valuable preliminary\u0000boundaries or prior knowledge as initial guides in the search space to enhance\u0000the efficiency of our framework. Furthermore, we incorporate the stochastic\u0000nature of the exploration process in reinforcement learning (RL) and its\u0000capability for continuous optimization through reward-based feedback. This\u0000combination mitigates the biases inherent in the initial boundaries and\u0000harnesses RL's adaptability to refine and target gene panel selection\u0000dynamically. To illustrate the effectiveness of our method, we conducted\u0000detailed comparative experiments, case studies, and visualization analysis.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi
While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such as celltype, tissue, and other categorical variables when evaluating the effects of these variables. We propose a novel method to enumerate suitable analysis units of variables for estimating the effects of tissue environment by extending the maximal biclique enumeration problem for bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to a large mouse single-cell transcriptome dataset of Tabala Muris Senis to evaluate pure tissue environmental effects on gene expression. Data Mining using the proposed method revealed pure tissue environment effects on gene expression and its age-related change among adipose sub-tissues. The method proposed in this study helps evaluations of the effects of discrete variables in exploratory data mining of large-scale genomics datasets.
{"title":"Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level","authors":"Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi","doi":"arxiv-2406.06969","DOIUrl":"https://doi.org/arxiv-2406.06969","url":null,"abstract":"While single-cell RNA-seq enables the investigation of the celltype effect on\u0000the transcriptome, the pure tissue environmental effect has not been well\u0000investigated. The bias in the combination of tissue and celltype in the body\u0000made it difficult to evaluate the effect of pure tissue environment by omics\u0000data mining. It is important to prevent statistical confounding among discrete\u0000variables such as celltype, tissue, and other categorical variables when\u0000evaluating the effects of these variables. We propose a novel method to\u0000enumerate suitable analysis units of variables for estimating the effects of\u0000tissue environment by extending the maximal biclique enumeration problem for\u0000bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to\u0000a large mouse single-cell transcriptome dataset of Tabala Muris Senis to\u0000evaluate pure tissue environmental effects on gene expression. Data Mining\u0000using the proposed method revealed pure tissue environment effects on gene\u0000expression and its age-related change among adipose sub-tissues. The method\u0000proposed in this study helps evaluations of the effects of discrete variables\u0000in exploratory data mining of large-scale genomics datasets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
{"title":"STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics","authors":"Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li","doi":"arxiv-2406.06393","DOIUrl":"https://doi.org/arxiv-2406.06393","url":null,"abstract":"Recent advances in multi-modal algorithms have driven and been driven by the\u0000increasing availability of large image-text datasets, leading to significant\u0000strides in various fields, including computational pathology. However, in most\u0000existing medical image-text datasets, the text typically provides high-level\u0000summaries that may not sufficiently describe sub-tile regions within a large\u0000pathology image. For example, an image might cover an extensive tissue area\u0000containing cancerous and healthy regions, but the accompanying text might only\u0000specify that this image is a cancer slide, lacking the nuanced details needed\u0000for in-depth analysis. In this study, we introduce STimage-1K4M, a novel\u0000dataset designed to bridge this gap by providing genomic features for sub-tile\u0000images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics\u0000data, which captures gene expression information at the level of individual\u0000spatial spots within a pathology image. Specifically, each image in the dataset\u0000is broken down into smaller sub-image tiles, with each tile paired with\u000015,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile\u0000images and gene expressions, STimage-1K4M offers unprecedented granularity,\u0000paving the way for a wide range of advanced research in multi-modal data\u0000analysis an innovative applications in computational pathology, and beyond.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.
基因组基础模型(GFM)范式有望促进从海量基因组数据中提取可通用的表征,从而使其能够应用于各种下游应用。尽管取得了进展,但由于实验设置、模型复杂性、基准数据集和可重复性方面的挑战,评估框架的缺乏使公平评估难以得到保证。在缺乏标准化的情况下,比较分析有可能变得有失偏颇和不可靠。为了打破这一僵局,我们推出了 GenBench,这是一个专门用于评估基因组基础模型功效的综合基准套件。GenBench 提供了一个模块化、可扩展的框架,囊括了各种最先进的方法。通过对横跨不同生物领域的数据集进行系统评估,特别强调短程和远程基因组任务,首先包括三个最重要的 DNA 任务,涵盖编码区、非编码区、基因组结构等。此外,我们还对模型架构和数据集特征之间的相互作用进行了细致的分析。我们的发现揭示了一个有趣的现象:与参数数量无关,基于注意力的模型和基于卷积的模型在短程和远程任务上存在明显的偏好差异,这可能会为未来的 GFM 设计提供启示。
{"title":"GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models","authors":"Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li","doi":"arxiv-2406.01627","DOIUrl":"https://doi.org/arxiv-2406.01627","url":null,"abstract":"The Genomic Foundation Model (GFM) paradigm is expected to facilitate the\u0000extraction of generalizable representations from massive genomic data, thereby\u0000enabling their application across a spectrum of downstream applications.\u0000Despite advancements, a lack of evaluation framework makes it difficult to\u0000ensure equitable assessment due to experimental settings, model intricacy,\u0000benchmark datasets, and reproducibility challenges. In the absence of\u0000standardization, comparative analyses risk becoming biased and unreliable. To\u0000surmount this impasse, we introduce GenBench, a comprehensive benchmarking\u0000suite specifically tailored for evaluating the efficacy of Genomic Foundation\u0000Models. GenBench offers a modular and expandable framework that encapsulates a\u0000variety of state-of-the-art methodologies. Through systematic evaluations of\u0000datasets spanning diverse biological domains with a particular emphasis on both\u0000short-range and long-range genomic tasks, firstly including the three most\u0000important DNA tasks covering Coding Region, Non-Coding Region, Genome\u0000Structure, etc. Moreover, We provide a nuanced analysis of the interplay\u0000between model architecture and dataset characteristics on task-specific\u0000performance. Our findings reveal an interesting observation: independent of the\u0000number of parameters, the discernible difference in preference between the\u0000attention-based and convolution-based models on short- and long-range tasks may\u0000provide insights into the future design of GFM.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}