Pub Date : 2025-02-06DOI: 10.1186/s12859-025-06062-y
Francesco Caredda, Andrea Pagnani
Proteins are involved in nearly all cellular functions, encompassing roles in transport, signaling, enzymatic activity, and more. Their functionalities crucially depend on their complex three-dimensional arrangement. For this reason, being able to predict their structure from the amino acid sequence has been and still is a phenomenal computational challenge that the introduction of AlphaFold solved with unprecedented accuracy. However, the inherent complexity of AlphaFold's architectures makes it challenging to understand the rules that ultimately shape the protein's predicted structure. This study investigates a single-layer unsupervised model based on the attention mechanism. More precisely, we explore a Direct Coupling Analysis (DCA) method that mimics the attention mechanism of several popular Transformer architectures, such as AlphaFold itself. The model's parameters, notably fewer than those in standard DCA-based algorithms, can be directly used for extracting structural determinants such as the contact map of the protein family under study. Additionally, the functional form of the energy function of the model enables us to deploy a multi-family learning strategy, allowing us to effectively integrate information across multiple protein families, whereas standard DCA algorithms are typically limited to single protein families. Finally, we implemented a generative version of the model using an autoregressive architecture, capable of efficiently generating new proteins in silico.
{"title":"Direct coupling analysis and the attention mechanism.","authors":"Francesco Caredda, Andrea Pagnani","doi":"10.1186/s12859-025-06062-y","DOIUrl":"10.1186/s12859-025-06062-y","url":null,"abstract":"<p><p>Proteins are involved in nearly all cellular functions, encompassing roles in transport, signaling, enzymatic activity, and more. Their functionalities crucially depend on their complex three-dimensional arrangement. For this reason, being able to predict their structure from the amino acid sequence has been and still is a phenomenal computational challenge that the introduction of AlphaFold solved with unprecedented accuracy. However, the inherent complexity of AlphaFold's architectures makes it challenging to understand the rules that ultimately shape the protein's predicted structure. This study investigates a single-layer unsupervised model based on the attention mechanism. More precisely, we explore a Direct Coupling Analysis (DCA) method that mimics the attention mechanism of several popular Transformer architectures, such as AlphaFold itself. The model's parameters, notably fewer than those in standard DCA-based algorithms, can be directly used for extracting structural determinants such as the contact map of the protein family under study. Additionally, the functional form of the energy function of the model enables us to deploy a multi-family learning strategy, allowing us to effectively integrate information across multiple protein families, whereas standard DCA algorithms are typically limited to single protein families. Finally, we implemented a generative version of the model using an autoregressive architecture, capable of efficiently generating new proteins in silico.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"41"},"PeriodicalIF":2.9,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11804077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-06DOI: 10.1186/s12859-025-06066-8
Bo Guan, Guangdi Chu, Ziying Wang, Jianmin Li, Bo Yi
Background: Accurate segmentation and classification of cell nuclei are crucial for histopathological image analysis. However, existing deep neural network-based methods often struggle to capture complex morphological features and global spatial distributions of cell nuclei due to their reliance on local receptive fields.
Methods: This study proposes a graph neural structure encoding framework based on a vision-language model. The framework incorporates: (1) A multi-scale feature fusion and knowledge distillation module utilizing the Contrastive Language-Image Pre-training (CLIP) model's image encoder; (2) A method to transform morphological features of cells into textual descriptions for semantic representation; and (3) A graph neural network approach to learn spatial relationships and contextual information between cell nuclei.
Results: Experimental results demonstrate that the proposed method significantly improves the accuracy of cell nucleus segmentation and classification compared to existing approaches. The framework effectively captures complex nuclear structures and global distribution features, leading to enhanced performance in histopathological image analysis.
Conclusions: By deeply mining the morphological features of cell nuclei and their spatial topological relationships, our graph neural structure encoding framework achieves high-precision nuclear segmentation and classification. This approach shows significant potential for enhancing histopathological image analysis, potentially leading to more accurate diagnoses and improved understanding of cellular structures in pathological tissues.
{"title":"Instance-level semantic segmentation of nuclei based on multimodal structure encoding.","authors":"Bo Guan, Guangdi Chu, Ziying Wang, Jianmin Li, Bo Yi","doi":"10.1186/s12859-025-06066-8","DOIUrl":"10.1186/s12859-025-06066-8","url":null,"abstract":"<p><strong>Background: </strong>Accurate segmentation and classification of cell nuclei are crucial for histopathological image analysis. However, existing deep neural network-based methods often struggle to capture complex morphological features and global spatial distributions of cell nuclei due to their reliance on local receptive fields.</p><p><strong>Methods: </strong>This study proposes a graph neural structure encoding framework based on a vision-language model. The framework incorporates: (1) A multi-scale feature fusion and knowledge distillation module utilizing the Contrastive Language-Image Pre-training (CLIP) model's image encoder; (2) A method to transform morphological features of cells into textual descriptions for semantic representation; and (3) A graph neural network approach to learn spatial relationships and contextual information between cell nuclei.</p><p><strong>Results: </strong>Experimental results demonstrate that the proposed method significantly improves the accuracy of cell nucleus segmentation and classification compared to existing approaches. The framework effectively captures complex nuclear structures and global distribution features, leading to enhanced performance in histopathological image analysis.</p><p><strong>Conclusions: </strong>By deeply mining the morphological features of cell nuclei and their spatial topological relationships, our graph neural structure encoding framework achieves high-precision nuclear segmentation and classification. This approach shows significant potential for enhancing histopathological image analysis, potentially leading to more accurate diagnoses and improved understanding of cellular structures in pathological tissues.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"42"},"PeriodicalIF":2.9,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11804060/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-06DOI: 10.1186/s12859-025-06067-7
Dimitrios Kioroglou
Multi-omic integration involves the management of diverse omic datasets. Conducting an effective analysis of these datasets necessitates a data management system that meets a specific set of requirements, such as rapid storage and retrieval of data with varying numbers of features and mixed data-types, ensurance of reliable and secure database transactions, extension of stored data row and column-wise and facilitation of data distribution. SQLite and DuckDB are embedded databases that fulfil these requirements. However, they utilize the structured query language (SQL) that hinders their implementation by the uninitiated user, and complicates their use in repetitive tasks due to the necessity of writing SQL queries. This study offers Omilayers, a Python package that encapsulates these two databases and exposes a subset of their functionality that is geared towards frequent and repetitive analytical procedures. Synthetic data were used to demonstrate the use of Omilayers and compare the performance of SQLite and DuckDB.
{"title":"Omilayers: a Python package for efficient data management to support multi-omic analysis.","authors":"Dimitrios Kioroglou","doi":"10.1186/s12859-025-06067-7","DOIUrl":"10.1186/s12859-025-06067-7","url":null,"abstract":"<p><p>Multi-omic integration involves the management of diverse omic datasets. Conducting an effective analysis of these datasets necessitates a data management system that meets a specific set of requirements, such as rapid storage and retrieval of data with varying numbers of features and mixed data-types, ensurance of reliable and secure database transactions, extension of stored data row and column-wise and facilitation of data distribution. SQLite and DuckDB are embedded databases that fulfil these requirements. However, they utilize the structured query language (SQL) that hinders their implementation by the uninitiated user, and complicates their use in repetitive tasks due to the necessity of writing SQL queries. This study offers Omilayers, a Python package that encapsulates these two databases and exposes a subset of their functionality that is geared towards frequent and repetitive analytical procedures. Synthetic data were used to demonstrate the use of Omilayers and compare the performance of SQLite and DuckDB.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"40"},"PeriodicalIF":2.9,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11800426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-05DOI: 10.1186/s12859-025-06063-x
Samuel S Boyd, Chad Slawson, Jeffrey A Thompson
Background: Multi-omic studies provide comprehensive insight into biological systems by evaluating cellular changes between normal and pathological conditions at multiple levels of measurement. Biological networks, which represent interactions or associations between biomolecules, have been highly effective in facilitating omic analysis. However, current network-based methods lack generalizability to accommodate multiple data types across a range of diverse experiments.
Results: We present AMEND 2.0, an updated active module identification method which can analyze multiplex and/or heterogeneous networks integrated with multi-omic data in a highly generalizable framework, in contrast to existing methods, which are mostly appropriate for at most two specific omic types. It is powered by Random Walk with Restart for multiplex-heterogeneous networks, with additional capabilities including degree bias adjustment and biased random walk for multi-objective module identification. AMEND was applied to two real-world multi-omic datasets: renal cell carcinoma data from The cancer genome atlas and an O-GlcNAc Transferase knockout study. Additional analyses investigate the performance of various subroutines of AMEND on tasks of node ranking and degree bias adjustment.
Conclusions: While the analysis of multi-omic datasets in a network context is poised to provide deeper understanding of health and disease, new methods are required to fully take advantage of this increasingly complex data. The current study combines several network analysis techniques into a single versatile method for analyzing biological networks with multi-omic data that can be applied in many diverse scenarios. Software is freely available in the R programming language at https://github.com/samboyd0/AMEND .
{"title":"AMEND 2.0: module identification and multi-omic data integration with multiplex-heterogeneous graphs.","authors":"Samuel S Boyd, Chad Slawson, Jeffrey A Thompson","doi":"10.1186/s12859-025-06063-x","DOIUrl":"10.1186/s12859-025-06063-x","url":null,"abstract":"<p><strong>Background: </strong>Multi-omic studies provide comprehensive insight into biological systems by evaluating cellular changes between normal and pathological conditions at multiple levels of measurement. Biological networks, which represent interactions or associations between biomolecules, have been highly effective in facilitating omic analysis. However, current network-based methods lack generalizability to accommodate multiple data types across a range of diverse experiments.</p><p><strong>Results: </strong>We present AMEND 2.0, an updated active module identification method which can analyze multiplex and/or heterogeneous networks integrated with multi-omic data in a highly generalizable framework, in contrast to existing methods, which are mostly appropriate for at most two specific omic types. It is powered by Random Walk with Restart for multiplex-heterogeneous networks, with additional capabilities including degree bias adjustment and biased random walk for multi-objective module identification. AMEND was applied to two real-world multi-omic datasets: renal cell carcinoma data from The cancer genome atlas and an O-GlcNAc Transferase knockout study. Additional analyses investigate the performance of various subroutines of AMEND on tasks of node ranking and degree bias adjustment.</p><p><strong>Conclusions: </strong>While the analysis of multi-omic datasets in a network context is poised to provide deeper understanding of health and disease, new methods are required to fully take advantage of this increasingly complex data. The current study combines several network analysis techniques into a single versatile method for analyzing biological networks with multi-omic data that can be applied in many diverse scenarios. Software is freely available in the R programming language at https://github.com/samboyd0/AMEND .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"39"},"PeriodicalIF":2.9,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11800622/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143254204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1186/s12859-025-06060-0
Antoine Allard, Maxime Liboz, Raphaël Crépin, Sid Labdi, Olek Maciejak, Michel Malo, Clément Campillo, Guillaume Lamour
Atomic force microscopy (AFM) is the gold-standard technique to simultaneously map the morphology and viscoelastic properties of living cells. Although existing software tools, both open-source and from AFM manufacturers, can analyze cells individually, there is a growing need for fast and accessible codes to compile data from multiple cells into a single dataset. To address this, we present CellMAP, a user-friendly software tool that streamlines the batch-processing of AFM-derived topography and stiffness maps of living cells. Our analysis pipeline includes but is not limited to: flattening of the underlying substrate surface, filtering of outlier values, measurement of the cell surface and volume, and measurement of height and stiffness distributions. CellMAP can also generate a composite cell that reflects the height and stiffness properties of an entire cell population.
{"title":"CellMAP: an open-source software tool to batch-process cell topography and stiffness maps collected with an atomic force microscope.","authors":"Antoine Allard, Maxime Liboz, Raphaël Crépin, Sid Labdi, Olek Maciejak, Michel Malo, Clément Campillo, Guillaume Lamour","doi":"10.1186/s12859-025-06060-0","DOIUrl":"10.1186/s12859-025-06060-0","url":null,"abstract":"<p><p>Atomic force microscopy (AFM) is the gold-standard technique to simultaneously map the morphology and viscoelastic properties of living cells. Although existing software tools, both open-source and from AFM manufacturers, can analyze cells individually, there is a growing need for fast and accessible codes to compile data from multiple cells into a single dataset. To address this, we present CellMAP, a user-friendly software tool that streamlines the batch-processing of AFM-derived topography and stiffness maps of living cells. Our analysis pipeline includes but is not limited to: flattening of the underlying substrate surface, filtering of outlier values, measurement of the cell surface and volume, and measurement of height and stiffness distributions. CellMAP can also generate a composite cell that reflects the height and stiffness properties of an entire cell population.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"38"},"PeriodicalIF":2.9,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11796028/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143187991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1186/s12859-025-06050-2
Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A Bismar, Reda Alhajj
Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.
{"title":"BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse.","authors":"Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A Bismar, Reda Alhajj","doi":"10.1186/s12859-025-06050-2","DOIUrl":"10.1186/s12859-025-06050-2","url":null,"abstract":"<p><p>Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"37"},"PeriodicalIF":2.9,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143187989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1186/s12859-025-06045-z
Jia Tian, Ziyu Gao, Minghao Li, Ergude Bao, Jin Zhao
Background: Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately.
Results: In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers.
Conclusion: Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .
{"title":"Accurate assembly of full-length consensus for viral quasispecies.","authors":"Jia Tian, Ziyu Gao, Minghao Li, Ergude Bao, Jin Zhao","doi":"10.1186/s12859-025-06045-z","DOIUrl":"10.1186/s12859-025-06045-z","url":null,"abstract":"<p><strong>Background: </strong>Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately.</p><p><strong>Results: </strong>In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers.</p><p><strong>Conclusion: </strong>Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"36"},"PeriodicalIF":2.9,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11787740/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143073632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-31DOI: 10.1186/s12859-025-06054-y
Meng Zhang, Joel Parker, Lingling An, Yiwen Liu, Xiaoxiao Sun
Motivation: Spatial transcriptomics is a state-of-art technique that allows researchers to study gene expression patterns in tissues over the spatial domain. As a result of technical limitations, the majority of spatial transcriptomics techniques provide bulk data for each sequencing spot. Consequently, in order to obtain high-resolution spatial transcriptomics data, performing deconvolution becomes essential. Most existing deconvolution methods rely on reference data (e.g., single-cell data), which may not be available in real applications. Current reference-free methods encounter limitations due to their dependence on distribution assumptions, reliance on marker genes, or the absence of leveraging histology and spatial information. Consequently, there is a critical need for the development of highly flexible, robust, and user-friendly reference-free deconvolution methods capable of unifying or leveraging case-specific information in the analysis of spatial transcriptomics data.
Results: We propose a novel reference-free method based on regularized non-negative matrix factorization (NMF), named Flexible Analysis of Spatial Transcriptomics (FAST), that can effectively incorporate gene expression data, spatial, and histology information into a unified deconvolution framework. Compared to existing methods, FAST imposes fewer distribution assumptions, utilizes the spatial structure information of tissues, and encourages interpretable factorization results. These features enable greater flexibility and accuracy, making FAST an effective tool for deciphering the complex cell-type composition of tissues and advancing our understanding of various biological processes and diseases. Extensive simulation studies have shown that FAST outperforms other existing reference-free methods. In real data applications, FAST is able to uncover the underlying tissue structures and identify the corresponding marker genes.
{"title":"Flexible analysis of spatial transcriptomics data (FAST): a deconvolution approach.","authors":"Meng Zhang, Joel Parker, Lingling An, Yiwen Liu, Xiaoxiao Sun","doi":"10.1186/s12859-025-06054-y","DOIUrl":"10.1186/s12859-025-06054-y","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics is a state-of-art technique that allows researchers to study gene expression patterns in tissues over the spatial domain. As a result of technical limitations, the majority of spatial transcriptomics techniques provide bulk data for each sequencing spot. Consequently, in order to obtain high-resolution spatial transcriptomics data, performing deconvolution becomes essential. Most existing deconvolution methods rely on reference data (e.g., single-cell data), which may not be available in real applications. Current reference-free methods encounter limitations due to their dependence on distribution assumptions, reliance on marker genes, or the absence of leveraging histology and spatial information. Consequently, there is a critical need for the development of highly flexible, robust, and user-friendly reference-free deconvolution methods capable of unifying or leveraging case-specific information in the analysis of spatial transcriptomics data.</p><p><strong>Results: </strong>We propose a novel reference-free method based on regularized non-negative matrix factorization (NMF), named Flexible Analysis of Spatial Transcriptomics (FAST), that can effectively incorporate gene expression data, spatial, and histology information into a unified deconvolution framework. Compared to existing methods, FAST imposes fewer distribution assumptions, utilizes the spatial structure information of tissues, and encourages interpretable factorization results. These features enable greater flexibility and accuracy, making FAST an effective tool for deciphering the complex cell-type composition of tissues and advancing our understanding of various biological processes and diseases. Extensive simulation studies have shown that FAST outperforms other existing reference-free methods. In real data applications, FAST is able to uncover the underlying tissue structures and identify the corresponding marker genes.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"35"},"PeriodicalIF":2.9,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11786350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143073656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Biomedical text mining is a technique that extracts essential information from scientific articles using named entity recognition (NER). Traditional NER methods rely on dictionaries, rules, or curated corpora, which may not always be accessible. To overcome these challenges, deep learning (DL) methods have emerged. However, DL-based NER methods may need help identifying long-distance relationships within text and require significant annotated datasets.
Results: This research has proposed a novel model to address the challenges in natural language processing. The Improved Green anaconda-assisted Bi-GRU based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is the model. IGa-BiHR BNERM model has shown promising results in accurately identifying named entities. The MACCROBAT dataset was obtained from Kaggle and underwent several pre-processing steps such as Stop Word Filtering, WordNet processing, Removal of non-alphanumeric characters, stemming Segmentation, and Tokenization, which is standardized and improves its quality. The pre-processed text was fed into a feature extraction model like the Robustly Optimized BERT -Whole Word Masking model. This model provides word embeddings with semantic information. Then, the BNER process utilized an Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM).
Conclusion: To improve the training phase of the IGa-BiHR BNERM, the Improved Green Anaconda Optimization technique was used to select optimal weight parameter coefficients for training the model parameters. After the model was tested using the MACCROBAT dataset, it outperformed previous models with a tremendous accuracy rate of 99.11%. This model effectively and accurately identifies biomedical names within the text, significantly advancing this field.
{"title":"Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model.","authors":"Ram Chandra Bhushan, Rakesh Kumar Donthi, Yojitha Chilukuri, Ulligaddala Srinivasarao, Polisetty Swetha","doi":"10.1186/s12859-024-06008-w","DOIUrl":"10.1186/s12859-024-06008-w","url":null,"abstract":"<p><strong>Background: </strong>Biomedical text mining is a technique that extracts essential information from scientific articles using named entity recognition (NER). Traditional NER methods rely on dictionaries, rules, or curated corpora, which may not always be accessible. To overcome these challenges, deep learning (DL) methods have emerged. However, DL-based NER methods may need help identifying long-distance relationships within text and require significant annotated datasets.</p><p><strong>Results: </strong>This research has proposed a novel model to address the challenges in natural language processing. The Improved Green anaconda-assisted Bi-GRU based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is the model. IGa-BiHR BNERM model has shown promising results in accurately identifying named entities. The MACCROBAT dataset was obtained from Kaggle and underwent several pre-processing steps such as Stop Word Filtering, WordNet processing, Removal of non-alphanumeric characters, stemming Segmentation, and Tokenization, which is standardized and improves its quality. The pre-processed text was fed into a feature extraction model like the Robustly Optimized BERT -Whole Word Masking model. This model provides word embeddings with semantic information. Then, the BNER process utilized an Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM).</p><p><strong>Conclusion: </strong>To improve the training phase of the IGa-BiHR BNERM, the Improved Green Anaconda Optimization technique was used to select optimal weight parameter coefficients for training the model parameters. After the model was tested using the MACCROBAT dataset, it outperformed previous models with a tremendous accuracy rate of 99.11%. This model effectively and accurately identifies biomedical names within the text, significantly advancing this field.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"34"},"PeriodicalIF":2.9,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11780922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data.
Results: We propose the SMD deep learning model, which integrates nonlinear dimensionality reduction techniques with a porous dilated attention gate component. Built upon a convolutional autoencoder and informed by the negative binomial distribution, the SMD model efficiently captures essential cell clustering features and dynamically adjusts feature weights. Comprehensive evaluation on both public datasets and proprietary osteosarcoma data highlights the SMD model's efficacy in achieving precise classifications for single-cell data clustering, showcasing its potential for advanced transcriptomic analysis.
Conclusion: This study underscores the potential of deep learning-specifically the SMD model-in advancing single-cell RNA sequencing data analysis. By integrating innovative computational techniques, the SMD model provides a powerful framework for unraveling cellular complexities, enhancing our understanding of biological processes, and elucidating disease mechanisms. The code is available from https://github.com/xiaoxuc/scSMD .
{"title":"scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder.","authors":"Xiaoxu Cui, Renkai Wu, Yinghao Liu, Peizhan Chen, Qing Chang, Pengchen Liang, Changyu He","doi":"10.1186/s12859-025-06047-x","DOIUrl":"10.1186/s12859-025-06047-x","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data.</p><p><strong>Results: </strong>We propose the SMD deep learning model, which integrates nonlinear dimensionality reduction techniques with a porous dilated attention gate component. Built upon a convolutional autoencoder and informed by the negative binomial distribution, the SMD model efficiently captures essential cell clustering features and dynamically adjusts feature weights. Comprehensive evaluation on both public datasets and proprietary osteosarcoma data highlights the SMD model's efficacy in achieving precise classifications for single-cell data clustering, showcasing its potential for advanced transcriptomic analysis.</p><p><strong>Conclusion: </strong>This study underscores the potential of deep learning-specifically the SMD model-in advancing single-cell RNA sequencing data analysis. By integrating innovative computational techniques, the SMD model provides a powerful framework for unraveling cellular complexities, enhancing our understanding of biological processes, and elucidating disease mechanisms. The code is available from https://github.com/xiaoxuc/scSMD .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"33"},"PeriodicalIF":2.9,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11780796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}