首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
Omilayers: a Python package for efficient data management to support multi-omic analysis.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-02-06 DOI: 10.1186/s12859-025-06067-7
Dimitrios Kioroglou

Multi-omic integration involves the management of diverse omic datasets. Conducting an effective analysis of these datasets necessitates a data management system that meets a specific set of requirements, such as rapid storage and retrieval of data with varying numbers of features and mixed data-types, ensurance of reliable and secure database transactions, extension of stored data row and column-wise and facilitation of data distribution. SQLite and DuckDB are embedded databases that fulfil these requirements. However, they utilize the structured query language (SQL) that hinders their implementation by the uninitiated user, and complicates their use in repetitive tasks due to the necessity of writing SQL queries. This study offers Omilayers, a Python package that encapsulates these two databases and exposes a subset of their functionality that is geared towards frequent and repetitive analytical procedures. Synthetic data were used to demonstrate the use of Omilayers and compare the performance of SQLite and DuckDB.

{"title":"Omilayers: a Python package for efficient data management to support multi-omic analysis.","authors":"Dimitrios Kioroglou","doi":"10.1186/s12859-025-06067-7","DOIUrl":"10.1186/s12859-025-06067-7","url":null,"abstract":"<p><p>Multi-omic integration involves the management of diverse omic datasets. Conducting an effective analysis of these datasets necessitates a data management system that meets a specific set of requirements, such as rapid storage and retrieval of data with varying numbers of features and mixed data-types, ensurance of reliable and secure database transactions, extension of stored data row and column-wise and facilitation of data distribution. SQLite and DuckDB are embedded databases that fulfil these requirements. However, they utilize the structured query language (SQL) that hinders their implementation by the uninitiated user, and complicates their use in repetitive tasks due to the necessity of writing SQL queries. This study offers Omilayers, a Python package that encapsulates these two databases and exposes a subset of their functionality that is geared towards frequent and repetitive analytical procedures. Synthetic data were used to demonstrate the use of Omilayers and compare the performance of SQLite and DuckDB.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"40"},"PeriodicalIF":2.9,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11800426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AMEND 2.0: module identification and multi-omic data integration with multiplex-heterogeneous graphs.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-02-05 DOI: 10.1186/s12859-025-06063-x
Samuel S Boyd, Chad Slawson, Jeffrey A Thompson

Background: Multi-omic studies provide comprehensive insight into biological systems by evaluating cellular changes between normal and pathological conditions at multiple levels of measurement. Biological networks, which represent interactions or associations between biomolecules, have been highly effective in facilitating omic analysis. However, current network-based methods lack generalizability to accommodate multiple data types across a range of diverse experiments.

Results: We present AMEND 2.0, an updated active module identification method which can analyze multiplex and/or heterogeneous networks integrated with multi-omic data in a highly generalizable framework, in contrast to existing methods, which are mostly appropriate for at most two specific omic types. It is powered by Random Walk with Restart for multiplex-heterogeneous networks, with additional capabilities including degree bias adjustment and biased random walk for multi-objective module identification. AMEND was applied to two real-world multi-omic datasets: renal cell carcinoma data from The cancer genome atlas and an O-GlcNAc Transferase knockout study. Additional analyses investigate the performance of various subroutines of AMEND on tasks of node ranking and degree bias adjustment.

Conclusions: While the analysis of multi-omic datasets in a network context is poised to provide deeper understanding of health and disease, new methods are required to fully take advantage of this increasingly complex data. The current study combines several network analysis techniques into a single versatile method for analyzing biological networks with multi-omic data that can be applied in many diverse scenarios. Software is freely available in the R programming language at https://github.com/samboyd0/AMEND .

{"title":"AMEND 2.0: module identification and multi-omic data integration with multiplex-heterogeneous graphs.","authors":"Samuel S Boyd, Chad Slawson, Jeffrey A Thompson","doi":"10.1186/s12859-025-06063-x","DOIUrl":"10.1186/s12859-025-06063-x","url":null,"abstract":"<p><strong>Background: </strong>Multi-omic studies provide comprehensive insight into biological systems by evaluating cellular changes between normal and pathological conditions at multiple levels of measurement. Biological networks, which represent interactions or associations between biomolecules, have been highly effective in facilitating omic analysis. However, current network-based methods lack generalizability to accommodate multiple data types across a range of diverse experiments.</p><p><strong>Results: </strong>We present AMEND 2.0, an updated active module identification method which can analyze multiplex and/or heterogeneous networks integrated with multi-omic data in a highly generalizable framework, in contrast to existing methods, which are mostly appropriate for at most two specific omic types. It is powered by Random Walk with Restart for multiplex-heterogeneous networks, with additional capabilities including degree bias adjustment and biased random walk for multi-objective module identification. AMEND was applied to two real-world multi-omic datasets: renal cell carcinoma data from The cancer genome atlas and an O-GlcNAc Transferase knockout study. Additional analyses investigate the performance of various subroutines of AMEND on tasks of node ranking and degree bias adjustment.</p><p><strong>Conclusions: </strong>While the analysis of multi-omic datasets in a network context is poised to provide deeper understanding of health and disease, new methods are required to fully take advantage of this increasingly complex data. The current study combines several network analysis techniques into a single versatile method for analyzing biological networks with multi-omic data that can be applied in many diverse scenarios. Software is freely available in the R programming language at https://github.com/samboyd0/AMEND .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"39"},"PeriodicalIF":2.9,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11800622/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143254204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CellMAP: an open-source software tool to batch-process cell topography and stiffness maps collected with an atomic force microscope.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-02-04 DOI: 10.1186/s12859-025-06060-0
Antoine Allard, Maxime Liboz, Raphaël Crépin, Sid Labdi, Olek Maciejak, Michel Malo, Clément Campillo, Guillaume Lamour

Atomic force microscopy (AFM) is the gold-standard technique to simultaneously map the morphology and viscoelastic properties of living cells. Although existing software tools, both open-source and from AFM manufacturers, can analyze cells individually, there is a growing need for fast and accessible codes to compile data from multiple cells into a single dataset. To address this, we present CellMAP, a user-friendly software tool that streamlines the batch-processing of AFM-derived topography and stiffness maps of living cells. Our analysis pipeline includes but is not limited to: flattening of the underlying substrate surface, filtering of outlier values, measurement of the cell surface and volume, and measurement of height and stiffness distributions. CellMAP can also generate a composite cell that reflects the height and stiffness properties of an entire cell population.

原子力显微镜(AFM)是同时绘制活细胞形态和粘弹特性图的黄金标准技术。尽管现有的开源软件工具和原子力显微镜制造商提供的软件工具都能单独分析细胞,但人们越来越需要快速、易用的代码将多个细胞的数据汇编成一个数据集。为解决这一问题,我们推出了 CellMAP,这是一款用户友好型软件工具,可简化活细胞 AFM 拓扑图和硬度图的批量处理。我们的分析流水线包括但不限于:平整基底表面、过滤离群值、测量细胞表面和体积以及测量高度和硬度分布。CellMAP 还能生成反映整个细胞群高度和硬度特性的复合细胞。
{"title":"CellMAP: an open-source software tool to batch-process cell topography and stiffness maps collected with an atomic force microscope.","authors":"Antoine Allard, Maxime Liboz, Raphaël Crépin, Sid Labdi, Olek Maciejak, Michel Malo, Clément Campillo, Guillaume Lamour","doi":"10.1186/s12859-025-06060-0","DOIUrl":"10.1186/s12859-025-06060-0","url":null,"abstract":"<p><p>Atomic force microscopy (AFM) is the gold-standard technique to simultaneously map the morphology and viscoelastic properties of living cells. Although existing software tools, both open-source and from AFM manufacturers, can analyze cells individually, there is a growing need for fast and accessible codes to compile data from multiple cells into a single dataset. To address this, we present CellMAP, a user-friendly software tool that streamlines the batch-processing of AFM-derived topography and stiffness maps of living cells. Our analysis pipeline includes but is not limited to: flattening of the underlying substrate surface, filtering of outlier values, measurement of the cell surface and volume, and measurement of height and stiffness distributions. CellMAP can also generate a composite cell that reflects the height and stiffness properties of an entire cell population.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"38"},"PeriodicalIF":2.9,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11796028/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143187991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-02-04 DOI: 10.1186/s12859-025-06050-2
Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A Bismar, Reda Alhajj

Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.

生物医学研究人员必须经常处理大量的原始数据,而对这些数据的分析可能会提供重要的见解。然而,如果原始数据规模庞大,就很难发现这些见解。本文介绍了一个名为 BioLake 的数据框架,它提供了简约的交互式方法来帮助研究人员进行生物信息学数据分析。与市场上现有的一些分析工具不同,BioLake 支持对公共数据集进行各种基于网络的生物信息学数据分析,同时允许研究人员即时分析他们的私人数据集。该工具还通过提供源代码和详细说明,大大提高了结果的可解释性。在数据存储设计方面,BioLake 采用数据湖架构,提供存储的可扩展性和分析的灵活性。为进一步提高分析效率,BioLake 支持自定义数据的在线分析,允许研究人员通过设计好的程序上传自己的数据,而无需等待服务器端的审批。BioLake 允许一次性上传最大 500 MB 的自定义数据,以确保研究人员避免因数据过大而无法上传的问题。在内置数据集方面,BioLake 采用反应式连续数据集成,帮助分析管道省去了大部分预处理步骤。BioLake 第一个公开版本的唯一预内置数据集是 TCGA-PRAD 用于前列腺癌研究的 mRNA 表达数据,这也是 BioLake 开发团队的主要关注点。总之,BioLake 提供了一个轻量级在线工具,通过支持自定义在线数据处理,促进生物信息学 mRNA 数据分析。
{"title":"BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse.","authors":"Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A Bismar, Reda Alhajj","doi":"10.1186/s12859-025-06050-2","DOIUrl":"10.1186/s12859-025-06050-2","url":null,"abstract":"<p><p>Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"37"},"PeriodicalIF":2.9,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143187989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate assembly of full-length consensus for viral quasispecies.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-02-01 DOI: 10.1186/s12859-025-06045-z
Jia Tian, Ziyu Gao, Minghao Li, Ergude Bao, Jin Zhao

Background: Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately.

Results: In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers.

Conclusion: Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .

{"title":"Accurate assembly of full-length consensus for viral quasispecies.","authors":"Jia Tian, Ziyu Gao, Minghao Li, Ergude Bao, Jin Zhao","doi":"10.1186/s12859-025-06045-z","DOIUrl":"10.1186/s12859-025-06045-z","url":null,"abstract":"<p><strong>Background: </strong>Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It's important to emphasize the significance of assembling a single full-length consensus contig, as it's vital for identifying genetic diversity and estimating strain abundance accurately.</p><p><strong>Results: </strong>In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers.</p><p><strong>Conclusion: </strong>Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"36"},"PeriodicalIF":2.9,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11787740/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143073632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible analysis of spatial transcriptomics data (FAST): a deconvolution approach.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-31 DOI: 10.1186/s12859-025-06054-y
Meng Zhang, Joel Parker, Lingling An, Yiwen Liu, Xiaoxiao Sun

Motivation: Spatial transcriptomics is a state-of-art technique that allows researchers to study gene expression patterns in tissues over the spatial domain. As a result of technical limitations, the majority of spatial transcriptomics techniques provide bulk data for each sequencing spot. Consequently, in order to obtain high-resolution spatial transcriptomics data, performing deconvolution becomes essential. Most existing deconvolution methods rely on reference data (e.g., single-cell data), which may not be available in real applications. Current reference-free methods encounter limitations due to their dependence on distribution assumptions, reliance on marker genes, or the absence of leveraging histology and spatial information. Consequently, there is a critical need for the development of highly flexible, robust, and user-friendly reference-free deconvolution methods capable of unifying or leveraging case-specific information in the analysis of spatial transcriptomics data.

Results: We propose a novel reference-free method based on regularized non-negative matrix factorization (NMF), named Flexible Analysis of Spatial Transcriptomics (FAST), that can effectively incorporate gene expression data, spatial, and histology information into a unified deconvolution framework. Compared to existing methods, FAST imposes fewer distribution assumptions, utilizes the spatial structure information of tissues, and encourages interpretable factorization results. These features enable greater flexibility and accuracy, making FAST an effective tool for deciphering the complex cell-type composition of tissues and advancing our understanding of various biological processes and diseases. Extensive simulation studies have shown that FAST outperforms other existing reference-free methods. In real data applications, FAST is able to uncover the underlying tissue structures and identify the corresponding marker genes.

{"title":"Flexible analysis of spatial transcriptomics data (FAST): a deconvolution approach.","authors":"Meng Zhang, Joel Parker, Lingling An, Yiwen Liu, Xiaoxiao Sun","doi":"10.1186/s12859-025-06054-y","DOIUrl":"10.1186/s12859-025-06054-y","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics is a state-of-art technique that allows researchers to study gene expression patterns in tissues over the spatial domain. As a result of technical limitations, the majority of spatial transcriptomics techniques provide bulk data for each sequencing spot. Consequently, in order to obtain high-resolution spatial transcriptomics data, performing deconvolution becomes essential. Most existing deconvolution methods rely on reference data (e.g., single-cell data), which may not be available in real applications. Current reference-free methods encounter limitations due to their dependence on distribution assumptions, reliance on marker genes, or the absence of leveraging histology and spatial information. Consequently, there is a critical need for the development of highly flexible, robust, and user-friendly reference-free deconvolution methods capable of unifying or leveraging case-specific information in the analysis of spatial transcriptomics data.</p><p><strong>Results: </strong>We propose a novel reference-free method based on regularized non-negative matrix factorization (NMF), named Flexible Analysis of Spatial Transcriptomics (FAST), that can effectively incorporate gene expression data, spatial, and histology information into a unified deconvolution framework. Compared to existing methods, FAST imposes fewer distribution assumptions, utilizes the spatial structure information of tissues, and encourages interpretable factorization results. These features enable greater flexibility and accuracy, making FAST an effective tool for deciphering the complex cell-type composition of tissues and advancing our understanding of various biological processes and diseases. Extensive simulation studies have shown that FAST outperforms other existing reference-free methods. In real data applications, FAST is able to uncover the underlying tissue structures and identify the corresponding marker genes.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"35"},"PeriodicalIF":2.9,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11786350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143073656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-30 DOI: 10.1186/s12859-024-06008-w
Ram Chandra Bhushan, Rakesh Kumar Donthi, Yojitha Chilukuri, Ulligaddala Srinivasarao, Polisetty Swetha

Background: Biomedical text mining is a technique that extracts essential information from scientific articles using named entity recognition (NER). Traditional NER methods rely on dictionaries, rules, or curated corpora, which may not always be accessible. To overcome these challenges, deep learning (DL) methods have emerged. However, DL-based NER methods may need help identifying long-distance relationships within text and require significant annotated datasets.

Results: This research has proposed a novel model to address the challenges in natural language processing. The Improved Green anaconda-assisted Bi-GRU based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is the model. IGa-BiHR BNERM model has shown promising results in accurately identifying named entities. The MACCROBAT dataset was obtained from Kaggle and underwent several pre-processing steps such as Stop Word Filtering, WordNet processing, Removal of non-alphanumeric characters, stemming Segmentation, and Tokenization, which is standardized and improves its quality. The pre-processed text was fed into a feature extraction model like the Robustly Optimized BERT -Whole Word Masking model. This model provides word embeddings with semantic information. Then, the BNER process utilized an Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM).

Conclusion: To improve the training phase of the IGa-BiHR BNERM, the Improved Green Anaconda Optimization technique was used to select optimal weight parameter coefficients for training the model parameters. After the model was tested using the MACCROBAT dataset, it outperformed previous models with a tremendous accuracy rate of 99.11%. This model effectively and accurately identifies biomedical names within the text, significantly advancing this field.

{"title":"Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model.","authors":"Ram Chandra Bhushan, Rakesh Kumar Donthi, Yojitha Chilukuri, Ulligaddala Srinivasarao, Polisetty Swetha","doi":"10.1186/s12859-024-06008-w","DOIUrl":"10.1186/s12859-024-06008-w","url":null,"abstract":"<p><strong>Background: </strong>Biomedical text mining is a technique that extracts essential information from scientific articles using named entity recognition (NER). Traditional NER methods rely on dictionaries, rules, or curated corpora, which may not always be accessible. To overcome these challenges, deep learning (DL) methods have emerged. However, DL-based NER methods may need help identifying long-distance relationships within text and require significant annotated datasets.</p><p><strong>Results: </strong>This research has proposed a novel model to address the challenges in natural language processing. The Improved Green anaconda-assisted Bi-GRU based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is the model. IGa-BiHR BNERM model has shown promising results in accurately identifying named entities. The MACCROBAT dataset was obtained from Kaggle and underwent several pre-processing steps such as Stop Word Filtering, WordNet processing, Removal of non-alphanumeric characters, stemming Segmentation, and Tokenization, which is standardized and improves its quality. The pre-processed text was fed into a feature extraction model like the Robustly Optimized BERT -Whole Word Masking model. This model provides word embeddings with semantic information. Then, the BNER process utilized an Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM).</p><p><strong>Conclusion: </strong>To improve the training phase of the IGa-BiHR BNERM, the Improved Green Anaconda Optimization technique was used to select optimal weight parameter coefficients for training the model parameters. After the model was tested using the MACCROBAT dataset, it outperformed previous models with a tremendous accuracy rate of 99.11%. This model effectively and accurately identifies biomedical names within the text, significantly advancing this field.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"34"},"PeriodicalIF":2.9,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11780922/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-29 DOI: 10.1186/s12859-025-06047-x
Xiaoxu Cui, Renkai Wu, Yinghao Liu, Peizhan Chen, Qing Chang, Pengchen Liang, Changyu He

Background: Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data.

Results: We propose the SMD deep learning model, which integrates nonlinear dimensionality reduction techniques with a porous dilated attention gate component. Built upon a convolutional autoencoder and informed by the negative binomial distribution, the SMD model efficiently captures essential cell clustering features and dynamically adjusts feature weights. Comprehensive evaluation on both public datasets and proprietary osteosarcoma data highlights the SMD model's efficacy in achieving precise classifications for single-cell data clustering, showcasing its potential for advanced transcriptomic analysis.

Conclusion: This study underscores the potential of deep learning-specifically the SMD model-in advancing single-cell RNA sequencing data analysis. By integrating innovative computational techniques, the SMD model provides a powerful framework for unraveling cellular complexities, enhancing our understanding of biological processes, and elucidating disease mechanisms. The code is available from  https://github.com/xiaoxuc/scSMD .

{"title":"scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder.","authors":"Xiaoxu Cui, Renkai Wu, Yinghao Liu, Peizhan Chen, Qing Chang, Pengchen Liang, Changyu He","doi":"10.1186/s12859-025-06047-x","DOIUrl":"10.1186/s12859-025-06047-x","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data.</p><p><strong>Results: </strong>We propose the SMD deep learning model, which integrates nonlinear dimensionality reduction techniques with a porous dilated attention gate component. Built upon a convolutional autoencoder and informed by the negative binomial distribution, the SMD model efficiently captures essential cell clustering features and dynamically adjusts feature weights. Comprehensive evaluation on both public datasets and proprietary osteosarcoma data highlights the SMD model's efficacy in achieving precise classifications for single-cell data clustering, showcasing its potential for advanced transcriptomic analysis.</p><p><strong>Conclusion: </strong>This study underscores the potential of deep learning-specifically the SMD model-in advancing single-cell RNA sequencing data analysis. By integrating innovative computational techniques, the SMD model provides a powerful framework for unraveling cellular complexities, enhancing our understanding of biological processes, and elucidating disease mechanisms. The code is available from  https://github.com/xiaoxuc/scSMD .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"33"},"PeriodicalIF":2.9,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11780796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correcting scale distortion in RNA sequencing data.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-28 DOI: 10.1186/s12859-025-06041-3
Christopher Thron, Farhad Jafari

RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which are not corrected by conventional normalization techniques. We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found expression-level dependent biases that differ from sample to sample in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and t tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.

{"title":"Correcting scale distortion in RNA sequencing data.","authors":"Christopher Thron, Farhad Jafari","doi":"10.1186/s12859-025-06041-3","DOIUrl":"10.1186/s12859-025-06041-3","url":null,"abstract":"<p><p>RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which are not corrected by conventional normalization techniques. We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found expression-level dependent biases that differ from sample to sample in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and t tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"32"},"PeriodicalIF":2.9,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11776150/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143057890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Marigold: a machine learning-based web app for zebrafish pose tracking.
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2025-01-28 DOI: 10.1186/s12859-025-06042-2
Gregory Teicher, R Madison Riffe, Wayne Barnaby, Gabrielle Martin, Benjamin E Clayton, Josef G Trapani, Gerald B Downes

Background: High-throughput behavioral analysis is important for drug discovery, toxicological studies, and the modeling of neurological disorders such as autism and epilepsy. Zebrafish embryos and larvae are ideal for such applications because they are spawned in large clutches, develop rapidly, feature a relatively simple nervous system, and have orthologs to many human disease genes. However, existing software for video-based behavioral analysis can be incompatible with recordings that contain dynamic backgrounds or foreign objects, lack support for multiwell formats, require expensive hardware, and/or demand considerable programming expertise. Here, we introduce Marigold, a free and open source web app for high-throughput behavioral analysis of embryonic and larval zebrafish.

Results: Marigold features an intuitive graphical user interface, tracks up to 10 user-defined keypoints, supports both single- and multiwell formats, and exports a range of kinematic parameters in addition to publication-quality data visualizations. By leveraging a highly efficient, custom-designed neural network architecture, Marigold achieves reasonable training and inference speeds even on modestly powered computers lacking a discrete graphics processing unit. Notably, as a web app, Marigold does not require any installation and runs within popular web browsers on ChromeOS, Linux, macOS, and Windows. To demonstrate Marigold's utility, we used two sets of biological experiments. First, we examined novel aspects of the touch-evoked escape response in techno trousers (tnt) mutant embryos, which contain a previously described loss-of-function mutation in the gene encoding Eaat2b, a glial glutamate transporter. We identified differences and interactions between touch location (head vs. tail) and genotype. Second, we investigated the effects of feeding on larval visuomotor behavior at 5 and 7 days post-fertilization (dpf). We found differences in the number and vigor of swimming bouts between fed and unfed fish at both time points, as well as interactions between developmental stage and feeding regimen.

Conclusions: In both biological experiments presented here, the use of Marigold facilitated novel behavioral findings. Marigold's ease of use, robust pose tracking, amenability to diverse experimental paradigms, and flexibility regarding hardware requirements make it a powerful tool for analyzing zebrafish behavior, especially in low-resource settings such as course-based undergraduate research experiences. Marigold is available at: https://downeslab.github.io/marigold/ .

{"title":"Marigold: a machine learning-based web app for zebrafish pose tracking.","authors":"Gregory Teicher, R Madison Riffe, Wayne Barnaby, Gabrielle Martin, Benjamin E Clayton, Josef G Trapani, Gerald B Downes","doi":"10.1186/s12859-025-06042-2","DOIUrl":"10.1186/s12859-025-06042-2","url":null,"abstract":"<p><strong>Background: </strong>High-throughput behavioral analysis is important for drug discovery, toxicological studies, and the modeling of neurological disorders such as autism and epilepsy. Zebrafish embryos and larvae are ideal for such applications because they are spawned in large clutches, develop rapidly, feature a relatively simple nervous system, and have orthologs to many human disease genes. However, existing software for video-based behavioral analysis can be incompatible with recordings that contain dynamic backgrounds or foreign objects, lack support for multiwell formats, require expensive hardware, and/or demand considerable programming expertise. Here, we introduce Marigold, a free and open source web app for high-throughput behavioral analysis of embryonic and larval zebrafish.</p><p><strong>Results: </strong>Marigold features an intuitive graphical user interface, tracks up to 10 user-defined keypoints, supports both single- and multiwell formats, and exports a range of kinematic parameters in addition to publication-quality data visualizations. By leveraging a highly efficient, custom-designed neural network architecture, Marigold achieves reasonable training and inference speeds even on modestly powered computers lacking a discrete graphics processing unit. Notably, as a web app, Marigold does not require any installation and runs within popular web browsers on ChromeOS, Linux, macOS, and Windows. To demonstrate Marigold's utility, we used two sets of biological experiments. First, we examined novel aspects of the touch-evoked escape response in techno trousers (tnt) mutant embryos, which contain a previously described loss-of-function mutation in the gene encoding Eaat2b, a glial glutamate transporter. We identified differences and interactions between touch location (head vs. tail) and genotype. Second, we investigated the effects of feeding on larval visuomotor behavior at 5 and 7 days post-fertilization (dpf). We found differences in the number and vigor of swimming bouts between fed and unfed fish at both time points, as well as interactions between developmental stage and feeding regimen.</p><p><strong>Conclusions: </strong>In both biological experiments presented here, the use of Marigold facilitated novel behavioral findings. Marigold's ease of use, robust pose tracking, amenability to diverse experimental paradigms, and flexibility regarding hardware requirements make it a powerful tool for analyzing zebrafish behavior, especially in low-resource settings such as course-based undergraduate research experiences. Marigold is available at: https://downeslab.github.io/marigold/ .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"30"},"PeriodicalIF":2.9,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11773884/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143057892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1