BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse.

IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS BMC Bioinformatics Pub Date : 2025-02-04 DOI:10.1186/s12859-025-06050-2
Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A Bismar, Reda Alhajj
{"title":"BioLake: an RNA expression analysis framework for prostate cancer biomarker powered by data lakehouse.","authors":"Qiaowang Li, Yaser Gamallat, Jon George Rokne, Tarek A Bismar, Reda Alhajj","doi":"10.1186/s12859-025-06050-2","DOIUrl":null,"url":null,"abstract":"<p><p>Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"37"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792420/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06050-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Biomedical researchers must often deal with large amounts of raw data, and analysis of this data might provide significant insights. However, if the raw data size is large, it might be difficult to uncover these insights. In this paper, a data framework named BioLake is presented that provides minimalist interactive methods to help researchers conduct bioinformatics data analysis. Unlike some existing analytical tools on the market, BioLake supports a wide range of web-based bioinformatics data analysis for public datasets, while allowing researchers to analyze their private datasets instantly. The tool also significantly enhances result interpretability by providing the source code and detailed instructions. In terms of data storage design, BioLake adopts the data lakehouse architecture to provide storage scalability and analysis flexibility. To further enhance the analysis efficiency, BioLake supports online analysis for custom data, allowing researchers to upload their own data via a designed procedure without waiting for server-side approval. BioLake allows a one-time upload of custom data of up to 500 MB to ensure that researchers avoid issues with data being too large for upload. In terms of the built-in dataset, BioLake applies reactive continuous data integration, helping the analysis pipeline to get rid of most preprocessing steps. The only pre-built-in dataset of BioLake in the first public version is TCGA-PRAD mRNA expression data for prostate cancer research, which is the primary focus of the development team of BioLake. In summary, BioLake offers a lightweight online tool to facilitate bioinformatic mRNA data analysis with the support of custom online data processing.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生物医学研究人员必须经常处理大量的原始数据,而对这些数据的分析可能会提供重要的见解。然而,如果原始数据规模庞大,就很难发现这些见解。本文介绍了一个名为 BioLake 的数据框架,它提供了简约的交互式方法来帮助研究人员进行生物信息学数据分析。与市场上现有的一些分析工具不同,BioLake 支持对公共数据集进行各种基于网络的生物信息学数据分析,同时允许研究人员即时分析他们的私人数据集。该工具还通过提供源代码和详细说明,大大提高了结果的可解释性。在数据存储设计方面,BioLake 采用数据湖架构,提供存储的可扩展性和分析的灵活性。为进一步提高分析效率,BioLake 支持自定义数据的在线分析,允许研究人员通过设计好的程序上传自己的数据,而无需等待服务器端的审批。BioLake 允许一次性上传最大 500 MB 的自定义数据,以确保研究人员避免因数据过大而无法上传的问题。在内置数据集方面,BioLake 采用反应式连续数据集成,帮助分析管道省去了大部分预处理步骤。BioLake 第一个公开版本的唯一预内置数据集是 TCGA-PRAD 用于前列腺癌研究的 mRNA 表达数据,这也是 BioLake 开发团队的主要关注点。总之,BioLake 提供了一个轻量级在线工具,通过支持自定义在线数据处理,促进生物信息学 mRNA 数据分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Bioinformatics
BMC Bioinformatics 生物-生化研究方法
CiteScore
5.70
自引率
3.30%
发文量
506
审稿时长
4.3 months
期刊介绍: BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
期刊最新文献
Conditional similarity triplets enable covariate-informed representations of single-cell data. TockyPrep: data preprocessing methods for flow cytometric fluorescent timer analysis. Correction: HISS: Snakemake-based workflows for performing SMRT-RenSeq assembly, AgRenSeq and dRenSeq for the discovery of novel plant disease resistance genes. Direct coupling analysis and the attention mechanism. Instance-level semantic segmentation of nuclei based on multimodal structure encoding.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1