SampleExplorer: using language models to discover relevant transcriptome data.

Bioinformatics (Oxford, England) Pub Date : 2024-12-26 DOI:10.1093/bioinformatics/btae759

Wee Loong Chin, Timo Lassmann

{"title":"SampleExplorer: using language models to discover relevant transcriptome data.","authors":"Wee Loong Chin, Timo Lassmann","doi":"10.1093/bioinformatics/btae759","DOIUrl":null,"url":null,"abstract":"Motivation: Over the last two decades, transcriptomics has become a standard technique in biomedical research. We now have large databases of RNA-seq data, accompanied by valuable metadata detailing scientific objectives and the experimental procedures used. The metadata is crucial in understanding and replicating published studies, but so far has been underutilized in helping researchers to discover existing datasets.Results: We present SampleExplorer, a tool allowing researchers to search for relevant data using both text and gene set queries. SampleExplorer embeds sample metadata and uses a transformer-based language model to retrieve similar datasets. Extensive benchmarking (see Supplementary Materials and Methods) using the ARCHS4 database demonstrates that SampleExplorer provides an effective approach for retrieving biologically relevant samples from large-scale transcriptomicdata. This tool provides an efficient approach for discovering relevant gene expression datasets in large public repositories. It improves sample and dataset identification across diverse experimental contexts, helping researchers leverage existing transcriptomic data for potential replication or verification studies.Availability and implementation: SampleExplorer is available as a Python package compatible with versions 3.9 to 3.11, available for installation via the Python Package Index (PyPI). The codebase and documentation are accessible at https://github.com/wlchin/SampleExplorer. Supplementary data (Supplementary Materials and Methods) provides detailed methodological information, including an algorithmic description of the retrieval process and data preparation steps.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11751629/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae759","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Over the last two decades, transcriptomics has become a standard technique in biomedical research. We now have large databases of RNA-seq data, accompanied by valuable metadata detailing scientific objectives and the experimental procedures used. The metadata is crucial in understanding and replicating published studies, but so far has been underutilized in helping researchers to discover existing datasets.

Results: We present SampleExplorer, a tool allowing researchers to search for relevant data using both text and gene set queries. SampleExplorer embeds sample metadata and uses a transformer-based language model to retrieve similar datasets. Extensive benchmarking (see Supplementary Materials and Methods) using the ARCHS4 database demonstrates that SampleExplorer provides an effective approach for retrieving biologically relevant samples from large-scale transcriptomicdata. This tool provides an efficient approach for discovering relevant gene expression datasets in large public repositories. It improves sample and dataset identification across diverse experimental contexts, helping researchers leverage existing transcriptomic data for potential replication or verification studies.

Availability and implementation: SampleExplorer is available as a Python package compatible with versions 3.9 to 3.11, available for installation via the Python Package Index (PyPI). The codebase and documentation are accessible at https://github.com/wlchin/SampleExplorer. Supplementary data (Supplementary Materials and Methods) provides detailed methodological information, including an algorithmic description of the retrieval process and data preparation steps.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SampleExplorer：使用语言模型来发现相关转录组数据。

动机：在过去的二十年里，转录组学已经成为生物医学研究的标准技术。我们现在拥有大量的RNA-seq数据数据库，并附有详细说明科学目标和实验程序的有价值的元数据。元数据对于理解和复制已发表的研究至关重要，但是迄今为止在帮助研究人员发现现有数据集方面还没有得到充分利用。结果：我们提出了SampleExplorer，一个允许研究人员使用文本和基因集查询来搜索相关数据的工具。SampleExplorer嵌入样例元数据，并使用基于转换器的语言模型（LM）检索类似的数据集。使用ARCHS4数据库进行广泛的基准测试（见材料和方法）表明，SampleExplorer为从大规模转录组学数据中检索生物学相关样本提供了有效的方法。结论：SampleExplorer为在大型公共存储库中发现相关基因表达数据集提供了有效的方法。它改善了不同实验背景下的样本和数据集识别，帮助研究人员利用现有的转录组学数据进行潜在的复制或验证研究。补充信息：补充数据（材料和方法）可在生物信息学在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量