{"title":"GenoTEX:与生物信息学家一起评估基于 LLM 的基因表达数据对齐探索的基准工具","authors":"Haoyang Liu, Haohan Wang","doi":"arxiv-2406.15341","DOIUrl":null,"url":null,"abstract":"Recent advancements in machine learning have significantly improved the\nidentification of disease-associated genes from gene expression datasets.\nHowever, these processes often require extensive expertise and manual effort,\nlimiting their scalability. Large Language Model (LLM)-based agents have shown\npromise in automating these tasks due to their increasing problem-solving\nabilities. To support the evaluation and development of such methods, we\nintroduce GenoTEX, a benchmark dataset for the automatic exploration of gene\nexpression data, involving the tasks of dataset selection, preprocessing, and\nstatistical analysis. GenoTEX provides annotated code and results for solving a\nwide range of gene identification problems, in a full analysis pipeline that\nfollows the standard of computational genomics. These annotations are curated\nby human bioinformaticians who carefully analyze the datasets to ensure\naccuracy and reliability. To provide baselines for these tasks, we present\nGenoAgents, a team of LLM-based agents designed with context-aware planning,\niterative correction, and domain expert consultation to collaboratively explore\ngene datasets. Our experiments with GenoAgents demonstrate the potential of\nLLM-based approaches in genomics data analysis, while error analysis highlights\nthe challenges and areas for future improvement. We propose GenoTEX as a\npromising resource for benchmarking and enhancing AI-driven methods for\ngenomics data analysis. We make our benchmark publicly available at\n\\url{https://github.com/Liu-Hy/GenoTex}.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians\",\"authors\":\"Haoyang Liu, Haohan Wang\",\"doi\":\"arxiv-2406.15341\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advancements in machine learning have significantly improved the\\nidentification of disease-associated genes from gene expression datasets.\\nHowever, these processes often require extensive expertise and manual effort,\\nlimiting their scalability. Large Language Model (LLM)-based agents have shown\\npromise in automating these tasks due to their increasing problem-solving\\nabilities. To support the evaluation and development of such methods, we\\nintroduce GenoTEX, a benchmark dataset for the automatic exploration of gene\\nexpression data, involving the tasks of dataset selection, preprocessing, and\\nstatistical analysis. GenoTEX provides annotated code and results for solving a\\nwide range of gene identification problems, in a full analysis pipeline that\\nfollows the standard of computational genomics. These annotations are curated\\nby human bioinformaticians who carefully analyze the datasets to ensure\\naccuracy and reliability. To provide baselines for these tasks, we present\\nGenoAgents, a team of LLM-based agents designed with context-aware planning,\\niterative correction, and domain expert consultation to collaboratively explore\\ngene datasets. Our experiments with GenoAgents demonstrate the potential of\\nLLM-based approaches in genomics data analysis, while error analysis highlights\\nthe challenges and areas for future improvement. We propose GenoTEX as a\\npromising resource for benchmarking and enhancing AI-driven methods for\\ngenomics data analysis. We make our benchmark publicly available at\\n\\\\url{https://github.com/Liu-Hy/GenoTex}.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.15341\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.15341","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians
Recent advancements in machine learning have significantly improved the
identification of disease-associated genes from gene expression datasets.
However, these processes often require extensive expertise and manual effort,
limiting their scalability. Large Language Model (LLM)-based agents have shown
promise in automating these tasks due to their increasing problem-solving
abilities. To support the evaluation and development of such methods, we
introduce GenoTEX, a benchmark dataset for the automatic exploration of gene
expression data, involving the tasks of dataset selection, preprocessing, and
statistical analysis. GenoTEX provides annotated code and results for solving a
wide range of gene identification problems, in a full analysis pipeline that
follows the standard of computational genomics. These annotations are curated
by human bioinformaticians who carefully analyze the datasets to ensure
accuracy and reliability. To provide baselines for these tasks, we present
GenoAgents, a team of LLM-based agents designed with context-aware planning,
iterative correction, and domain expert consultation to collaboratively explore
gene datasets. Our experiments with GenoAgents demonstrate the potential of
LLM-based approaches in genomics data analysis, while error analysis highlights
the challenges and areas for future improvement. We propose GenoTEX as a
promising resource for benchmarking and enhancing AI-driven methods for
genomics data analysis. We make our benchmark publicly available at
\url{https://github.com/Liu-Hy/GenoTex}.