Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Briefings in bioinformatics Pub Date : 2024-11-22 DOI:10.1093/bib/bbaf045

Olivier Cinquin

{"title":"Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.","authors":"Olivier Cinquin","doi":"10.1093/bib/bbaf045","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) leverage factual knowledge from pretraining. Yet this knowledge remains incomplete and sometimes challenging to retrieve-especially in scientific domains not extensively covered in pretraining datasets and where information is still evolving. Here, we focus on genomics and bioinformatics. We confirm and expand upon issues with plain ChatGPT functioning as a bioinformatics assistant. Poor data retrieval and hallucination lead ChatGPT to err, as do incorrect sequence manipulations. To address this, we propose a system basing LLM outputs on up-to-date, authoritative facts and facilitating LLM-guided data analysis. Specifically, we introduce NagGPT, a middleware tool to insert between LLMs and databases, designed to bridge gaps in LLM knowledge and usage of database application programming interfaces. NagGPT proxies LLM-generated database queries, with special handling of incorrect queries. It acts as a gatekeeper between query responses and the LLM prompt, redirecting large responses to files but providing a synthesized snippet and injecting comments to steer the LLM. A companion OpenAI custom GPT, Genomics Fetcher-Analyzer, connects ChatGPT with NagGPT. It steers ChatGPT to generate and run Python code, performing bioinformatics tasks on data dynamically retrieved from a dozen common genomics databases (e.g. NCBI, Ensembl, UniProt, WormBase, and FlyBase). We implement partial mitigations for encountered challenges: detrimental interactions between code generation style and data analysis, confusion between database identifiers, and hallucination of both data and actions taken. Our results identify avenues to augment ChatGPT as a bioinformatics assistant and, more broadly, to improve factual accuracy and instruction following of unmodified LLMs.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11798674/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf045","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) leverage factual knowledge from pretraining. Yet this knowledge remains incomplete and sometimes challenging to retrieve-especially in scientific domains not extensively covered in pretraining datasets and where information is still evolving. Here, we focus on genomics and bioinformatics. We confirm and expand upon issues with plain ChatGPT functioning as a bioinformatics assistant. Poor data retrieval and hallucination lead ChatGPT to err, as do incorrect sequence manipulations. To address this, we propose a system basing LLM outputs on up-to-date, authoritative facts and facilitating LLM-guided data analysis. Specifically, we introduce NagGPT, a middleware tool to insert between LLMs and databases, designed to bridge gaps in LLM knowledge and usage of database application programming interfaces. NagGPT proxies LLM-generated database queries, with special handling of incorrect queries. It acts as a gatekeeper between query responses and the LLM prompt, redirecting large responses to files but providing a synthesized snippet and injecting comments to steer the LLM. A companion OpenAI custom GPT, Genomics Fetcher-Analyzer, connects ChatGPT with NagGPT. It steers ChatGPT to generate and run Python code, performing bioinformatics tasks on data dynamically retrieved from a dozen common genomics databases (e.g. NCBI, Ensembl, UniProt, WormBase, and FlyBase). We implement partial mitigations for encountered challenges: detrimental interactions between code generation style and data analysis, confusion between database identifiers, and hallucination of both data and actions taken. Our results identify avenues to augment ChatGPT as a bioinformatics assistant and, more broadly, to improve factual accuracy and instruction following of unmodified LLMs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过纠正和丰富生成的数据库查询来指导验证大型语言模型分析：ChatGPT生物信息学的第一步。

大型语言模型（llm）利用预训练中的事实知识。然而，这些知识仍然不完整，有时很难检索，特别是在预训练数据集没有广泛覆盖的科学领域，以及信息仍在不断发展的领域。在这里，我们关注基因组学和生物信息学。我们确认并扩展了普通ChatGPT作为生物信息学助手的问题。糟糕的数据检索和幻觉导致ChatGPT出错，错误的序列操作也是如此。为了解决这个问题，我们提出了一个基于最新权威事实的法学硕士输出系统，并促进法学硕士指导的数据分析。具体来说，我们将介绍NagGPT，这是一种用于在LLM和数据库之间插入的中间件工具，旨在弥合LLM知识和数据库应用程序编程接口使用方面的差距。NagGPT代理llm生成的数据库查询，并对不正确的查询进行特殊处理。它充当查询响应和LLM提示符之间的看门人，将大型响应重定向到文件，但提供合成代码片段并注入注释来引导LLM。OpenAI自定义GPT， Genomics Fetcher-Analyzer，将ChatGPT与NagGPT连接起来。它引导ChatGPT生成并运行Python代码，对从十几个常见基因组数据库（例如NCBI， Ensembl, UniProt， WormBase和FlyBase）动态检索的数据执行生物信息学任务。我们对遇到的挑战实现了部分缓解：代码生成风格和数据分析之间的有害交互，数据库标识符之间的混淆，以及对数据和所采取的操作的幻觉。我们的研究结果确定了增强ChatGPT作为生物信息学助手的途径，更广泛地说，可以提高未修改llm的事实准确性和指令遵循。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.