Criteria2Query 3.0: Leveraging generative large language models for clinical trial eligibility query generation

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Journal of Biomedical Informatics Pub Date : 2024-04-30 DOI:10.1016/j.jbi.2024.104649

Jimyung Park , Yilu Fang , Casey Ta , Gongbo Zhang , Betina Idnay , Fangyi Chen , David Feng , Rebecca Shyu , Emily R. Gordon , Matthew Spotnitz , Chunhua Weng

{"title":"Criteria2Query 3.0: Leveraging generative large language models for clinical trial eligibility query generation","authors":"Jimyung Park , Yilu Fang , Casey Ta , Gongbo Zhang , Betina Idnay , Fangyi Chen , David Feng , Rebecca Shyu , Emily R. Gordon , Matthew Spotnitz , Chunhua Weng","doi":"10.1016/j.jbi.2024.104649","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Automated identification of eligible patients is a bottleneck of clinical research. We propose Criteria2Query (C2Q) 3.0, a system that leverages GPT-4 for the semi-automatic transformation of clinical trial eligibility criteria text into executable clinical database queries.</p></div><div><h3>Materials and Methods</h3><p>C2Q 3.0 integrated three GPT-4 prompts for concept extraction, SQL query generation, and reasoning. Each prompt was designed and evaluated separately. The concept extraction prompt was benchmarked against manual annotations from 20 clinical trials by two evaluators, who later also measured SQL generation accuracy and identified errors in GPT-generated SQL queries from 5 clinical trials. The reasoning prompt was assessed by three evaluators on four metrics: readability, correctness, coherence, and usefulness, using corrected SQL queries and an open-ended feedback questionnaire.</p></div><div><h3>Results</h3><p>Out of 518 concepts from 20 clinical trials, GPT-4 achieved an F1-score of 0.891 in concept extraction. For SQL generation, 29 errors spanning seven categories were detected, with logic errors being the most common (n = 10; 34.48 %). Reasoning evaluations yielded a high coherence rating, with the mean score being 4.70 but relatively lower readability, with a mean of 3.95. Mean scores of correctness and usefulness were identified as 3.97 and 4.37, respectively.</p></div><div><h3>Conclusion</h3><p>GPT-4 significantly improves the accuracy of extracting clinical trial eligibility criteria concepts in C2Q 3.0. Continued research is warranted to ensure the reliability of large language models.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"154 ","pages":"Article 104649"},"PeriodicalIF":4.0000,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046424000674","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Automated identification of eligible patients is a bottleneck of clinical research. We propose Criteria2Query (C2Q) 3.0, a system that leverages GPT-4 for the semi-automatic transformation of clinical trial eligibility criteria text into executable clinical database queries.

Materials and Methods

C2Q 3.0 integrated three GPT-4 prompts for concept extraction, SQL query generation, and reasoning. Each prompt was designed and evaluated separately. The concept extraction prompt was benchmarked against manual annotations from 20 clinical trials by two evaluators, who later also measured SQL generation accuracy and identified errors in GPT-generated SQL queries from 5 clinical trials. The reasoning prompt was assessed by three evaluators on four metrics: readability, correctness, coherence, and usefulness, using corrected SQL queries and an open-ended feedback questionnaire.

Results

Out of 518 concepts from 20 clinical trials, GPT-4 achieved an F1-score of 0.891 in concept extraction. For SQL generation, 29 errors spanning seven categories were detected, with logic errors being the most common (n = 10; 34.48 %). Reasoning evaluations yielded a high coherence rating, with the mean score being 4.70 but relatively lower readability, with a mean of 3.95. Mean scores of correctness and usefulness were identified as 3.97 and 4.37, respectively.

Conclusion

GPT-4 significantly improves the accuracy of extracting clinical trial eligibility criteria concepts in C2Q 3.0. Continued research is warranted to ensure the reliability of large language models.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Criteria2Query 3.0：利用生成式大语言模型生成临床试验资格查询

目标自动识别符合条件的患者是临床研究的一个瓶颈。我们提出了 Criteria2Query（C2Q）3.0，这是一个利用 GPT-4 将临床试验资格标准文本半自动转换为可执行临床数据库查询的系统。每个提示都是单独设计和评估的。概念提取提示由两名评估人员根据 20 项临床试验的人工注释进行基准测试，他们随后还测量了 SQL 生成的准确性，并找出了 GPT 生成的 5 项临床试验 SQL 查询中的错误。推理提示由三位评估员根据四项指标进行评估：可读性、正确性、连贯性和实用性，并使用校正后的 SQL 查询和开放式反馈问卷。结果在 20 项临床试验的 518 个概念中，GPT-4 的概念提取 F1 分数达到 0.891。在 SQL 生成过程中，发现了 7 个类别的 29 个错误，其中逻辑错误最为常见（n = 10; 34.48 %）。推理评估的一致性评分较高，平均得分为 4.70，但可读性相对较低，平均得分为 3.95。结论 GPT-4 显著提高了 C2Q 3.0 中提取临床试验资格标准概念的准确性。为确保大型语言模型的可靠性，有必要继续开展研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.