Extracting the Sample Size From Randomized Controlled Trials in Explainable Fashion Using Natural Language Processing

medRxiv - Health Informatics Pub Date : 2024-07-10 DOI:10.1101/2024.07.09.24310155

Paul Windisch, Fabio Dennstaedt, Carole Koechli, Robert Foerster, Christina Schroeder, Daniel M. Aebersold, Daniel R. Zwahlen

{"title":"Extracting the Sample Size From Randomized Controlled Trials in Explainable Fashion Using Natural Language Processing","authors":"Paul Windisch, Fabio Dennstaedt, Carole Koechli, Robert Foerster, Christina Schroeder, Daniel M. Aebersold, Daniel R. Zwahlen","doi":"10.1101/2024.07.09.24310155","DOIUrl":null,"url":null,"abstract":"Background: Extracting the sample size from randomized controlled trials (RCTs) remains a challenge to developing better search functionalities or automating systematic reviews. Most current approaches rely on the sample size being explicitly mentioned in the abstract. Methods: 847 RCTs from high-impact medical journals were tagged with six different entities that could indicate the sample size. A named entity recognition (NER) model was trained to extract the entities and then deployed on a test set of 150 RCTs. The entities' performance in predicting the actual number of trial participants who were randomized was assessed and possible combinations of the entities were evaluated to create predictive models.\nResults: The most accurate model could make predictions for 64.7% of trials in the test set, and the resulting predictions were within 10% of the ground truth in 96.9% of cases. A less strict model could make a prediction for 96.0% of trials, and its predictions were within 10% of the ground truth in 88.2% of cases.\nConclusion: Training a named entity recognition model to predict the sample size from randomized controlled trials is feasible, not only if the sample size is explicitly mentioned but also if the sample size can be calculated, e.g., by adding up the number of patients in each arm.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.09.24310155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Extracting the sample size from randomized controlled trials (RCTs) remains a challenge to developing better search functionalities or automating systematic reviews. Most current approaches rely on the sample size being explicitly mentioned in the abstract. Methods: 847 RCTs from high-impact medical journals were tagged with six different entities that could indicate the sample size. A named entity recognition (NER) model was trained to extract the entities and then deployed on a test set of 150 RCTs. The entities' performance in predicting the actual number of trial participants who were randomized was assessed and possible combinations of the entities were evaluated to create predictive models. Results: The most accurate model could make predictions for 64.7% of trials in the test set, and the resulting predictions were within 10% of the ground truth in 96.9% of cases. A less strict model could make a prediction for 96.0% of trials, and its predictions were within 10% of the ground truth in 88.2% of cases. Conclusion: Training a named entity recognition model to predict the sample size from randomized controlled trials is feasible, not only if the sample size is explicitly mentioned but also if the sample size can be calculated, e.g., by adding up the number of patients in each arm.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用自然语言处理以可解释的方式提取随机对照试验的样本量

背景：从随机对照试验（RCT）中提取样本量仍然是开发更好的搜索功能或自动进行系统综述所面临的挑战。目前大多数方法都依赖于摘要中明确提及的样本量。方法：对来自高影响力医学期刊的 847 篇随机对照试验进行标记，标记中包含六种不同的实体，这些实体可以表明样本量。对命名实体识别（NER）模型进行了提取实体的训练，然后将其部署在由 150 份 RCT 组成的测试集上。评估了实体在预测实际随机试验参与者人数方面的性能，并对实体的可能组合进行了评估，以创建预测模型：结果：最准确的模型可以对测试集中 64.7% 的试验进行预测，所得出的预测结果在 96.9% 的情况下都在基本事实的 10% 以内。一个不那么严格的模型可以对 96.0% 的试验做出预测，其预测结果在 88.2% 的情况下在地面实况的 10% 以内：结论：通过训练命名实体识别模型来预测随机对照试验的样本量是可行的，不仅要明确提及样本量，而且要能计算出样本量，例如，通过将每个臂中的患者人数相加。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

medRxiv - Health Informatics

自引率

0.00%

发文量