挖掘临床表现状态和无进展生存期文本,促进癌症研究中的数据收集:一项探索性研究

L. Lin , M. Singer-van den Hout , L.F.A. Wessels , A.J. de Langen , J.H. Beijnen , A.D.R. Huitema
{"title":"挖掘临床表现状态和无进展生存期文本,促进癌症研究中的数据收集:一项探索性研究","authors":"L. Lin ,&nbsp;M. Singer-van den Hout ,&nbsp;L.F.A. Wessels ,&nbsp;A.J. de Langen ,&nbsp;J.H. Beijnen ,&nbsp;A.D.R. Huitema","doi":"10.1016/j.esmorw.2024.100059","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.</p></div><div><h3>Materials and methods</h3><p>Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.</p></div><div><h3>Results</h3><p>A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (<em>n</em> = 328) and 8.00 months for the TM-curated data (<em>n</em> = 301). The C-index was 0.916.</p></div><div><h3>Conclusions</h3><p>The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.</p></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"5 ","pages":"Article 100059"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949820124000377/pdfft?md5=dd6ada54246fb2990f90d8c6644007cb&pid=1-s2.0-S2949820124000377-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Clinical text mining of the performance status and progression-free survival to facilitate data collection in cancer research: an exploratory study\",\"authors\":\"L. Lin ,&nbsp;M. Singer-van den Hout ,&nbsp;L.F.A. Wessels ,&nbsp;A.J. de Langen ,&nbsp;J.H. Beijnen ,&nbsp;A.D.R. Huitema\",\"doi\":\"10.1016/j.esmorw.2024.100059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.</p></div><div><h3>Materials and methods</h3><p>Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.</p></div><div><h3>Results</h3><p>A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (<em>n</em> = 328) and 8.00 months for the TM-curated data (<em>n</em> = 301). The C-index was 0.916.</p></div><div><h3>Conclusions</h3><p>The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.</p></div>\",\"PeriodicalId\":100491,\"journal\":{\"name\":\"ESMO Real World Data and Digital Oncology\",\"volume\":\"5 \",\"pages\":\"Article 100059\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949820124000377/pdfft?md5=dd6ada54246fb2990f90d8c6644007cb&pid=1-s2.0-S2949820124000377-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESMO Real World Data and Digital Oncology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949820124000377\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820124000377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景现代电子病历(EMR)包含大量宝贵的数据。这些数据可以通过人工收集数据的方式解锁用于研究,而人工收集数据需要耗费大量人力。因此,我们探讨了自动文本挖掘(TM)是否可用于提取 328 名非小细胞肺癌患者队列中的表现状态(PS)和无进展生存期(PFS)。在 R 编程语言中使用基于规则的 TM 方法(使用正则表达式)提取 PS 和 PFS。对于 PS,使用加权 F1 分数等定量评估指标来确定 TM 提取数据的准确性。对于 PFS,使用 Kaplan-Meier 方法比较了两种方法的中位 PFS。此外,还对 C 指数进行了测定。在 189 例(96%)患者中,TM 估算的 PS 与人工估算的 PS 相吻合。加权 F1 评分为 96.5%。人工筛选数据的中位 PFS 为 7.42 个月(n = 328),TM 筛选数据的中位 PFS 为 8.00 个月(n = 301)。结论所开发的 TM 方法能够从 EMR 中提取 PS 和 PFS,而且效果非常好。因此,这种方法提高了从 EMR 收集可靠数据的效率,有助于在临床研究中使用真实世界数据 (RWD)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Clinical text mining of the performance status and progression-free survival to facilitate data collection in cancer research: an exploratory study

Background

Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.

Materials and methods

Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.

Results

A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (n = 328) and 8.00 months for the TM-curated data (n = 301). The C-index was 0.916.

Conclusions

The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Utility of automated data transfer for cancer clinical trials and considerations for implementation Characterisation of oncology EHR-derived real-world data in the UK, Germany, and Japan Evolving treatment patterns and outcomes among patients with metastatic urothelial carcinoma post-avelumab maintenance approval: insights from The US Oncology Network Collaborating across sectors in service of open science, precision oncology, and patients: an overview of the AACR Project GENIE (Genomics Evidence Neoplasia Information Exchange) Biopharma Collaborative (BPC) Data analytics for real-world data integration in TKI-treated NSCLC patients using electronic health records
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1