L. Lin , M. Singer-van den Hout , L.F.A. Wessels , A.J. de Langen , J.H. Beijnen , A.D.R. Huitema
{"title":"挖掘临床表现状态和无进展生存期文本,促进癌症研究中的数据收集:一项探索性研究","authors":"L. Lin , M. Singer-van den Hout , L.F.A. Wessels , A.J. de Langen , J.H. Beijnen , A.D.R. Huitema","doi":"10.1016/j.esmorw.2024.100059","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.</p></div><div><h3>Materials and methods</h3><p>Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.</p></div><div><h3>Results</h3><p>A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (<em>n</em> = 328) and 8.00 months for the TM-curated data (<em>n</em> = 301). The C-index was 0.916.</p></div><div><h3>Conclusions</h3><p>The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.</p></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"5 ","pages":"Article 100059"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949820124000377/pdfft?md5=dd6ada54246fb2990f90d8c6644007cb&pid=1-s2.0-S2949820124000377-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Clinical text mining of the performance status and progression-free survival to facilitate data collection in cancer research: an exploratory study\",\"authors\":\"L. Lin , M. Singer-van den Hout , L.F.A. Wessels , A.J. de Langen , J.H. Beijnen , A.D.R. Huitema\",\"doi\":\"10.1016/j.esmorw.2024.100059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.</p></div><div><h3>Materials and methods</h3><p>Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.</p></div><div><h3>Results</h3><p>A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (<em>n</em> = 328) and 8.00 months for the TM-curated data (<em>n</em> = 301). The C-index was 0.916.</p></div><div><h3>Conclusions</h3><p>The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.</p></div>\",\"PeriodicalId\":100491,\"journal\":{\"name\":\"ESMO Real World Data and Digital Oncology\",\"volume\":\"5 \",\"pages\":\"Article 100059\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949820124000377/pdfft?md5=dd6ada54246fb2990f90d8c6644007cb&pid=1-s2.0-S2949820124000377-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESMO Real World Data and Digital Oncology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949820124000377\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820124000377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Clinical text mining of the performance status and progression-free survival to facilitate data collection in cancer research: an exploratory study
Background
Modern electronic medical records (EMRs) contain a valuable amount of data. These data can be unlocked for research by manual data collection, which is highly labor intensive. Therefore, we explored whether automated text mining (TM) could be used to extract the performance status (PS) and progression-free survival (PFS) in a cohort of 328 non-small-cell lung cancer patients.
Materials and methods
Unstructured Dutch text data were derived from different EMR fields containing mainly information recorded during outpatient visits. A rule-based TM approach using regular expressions was used to extract PS and PFS in the R programming language. For PS, quantitative evaluation metrics, such as the weighted F1-score, were used to determine the accuracy of the TM-extracted data. For PFS, the median PFS was compared between the two approaches using the Kaplan–Meier method. In addition, the C-index was determined.
Results
A PS was obtained for 196 patients (60%) using the TM approach. In 189 (96%) patients, the TM-curated PS matched the manually curated PS. The weighted F1-score was 96.5%. The median PFS was 7.42 months for the manually curated data (n = 328) and 8.00 months for the TM-curated data (n = 301). The C-index was 0.916.
Conclusions
The developed TM approach is able to extract PS and PFS from the EMR with a very good performance. Therefore, this approach increases the efficiency of reliable data collection from EMRs, facilitating the use of real-world data (RWD) in clinical research.