Nam H Nguyen, Elissa B Dodd-Eaton, Gang Peng, Jessica L Corredor, Wenwei Jiao, Jacynda Woodman-Ross, Banu K Arun, Wenyi Wang
Purpose: LFSPRO is an R library that implements risk prediction models for Li-Fraumeni syndrome (LFS), a genetic disorder characterized by deleterious germline mutations in the TP53 gene. To facilitate the use of these models in clinics, we developed LFSPROShiny, an interactive R/Shiny interface of LFSPRO that allows genetic counselors (GCs) to perform risk predictions without any programming components and further visualize the risk profiles of their patients to aid the decision-making process.
Methods: LFSPROShiny implements two models that have been validated on multiple LFS patient cohorts: a competing risk model that predicts cancer-specific risks for the first primary and a recurrent-event model that predicts the risk of a second primary tumor. Starting with a visualization template, we keep regular contact with GCs, who ran LFSPROShiny in their counseling sessions, to collect feedback and discuss potential improvement. On receiving the family history as input, LFSPROShiny renders the family into a pedigree and displays the risk estimates of the family members in a tabular format. The software offers interactive overlaid side-by-side bar charts for visualization of the patients' cancer risks relative to the general population.
Results: We walk through a detailed example to illustrate how GCs can run LFSPROShiny in clinics from data preparation to downstream analyses and interpretation of results with an emphasis on the utilities that LFSPROShiny provides to aid decision making.
Conclusion: Since December 2021, we have applied LFSPROShiny to over 100 families from counseling sessions at the MD Anderson Cancer Center. Our study suggests that software tools with easy-to-use interfaces are crucial for the dissemination of risk prediction models in clinical settings, hence serving as a guideline for future development of similar models.
{"title":"LFSPROShiny: An Interactive R/Shiny App for Prediction and Visualization of Cancer Risks in Families With Deleterious Germline <i>TP53</i> Mutations.","authors":"Nam H Nguyen, Elissa B Dodd-Eaton, Gang Peng, Jessica L Corredor, Wenwei Jiao, Jacynda Woodman-Ross, Banu K Arun, Wenyi Wang","doi":"10.1200/CCI.23.00167","DOIUrl":"10.1200/CCI.23.00167","url":null,"abstract":"<p><strong>Purpose: </strong>LFSPRO is an R library that implements risk prediction models for Li-Fraumeni syndrome (LFS), a genetic disorder characterized by deleterious germline mutations in the <i>TP53</i> gene. To facilitate the use of these models in clinics, we developed LFSPROShiny, an interactive R/Shiny interface of LFSPRO that allows genetic counselors (GCs) to perform risk predictions without any programming components and further visualize the risk profiles of their patients to aid the decision-making process.</p><p><strong>Methods: </strong>LFSPROShiny implements two models that have been validated on multiple LFS patient cohorts: a competing risk model that predicts cancer-specific risks for the first primary and a recurrent-event model that predicts the risk of a second primary tumor. Starting with a visualization template, we keep regular contact with GCs, who ran LFSPROShiny in their counseling sessions, to collect feedback and discuss potential improvement. On receiving the family history as input, LFSPROShiny renders the family into a pedigree and displays the risk estimates of the family members in a tabular format. The software offers interactive overlaid side-by-side bar charts for visualization of the patients' cancer risks relative to the general population.</p><p><strong>Results: </strong>We walk through a detailed example to illustrate how GCs can run LFSPROShiny in clinics from data preparation to downstream analyses and interpretation of results with an emphasis on the utilities that LFSPROShiny provides to aid decision making.</p><p><strong>Conclusion: </strong>Since December 2021, we have applied LFSPROShiny to over 100 families from counseling sessions at the MD Anderson Cancer Center. Our study suggests that software tools with easy-to-use interfaces are crucial for the dissemination of risk prediction models in clinical settings, hence serving as a guideline for future development of similar models.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10871774/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139724933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Can you rely on national health records after an earthquake for cancer care? When nothing is left!
地震发生后,癌症治疗还能依靠国民健康档案吗?当一切都不复存在时
{"title":"Cancer Care After the Earthquake? What Is Left to Us?","authors":"Ismail Beypinar","doi":"10.1200/CCI.23.00253","DOIUrl":"10.1200/CCI.23.00253","url":null,"abstract":"<p><p>Can you rely on national health records after an earthquake for cancer care? When nothing is left!</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139934120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kawther Abdilleh, Omar Khalid, Dennis Ladnier, Wenshuai Wan, Sara Seepo, Garrett Rupp, Valentin Corelj, Zelia F Worman, Divya Sain, Jack DiGiovanna, Bruce Press, Satty Chandrashekhar, Eric Collisson, Karen Y Cui, Anirban Maitra, Paul A Rejto, Kevin P White, Lynn Matrisian, Sudheer Doss
Purpose: Pancreatic cancer currently holds the position of third deadliest cancer in the United States and the 5-year survival rate is among the lowest for major cancers at just 12%. Thus, continued research efforts to better understand the clinical and molecular underpinnings of pancreatic cancer are critical to developing both early detection methodologies as well as improved therapeutic options. This study introduces Pancreatic Cancer Action Network's (PanCAN's) SPARK, a cloud-based data and analytics platform that integrates patient health data from the PanCAN's research initiatives and aims to accelerate pancreatic cancer research by making real-world patient health data and analysis tools easier to access and use.
Materials and methods: The SPARK platform integrates clinical, molecular, multiomic, imaging, and patient-reported data generated from PanCAN's research initiatives. The platform is built on a cloud-based infrastructure powered by Velsera. Cohort exploration and browser capabilities are built using Velsera ARIA, a specialized product for leveraging clinicogenomic data to build cohorts, query variant information, and drive downstream association analyses. Data science and analytic capabilities are also built into the platform allowing researchers to perform simple to complex analysis.
Results: Version 1 of the SPARK platform was released to pilot users, who represented diverse end users, including molecular biologists, clinicians, and bioinformaticians. Included in the pilot release of SPARK are deidentified clinical (including treatment and outcomes data), molecular, multiomic, and whole-slide pathology images for over 600 patients enrolled in PanCAN's Know Your Tumor molecular profiling service.
Conclusion: The pilot release of the SPARK platform introduces qualified researchers to PanCAN real-world patient health data and analytical resources in a centralized location.
目的:胰腺癌目前是美国第三大致命癌症,5 年生存率在主要癌症中最低,仅为 12%。因此,继续开展研究以更好地了解胰腺癌的临床和分子基础,对于开发早期检测方法和改进治疗方案至关重要。本研究介绍了胰腺癌行动网络(Pancreatic Cancer Action Network,PanCAN)的SPARK,这是一个基于云的数据和分析平台,整合了来自PanCAN研究计划的患者健康数据,旨在通过使真实世界的患者健康数据和分析工具更易于访问和使用来加速胰腺癌研究:SPARK平台整合了PanCAN研究计划中生成的临床、分子、多组学、成像和患者报告数据。该平台建立在由Velsera提供支持的云基础设施之上。队列探索和浏览器功能是利用Velsera ARIA构建的,Velsera ARIA是利用临床基因组数据构建队列、查询变异信息和推动下游关联分析的专用产品。该平台还内置了数据科学和分析功能,使研究人员能够进行从简单到复杂的分析:SPARK平台第一版已向试点用户发布,试点用户代表了不同的终端用户,包括分子生物学家、临床医生和生物信息学家。SPARK的试用版中包含了PanCAN "了解你的肿瘤 "分子图谱分析服务中600多名患者的去标识化临床(包括治疗和结果数据)、分子、多组学和全切片病理图像:SPARK平台的试发布将合格的研究人员引入到PanCAN真实世界患者健康数据和分析资源的集中位置。
{"title":"Pancreatic Cancer Action Network's SPARK: A Cloud-Based Patient Health Data and Analytics Platform for Pancreatic Cancer.","authors":"Kawther Abdilleh, Omar Khalid, Dennis Ladnier, Wenshuai Wan, Sara Seepo, Garrett Rupp, Valentin Corelj, Zelia F Worman, Divya Sain, Jack DiGiovanna, Bruce Press, Satty Chandrashekhar, Eric Collisson, Karen Y Cui, Anirban Maitra, Paul A Rejto, Kevin P White, Lynn Matrisian, Sudheer Doss","doi":"10.1200/CCI.23.00119","DOIUrl":"10.1200/CCI.23.00119","url":null,"abstract":"<p><strong>Purpose: </strong>Pancreatic cancer currently holds the position of third deadliest cancer in the United States and the 5-year survival rate is among the lowest for major cancers at just 12%. Thus, continued research efforts to better understand the clinical and molecular underpinnings of pancreatic cancer are critical to developing both early detection methodologies as well as improved therapeutic options. This study introduces Pancreatic Cancer Action Network's (PanCAN's) SPARK, a cloud-based data and analytics platform that integrates patient health data from the PanCAN's research initiatives and aims to accelerate pancreatic cancer research by making real-world patient health data and analysis tools easier to access and use.</p><p><strong>Materials and methods: </strong>The SPARK platform integrates clinical, molecular, multiomic, imaging, and patient-reported data generated from PanCAN's research initiatives. The platform is built on a cloud-based infrastructure powered by Velsera. Cohort exploration and browser capabilities are built using Velsera ARIA, a specialized product for leveraging clinicogenomic data to build cohorts, query variant information, and drive downstream association analyses. Data science and analytic capabilities are also built into the platform allowing researchers to perform simple to complex analysis.</p><p><strong>Results: </strong>Version 1 of the SPARK platform was released to pilot users, who represented diverse end users, including molecular biologists, clinicians, and bioinformaticians. Included in the pilot release of SPARK are deidentified clinical (including treatment and outcomes data), molecular, multiomic, and whole-slide pathology images for over 600 patients enrolled in PanCAN's Know Your Tumor molecular profiling service.</p><p><strong>Conclusion: </strong>The pilot release of the SPARK platform introduces qualified researchers to PanCAN real-world patient health data and analytical resources in a centralized location.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10803046/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139081039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Yuan, Haolun Ding, Bangwei Guo, Miaomiao Yang, Yaning Yang, Xu Steven Xu
Purpose: To apply deep learning algorithms to histopathology images, construct image-based subtypes independent of known clinical and molecular classifications for glioblastoma, and produce novel insights into molecular and immune characteristics of the glioblastoma tumor microenvironment.
Materials and methods: Using whole-slide hematoxylin and eosin images from 214 patients with glioblastoma in The Cancer Genome Atlas (TCGA), a fine-tuned convolutional neural network model extracted deep learning features. Biclustering was used to identify subtypes and image feature modules. Prognostic value of image subtypes was assessed via Cox regression on survival outcomes and validated with 189 samples from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data set. Morphological, molecular, and immune characteristics of glioblastoma image subtypes were analyzed.
Results: Four distinct subtypes and modules (imClust1-4) were identified for the TCGA patients with glioblastoma on the basis of the image feature data. The glioblastoma image subtypes were significantly associated with overall survival (OS; P = .028) and progression-free survival (P = .003). Apparent association was also observed for disease-specific survival (P = .096). imClust2 had the best prognosis for all three survival end points (eg, after 25 months, imClust2 had >7% surviving patients than the other subtypes). Examination of OS in the external validation using the unseen CPTAC data set showed consistent patterns. Multivariable Cox analyses confirmed that the image subtypes carry unique prognostic information independent of known clinical and molecular predictors. Molecular and immune profiling revealed distinct immune compositions of the tumor microenvironment in different image subtypes and may provide biologic explanations for the patterns in patients' outcomes.
Conclusion: Our image-based subtype classification on the basis of deep learning models is a novel tool to refine risk stratification in cancers. The image subtypes detected for glioblastoma represent a promising prognostic biomarker with distinct molecular and immune characteristics and may facilitate developing novel, individualized immunotherapies for glioblastoma.
{"title":"Image-Based Subtype Classification for Glioblastoma Using Deep Learning: Prognostic Significance and Biologic Relevance.","authors":"Min Yuan, Haolun Ding, Bangwei Guo, Miaomiao Yang, Yaning Yang, Xu Steven Xu","doi":"10.1200/CCI.23.00154","DOIUrl":"10.1200/CCI.23.00154","url":null,"abstract":"<p><strong>Purpose: </strong>To apply deep learning algorithms to histopathology images, construct image-based subtypes independent of known clinical and molecular classifications for glioblastoma, and produce novel insights into molecular and immune characteristics of the glioblastoma tumor microenvironment.</p><p><strong>Materials and methods: </strong>Using whole-slide hematoxylin and eosin images from 214 patients with glioblastoma in The Cancer Genome Atlas (TCGA), a fine-tuned convolutional neural network model extracted deep learning features. Biclustering was used to identify subtypes and image feature modules. Prognostic value of image subtypes was assessed via Cox regression on survival outcomes and validated with 189 samples from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data set. Morphological, molecular, and immune characteristics of glioblastoma image subtypes were analyzed.</p><p><strong>Results: </strong>Four distinct subtypes and modules (imClust1-4) were identified for the TCGA patients with glioblastoma on the basis of the image feature data. The glioblastoma image subtypes were significantly associated with overall survival (OS; <i>P</i> = .028) and progression-free survival (<i>P</i> = .003). Apparent association was also observed for disease-specific survival (<i>P</i> = .096). imClust2 had the best prognosis for all three survival end points (eg, after 25 months, imClust2 had >7% surviving patients than the other subtypes). Examination of OS in the external validation using the unseen CPTAC data set showed consistent patterns. Multivariable Cox analyses confirmed that the image subtypes carry unique prognostic information independent of known clinical and molecular predictors. Molecular and immune profiling revealed distinct immune compositions of the tumor microenvironment in different image subtypes and may provide biologic explanations for the patterns in patients' outcomes.</p><p><strong>Conclusion: </strong>Our image-based subtype classification on the basis of deep learning models is a novel tool to refine risk stratification in cancers. The image subtypes detected for glioblastoma represent a promising prognostic biomarker with distinct molecular and immune characteristics and may facilitate developing novel, individualized immunotherapies for glioblastoma.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139477763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elvis Duran Sierra, Raul Valenzuela, Mathew A Canjirathinkal, Colleen M Costelloe, Heerod Moradi, John E Madewell, William A Murphy, Behrang Amini
Purpose: Limitations from commercial software applications prevent the implementation of a robust and cost-efficient high-throughput cancer imaging radiomic feature extraction and perfusion analysis workflow. This study aimed to develop and validate a cancer research computational solution using open-source software for vendor- and sequence-neutral high-throughput image processing and feature extraction.
Methods: The Cancer Radiomic and Perfusion Imaging (CARPI) automated framework is a Python-based software application that is vendor- and sequence-neutral. CARPI uses contour files generated using an application of the user's choice and performs automated radiomic feature extraction and perfusion analysis. This workflow solution was validated using two clinical data sets, one consisted of 40 pelvic chondrosarcomas and 42 sacral chordomas with a total of 82 patients, and a second data set consisted of 26 patients with undifferentiated pleomorphic sarcoma (UPS) imaged at multiple points during presurgical treatment.
Results: Three hundred sixteen volumetric contour files were processed using CARPI. The application automatically extracted 107 radiomic features from multiple magnetic resonance imaging sequences and seven semiquantitative perfusion parameters from time-intensity curves. Statistically significant differences (P < .00047) were found in 18 of 107 radiomic features in chordoma versus chondrosarcoma, including six first-order and 12 high-order features. In UPS postradiation, the apparent diffusion coefficient mean increased 41% in good responders (P = .0017), while firstorder_10Percentile (P = .0312) was statistically significant between good and partial/nonresponders.
Conclusion: The CARPI processing of two clinical validation data sets confirmed the software application's ability to differentiate between different types of tumors and help predict patient response to treatment on the basis of radiomic features. Benchmark comparison with five similar open-source solutions demonstrated the advantages of CARPI in the automated perfusion feature extraction, relational database generation, and graphic report export features, although lacking a user-friendly graphical user interface and predictive model building.
{"title":"Cancer Radiomic and Perfusion Imaging Automated Framework: Validation on Musculoskeletal Tumors.","authors":"Elvis Duran Sierra, Raul Valenzuela, Mathew A Canjirathinkal, Colleen M Costelloe, Heerod Moradi, John E Madewell, William A Murphy, Behrang Amini","doi":"10.1200/CCI.23.00118","DOIUrl":"10.1200/CCI.23.00118","url":null,"abstract":"<p><strong>Purpose: </strong>Limitations from commercial software applications prevent the implementation of a robust and cost-efficient high-throughput cancer imaging radiomic feature extraction and perfusion analysis workflow. This study aimed to develop and validate a cancer research computational solution using open-source software for vendor- and sequence-neutral high-throughput image processing and feature extraction.</p><p><strong>Methods: </strong>The Cancer Radiomic and Perfusion Imaging (CARPI) automated framework is a Python-based software application that is vendor- and sequence-neutral. CARPI uses contour files generated using an application of the user's choice and performs automated radiomic feature extraction and perfusion analysis. This workflow solution was validated using two clinical data sets, one consisted of 40 pelvic chondrosarcomas and 42 sacral chordomas with a total of 82 patients, and a second data set consisted of 26 patients with undifferentiated pleomorphic sarcoma (UPS) imaged at multiple points during presurgical treatment.</p><p><strong>Results: </strong>Three hundred sixteen volumetric contour files were processed using CARPI. The application automatically extracted 107 radiomic features from multiple magnetic resonance imaging sequences and seven semiquantitative perfusion parameters from time-intensity curves. Statistically significant differences (<i>P</i> < .00047) were found in 18 of 107 radiomic features in chordoma versus chondrosarcoma, including six first-order and 12 high-order features. In UPS postradiation, the apparent diffusion coefficient mean increased 41% in good responders (<i>P</i> = .0017), while firstorder_10Percentile (<i>P</i> = .0312) was statistically significant between good and partial/nonresponders.</p><p><strong>Conclusion: </strong>The CARPI processing of two clinical validation data sets confirmed the software application's ability to differentiate between different types of tumors and help predict patient response to treatment on the basis of radiomic features. Benchmark comparison with five similar open-source solutions demonstrated the advantages of CARPI in the automated perfusion feature extraction, relational database generation, and graphic report export features, although lacking a user-friendly graphical user interface and predictive model building.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10793993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139106811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Catherine C Lerro, Marie C Bradley, Richard A Forshee, Donna R Rivera
{"title":"The Bar Is High: Evaluating Fit-for-Use Oncology Real-World Data for Regulatory Decision Making.","authors":"Catherine C Lerro, Marie C Bradley, Richard A Forshee, Donna R Rivera","doi":"10.1200/CCI.23.00261","DOIUrl":"10.1200/CCI.23.00261","url":null,"abstract":"","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10807892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139503228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Erratum: Chatbot Artificial Intelligence for Genetic Cancer Risk Assessment and Counseling: A Systematic Review and Meta-Analysis.","authors":"","doi":"10.1200/CCI.23.00240","DOIUrl":"10.1200/CCI.23.00240","url":null,"abstract":"","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139432360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyunwook Kim, Won Seok Jang, Woo Seob Sim, Han Sang Kim, Jeong Eun Choi, Eun Sil Baek, Yu Rang Park, Sang Joon Shin
Purpose: In artificial intelligence-based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models.
Materials and methods: A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network-based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method.
Results: A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state.
Conclusion: The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.
{"title":"Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer.","authors":"Hyunwook Kim, Won Seok Jang, Woo Seob Sim, Han Sang Kim, Jeong Eun Choi, Eun Sil Baek, Yu Rang Park, Sang Joon Shin","doi":"10.1200/CCI.23.00201","DOIUrl":"10.1200/CCI.23.00201","url":null,"abstract":"<p><strong>Purpose: </strong>In artificial intelligence-based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models.</p><p><strong>Materials and methods: </strong>A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network-based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method.</p><p><strong>Results: </strong>A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state.</p><p><strong>Conclusion: </strong>The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10830088/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139565231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ricardo Ahumada, Jocelyn Dunstan, Matías Rojas, Sergio Peñafiel, Inti Paredes, Pablo Báez
Purpose: A critical task in oncology is extracting information related to cancer metastasis from electronic health records. Metastasis-related information is crucial for planning treatment, evaluating patient prognoses, and cancer research. However, the unstructured way in which findings of distant metastasis are often written in radiology reports makes it difficult to extract information automatically. The main aim of this study was to extract distant metastasis findings from free-text imaging and nuclear medicine reports to classify the patient status according to the presence or absence of distant metastasis.
Materials and methods: We created a distant metastasis annotated corpus using positron emission tomography-computed tomography and computed tomography reports of patients with prostate, colorectal, and breast cancers. Entities were labeled M1 or M0 according to affirmative or negative metastasis descriptions. We used a named entity recognition model on the basis of a bidirectional long short-term memory model and conditional random fields to identify entities. Mentions were subsequently used to classify whole reports into M1 or M0.
Results: The model detected distant metastasis mentions with a weighted average F1 score performance of 0.84. Whole reports were classified with an F1 score of 0.92 for M0 documents and 0.90 for M1 documents.
Conclusion: These results show the usefulness of the model in detecting distant metastasis findings in three different types of cancer and the consequent classification of reports. The relevance of this study is to generate structured distant metastasis information from free-text imaging reports in Spanish. In addition, the manually annotated corpus, annotation guidelines, and code are freely released to the research community.
目的:肿瘤学的一项关键任务是从电子健康记录中提取与癌症转移相关的信息。转移相关信息对于制定治疗计划、评估病人预后和癌症研究至关重要。然而,由于放射学报告中的远处转移发现通常采用非结构化的书写方式,因此很难自动提取信息。本研究的主要目的是从自由文本的影像学和核医学报告中提取远处转移的结果,并根据有无远处转移对患者状态进行分类:我们利用前列腺癌、结直肠癌和乳腺癌患者的正电子发射断层扫描-计算机断层扫描和计算机断层扫描报告创建了远处转移注释语料库。根据肯定或否定的转移描述,实体被标记为 M1 或 M0。我们在双向长短期记忆模型和条件随机场的基础上使用命名实体识别模型来识别实体。随后,我们使用实体识别模型将整个报告分为 M1 或 M0:结果:该模型检测到的远处转移提及加权平均 F1 分数为 0.84。对整个报告进行分类时,M0 文档的 F1 得分为 0.92,M1 文档的 F1 得分为 0.90:这些结果表明,该模型在检测三种不同类型癌症的远处转移结果以及随后对报告进行分类方面非常有用。这项研究的意义在于从西班牙语的自由文本成像报告中生成结构化的远处转移信息。此外,人工标注的语料库、标注指南和代码也免费向研究界发布。
{"title":"Automatic Detection of Distant Metastasis Mentions in Radiology Reports in Spanish.","authors":"Ricardo Ahumada, Jocelyn Dunstan, Matías Rojas, Sergio Peñafiel, Inti Paredes, Pablo Báez","doi":"10.1200/CCI.23.00130","DOIUrl":"10.1200/CCI.23.00130","url":null,"abstract":"<p><strong>Purpose: </strong>A critical task in oncology is extracting information related to cancer metastasis from electronic health records. Metastasis-related information is crucial for planning treatment, evaluating patient prognoses, and cancer research. However, the unstructured way in which findings of distant metastasis are often written in radiology reports makes it difficult to extract information automatically. The main aim of this study was to extract distant metastasis findings from free-text imaging and nuclear medicine reports to classify the patient status according to the presence or absence of distant metastasis.</p><p><strong>Materials and methods: </strong>We created a distant metastasis annotated corpus using positron emission tomography-computed tomography and computed tomography reports of patients with prostate, colorectal, and breast cancers. Entities were labeled M1 or M0 according to affirmative or negative metastasis descriptions. We used a named entity recognition model on the basis of a bidirectional long short-term memory model and conditional random fields to identify entities. Mentions were subsequently used to classify whole reports into M1 or M0.</p><p><strong>Results: </strong>The model detected distant metastasis mentions with a weighted average <i>F</i><sub>1</sub> score performance of 0.84. Whole reports were classified with an <i>F</i><sub>1</sub> score of 0.92 for M0 documents and 0.90 for M1 documents.</p><p><strong>Conclusion: </strong>These results show the usefulness of the model in detecting distant metastasis findings in three different types of cancer and the consequent classification of reports. The relevance of this study is to generate structured distant metastasis information from free-text imaging reports in Spanish. In addition, the manually annotated corpus, annotation guidelines, and code are freely released to the research community.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10793975/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139405182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emily H Castellanos, Brett K Wittmershaus, Sheenu Chandwani
Purpose: Electronic health record (EHR)-based real-world data (RWD) are integral to oncology research, and understanding fitness for use is critical for data users. Complexity of data sources and curation methods necessitate transparency into how quality is approached. We describe the application of data quality dimensions in curating EHR-derived oncology RWD.
Methods: A targeted review was conducted to summarize data quality dimensions in frameworks published by the European Medicines Agency, The National Institute for Healthcare and Excellence, US Food and Drug Administration, Duke-Margolis Center for Health Policy, and Patient-Centered Outcomes Research Institute. We then characterized quality processes applied to curation of Flatiron Health RWD, which originate from EHRs of a nationwide network of academic and community cancer clinics, across the summarized quality dimensions.
Results: The primary quality dimensions across frameworks were relevance (including subdimensions of availability, sufficiency, and representativeness) and reliability (including subdimensions of accuracy, completeness, provenance, and timeliness). Flatiron Health RWD quality processes were aligned to each dimension. Relevancy to broad or specific use cases is optimized through data set size and variable breadth and depth. Accuracy is addressed using validation approaches, such as comparison with external or internal reference standards or indirect benchmarking, and verification checks for conformance, consistency, and plausibility, selected on the basis of feasibility and criticality of the variable to the intended use case. Completeness is assessed against expected source documentation; provenance by recording data transformation, management procedures, and auditable metadata; and timeliness by setting refresh frequency to minimize data lags.
Conclusion: Development of high-quality, scaled, EHR-based RWD requires integration of systematic processes across the data lifecycle. Approaches to quality are optimized through knowledge of data sources, curation processes, and use case needs. By addressing quality dimensions from published frameworks, Flatiron Health RWD enable transparency in determining fitness for real-world evidence generation.
目的:基于电子健康记录(EHR)的真实世界数据(RWD)是肿瘤学研究不可或缺的一部分,了解数据的适用性对数据用户来说至关重要。数据源和整理方法的复杂性要求数据质量的透明度。我们介绍了在整理电子病历衍生的肿瘤学 RWD 时应用数据质量维度的情况:我们进行了有针对性的回顾,总结了欧洲药品管理局、美国国家医疗保健与卓越研究所、美国食品药品管理局、杜克大学马戈利斯卫生政策中心和以患者为中心的结果研究所发布的框架中的数据质量维度。然后,我们根据总结出的质量维度,描述了用于Flatiron Health RWD(源自全国范围内的学术和社区癌症诊所网络的电子病历)整理的质量流程:各框架的主要质量维度是相关性(包括可用性、充分性和代表性等子维度)和可靠性(包括准确性、完整性、出处和及时性等子维度)。Flatiron Health RWD 质量流程与每个维度保持一致。通过数据集的规模和不同的广度和深度,优化与广泛或特定用例的相关性。准确性通过验证方法来解决,如与外部或内部参考标准或间接基准进行比较,以及对一致性、连贯性和合理性进行验证检查,这些都是根据可行性和变量对预期用例的关键性来选择的。根据预期的源文件评估完整性;通过记录数据转换、管理程序和可审计的元数据评估出处;通过设置刷新频率最大限度地减少数据滞后来评估及时性:开发高质量、大规模、基于电子病历的 RWD 需要整合整个数据生命周期的系统流程。通过了解数据源、整理流程和用例需求,可以优化质量方法。通过解决已发布框架中的质量问题,Flatiron Health RWD 在确定是否适合生成真实世界的证据方面实现了透明化。
{"title":"Raising the Bar for Real-World Data in Oncology: Approaches to Quality Across Multiple Dimensions.","authors":"Emily H Castellanos, Brett K Wittmershaus, Sheenu Chandwani","doi":"10.1200/CCI.23.00046","DOIUrl":"10.1200/CCI.23.00046","url":null,"abstract":"<p><strong>Purpose: </strong>Electronic health record (EHR)-based real-world data (RWD) are integral to oncology research, and understanding fitness for use is critical for data users. Complexity of data sources and curation methods necessitate transparency into how quality is approached. We describe the application of data quality dimensions in curating EHR-derived oncology RWD.</p><p><strong>Methods: </strong>A targeted review was conducted to summarize data quality dimensions in frameworks published by the European Medicines Agency, The National Institute for Healthcare and Excellence, US Food and Drug Administration, Duke-Margolis Center for Health Policy, and Patient-Centered Outcomes Research Institute. We then characterized quality processes applied to curation of Flatiron Health RWD, which originate from EHRs of a nationwide network of academic and community cancer clinics, across the summarized quality dimensions.</p><p><strong>Results: </strong>The primary quality dimensions across frameworks were <i>relevance</i> (including subdimensions of availability, sufficiency, and representativeness) and <i>reliability</i> (including subdimensions of accuracy, completeness, provenance, and timeliness). Flatiron Health RWD quality processes were aligned to each dimension. Relevancy to broad or specific use cases is optimized through data set size and variable breadth and depth. Accuracy is addressed using validation approaches, such as comparison with external or internal reference standards or indirect benchmarking, and verification checks for conformance, consistency, and plausibility, selected on the basis of feasibility and criticality of the variable to the intended use case. Completeness is assessed against expected source documentation; provenance by recording data transformation, management procedures, and auditable metadata; and timeliness by setting refresh frequency to minimize data lags.</p><p><strong>Conclusion: </strong>Development of high-quality, scaled, EHR-based RWD requires integration of systematic processes across the data lifecycle. Approaches to quality are optimized through knowledge of data sources, curation processes, and use case needs. By addressing quality dimensions from published frameworks, Flatiron Health RWD enable transparency in determining fitness for real-world evidence generation.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10807898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139503224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}