Pub Date : 2024-05-01Epub Date: 2024-03-05DOI: 10.1055/s-0044-1778693
Xavier Tannier, Perceval Wajsbürt, Alice Calliger, Basile Dura, Alexandre Mouchet, Martin Hilka, Romain Bey
Objective: The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.
Methods: We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.
Results and discussion: Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
{"title":"Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.","authors":"Xavier Tannier, Perceval Wajsbürt, Alice Calliger, Basile Dura, Alexandre Mouchet, Martin Hilka, Romain Bey","doi":"10.1055/s-0044-1778693","DOIUrl":"10.1055/s-0044-1778693","url":null,"abstract":"<p><strong>Objective: </strong>The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.</p><p><strong>Methods: </strong>We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.</p><p><strong>Results and discussion: </strong>Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"21-34"},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495938/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140040727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01Epub Date: 2024-08-13DOI: 10.1055/a-2385-1355
Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala
Background: Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.
Objectives: The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.
Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.
Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.
背景:合成数据是共享敏感生物医学数据集匿名版本的一种解决方案。理想情况下,合成数据应保留原始数据的结构和统计特性,同时保护受试者的个人隐私。目前,差异隐私(DP)被认为是平衡这种权衡的黄金标准方法:本研究的目的是调查通过 DP 合成数据的独立样本测试发现的群体差异的可信度。评估从测试的 I 类和 II 类误差的角度进行。通过前者,我们可以量化检验的有效性,即错误发现的概率是否确实低于显著性水平:我们对 DP 合成数据进行了曼惠尼 U 检验、学生 t 检验、卡方检验和中位检验。私人合成数据集由真实世界数据生成,包括前列腺癌数据集(n=500)和心血管数据集(n=70 000),以及双变量和多变量模拟数据。评估了五种不同的 DP 合成数据生成方法,包括两种基本的 DP 直方图释放方法以及 MWEM、Private-PGM 和 DP GAN 算法:结论:大部分评估结果表明 I 类误差急剧扩大,尤其是在ϵ≤1 的水平上。这一结果要求在发布和分析 DP 合成数据时保持谨慎:在统计测试中可能会获得较低的 p 值,而这仅仅是为保护隐私而添加的噪声的副产品。基于 DP 平滑直方图的合成数据生成方法在所有测试的隐私级别中都能产生有效的 I 类误差,但需要较大的原始数据集规模和适度的隐私预算(ϵ≥ 5),以获得合理的 II 类误差水平。
{"title":"Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?","authors":"Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala","doi":"10.1055/a-2385-1355","DOIUrl":"10.1055/a-2385-1355","url":null,"abstract":"<p><strong>Background: </strong>Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.</p><p><strong>Objectives: </strong>The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.</p><p><strong>Methods: </strong>We evaluate the Mann-Whitney U test, Student's <i>t</i>-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (<i>n</i> = 500) and a cardiovascular dataset (<i>n</i> = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.</p><p><strong>Conclusion: </strong>A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of <i>ϵ</i> ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low <i>p</i>-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (<i>ϵ</i> ≥ 5) in order to have reasonable Type II error levels.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"35-51"},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495942/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141977081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01Epub Date: 2024-01-23DOI: 10.1055/s-0044-1778694
Marja Fleitmann, Hristina Uzunova, René Pallenberg, Andreas M Stroth, Jan Gerlach, Alexander Fürschke, Jörg Barkhausen, Arpad Bischof, Heinz Handels
Objectives: In this paper, an artificial intelligence-based algorithm for predicting the optimal contrast medium dose for computed tomography (CT) angiography of the aorta is presented and evaluated in a clinical study. The prediction of the contrast dose reduction is modelled as a classification problem using the image contrast as the main feature.
Methods: This classification is performed by random decision forests (RDF) and k-nearest-neighbor methods (KNN). For the selection of optimal parameter subsets all possible combinations of the 22 clinical parameters (age, blood pressure, etc.) are considered using the classification accuracy and precision of the KNN classifier and RDF as quality criteria. Subsequently, the results of the evaluation were optimized by means of feature transformation using regression neural networks (RNN). These were used for a direct classification based on regressed Hounsfield units as well as preprocessing for a subsequent KNN classification.
Results: For feature selection, an RDF model achieved the highest accuracy of 84.42% and a KNN model achieved the best precision of 86.21%. The most important parameters include age, height, and hemoglobin. The feature transformation using an RNN considerably exceeded these values with an accuracy of 90.00% and a precision of 97.62% using all 22 parameters as input. However, also the feasibility of the parameter sets in routine clinical practice has to be considered, because some of the 22 parameters are not measured in routine clinical practice and additional measurement time of 15 to 20 minutes per patient is needed. Using the standard feature set available in clinical routine the best accuracy of 86.67% and precision of 93.18% was achieved by the RNN.
Conclusion: We developed a reliable hybrid system that helps radiologists determine the optimal contrast dose for CT angiography based on patient-specific parameters.
{"title":"Artificial Intelligence-Based Prediction of Contrast Medium Doses for Computed Tomography Angiography Using Optimized Clinical Parameter Sets.","authors":"Marja Fleitmann, Hristina Uzunova, René Pallenberg, Andreas M Stroth, Jan Gerlach, Alexander Fürschke, Jörg Barkhausen, Arpad Bischof, Heinz Handels","doi":"10.1055/s-0044-1778694","DOIUrl":"10.1055/s-0044-1778694","url":null,"abstract":"<p><strong>Objectives: </strong>In this paper, an artificial intelligence-based algorithm for predicting the optimal contrast medium dose for computed tomography (CT) angiography of the aorta is presented and evaluated in a clinical study. The prediction of the contrast dose reduction is modelled as a classification problem using the image contrast as the main feature.</p><p><strong>Methods: </strong>This classification is performed by random decision forests (RDF) and k-nearest-neighbor methods (KNN). For the selection of optimal parameter subsets all possible combinations of the 22 clinical parameters (age, blood pressure, etc.) are considered using the classification accuracy and precision of the KNN classifier and RDF as quality criteria. Subsequently, the results of the evaluation were optimized by means of feature transformation using regression neural networks (RNN). These were used for a direct classification based on regressed Hounsfield units as well as preprocessing for a subsequent KNN classification.</p><p><strong>Results: </strong>For feature selection, an RDF model achieved the highest accuracy of 84.42% and a KNN model achieved the best precision of 86.21%. The most important parameters include age, height, and hemoglobin. The feature transformation using an RNN considerably exceeded these values with an accuracy of 90.00% and a precision of 97.62% using all 22 parameters as input. However, also the feasibility of the parameter sets in routine clinical practice has to be considered, because some of the 22 parameters are not measured in routine clinical practice and additional measurement time of 15 to 20 minutes per patient is needed. Using the standard feature set available in clinical routine the best accuracy of 86.67% and precision of 93.18% was achieved by the RNN.</p><p><strong>Conclusion: </strong>We developed a reliable hybrid system that helps radiologists determine the optimal contrast dose for CT angiography based on patient-specific parameters.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"11-20"},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495943/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139543328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01Epub Date: 2024-05-13DOI: 10.1055/s-0044-1786839
Sarah Riepenhausen, Max Blumenstock, Christian Niklas, Stefan Hegselmann, Philipp Neuhaus, Alexandra Meidt, Cornelia Püttmann, Michael Storck, Matthias Ganzinger, Julian Varghese, Martin Dugas
Background: Structural metadata from the majority of clinical studies and routine health care systems is currently not yet available to the scientific community.
Objective: To provide an overview of available contents in the Portal of Medical Data Models (MDM Portal).
Methods: The MDM Portal is a registered European information infrastructure for research and health care, and its contents are curated and semantically annotated by medical experts. It enables users to search, view, discuss, and download existing medical data models.
Results: The most frequent keyword is "clinical trial" (n = 18,777), and the most frequent disease-specific keyword is "breast neoplasms" (n = 1,943). Most data items are available in English (n = 545,749) and German (n = 109,267). Manually curated semantic annotations are available for 805,308 elements (554,352 items, 58,101 item groups, and 192,855 code list items), which were derived from 25,257 data models. In total, 1,609,225 Unified Medical Language System (UMLS) codes have been assigned, with 66,373 unique UMLS codes.
Conclusion: To our knowledge, the MDM Portal constitutes Europe's largest collection of medical data models with semantically annotated elements. As such, it can be used to increase compatibility of medical datasets and can be utilized as a large expert-annotated medical text corpus for natural language processing.
{"title":"Europe's Largest Research Infrastructure for Curated Medical Data Models with Semantic Annotations.","authors":"Sarah Riepenhausen, Max Blumenstock, Christian Niklas, Stefan Hegselmann, Philipp Neuhaus, Alexandra Meidt, Cornelia Püttmann, Michael Storck, Matthias Ganzinger, Julian Varghese, Martin Dugas","doi":"10.1055/s-0044-1786839","DOIUrl":"10.1055/s-0044-1786839","url":null,"abstract":"<p><strong>Background: </strong>Structural metadata from the majority of clinical studies and routine health care systems is currently not yet available to the scientific community.</p><p><strong>Objective: </strong>To provide an overview of available contents in the Portal of Medical Data Models (MDM Portal).</p><p><strong>Methods: </strong>The MDM Portal is a registered European information infrastructure for research and health care, and its contents are curated and semantically annotated by medical experts. It enables users to search, view, discuss, and download existing medical data models.</p><p><strong>Results: </strong>The most frequent keyword is \"clinical trial\" (<i>n</i> = 18,777), and the most frequent disease-specific keyword is \"breast neoplasms\" (<i>n</i> = 1,943). Most data items are available in English (<i>n</i> = 545,749) and German (<i>n</i> = 109,267). Manually curated semantic annotations are available for 805,308 elements (554,352 items, 58,101 item groups, and 192,855 code list items), which were derived from 25,257 data models. In total, 1,609,225 Unified Medical Language System (UMLS) codes have been assigned, with 66,373 unique UMLS codes.</p><p><strong>Conclusion: </strong>To our knowledge, the MDM Portal constitutes Europe's largest collection of medical data models with semantically annotated elements. As such, it can be used to increase compatibility of medical datasets and can be utilized as a large expert-annotated medical text corpus for natural language processing.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"52-61"},"PeriodicalIF":1.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140917387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-09-04DOI: 10.1055/a-2165-5552
Jasmine Kashkoush, Mudit Gupta, Matthew A Meissner, Matthew E Nielsen, H Lester Kirchner, Tullika Garg
Background: Two million patients per year are referred to urologists for hematuria, or blood in the urine. The American Urological Association recently adopted a risk-stratified hematuria evaluation guideline to limit multi-phase computed tomography to individuals at highest risk of occult malignancy.
Objectives: To understand population-level hematuria evaluations, we developed an algorithm to accurately identify hematuria cases from electronic health records (EHRs).
Methods: We used International Classification of Diseases (ICD)-9/ICD-10 diagnosis codes, urine color, and urine microscopy values to identify hematuria cases and to differentiate between gross and microscopic hematuria. Using an iterative process, we refined the ICD-9 algorithm on a gold standard, chart-reviewed cohort of 3,094 hematuria cases, and the ICD-10 algorithm on a 300 patient cohort. We applied the algorithm to Geisinger patients ≥35 years (n = 539,516) and determined performance by conducting chart review (n = 500).
Results: After applying the hematuria algorithm, we identified 51,500 hematuria cases and 488,016 clean controls. Of the hematuria cases, 11,435 were categorized as gross, 26,658 as microscopic, 12,562 as indeterminate, and 845 were uncategorized. The positive predictive value (PPV) of identifying hematuria cases using the algorithm was 100% and the negative predictive value (NPV) was 99%. The gross hematuria algorithm had a PPV of 100% and NPV of 99%. The microscopic hematuria algorithm had lower PPV of 78% and NPV of 100%.
Conclusion: We developed an algorithm utilizing diagnosis codes and urine laboratory values to accurately identify hematuria and categorize as gross or microscopic in EHRs. Applying the algorithm will help researchers to understand patterns of care for this common condition.
{"title":"Performance Characteristics of a Rule-Based Electronic Health Record Algorithm to Identify Patients with Gross and Microscopic Hematuria.","authors":"Jasmine Kashkoush, Mudit Gupta, Matthew A Meissner, Matthew E Nielsen, H Lester Kirchner, Tullika Garg","doi":"10.1055/a-2165-5552","DOIUrl":"10.1055/a-2165-5552","url":null,"abstract":"<p><strong>Background: </strong>Two million patients per year are referred to urologists for hematuria, or blood in the urine. The American Urological Association recently adopted a risk-stratified hematuria evaluation guideline to limit multi-phase computed tomography to individuals at highest risk of occult malignancy.</p><p><strong>Objectives: </strong>To understand population-level hematuria evaluations, we developed an algorithm to accurately identify hematuria cases from electronic health records (EHRs).</p><p><strong>Methods: </strong>We used International Classification of Diseases (ICD)-9/ICD-10 diagnosis codes, urine color, and urine microscopy values to identify hematuria cases and to differentiate between gross and microscopic hematuria. Using an iterative process, we refined the ICD-9 algorithm on a gold standard, chart-reviewed cohort of 3,094 hematuria cases, and the ICD-10 algorithm on a 300 patient cohort. We applied the algorithm to Geisinger patients ≥35 years (<i>n</i> = 539,516) and determined performance by conducting chart review (<i>n</i> = 500).</p><p><strong>Results: </strong>After applying the hematuria algorithm, we identified 51,500 hematuria cases and 488,016 clean controls. Of the hematuria cases, 11,435 were categorized as gross, 26,658 as microscopic, 12,562 as indeterminate, and 845 were uncategorized. The positive predictive value (PPV) of identifying hematuria cases using the algorithm was 100% and the negative predictive value (NPV) was 99%. The gross hematuria algorithm had a PPV of 100% and NPV of 99%. The microscopic hematuria algorithm had lower PPV of 78% and NPV of 100%.</p><p><strong>Conclusion: </strong>We developed an algorithm utilizing diagnosis codes and urine laboratory values to accurately identify hematuria and categorize as gross or microscopic in EHRs. Applying the algorithm will help researchers to understand patterns of care for this common condition.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"183-192"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10153429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2024-02-20DOI: 10.1055/s-0043-1777733
Jonas Bienzeisler, Ariadna Perez-Garriga, Lea C Brandl, Ann-Kristin Kock-Schoppenhauer, Yasmin Hollenbenders, Maximilian Kurscheidt, Christina Schüttler
{"title":"Report from the 68th GMDS Annual Meeting: Science. Close to People.","authors":"Jonas Bienzeisler, Ariadna Perez-Garriga, Lea C Brandl, Ann-Kristin Kock-Schoppenhauer, Yasmin Hollenbenders, Maximilian Kurscheidt, Christina Schüttler","doi":"10.1055/s-0043-1777733","DOIUrl":"10.1055/s-0043-1777733","url":null,"abstract":"","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"62 5-06","pages":"202-205"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139913957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-12-29DOI: 10.1055/s-0043-1777732
Kerstin Denecke, Elia Gabarron, Carolyn Petersen
{"title":"Current Trends and New Approaches in Participatory Health Informatics.","authors":"Kerstin Denecke, Elia Gabarron, Carolyn Petersen","doi":"10.1055/s-0043-1777732","DOIUrl":"10.1055/s-0043-1777732","url":null,"abstract":"","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"151-153"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139075728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-12-20DOI: 10.1055/a-2233-2736
Elizabeth I Harrison, Laura A Kirkpatrick, Patrick W Harrison, Traci M Kazmerski, Yoshimi Sogawa, Harry S Hochheiser
Objectives: This study aimed to enable clinical researchers without expertise in natural language processing (NLP) to extract and analyze information about sexual and reproductive health (SRH), or other sensitive health topics, from large sets of clinical notes.
Methods: (1) We retrieved text from the electronic health record as individual notes. (2) We segmented notes into sentences using one of scispaCy's NLP toolkits. (3) We exported sentences to the labeling application Watchful and annotated subsets of these as relevant or irrelevant to various SRH categories by applying a combination of regular expressions and manual annotation. (4) The labeled sentences served as training data to create machine learning models for classifying text; specifically, we used spaCy's default text classification ensemble, comprising a bag-of-words model and a neural network with attention. (5) We applied each model to unlabeled sentences to identify additional references to SRH with novel relevant vocabulary. We used this information and repeated steps 3 to 5 iteratively until the models identified no new relevant sentences for each topic. Finally, we aggregated the labeled data for analysis.
Results: This methodology was applied to 3,663 Child Neurology notes for 971 female patients. Our search focused on six SRH categories. We validated the approach using two subject matter experts, who independently labeled a sample of 400 sentences. Cohen's kappa values were calculated for each category between the reviewers (menstruation: 1, sexual activity: 0.9499, contraception: 0.9887, folic acid: 1, teratogens: 0.8864, pregnancy: 0.9499). After removing the sentences on which reviewers did not agree, we compared the reviewers' labels to those produced via our methodology, again using Cohen's kappa (menstruation: 1, sexual activity: 1, contraception: 0.9885, folic acid: 1, teratogens: 0.9841, pregnancy: 0.9871).
Conclusion: Our methodology is reproducible, enables analysis of large amounts of text, and has produced results that are highly comparable to subject matter expert manual review.
{"title":"Use of Natural Language Processing to Identify Sexual and Reproductive Health Information in Clinical Text.","authors":"Elizabeth I Harrison, Laura A Kirkpatrick, Patrick W Harrison, Traci M Kazmerski, Yoshimi Sogawa, Harry S Hochheiser","doi":"10.1055/a-2233-2736","DOIUrl":"10.1055/a-2233-2736","url":null,"abstract":"<p><strong>Objectives: </strong>This study aimed to enable clinical researchers without expertise in natural language processing (NLP) to extract and analyze information about sexual and reproductive health (SRH), or other sensitive health topics, from large sets of clinical notes.</p><p><strong>Methods: </strong>(1) We retrieved text from the electronic health record as individual notes. (2) We segmented notes into sentences using one of scispaCy's NLP toolkits. (3) We exported sentences to the labeling application Watchful and annotated subsets of these as relevant or irrelevant to various SRH categories by applying a combination of regular expressions and manual annotation. (4) The labeled sentences served as training data to create machine learning models for classifying text; specifically, we used spaCy's default text classification ensemble, comprising a bag-of-words model and a neural network with attention. (5) We applied each model to unlabeled sentences to identify additional references to SRH with novel relevant vocabulary. We used this information and repeated steps 3 to 5 iteratively until the models identified no new relevant sentences for each topic. Finally, we aggregated the labeled data for analysis.</p><p><strong>Results: </strong>This methodology was applied to 3,663 Child Neurology notes for 971 female patients. Our search focused on six SRH categories. We validated the approach using two subject matter experts, who independently labeled a sample of 400 sentences. Cohen's kappa values were calculated for each category between the reviewers (menstruation: 1, sexual activity: 0.9499, contraception: 0.9887, folic acid: 1, teratogens: 0.8864, pregnancy: 0.9499). After removing the sentences on which reviewers did not agree, we compared the reviewers' labels to those produced via our methodology, again using Cohen's kappa (menstruation: 1, sexual activity: 1, contraception: 0.9885, folic acid: 1, teratogens: 0.9841, pregnancy: 0.9871).</p><p><strong>Conclusion: </strong>Our methodology is reproducible, enables analysis of large amounts of text, and has produced results that are highly comparable to subject matter expert manual review.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"193-201"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138832647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-07-24DOI: 10.1055/s-0043-1771378
Martti Juhola, Tommi Nikkanen, Juho Niemi, Maiju Welling, Olli Kampman
Background: Adverse events are common in health care. In psychiatric treatment, compensation claims for patient injuries appear to be less common than in other medical specialties. The most common types of patient injury claims in psychiatry include diagnostic flaws, unprevented suicide, or coercive treatment deemed as unnecessary or harmful.
Objectives: The objective was to study whether it is possible to form different categories of patient injury types associated with the psychiatric evaluations of compensation claims and to base machine learning classification on these categories. Further, the binary classification of positive and negative decisions for compensation claims was the other objective.
Methods: Finnish psychiatric specialist evaluations for the compensation claims of patient injuries were classified into six different categories called classes applying the machine learning methods of artificial intelligence. In addition, another classification of the same data into two classes was performed to test whether it was possible to classify data cases according to their known decisions, either accepted or declined compensation claim.
Results: The former classification task produced relatively good classification results subject to separating between different classes. Instead, the latter was more complex. However, classification accuracies of both tasks could be improved by using the generation of artificial data cases in the preprocessing phase before classifications. This preprocessing improved the classification accuracy of six classes up to 88% when the method of random forests was used for classification and that of the binary classification to 89%.
Conclusion: The results show that the objectives defined were possible to solve reasonably.
{"title":"Machine Learning Classification of Psychiatric Data Associated with Compensation Claims for Patient Injuries.","authors":"Martti Juhola, Tommi Nikkanen, Juho Niemi, Maiju Welling, Olli Kampman","doi":"10.1055/s-0043-1771378","DOIUrl":"10.1055/s-0043-1771378","url":null,"abstract":"<p><strong>Background: </strong>Adverse events are common in health care. In psychiatric treatment, compensation claims for patient injuries appear to be less common than in other medical specialties. The most common types of patient injury claims in psychiatry include diagnostic flaws, unprevented suicide, or coercive treatment deemed as unnecessary or harmful.</p><p><strong>Objectives: </strong>The objective was to study whether it is possible to form different categories of patient injury types associated with the psychiatric evaluations of compensation claims and to base machine learning classification on these categories. Further, the binary classification of positive and negative decisions for compensation claims was the other objective.</p><p><strong>Methods: </strong>Finnish psychiatric specialist evaluations for the compensation claims of patient injuries were classified into six different categories called classes applying the machine learning methods of artificial intelligence. In addition, another classification of the same data into two classes was performed to test whether it was possible to classify data cases according to their known decisions, either accepted or declined compensation claim.</p><p><strong>Results: </strong>The former classification task produced relatively good classification results subject to separating between different classes. Instead, the latter was more complex. However, classification accuracies of both tasks could be improved by using the generation of artificial data cases in the preprocessing phase before classifications. This preprocessing improved the classification accuracy of six classes up to 88% when the method of random forests was used for classification and that of the binary classification to 89%.</p><p><strong>Conclusion: </strong>The results show that the objectives defined were possible to solve reasonably.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"174-182"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10878742/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9868179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Patient-generated health data (PGHD) are data collected through technologies such as mobile devices and health apps. The integration of PGHD into health care workflows can support the care of chronic conditions such as multiple sclerosis (MS). Patients are often willing to share data with health care professionals (HCPs) in their care team; however, the benefits of PGHD can be limited if HCPs do not find it useful, leading patients to discontinue data tracking and sharing eventually. Therefore, understanding the usefulness of mobile health (mHealth) solutions, which provide PGHD and serve as enablers of the HCPs' involvement in participatory care, could motivate them to continue using these technologies.
Objective: The objective of this study is to explore the perceived utility of different types of PGHD from mHealth solutions which could serve as tools for HCPs to support participatory care in MS.
Method: A mixed-methods approach was used, combining qualitative research and participatory design. This study includes three sequential phases: data collection, assessment of PGHD utility, and design of data visualizations. In the first phase, 16 HCPs were interviewed. The second and third phases were carried out through participatory workshops, where PGHD types were conceptualized in terms of utility.
Results: The study found that HCPs are optimistic about PGHD in MS care. The most useful types of PGHD for HCPs in MS care are patients' habits, lifestyles, and fatigue-inducing activities. Although these subjective data seem more useful for HCPs, it is more challenging to visualize them in a useful and actionable way.
Conclusion: HCPs are optimistic about mHealth and PGHD as tools to further understand their patients' needs and support care in MS. HCPs from different disciplines have different perceptions of what types of PGHD are useful; however, subjective types of PGHD seem potentially more useful for MS care.
{"title":"An Exploratory Study on the Utility of Patient-Generated Health Data as a Tool for Health Care Professionals in Multiple Sclerosis Care.","authors":"Sharon Guardado, Vasiliki Mylonopoulou, Octavio Rivera-Romero, Nadine Patt, Jens Bansi, Guido Giunti","doi":"10.1055/s-0043-1775718","DOIUrl":"10.1055/s-0043-1775718","url":null,"abstract":"<p><strong>Background: </strong>Patient-generated health data (PGHD) are data collected through technologies such as mobile devices and health apps. The integration of PGHD into health care workflows can support the care of chronic conditions such as multiple sclerosis (MS). Patients are often willing to share data with health care professionals (HCPs) in their care team; however, the benefits of PGHD can be limited if HCPs do not find it useful, leading patients to discontinue data tracking and sharing eventually. Therefore, understanding the usefulness of mobile health (mHealth) solutions, which provide PGHD and serve as enablers of the HCPs' involvement in participatory care, could motivate them to continue using these technologies.</p><p><strong>Objective: </strong>The objective of this study is to explore the perceived utility of different types of PGHD from mHealth solutions which could serve as tools for HCPs to support participatory care in MS.</p><p><strong>Method: </strong>A mixed-methods approach was used, combining qualitative research and participatory design. This study includes three sequential phases: data collection, assessment of PGHD utility, and design of data visualizations. In the first phase, 16 HCPs were interviewed. The second and third phases were carried out through participatory workshops, where PGHD types were conceptualized in terms of utility.</p><p><strong>Results: </strong>The study found that HCPs are optimistic about PGHD in MS care. The most useful types of PGHD for HCPs in MS care are patients' habits, lifestyles, and fatigue-inducing activities. Although these subjective data seem more useful for HCPs, it is more challenging to visualize them in a useful and actionable way.</p><p><strong>Conclusion: </strong>HCPs are optimistic about mHealth and PGHD as tools to further understand their patients' needs and support care in MS. HCPs from different disciplines have different perceptions of what types of PGHD are useful; however, subjective types of PGHD seem potentially more useful for MS care.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"165-173"},"PeriodicalIF":1.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10878743/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41137368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}