Benjamin Smith, Senne Van Steelandt, Anahita Khojandi
Background: Deep generative models (DGMs) present a promising avenue for generating realistic, synthetic data to augment existing health care datasets. However, exactly how the completeness of the original dataset affects the quality of the generated synthetic data is unclear.
Objectives: In this paper, we investigate the effect of data completeness on samples generated by the most common DGM paradigms.
Methods: We create both cross-sectional and panel datasets with varying missingness and subset rates and train generative adversarial networks, variational autoencoders, and autoregressive models (Transformers) on these datasets. We then compare the distributions of generated data with original training data to measure similarity.
Results: We find that increased incompleteness is directly correlated with increased dissimilarity between original and generated samples produced through DGMs.
Conclusions: Care must be taken when using DGMs to generate synthetic data as data completeness issues can affect the quality of generated data in both panel and cross-sectional datasets.
{"title":"Evaluating the Impact of Health Care Data Completeness for Deep Generative Models.","authors":"Benjamin Smith, Senne Van Steelandt, Anahita Khojandi","doi":"10.1055/a-2023-9181","DOIUrl":"https://doi.org/10.1055/a-2023-9181","url":null,"abstract":"<p><strong>Background: </strong>Deep generative models (DGMs) present a promising avenue for generating realistic, synthetic data to augment existing health care datasets. However, exactly how the completeness of the original dataset affects the quality of the generated synthetic data is unclear.</p><p><strong>Objectives: </strong>In this paper, we investigate the effect of data completeness on samples generated by the most common DGM paradigms.</p><p><strong>Methods: </strong>We create both cross-sectional and panel datasets with varying missingness and subset rates and train generative adversarial networks, variational autoencoders, and autoregressive models (Transformers) on these datasets. We then compare the distributions of generated data with original training data to measure similarity.</p><p><strong>Results: </strong>We find that increased incompleteness is directly correlated with increased dissimilarity between original and generated samples produced through DGMs.</p><p><strong>Conclusions: </strong>Care must be taken when using DGMs to generate synthetic data as data completeness issues can affect the quality of generated data in both panel and cross-sectional datasets.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"62 1-02","pages":"31-39"},"PeriodicalIF":1.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10145379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Quindroit, Mathilde Fruchart, Samuel Degoul, Renaud Périchon, Julien Soula, Romaric Marcilly, Antoine Lamer
Introduction: Health care information systems can generate and/or record huge volumes of data, some of which may be reused for research, clinical trials, or teaching. However, these databases can be affected by data quality problems; hence, an important step in the data reuse process consists in detecting and rectifying these issues. With a view to facilitating the assessment of data quality, we developed a taxonomy of data quality problems in operational databases.
Material: We searched the literature for publications that mentioned "data quality problems," "data quality taxonomy," "data quality assessment," or "dirty data." The publications were then reviewed, compared, summarized, and structured using a bottom-up approach, to provide an operational taxonomy of data quality problems. The latter were illustrated with fictional examples (though based on reality) from clinical databases.
Results: Twelve publications were selected, and 286 instances of data quality problems were identified and were classified according to six distinct levels of granularity. We used the classification defined by Oliveira et al to structure our taxonomy. The extracted items were grouped into 53 data quality problems.
Discussion: This taxonomy facilitated the systematic assessment of data quality in databases by presenting the data's quality according to their granularity. The definition of this taxonomy is the first step in the data cleaning process. The subsequent steps include the definition of associated quality assessment methods and data cleaning methods.
Conclusion: Our new taxonomy enabled the classification and illustration of 53 data quality problems found in hospital databases.
{"title":"Definition of a Practical Taxonomy for Referencing Data Quality Problems in Health Care Databases.","authors":"Paul Quindroit, Mathilde Fruchart, Samuel Degoul, Renaud Périchon, Julien Soula, Romaric Marcilly, Antoine Lamer","doi":"10.1055/a-1976-2371","DOIUrl":"https://doi.org/10.1055/a-1976-2371","url":null,"abstract":"<p><strong>Introduction: </strong>Health care information systems can generate and/or record huge volumes of data, some of which may be reused for research, clinical trials, or teaching. However, these databases can be affected by data quality problems; hence, an important step in the data reuse process consists in detecting and rectifying these issues. With a view to facilitating the assessment of data quality, we developed a taxonomy of data quality problems in operational databases.</p><p><strong>Material: </strong>We searched the literature for publications that mentioned \"data quality problems,\" \"data quality taxonomy,\" \"data quality assessment,\" or \"dirty data.\" The publications were then reviewed, compared, summarized, and structured using a bottom-up approach, to provide an operational taxonomy of data quality problems. The latter were illustrated with fictional examples (though based on reality) from clinical databases.</p><p><strong>Results: </strong>Twelve publications were selected, and 286 instances of data quality problems were identified and were classified according to six distinct levels of granularity. We used the classification defined by Oliveira et al to structure our taxonomy. The extracted items were grouped into 53 data quality problems.</p><p><strong>Discussion: </strong>This taxonomy facilitated the systematic assessment of data quality in databases by presenting the data's quality according to their granularity. The definition of this taxonomy is the first step in the data cleaning process. The subsequent steps include the definition of associated quality assessment methods and data cleaning methods.</p><p><strong>Conclusion: </strong>Our new taxonomy enabled the classification and illustration of 53 data quality problems found in hospital databases.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"62 1-02","pages":"19-30"},"PeriodicalIF":1.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9786699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the 19th century, Florence Nightingale pointed to the importance of nursing documentation for the care of patients and the necessity of data-based statistics for quality improvement. The same century, John Snow projected his observations about patients with Cholera on a street map, laying the ground for modern epidemiological science. The historical examples demonstrate that proper data are the foundation of relevant information about individuals and of new scientific evidence. In the ideal case of Ackoff's pyramid, information, knowledge, understanding, and wisdom arise from data.
{"title":"High-Quality Data for Health Care and Health Research.","authors":"Jürgen Stausberg, Sonja Harkener","doi":"10.1055/a-2045-8287","DOIUrl":"https://doi.org/10.1055/a-2045-8287","url":null,"abstract":"In the 19th century, Florence Nightingale pointed to the importance of nursing documentation for the care of patients and the necessity of data-based statistics for quality improvement. The same century, John Snow projected his observations about patients with Cholera on a street map, laying the ground for modern epidemiological science. The historical examples demonstrate that proper data are the foundation of relevant information about individuals and of new scientific evidence. In the ideal case of Ackoff's pyramid, information, knowledge, understanding, and wisdom arise from data.","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"62 1-02","pages":"1-4"},"PeriodicalIF":1.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10164150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heekyong Park, Taowei David Wang, Nich Wattanasin, Victor M Castro, Vivian Gainer, Sergey Goryachev, Shawn Murphy
Objective: To provide high-quality data for coronavirus disease 2019 (COVID-19) research, we validated derived COVID-19 clinical indicators and 22 associated machine learning phenotypes, in the Mass General Brigham (MGB) COVID-19 Data Mart.
Methods: Fifteen reviewers performed a retrospective manual chart review for 150 COVID-19-positive patients in the data mart. To support rapid chart review for a wide range of target data, we offered a natural language processing (NLP)-based chart review tool, the Digital Analytic Patient Reviewer (DAPR). For this work, we designed a dedicated patient summary view and developed new 127 NLP logics to extract COVID-19 relevant medical concepts and target phenotypes. Moreover, we transformed DAPR for research purposes so that patient information is used for an approved research purpose only and enabled fast access to the integrated patient information. Lastly, we performed a survey to evaluate the validation difficulty and usefulness of the DAPR.
Results: The concepts for COVID-19-positive cohort, COVID-19 index date, COVID-19-related admission, and the admission date were shown to have high values in all evaluation metrics. However, three phenotypes showed notable performance degradation than the positive predictive value in the prepandemic population. Based on these results, we removed the three phenotypes from our data mart. In the survey about using the tool, participants expressed positive attitudes toward using DAPR for chart review. They assessed that the validation was easy and DAPR helped find relevant information. Some validation difficulties were also discussed.
Conclusion: Use of NLP technology in the chart review helped to cope with the challenges of the COVID-19 data validation task and accelerated the process. As a result, we could provide more reliable research data promptly and respond to the COVID-19 crisis. DAPR's benefit can be expanded to other domains. We plan to operationalize it for wider research groups.
{"title":"The Digital Analytic Patient Reviewer (DAPR) for COVID-19 Data Mart Validation.","authors":"Heekyong Park, Taowei David Wang, Nich Wattanasin, Victor M Castro, Vivian Gainer, Sergey Goryachev, Shawn Murphy","doi":"10.1055/a-1938-0436","DOIUrl":"https://doi.org/10.1055/a-1938-0436","url":null,"abstract":"<p><strong>Objective: </strong>To provide high-quality data for coronavirus disease 2019 (COVID-19) research, we validated derived COVID-19 clinical indicators and 22 associated machine learning phenotypes, in the Mass General Brigham (MGB) COVID-19 Data Mart.</p><p><strong>Methods: </strong>Fifteen reviewers performed a retrospective manual chart review for 150 COVID-19-positive patients in the data mart. To support rapid chart review for a wide range of target data, we offered a natural language processing (NLP)-based chart review tool, the Digital Analytic Patient Reviewer (DAPR). For this work, we designed a dedicated patient summary view and developed new 127 NLP logics to extract COVID-19 relevant medical concepts and target phenotypes. Moreover, we transformed DAPR for research purposes so that patient information is used for an approved research purpose only and enabled fast access to the integrated patient information. Lastly, we performed a survey to evaluate the validation difficulty and usefulness of the DAPR.</p><p><strong>Results: </strong>The concepts for COVID-19-positive cohort, COVID-19 index date, COVID-19-related admission, and the admission date were shown to have high values in all evaluation metrics. However, three phenotypes showed notable performance degradation than the positive predictive value in the prepandemic population. Based on these results, we removed the three phenotypes from our data mart. In the survey about using the tool, participants expressed positive attitudes toward using DAPR for chart review. They assessed that the validation was easy and DAPR helped find relevant information. Some validation difficulties were also discussed.</p><p><strong>Conclusion: </strong>Use of NLP technology in the chart review helped to cope with the challenges of the COVID-19 data validation task and accelerated the process. As a result, we could provide more reliable research data promptly and respond to the COVID-19 crisis. DAPR's benefit can be expanded to other domains. We plan to operationalize it for wider research groups.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 5-06","pages":"167-173"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9254113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mansoureh Yari Eili, Safar Vafadar, Jalal Rezaeenour, Mahdi Sharif-Alhoseini
Background: Although the process-mining algorithms have evolved in the past decade, the lack of attention to extracting event logs from raw data of databases in an automatic manner is evident. These logs are available in a process-oriented manner in the process-aware information systems. Still, there are areas where their extraction is a challenge to address (e.g., trauma registries).
Objective: The registry data are recorded manually and follow an unstructured ad hoc pattern; prone to high noises and errors; consequently, registry logs are classified at a maturity level of one, and extracting process-centric information is not a trivial task therein. The experiences made during the event log building from the trauma registry are the subjects to be studied.
Results: The result indicates that the three-phase self-service registry log builder tool can withstand the mentioned issues by filtering and enriching the raw data and making them ready for any level of process-mining analysis. This proposed tool is demonstrated through process discovery in the National Trauma Registry of Iran, and the encountered challenges and limitations are reported.
Conclusion: This tool is an interactive visual event log builder for trauma registry data and is freely available for studies involving other registries. In conclusion, future research directions derived from this case study are suggested.
{"title":"Self-Service Registry Log Builder: A Case Study in National Trauma Registry of Iran.","authors":"Mansoureh Yari Eili, Safar Vafadar, Jalal Rezaeenour, Mahdi Sharif-Alhoseini","doi":"10.1055/a-1911-9088","DOIUrl":"https://doi.org/10.1055/a-1911-9088","url":null,"abstract":"<p><strong>Background: </strong>Although the process-mining algorithms have evolved in the past decade, the lack of attention to extracting event logs from raw data of databases in an automatic manner is evident. These logs are available in a process-oriented manner in the process-aware information systems. Still, there are areas where their extraction is a challenge to address (e.g., trauma registries).</p><p><strong>Objective: </strong>The registry data are recorded manually and follow an unstructured ad hoc pattern; prone to high noises and errors; consequently, registry logs are classified at a maturity level of one, and extracting process-centric information is not a trivial task therein. The experiences made during the event log building from the trauma registry are the subjects to be studied.</p><p><strong>Results: </strong>The result indicates that the three-phase self-service registry log builder tool can withstand the mentioned issues by filtering and enriching the raw data and making them ready for any level of process-mining analysis. This proposed tool is demonstrated through process discovery in the National Trauma Registry of Iran, and the encountered challenges and limitations are reported.</p><p><strong>Conclusion: </strong>This tool is an interactive visual event log builder for trauma registry data and is freely available for studies involving other registries. In conclusion, future research directions derived from this case study are suggested.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 5-06","pages":"185-194"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9608515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>Since COVID-19 (coronavirus disease 2019) was discovered in December 2019, it has spread worldwide. Early isolation and medical observation management of cases and their close contacts are the key to controlling the spread of the epidemic. However, traditional medical observation requires medical staff to measure body temperature and other vital signs face to face and record them manually. There is a general shortage of human and personal protective equipment and a high risk of occupational exposure, which seriously threaten the safety of medical staff.</p><p><strong>Methods: </strong>We designed an intelligent crowd isolation medical observation management system framework based on the Internet of Things using wireless telemetry and big data cloud platform remote management technology. Through a smart wearable device with built-in sensors, vital sign data and geographical locations of medical observation subjects are collected and automatically uploaded to the big data monitoring platform on demand. According to the comprehensive analysis of the set threshold parameters, abnormal subjects are screened out, and activity tracking and health status monitoring for medical observation and management objectives are performed through monitoring and early warning management and post-event data traceability. In the trial of this system, the subjects wore the wristwatches designed in this study and real-time monitoring was conducted throughout the whole process. Additionally, for comparison, the traditional method was also used for these people. Medical staff came to measure their temperature twice a day. The subjects were 1,128 returned overseas Chinese from Europe.</p><p><strong>Results: </strong>Compared with the traditional vital sign detection method, the system designed in this study has the advantages of a fast response, low error, stability, and good endurance. It can monitor the temperature, pulse, blood pressure, and heart rate of the monitored subject in real time. The system designed in this study and the traditional vital sign detection method were both used to monitor 1,128 close contacts with COVID-19. There were six cases of abnormal body temperature that were missed by traditional manual temperature measurement in the morning and evening, and these six cases (0.53%) were sent to the hospital for further diagnosis. The abnormal body temperature of these six cases was not found in time when the medical staff came to check the temperature on a twice-a-day basis. The system designed in this study, however, can detect the abnormal body temperature of all these six people. The sensitivity and specificity of our system were both 100%.</p><p><strong>Conclusion: </strong>The system designed in this study can monitor the body temperature, blood oxygen, blood pressure, heart rate, and geographical location of the monitoring subject in real time. It can be extended to COVID-19 medical observation isolation points, shel
{"title":"An Intelligent Medical Isolation Observation Management System Based on the Internet of Things.","authors":"Wensheng Sun, Chunmei Wang, Jimin Sun, Ziping Miao, Feng Ling, Guangsong Wu","doi":"10.1055/s-0042-1757185","DOIUrl":"https://doi.org/10.1055/s-0042-1757185","url":null,"abstract":"<p><strong>Background: </strong>Since COVID-19 (coronavirus disease 2019) was discovered in December 2019, it has spread worldwide. Early isolation and medical observation management of cases and their close contacts are the key to controlling the spread of the epidemic. However, traditional medical observation requires medical staff to measure body temperature and other vital signs face to face and record them manually. There is a general shortage of human and personal protective equipment and a high risk of occupational exposure, which seriously threaten the safety of medical staff.</p><p><strong>Methods: </strong>We designed an intelligent crowd isolation medical observation management system framework based on the Internet of Things using wireless telemetry and big data cloud platform remote management technology. Through a smart wearable device with built-in sensors, vital sign data and geographical locations of medical observation subjects are collected and automatically uploaded to the big data monitoring platform on demand. According to the comprehensive analysis of the set threshold parameters, abnormal subjects are screened out, and activity tracking and health status monitoring for medical observation and management objectives are performed through monitoring and early warning management and post-event data traceability. In the trial of this system, the subjects wore the wristwatches designed in this study and real-time monitoring was conducted throughout the whole process. Additionally, for comparison, the traditional method was also used for these people. Medical staff came to measure their temperature twice a day. The subjects were 1,128 returned overseas Chinese from Europe.</p><p><strong>Results: </strong>Compared with the traditional vital sign detection method, the system designed in this study has the advantages of a fast response, low error, stability, and good endurance. It can monitor the temperature, pulse, blood pressure, and heart rate of the monitored subject in real time. The system designed in this study and the traditional vital sign detection method were both used to monitor 1,128 close contacts with COVID-19. There were six cases of abnormal body temperature that were missed by traditional manual temperature measurement in the morning and evening, and these six cases (0.53%) were sent to the hospital for further diagnosis. The abnormal body temperature of these six cases was not found in time when the medical staff came to check the temperature on a twice-a-day basis. The system designed in this study, however, can detect the abnormal body temperature of all these six people. The sensitivity and specificity of our system were both 100%.</p><p><strong>Conclusion: </strong>The system designed in this study can monitor the body temperature, blood oxygen, blood pressure, heart rate, and geographical location of the monitoring subject in real time. It can be extended to COVID-19 medical observation isolation points, shel","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 5-06","pages":"155-166"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9609025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jay Sureshbhai Patel, Ryan Brandon, Marisol Tellez, Jasim M Albandar, Rishi Rao, Joachim Krois, Huanmei Wu
Objective: Our objective was to phenotype periodontal disease (PD) diagnoses from three different sections (diagnosis codes, clinical notes, and periodontal charting) of the electronic dental records (EDR) by developing two automated computer algorithms.
Methods: We conducted a retrospective study using EDR data of patients (n = 27,138) who received care at Temple University Maurice H. Kornberg School of Dentistry from January 1, 2017 to August 31, 2021. We determined the completeness of patient demographics, periodontal charting, and PD diagnoses information in the EDR. Next, we developed two automated computer algorithms to automatically diagnose patients' PD statuses from clinical notes and periodontal charting data. Last, we phenotyped PD diagnoses using automated computer algorithms and reported the improved completeness of diagnosis.
Results: The completeness of PD diagnosis from the EDR was as follows: periodontal diagnosis codes 36% (n = 9,834), diagnoses in clinical notes 18% (n = 4,867), and charting information 80% (n = 21,710). After phenotyping, the completeness of PD diagnoses improved to 100%. Eleven percent of patients had healthy periodontium, 43% were with gingivitis, 3% with stage I, 36% with stage II, and 7% with stage III/IV periodontitis.
Conclusions: We successfully developed, tested, and deployed two automated algorithms on big EDR datasets to improve the completeness of PD diagnoses. After phenotyping, EDR provided 100% completeness of PD diagnoses of 27,138 unique patients for research purposes. This approach is recommended for use in other large databases for the evaluation of their EDR data quality and for phenotyping PD diagnoses and other relevant variables.
{"title":"Developing Automated Computer Algorithms to Phenotype Periodontal Disease Diagnoses in Electronic Dental Records.","authors":"Jay Sureshbhai Patel, Ryan Brandon, Marisol Tellez, Jasim M Albandar, Rishi Rao, Joachim Krois, Huanmei Wu","doi":"10.1055/s-0042-1757880","DOIUrl":"https://doi.org/10.1055/s-0042-1757880","url":null,"abstract":"<p><strong>Objective: </strong>Our objective was to phenotype periodontal disease (PD) diagnoses from three different sections (diagnosis codes, clinical notes, and periodontal charting) of the electronic dental records (EDR) by developing two automated computer algorithms.</p><p><strong>Methods: </strong>We conducted a retrospective study using EDR data of patients (<i>n</i> = 27,138) who received care at Temple University Maurice H. Kornberg School of Dentistry from January 1, 2017 to August 31, 2021. We determined the completeness of patient demographics, periodontal charting, and PD diagnoses information in the EDR. Next, we developed two automated computer algorithms to automatically diagnose patients' PD statuses from clinical notes and periodontal charting data. Last, we phenotyped PD diagnoses using automated computer algorithms and reported the improved completeness of diagnosis.</p><p><strong>Results: </strong>The completeness of PD diagnosis from the EDR was as follows: periodontal diagnosis codes 36% (<i>n</i> = 9,834), diagnoses in clinical notes 18% (<i>n</i> = 4,867), and charting information 80% (<i>n</i> = 21,710). After phenotyping, the completeness of PD diagnoses improved to 100%. Eleven percent of patients had healthy periodontium, 43% were with gingivitis, 3% with stage I, 36% with stage II, and 7% with stage III/IV periodontitis.</p><p><strong>Conclusions: </strong>We successfully developed, tested, and deployed two automated algorithms on big EDR datasets to improve the completeness of PD diagnoses. After phenotyping, EDR provided 100% completeness of PD diagnoses of 27,138 unique patients for research purposes. This approach is recommended for use in other large databases for the evaluation of their EDR data quality and for phenotyping PD diagnoses and other relevant variables.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 S 02","pages":"e125-e133"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/a4/d8/10-1055-s-0042-1757880.PMC9788909.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9247777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph Sirrianni, Emre Sezgin, Daniel Claman, Simon L Linwood
Background: Generative pretrained transformer (GPT) models are one of the latest large pretrained natural language processing models that enables model training with limited datasets and reduces dependency on large datasets, which are scarce and costly to establish and maintain. There is a rising interest to explore the use of GPT models in health care.
Objective: We investigate the performance of GPT-2 and GPT-Neo models for medical text prediction using 374,787 free-text dental notes.
Methods: We fine-tune pretrained GPT-2 and GPT-Neo models for next word prediction on a dataset of over 374,000 manually written sections of dental clinical notes. Each model was trained on 80% of the dataset, validated on 10%, and tested on the remaining 10%. We report model performance in terms of next word prediction accuracy and loss. Additionally, we analyze the performance of the models on different types of prediction tokens for categories. For comparison, we also fine-tuned a non-GPT pretrained neural network model, XLNet (large), for next word prediction. We annotate each token in 100 randomly sampled notes by category (e.g., names, abbreviations, clinical terms, punctuation, etc.) and compare the performance of each model by token category.
Results: Models present acceptable accuracy scores (GPT-2: 76%; GPT-Neo: 53%), and the GPT-2 model also performs better in manual evaluations, especially for names, abbreviations, and punctuation. Both GPT models outperformed XLNet in terms of accuracy. The results suggest that pretrained models have the potential to assist medical charting in the future. We share the lessons learned, insights, and suggestions for future implementations.
Conclusion: The results suggest that pretrained models have the potential to assist medical charting in the future. Our study presented one of the first implementations of the GPT model used with medical notes.
{"title":"Medical Text Prediction and Suggestion Using Generative Pretrained Transformer Models with Dental Medical Notes.","authors":"Joseph Sirrianni, Emre Sezgin, Daniel Claman, Simon L Linwood","doi":"10.1055/a-1900-7351","DOIUrl":"https://doi.org/10.1055/a-1900-7351","url":null,"abstract":"<p><strong>Background: </strong>Generative pretrained transformer (GPT) models are one of the latest large pretrained natural language processing models that enables model training with limited datasets and reduces dependency on large datasets, which are scarce and costly to establish and maintain. There is a rising interest to explore the use of GPT models in health care.</p><p><strong>Objective: </strong>We investigate the performance of GPT-2 and GPT-Neo models for medical text prediction using 374,787 free-text dental notes.</p><p><strong>Methods: </strong>We fine-tune pretrained GPT-2 and GPT-Neo models for next word prediction on a dataset of over 374,000 manually written sections of dental clinical notes. Each model was trained on 80% of the dataset, validated on 10%, and tested on the remaining 10%. We report model performance in terms of next word prediction accuracy and loss. Additionally, we analyze the performance of the models on different types of prediction tokens for categories. For comparison, we also fine-tuned a non-GPT pretrained neural network model, XLNet (large), for next word prediction. We annotate each token in 100 randomly sampled notes by category (e.g., names, abbreviations, clinical terms, punctuation, etc.) and compare the performance of each model by token category.</p><p><strong>Results: </strong>Models present acceptable accuracy scores (GPT-2: 76%; GPT-Neo: 53%), and the GPT-2 model also performs better in manual evaluations, especially for names, abbreviations, and punctuation. Both GPT models outperformed XLNet in terms of accuracy. The results suggest that pretrained models have the potential to assist medical charting in the future. We share the lessons learned, insights, and suggestions for future implementations.</p><p><strong>Conclusion: </strong>The results suggest that pretrained models have the potential to assist medical charting in the future. Our study presented one of the first implementations of the GPT model used with medical notes.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 5-06","pages":"195-200"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9254100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Gött, Sebastian Stäubert, Alexander Strübing, Alfred Winter, Angela Merzweiler, Björn Bergh, Knut Kaulke, Thomas Bahls, Wolfgang Hoffmann, Martin Bialke
Objectives: The TMF (Technology, Methods, and Infrastructure for Networked Medical Research) Data Protection Guide (TMF-DP) makes path-breaking recommendations on the subject of data protection in research projects. It includes comprehensive requirements for applications such as patient lists, pseudonymization services, and consent management services. Nevertheless, it lacks a structured, categorized list of requirements for simplified application in research projects and systematic evaluation. The 3LGM2IHE ("Three-layer Graphbased meta model - Integrating the Healthcare Enterprise [IHE] " ) project is funded by the German Research Foundation (DFG). 3LGM2IHE aims to define modeling paradigms and implement modeling tools for planning health care information systems. In addition, one of the goals is to create and publish 3LGM2 information system architecture design patterns (short "design patterns") for the community as design models in terms of a framework. A structured list of data protection-related requirements based on the TMF-DP is a precondition to integrate functions (3LGM2 Domain Layer) and building blocks (3LGM2 Logical Tool Layer) in 3LGM2 design patterns.
Methods: In order to structure the continuous text of the TMF-DP, requirement types were defined in a first step. In a second step, dependencies and delineations of the definitions were identified. In a third step, the requirements from the TMF-DP were systematically extracted. Based on the identified lists of requirements, a fourth step included the comparison of the identified requirements with exemplary open source tools as provided by the "Independent Trusted Third Party of the University Medicine Greifswald" (TTP tools).
Results: As a result, four lists of requirements were created, which contain requirements for the "patient list", the "pseudonymization service", and the "consent management", as well as cross-component requirements from the TMF-DP chapter 6 in a structured form. Further to requirements (1), possible variants (2) of implementations (to fulfill a single requirement) and recommendations (3) were identified. A comparison of the requirements lists with the functional scopes of the open source tools E-PIX (record linkage), gPAS (pseudonym management), and gICS (consent management) has shown that these fulfill more than 80% of the requirements.
Conclusions: A structured set of data protection-related requirements facilitates a systematic evaluation of implementations with respect to the fulfillment of the TMF-DP guidelines. These re-usable lists provide a decision aid for the selection of suitable tools for new research projects. As a result, these lists form the basis for the development of data protection-related 3LGM2 design patterns as part of the 3LGM2IHE project.
目标:TMF(网络医学研究的技术、方法和基础设施)数据保护指南(TMF- dp)就研究项目中的数据保护主题提出了开创性的建议。它包括对患者名单、假名服务和同意管理服务等应用程序的全面需求。然而,它缺乏一个结构化的、分类的需求清单,以简化研究项目的应用和系统的评估。3LGM2IHE(“基于三层图的元模型-集成医疗保健企业[IHE]”)项目由德国研究基金会(DFG)资助。3LGM2IHE旨在定义建模范例并实现规划医疗保健信息系统的建模工具。此外,目标之一是为社区创建和发布3LGM2信息系统体系结构设计模式(简称“设计模式”),作为框架方面的设计模型。基于TMF-DP的数据保护相关需求的结构化列表是在3LGM2设计模式中集成功能(3LGM2域层)和构建块(3LGM2逻辑工具层)的先决条件。方法:为了构造TMF-DP的连续文本,首先定义需求类型。在第二步中,确定了定义的依赖关系和描述。在第三步中,系统地提取来自TMF-DP的需求。基于已确定的需求列表,第四步包括将已确定的需求与“University Medicine Greifswald的独立可信第三方”(TTP工具)提供的示范性开源工具进行比较。结果:创建了四个需求列表,其中包含“患者列表”,“假名服务”和“同意管理”的需求,以及TMF-DP第6章中以结构化形式的跨组件需求。除了需求(1)之外,还确定了实现的可能变体(2)(以满足单个需求)和建议(3)。将需求列表与开源工具E-PIX(记录链接)、gPAS(假名管理)和gICS(同意管理)的功能范围进行比较表明,这些工具满足了80%以上的需求。结论:一组结构化的数据保护相关需求有助于对TMF-DP指南的实现进行系统评估。这些可重复使用的列表为新研究项目选择合适的工具提供了决策辅助。因此,作为3LGM2IHE项目的一部分,这些列表构成了开发与数据保护相关的3LGM2设计模式的基础。
{"title":"3LGM2IHE: Requirements for Data-Protection-Compliant Research Infrastructures-A Systematic Comparison of Theory and Practice-Oriented Implementation.","authors":"Robert Gött, Sebastian Stäubert, Alexander Strübing, Alfred Winter, Angela Merzweiler, Björn Bergh, Knut Kaulke, Thomas Bahls, Wolfgang Hoffmann, Martin Bialke","doi":"10.1055/a-1950-2791","DOIUrl":"https://doi.org/10.1055/a-1950-2791","url":null,"abstract":"<p><strong>Objectives: </strong>The TMF (Technology, Methods, and Infrastructure for Networked Medical Research) Data Protection Guide (TMF-DP) makes path-breaking recommendations on the subject of data protection in research projects. It includes comprehensive requirements for applications such as patient lists, pseudonymization services, and consent management services. Nevertheless, it lacks a structured, categorized list of requirements for simplified application in research projects and systematic evaluation. The 3LGM2IHE (\"Three-layer Graphbased meta model - Integrating the Healthcare Enterprise [IHE] \" ) project is funded by the German Research Foundation (DFG). 3LGM2IHE aims to define modeling paradigms and implement modeling tools for planning health care information systems. In addition, one of the goals is to create and publish 3LGM<sup>2</sup> information system architecture design patterns (short \"design patterns\") for the community as design models in terms of a framework. A structured list of data protection-related requirements based on the TMF-DP is a precondition to integrate functions (3LGM<sup>2</sup> Domain Layer) and building blocks (3LGM<sup>2</sup> Logical Tool Layer) in 3LGM<sup>2</sup> design patterns.</p><p><strong>Methods: </strong>In order to structure the continuous text of the TMF-DP, requirement types were defined in a first step. In a second step, dependencies and delineations of the definitions were identified. In a third step, the requirements from the TMF-DP were systematically extracted. Based on the identified lists of requirements, a fourth step included the comparison of the identified requirements with exemplary open source tools as provided by the \"Independent Trusted Third Party of the University Medicine Greifswald\" (TTP tools).</p><p><strong>Results: </strong>As a result, four lists of requirements were created, which contain requirements for the \"patient list\", the \"pseudonymization service\", and the \"consent management\", as well as cross-component requirements from the TMF-DP chapter 6 in a structured form. Further to requirements (1), possible variants (2) of implementations (to fulfill a single requirement) and recommendations (3) were identified. A comparison of the requirements lists with the functional scopes of the open source tools E-PIX (record linkage), gPAS (pseudonym management), and gICS (consent management) has shown that these fulfill more than 80% of the requirements.</p><p><strong>Conclusions: </strong>A structured set of data protection-related requirements facilitates a systematic evaluation of implementations with respect to the fulfillment of the TMF-DP guidelines. These re-usable lists provide a decision aid for the selection of suitable tools for new research projects. As a result, these lists form the basis for the development of data protection-related 3LGM<sup>2</sup> design patterns as part of the 3LGM2IHE project.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 S 02","pages":"e134-e148"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/d1/2e/10-1055-a-1950-2791.PMC9788907.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9259948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zahra Meidani, Alireza Moravveji, Shirin Gohari, Hamideh Ghaffarian, Sahar Zare, Fatemeh Vaseghi, Gholam Abbas Moosavi, Ali Mohammad Nickfarjam, Felix Holl
Background: Management of child health care can be negatively affected by incomplete recording, low data quality, and lack of data integration of health management information systems to support decision making and public health program needs. Given the importance of identifying key determinants of child health via capturing and integrating accurate and high-quality information, we aim to address this gap through the development and testing requirements for an integrated child health information system.
Subjects and methods: A five-phase design thinking approach including empathizing, defining, ideation, prototyping, and testing was applied. We employed observations and interviews with the health workers at the primary health care network to identify end-users' challenges and needs using tools in human-centered design and focus group discussion. Then, a potential solution to the identified problems was developed as an integrated maternal and child health information system (IMCHIS) prototype and tested using Software Quality Requirements and Evaluation Model (SQuaRE) ISO/IEC 25000.
Results: IMCHIS was developed as a web-based system with 74 data elements and seven maternal and child health care requirements. The requirements of "child disease" with weight (0.26), "child nutrition" with weight (0.20), and "prenatal care" with weight (0.16) acquired the maximum weight coefficient. In the testing phase, the highest score with the weight coefficient of 0.48 and 0.73 was attributed to efficiency and functionality characteristics, focusing on software capability to fulfill the tasks that meet users' needs.
Conclusion: Implementing a successful child health care system integrates both maternal and child health care information systems to track the effect of maternal conditions on child health and support managing performance and optimizing service delivery. The highest quality score of IMCHIS in efficiency and functionality characteristics confirms that it owns the capability to identify key determinants of child health.
{"title":"Development and Testing Requirements for an Integrated Maternal and Child Health Information System in Iran: A Design Thinking Case Study.","authors":"Zahra Meidani, Alireza Moravveji, Shirin Gohari, Hamideh Ghaffarian, Sahar Zare, Fatemeh Vaseghi, Gholam Abbas Moosavi, Ali Mohammad Nickfarjam, Felix Holl","doi":"10.1055/a-1860-8618","DOIUrl":"https://doi.org/10.1055/a-1860-8618","url":null,"abstract":"<p><strong>Background: </strong>Management of child health care can be negatively affected by incomplete recording, low data quality, and lack of data integration of health management information systems to support decision making and public health program needs. Given the importance of identifying key determinants of child health via capturing and integrating accurate and high-quality information, we aim to address this gap through the development and testing requirements for an integrated child health information system.</p><p><strong>Subjects and methods: </strong>A five-phase design thinking approach including empathizing, defining, ideation, prototyping, and testing was applied. We employed observations and interviews with the health workers at the primary health care network to identify end-users' challenges and needs using tools in human-centered design and focus group discussion. Then, a potential solution to the identified problems was developed as an integrated maternal and child health information system (IMCHIS) prototype and tested using Software Quality Requirements and Evaluation Model (SQuaRE) ISO/IEC 25000.</p><p><strong>Results: </strong>IMCHIS was developed as a web-based system with 74 data elements and seven maternal and child health care requirements. The requirements of \"child disease\" with weight (0.26), \"child nutrition\" with weight (0.20), and \"prenatal care\" with weight (0.16) acquired the maximum weight coefficient. In the testing phase, the highest score with the weight coefficient of 0.48 and 0.73 was attributed to efficiency and functionality characteristics, focusing on software capability to fulfill the tasks that meet users' needs.</p><p><strong>Conclusion: </strong>Implementing a successful child health care system integrates both maternal and child health care information systems to track the effect of maternal conditions on child health and support managing performance and optimizing service delivery. The highest quality score of IMCHIS in efficiency and functionality characteristics confirms that it owns the capability to identify key determinants of child health.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 S 02","pages":"e64-e72"},"PeriodicalIF":1.7,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c1/3d/10-1055-a-1860-8618.PMC9788911.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9247243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}