Pub Date : 2024-08-22DOI: 10.1101/2024.08.21.24312399
Ziyang Zhang, Jiancheng Ye
Background: Accurate prediction of mortality in critically ill patients with hypertension admitted to the Intensive Care Unit (ICU) is essential for guiding clinical decision-making and improving patient outcomes. Traditional prognostic tools often fall short in capturing the complex interactions between clinical variables in this high-risk population. Recent advances in machine learning (ML) and deep learning (DL) offer the potential for developing more sophisticated and accurate predictive models. Objective: This study aims to evaluate the performance of various ML and DL models in predicting mortality among critically ill patients with hypertension, with a particular focus on identifying key clinical predictors and assessing the comparative effectiveness of these models. Methods: We conducted a retrospective analysis of 30,096 critically ill patients with hypertension admitted to the ICU. Various ML models, including logistic regression, decision trees, and support vector machines, were compared with advanced DL models, including 1D convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. Model performance was evaluated using area under the receiver operating characteristic curve (AUC) and other performance metrics. SHapley Additive exPlanations (SHAP) values were used to interpret model outputs and identify key predictors of mortality. Results: The 1D CNN model with an initial selection of predictors achieved the highest AUC (0.7744), outperforming both traditional ML models and other DL models. Key clinical predictors of mortality identified across models included the APS-III score, age, and length of ICU stay. The SHAP analysis revealed that these predictors had a substantial influence on model predictions, underscoring their importance in assessing mortality risk in this patient population. Conclusion: Deep learning models, particularly the 1D CNN, demonstrated superior predictive accuracy compared to traditional ML models in predicting mortality among critically ill patients with hypertension. The integration of these models into clinical workflows could enhance the early identification of high-risk patients, enabling more targeted interventions and improving patient outcomes. Future research should focus on the prospective validation of these models and the ethical considerations associated with their implementation in clinical practice.
背景:准确预测入住重症监护室(ICU)的高血压重症患者的死亡率对于指导临床决策和改善患者预后至关重要。传统的预后工具往往无法捕捉这一高风险人群中临床变量之间复杂的相互作用。机器学习(ML)和深度学习(DL)的最新进展为开发更复杂、更准确的预测模型提供了可能。目的:本研究旨在评估各种 ML 和 DL 模型在预测高血压重症患者死亡率方面的性能,尤其侧重于确定关键临床预测因素和评估这些模型的比较效果。方法:我们对 30096 名入住重症监护室的高血压重症患者进行了回顾性分析。我们将包括逻辑回归、决策树和支持向量机在内的各种 ML 模型与包括一维卷积神经网络 (CNN) 和长短期记忆 (LSTM) 网络在内的高级 DL 模型进行了比较。使用接收者工作特征曲线下面积(AUC)和其他性能指标对模型性能进行了评估。使用SHAPLEY Additive exPlanations(SHAP)值解释模型输出,并确定死亡率的关键预测因素。结果:具有初始预测因子选择的一维 CNN 模型达到了最高的 AUC(0.7744),优于传统的 ML 模型和其他 DL 模型。各模型确定的死亡率关键临床预测因子包括 APS-III 评分、年龄和重症监护室住院时间。SHAP分析表明,这些预测因素对模型预测有很大影响,突出了它们在评估这类患者死亡风险中的重要性。结论:与传统的 ML 模型相比,深度学习模型,尤其是一维 CNN,在预测高血压重症患者的死亡率方面表现出更高的预测准确性。将这些模型整合到临床工作流程中,可以加强对高危患者的早期识别,从而采取更有针对性的干预措施,改善患者预后。未来的研究应侧重于这些模型的前瞻性验证以及在临床实践中实施这些模型的相关伦理考虑。
{"title":"Predicting mortality in critically ill patients with hypertension using machine learning and deep learning models","authors":"Ziyang Zhang, Jiancheng Ye","doi":"10.1101/2024.08.21.24312399","DOIUrl":"https://doi.org/10.1101/2024.08.21.24312399","url":null,"abstract":"Background:\u0000Accurate prediction of mortality in critically ill patients with hypertension admitted to the Intensive Care Unit (ICU) is essential for guiding clinical decision-making and improving patient outcomes. Traditional prognostic tools often fall short in capturing the complex interactions between clinical variables in this high-risk population. Recent advances in machine learning (ML) and deep learning (DL) offer the potential for developing more sophisticated and accurate predictive models. Objective:\u0000This study aims to evaluate the performance of various ML and DL models in predicting mortality among critically ill patients with hypertension, with a particular focus on identifying key clinical predictors and assessing the comparative effectiveness of these models. Methods:\u0000We conducted a retrospective analysis of 30,096 critically ill patients with hypertension admitted to the ICU. Various ML models, including logistic regression, decision trees, and support vector machines, were compared with advanced DL models, including 1D convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. Model performance was evaluated using area under the receiver operating characteristic curve (AUC) and other performance metrics. SHapley Additive exPlanations (SHAP) values were used to interpret model outputs and identify key predictors of mortality. Results:\u0000The 1D CNN model with an initial selection of predictors achieved the highest AUC (0.7744), outperforming both traditional ML models and other DL models. Key clinical predictors of mortality identified across models included the APS-III score, age, and length of ICU stay. The SHAP analysis revealed that these predictors had a substantial influence on model predictions, underscoring their importance in assessing mortality risk in this patient population. Conclusion:\u0000Deep learning models, particularly the 1D CNN, demonstrated superior predictive accuracy compared to traditional ML models in predicting mortality among critically ill patients with hypertension. The integration of these models into clinical workflows could enhance the early identification of high-risk patients, enabling more targeted interventions and improving patient outcomes. Future research should focus on the prospective validation of these models and the ethical considerations associated with their implementation in clinical practice.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1101/2024.08.22.24312441
Lucero Mendoza-Maldonado, John MacSharry, Johan Garssen, Aletta D. Kraneveld, Alberto Tonda, Alejandro Lopez-Rincon
The global outbreak of human monkeypox (mpox) in 2022, declared a Public Health Emergency of International Concern by the WHO, has underscored the urgent need for effective diagnostic tools. In August 2024 WHO again declared mpox as a Public Health Emergency of International Concern. This study presents an innovative approach using artificial intelligence (AI) to design primers for the rapid and accurate detection of mpox. Leveraging evolutionary algorithms, we developed primer sets with high specificity and sensitivity, validated in silico for mpox main lineage and the Clade 1b. These primers are crucial for distinguishing mpox from other viruses, enabling precise diagnosis and timely public health responses. Our findings highlight the potential of AI-driven methodologies to enhance surveillance, vaccination strategies, and outbreak management, particularly for emerging zoonotic diseases. The emergence of new mpox clades, such as Clade 1b, with higher mortality rates, further emphasizes the necessity for continuous monitoring and preparedness for future pandemics. This study advocates for the integration of AI in molecular diagnostics to improve public health outcomes.
{"title":"Future Pandemics: AI-Designed Assays for Detecting Mpox, General and Clade 1b Specific","authors":"Lucero Mendoza-Maldonado, John MacSharry, Johan Garssen, Aletta D. Kraneveld, Alberto Tonda, Alejandro Lopez-Rincon","doi":"10.1101/2024.08.22.24312441","DOIUrl":"https://doi.org/10.1101/2024.08.22.24312441","url":null,"abstract":"The global outbreak of human monkeypox (mpox) in 2022, declared a Public Health Emergency of International Concern by the WHO, has underscored the urgent need for effective diagnostic tools. In August 2024 WHO again declared mpox as a Public Health Emergency of International Concern. This study presents an innovative approach using artificial intelligence (AI) to design primers for the rapid and accurate detection of mpox. Leveraging evolutionary algorithms, we developed primer sets with high specificity and sensitivity, validated in silico for mpox main lineage and the Clade 1b. These primers are crucial for distinguishing mpox from other viruses, enabling precise diagnosis and timely public health responses. Our findings highlight the potential of AI-driven methodologies to enhance surveillance, vaccination strategies, and outbreak management, particularly for emerging zoonotic diseases. The emergence of new mpox clades, such as Clade 1b, with higher mortality rates, further emphasizes the necessity for continuous monitoring and preparedness for future pandemics. This study advocates for the integration of AI in molecular diagnostics to improve public health outcomes.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1101/2024.08.21.24312383
Xiaoling Zhong, Yihan Yang, Sheng Wei, Yuchen Liu
Background Finasteride is commonly utilized in clinical practice for treating androgenetic alopecia, but real-world data regarding the long-term safety of its adverse events remains incomplete, necessitating ongoing supplementation. This study aims to evaluate the adverse events (AEs) associated with finasteride use, based on data from the US Food and Drug Administration Adverse Event Reporting System (FAERS), to contribute to its safety assessment. Methods We reviewed adverse event reports associated with finasteride from the FAERS database, covering the period from the first quarter of 2004 to the first quarter of 2024. We assessed the safety of finasteride medication and AEs using four proportional disproportionality analyses: reported odds ratio, proportionate reporting ratio (PRR), Bayesian Confidence Propagation Neural Network (BCPN), and Multi-Item Gamma Poisson Shrinkage (MGPS). These methods were used to evaluate the of finasteride medication and AEs. whether there is a significant association between finasteride drug use and AEs. To investigate potential safety issues related to drug use, we further analyzed the similarities and differences in the onset time and AEs by gender, as well as the similarities and differences in AEs by age. Results Among the 11,557 adverse event reports where finasteride was the primary suspected drug, most patients affected were male (86.04%), with a significant proportion being the young adult aged 18-45 years (27.22%). We categorized 73 adverse events (AEs) into 7 different system organ categories (SOCs), which included common AEs like erectile dysfunction and sexual dysfunction. Notably, Peyronie's disease and post 5α reductase inhibitor syndrome were AEs not listed in the drug insert. We identified 102 AEs for men and 7 for women. Depression and anxiety were notable AEs for both male and female. Additionally, we examined 17 adverse events (AEs) in patients under 18 years old, 157 in patients aged 18 to 65 years, and 133 in patients aged 65 years and older. Each age group exhibited unique AEs, although erectile dysfunction, decreased libido, depression, suicidal ideation, psychotic disorder, and attention disturbance were common AEs observed across different age brackets. Ultimately, the median onset time for all instances was 61 days. The onset was mainly within one month after initiation of finasteride and it is noteworthy that the second highest number of cases involved adverse drug reactions persisted beyond one year of treatment. Conclusion The results of our study uncovered both known and novel AEs associated with finasteride medication. Some of these AEs were identical to the specification, and some of them signaled AEs that were not demonstrated in the specification. In addition, some AEs showed variations based on gender and age in our study. Consequently, our findings offer valuable insights for future research on the safety of finasteride medication and are anticipated to enhance its safe use i
{"title":"Multidimensional assessment of adverse events of finasteride:a real-world pharmacovigilance analysis based on FDA Adverse Event Reporting System (FAERS) from 2004 to April 2024","authors":"Xiaoling Zhong, Yihan Yang, Sheng Wei, Yuchen Liu","doi":"10.1101/2024.08.21.24312383","DOIUrl":"https://doi.org/10.1101/2024.08.21.24312383","url":null,"abstract":"Background Finasteride is commonly utilized in clinical practice for treating androgenetic alopecia, but real-world data regarding the long-term safety of its adverse events remains incomplete, necessitating ongoing supplementation. This study aims to evaluate the adverse events (AEs) associated with finasteride use, based on data from the US Food and Drug Administration Adverse Event Reporting System (FAERS), to contribute to its safety assessment. Methods We reviewed adverse event reports associated with finasteride from the FAERS database, covering the period from the first quarter of 2004 to the first quarter of 2024. We assessed the safety of finasteride medication and AEs using four proportional disproportionality analyses: reported odds ratio, proportionate reporting ratio (PRR), Bayesian Confidence Propagation Neural Network (BCPN), and Multi-Item Gamma Poisson Shrinkage (MGPS). These methods were used to evaluate the of finasteride medication and AEs. whether there is a significant association between finasteride drug use and AEs. To investigate potential safety issues related to drug use, we further analyzed the similarities and differences in the onset time and AEs by gender, as well as the similarities and differences in AEs by age. Results Among the 11,557 adverse event reports where finasteride was the primary suspected drug, most patients affected were male (86.04%), with a significant proportion being the young adult aged 18-45 years (27.22%). We categorized 73 adverse events (AEs) into 7 different system organ categories (SOCs), which included common AEs like erectile dysfunction and sexual dysfunction. Notably, Peyronie's disease and post 5α reductase inhibitor syndrome were AEs not listed in the drug insert. We identified 102 AEs for men and 7 for women. Depression and anxiety were notable AEs for both male and female. Additionally, we examined 17 adverse events (AEs) in patients under 18 years old, 157 in patients aged 18 to 65 years, and 133 in patients aged 65 years and older. Each age group exhibited unique AEs, although erectile dysfunction, decreased libido, depression, suicidal ideation, psychotic disorder, and attention disturbance were common AEs observed across different age brackets. Ultimately, the median onset time for all instances was 61 days. The onset was mainly within one month after initiation of finasteride and it is noteworthy that the second highest number of cases involved adverse drug reactions persisted beyond one year of treatment. Conclusion The results of our study uncovered both known and novel AEs associated with finasteride medication. Some of these AEs were identical to the specification, and some of them signaled AEs that were not demonstrated in the specification. In addition, some AEs showed variations based on gender and age in our study. Consequently, our findings offer valuable insights for future research on the safety of finasteride medication and are anticipated to enhance its safe use i","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1101/2024.08.21.24312353
Paula Muhr, Yating Pan, Charlotte Tumescheit, Ann-Kathrin Kuebler, Hatice Kuebra Parmaksiz, Cheng Chen, Pablo Sebastian Bolanos Orozco, Soeren S. Lienkamp, Janna Hastings
Background: Generative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons. Methods: We develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image. Findings: The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models. Interpretation: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.
{"title":"Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy","authors":"Paula Muhr, Yating Pan, Charlotte Tumescheit, Ann-Kathrin Kuebler, Hatice Kuebra Parmaksiz, Cheng Chen, Pablo Sebastian Bolanos Orozco, Soeren S. Lienkamp, Janna Hastings","doi":"10.1101/2024.08.21.24312353","DOIUrl":"https://doi.org/10.1101/2024.08.21.24312353","url":null,"abstract":"Background: Generative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons. Methods: We develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image. Findings: The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models. Interpretation: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1101/2024.08.14.24312012
Ravi Pal, Anna Barney, Giacomo Sgalla, Simon L. F. Walsh, Nicola Sverzellati, Sophie Fletcher, Stefania Cerri, Maxime Cannesson, Luca Richeldi
Patients with pulmonary fibrosis (PF) often experience long waits before getting a correct diagnosis, and this delay in reaching specialized care is associated with increased mortality, regardless of the severity of the disease. Early diagnosis and timely treatment of PF can potentially extend life expectancy and maintain a better quality of life. Crackles present in the recorded lung sounds may be crucial for the early diagnosis of PF. This paper describes an automated system for differentiating lung sounds related to PF from other pathological lung conditions using the average number of crackles per breath cycle (NOC/BC). The system is divided into four main parts: (1) preprocessing, (2) separation of crackles from normal breath sounds, (3) crackle verification and counting, and (4) estimating NOC/BC. The system was tested on a dataset consisting of 48 (24 fibrotic and 24 non-fibrotic) subjects and the results were compared with an assessment by two expert respiratory physicians. The set of HRCT images, reviewed by two expert radiologists for the presence or absence of pulmonary fibrosis, was used as the ground truth for evaluating the PF and non-PF classification performance of the system. The overall performance of the automatic classifier based on receiver operating curve-derived cut-off value for average NOC/BC of 18.65 (AUC=0.845, 95 % CI 0.739-0.952, p<0.001; sensitivity=91.7 %; specificity=59.3 %) compares favorably with the averaged performance of the physicians (sensitivity=83.3 %; specificity=56.25 %). Although radiological assessment should remain the gold standard for diagnosis of fibrotic interstitial lung disease, the automatic classification system has strong potential for diagnostic support, especially in assisting general practitioners in the auscultatory assessment of lung sounds to prompt further diagnostic work up of patients with suspect of interstitial lung disease.
{"title":"Automatic diagnostic support for diagnosis of pulmonary fibrosis","authors":"Ravi Pal, Anna Barney, Giacomo Sgalla, Simon L. F. Walsh, Nicola Sverzellati, Sophie Fletcher, Stefania Cerri, Maxime Cannesson, Luca Richeldi","doi":"10.1101/2024.08.14.24312012","DOIUrl":"https://doi.org/10.1101/2024.08.14.24312012","url":null,"abstract":"Patients with pulmonary fibrosis (PF) often experience long waits before getting a correct diagnosis, and this delay in reaching specialized care is associated with increased mortality, regardless of the severity of the disease. Early diagnosis and timely treatment of PF can potentially extend life expectancy and maintain a better quality of life. Crackles present in the recorded lung sounds may be crucial for the early diagnosis of PF. This paper describes an automated system for differentiating lung sounds related to PF from other pathological lung conditions using the average number of crackles per breath cycle (NOC/BC). The system is divided into four main parts: (1) preprocessing, (2) separation of crackles from normal breath sounds, (3) crackle verification and counting, and (4) estimating NOC/BC. The system was tested on a dataset consisting of 48 (24 fibrotic and 24 non-fibrotic) subjects and the results were compared with an assessment by two expert respiratory physicians. The set of HRCT images, reviewed by two expert radiologists for the presence or absence of pulmonary fibrosis, was used as the ground truth for evaluating the PF and non-PF classification performance of the system. The overall performance of the automatic classifier based on receiver operating curve-derived cut-off value for average NOC/BC of 18.65 (AUC=0.845, 95 % CI 0.739-0.952, p<0.001; sensitivity=91.7 %; specificity=59.3 %) compares favorably with the averaged performance of the physicians (sensitivity=83.3 %; specificity=56.25 %). Although radiological assessment should remain the gold standard for diagnosis of fibrotic interstitial lung disease, the automatic classification system has strong potential for diagnostic support, especially in assisting general practitioners in the auscultatory assessment of lung sounds to prompt further diagnostic work up of patients with suspect of interstitial lung disease.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1101/2024.08.14.24312010
Shelly Soffer, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander Charney, Girish Nadkarni, Eyal Klang
Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the multilingual text embedding benchmarks (MTEB) Hugging Face resource, of various parameter sizes and architectures. Models were tested with real-world semantic retrieval medical tasks on (1) PubMed abstracts, (2) synthetic Electronic Health Records (EHRs) generated by the Llama-3-70b model, (3) real-world patient data from the Mount Sinai Health System, and the (4) MIMIC IV database. Tasks were split into Short Tasks, involving brief text pair interactions such as triage notes and chief complaints, and Long Tasks, which required processing extended documentation such as progress notes and history & physical notes. We assessed models by correlating their performance with data integrity levels, ranging from 0% (fully mismatched pairs) to 100% (perfectly matched pairs), using Spearman correlation. Additionally, we examined correlations between the average Spearman scores across tasks and two MTEB leaderboard benchmarks: the overall recorded average and the average Semantic Textual Similarity (STS) score. We evaluated 30 embedding models across seven clinical tasks (each involving 2,000 text pairs), across five levels of data integrity, totaling 2.1 million comparisons. Some models performed consistently well, while models based on Mistral-7b excelled in long-context tasks. NV-Embed-v1, despite being top performer in short tasks, did not perform as well in long tasks. Our average task performance score (ATPS) correlated better with the MTEB STS score (0.73) than with MTEB average score (0.67). The suggested framework is flexible, scalable and resistant to the risk of models overfitting on published benchmarks. Adopting this method can improve embedding technologies in healthcare.
{"title":"A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks","authors":"Shelly Soffer, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander Charney, Girish Nadkarni, Eyal Klang","doi":"10.1101/2024.08.14.24312010","DOIUrl":"https://doi.org/10.1101/2024.08.14.24312010","url":null,"abstract":"Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the multilingual text embedding benchmarks (MTEB) Hugging Face resource, of various parameter sizes and architectures. Models were tested with real-world semantic retrieval medical tasks on (1) PubMed abstracts, (2) synthetic Electronic Health Records (EHRs) generated by the Llama-3-70b model, (3) real-world patient data from the Mount Sinai Health System, and the (4) MIMIC IV database. Tasks were split into Short Tasks, involving brief text pair interactions such as triage notes and chief complaints, and Long Tasks, which required processing extended documentation such as progress notes and history & physical notes. We assessed models by correlating their performance with data integrity levels, ranging from 0% (fully mismatched pairs) to 100% (perfectly matched pairs), using Spearman correlation. Additionally, we examined correlations between the average Spearman scores across tasks and two MTEB leaderboard benchmarks: the overall recorded average and the average Semantic Textual Similarity (STS) score. We evaluated 30 embedding models across seven clinical tasks (each involving 2,000 text pairs), across five levels of data integrity, totaling 2.1 million comparisons. Some models performed consistently well, while models based on Mistral-7b excelled in long-context tasks. NV-Embed-v1, despite being top performer in short tasks, did not perform as well in long tasks. Our average task performance score (ATPS) correlated better with the MTEB STS score (0.73) than with MTEB average score (0.67). The suggested framework is flexible, scalable and resistant to the risk of models overfitting on published benchmarks. Adopting this method can improve embedding technologies in healthcare.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"256 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1101/2024.08.15.24312032
Arman Atalar, Nihat Adar, Savaş Okyay
Alzheimer’s disease (AD) is a prevalent form of dementia that impacts brain cells. Although its likelihood increases with age, there is no transitional period between its stages. In order to enhance diagnostic precision, physicians rely on clinical judgments derived from interpreting health data, considering demographics, clinical history, and laboratory results to detect AD at an early stage. While patient cognitive tests and demographic information are primarily presented in text, brain scan images are presented in graphic formats. Researchers typically use different classifiers for each data format and then merge the classifier outcomes to maximize classification accuracy and utilize all patient-related data for the final decision. However, this approach leads to low performance, diminishing predictive abilities and model effectiveness. We propose an innovative approach that combines diverse textual health records (HR) with three-dimensional structural magnetic resonance imaging (3D sMRI) to achieve a similar objective in computer-aided diagnosis, utilizing a novel deep learning technique. Health records, encompassing demographic features like age, gender, apolipoprotein gene, and mini-mental state examination score, are fused with 3D sMRI, enabling a graphic-based deep learning strategy for early AD detection. The fusion of data is accomplished by representing textual information as graphic pipes and integrating them into 3D sMRI, a method referred to as the “pipe-laying” method. Experimental results from over 4000 sMRI scans of 780 patients in the AD Neuroimaging Initiative (ADNI) dataset demonstrate that the pipe-laying method enhances recognition accuracy rates for Early and Late Mild Cognitive Impairment (MCI) patients, accurately classifying all AD patients. In a 4-class AD diagnosis scenario, accuracy improved from 86.87% when only 3D images were used to 90.00% when 3D sMRI and patient health records were included. Thus, the positive impact of combining 3D sMRI with HR on 4-class AD diagnosis was established.
{"title":"A novel fusion method of 3D MRI and test results through deep learning for the early detection of Alzheimer’s disease","authors":"Arman Atalar, Nihat Adar, Savaş Okyay","doi":"10.1101/2024.08.15.24312032","DOIUrl":"https://doi.org/10.1101/2024.08.15.24312032","url":null,"abstract":"Alzheimer’s disease (AD) is a prevalent form of dementia that impacts brain cells. Although its likelihood increases with age, there is no transitional period between its stages. In order to enhance diagnostic precision, physicians rely on clinical judgments derived from interpreting health data, considering demographics, clinical history, and laboratory results to detect AD at an early stage. While patient cognitive tests and demographic information are primarily presented in text, brain scan images are presented in graphic formats. Researchers typically use different classifiers for each data format and then merge the classifier outcomes to maximize classification accuracy and utilize all patient-related data for the final decision. However, this approach leads to low performance, diminishing predictive abilities and model effectiveness.\u0000We propose an innovative approach that combines diverse textual health records (HR) with three-dimensional structural magnetic resonance imaging (3D sMRI) to achieve a similar objective in computer-aided diagnosis, utilizing a novel deep learning technique. Health records, encompassing demographic features like age, gender, apolipoprotein gene, and mini-mental state examination score, are fused with 3D sMRI, enabling a graphic-based deep learning strategy for early AD detection. The fusion of data is accomplished by representing textual information as graphic pipes and integrating them into 3D sMRI, a method referred to as the “pipe-laying” method.\u0000Experimental results from over 4000 sMRI scans of 780 patients in the AD Neuroimaging Initiative (ADNI) dataset demonstrate that the pipe-laying method enhances recognition accuracy rates for Early and Late Mild Cognitive Impairment (MCI) patients, accurately classifying all AD patients. In a 4-class AD diagnosis scenario, accuracy improved from 86.87% when only 3D images were used to 90.00% when 3D sMRI and patient health records were included. Thus, the positive impact of combining 3D sMRI with HR on 4-class AD diagnosis was established.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1101/2024.08.20.24312291
Gianluca Mondillo, Alessandra Perrotta, Simone Colosimo, Vittoria Frattolillo
The advanced Large Language Model ChatGPT4o, developed by OpenAI, can be used in the field of bioinformatics to analyze and understand cross-reactive allergic reactions. This study explores the use of ChatGPT4o to support research on allergens, particularly in the cross-reactivity syndrome between cat and pork. Using a hypothetical clinical case of a child with a confirmed allergy to Fel d 2 (cat albumin) and Sus s 1 (pork albumin), the model guided data collection, protein sequence analysis, and three-dimensional structure visualization. Through the use of bioinformatics tools like SDAP 2.0 and BepiPRED, the epitope regions of the allergenic proteins were predicted, con-firming their accessibility to immunoglobulin E (IgE) and probability of cross-reactivity. The results show that regions with high epitope probability exhibit high surface accessibility and predominantly coil and helical structures. The construction of a phylogenetic tree further sup-ported the evolutionary relationships among the studied allergens. ChatGPT4o has demonstrated its usefulness in guiding non-specialist researchers through complex bioinformatics processes, making advanced science accessible and improving analytical and innovation capabilities.
由 OpenAI 开发的高级大型语言模型 ChatGPT4o 可用于生物信息学领域,以分析和理解交叉反应性过敏反应。本研究探讨了如何利用 ChatGPT4o 支持过敏原研究,尤其是猫和猪肉之间的交叉反应综合征。该模型使用了一个假定的临床病例,即一个对 Fel d 2(猫白蛋白)和 Sus s 1(猪白蛋白)确诊过敏的儿童,该模型指导了数据收集、蛋白质序列分析和三维结构可视化。通过使用 SDAP 2.0 和 BepiPRED 等生物信息学工具,预测了过敏原蛋白的表位区,确认了它们与免疫球蛋白 E (IgE) 的可及性和交叉反应的可能性。结果表明,表位概率高的区域具有较高的表面可及性,且主要为螺旋结构。系统发生树的构建进一步证实了所研究过敏原之间的进化关系。ChatGPT4o 在指导非专业研究人员完成复杂的生物信息学过程、普及先进科学知识以及提高分析和创新能力方面证明了它的实用性。
{"title":"ChatGPT as a bioinformatic partner.","authors":"Gianluca Mondillo, Alessandra Perrotta, Simone Colosimo, Vittoria Frattolillo","doi":"10.1101/2024.08.20.24312291","DOIUrl":"https://doi.org/10.1101/2024.08.20.24312291","url":null,"abstract":"The advanced Large Language Model ChatGPT4o, developed by OpenAI, can be used in the field of bioinformatics to analyze and understand cross-reactive allergic reactions. This study explores the use of ChatGPT4o to support research on allergens, particularly in the cross-reactivity syndrome between cat and pork. Using a hypothetical clinical case of a child with a confirmed allergy to Fel d 2 (cat albumin) and Sus s 1 (pork albumin), the model guided data collection, protein sequence analysis, and three-dimensional structure visualization. Through the use of bioinformatics tools like SDAP 2.0 and BepiPRED, the epitope regions of the allergenic proteins were predicted, con-firming their accessibility to immunoglobulin E (IgE) and probability of cross-reactivity. The results show that regions with high epitope probability exhibit high surface accessibility and predominantly coil and helical structures. The construction of a phylogenetic tree further sup-ported the evolutionary relationships among the studied allergens. ChatGPT4o has demonstrated its usefulness in guiding non-specialist researchers through complex bioinformatics processes, making advanced science accessible and improving analytical and innovation capabilities.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1101/2024.08.14.24312001
Sabah Mohammed, Jinan Fiaidhi, Abel Serracin Martinez
The advancements in computer vision and natural language processing are keys to thriving modern healthcare systems and its applications. Nonetheless, they have been researched and used as separate technical entities without integrating their predictive knowledge discovery when they are combined. Such integration will benefit every clinical/medical problem as they are inherently multimodal - they involve several distinct forms of data, such as images and text. However, the recent advancements in machine learning have brought these fields closer using the notion of meta-transformers. At the core of this synergy is building models that can process and relate information from multiple modalities where the raw input data from various modalities are mapped into a shared token space, allowing an encoder to extract high-level semantic features of the input data. Nerveless, the task of automatically identifying arguments in a clinical/medical text and finding their multimodal relationships remains challenging as it does not rely only on relevancy measures (e.g. how close that text to other modalities like an image) but also on the evidence supporting that relevancy. Relevancy based on evidence is a normal practice in medicine as every practice is an evidence-based. In this article we are experimenting with meta-transformers that can benefit evidence based predictions. In this article, we are experimenting with variety of fine tuned medical meta-transformers like PubmedCLIP, CLIPMD, BiomedCLIP-PubMedBERT and BioCLIP to see which one provide evidence-based relevant multimodal information. Our experimentation uses the TTi-Eval open-source platform to accommodate multimodal data embeddings. This platform simplifies the integration and evaluation of different meta-transformers models but also to variety of datasets for testing and fine tuning. Additionally, we are conducting experiments to test how relevant any multimodal prediction to the published medical literature especially those that are published by PubMed. Our experimentations revealed that the BiomedCLIP-PubMedBERT model provide more reliable evidence-based relevance compared to other models based on randomized samples from the ROCO V2 dataset or other multimodal datasets like MedCat. In this next stage of this research we are extending the use of the winning evidence-based multimodal learning model by adding components that enable medical practitioner to use this model to predict answers to clinical questions based on sound medical questioning protocol like PICO and based on standardized medical terminologies like UMLS.
{"title":"Using Meta-Transformers for Multimodal Clinical Decision Support and Evidence-Based Medicine","authors":"Sabah Mohammed, Jinan Fiaidhi, Abel Serracin Martinez","doi":"10.1101/2024.08.14.24312001","DOIUrl":"https://doi.org/10.1101/2024.08.14.24312001","url":null,"abstract":"The advancements in computer vision and natural language processing are keys to thriving modern healthcare systems and its applications. Nonetheless, they have been researched and used as separate technical entities without integrating their predictive knowledge discovery when they are combined. Such integration will benefit every clinical/medical problem as they are inherently multimodal - they involve several distinct forms of data, such as images and text. However, the recent advancements in machine learning have brought these fields closer using the notion of meta-transformers. At the core of this synergy is building models that can process and relate information from multiple modalities where the raw input data from various modalities are mapped into a shared token space, allowing an encoder to extract high-level semantic features of the input data. Nerveless, the task of automatically identifying arguments in a clinical/medical text and finding their multimodal relationships remains challenging as it does not rely only on relevancy measures (e.g. how close that text to other modalities like an image) but also on the evidence supporting that relevancy. Relevancy based on evidence is a normal practice in medicine as every practice is an evidence-based. In this article we are experimenting with meta-transformers that can benefit evidence based predictions. In this article, we are experimenting with variety of fine tuned medical meta-transformers like PubmedCLIP, CLIPMD, BiomedCLIP-PubMedBERT and BioCLIP to see which one provide evidence-based relevant multimodal information. Our experimentation uses the TTi-Eval open-source platform to accommodate multimodal data embeddings. This platform simplifies the integration and evaluation of different meta-transformers models but also to variety of datasets for testing and fine tuning. Additionally, we are conducting experiments to test how relevant any multimodal prediction to the published medical literature especially those that are published by PubMed. Our experimentations revealed that the BiomedCLIP-PubMedBERT model provide more reliable evidence-based relevance compared to other models based on randomized samples from the ROCO V2 dataset or other multimodal datasets like MedCat. In this next stage of this research we are extending the use of the winning evidence-based multimodal learning model by adding components that enable medical practitioner to use this model to predict answers to clinical questions based on sound medical questioning protocol like PICO and based on standardized medical terminologies like UMLS.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-08DOI: 10.1101/2024.08.06.24311538
Richard Williams, David Jenkins, Thomas Bolton, Adrian Heald, Mehrdad A Mizani, Matthew Sperrin, Niels Peek, CVD-COVID-UK/COVID-IMPACT Consortium
Introduction The replication of observational studies using electronic health record data is critical for the evidence base of epidemiology. We have previously performed a study using linked primary and secondary care data in a large, urbanised region (Greater Manchester Care Record, Greater Manchester, UK) to compare the hospitalization rates of patients with diabetes (type 1 or type 2) after contracting COVID-19 with matched controls. Methods In this study we repeated the analysis using a national database covering the whole of England, UK (NHS England's Secure Data Environment service for England, accessed via the BHF Data Science Centre's CVD-COVID-UK/COVID-IMPACT Consortium). Results We found that many of the effect sizes did not show a statistically significant difference. Where effect sizes were statistically significant in the regional study, then they remained significant in the national study and the effect size was the same direction and of similar magnitude. Conclusion There is some evidence that the findings from studies in smaller regional datasets can be extrapolated to a larger, national setting. However, there were some significant differences and therefore replication studies remain an essential part of healthcare research.
{"title":"Replicating a COVID-19 study in a national England database to assess the generalisability of research with regional electronic health record data","authors":"Richard Williams, David Jenkins, Thomas Bolton, Adrian Heald, Mehrdad A Mizani, Matthew Sperrin, Niels Peek, CVD-COVID-UK/COVID-IMPACT Consortium","doi":"10.1101/2024.08.06.24311538","DOIUrl":"https://doi.org/10.1101/2024.08.06.24311538","url":null,"abstract":"Introduction\u0000The replication of observational studies using electronic health record data is critical for the evidence base of epidemiology. We have previously performed a study using linked primary and secondary care data in a large, urbanised region (Greater Manchester Care Record, Greater Manchester, UK) to compare the hospitalization rates of patients with diabetes (type 1 or type 2) after contracting COVID-19 with matched controls.\u0000Methods\u0000In this study we repeated the analysis using a national database covering the whole of England, UK (NHS England's Secure Data Environment service for England, accessed via the BHF Data Science Centre's CVD-COVID-UK/COVID-IMPACT Consortium).\u0000Results\u0000We found that many of the effect sizes did not show a statistically significant difference. Where effect sizes were statistically significant in the regional study, then they remained significant in the national study and the effect size was the same direction and of similar magnitude.\u0000Conclusion\u0000There is some evidence that the findings from studies in smaller regional datasets can be extrapolated to a larger, national setting. However, there were some significant differences and therefore replication studies remain an essential part of healthcare research.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141969760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}