Rosarina Vallelunga , Ileana Scarpino , Maria Chiara Martinis , Francesco Luzza , Chiara Zucco
{"title":"Applications of Text Mining techniques to extract meaningful information from gastroenterology medical reports","authors":"Rosarina Vallelunga , Ileana Scarpino , Maria Chiara Martinis , Francesco Luzza , Chiara Zucco","doi":"10.1016/j.jocs.2024.102458","DOIUrl":null,"url":null,"abstract":"<div><div>Text mining techniques, particularly topic modeling, can be used for the automatic extraction of information from medical reports. The ability to autonomously analyze texts and identify topics within them can provide meaningful clinical insights that support physicians in diagnostic settings and enhance the characterization of intestinal diseases, leading to more efficient and automated systems.</div><div>This study evaluates the effectiveness of Latent Dirichlet Allocation (LDA) and BERTopic in modeling topics from colonoscopy reports related to Crohn’s Disease, Ulcerative Colitis, and Polyps. We compared these models in terms of their ability to identify clinically relevant topics, their influence on the performance of machine learning classifiers trained on the derived topic features, and their scalability.</div><div>Our analysis, based on average results across five iterations of train-test splits, showed that BERTopic generally outperformed LDA in clustering metrics, achieving Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Purity scores of 0.5637, 0.5953, and 0.8447, respectively, compared to LDA’s scores of 0.5349, 0.5254, and 0.8149. Additionally, classifiers trained on BERTopic-derived features exhibited improved predictive accuracy and F1-scores, with Logistic Regression reaching a mean accuracy of 0.8464 and a mean F1-score of 0.8507, compared to 0.8319 and 0.8351 for LDA-based features. Despite BERTopic’s overall superior performance, LDA demonstrated greater stability and interpretability, making it a viable option in scenarios where computational efficiency is a priority.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"83 ","pages":"Article 102458"},"PeriodicalIF":3.1000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750324002515","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Text mining techniques, particularly topic modeling, can be used for the automatic extraction of information from medical reports. The ability to autonomously analyze texts and identify topics within them can provide meaningful clinical insights that support physicians in diagnostic settings and enhance the characterization of intestinal diseases, leading to more efficient and automated systems.
This study evaluates the effectiveness of Latent Dirichlet Allocation (LDA) and BERTopic in modeling topics from colonoscopy reports related to Crohn’s Disease, Ulcerative Colitis, and Polyps. We compared these models in terms of their ability to identify clinically relevant topics, their influence on the performance of machine learning classifiers trained on the derived topic features, and their scalability.
Our analysis, based on average results across five iterations of train-test splits, showed that BERTopic generally outperformed LDA in clustering metrics, achieving Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Purity scores of 0.5637, 0.5953, and 0.8447, respectively, compared to LDA’s scores of 0.5349, 0.5254, and 0.8149. Additionally, classifiers trained on BERTopic-derived features exhibited improved predictive accuracy and F1-scores, with Logistic Regression reaching a mean accuracy of 0.8464 and a mean F1-score of 0.8507, compared to 0.8319 and 0.8351 for LDA-based features. Despite BERTopic’s overall superior performance, LDA demonstrated greater stability and interpretability, making it a viable option in scenarios where computational efficiency is a priority.
期刊介绍:
Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory.
The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation.
This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods.
Computational science typically unifies three distinct elements:
• Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous);
• Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems;
• Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).