{"title":"SUMEX:用于生成语义文本相似性和 EXplanation 的混合框架","authors":"Sumaira Saeed, Quratulain Rajput, Sajjad Haider","doi":"10.1016/j.ipm.2024.103771","DOIUrl":null,"url":null,"abstract":"<div><p>Measuring semantic similarity between two pieces of text is a widely known problem in Natural language processing(NLP). It has many applications, such as finding similar medical notes of patients to accelerate the diagnosis process, plagiarism detection, and document clustering. Most state-of-the-art models are based on machine/deep learning and lack sufficient explanations for their results, limiting their adoption in critical domains like healthcare. This paper presents a hybrid framework SUMEX (Semantic textUal siMilarity and EXplanation generation) that uniquely combines ontology with a state-of-the-art embedding-based model for semantic textual similarity. The primary strength of the framework is that it explains its results in human-understandable natural language, which is vital in critical domains such as healthcare. Experiments have been conducted on two datasets of clinical notes using four embeddings: ScispaCy, BioWord2Vec, ClinicalBERT, and a customized Word2Vec trained on clinical notes. The SUMEX framework outperforms the embedding-based model on the benchmark datasets of ClinicalSTS by improving average precision scores by 7 % and reducing the false-positives-rate by 23 %. On the Patients Similarity Dataset, the average top-five and top-three precision scores were improved by 14% and 10%, respectively, using SUMEX. The SUMEX also generates explanations for its results in natural language. The domain experts evaluated the quality of the explanations. The results show that the generated explanations are of significantly good quality, with a score of 90 % and 93 % for measures of Completeness and Correctness, respectively. In addition, ChatGPT was also used for similarity score and generating explanations. The experiments show that the SUMEX framework performed better than the ChatGPT.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SUMEX: A hybrid framework for Semantic textUal siMilarity and EXplanation generation\",\"authors\":\"Sumaira Saeed, Quratulain Rajput, Sajjad Haider\",\"doi\":\"10.1016/j.ipm.2024.103771\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Measuring semantic similarity between two pieces of text is a widely known problem in Natural language processing(NLP). It has many applications, such as finding similar medical notes of patients to accelerate the diagnosis process, plagiarism detection, and document clustering. Most state-of-the-art models are based on machine/deep learning and lack sufficient explanations for their results, limiting their adoption in critical domains like healthcare. This paper presents a hybrid framework SUMEX (Semantic textUal siMilarity and EXplanation generation) that uniquely combines ontology with a state-of-the-art embedding-based model for semantic textual similarity. The primary strength of the framework is that it explains its results in human-understandable natural language, which is vital in critical domains such as healthcare. Experiments have been conducted on two datasets of clinical notes using four embeddings: ScispaCy, BioWord2Vec, ClinicalBERT, and a customized Word2Vec trained on clinical notes. The SUMEX framework outperforms the embedding-based model on the benchmark datasets of ClinicalSTS by improving average precision scores by 7 % and reducing the false-positives-rate by 23 %. On the Patients Similarity Dataset, the average top-five and top-three precision scores were improved by 14% and 10%, respectively, using SUMEX. The SUMEX also generates explanations for its results in natural language. The domain experts evaluated the quality of the explanations. The results show that the generated explanations are of significantly good quality, with a score of 90 % and 93 % for measures of Completeness and Correctness, respectively. In addition, ChatGPT was also used for similarity score and generating explanations. The experiments show that the SUMEX framework performed better than the ChatGPT.</p></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324001316\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324001316","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
SUMEX: A hybrid framework for Semantic textUal siMilarity and EXplanation generation
Measuring semantic similarity between two pieces of text is a widely known problem in Natural language processing(NLP). It has many applications, such as finding similar medical notes of patients to accelerate the diagnosis process, plagiarism detection, and document clustering. Most state-of-the-art models are based on machine/deep learning and lack sufficient explanations for their results, limiting their adoption in critical domains like healthcare. This paper presents a hybrid framework SUMEX (Semantic textUal siMilarity and EXplanation generation) that uniquely combines ontology with a state-of-the-art embedding-based model for semantic textual similarity. The primary strength of the framework is that it explains its results in human-understandable natural language, which is vital in critical domains such as healthcare. Experiments have been conducted on two datasets of clinical notes using four embeddings: ScispaCy, BioWord2Vec, ClinicalBERT, and a customized Word2Vec trained on clinical notes. The SUMEX framework outperforms the embedding-based model on the benchmark datasets of ClinicalSTS by improving average precision scores by 7 % and reducing the false-positives-rate by 23 %. On the Patients Similarity Dataset, the average top-five and top-three precision scores were improved by 14% and 10%, respectively, using SUMEX. The SUMEX also generates explanations for its results in natural language. The domain experts evaluated the quality of the explanations. The results show that the generated explanations are of significantly good quality, with a score of 90 % and 93 % for measures of Completeness and Correctness, respectively. In addition, ChatGPT was also used for similarity score and generating explanations. The experiments show that the SUMEX framework performed better than the ChatGPT.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.