{"title":"A precise machine learning model: Detecting cervical cancer using feature selection and explainable AI","authors":"Rashiduzzaman Shakil, Sadia Islam, Bonna Akter","doi":"10.1016/j.jpi.2024.100398","DOIUrl":null,"url":null,"abstract":"<div><div>Cervical cancer is a cancer that remains a significant global health challenge all over the world. Due to improper screening in the early stages, and healthcare disparities, a large number of women are suffering from this disease, and the mortality rate increases day by day. Hence, in these studies, we presented a precise approach utilizing six different machine learning models (decision tree, logistic regression, naïve bayes, random forest, k nearest neighbors, support vector machine), which can predict the early stage of cervical cancer by analysing 36 risk factor attributes of 858 individuals. In addition, two data balancing techniques—Synthetic Minority Oversampling Technique and Adaptive Synthetic Sampling—were used to mitigate the data imbalance issues. Furthermore, Chi-square and Least Absolute Shrinkage and Selection Operator are two distinct feature selection processes that have been applied to evaluate the feature rank, which are mostly correlated to identify the particular disease, and also integrate an explainable artificial intelligence technique, namely Shapley Additive Explanations, for clarifying the model outcome. The applied machine learning model outcome is evaluated by performance evaluation matrices, namely accuracy, sensitivity, specificity, precision, f1-score, false-positive rate and false-negative rate, and area under the Receiver operating characteristic curve score. The decision tree outperformed in Chi-square feature selection with outstanding accuracy with 97.60%, 98.73% sensitivity, 80% specificity, and 98.73% precision, respectively. During the data imbalance, DT performed 97% accuracy, 99.35% sensitivity, 69.23% specificity, and 97.45% precision. This research is focused on developing diagnostic frameworks with automated tools to improve the detection and management of cervical cancer, as well as on helping healthcare professionals deliver more efficient and personalized care to their patients.</div></div>","PeriodicalId":37769,"journal":{"name":"Journal of Pathology Informatics","volume":"15 ","pages":"Article 100398"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Informatics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2153353924000373","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Cervical cancer is a cancer that remains a significant global health challenge all over the world. Due to improper screening in the early stages, and healthcare disparities, a large number of women are suffering from this disease, and the mortality rate increases day by day. Hence, in these studies, we presented a precise approach utilizing six different machine learning models (decision tree, logistic regression, naïve bayes, random forest, k nearest neighbors, support vector machine), which can predict the early stage of cervical cancer by analysing 36 risk factor attributes of 858 individuals. In addition, two data balancing techniques—Synthetic Minority Oversampling Technique and Adaptive Synthetic Sampling—were used to mitigate the data imbalance issues. Furthermore, Chi-square and Least Absolute Shrinkage and Selection Operator are two distinct feature selection processes that have been applied to evaluate the feature rank, which are mostly correlated to identify the particular disease, and also integrate an explainable artificial intelligence technique, namely Shapley Additive Explanations, for clarifying the model outcome. The applied machine learning model outcome is evaluated by performance evaluation matrices, namely accuracy, sensitivity, specificity, precision, f1-score, false-positive rate and false-negative rate, and area under the Receiver operating characteristic curve score. The decision tree outperformed in Chi-square feature selection with outstanding accuracy with 97.60%, 98.73% sensitivity, 80% specificity, and 98.73% precision, respectively. During the data imbalance, DT performed 97% accuracy, 99.35% sensitivity, 69.23% specificity, and 97.45% precision. This research is focused on developing diagnostic frameworks with automated tools to improve the detection and management of cervical cancer, as well as on helping healthcare professionals deliver more efficient and personalized care to their patients.
期刊介绍:
The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, symposia, meeting abstracts, book reviews, and correspondence to the editors. All submissions are subject to rigorous peer review by the well-regarded editorial board and by expert referees in appropriate specialties.