基于输入数据的乳腺癌复发分类:模拟研究。

IF 7.9 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2022-12-07 DOI:10.1186/s13040-022-00316-8

Rahibu A Abassi, Amina S Msengwa

{"title":"基于输入数据的乳腺癌复发分类:模拟研究。","authors":"Rahibu A Abassi, Amina S Msengwa","doi":"10.1186/s13040-022-00316-8","DOIUrl":null,"url":null,"abstract":"Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"15 1","pages":"30"},"PeriodicalIF":7.9000,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9727846/pdf/","citationCount":"0","resultStr":"{\"title\":\"Classification of breast cancer recurrence based on imputed data: a simulation study.\",\"authors\":\"Rahibu A Abassi, Amina S Msengwa\",\"doi\":\"10.1186/s13040-022-00316-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"15 1\",\"pages\":\"30\"},\"PeriodicalIF\":7.9000,\"publicationDate\":\"2022-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9727846/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-022-00316-8\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-022-00316-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

已经进行了一些研究来对各种现实生活事件进行分类，但在医学领域的研究很少;特别是在统计技术下的乳房复发。据我们所知，在存在缺失数据的情况下，没有关于统计分类准确率和分类器对乳腺癌复发的判别能力的比较报道。因此，本文旨在通过使用不同模拟条件下由输入过程产生的多个数据集，比较二元分类器(逻辑回归、线性和二次判别分析)的性能来填补这一分析空白。我们的研究有助于了解分类器在分类二元结果变量时的准确性和判别能力如何受到输入的数值缺失数据的影响。我们在随机缺失(MAR)和完全随机缺失(MCAR)机制下模拟了缺失率分别为15%、30%、45%和60%的不完整数据集。均值归算、热甲板、k近邻、通过链式方程的多重归算、期望最大化和预测均值匹配被用于归算不完整数据集。对每个分类器在MAR和MCAR机制下的正确分类精度和受试者工作特征(ROC)曲线下的面积进行比较。在MCAR机制下，基于平均输入数据的线性判别分类器在45%的缺失数据下获得了最高的分类准确率(73.9%)。作为分类器，在MCAR机制下，基于预测均值匹配输入数据的逻辑回归在缺失30%时产生的ROC曲线下面积最大(0.6418)，而k近邻在缺失60%数据时最大(0.6428)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Classification of breast cancer recurrence based on imputed data: a simulation study.

Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.