X. Kong, Lin Zhang, Quanda Zhang, Jiun Choong, Sicong Ma, X. Qin, Z. Qi, Ran Cheng, Yi Fang, Z. Ge, Yu Jiang, Jing Wang
{"title":"基于机器学习的乳腺癌Oncotype Dx复发风险辅助预测系统构建","authors":"X. Kong, Lin Zhang, Quanda Zhang, Jiun Choong, Sicong Ma, X. Qin, Z. Qi, Ran Cheng, Yi Fang, Z. Ge, Yu Jiang, Jing Wang","doi":"10.2139/ssrn.3642585","DOIUrl":null,"url":null,"abstract":"Background: \nTAILORx data confirm that using a 21-gene expression assay known as Oncotype DX (ODX; Genomic Health, Redwood City, CA) to assess the risk of early-stage breast cancer recurrence can spare women unnecessary chemotherapy. However, high up-front costs (list price, $4175) could dissuade usage. Also, from a technical perspective, this test cannot be widely used in developing countries, especially in relatively poor areas. \n \nMethods: \nBy analyzing the Surveillance, Epidemiology, and End-Results (SEER) database, Logistic Regression models were firstly used to identified significant variables that might be associated with breast cancer patients’ ODX recurrence scores (RS) and risk levels. Secondly, by adopting a series of machine leaning (ML) technologies, including random forest (RF), gradient boosting decision tree (GBDT), and XGBoost, we developed an assistant forecast system for the ODX recurrence risks [low-to-intermediate-risk (RS=2~25) and high-risk (RS=26~100)] based on individual’s sociodemographic information and clinicopathological information. This developed system was then validated in an independent validation data set via a training-test split method on the original data set. \n \nFindings: \nWe identified 111,635 patients with breast cancer, among which, 86617 patients (77.59%) were not beyond 50 years old. There were 23,514 patients (21.1%) whose ODX RSs were within the low risk of recurrence group, 71,439 patients (64.0%) were at intermediate-risk level, and 16,682 patients (14.9%) were at high-risk level. Via the multinomial ordinal logit regression, the variables closely associated with the ODX recurrence scores included age, sex, race, tumor primary site, histopathological grade, tumor size, pathology, PR status, HER2 status, (all P<0.05). Through our developed assistant forecast system, as long as a breast cancer patient’s precise sociodemographic and clinicopathological information was input, the computer would be able to automatically forecast the patient’s ODX recurrence risk level with an accuracy probability. According to the validation results, the best overall accuracy of this forecast system was 87.02% (Ordered Logistic Regression), with 99.06% specificity (Ordered Logistic Regression), and 86.0% sensitivity (RF). \n \nInterpretation: \nOur developed assistant forecast system based on sociodemographic and clinicopathological data provided clinicians an alternative tool to estimate breast cancer patients’ ODX recurrence risk level, which could be used to help assist in making an adjuvant treatment decision. In the future, this tool is widely worthwhile to be retrospectively validated in clinical practice and applied in actual clinical scenarios.","PeriodicalId":8928,"journal":{"name":"Biomaterials eJournal","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Construction of an Assistant Forecast System for Breast Cancer Oncotype Dx Recurrence Risk by Machine Learning\",\"authors\":\"X. Kong, Lin Zhang, Quanda Zhang, Jiun Choong, Sicong Ma, X. Qin, Z. Qi, Ran Cheng, Yi Fang, Z. Ge, Yu Jiang, Jing Wang\",\"doi\":\"10.2139/ssrn.3642585\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: \\nTAILORx data confirm that using a 21-gene expression assay known as Oncotype DX (ODX; Genomic Health, Redwood City, CA) to assess the risk of early-stage breast cancer recurrence can spare women unnecessary chemotherapy. However, high up-front costs (list price, $4175) could dissuade usage. Also, from a technical perspective, this test cannot be widely used in developing countries, especially in relatively poor areas. \\n \\nMethods: \\nBy analyzing the Surveillance, Epidemiology, and End-Results (SEER) database, Logistic Regression models were firstly used to identified significant variables that might be associated with breast cancer patients’ ODX recurrence scores (RS) and risk levels. Secondly, by adopting a series of machine leaning (ML) technologies, including random forest (RF), gradient boosting decision tree (GBDT), and XGBoost, we developed an assistant forecast system for the ODX recurrence risks [low-to-intermediate-risk (RS=2~25) and high-risk (RS=26~100)] based on individual’s sociodemographic information and clinicopathological information. This developed system was then validated in an independent validation data set via a training-test split method on the original data set. \\n \\nFindings: \\nWe identified 111,635 patients with breast cancer, among which, 86617 patients (77.59%) were not beyond 50 years old. There were 23,514 patients (21.1%) whose ODX RSs were within the low risk of recurrence group, 71,439 patients (64.0%) were at intermediate-risk level, and 16,682 patients (14.9%) were at high-risk level. Via the multinomial ordinal logit regression, the variables closely associated with the ODX recurrence scores included age, sex, race, tumor primary site, histopathological grade, tumor size, pathology, PR status, HER2 status, (all P<0.05). Through our developed assistant forecast system, as long as a breast cancer patient’s precise sociodemographic and clinicopathological information was input, the computer would be able to automatically forecast the patient’s ODX recurrence risk level with an accuracy probability. According to the validation results, the best overall accuracy of this forecast system was 87.02% (Ordered Logistic Regression), with 99.06% specificity (Ordered Logistic Regression), and 86.0% sensitivity (RF). \\n \\nInterpretation: \\nOur developed assistant forecast system based on sociodemographic and clinicopathological data provided clinicians an alternative tool to estimate breast cancer patients’ ODX recurrence risk level, which could be used to help assist in making an adjuvant treatment decision. In the future, this tool is widely worthwhile to be retrospectively validated in clinical practice and applied in actual clinical scenarios.\",\"PeriodicalId\":8928,\"journal\":{\"name\":\"Biomaterials eJournal\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomaterials eJournal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.3642585\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomaterials eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3642585","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Construction of an Assistant Forecast System for Breast Cancer Oncotype Dx Recurrence Risk by Machine Learning
Background:
TAILORx data confirm that using a 21-gene expression assay known as Oncotype DX (ODX; Genomic Health, Redwood City, CA) to assess the risk of early-stage breast cancer recurrence can spare women unnecessary chemotherapy. However, high up-front costs (list price, $4175) could dissuade usage. Also, from a technical perspective, this test cannot be widely used in developing countries, especially in relatively poor areas.
Methods:
By analyzing the Surveillance, Epidemiology, and End-Results (SEER) database, Logistic Regression models were firstly used to identified significant variables that might be associated with breast cancer patients’ ODX recurrence scores (RS) and risk levels. Secondly, by adopting a series of machine leaning (ML) technologies, including random forest (RF), gradient boosting decision tree (GBDT), and XGBoost, we developed an assistant forecast system for the ODX recurrence risks [low-to-intermediate-risk (RS=2~25) and high-risk (RS=26~100)] based on individual’s sociodemographic information and clinicopathological information. This developed system was then validated in an independent validation data set via a training-test split method on the original data set.
Findings:
We identified 111,635 patients with breast cancer, among which, 86617 patients (77.59%) were not beyond 50 years old. There were 23,514 patients (21.1%) whose ODX RSs were within the low risk of recurrence group, 71,439 patients (64.0%) were at intermediate-risk level, and 16,682 patients (14.9%) were at high-risk level. Via the multinomial ordinal logit regression, the variables closely associated with the ODX recurrence scores included age, sex, race, tumor primary site, histopathological grade, tumor size, pathology, PR status, HER2 status, (all P<0.05). Through our developed assistant forecast system, as long as a breast cancer patient’s precise sociodemographic and clinicopathological information was input, the computer would be able to automatically forecast the patient’s ODX recurrence risk level with an accuracy probability. According to the validation results, the best overall accuracy of this forecast system was 87.02% (Ordered Logistic Regression), with 99.06% specificity (Ordered Logistic Regression), and 86.0% sensitivity (RF).
Interpretation:
Our developed assistant forecast system based on sociodemographic and clinicopathological data provided clinicians an alternative tool to estimate breast cancer patients’ ODX recurrence risk level, which could be used to help assist in making an adjuvant treatment decision. In the future, this tool is widely worthwhile to be retrospectively validated in clinical practice and applied in actual clinical scenarios.