Karthik Srinivasan, Faiz Currim, S. Ram, Casey Lindberg, Esther Sternberg, Perry Skeath, B. Najafi, J. Razjouyan, Hyo-Ki Lee, Colin Foe-Parker, Nicole Goebel, Reuben Herzl, M. Mehl, Brian Gilligan, J. Heerwagen, Kevin Kampschroer, Kelli Canada
{"title":"Feature Importance and Predictive Modeling for Multi-source Healthcare Data with Missing Values","authors":"Karthik Srinivasan, Faiz Currim, S. Ram, Casey Lindberg, Esther Sternberg, Perry Skeath, B. Najafi, J. Razjouyan, Hyo-Ki Lee, Colin Foe-Parker, Nicole Goebel, Reuben Herzl, M. Mehl, Brian Gilligan, J. Heerwagen, Kevin Kampschroer, Kelli Canada","doi":"10.1145/2896338.2896347","DOIUrl":null,"url":null,"abstract":"With rapid development of sensor technologies and the internet of things, research in the area of connected health is increasing in importance and complexity with wide-reaching impacts for public health. As data sources such as mobile (wearable) sensors get cheaper, smaller, and smarter, important research questions can be answered by combining information from multiple data sources. However, integration of multiple heterogeneous data streams often results in a dataset with several empty cells or missing values. The challenge is to use such sparsely populated integrated datasets without compromising model performance. Naïve approaches for dataset modification such as discarding observations or ad-hoc replacement of missing values often lead to misleading results. In this paper, we discuss and evaluate current best-practices for modeling such data with missing values and then propose an ensemble-learning based sparse-data modeling framework. We develop a predictive model using this framework and compare it with existing models using a study in a healthcare setting. Instead of generating a single score on variable/feature importance, our framework enables the user to understand the importance of a variable based on the existing data values and their localized impact on the outcome.","PeriodicalId":146447,"journal":{"name":"Proceedings of the 6th International Conference on Digital Health Conference","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Digital Health Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2896338.2896347","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
With rapid development of sensor technologies and the internet of things, research in the area of connected health is increasing in importance and complexity with wide-reaching impacts for public health. As data sources such as mobile (wearable) sensors get cheaper, smaller, and smarter, important research questions can be answered by combining information from multiple data sources. However, integration of multiple heterogeneous data streams often results in a dataset with several empty cells or missing values. The challenge is to use such sparsely populated integrated datasets without compromising model performance. Naïve approaches for dataset modification such as discarding observations or ad-hoc replacement of missing values often lead to misleading results. In this paper, we discuss and evaluate current best-practices for modeling such data with missing values and then propose an ensemble-learning based sparse-data modeling framework. We develop a predictive model using this framework and compare it with existing models using a study in a healthcare setting. Instead of generating a single score on variable/feature importance, our framework enables the user to understand the importance of a variable based on the existing data values and their localized impact on the outcome.