{"title":"用知识蒸馏解释基于社交媒体的物质使用预测模型","authors":"Tao Ding, Fatema Hasan, W. Bickel, Shimei Pan","doi":"10.1109/ICTAI.2018.00100","DOIUrl":null,"url":null,"abstract":"People nowadays spend a significant amount of time on social media such as Twitter, Facebook, and Instagram. As a result, social media data capture rich human behavioral evidence that can be used to help us understand their thoughts, behavior and decision making process. Social media data, however, are mostly unstructured (e.g., text and images) and may involve a large number of raw features (e.g., millions of raw text and image features). Moreover, the ground truth data about human behavior and decision making could be difficult to obtain at a large scale. As a result, most state-of-the-art social media-based human behavior models employ sophisticated unsupervised feature learning to leverage a large amount of unsupervised data. Unfortunately, these advanced models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important for behavior scientists, public health providers as well as policymakers, in this research, we focus on employing a knowledge distillation framework to build machine learning models with not only state-of-the-art predictive performance but also interpretable results. We evaluate the effectiveness of the proposed framework in explaining Substance Use Disorder (SUD) prediction models. Our best models achieved 87% ROC AUC for predicting tobacco use, 84% for alcohol use and 93% for drug use, which are comparable to existing state-of-the-art SUD prediction models. Since these models are also interpretable (e.g., a logistics regression model and a gradient boosting tree model), we combine the results from these models to gain insight into the relationship between a user's social media behavior (e.g., social media likes and word usage) and substance use.","PeriodicalId":254686,"journal":{"name":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Interpreting Social Media-Based Substance Use Prediction Models with Knowledge Distillation\",\"authors\":\"Tao Ding, Fatema Hasan, W. Bickel, Shimei Pan\",\"doi\":\"10.1109/ICTAI.2018.00100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"People nowadays spend a significant amount of time on social media such as Twitter, Facebook, and Instagram. As a result, social media data capture rich human behavioral evidence that can be used to help us understand their thoughts, behavior and decision making process. Social media data, however, are mostly unstructured (e.g., text and images) and may involve a large number of raw features (e.g., millions of raw text and image features). Moreover, the ground truth data about human behavior and decision making could be difficult to obtain at a large scale. As a result, most state-of-the-art social media-based human behavior models employ sophisticated unsupervised feature learning to leverage a large amount of unsupervised data. Unfortunately, these advanced models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important for behavior scientists, public health providers as well as policymakers, in this research, we focus on employing a knowledge distillation framework to build machine learning models with not only state-of-the-art predictive performance but also interpretable results. We evaluate the effectiveness of the proposed framework in explaining Substance Use Disorder (SUD) prediction models. Our best models achieved 87% ROC AUC for predicting tobacco use, 84% for alcohol use and 93% for drug use, which are comparable to existing state-of-the-art SUD prediction models. Since these models are also interpretable (e.g., a logistics regression model and a gradient boosting tree model), we combine the results from these models to gain insight into the relationship between a user's social media behavior (e.g., social media likes and word usage) and substance use.\",\"PeriodicalId\":254686,\"journal\":{\"name\":\"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2018.00100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2018.00100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Interpreting Social Media-Based Substance Use Prediction Models with Knowledge Distillation
People nowadays spend a significant amount of time on social media such as Twitter, Facebook, and Instagram. As a result, social media data capture rich human behavioral evidence that can be used to help us understand their thoughts, behavior and decision making process. Social media data, however, are mostly unstructured (e.g., text and images) and may involve a large number of raw features (e.g., millions of raw text and image features). Moreover, the ground truth data about human behavior and decision making could be difficult to obtain at a large scale. As a result, most state-of-the-art social media-based human behavior models employ sophisticated unsupervised feature learning to leverage a large amount of unsupervised data. Unfortunately, these advanced models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important for behavior scientists, public health providers as well as policymakers, in this research, we focus on employing a knowledge distillation framework to build machine learning models with not only state-of-the-art predictive performance but also interpretable results. We evaluate the effectiveness of the proposed framework in explaining Substance Use Disorder (SUD) prediction models. Our best models achieved 87% ROC AUC for predicting tobacco use, 84% for alcohol use and 93% for drug use, which are comparable to existing state-of-the-art SUD prediction models. Since these models are also interpretable (e.g., a logistics regression model and a gradient boosting tree model), we combine the results from these models to gain insight into the relationship between a user's social media behavior (e.g., social media likes and word usage) and substance use.