用知识蒸馏解释基于社交媒体的物质使用预测模型

Tao Ding, Fatema Hasan, W. Bickel, Shimei Pan
{"title":"用知识蒸馏解释基于社交媒体的物质使用预测模型","authors":"Tao Ding, Fatema Hasan, W. Bickel, Shimei Pan","doi":"10.1109/ICTAI.2018.00100","DOIUrl":null,"url":null,"abstract":"People nowadays spend a significant amount of time on social media such as Twitter, Facebook, and Instagram. As a result, social media data capture rich human behavioral evidence that can be used to help us understand their thoughts, behavior and decision making process. Social media data, however, are mostly unstructured (e.g., text and images) and may involve a large number of raw features (e.g., millions of raw text and image features). Moreover, the ground truth data about human behavior and decision making could be difficult to obtain at a large scale. As a result, most state-of-the-art social media-based human behavior models employ sophisticated unsupervised feature learning to leverage a large amount of unsupervised data. Unfortunately, these advanced models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important for behavior scientists, public health providers as well as policymakers, in this research, we focus on employing a knowledge distillation framework to build machine learning models with not only state-of-the-art predictive performance but also interpretable results. We evaluate the effectiveness of the proposed framework in explaining Substance Use Disorder (SUD) prediction models. Our best models achieved 87% ROC AUC for predicting tobacco use, 84% for alcohol use and 93% for drug use, which are comparable to existing state-of-the-art SUD prediction models. Since these models are also interpretable (e.g., a logistics regression model and a gradient boosting tree model), we combine the results from these models to gain insight into the relationship between a user's social media behavior (e.g., social media likes and word usage) and substance use.","PeriodicalId":254686,"journal":{"name":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Interpreting Social Media-Based Substance Use Prediction Models with Knowledge Distillation\",\"authors\":\"Tao Ding, Fatema Hasan, W. Bickel, Shimei Pan\",\"doi\":\"10.1109/ICTAI.2018.00100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"People nowadays spend a significant amount of time on social media such as Twitter, Facebook, and Instagram. As a result, social media data capture rich human behavioral evidence that can be used to help us understand their thoughts, behavior and decision making process. Social media data, however, are mostly unstructured (e.g., text and images) and may involve a large number of raw features (e.g., millions of raw text and image features). Moreover, the ground truth data about human behavior and decision making could be difficult to obtain at a large scale. As a result, most state-of-the-art social media-based human behavior models employ sophisticated unsupervised feature learning to leverage a large amount of unsupervised data. Unfortunately, these advanced models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important for behavior scientists, public health providers as well as policymakers, in this research, we focus on employing a knowledge distillation framework to build machine learning models with not only state-of-the-art predictive performance but also interpretable results. We evaluate the effectiveness of the proposed framework in explaining Substance Use Disorder (SUD) prediction models. Our best models achieved 87% ROC AUC for predicting tobacco use, 84% for alcohol use and 93% for drug use, which are comparable to existing state-of-the-art SUD prediction models. Since these models are also interpretable (e.g., a logistics regression model and a gradient boosting tree model), we combine the results from these models to gain insight into the relationship between a user's social media behavior (e.g., social media likes and word usage) and substance use.\",\"PeriodicalId\":254686,\"journal\":{\"name\":\"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2018.00100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2018.00100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

摘要

如今,人们在Twitter、Facebook和Instagram等社交媒体上花费了大量时间。因此,社交媒体数据捕获了丰富的人类行为证据,可以用来帮助我们理解他们的想法、行为和决策过程。然而,社交媒体数据大多是非结构化的(例如,文本和图像),可能涉及大量的原始特征(例如,数百万个原始文本和图像特征)。此外,关于人类行为和决策的真实数据可能很难大规模获得。因此,大多数最先进的基于社交媒体的人类行为模型采用复杂的无监督特征学习来利用大量的无监督数据。不幸的是,这些高级模型往往依赖于难以解释的潜在特征。由于理解这些模型中捕获的知识对于行为科学家,公共卫生提供者以及政策制定者非常重要,因此在本研究中,我们专注于使用知识蒸馏框架来构建机器学习模型,不仅具有最先进的预测性能,而且具有可解释的结果。我们评估了所提出的框架在解释物质使用障碍(SUD)预测模型中的有效性。我们的最佳模型预测烟草使用的ROC AUC为87%,预测酒精使用的ROC AUC为84%,预测药物使用的ROC AUC为93%,与现有最先进的SUD预测模型相当。由于这些模型也是可解释的(例如,逻辑回归模型和梯度增强树模型),我们将这些模型的结果结合起来,以深入了解用户的社交媒体行为(例如,社交媒体点赞和用词)与物质使用之间的关系。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Interpreting Social Media-Based Substance Use Prediction Models with Knowledge Distillation
People nowadays spend a significant amount of time on social media such as Twitter, Facebook, and Instagram. As a result, social media data capture rich human behavioral evidence that can be used to help us understand their thoughts, behavior and decision making process. Social media data, however, are mostly unstructured (e.g., text and images) and may involve a large number of raw features (e.g., millions of raw text and image features). Moreover, the ground truth data about human behavior and decision making could be difficult to obtain at a large scale. As a result, most state-of-the-art social media-based human behavior models employ sophisticated unsupervised feature learning to leverage a large amount of unsupervised data. Unfortunately, these advanced models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important for behavior scientists, public health providers as well as policymakers, in this research, we focus on employing a knowledge distillation framework to build machine learning models with not only state-of-the-art predictive performance but also interpretable results. We evaluate the effectiveness of the proposed framework in explaining Substance Use Disorder (SUD) prediction models. Our best models achieved 87% ROC AUC for predicting tobacco use, 84% for alcohol use and 93% for drug use, which are comparable to existing state-of-the-art SUD prediction models. Since these models are also interpretable (e.g., a logistics regression model and a gradient boosting tree model), we combine the results from these models to gain insight into the relationship between a user's social media behavior (e.g., social media likes and word usage) and substance use.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
[Title page i] Enhanced Unsatisfiable Cores for QBF: Weakening Universal to Existential Quantifiers Effective Ant Colony Optimization Solution for the Brazilian Family Health Team Scheduling Problem Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery Assigning and Scheduling Service Visits in a Mixed Urban/Rural Setting
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1