基于知识缓存的两阶段推理

Geonha Park, Changho Hwang, KyoungSoo Park
{"title":"基于知识缓存的两阶段推理","authors":"Geonha Park, Changho Hwang, KyoungSoo Park","doi":"10.1145/3325413.3329789","DOIUrl":null,"url":null,"abstract":"Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.","PeriodicalId":164793,"journal":{"name":"The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Case for Two-stage Inference with Knowledge Caching\",\"authors\":\"Geonha Park, Changho Hwang, KyoungSoo Park\",\"doi\":\"10.1145/3325413.3329789\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.\",\"PeriodicalId\":164793,\"journal\":{\"name\":\"The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3325413.3329789\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 3rd International Workshop on Deep Learning for Mobile Systems and Applications - EMDL '19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3325413.3329789","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

现实世界中采用深度学习技术的智能服务通常采用两层系统架构——哑前端设备和智能后端云服务器。前端设备简单地转发人工查询,而后端服务器运行复杂的深度模型来解析查询并响应前端设备。虽然简单有效,但目前的架构不仅增加了服务器的负载,而且还存在损害用户隐私的风险。本文提出了一种利用前端设备作为广义深度模型的智能缓存的知识缓存方法。缓存在本地解析流行查询或隐私敏感查询的子集,同时将其余查询转发到后端云服务器。我们讨论了知识缓存的可行性,以及围绕深度模型专门化和压缩的技术挑战。我们展示了我们的原型两阶段推理系统,它用35个语音命令中的10个来填充前端缓存。我们证明了我们的专门化和压缩技术将缓存的模型大小比原始模型减少了17.4倍,推理精度提高了1.8倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Case for Two-stage Inference with Knowledge Caching
Real-world intelligent services employing deep learning technology typically take a two-tier system architecture -- a dumb front-end device and smart back-end cloud servers. The front-end device simply forwards a human query while the back-end servers run a complex deep model to resolve the query and respond to the front-end device. While simple and effective, the current architecture not only increases the load at servers but also runs the risk of harming user privacy. In this paper, we present knowledge caching, which exploits the front-end device as a smart cache of a generalized deep model. The cache locally resolves a subset of popular or privacy-sensitive queries while it forwards the rest of them to back-end cloud servers. We discuss the feasibility of knowledge caching as well as technical challenges around deep model specialization and compression. We show our prototype two-stage inference system that populates a front-end cache with 10 voice commands out of 35 commands. We demonstrate that our specialization and compression techniques reduce the cached model size by 17.4x from the original model with 1.8x improvement on the inference accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Case for Two-stage Inference with Knowledge Caching Bluetooth Beacon-Based Indoor Localization Using Self-Learning Neural Network Enhanced Partitioning of DNN Layers for Uploading from Mobile Devices to Edge Servers Exploring Image Reconstruction Attack in Deep Learning Computation Offloading
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1