LAMP: data provenance for graph based machine learning algorithms through derivative computation

Shiqing Ma, Yousra Aafer, Zhaogui Xu, Wen-Chuan Lee, Juan Zhai, Yingqi Liu, X. Zhang
{"title":"LAMP: data provenance for graph based machine learning algorithms through derivative computation","authors":"Shiqing Ma, Yousra Aafer, Zhaogui Xu, Wen-Chuan Lee, Juan Zhai, Yingqi Liu, X. Zhang","doi":"10.1145/3106237.3106291","DOIUrl":null,"url":null,"abstract":"Data provenance tracking determines the set of inputs related to a given output. It enables quality control and problem diagnosis in data engineering. Most existing techniques work by tracking program dependencies. They cannot quantitatively assess the importance of related inputs, which is critical to machine learning algorithms, in which an output tends to depend on a huge set of inputs while only some of them are of importance. In this paper, we propose LAMP, a provenance computation system for machine learning algorithms. Inspired by automatic differentiation (AD), LAMP quantifies the importance of an input for an output by computing the partial derivative. LAMP separates the original data processing and the more expensive derivative computation to different processes to achieve cost-effectiveness. In addition, it allows quantifying importance for inputs related to discrete behavior, such as control flow selection. The evaluation on a set of real world programs and data sets illustrates that LAMP produces more precise and succinct provenance than program dependence based techniques, with much less overhead. Our case studies demonstrate the potential of LAMP in problem diagnosis in data engineering.","PeriodicalId":313494,"journal":{"name":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3106237.3106291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

Data provenance tracking determines the set of inputs related to a given output. It enables quality control and problem diagnosis in data engineering. Most existing techniques work by tracking program dependencies. They cannot quantitatively assess the importance of related inputs, which is critical to machine learning algorithms, in which an output tends to depend on a huge set of inputs while only some of them are of importance. In this paper, we propose LAMP, a provenance computation system for machine learning algorithms. Inspired by automatic differentiation (AD), LAMP quantifies the importance of an input for an output by computing the partial derivative. LAMP separates the original data processing and the more expensive derivative computation to different processes to achieve cost-effectiveness. In addition, it allows quantifying importance for inputs related to discrete behavior, such as control flow selection. The evaluation on a set of real world programs and data sets illustrates that LAMP produces more precise and succinct provenance than program dependence based techniques, with much less overhead. Our case studies demonstrate the potential of LAMP in problem diagnosis in data engineering.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LAMP:通过导数计算的基于图的机器学习算法的数据来源
数据来源跟踪确定与给定输出相关的一组输入。它使数据工程中的质量控制和问题诊断成为可能。大多数现有技术都是通过跟踪程序依赖关系来工作的。它们不能定量地评估相关输入的重要性,这对机器学习算法至关重要,因为机器学习算法的输出往往依赖于大量的输入,而其中只有一部分是重要的。在本文中,我们提出了LAMP,一个用于机器学习算法的来源计算系统。受自动微分(AD)的启发,LAMP通过计算偏导数来量化输入对输出的重要性。LAMP将原始数据处理和较昂贵的导数计算分离到不同的过程中,以达到成本效益。此外,它允许量化与离散行为相关的输入的重要性,例如控制流选择。对一组真实世界的程序和数据集的评估表明,LAMP比基于程序依赖的技术产生更精确和简洁的来源,开销更小。我们的案例研究证明了LAMP在数据工程问题诊断中的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Serverless computing: economic and architectural impact The rising tide lifts all boats: the advancement of science in cyber security (invited talk) User- and analysis-driven context aware software development in mobile computing Continuous variable-specific resolutions of feature interactions Attributed variability models: outside the comfort zone
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1