Data Science Through the Looking Glass

Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, Markus Weimer, A. Floratou, C. Curino, Konstantinos Karanasos
{"title":"Data Science Through the Looking Glass","authors":"Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, Markus Weimer, A. Floratou, C. Curino, Konstantinos Karanasos","doi":"10.1145/3552490.3552496","DOIUrl":null,"url":null,"abstract":"The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, finegrained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.","PeriodicalId":346332,"journal":{"name":"ACM SIGMOD Record","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGMOD Record","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552490.3552496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, finegrained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
透视镜子中的数据科学
最近机器学习(ML)的成功导致了系统和应用程序的爆炸式增长,这些系统和应用程序由不断增长的系统构建者和数据科学(DS)从业者组成。然而,这种快速变化的全景对于系统构建者和实践者来说都是具有挑战性的。在本文中,我们开始通过广角镜头捕捉这一全景,对迄今为止最大规模的DS项目进行分析,重点关注可以促进我们对该领域的理解并决定投资的问题。具体来说,我们下载并分析了(a) GITHUB上公开提供的800多万台笔记本电脑和(b)微软开发的200多万台企业机器学习管道。我们的分析包括粗粒度的统计特征,对库和管道的细粒度分析,以及跨数据集和时间的比较研究。我们报告了大量的测量结果,以供读者解释并得出可操作的结论(a)系统构建者应该关注哪些方面以更好地为从业者服务,以及(b)从业者应该依赖哪些技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Technical Perspective: Efficient and Reusable Lazy Sampling Unicorn: A Unified Multi-Tasking Matching Model Learning to Restructure Tables Automatically DBSP: Incremental Computation on Streams and Its Applications to Databases Efficient and Reusable Lazy Sampling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1