Parallelization of Data Science Tasks, an Experimental Overview

Oscar Castro, P. Bruneau, Jean-Sébastien Sottet, Dario Torregrossa
{"title":"Parallelization of Data Science Tasks, an Experimental Overview","authors":"Oscar Castro, P. Bruneau, Jean-Sébastien Sottet, Dario Torregrossa","doi":"10.1145/3581807.3581878","DOIUrl":null,"url":null,"abstract":"The practice of data science and machine learning often involves training many kinds of models, for inferring some target variable, or extracting structured knowledge from data. Training procedures generally require lengthy and intensive computations, so a natural step for data scientists is to try to accelerate these procedures, typically through parallelization as supported by multiple CPU cores and GPU devices. In this paper, we focus on Python libraries commonly used by machine learning practitioners, and propose a case-based experimental approach to overview mainstream tools for software acceleration. For each use case, we highlight and quantify the optimizations from the baseline implementations to the optimized versions. Finally, we draw a taxonomy of the tools and techniques involved in our experiments, and identify common pitfalls, in view to provide actionable guidelines to data scientists and code optimization tools developers.","PeriodicalId":292813,"journal":{"name":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3581807.3581878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The practice of data science and machine learning often involves training many kinds of models, for inferring some target variable, or extracting structured knowledge from data. Training procedures generally require lengthy and intensive computations, so a natural step for data scientists is to try to accelerate these procedures, typically through parallelization as supported by multiple CPU cores and GPU devices. In this paper, we focus on Python libraries commonly used by machine learning practitioners, and propose a case-based experimental approach to overview mainstream tools for software acceleration. For each use case, we highlight and quantify the optimizations from the baseline implementations to the optimized versions. Finally, we draw a taxonomy of the tools and techniques involved in our experiments, and identify common pitfalls, in view to provide actionable guidelines to data scientists and code optimization tools developers.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据科学任务的并行化,实验综述
数据科学和机器学习的实践通常涉及训练多种模型,用于推断一些目标变量,或从数据中提取结构化知识。训练过程通常需要长时间和密集的计算,因此数据科学家的自然步骤是尝试加速这些过程,通常通过多个CPU内核和GPU设备支持的并行化。在本文中,我们专注于机器学习从业者常用的Python库,并提出了一种基于案例的实验方法来概述软件加速的主流工具。对于每个用例,我们强调并量化从基线实现到优化版本的优化。最后,我们对实验中涉及的工具和技术进行了分类,并确定了常见的陷阱,以便为数据科学家和代码优化工具开发人员提供可操作的指导方针。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multi-Scale Channel Attention for Chinese Scene Text Recognition Vehicle Re-identification Based on Multi-Scale Attention Feature Fusion Comparative Study on EEG Feature Recognition based on Deep Belief Network VA-TransUNet: A U-shaped Medical Image Segmentation Network with Visual Attention Traffic Flow Forecasting Research Based on Delay Reconstruction and GRU-SVR
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1