dislib: Large Scale High Performance Machine Learning in Python

J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia
{"title":"dislib: Large Scale High Performance Machine Learning in Python","authors":"J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia","doi":"10.1109/eScience.2019.00018","DOIUrl":null,"url":null,"abstract":"In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on eScience (eScience)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2019.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
dislib: Python中的大规模高性能机器学习
近年来,机器学习已经被证明是从数据中提取知识的一个非常有用的工具。这可以用于许多研究领域,如基因组学、地球科学和天体物理学,以获得有价值的见解。与此同时,Python因其高生产率和丰富的生态系统而成为研究人员中最受欢迎的编程语言之一。不幸的是,现有的Python机器学习库不能扩展到大型数据集,非专家很难使用,而且很难在高性能计算集群中进行设置。这些限制阻碍了科学家在研究中充分利用机器学习的潜力。在本文中,我们提出并评估了dislib,一个基于pycomps编程模型的分布式机器学习库,它解决了其他现有库的问题。在我们的评估中,我们表明dislib可以比其他流行的分布式机器学习库(如MLlib)快9倍,并且可以处理高达16倍的数据集。除此之外,我们还展示了如何使用dislib将真正的科学应用程序的计算时间从18小时减少到17分钟。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Accelerating Scientific Discovery with SCAIGATE Science Gateway Contextual Linking between Workflow Provenance and System Performance Logs BBBlockchain: Blockchain-Based Participation in Urban Development Streaming Workflows on Edge Devices to Process Sensor Data on a Smart Manufacturing Platform Serverless Science for Simple, Scalable, and Shareable Scholarship
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1