dislib: Large Scale High Performance Machine Learning in Python

2019 15th International Conference on eScience (eScience) Pub Date : 2019-09-01 DOI:10.1109/eScience.2019.00018

J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia

{"title":"dislib: Large Scale High Performance Machine Learning in Python","authors":"J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia","doi":"10.1109/eScience.2019.00018","DOIUrl":null,"url":null,"abstract":"In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on eScience (eScience)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2019.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

dislib: Python中的大规模高性能机器学习

近年来，机器学习已经被证明是从数据中提取知识的一个非常有用的工具。这可以用于许多研究领域，如基因组学、地球科学和天体物理学，以获得有价值的见解。与此同时，Python因其高生产率和丰富的生态系统而成为研究人员中最受欢迎的编程语言之一。不幸的是，现有的Python机器学习库不能扩展到大型数据集，非专家很难使用，而且很难在高性能计算集群中进行设置。这些限制阻碍了科学家在研究中充分利用机器学习的潜力。在本文中，我们提出并评估了dislib，一个基于pycomps编程模型的分布式机器学习库，它解决了其他现有库的问题。在我们的评估中，我们表明dislib可以比其他流行的分布式机器学习库(如MLlib)快9倍，并且可以处理高达16倍的数据集。除此之外，我们还展示了如何使用dislib将真正的科学应用程序的计算时间从18小时减少到17分钟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 15th International Conference on eScience (eScience)

自引率

0.00%

发文量

期刊最新文献

Accelerating Scientific Discovery with SCAIGATE Science Gateway Contextual Linking between Workflow Provenance and System Performance Logs BBBlockchain: Blockchain-Based Participation in Urban Development Streaming Workflows on Edge Devices to Process Sensor Data on a Smart Manufacturing Platform Serverless Science for Simple, Scalable, and Shareable Scholarship