{"title":"A Cloud System for Machine Learning Exploiting a Parallel Array DBMS","authors":"Yiqun Zhang, C. Ordonez, S. Johnsson","doi":"10.1109/DEXA.2017.21","DOIUrl":null,"url":null,"abstract":"Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2017.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.