BigDataDIRAC:部署分布式大数据应用

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Pub Date : 2015-05-04 DOI:10.1109/CCGrid.2015.109

Víctor Fernández, V. Muñoz, T. F. Pena

{"title":"BigDataDIRAC:部署分布式大数据应用","authors":"Víctor Fernández, V. Muñoz, T. F. Pena","doi":"10.1109/CCGrid.2015.109","DOIUrl":null,"url":null,"abstract":"The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in different grid and cloud environments. Many communities from several fields (LHCb, Belle II, Creatis, DIRAC4EGI multiple community portal, etc.) use DIRAC to run jobs in distributed environments. Google created the MapReduce programming model offering an efficient way of performing distributed computation over large data sets. Several enterprises are providing Hadoop cloud based resources to their users, and are trying to simplify the usage of Hadoop in the cloud. Based in these two robust technologies, we have created BigDataDIRAC, a solution which gives users the opportunity to access multiple Big Data resources scattered in different geographical areas, such as access to grid resources. This approach opens the possibility of offering not only grid and cloud to the users, but also Big Data resources from the same DIRAC environment. Proof of concept is shown using three computing centers in two countries, and with four Hadoop clusters. Our results demonstrate the ability of BigDataDIRAC to manage jobs driven by dataset location stored in the Hadoop File System (HDFS) of the Hadoop distributed clusters. DIRAC is used to monitor the execution, collect the necessary statistical data, and upload the results from the remote HDFS to the SandBox Storage machine. The tests produced the equivalent of 5 days continuous processing.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"26 1","pages":"1177-1180"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"BigDataDIRAC: Deploying Distributed Big Data Applications\",\"authors\":\"Víctor Fernández, V. Muñoz, T. F. Pena\",\"doi\":\"10.1109/CCGrid.2015.109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in different grid and cloud environments. Many communities from several fields (LHCb, Belle II, Creatis, DIRAC4EGI multiple community portal, etc.) use DIRAC to run jobs in distributed environments. Google created the MapReduce programming model offering an efficient way of performing distributed computation over large data sets. Several enterprises are providing Hadoop cloud based resources to their users, and are trying to simplify the usage of Hadoop in the cloud. Based in these two robust technologies, we have created BigDataDIRAC, a solution which gives users the opportunity to access multiple Big Data resources scattered in different geographical areas, such as access to grid resources. This approach opens the possibility of offering not only grid and cloud to the users, but also Big Data resources from the same DIRAC environment. Proof of concept is shown using three computing centers in two countries, and with four Hadoop clusters. Our results demonstrate the ability of BigDataDIRAC to manage jobs driven by dataset location stored in the Hadoop File System (HDFS) of the Hadoop distributed clusters. DIRAC is used to monitor the execution, collect the necessary statistical data, and upload the results from the remote HDFS to the SandBox Storage machine. The tests produced the equivalent of 5 days continuous processing.\",\"PeriodicalId\":6664,\"journal\":{\"name\":\"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing\",\"volume\":\"26 1\",\"pages\":\"1177-1180\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid.2015.109\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

具有远程代理控制的分布式基础设施(DIRAC)软件框架允许用户社区管理不同网格和云环境中的计算活动。来自多个领域的许多社区(LHCb、Belle II、Creatis、DIRAC4EGI多社区门户等)使用DIRAC在分布式环境中运行作业。Google创建了MapReduce编程模型，提供了在大型数据集上执行分布式计算的有效方法。一些企业正在为他们的用户提供基于Hadoop云的资源，并试图简化Hadoop在云中的使用。基于这两项强大的技术，我们创建了BigDataDIRAC，该解决方案使用户有机会访问分散在不同地理区域的多个大数据资源，例如访问网格资源。这种方法不仅可以为用户提供网格和云，还可以从相同的DIRAC环境中提供大数据资源。概念验证使用了两个国家的三个计算中心和四个Hadoop集群。我们的结果证明了BigDataDIRAC管理由存储在Hadoop分布式集群的Hadoop文件系统(HDFS)中的数据集位置驱动的作业的能力。DIRAC用于监控执行，收集必要的统计数据，并将结果从远程HDFS上传到SandBox Storage机器。测试产生了相当于5天连续处理的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BigDataDIRAC: Deploying Distributed Big Data Applications

The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in different grid and cloud environments. Many communities from several fields (LHCb, Belle II, Creatis, DIRAC4EGI multiple community portal, etc.) use DIRAC to run jobs in distributed environments. Google created the MapReduce programming model offering an efficient way of performing distributed computation over large data sets. Several enterprises are providing Hadoop cloud based resources to their users, and are trying to simplify the usage of Hadoop in the cloud. Based in these two robust technologies, we have created BigDataDIRAC, a solution which gives users the opportunity to access multiple Big Data resources scattered in different geographical areas, such as access to grid resources. This approach opens the possibility of offering not only grid and cloud to the users, but also Big Data resources from the same DIRAC environment. Proof of concept is shown using three computing centers in two countries, and with four Hadoop clusters. Our results demonstrate the ability of BigDataDIRAC to manage jobs driven by dataset location stored in the Hadoop File System (HDFS) of the Hadoop distributed clusters. DIRAC is used to monitor the execution, collect the necessary statistical data, and upload the results from the remote HDFS to the SandBox Storage machine. The tests produced the equivalent of 5 days continuous processing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

自引率

0.00%

发文量