R. Ramos-Pollán, F. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez-Rubiano, J. Bermeo, Juan Sebastian Otálora Montenegro, Paola K. Rozo, John Arevalo
{"title":"BIGS: A framework for large-scale image processing and analysis over distributed and heterogeneous computing resources","authors":"R. Ramos-Pollán, F. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez-Rubiano, J. Bermeo, Juan Sebastian Otálora Montenegro, Paola K. Rozo, John Arevalo","doi":"10.1109/eScience.2012.6404424","DOIUrl":null,"url":null,"abstract":"This paper presents BIGS the Big Image Data Analysis Toolkit, a software framework for large scale image processing and analysis over heterogeneous computing resources, such as those available in clouds, grids, computer clusters or throughout scattered computer resources (desktops, labs) in an opportunistic manner. Through BIGS, eScience for image processing and analysis is conceived to exploit coarse grained parallelism based on data partitioning and parameter sweeps, avoiding the need of inter-process communication and, therefore, enabling loosely coupled computing nodes (BIGS workers). It adopts an uncommitted resource allocation model where (1) experimenters define their image processing pipelines in a simple configuration file, (2) a schedule of jobs is generated and (3) workers, as they become available, take over pending jobs as long as their dependency on other jobs is fulfilled. BIGS workers act autonomously, querying the job schedule to determine which one to take over. This removes the need for a central scheduling node, requiring only access by all workers to a shared information source. Furthermore, BIGS workers are encapsulated within different technologies to enable their agile deployment over the available computing resources. Currently they can be launched through the Amazon EC2 service over their cloud resources, through Java Web Start from any desktop computer and through regular scripting or SSH commands. This suits well different kinds of research environments, both when accessing dedicated computing clusters or clouds with committed computing capacity or when using opportunistic computing resources whose access is seldom or cannot be provisioned in advance. We also adopt a NoSQL storage model to ensure the scalability of the shared information sources required by all workers, including within BIGS support for HBase and Amazon's DynamoDB service. Overall, BIGS now enables researchers to run large scale image processing pipelines in an easy, affordable and unplanned manner with the capability to take over computing resources as they become available at run time. This is shown in this paper by using BIGS in different experimental setups in the Amazon cloud and in an opportunistic manner, demonstrating its configurability, adaptability and scalability capabilities.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 8th International Conference on E-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2012.6404424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
This paper presents BIGS the Big Image Data Analysis Toolkit, a software framework for large scale image processing and analysis over heterogeneous computing resources, such as those available in clouds, grids, computer clusters or throughout scattered computer resources (desktops, labs) in an opportunistic manner. Through BIGS, eScience for image processing and analysis is conceived to exploit coarse grained parallelism based on data partitioning and parameter sweeps, avoiding the need of inter-process communication and, therefore, enabling loosely coupled computing nodes (BIGS workers). It adopts an uncommitted resource allocation model where (1) experimenters define their image processing pipelines in a simple configuration file, (2) a schedule of jobs is generated and (3) workers, as they become available, take over pending jobs as long as their dependency on other jobs is fulfilled. BIGS workers act autonomously, querying the job schedule to determine which one to take over. This removes the need for a central scheduling node, requiring only access by all workers to a shared information source. Furthermore, BIGS workers are encapsulated within different technologies to enable their agile deployment over the available computing resources. Currently they can be launched through the Amazon EC2 service over their cloud resources, through Java Web Start from any desktop computer and through regular scripting or SSH commands. This suits well different kinds of research environments, both when accessing dedicated computing clusters or clouds with committed computing capacity or when using opportunistic computing resources whose access is seldom or cannot be provisioned in advance. We also adopt a NoSQL storage model to ensure the scalability of the shared information sources required by all workers, including within BIGS support for HBase and Amazon's DynamoDB service. Overall, BIGS now enables researchers to run large scale image processing pipelines in an easy, affordable and unplanned manner with the capability to take over computing resources as they become available at run time. This is shown in this paper by using BIGS in different experimental setups in the Amazon cloud and in an opportunistic manner, demonstrating its configurability, adaptability and scalability capabilities.
本文介绍了BIGS,即大图像数据分析工具包,这是一个软件框架,用于在异构计算资源上进行大规模图像处理和分析,例如在云、网格、计算机集群或整个分散的计算机资源(桌面、实验室)中可用的资源。通过BIGS,用于图像处理和分析的eScience被设想为利用基于数据分区和参数扫描的粗粒度并行性,避免了进程间通信的需要,因此,实现了松耦合计算节点(BIGS worker)。它采用一种未提交的资源分配模型,其中:(1)实验者在一个简单的配置文件中定义他们的图像处理管道,(2)生成一个作业计划,(3)当工人可用时,只要完成对其他作业的依赖,就接管待处理的作业。BIGS工人自主行动,查询工作计划以确定哪一个接管。这消除了对中央调度节点的需求,只需要所有工作人员访问共享信息源。此外,BIGS工作器被封装在不同的技术中,以便在可用的计算资源上实现敏捷部署。目前,它们可以通过云资源上的Amazon EC2服务启动,也可以通过Java Web Start从任何桌面计算机启动,还可以通过常规脚本或SSH命令启动。这非常适合不同类型的研究环境,无论是在访问具有承诺计算能力的专用计算集群或云时,还是在使用访问很少或无法预先提供的机会性计算资源时。我们还采用了NoSQL存储模型,以确保所有工作人员所需的共享信息源的可扩展性,包括BIGS对HBase和Amazon的DynamoDB服务的支持。总的来说,BIGS现在使研究人员能够以一种简单、负担得起和计划外的方式运行大规模图像处理管道,并能够在运行时接管可用的计算资源。本文通过在亚马逊云中的不同实验设置中使用BIGS,并以机会主义的方式展示了它的可配置性、适应性和可扩展性。