Pub Date : 2012-10-08DOI: 10.1109/eScience.2012.6404445
Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry
Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required preprocessing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.
{"title":"Digitization and search: A non-traditional use of HPC","authors":"Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry","doi":"10.1109/eScience.2012.6404445","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404445","url":null,"abstract":"Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required preprocessing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80601556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-08DOI: 10.1109/eScience.2012.6404465
C. Myers, Michael D'Silva
This talk demonstrates the current remote experimentation capabilities deployed at the Australian Synchrotron and La Trobe university, as well as remote data transfer services deployed at the above locations and at Bragg, ansto, metadata extraction tool, MyTardis node's, remote analysis and visualisation environments for medical imaging and IR spectroscopy and the use of high resolution multi screen displays.
{"title":"eResearch environment for remote instrumentation: VBL, RLI, VisLabl & 2","authors":"C. Myers, Michael D'Silva","doi":"10.1109/eScience.2012.6404465","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404465","url":null,"abstract":"This talk demonstrates the current remote experimentation capabilities deployed at the Australian Synchrotron and La Trobe university, as well as remote data transfer services deployed at the above locations and at Bragg, ansto, metadata extraction tool, MyTardis node's, remote analysis and visualisation environments for medical imaging and IR spectroscopy and the use of high resolution multi screen displays.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90424526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-08DOI: 10.1109/eScience.2012.6404473
Yun Tian, P. J. Rhodes
The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. Accessing only a subset of a spatial replica usually results in a large number of relatively small read requests made to the underlying storage device. For this reason, an accurate model of disk access is important when working with spatial subsets. We make two primary contributions in this paper. First, we describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Second, making a few simplifying assumptions, we propose a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively.
{"title":"Partial replica selection for spatial datasets","authors":"Yun Tian, P. J. Rhodes","doi":"10.1109/eScience.2012.6404473","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404473","url":null,"abstract":"The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. Accessing only a subset of a spatial replica usually results in a large number of relatively small read requests made to the underlying storage device. For this reason, an accurate model of disk access is important when working with spatial subsets. We make two primary contributions in this paper. First, we describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Second, making a few simplifying assumptions, we propose a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89349493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-08DOI: 10.1109/eScience.2012.6404433
Peter Sempolinski, D. Thain, Daniel Wei, A. Kareem
We introduce a web-based system for management of Computational Fluid Dynamics(CFD) simulations. This system provides an interface for users, on a web-browser, to have an intuitive, user-friendly means of dispatching and controlling long-running simulations. CFD presents a challenge to its users due to the complexity of its internal mathematics, the high computational demands of its simulations and the complexity of inputs to its simulations and related tasks. We designed this system to be as extensible as possible in order to be suitable for many different civil engineering applications. The front-end of this system is a webserver, which provides the user interface. The back-end is responsible for starting and stopping jobs as requested. There are also numerous components specifically for facilitating CFD computation. We discuss our experience with presenting this system to real users and the future ambitions for this project.
{"title":"A system for management of Computational Fluid Dynamics simulations for civil engineering","authors":"Peter Sempolinski, D. Thain, Daniel Wei, A. Kareem","doi":"10.1109/eScience.2012.6404433","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404433","url":null,"abstract":"We introduce a web-based system for management of Computational Fluid Dynamics(CFD) simulations. This system provides an interface for users, on a web-browser, to have an intuitive, user-friendly means of dispatching and controlling long-running simulations. CFD presents a challenge to its users due to the complexity of its internal mathematics, the high computational demands of its simulations and the complexity of inputs to its simulations and related tasks. We designed this system to be as extensible as possible in order to be suitable for many different civil engineering applications. The front-end of this system is a webserver, which provides the user interface. The back-end is responsible for starting and stopping jobs as requested. There are also numerous components specifically for facilitating CFD computation. We discuss our experience with presenting this system to real users and the future ambitions for this project.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89687975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404477
Peng Chen, Beth Plale, M. Aktaş
Provenance of digital scientific data is an important piece of the metadata of a data object. It can however grow voluminous quickly because the granularity level of capture can be high. It can also be quite feature rich. We propose a representation of the provenance data based on logical time that reduces the feature space. Creating time and frequency domain representations of the provenance, we apply clustering, classification and association rule mining to the abstract representations to determine the usefulness of the temporal representation. We evaluate the temporal representation using an existing 10 GB database of provenance captured from a range of scientific workflows.
{"title":"Temporal representation for scientific data provenance","authors":"Peng Chen, Beth Plale, M. Aktaş","doi":"10.1109/eScience.2012.6404477","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404477","url":null,"abstract":"Provenance of digital scientific data is an important piece of the metadata of a data object. It can however grow voluminous quickly because the granularity level of capture can be high. It can also be quite feature rich. We propose a representation of the provenance data based on logical time that reduces the feature space. Creating time and frequency domain representations of the provenance, we apply clustering, classification and association rule mining to the abstract representations to determine the usefulness of the temporal representation. We evaluate the temporal representation using an existing 10 GB database of provenance captured from a range of scientific workflows.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87441105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404481
R. Sinnott, Christopher Bayliss, G. Galang, Phillip Greenwood, George Koetsier, D. Mannix, L. Morandini, Marcos Nino-Ruiz, C. Pettit, Martin Tomko, M. Sarwar, R. Stimson, W. Voorsluys, I. Widjaja
The Australian Urban Research Infrastructure Network (AURIN) project (www.aurin.org.au) is tasked with developing an e-Infrastructure to support urban and built environment research across Australia. As identified in [1], this e-Infrastructure must provide seamless access to highly distributed and heterogeneous data sets from multiple organisations with accompanying analytical and visualization capabilities. The project is tasked with delivering a secure, web-based unifying environment offering a one-stop-shop for Australia-wide urban and built environment research. This paper describes the architectural design and implementation of the AURIN data-driven e-Infrastructure, where data is not just a passive entity that is accessed and used as a consequence of research demand, but is instead, directly shaping the computational access, processing and intelligent utilization possibilities. This is demonstrated in a situational context.
{"title":"A data-driven urban research environment for Australia","authors":"R. Sinnott, Christopher Bayliss, G. Galang, Phillip Greenwood, George Koetsier, D. Mannix, L. Morandini, Marcos Nino-Ruiz, C. Pettit, Martin Tomko, M. Sarwar, R. Stimson, W. Voorsluys, I. Widjaja","doi":"10.1109/eScience.2012.6404481","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404481","url":null,"abstract":"The Australian Urban Research Infrastructure Network (AURIN) project (www.aurin.org.au) is tasked with developing an e-Infrastructure to support urban and built environment research across Australia. As identified in [1], this e-Infrastructure must provide seamless access to highly distributed and heterogeneous data sets from multiple organisations with accompanying analytical and visualization capabilities. The project is tasked with delivering a secure, web-based unifying environment offering a one-stop-shop for Australia-wide urban and built environment research. This paper describes the architectural design and implementation of the AURIN data-driven e-Infrastructure, where data is not just a passive entity that is accessed and used as a consequence of research demand, but is instead, directly shaping the computational access, processing and intelligent utilization possibilities. This is demonstrated in a situational context.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82101246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404441
K. Jorissen, W. Johnson, F. Vila, J. Rehr
Computational work is a vital part of many scientific studies. In materials science research in particular, theoretical models are often needed to understand measurements. There is currently a double barrier that keeps a broad class of researchers from using state-of-the-art materials science codes: the software typically lacks user-friendliness, and the hardware requirements can demand a significant investment, e.g. the purchase of a Beowulf cluster. Scientific Cloud Computing has the potential to remove this barrier and make computational science accessible to a wider class of scientists who are not computational specialists. We present a set of interface tools, SC2IT, that enables seamless control of virtual compute clusters in the Amazon EC2 cloud and is designed to be embedded in user-friendly Java GUIs. We present applications of our Scientific Cloud Computing method to the materials science codes FEFF9, WIEN2k, and MEEP-mpi. SC2IT and the paradigm described here are applicable to other fields of research outside materials science within current Cloud Computing capability.
{"title":"High-performance computing without commitment: SC2IT: A cloud computing interface that makes computational science available to non-specialists","authors":"K. Jorissen, W. Johnson, F. Vila, J. Rehr","doi":"10.1109/eScience.2012.6404441","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404441","url":null,"abstract":"Computational work is a vital part of many scientific studies. In materials science research in particular, theoretical models are often needed to understand measurements. There is currently a double barrier that keeps a broad class of researchers from using state-of-the-art materials science codes: the software typically lacks user-friendliness, and the hardware requirements can demand a significant investment, e.g. the purchase of a Beowulf cluster. Scientific Cloud Computing has the potential to remove this barrier and make computational science accessible to a wider class of scientists who are not computational specialists. We present a set of interface tools, SC2IT, that enables seamless control of virtual compute clusters in the Amazon EC2 cloud and is designed to be embedded in user-friendly Java GUIs. We present applications of our Scientific Cloud Computing method to the materials science codes FEFF9, WIEN2k, and MEEP-mpi. SC2IT and the paradigm described here are applicable to other fields of research outside materials science within current Cloud Computing capability.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80195839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404431
Kary A. C. S. Ocaña, Daniel de Oliveira, Jonas Dias, Eduardo S. Ogasawara, M. Mattoso
Illnesses caused by parasitic protozoan are a research priority. A representative group of these illnesses is the commonly known as Neglected Tropical Diseases (NTD). NTD specially attack low socioeconomic population around the world and new anti-protozoan inhibitors are needed and several drug discovery projects focus on researching new drug targets. Pharmacophylogenomics is a novel bioinformatics field that aims at reducing the time and the financial cost of the drug discovery process. Pharmacophylogenomic analyses are applied mainly in the early stages of the research phase in drug discovery. Pharmacophylogenomic analysis executes several bioinformatics programs in a coherent flow to identify homologues sequences, construct phylogenetic trees and execute evolutionary and structural experiments. This way, it can be modeled as scientific workflows. Pharmacophylogenomic analysis workflows are complex, computing and data intensive and may execute during weeks. This way, it benefits from parallel execution. We propose SciPPGx, a scientific workflow that aims at providing thorough inferring support for pharmacophylogenomic hypotheses. SciPPGx is executed in parallel in a cloud using SciCumulus workflow engine. Experiments show that SciPPGx considerably reduces the total execution time up to 97.1% when compared to a sequential execution. We also present representative biological results taking advantage of the inference covering several related bioinformatics overviews.
{"title":"Discovering drug targets for neglected diseases using a pharmacophylogenomic cloud workflow","authors":"Kary A. C. S. Ocaña, Daniel de Oliveira, Jonas Dias, Eduardo S. Ogasawara, M. Mattoso","doi":"10.1109/eScience.2012.6404431","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404431","url":null,"abstract":"Illnesses caused by parasitic protozoan are a research priority. A representative group of these illnesses is the commonly known as Neglected Tropical Diseases (NTD). NTD specially attack low socioeconomic population around the world and new anti-protozoan inhibitors are needed and several drug discovery projects focus on researching new drug targets. Pharmacophylogenomics is a novel bioinformatics field that aims at reducing the time and the financial cost of the drug discovery process. Pharmacophylogenomic analyses are applied mainly in the early stages of the research phase in drug discovery. Pharmacophylogenomic analysis executes several bioinformatics programs in a coherent flow to identify homologues sequences, construct phylogenetic trees and execute evolutionary and structural experiments. This way, it can be modeled as scientific workflows. Pharmacophylogenomic analysis workflows are complex, computing and data intensive and may execute during weeks. This way, it benefits from parallel execution. We propose SciPPGx, a scientific workflow that aims at providing thorough inferring support for pharmacophylogenomic hypotheses. SciPPGx is executed in parallel in a cloud using SciCumulus workflow engine. Experiments show that SciPPGx considerably reduces the total execution time up to 97.1% when compared to a sequential execution. We also present representative biological results taking advantage of the inference covering several related bioinformatics overviews.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74944990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404424
R. Ramos-Pollán, F. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez-Rubiano, J. Bermeo, Juan Sebastian Otálora Montenegro, Paola K. Rozo, John Arevalo
This paper presents BIGS the Big Image Data Analysis Toolkit, a software framework for large scale image processing and analysis over heterogeneous computing resources, such as those available in clouds, grids, computer clusters or throughout scattered computer resources (desktops, labs) in an opportunistic manner. Through BIGS, eScience for image processing and analysis is conceived to exploit coarse grained parallelism based on data partitioning and parameter sweeps, avoiding the need of inter-process communication and, therefore, enabling loosely coupled computing nodes (BIGS workers). It adopts an uncommitted resource allocation model where (1) experimenters define their image processing pipelines in a simple configuration file, (2) a schedule of jobs is generated and (3) workers, as they become available, take over pending jobs as long as their dependency on other jobs is fulfilled. BIGS workers act autonomously, querying the job schedule to determine which one to take over. This removes the need for a central scheduling node, requiring only access by all workers to a shared information source. Furthermore, BIGS workers are encapsulated within different technologies to enable their agile deployment over the available computing resources. Currently they can be launched through the Amazon EC2 service over their cloud resources, through Java Web Start from any desktop computer and through regular scripting or SSH commands. This suits well different kinds of research environments, both when accessing dedicated computing clusters or clouds with committed computing capacity or when using opportunistic computing resources whose access is seldom or cannot be provisioned in advance. We also adopt a NoSQL storage model to ensure the scalability of the shared information sources required by all workers, including within BIGS support for HBase and Amazon's DynamoDB service. Overall, BIGS now enables researchers to run large scale image processing pipelines in an easy, affordable and unplanned manner with the capability to take over computing resources as they become available at run time. This is shown in this paper by using BIGS in different experimental setups in the Amazon cloud and in an opportunistic manner, demonstrating its configurability, adaptability and scalability capabilities.
本文介绍了BIGS,即大图像数据分析工具包,这是一个软件框架,用于在异构计算资源上进行大规模图像处理和分析,例如在云、网格、计算机集群或整个分散的计算机资源(桌面、实验室)中可用的资源。通过BIGS,用于图像处理和分析的eScience被设想为利用基于数据分区和参数扫描的粗粒度并行性,避免了进程间通信的需要,因此,实现了松耦合计算节点(BIGS worker)。它采用一种未提交的资源分配模型,其中:(1)实验者在一个简单的配置文件中定义他们的图像处理管道,(2)生成一个作业计划,(3)当工人可用时,只要完成对其他作业的依赖,就接管待处理的作业。BIGS工人自主行动,查询工作计划以确定哪一个接管。这消除了对中央调度节点的需求,只需要所有工作人员访问共享信息源。此外,BIGS工作器被封装在不同的技术中,以便在可用的计算资源上实现敏捷部署。目前,它们可以通过云资源上的Amazon EC2服务启动,也可以通过Java Web Start从任何桌面计算机启动,还可以通过常规脚本或SSH命令启动。这非常适合不同类型的研究环境,无论是在访问具有承诺计算能力的专用计算集群或云时,还是在使用访问很少或无法预先提供的机会性计算资源时。我们还采用了NoSQL存储模型,以确保所有工作人员所需的共享信息源的可扩展性,包括BIGS对HBase和Amazon的DynamoDB服务的支持。总的来说,BIGS现在使研究人员能够以一种简单、负担得起和计划外的方式运行大规模图像处理管道,并能够在运行时接管可用的计算资源。本文通过在亚马逊云中的不同实验设置中使用BIGS,并以机会主义的方式展示了它的可配置性、适应性和可扩展性。
{"title":"BIGS: A framework for large-scale image processing and analysis over distributed and heterogeneous computing resources","authors":"R. Ramos-Pollán, F. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez-Rubiano, J. Bermeo, Juan Sebastian Otálora Montenegro, Paola K. Rozo, John Arevalo","doi":"10.1109/eScience.2012.6404424","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404424","url":null,"abstract":"This paper presents BIGS the Big Image Data Analysis Toolkit, a software framework for large scale image processing and analysis over heterogeneous computing resources, such as those available in clouds, grids, computer clusters or throughout scattered computer resources (desktops, labs) in an opportunistic manner. Through BIGS, eScience for image processing and analysis is conceived to exploit coarse grained parallelism based on data partitioning and parameter sweeps, avoiding the need of inter-process communication and, therefore, enabling loosely coupled computing nodes (BIGS workers). It adopts an uncommitted resource allocation model where (1) experimenters define their image processing pipelines in a simple configuration file, (2) a schedule of jobs is generated and (3) workers, as they become available, take over pending jobs as long as their dependency on other jobs is fulfilled. BIGS workers act autonomously, querying the job schedule to determine which one to take over. This removes the need for a central scheduling node, requiring only access by all workers to a shared information source. Furthermore, BIGS workers are encapsulated within different technologies to enable their agile deployment over the available computing resources. Currently they can be launched through the Amazon EC2 service over their cloud resources, through Java Web Start from any desktop computer and through regular scripting or SSH commands. This suits well different kinds of research environments, both when accessing dedicated computing clusters or clouds with committed computing capacity or when using opportunistic computing resources whose access is seldom or cannot be provisioned in advance. We also adopt a NoSQL storage model to ensure the scalability of the shared information sources required by all workers, including within BIGS support for HBase and Amazon's DynamoDB service. Overall, BIGS now enables researchers to run large scale image processing pipelines in an easy, affordable and unplanned manner with the capability to take over computing resources as they become available at run time. This is shown in this paper by using BIGS in different experimental setups in the Amazon cloud and in an opportunistic manner, demonstrating its configurability, adaptability and scalability capabilities.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75651632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-01DOI: 10.1109/eScience.2012.6404469
R. Farnsworth, S. Benes
IRMIS: The care and feeding of a generalized relatively relational database for accelerator components with a connection to the real time EPICS Input output controllers. This paper describes a relational database approach to documenting and maintaining; the feeding. It describes the automated process used to generate accelerator or synchrotron component data for the relational tables and the role of devices and components. The data this obtained turn may be used or presented in a variety of ways to the end use in order to either optimize the maintenance or to provide machine metadata for experimental performance purposes.
{"title":"IRMIS: The care and feeding of a generalized relatively relational database for accelerator components with a connection to the real time EPICS Input output controllers","authors":"R. Farnsworth, S. Benes","doi":"10.1109/eScience.2012.6404469","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404469","url":null,"abstract":"IRMIS: The care and feeding of a generalized relatively relational database for accelerator components with a connection to the real time EPICS Input output controllers. This paper describes a relational database approach to documenting and maintaining; the feeding. It describes the automated process used to generate accelerator or synchrotron component data for the relational tables and the role of devices and components. The data this obtained turn may be used or presented in a variety of ways to the end use in order to either optimize the maintenance or to provide machine metadata for experimental performance purposes.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83187892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}