Pub Date : 2013-10-22DOI: 10.1109/ESCIENCE.2013.19
Andrej Andrejev, S. Toor, A. Hellander, S. Holmgren, T. Risch
Data-intensive applications in e-Science require scalable solutions for storage as well as interactive tools for analysis of scientific data. It is important to be able to query the data in a storage-independent way, and to be able to obtain the results of the data-analysis incrementally (in contrast to traditional batch solutions). We use the RDF data model extended with multidimensional numeric arrays to represent the results, parameters, and other metadata describing scientific experiments, and SciSPARQL, an extension of the SPARQL language, to combine massive numeric array data and metadata in queries. To address the scalability problem we present an architecture that enables the same SciSPARQL queries to be executed on the RDF dataset whether it is stored in a relational DBMS or mapped over a specialized geographically distributed e-Science data store. In order to minimize access and communication costs, we represent the arrays with proxy objects, and retrieve their content lazily. We formulate typical analysis tasks from a computational biology application in terms of SciSPARQL queries, and compare the query processing performance with manually written scripts in MATLAB.
{"title":"Scientific Analysis by Queries in Extended SPARQL over a Scalable e-Science Data Store","authors":"Andrej Andrejev, S. Toor, A. Hellander, S. Holmgren, T. Risch","doi":"10.1109/ESCIENCE.2013.19","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.19","url":null,"abstract":"Data-intensive applications in e-Science require scalable solutions for storage as well as interactive tools for analysis of scientific data. It is important to be able to query the data in a storage-independent way, and to be able to obtain the results of the data-analysis incrementally (in contrast to traditional batch solutions). We use the RDF data model extended with multidimensional numeric arrays to represent the results, parameters, and other metadata describing scientific experiments, and SciSPARQL, an extension of the SPARQL language, to combine massive numeric array data and metadata in queries. To address the scalability problem we present an architecture that enables the same SciSPARQL queries to be executed on the RDF dataset whether it is stored in a relational DBMS or mapped over a specialized geographically distributed e-Science data store. In order to minimize access and communication costs, we represent the arrays with proxy objects, and retrieve their content lazily. We formulate typical analysis tasks from a computational biology application in terms of SciSPARQL queries, and compare the query processing performance with manually written scripts in MATLAB.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115643321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.47
Robert Cowles, Craig Jackson, Von Welch
In this paper we present our Virtual Organization (VO) Identity Management (IdM) Model, an overview of 14 interviews that informed it, and preliminary analysis of the factors that guide VOs and Resource Providers (RPs) to choose a particular IdM implementation. This model will serve both existing and future VOs and RPs to more effectively understand and implement their IdM relationships. The Virtual Organization has emerged as a fundamental way of structuring modern scientific collaborations and has shaped the computing infrastructure that supports those collaborations. One key aspect of this infrastructure is identity management, and the emergence of VOs introduces challenges regarding how much of the IdM process should be delegated from the RP to the VO. Many different implementation choices have been made, we conducted semi-structured interviews with 14 different VOs or RPs regarding their IdM choices and the bases behind those decisions. We analyzed the interview results to extract common parameters and values, which we used to inform our VO IdM Model.
{"title":"Identity Management for Virtual Organizations: An Experience-Based Model","authors":"Robert Cowles, Craig Jackson, Von Welch","doi":"10.1109/eScience.2013.47","DOIUrl":"https://doi.org/10.1109/eScience.2013.47","url":null,"abstract":"In this paper we present our Virtual Organization (VO) Identity Management (IdM) Model, an overview of 14 interviews that informed it, and preliminary analysis of the factors that guide VOs and Resource Providers (RPs) to choose a particular IdM implementation. This model will serve both existing and future VOs and RPs to more effectively understand and implement their IdM relationships. The Virtual Organization has emerged as a fundamental way of structuring modern scientific collaborations and has shaped the computing infrastructure that supports those collaborations. One key aspect of this infrastructure is identity management, and the emergence of VOs introduces challenges regarding how much of the IdM process should be delegated from the RP to the VO. Many different implementation choices have been made, we conducted semi-structured interviews with 14 different VOs or RPs regarding their IdM choices and the bases behind those decisions. We analyzed the interview results to extract common parameters and values, which we used to inform our VO IdM Model.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123332920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.46
Wilfred W. Li, R. Moore, Matthew Kullberg, B. Battistuz, S. Meier, Ronald Joyce, R. Wagner, T. Reynales, Qian Liu
The University of California, San Diego (UC San Diego) Research Cyber infrastructure (RCI) program provides long-term quality services in centralized storage, colocation, computing, data curation, networking and technical expertise. To help define the data storage needs and set priorities, the RCI data services (RCIDS) team conducted a series of interviews with faculty and senior staff members between September 2012 and February 2013. A total of 50 groups from 29 separate departments and organized research units (ORUs) participated in the interviews, representing more than 600 UC San Diego researchers. From human genomic sequences, marine natural products, to cosmological simulations, their diverse datasets are shared with hundreds of thousands of users worldwide. The top 10 requirements on data services and the top 5 existing challenges and risks as reported by UC San Diego researchers have been identified. Based upon these requirements, the RCIDS team recommends a Network Attached Storage (NAS) data service to be first deployed with a sustainable business model. Additional services will be developed through further discussion with the research community and in view of emerging cloud computing technologies. An extensive discussion is provided on the implementation plan, cloud-based data services, and the lessons learned in building sustainable e-science infrastructure for higher education research.
加州大学圣地亚哥分校(UC San Diego)研究网络基础设施(RCI)项目提供集中存储、托管、计算、数据管理、网络和技术专业知识方面的长期优质服务。为了帮助定义数据存储需求和设置优先级,RCI数据服务(rcid)团队在2012年9月至2013年2月期间对教职员工和高级员工进行了一系列访谈。共有来自29个独立部门和有组织的研究单位(oru)的50个小组参加了采访,代表600多名加州大学圣地亚哥分校的研究人员。从人类基因组序列、海洋天然产品到宇宙学模拟,他们的各种数据集与全球数十万用户共享。加州大学圣地亚哥分校研究人员报告的数据服务的十大要求和五大现有挑战和风险已经确定。基于这些需求,rcid团队建议首先使用可持续的业务模型部署网络附加存储(NAS)数据服务。将通过与研究界的进一步讨论并考虑到新兴的云计算技术,开发更多的服务。对实施计划、基于云的数据服务以及为高等教育研究建立可持续的电子科学基础设施的经验教训进行了广泛的讨论。
{"title":"Developing Sustainable Data Services in Cyberinfrastructure for Higher Education: Requirements and Lessons Learned","authors":"Wilfred W. Li, R. Moore, Matthew Kullberg, B. Battistuz, S. Meier, Ronald Joyce, R. Wagner, T. Reynales, Qian Liu","doi":"10.1109/eScience.2013.46","DOIUrl":"https://doi.org/10.1109/eScience.2013.46","url":null,"abstract":"The University of California, San Diego (UC San Diego) Research Cyber infrastructure (RCI) program provides long-term quality services in centralized storage, colocation, computing, data curation, networking and technical expertise. To help define the data storage needs and set priorities, the RCI data services (RCIDS) team conducted a series of interviews with faculty and senior staff members between September 2012 and February 2013. A total of 50 groups from 29 separate departments and organized research units (ORUs) participated in the interviews, representing more than 600 UC San Diego researchers. From human genomic sequences, marine natural products, to cosmological simulations, their diverse datasets are shared with hundreds of thousands of users worldwide. The top 10 requirements on data services and the top 5 existing challenges and risks as reported by UC San Diego researchers have been identified. Based upon these requirements, the RCIDS team recommends a Network Attached Storage (NAS) data service to be first deployed with a sustainable business model. Additional services will be developed through further discussion with the research community and in view of emerging cloud computing technologies. An extensive discussion is provided on the implementation plan, cloud-based data services, and the lessons learned in building sustainable e-science infrastructure for higher education research.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122369190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.37
Yaonan Zhang, Yingpin Long, Guohui Zhao, Yufang Min, Jianfang Kang, L. Luo, Zhenfang He, Yang Wang
Comprehensive integrated research on ecological and hydrological processes and the simulation of river basin environments are critical foundations for decision making by governments and river-basin managers. The demand for a holistic understanding of environmental systems such as river basins is increasing. Eco-hydrological research needs two types of monitoring platforms to access and collect data from basins: a modeling platform to support access, select, and run models online, and build new models with the collected data, and a manipulation platform to generate forcing data, run models, and visualize the results. Consequently, we developed an e-science environment framework comprising three platforms - a monitoring platform, a model platform, and a manipulation platform. The framework allows automatic data transmission, storage, management, analysis, model management, simulation, computing, and result visualization. The e-science environment integrates land surface models such as Simplified Simple Biosphere model, the Revised Simple Biosphere model and WRF, hydrological models such as SWAT and TOPMODEL, data assimilation filters including such as Kalman filter algorithm, and several tools and methods for dealing with data, principally artificial neural networks and Markov chains. We demonstrate the application of the framework that uses an SSIB land surface model ensemble Kalman filter to improve evapotranspiration, soil moisture, and ground temperature simulation in the Heihe inland river basin. The approach proves suitable for environmental simulation for inland river research.
{"title":"An e-Science Environment for Ecological and Hydrological Simulation Research","authors":"Yaonan Zhang, Yingpin Long, Guohui Zhao, Yufang Min, Jianfang Kang, L. Luo, Zhenfang He, Yang Wang","doi":"10.1109/eScience.2013.37","DOIUrl":"https://doi.org/10.1109/eScience.2013.37","url":null,"abstract":"Comprehensive integrated research on ecological and hydrological processes and the simulation of river basin environments are critical foundations for decision making by governments and river-basin managers. The demand for a holistic understanding of environmental systems such as river basins is increasing. Eco-hydrological research needs two types of monitoring platforms to access and collect data from basins: a modeling platform to support access, select, and run models online, and build new models with the collected data, and a manipulation platform to generate forcing data, run models, and visualize the results. Consequently, we developed an e-science environment framework comprising three platforms - a monitoring platform, a model platform, and a manipulation platform. The framework allows automatic data transmission, storage, management, analysis, model management, simulation, computing, and result visualization. The e-science environment integrates land surface models such as Simplified Simple Biosphere model, the Revised Simple Biosphere model and WRF, hydrological models such as SWAT and TOPMODEL, data assimilation filters including such as Kalman filter algorithm, and several tools and methods for dealing with data, principally artificial neural networks and Markov chains. We demonstrate the application of the framework that uses an SSIB land surface model ensemble Kalman filter to improve evapotranspiration, soil moisture, and ground temperature simulation in the Heihe inland river basin. The approach proves suitable for environmental simulation for inland river research.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124113391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.53
M. Humphrey, Jacob Steele, I. Kim, M. Kahn, J. Bondy, Michael Ames
The cloud has proven itself as a scalable platform for Web-based applications. However, scientists and medical researchers are still searching for a simple cloud-based architecture that enables secure collaboration and sharing of distributed datasets. To date, attempts at using the cloud for this purpose generally view the cloud as simply a pool of servers upon which to run their legacy software. This approach fails to leverage the unique platform capabilities of the cloud. In this paper, we describe our Cloud Distributed Research Network (CloudDRN). We leverage the cloud for availability, reliability, scalability, and improved security as compared to legacy distributed systems while still supporting site autonomy. Our philosophy is to adapt commercial software tooling that was originally designed for business use-cases, thereby benefiting from the large built-in user community. We describe our general architecture and show an example of our system created to share distributed clinical research data. We evaluate our system in Amazon Web Services (AWS) and in Microsoft Windows Azure and find that while each cloud achieves similar financial cost, representative queries are 3.5x slower on average in Windows Azure.
{"title":"CloudDRN: A Lightweight, End-to-End System for Sharing Distributed Research Data in the Cloud","authors":"M. Humphrey, Jacob Steele, I. Kim, M. Kahn, J. Bondy, Michael Ames","doi":"10.1109/eScience.2013.53","DOIUrl":"https://doi.org/10.1109/eScience.2013.53","url":null,"abstract":"The cloud has proven itself as a scalable platform for Web-based applications. However, scientists and medical researchers are still searching for a simple cloud-based architecture that enables secure collaboration and sharing of distributed datasets. To date, attempts at using the cloud for this purpose generally view the cloud as simply a pool of servers upon which to run their legacy software. This approach fails to leverage the unique platform capabilities of the cloud. In this paper, we describe our Cloud Distributed Research Network (CloudDRN). We leverage the cloud for availability, reliability, scalability, and improved security as compared to legacy distributed systems while still supporting site autonomy. Our philosophy is to adapt commercial software tooling that was originally designed for business use-cases, thereby benefiting from the large built-in user community. We describe our general architecture and show an example of our system created to share distributed clinical research data. We evaluate our system in Amazon Web Services (AWS) and in Microsoft Windows Azure and find that while each cloud achieves similar financial cost, representative queries are 3.5x slower on average in Windows Azure.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129374580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/ESCIENCE.2013.52
Kai Kugler, K. Chard, Simon Caton, O. Rana, D. Katz
Increases in the size of research data and the move towards citizen science, in which everyday users contribute data and analyses, have resulted in a research data deluge. Researchers must now carefully determine how to store, transfer and analyze "Big Data" in collaborative environments. This task is even more complicated when considering budget and locality constraints on data storage and access. In this paper we investigate the potential to construct a Social Content Delivery Network (S-CDN) based upon the social networks that exist between researchers. The S-CDN model builds upon the incentives of collaborative researchers within a given scientific community to address their data challenges collaboratively and in proven trusted settings. In this paper we present a prototype implementation of a S-CDN and investigate the performance of the data transfer mechanisms (using Glob us Online) and the potential cost advantages of this approach.
研究数据规模的增加和向公民科学的转变(日常用户提供数据和分析)导致了研究数据的泛滥。研究人员现在必须仔细决定如何在协作环境中存储、传输和分析“大数据”。当考虑到数据存储和访问的预算和位置限制时,这项任务甚至更加复杂。在本文中,我们研究了基于研究者之间存在的社交网络构建社交内容分发网络(S-CDN)的可能性。S-CDN模型建立在特定科学社区内协作研究人员的激励基础上,以协作和可靠的环境解决他们的数据挑战。在本文中,我们提出了一个S-CDN的原型实现,并研究了数据传输机制的性能(使用Glob us Online)和这种方法的潜在成本优势。
{"title":"Constructing a Social Content Delivery Network for eScience","authors":"Kai Kugler, K. Chard, Simon Caton, O. Rana, D. Katz","doi":"10.1109/ESCIENCE.2013.52","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.52","url":null,"abstract":"Increases in the size of research data and the move towards citizen science, in which everyday users contribute data and analyses, have resulted in a research data deluge. Researchers must now carefully determine how to store, transfer and analyze \"Big Data\" in collaborative environments. This task is even more complicated when considering budget and locality constraints on data storage and access. In this paper we investigate the potential to construct a Social Content Delivery Network (S-CDN) based upon the social networks that exist between researchers. The S-CDN model builds upon the incentives of collaborative researchers within a given scientific community to address their data challenges collaboratively and in proven trusted settings. In this paper we present a prototype implementation of a S-CDN and investigate the performance of the data transfer mechanisms (using Glob us Online) and the potential cost advantages of this approach.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"826 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120879191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.13
Juan Amiguet-Vercher, P. Apers, A. Wombacher
To facilitate the sharing and re-use of data in scientific studies we propose an automated technique for annotating operation results. The annotated output has to preserve, as much as possible, the properties of the input annotations. The preservation of properties is achieved by taking into account operation properties. Property preservation is evaluated with information theory metrics.
{"title":"Operation Properties: A Representation and their Role in the Propagation of Meta-Data","authors":"Juan Amiguet-Vercher, P. Apers, A. Wombacher","doi":"10.1109/eScience.2013.13","DOIUrl":"https://doi.org/10.1109/eScience.2013.13","url":null,"abstract":"To facilitate the sharing and re-use of data in scientific studies we propose an automated technique for annotating operation results. The annotated output has to preserve, as much as possible, the properties of the input annotations. The preservation of properties is achieved by taking into account operation properties. Property preservation is evaluated with information theory metrics.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"1995 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.33
A. Stell, R. Sinnott
Breakthroughs in biomedicine are driven by research. More often than not, research takes place outside of a healthcare setting. However access to and use of clinical data for research purposes has many challenges that must be overcome, not least of which are the lack of standardized nomenclature and the heterogeneity of healthcare IT systems. For rare conditions, this challenge is particularly acute since the scarcity of data makes scientific breakthroughs increasingly difficult. Adrenal tumours represent one rare disease area where consolidation of clinical and biological information is urgently required. This paper describes the lessons being learnt in the development and rollout of an advanced security-oriented, virtual research environment (VRE) as part of the EU funded ENS@T-CANCER project. This system is currently used by 39 major cancer research centres across Europe and provides a unique resource for adrenal cancer research, underpinning an expanding portfolio of major international clinical trials and studies.
{"title":"e-Enabling International Cancer Research: Lessons Being Learnt in the ENS@T-CANCER Project","authors":"A. Stell, R. Sinnott","doi":"10.1109/eScience.2013.33","DOIUrl":"https://doi.org/10.1109/eScience.2013.33","url":null,"abstract":"Breakthroughs in biomedicine are driven by research. More often than not, research takes place outside of a healthcare setting. However access to and use of clinical data for research purposes has many challenges that must be overcome, not least of which are the lack of standardized nomenclature and the heterogeneity of healthcare IT systems. For rare conditions, this challenge is particularly acute since the scarcity of data makes scientific breakthroughs increasingly difficult. Adrenal tumours represent one rare disease area where consolidation of clinical and biological information is urgently required. This paper describes the lessons being learnt in the development and rollout of an advanced security-oriented, virtual research environment (VRE) as part of the EU funded ENS@T-CANCER project. This system is currently used by 39 major cancer research centres across Europe and provides a unique resource for adrenal cancer research, underpinning an expanding portfolio of major international clinical trials and studies.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126331942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/ESCIENCE.2013.38
J. Hunter, C. Brooking, Wilfred Brimblecombe, R. G. Dwyer, H. Campbell, Matthew E. Watts, C. Franklin
The aim of the OzTrack project is to provide common e-Science infrastructure to support the management, pre-processing, analysis and visualization of animal tracking data generated by researchers who are using telemetry devices to study animal behavior and ecology in Australia. This paper describes the technical challenges and design decisions associated with the development of the OzTrack system. It also describes the pre-processing, analysis and visualization services that we have developed to help researchers understand how their study species move across space and time. Finally this paper outlines the systems' current limitations and preliminary results and feedback from its evaluation to date.
{"title":"OzTrack -- E-Infrastructure to Support the Management, Analysis and Sharing of Animal Tracking Data","authors":"J. Hunter, C. Brooking, Wilfred Brimblecombe, R. G. Dwyer, H. Campbell, Matthew E. Watts, C. Franklin","doi":"10.1109/ESCIENCE.2013.38","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.38","url":null,"abstract":"The aim of the OzTrack project is to provide common e-Science infrastructure to support the management, pre-processing, analysis and visualization of animal tracking data generated by researchers who are using telemetry devices to study animal behavior and ecology in Australia. This paper describes the technical challenges and design decisions associated with the development of the OzTrack system. It also describes the pre-processing, analysis and visualization services that we have developed to help researchers understand how their study species move across space and time. Finally this paper outlines the systems' current limitations and preliminary results and feedback from its evaluation to date.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125194230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-22DOI: 10.1109/eScience.2013.21
Jiaan Zeng, Beth Plale
MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.
{"title":"Data Pipeline in MapReduce","authors":"Jiaan Zeng, Beth Plale","doi":"10.1109/eScience.2013.21","DOIUrl":"https://doi.org/10.1109/eScience.2013.21","url":null,"abstract":"MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"89 25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126291144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}