2013 IEEE 9th International Conference on e-Science最新文献

英文中文

Scientific Analysis by Queries in Extended SPARQL over a Scalable e-Science Data Store 在可扩展的电子科学数据存储上扩展SPARQL查询的科学分析

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/ESCIENCE.2013.19

Andrej Andrejev, S. Toor, A. Hellander, S. Holmgren, T. Risch

Data-intensive applications in e-Science require scalable solutions for storage as well as interactive tools for analysis of scientific data. It is important to be able to query the data in a storage-independent way, and to be able to obtain the results of the data-analysis incrementally (in contrast to traditional batch solutions). We use the RDF data model extended with multidimensional numeric arrays to represent the results, parameters, and other metadata describing scientific experiments, and SciSPARQL, an extension of the SPARQL language, to combine massive numeric array data and metadata in queries. To address the scalability problem we present an architecture that enables the same SciSPARQL queries to be executed on the RDF dataset whether it is stored in a relational DBMS or mapped over a specialized geographically distributed e-Science data store. In order to minimize access and communication costs, we represent the arrays with proxy objects, and retrieve their content lazily. We formulate typical analysis tasks from a computational biology application in terms of SciSPARQL queries, and compare the query processing performance with manually written scripts in MATLAB.

电子科学中的数据密集型应用需要可扩展的存储解决方案以及用于分析科学数据的交互式工具。能够以与存储无关的方式查询数据，并能够增量地获得数据分析的结果(与传统的批处理解决方案相反)，这一点非常重要。我们使用扩展了多维数字数组的RDF数据模型来表示描述科学实验的结果、参数和其他元数据，并使用SPARQL语言的扩展SciSPARQL来在查询中组合大量数字数组数据和元数据。为了解决可伸缩性问题，我们提出了一种体系结构，该体系结构允许在RDF数据集上执行相同的SciSPARQL查询，无论RDF数据集存储在关系DBMS中还是映射到专门的地理分布式e-Science数据存储中。为了最小化访问和通信成本，我们使用代理对象表示数组，并惰性地检索其内容。我们根据SciSPARQL查询从计算生物学应用程序中制定了典型的分析任务，并将查询处理性能与MATLAB中手动编写的脚本进行了比较。

{"title":"Scientific Analysis by Queries in Extended SPARQL over a Scalable e-Science Data Store","authors":"Andrej Andrejev, S. Toor, A. Hellander, S. Holmgren, T. Risch","doi":"10.1109/ESCIENCE.2013.19","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.19","url":null,"abstract":"Data-intensive applications in e-Science require scalable solutions for storage as well as interactive tools for analysis of scientific data. It is important to be able to query the data in a storage-independent way, and to be able to obtain the results of the data-analysis incrementally (in contrast to traditional batch solutions). We use the RDF data model extended with multidimensional numeric arrays to represent the results, parameters, and other metadata describing scientific experiments, and SciSPARQL, an extension of the SPARQL language, to combine massive numeric array data and metadata in queries. To address the scalability problem we present an architecture that enables the same SciSPARQL queries to be executed on the RDF dataset whether it is stored in a relational DBMS or mapped over a specialized geographically distributed e-Science data store. In order to minimize access and communication costs, we represent the arrays with proxy objects, and retrieve their content lazily. We formulate typical analysis tasks from a computational biology application in terms of SciSPARQL queries, and compare the query processing performance with manually written scripts in MATLAB.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115643321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Identity Management for Virtual Organizations: An Experience-Based Model 虚拟组织的身份管理:一个基于经验的模型

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.47

Robert Cowles, Craig Jackson, Von Welch

In this paper we present our Virtual Organization (VO) Identity Management (IdM) Model, an overview of 14 interviews that informed it, and preliminary analysis of the factors that guide VOs and Resource Providers (RPs) to choose a particular IdM implementation. This model will serve both existing and future VOs and RPs to more effectively understand and implement their IdM relationships. The Virtual Organization has emerged as a fundamental way of structuring modern scientific collaborations and has shaped the computing infrastructure that supports those collaborations. One key aspect of this infrastructure is identity management, and the emergence of VOs introduces challenges regarding how much of the IdM process should be delegated from the RP to the VO. Many different implementation choices have been made, we conducted semi-structured interviews with 14 different VOs or RPs regarding their IdM choices and the bases behind those decisions. We analyzed the interview results to extract common parameters and values, which we used to inform our VO IdM Model.

在本文中，我们介绍了我们的虚拟组织(VO)身份管理(IdM)模型，概述了14个访谈，并初步分析了指导VO和资源提供商(rp)选择特定IdM实施的因素。该模型将服务于现有和未来的vo和rp，以更有效地理解和实现他们的IdM关系。虚拟组织已经成为构建现代科学合作的一种基本方式，并形成了支持这些合作的计算基础设施。这个基础设施的一个关键方面是身份管理，VOs的出现带来了关于应该将多少IdM流程从RP委派给VO的挑战。我们已经做出了许多不同的实现选择，我们对14个不同的vo或rp进行了半结构化的访谈，了解他们的IdM选择以及这些决定背后的基础。我们对访谈结果进行了分析，以提取常见的参数和值，并将其用于我们的VO IdM模型。

引用次数: 3

Developing Sustainable Data Services in Cyberinfrastructure for Higher Education: Requirements and Lessons Learned 在高等教育网络基础设施中发展可持续的数据服务:要求和经验教训

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.46

Wilfred W. Li, R. Moore, Matthew Kullberg, B. Battistuz, S. Meier, Ronald Joyce, R. Wagner, T. Reynales, Qian Liu

The University of California, San Diego (UC San Diego) Research Cyber infrastructure (RCI) program provides long-term quality services in centralized storage, colocation, computing, data curation, networking and technical expertise. To help define the data storage needs and set priorities, the RCI data services (RCIDS) team conducted a series of interviews with faculty and senior staff members between September 2012 and February 2013. A total of 50 groups from 29 separate departments and organized research units (ORUs) participated in the interviews, representing more than 600 UC San Diego researchers. From human genomic sequences, marine natural products, to cosmological simulations, their diverse datasets are shared with hundreds of thousands of users worldwide. The top 10 requirements on data services and the top 5 existing challenges and risks as reported by UC San Diego researchers have been identified. Based upon these requirements, the RCIDS team recommends a Network Attached Storage (NAS) data service to be first deployed with a sustainable business model. Additional services will be developed through further discussion with the research community and in view of emerging cloud computing technologies. An extensive discussion is provided on the implementation plan, cloud-based data services, and the lessons learned in building sustainable e-science infrastructure for higher education research.

加州大学圣地亚哥分校(UC San Diego)研究网络基础设施(RCI)项目提供集中存储、托管、计算、数据管理、网络和技术专业知识方面的长期优质服务。为了帮助定义数据存储需求和设置优先级，RCI数据服务(rcid)团队在2012年9月至2013年2月期间对教职员工和高级员工进行了一系列访谈。共有来自29个独立部门和有组织的研究单位(oru)的50个小组参加了采访，代表600多名加州大学圣地亚哥分校的研究人员。从人类基因组序列、海洋天然产品到宇宙学模拟，他们的各种数据集与全球数十万用户共享。加州大学圣地亚哥分校研究人员报告的数据服务的十大要求和五大现有挑战和风险已经确定。基于这些需求，rcid团队建议首先使用可持续的业务模型部署网络附加存储(NAS)数据服务。将通过与研究界的进一步讨论并考虑到新兴的云计算技术，开发更多的服务。对实施计划、基于云的数据服务以及为高等教育研究建立可持续的电子科学基础设施的经验教训进行了广泛的讨论。

{"title":"Developing Sustainable Data Services in Cyberinfrastructure for Higher Education: Requirements and Lessons Learned","authors":"Wilfred W. Li, R. Moore, Matthew Kullberg, B. Battistuz, S. Meier, Ronald Joyce, R. Wagner, T. Reynales, Qian Liu","doi":"10.1109/eScience.2013.46","DOIUrl":"https://doi.org/10.1109/eScience.2013.46","url":null,"abstract":"The University of California, San Diego (UC San Diego) Research Cyber infrastructure (RCI) program provides long-term quality services in centralized storage, colocation, computing, data curation, networking and technical expertise. To help define the data storage needs and set priorities, the RCI data services (RCIDS) team conducted a series of interviews with faculty and senior staff members between September 2012 and February 2013. A total of 50 groups from 29 separate departments and organized research units (ORUs) participated in the interviews, representing more than 600 UC San Diego researchers. From human genomic sequences, marine natural products, to cosmological simulations, their diverse datasets are shared with hundreds of thousands of users worldwide. The top 10 requirements on data services and the top 5 existing challenges and risks as reported by UC San Diego researchers have been identified. Based upon these requirements, the RCIDS team recommends a Network Attached Storage (NAS) data service to be first deployed with a sustainable business model. Additional services will be developed through further discussion with the research community and in view of emerging cloud computing technologies. An extensive discussion is provided on the implementation plan, cloud-based data services, and the lessons learned in building sustainable e-science infrastructure for higher education research.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122369190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An e-Science Environment for Ecological and Hydrological Simulation Research 生态水文模拟研究的e-Science环境

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.37

Yaonan Zhang, Yingpin Long, Guohui Zhao, Yufang Min, Jianfang Kang, L. Luo, Zhenfang He, Yang Wang

Comprehensive integrated research on ecological and hydrological processes and the simulation of river basin environments are critical foundations for decision making by governments and river-basin managers. The demand for a holistic understanding of environmental systems such as river basins is increasing. Eco-hydrological research needs two types of monitoring platforms to access and collect data from basins: a modeling platform to support access, select, and run models online, and build new models with the collected data, and a manipulation platform to generate forcing data, run models, and visualize the results. Consequently, we developed an e-science environment framework comprising three platforms - a monitoring platform, a model platform, and a manipulation platform. The framework allows automatic data transmission, storage, management, analysis, model management, simulation, computing, and result visualization. The e-science environment integrates land surface models such as Simplified Simple Biosphere model, the Revised Simple Biosphere model and WRF, hydrological models such as SWAT and TOPMODEL, data assimilation filters including such as Kalman filter algorithm, and several tools and methods for dealing with data, principally artificial neural networks and Markov chains. We demonstrate the application of the framework that uses an SSIB land surface model ensemble Kalman filter to improve evapotranspiration, soil moisture, and ground temperature simulation in the Heihe inland river basin. The approach proves suitable for environmental simulation for inland river research.

生态水文过程综合综合研究和流域环境模拟是政府和流域管理者决策的重要依据。对河流流域等环境系统的全面了解的需求正在增加。生态水文研究需要两种类型的监测平台来获取和收集流域数据:一种是建模平台，支持在线获取、选择和运行模型，并利用收集到的数据构建新模型;另一种是操作平台，生成强迫数据，运行模型，并将结果可视化。因此，我们开发了一个由三个平台组成的电子科学环境框架-一个监测平台，一个模型平台和一个操作平台。该框架允许自动数据传输、存储、管理、分析、模型管理、仿真、计算和结果可视化。电子科学环境集成了陆地表面模型，如简化简单生物圈模型、修订简单生物圈模型和WRF，水文模型，如SWAT和TOPMODEL，数据同化滤波器，如卡尔曼滤波算法，以及一些处理数据的工具和方法，主要是人工神经网络和马尔可夫链。以黑河内陆河流域为研究对象，利用SSIB陆面模型集合卡尔曼滤波改进了该框架的蒸散发、土壤湿度和地温模拟。该方法适用于内河研究的环境模拟。

{"title":"An e-Science Environment for Ecological and Hydrological Simulation Research","authors":"Yaonan Zhang, Yingpin Long, Guohui Zhao, Yufang Min, Jianfang Kang, L. Luo, Zhenfang He, Yang Wang","doi":"10.1109/eScience.2013.37","DOIUrl":"https://doi.org/10.1109/eScience.2013.37","url":null,"abstract":"Comprehensive integrated research on ecological and hydrological processes and the simulation of river basin environments are critical foundations for decision making by governments and river-basin managers. The demand for a holistic understanding of environmental systems such as river basins is increasing. Eco-hydrological research needs two types of monitoring platforms to access and collect data from basins: a modeling platform to support access, select, and run models online, and build new models with the collected data, and a manipulation platform to generate forcing data, run models, and visualize the results. Consequently, we developed an e-science environment framework comprising three platforms - a monitoring platform, a model platform, and a manipulation platform. The framework allows automatic data transmission, storage, management, analysis, model management, simulation, computing, and result visualization. The e-science environment integrates land surface models such as Simplified Simple Biosphere model, the Revised Simple Biosphere model and WRF, hydrological models such as SWAT and TOPMODEL, data assimilation filters including such as Kalman filter algorithm, and several tools and methods for dealing with data, principally artificial neural networks and Markov chains. We demonstrate the application of the framework that uses an SSIB land surface model ensemble Kalman filter to improve evapotranspiration, soil moisture, and ground temperature simulation in the Heihe inland river basin. The approach proves suitable for environmental simulation for inland river research.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124113391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CloudDRN: A Lightweight, End-to-End System for Sharing Distributed Research Data in the Cloud CloudDRN:一个轻量级的端到端系统，用于在云中共享分布式研究数据

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.53

M. Humphrey, Jacob Steele, I. Kim, M. Kahn, J. Bondy, Michael Ames

The cloud has proven itself as a scalable platform for Web-based applications. However, scientists and medical researchers are still searching for a simple cloud-based architecture that enables secure collaboration and sharing of distributed datasets. To date, attempts at using the cloud for this purpose generally view the cloud as simply a pool of servers upon which to run their legacy software. This approach fails to leverage the unique platform capabilities of the cloud. In this paper, we describe our Cloud Distributed Research Network (CloudDRN). We leverage the cloud for availability, reliability, scalability, and improved security as compared to legacy distributed systems while still supporting site autonomy. Our philosophy is to adapt commercial software tooling that was originally designed for business use-cases, thereby benefiting from the large built-in user community. We describe our general architecture and show an example of our system created to share distributed clinical research data. We evaluate our system in Amazon Web Services (AWS) and in Microsoft Windows Azure and find that while each cloud achieves similar financial cost, representative queries are 3.5x slower on average in Windows Azure.

云已经证明自己是基于web的应用程序的可扩展平台。然而，科学家和医学研究人员仍在寻找一种简单的基于云的架构，以实现分布式数据集的安全协作和共享。迄今为止，将云用于此目的的尝试通常只是将云视为运行其遗留软件的服务器池。这种方法无法利用云的独特平台功能。本文描述了我们的云分布式研究网络(CloudDRN)。与传统的分布式系统相比，我们利用云来实现可用性、可靠性、可伸缩性和改进的安全性，同时仍然支持站点自治。我们的理念是调整最初为业务用例设计的商业软件工具，从而从大型内置用户社区中受益。我们描述了我们的总体架构，并展示了一个用于共享分布式临床研究数据的系统示例。我们在亚马逊网络服务(AWS)和微软Windows Azure中评估了我们的系统，发现虽然每个云都实现了相似的财务成本，但在Windows Azure中，代表性查询的平均速度要慢3.5倍。

{"title":"CloudDRN: A Lightweight, End-to-End System for Sharing Distributed Research Data in the Cloud","authors":"M. Humphrey, Jacob Steele, I. Kim, M. Kahn, J. Bondy, Michael Ames","doi":"10.1109/eScience.2013.53","DOIUrl":"https://doi.org/10.1109/eScience.2013.53","url":null,"abstract":"The cloud has proven itself as a scalable platform for Web-based applications. However, scientists and medical researchers are still searching for a simple cloud-based architecture that enables secure collaboration and sharing of distributed datasets. To date, attempts at using the cloud for this purpose generally view the cloud as simply a pool of servers upon which to run their legacy software. This approach fails to leverage the unique platform capabilities of the cloud. In this paper, we describe our Cloud Distributed Research Network (CloudDRN). We leverage the cloud for availability, reliability, scalability, and improved security as compared to legacy distributed systems while still supporting site autonomy. Our philosophy is to adapt commercial software tooling that was originally designed for business use-cases, thereby benefiting from the large built-in user community. We describe our general architecture and show an example of our system created to share distributed clinical research data. We evaluate our system in Amazon Web Services (AWS) and in Microsoft Windows Azure and find that while each cloud achieves similar financial cost, representative queries are 3.5x slower on average in Windows Azure.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129374580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Constructing a Social Content Delivery Network for eScience 构建面向eScience的社会化内容分发网络

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/ESCIENCE.2013.52

Kai Kugler, K. Chard, Simon Caton, O. Rana, D. Katz

Increases in the size of research data and the move towards citizen science, in which everyday users contribute data and analyses, have resulted in a research data deluge. Researchers must now carefully determine how to store, transfer and analyze "Big Data" in collaborative environments. This task is even more complicated when considering budget and locality constraints on data storage and access. In this paper we investigate the potential to construct a Social Content Delivery Network (S-CDN) based upon the social networks that exist between researchers. The S-CDN model builds upon the incentives of collaborative researchers within a given scientific community to address their data challenges collaboratively and in proven trusted settings. In this paper we present a prototype implementation of a S-CDN and investigate the performance of the data transfer mechanisms (using Glob us Online) and the potential cost advantages of this approach.

研究数据规模的增加和向公民科学的转变(日常用户提供数据和分析)导致了研究数据的泛滥。研究人员现在必须仔细决定如何在协作环境中存储、传输和分析“大数据”。当考虑到数据存储和访问的预算和位置限制时，这项任务甚至更加复杂。在本文中，我们研究了基于研究者之间存在的社交网络构建社交内容分发网络(S-CDN)的可能性。S-CDN模型建立在特定科学社区内协作研究人员的激励基础上，以协作和可靠的环境解决他们的数据挑战。在本文中，我们提出了一个S-CDN的原型实现，并研究了数据传输机制的性能(使用Glob us Online)和这种方法的潜在成本优势。

引用次数: 8

Operation Properties: A Representation and their Role in the Propagation of Meta-Data 操作属性:一种表示及其在元数据传播中的作用

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.13

Juan Amiguet-Vercher, P. Apers, A. Wombacher

To facilitate the sharing and re-use of data in scientific studies we propose an automated technique for annotating operation results. The annotated output has to preserve, as much as possible, the properties of the input annotations. The preservation of properties is achieved by taking into account operation properties. Property preservation is evaluated with information theory metrics.

为了促进科学研究中数据的共享和重用，我们提出了一种自动注释操作结果的技术。带注释的输出必须尽可能地保留输入注释的属性。属性的保存是通过考虑操作属性来实现的。用信息论度量来评价财产保护。

引用次数: 0

e-Enabling International Cancer Research: Lessons Being Learnt in the ENS@T-CANCER Project 使国际癌症研究电子化:ENS@T-CANCER项目的经验教训

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.33

A. Stell, R. Sinnott

Breakthroughs in biomedicine are driven by research. More often than not, research takes place outside of a healthcare setting. However access to and use of clinical data for research purposes has many challenges that must be overcome, not least of which are the lack of standardized nomenclature and the heterogeneity of healthcare IT systems. For rare conditions, this challenge is particularly acute since the scarcity of data makes scientific breakthroughs increasingly difficult. Adrenal tumours represent one rare disease area where consolidation of clinical and biological information is urgently required. This paper describes the lessons being learnt in the development and rollout of an advanced security-oriented, virtual research environment (VRE) as part of the EU funded ENS@T-CANCER project. This system is currently used by 39 major cancer research centres across Europe and provides a unique resource for adrenal cancer research, underpinning an expanding portfolio of major international clinical trials and studies.

生物医学的突破是由研究驱动的。通常情况下，研究是在医疗保健环境之外进行的。然而，为研究目的访问和使用临床数据有许多必须克服的挑战，其中最重要的是缺乏标准化的命名法和医疗保健IT系统的异质性。对于罕见疾病，这一挑战尤其严峻，因为数据的缺乏使得科学突破越来越困难。肾上腺肿瘤是一种罕见的疾病领域，迫切需要巩固临床和生物学信息。本文描述了作为欧盟资助ENS@T-CANCER项目的一部分，在开发和推出先进的面向安全的虚拟研究环境(VRE)过程中吸取的经验教训。该系统目前被欧洲39个主要癌症研究中心使用，为肾上腺癌研究提供了独特的资源，支持了主要国际临床试验和研究的不断扩大的投资组合。

引用次数: 7

OzTrack -- E-Infrastructure to Support the Management, Analysis and Sharing of Animal Tracking Data OzTrack——支持动物跟踪数据管理、分析和共享的电子基础设施

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/ESCIENCE.2013.38

J. Hunter, C. Brooking, Wilfred Brimblecombe, R. G. Dwyer, H. Campbell, Matthew E. Watts, C. Franklin

The aim of the OzTrack project is to provide common e-Science infrastructure to support the management, pre-processing, analysis and visualization of animal tracking data generated by researchers who are using telemetry devices to study animal behavior and ecology in Australia. This paper describes the technical challenges and design decisions associated with the development of the OzTrack system. It also describes the pre-processing, analysis and visualization services that we have developed to help researchers understand how their study species move across space and time. Finally this paper outlines the systems' current limitations and preliminary results and feedback from its evaluation to date.

OzTrack项目的目的是提供通用的电子科学基础设施，以支持使用遥测设备研究澳大利亚动物行为和生态的研究人员产生的动物跟踪数据的管理、预处理、分析和可视化。本文描述了与OzTrack系统开发相关的技术挑战和设计决策。它还描述了我们开发的预处理、分析和可视化服务，以帮助研究人员了解他们的研究物种是如何跨越空间和时间移动的。最后，本文概述了该系统目前的局限性、初步结果和迄今为止的评估反馈。

引用次数: 18

Data Pipeline in MapReduce MapReduce中的数据管道

2013 IEEE 9th International Conference on e-Science

Pub Date : 2013-10-22 DOI: 10.1109/eScience.2013.21

Jiaan Zeng, Beth Plale

MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.

MapReduce是一种用于大规模文本和数据分析的有效编程模型。传统的MapReduce实现，例如Hadoop，有一个限制，在进行任何分析之前，必须将整个输入数据集加载到集群中。当数据集很大，并且不可能一次加载数据并多次处理时(例如，日志文件、健康记录和受保护的文本就存在这种情况)，这会导致相当大的延迟。我们提出了一种数据管道方法来隐藏MapReduce分析中的数据上传延迟。我们的实现基于Hadoop MapReduce，对用户是完全透明的。引入分布式并发队列来协调数据块的分配和同步，实现数据上传和执行的重叠。本文克服了两个挑战:固定数量的地图调度和动态数量的地图调度允许更好地处理未知大小的输入数据集。我们还使用延迟调度器来实现数据管道的数据局部性。在真实世界数据集的不同应用程序上对解决方案的评估表明，我们的方法显示出性能提升。

引用次数: 4

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE 9th International Conference on e-Science

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀