首页 > 最新文献

2019 15th International Conference on eScience (eScience)最新文献

英文 中文
Incorporating New Concepts Into the Scientific Variables Ontology 将新概念纳入科学变量本体
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00073
M. Stoica, S. Peckham
We present a preliminary methodology, currently in development, for automated generation of domain-specific, machine-readable representations of qualitative and quantitative scientific variable concepts. The method presented is based on the top level universal categories and modular design patterns declared within the Scientific Variables Ontology (v 1.0.0) blueprint. These scientific variable representations can be used to annotate electronic resources, such as data and models and, along with reasoning algorithms, can be used to provide explainable automated resource alignment capabilities in the assembly of scientific workflows.
我们提出了一种初步的方法,目前正在开发中,用于自动生成特定领域,机器可读的定性和定量科学变量概念的表示。所提出的方法基于科学变量本体(1.0.0)蓝图中声明的顶级通用类别和模块化设计模式。这些科学变量表示可用于注释电子资源,如数据和模型,并与推理算法一起,可用于在科学工作流的组装中提供可解释的自动化资源对齐功能。
{"title":"Incorporating New Concepts Into the Scientific Variables Ontology","authors":"M. Stoica, S. Peckham","doi":"10.1109/eScience.2019.00073","DOIUrl":"https://doi.org/10.1109/eScience.2019.00073","url":null,"abstract":"We present a preliminary methodology, currently in development, for automated generation of domain-specific, machine-readable representations of qualitative and quantitative scientific variable concepts. The method presented is based on the top level universal categories and modular design patterns declared within the Scientific Variables Ontology (v 1.0.0) blueprint. These scientific variable representations can be used to annotate electronic resources, such as data and models and, along with reasoning algorithms, can be used to provide explainable automated resource alignment capabilities in the assembly of scientific workflows.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134402314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Hybrid Algorithm for Mineral Dust Detection Using Satellite Data 一种基于卫星数据的矿物粉尘探测混合算法
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00012
Peichang Shi, Qianqian Song, Janita Patwardhan, Zhibo Zhang, Jianwu Wang, A. Gangopadhyay
Mineral dust, defined as aerosol originating from the soil, can have various harmful effects to the environment and human health. The detection of dust, and particularly incoming dust storms, may help prevent some of these negative impacts. In this paper, using satellite observations from Moderate Resolution Imaging Spectroradiometer (MODIS) and the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation Observation (CALIPSO), we compared several machine learning algorithms to traditional physical models and evaluated their performance regarding mineral dust detection. Based on the comparison results, we proposed a hybrid algorithm to integrate physical model with the data mining model, which achieved the best accuracy result among all the methods. Further, we identified the ranking of different channels of MODIS data based on the importance of the band wavelengths in dust detection. Our model also showed the quantitative relationships between the dust and the different band wavelengths.
矿物粉尘被定义为源自土壤的气溶胶,可对环境和人类健康产生各种有害影响。对尘埃的探测,特别是对即将到来的沙尘暴的探测,可能有助于防止其中一些负面影响。本文利用中分辨率成像光谱仪(MODIS)和云气溶胶激光雷达和红外探路者卫星观测观测(CALIPSO)的卫星观测数据,将几种机器学习算法与传统物理模型进行了比较,并评估了它们在矿物粉尘探测方面的性能。基于对比结果,我们提出了一种将物理模型与数据挖掘模型相结合的混合算法,该算法在所有方法中获得了最好的精度结果。此外,基于波段波长在尘埃探测中的重要性,我们确定了MODIS数据不同通道的排序。我们的模型还显示了尘埃与不同波段波长之间的定量关系。
{"title":"A Hybrid Algorithm for Mineral Dust Detection Using Satellite Data","authors":"Peichang Shi, Qianqian Song, Janita Patwardhan, Zhibo Zhang, Jianwu Wang, A. Gangopadhyay","doi":"10.1109/eScience.2019.00012","DOIUrl":"https://doi.org/10.1109/eScience.2019.00012","url":null,"abstract":"Mineral dust, defined as aerosol originating from the soil, can have various harmful effects to the environment and human health. The detection of dust, and particularly incoming dust storms, may help prevent some of these negative impacts. In this paper, using satellite observations from Moderate Resolution Imaging Spectroradiometer (MODIS) and the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation Observation (CALIPSO), we compared several machine learning algorithms to traditional physical models and evaluated their performance regarding mineral dust detection. Based on the comparison results, we proposed a hybrid algorithm to integrate physical model with the data mining model, which achieved the best accuracy result among all the methods. Further, we identified the ranking of different channels of MODIS data based on the importance of the band wavelengths in dust detection. Our model also showed the quantitative relationships between the dust and the different band wavelengths.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132430924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Historical Big Data Analysis to Understand the Social Construction of Juvenile Delinquency in the United States 以历史大数据分析了解美国青少年犯罪的社会建构
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00094
Sandeep Puthanveetil Satheesan, Alan B. Craig, Yu Zhang
Social construction is a theoretical position that social reality is created through the humans' definition and interaction as opposed to something that exists by default. As one type of social reality, juvenile delinquency is perceived as part of social problems, deeply contextualized and socially constructed in American society. The social construction of juvenile delinquency started far earlier than the first juvenile court in 1899 in the U.S. Scholars have tried traditional historical analysis to explore the timeline of the social construction of juvenile delinquency in the past, but it is inefficient to examine hundred years of documents using traditional paper-pencil documenting method. We propose to research, develop and apply image and text analysis methods to analyze hundreds of years of newspaper data and show a clear development of social construction of juvenile delinquency in American society. The project aims to explore questions around how the media started depicting certain types of juvenile behavior as delinquency, how they described those behaviors; who are those juveniles (age, race, gender, family background, community background, etc.), how other social institutions treat those juveniles in those stories; how the depiction of juvenile delinquency has changed during the past 100 years; whether the analysis results support social construction perspective in terms of juvenile delinquency or not. In this paper, we present our ongoing work of doing image analysis on the newspaper collection from the Library of Congress Chronicling America website, initial results, observations, current conclusions, and future work.
社会建构是一种理论立场,认为社会现实是通过人的定义和互动创造出来的,而不是默认存在的东西。青少年犯罪作为一种社会现实,被视为社会问题的一部分,在美国社会中被深刻地语境化和社会建构。青少年犯罪的社会建构远早于1899年美国第一个少年法庭的成立。过去学者们尝试用传统的历史分析来探究青少年犯罪的社会建构时间轴,但用传统的纸笔记录法来考察百年文献是低效的。我们提出研究、开发并运用图像和文本分析的方法,对数百年来的报纸数据进行分析,清晰地展现美国社会青少年犯罪社会建构的发展脉络。该项目旨在探讨媒体是如何开始将某些类型的青少年行为描述为犯罪的,他们是如何描述这些行为的;这些青少年是谁(年龄、种族、性别、家庭背景、社区背景等),其他社会机构如何对待这些故事中的青少年;在过去的100年里,对青少年犯罪的描述发生了怎样的变化;分析结果是否支持青少年犯罪的社会建构视角。在本文中,我们介绍了我们正在进行的对美国国会图书馆编年史网站收集的报纸进行图像分析的工作,初步结果,观察结果,当前结论和未来的工作。
{"title":"A Historical Big Data Analysis to Understand the Social Construction of Juvenile Delinquency in the United States","authors":"Sandeep Puthanveetil Satheesan, Alan B. Craig, Yu Zhang","doi":"10.1109/eScience.2019.00094","DOIUrl":"https://doi.org/10.1109/eScience.2019.00094","url":null,"abstract":"Social construction is a theoretical position that social reality is created through the humans' definition and interaction as opposed to something that exists by default. As one type of social reality, juvenile delinquency is perceived as part of social problems, deeply contextualized and socially constructed in American society. The social construction of juvenile delinquency started far earlier than the first juvenile court in 1899 in the U.S. Scholars have tried traditional historical analysis to explore the timeline of the social construction of juvenile delinquency in the past, but it is inefficient to examine hundred years of documents using traditional paper-pencil documenting method. We propose to research, develop and apply image and text analysis methods to analyze hundreds of years of newspaper data and show a clear development of social construction of juvenile delinquency in American society. The project aims to explore questions around how the media started depicting certain types of juvenile behavior as delinquency, how they described those behaviors; who are those juveniles (age, race, gender, family background, community background, etc.), how other social institutions treat those juveniles in those stories; how the depiction of juvenile delinquency has changed during the past 100 years; whether the analysis results support social construction perspective in terms of juvenile delinquency or not. In this paper, we present our ongoing work of doing image analysis on the newspaper collection from the Library of Congress Chronicling America website, initial results, observations, current conclusions, and future work.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128228876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Out-of-the-Box Reproducibility: A Survey of Machine Learning Platforms 开箱即用的再现性:机器学习平台的调查
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00017
R. Isdahl, Odd Erik Gundersen
Even machine learning experiments that are fully conducted on computers are not necessarily reproducible. An increasing number of open source and commercial, closed source machine learning platforms are being developed that help address this problem. However, there is no standard for assessing and comparing which features are required to fully support reproducibility. We propose a quantitative method that alleviates this problem. Based on the proposed method we assess and compare the current state of the art machine learning platforms for how well they support making empirical results reproducible. Our results show that BEAT and Floydhub have the best support for reproducibility with Codalab and Kaggle as close contenders. The most commonly used machine learning platforms provided by the big tech companies have poor support for reproducibility.
即使是完全在计算机上进行的机器学习实验也不一定是可复制的。越来越多的开源和商业闭源机器学习平台正在开发中,以帮助解决这个问题。然而,没有标准来评估和比较哪些特性需要完全支持再现性。我们提出了一种量化方法来缓解这一问题。基于所提出的方法,我们评估和比较了当前最先进的机器学习平台对使经验结果可重复性的支持程度。我们的结果表明,BEAT和Floydhub对再现性的支持最好,Codalab和Kaggle是最接近的竞争者。大型科技公司提供的最常用的机器学习平台对再现性的支持很差。
{"title":"Out-of-the-Box Reproducibility: A Survey of Machine Learning Platforms","authors":"R. Isdahl, Odd Erik Gundersen","doi":"10.1109/eScience.2019.00017","DOIUrl":"https://doi.org/10.1109/eScience.2019.00017","url":null,"abstract":"Even machine learning experiments that are fully conducted on computers are not necessarily reproducible. An increasing number of open source and commercial, closed source machine learning platforms are being developed that help address this problem. However, there is no standard for assessing and comparing which features are required to fully support reproducibility. We propose a quantitative method that alleviates this problem. Based on the proposed method we assess and compare the current state of the art machine learning platforms for how well they support making empirical results reproducible. Our results show that BEAT and Floydhub have the best support for reproducibility with Codalab and Kaggle as close contenders. The most commonly used machine learning platforms provided by the big tech companies have poor support for reproducibility.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132946898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers 新一代超级计算机分子动力学模拟的原位和传输特性分析
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00027
M. Taufer, E. Deelman, Stephen Thomas, Michael R. Wyatt, T. Do, L. Pottier, Rafael Ferreira da Silva, H. Weinstein, M. Cuendet, Trilce Estrada
Molecular Dynamics (MD) simulations executed on state-of-the-art supercomputers are producing data at rates faster than it can be written out to disk. In situ and in transit analysis of data generated by MD simulations reduce the original volume of information by several orders of magnitude, thereby alleviating the negative impact of I/O bottlenecks. This work focuses on characterizing the impact of in situ and in transit analytics on the overall MD workflow performance, and the capability for capturing rapid, rare events in the simulated molecular system. The MD simulation and analysis processes share data via remote direct memory access (RDMA) using DataSpaces. Our metrics of interest are time spent waiting in I/O by the MD simulation, lost frames of the MD simulation, and idle time of the analysis. We measure these metrics for a diverse set of molecular systems and characterize their trends for in situ and in transit configurations. We then model which frames are dropped and which ones are analyzed for a real use case. The insights gained from this study are generally applicable for in situ and in transit workflows that require optimization of parameters to minimize loss in workflow performance and analytic accuracy.
在最先进的超级计算机上执行的分子动力学(MD)模拟产生数据的速度比写入磁盘的速度还要快。MD模拟生成的数据的原位和传输分析将原始信息量减少了几个数量级,从而减轻了I/O瓶颈的负面影响。这项工作的重点是描述原位和在途分析对整个MD工作流程性能的影响,以及在模拟分子系统中捕获快速、罕见事件的能力。MD仿真和分析过程通过使用数据空间的远程直接内存访问(RDMA)共享数据。我们感兴趣的指标是MD模拟在I/O中等待的时间、MD模拟的丢失帧以及分析的空闲时间。我们测量了不同分子系统的这些指标,并描述了它们在原位和运输构型中的趋势。然后我们建模哪些帧被丢弃,哪些帧被分析用于实际用例。从本研究中获得的见解通常适用于需要优化参数以最小化工作流程性能和分析准确性损失的现场和运输工作流程。
{"title":"Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers","authors":"M. Taufer, E. Deelman, Stephen Thomas, Michael R. Wyatt, T. Do, L. Pottier, Rafael Ferreira da Silva, H. Weinstein, M. Cuendet, Trilce Estrada","doi":"10.1109/eScience.2019.00027","DOIUrl":"https://doi.org/10.1109/eScience.2019.00027","url":null,"abstract":"Molecular Dynamics (MD) simulations executed on state-of-the-art supercomputers are producing data at rates faster than it can be written out to disk. In situ and in transit analysis of data generated by MD simulations reduce the original volume of information by several orders of magnitude, thereby alleviating the negative impact of I/O bottlenecks. This work focuses on characterizing the impact of in situ and in transit analytics on the overall MD workflow performance, and the capability for capturing rapid, rare events in the simulated molecular system. The MD simulation and analysis processes share data via remote direct memory access (RDMA) using DataSpaces. Our metrics of interest are time spent waiting in I/O by the MD simulation, lost frames of the MD simulation, and idle time of the analysis. We measure these metrics for a diverse set of molecular systems and characterize their trends for in situ and in transit configurations. We then model which frames are dropped and which ones are analyzed for a real use case. The insights gained from this study are generally applicable for in situ and in transit workflows that require optimization of parameters to minimize loss in workflow performance and analytic accuracy.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133009482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Serverless Science for Simple, Scalable, and Shareable Scholarship 无服务器科学的简单,可扩展和可共享的奖学金
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00056
K. Chard, Ian T Foster
The adoption of computation- and data-intensive science, or eScience, makes research progress increasingly dependent on the availability, management, and use of sophisticated cyberinfrastructure. An unfortunate consequence is that researchers face increasingly burdensome demands for managing and maintaining cyberinfrastructure. The advent of virtualization and cloud computing has helped, by allowing outsourcing of some such tasks to reliable and scalable cloud providers. But much more progress is needed before we can create a research cyberinfrastructure that allows researchers to focus on creative thought rather than systems management. We examine here how the emerging paradigm of serverless computing, in which arbitrary functions can be dispatched seamlessly to scalable, secure, and reliable service providers, can move us in that direction. To demonstrate how serverless computing can transform scientific computing, we describe three serverless computing models: service-oriented computing, research automation, and function as a service, presenting illustrative case studies for each.
计算和数据密集型科学(eScience)的采用使得研究进展越来越依赖于复杂网络基础设施的可用性、管理和使用。一个不幸的后果是,研究人员在管理和维护网络基础设施方面面临着越来越繁重的需求。虚拟化和云计算的出现提供了帮助,它们允许将一些这样的任务外包给可靠的、可扩展的云提供商。但是,在我们能够创建一个研究网络基础设施,使研究人员能够专注于创造性思维而不是系统管理之前,还需要取得更多的进展。在这里,我们将研究新兴的无服务器计算范式(其中任意功能可以无缝地分配给可扩展、安全和可靠的服务提供商)如何将我们推向这个方向。为了演示无服务器计算如何改变科学计算,我们描述了三种无服务器计算模型:面向服务的计算、研究自动化和功能即服务,并为每种模型提供了说明性案例研究。
{"title":"Serverless Science for Simple, Scalable, and Shareable Scholarship","authors":"K. Chard, Ian T Foster","doi":"10.1109/eScience.2019.00056","DOIUrl":"https://doi.org/10.1109/eScience.2019.00056","url":null,"abstract":"The adoption of computation- and data-intensive science, or eScience, makes research progress increasingly dependent on the availability, management, and use of sophisticated cyberinfrastructure. An unfortunate consequence is that researchers face increasingly burdensome demands for managing and maintaining cyberinfrastructure. The advent of virtualization and cloud computing has helped, by allowing outsourcing of some such tasks to reliable and scalable cloud providers. But much more progress is needed before we can create a research cyberinfrastructure that allows researchers to focus on creative thought rather than systems management. We examine here how the emerging paradigm of serverless computing, in which arbitrary functions can be dispatched seamlessly to scalable, secure, and reliable service providers, can move us in that direction. To demonstrate how serverless computing can transform scientific computing, we describe three serverless computing models: service-oriented computing, research automation, and function as a service, presenting illustrative case studies for each.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114827312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Enabling Transparent Access to Heterogeneous Architectures for IS-ENES Climate4Impact using the DARE Platform 使用DARE平台为IS-ENES Climate4Impact实现对异构架构的透明访问
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00089
C. Pagé, W. S. D. Cerff, M. Plieger, A. Spinuso, Xavier Pivan
Access to Climate data is crucial to sustain research and climate change impact assessments. It has a strong societal impact as those changes will have to be mitigated as much as possible. The whole climate data archive is expected to reach a volume of 30 PB in 2019 and up to 2000 PB in 2024 (estimated), evolving from 30 TB in 2007 and 2 PB in 2014. Data processing and analysis must now happen remotely for the users: they now have to rely on heterogeneous infrastructures and services between the data and their location. Developers of Research Infrastructures have to provide services to those users, hence having to define standards and generic services to fulfill those requirements. It will be shown how the DARE eScience Platform (http://project-dare.eu) will help developers to develop more rapidly needed services for a large range of scientific researchers. The platform is designed for efficient and traceable development of complex experiments and domain-specific services on the Cloud. It will be also shown how the integration of the DARE platform together with the climate IS-ENES (https://is.enes.org) Research Infrastructure front-end climate4impact (C4I: https://climate4impact.eu/) will help developers leverage heterogeneous architectures transparently for the benefit of researchers.
获取气候数据对于维持研究和气候变化影响评估至关重要。它具有强烈的社会影响,因为这些变化必须尽可能地减轻。整个气候数据档案预计将从2007年的30 TB和2014年的2 PB逐步发展到2019年的30 PB和2024年的2000 PB(估计)。数据处理和分析现在必须为用户远程进行:他们现在必须依赖于数据与其位置之间的异构基础设施和服务。研究基础设施的开发人员必须为这些用户提供服务,因此必须定义标准和通用服务来满足这些需求。它将展示DARE eScience平台(http://project-dare.eu)将如何帮助开发人员为大量科学研究人员开发更快速所需的服务。该平台是为在云上高效、可跟踪地开发复杂实验和特定领域服务而设计的。还将展示DARE平台与气候IS-ENES (https://is.enes.org)研究基础设施前端climate4impact (C4I: https://climate4impact.eu/)的集成如何帮助开发人员透明地利用异构架构,以造福研究人员。
{"title":"Enabling Transparent Access to Heterogeneous Architectures for IS-ENES Climate4Impact using the DARE Platform","authors":"C. Pagé, W. S. D. Cerff, M. Plieger, A. Spinuso, Xavier Pivan","doi":"10.1109/eScience.2019.00089","DOIUrl":"https://doi.org/10.1109/eScience.2019.00089","url":null,"abstract":"Access to Climate data is crucial to sustain research and climate change impact assessments. It has a strong societal impact as those changes will have to be mitigated as much as possible. The whole climate data archive is expected to reach a volume of 30 PB in 2019 and up to 2000 PB in 2024 (estimated), evolving from 30 TB in 2007 and 2 PB in 2014. Data processing and analysis must now happen remotely for the users: they now have to rely on heterogeneous infrastructures and services between the data and their location. Developers of Research Infrastructures have to provide services to those users, hence having to define standards and generic services to fulfill those requirements. It will be shown how the DARE eScience Platform (http://project-dare.eu) will help developers to develop more rapidly needed services for a large range of scientific researchers. The platform is designed for efficient and traceable development of complex experiments and domain-specific services on the Cloud. It will be also shown how the integration of the DARE platform together with the climate IS-ENES (https://is.enes.org) Research Infrastructure front-end climate4impact (C4I: https://climate4impact.eu/) will help developers leverage heterogeneous architectures transparently for the benefit of researchers.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115754671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data Science Model Curriculum Implementation for Various Types of Big Data Infrastructure Courses 各类大数据基础设施课程的数据科学模式课程实施
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00074
T. Wiktorski, Y. Demchenko, O. Chertov
This paper presents experiences of development and teaching three different types of Big Data Infrastructure courses as a part of the general Data Science curricula. The authors built the discussed courses based on the EDISON Data Science Framework (EDSF), in particular, Data Science Body of Knowledge (DS-BoK) related to Data Science Engineering knowledge area group (KAG-DSENG). The paper provides overview of the sandboxes, Cloud-based platforms and tools for Big Data Analytics and stresses importance of including into curriculum the practical work with Clouds for future graduates or specialists workplace adaptability. The paper discusses a relationship between the DSENG BoK and Big Data technologies and platforms, in particular Hadoop-based applications and tools for data analytics that should be promoted through all course activities: lectures, practical activities and self-study.
本文介绍了作为通用数据科学课程一部分的三种不同类型大数据基础设施课程的开发和教学经验。作者基于爱迪生数据科学框架(EDSF),特别是与数据科学工程知识领域组(KAG-DSENG)相关的数据科学知识体系(DS-BoK)构建了所讨论的课程。本文概述了沙盒、基于云的平台和大数据分析工具,并强调了将云的实际工作纳入课程的重要性,以帮助未来的毕业生或专家适应工作场所。本文讨论了DSENG BoK与大数据技术和平台之间的关系,特别是基于hadoop的数据分析应用和工具,应该通过所有课程活动:讲座,实践活动和自学来推广。
{"title":"Data Science Model Curriculum Implementation for Various Types of Big Data Infrastructure Courses","authors":"T. Wiktorski, Y. Demchenko, O. Chertov","doi":"10.1109/eScience.2019.00074","DOIUrl":"https://doi.org/10.1109/eScience.2019.00074","url":null,"abstract":"This paper presents experiences of development and teaching three different types of Big Data Infrastructure courses as a part of the general Data Science curricula. The authors built the discussed courses based on the EDISON Data Science Framework (EDSF), in particular, Data Science Body of Knowledge (DS-BoK) related to Data Science Engineering knowledge area group (KAG-DSENG). The paper provides overview of the sandboxes, Cloud-based platforms and tools for Big Data Analytics and stresses importance of including into curriculum the practical work with Clouds for future graduates or specialists workplace adaptability. The paper discusses a relationship between the DSENG BoK and Big Data technologies and platforms, in particular Hadoop-based applications and tools for data analytics that should be promoted through all course activities: lectures, practical activities and self-study.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116110558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SDM: A Scientific Dataset Delivery Platform SDM:科学数据集交付平台
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00049
Illyoung Choi, Jude C. Nelson, Larry L. Peterson, J. Hartman
Scientific computing is becoming more data-centric and more collaborative, requiring increasingly large datasets to be transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn't require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate and access datasets. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM configured with a CDN outperforms existing data access methods. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM is only 9% slower than local HDD storage and 18% slower than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.
科学计算正变得更加以数据为中心,更具协作性,需要越来越多的大型数据集在互联网上传输。有效地传输这些数据集并使其可用于科学工作流程是一项越来越困难的任务。此外,数据传输时间可能是整个工作流运行时间的重要组成部分。本文介绍了科学数据集交付平台SDM (Syndicate Dataset Manager)。与通用数据传输工具不同,SDM提供对远程科学数据集的按需访问。按需访问不需要在对其进行计算之前将数据集暂存到本地文件系统中,而且它还支持重叠计算和I/O。此外,SDM为用户提供了一个简单的接口来定位和访问数据集。为了验证SDM的有效性,我们在远程基因组数据集上执行了实际的宏基因组序列分析工作流程。通常,配置了CDN的SDM优于现有的数据访问方法。使用热CDN缓存,SDM完成工作流的速度比分期方法快17-20%。它的性能甚至可以与本地存储相媲美。SDM仅比本地HDD存储慢9%,比本地SSD存储慢18%。总之,它的性能和易用性使SDM成为在远程数据集上执行科学工作流的一个有吸引力的平台。
{"title":"SDM: A Scientific Dataset Delivery Platform","authors":"Illyoung Choi, Jude C. Nelson, Larry L. Peterson, J. Hartman","doi":"10.1109/eScience.2019.00049","DOIUrl":"https://doi.org/10.1109/eScience.2019.00049","url":null,"abstract":"Scientific computing is becoming more data-centric and more collaborative, requiring increasingly large datasets to be transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn't require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate and access datasets. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM configured with a CDN outperforms existing data access methods. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM is only 9% slower than local HDD storage and 18% slower than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Transparency by Design in eScience Research 科学研究中的透明度设计
Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00055
Beth Plale
Both the landscape of eScience research and the environment in which the research is conducted are undergoing change. Transparency by design in eScience is proposed as a term to describe transparency in eScience practices, processes, methodologies, and research results. We break down different aspects of transparency and urge the eScience community towards a renewed commitment to scientific rigor because of the important role that we as scientists have to improve society and protect the good will that society has bestowed on science.
科学研究的景观和进行研究的环境都在发生变化。eScience中设计的透明度被提议作为一个术语来描述eScience实践、过程、方法和研究结果的透明度。我们分解了透明度的不同方面,并敦促科学界重新致力于科学严谨性,因为我们作为科学家必须发挥重要作用,改善社会,保护社会赋予科学的善意。
{"title":"Transparency by Design in eScience Research","authors":"Beth Plale","doi":"10.1109/eScience.2019.00055","DOIUrl":"https://doi.org/10.1109/eScience.2019.00055","url":null,"abstract":"Both the landscape of eScience research and the environment in which the research is conducted are undergoing change. Transparency by design in eScience is proposed as a term to describe transparency in eScience practices, processes, methodologies, and research results. We break down different aspects of transparency and urge the eScience community towards a renewed commitment to scientific rigor because of the important role that we as scientists have to improve society and protect the good will that society has bestowed on science.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132116266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2019 15th International Conference on eScience (eScience)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1