2019 15th International Conference on eScience (eScience)最新文献

SciInc: A Container Runtime for Incremental Recomputation 用于增量重计算的容器运行时

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00040

A. Youngdahl, Dai Hai Ton That, T. Malik

The conduct of reproducible science improves when computations are portable and verifiable. A container runtime provides an isolated environment for running computations and thus is useful for porting applications on new machines. Current container engines, such as LXC and Docker, however, do not track provenance, which is essential for verifying computations. In this paper, we present SciInc, a container runtime that tracks the provenance of computations during container creation. We show how container engines can use audited provenance data for efficient container replay. SciInc observes inputs to computations, and, if they change, propagates the changes, re-using partially memoized computations and data that are identical across replay and original run. We chose light-weight data structures for storing the provenance trace to maintain the invariant of shareable and portable container runtime. To determine the effectiveness of change propagation and memoization, we compared popular container technology and incremental recomputation methods using published data analysis experiments.

当计算变得便携和可验证时，可重复性科学的行为就会得到改善。容器运行时为运行计算提供了一个隔离的环境，因此对于将应用程序移植到新机器上非常有用。然而，当前的容器引擎，如LXC和Docker，不跟踪来源，这对于验证计算是必不可少的。在本文中，我们介绍了scic，它是一个容器运行时，可以跟踪容器创建期间计算的来源。我们将展示容器引擎如何使用经过审计的来源数据来实现高效的容器重播。scic观察计算的输入，如果它们发生了变化，就传播这些变化，重用在重播和原始运行中相同的部分记忆的计算和数据。我们选择轻量级数据结构来存储来源跟踪，以保持可共享和可移植容器运行时的不变性。为了确定变更传播和记忆的有效性，我们使用已发表的数据分析实验比较了流行的容器技术和增量重计算方法。

引用次数: 6

Cyberinfrastructure Center of Excellence Pilot: Connecting Large Facilities Cyberinfrastructure 网络基础设施卓越中心试点:连接大型设施网络基础设施

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00058

Ewa Deelman, Ryan Mitchell, Loïc Pottier, M. Rynge, Erik Scott, K. Vahi, Marina Kogan, Jasmine Mann, Tom Gulbransen, Daniel Allen, David Barlow, A. Mandal, Santiago Bonarrigo, Chris Clark, Leslie Goldman, Tristan Goulden, Phil Harvey, David Hulsander, Steve Jacobs, Christine Laney, Ivan Lobo-Padilla, Jeremy Sampson, Valerio Pascucci, John Staarmann, Steve Stone, Susan Sons, J. Wyngaard, Charles Vardeman, Steve Petruzza, I. Baldin, L. Christopherson

The National Science Foundation's Large Facilities are major, multi-user research facilities that operate and manage sophisticated and diverse research instruments and platforms (e.g., large telescopes, interferometers, distributed sensor arrays) that serve a variety of scientific disciplines, from astronomy and physics to geology and biology and beyond. Large Facilities are increasingly dependent on advanced cyberinfrastructure (i.e., computing, data, and software systems; networking; and associated human capital) to enable the broad delivery and analysis of facility-generated data. These cyberinfrastructure tools enable scientists and the public to gain new insights into fundamental questions about the structure and history of the universe, the world we live in today, and how our environment may change in the coming decades. This paper describes a pilot project that aims to develop a model for a Cyberinfrastructure Center of Excellence (CI CoE) that facilitates community building and knowledge sharing and that disseminates and applies best practices and innovative solutions for facility CI.

美国国家科学基金会的大型设施是主要的，多用户的研究设施，操作和管理复杂和多样化的研究仪器和平台(例如，大型望远镜，干涉仪，分布式传感器阵列)，服务于各种科学学科，从天文学和物理学到地质学和生物学等。大型设施越来越依赖于先进的网络基础设施(即计算、数据和软件系统);网络;以及相关的人力资本)，以便广泛提供和分析设施生成的数据。这些网络基础设施工具使科学家和公众能够对有关宇宙的结构和历史、我们今天生活的世界以及未来几十年我们的环境可能如何变化的基本问题获得新的见解。本文介绍了一个试点项目，该项目旨在为网络基础设施卓越中心(CI CoE)开发一个模型，促进社区建设和知识共享，并传播和应用设施CI的最佳实践和创新解决方案。

{"title":"Cyberinfrastructure Center of Excellence Pilot: Connecting Large Facilities Cyberinfrastructure","authors":"Ewa Deelman, Ryan Mitchell, Loïc Pottier, M. Rynge, Erik Scott, K. Vahi, Marina Kogan, Jasmine Mann, Tom Gulbransen, Daniel Allen, David Barlow, A. Mandal, Santiago Bonarrigo, Chris Clark, Leslie Goldman, Tristan Goulden, Phil Harvey, David Hulsander, Steve Jacobs, Christine Laney, Ivan Lobo-Padilla, Jeremy Sampson, Valerio Pascucci, John Staarmann, Steve Stone, Susan Sons, J. Wyngaard, Charles Vardeman, Steve Petruzza, I. Baldin, L. Christopherson","doi":"10.1109/eScience.2019.00058","DOIUrl":"https://doi.org/10.1109/eScience.2019.00058","url":null,"abstract":"The National Science Foundation's Large Facilities are major, multi-user research facilities that operate and manage sophisticated and diverse research instruments and platforms (e.g., large telescopes, interferometers, distributed sensor arrays) that serve a variety of scientific disciplines, from astronomy and physics to geology and biology and beyond. Large Facilities are increasingly dependent on advanced cyberinfrastructure (i.e., computing, data, and software systems; networking; and associated human capital) to enable the broad delivery and analysis of facility-generated data. These cyberinfrastructure tools enable scientists and the public to gain new insights into fundamental questions about the structure and history of the universe, the world we live in today, and how our environment may change in the coming decades. This paper describes a pilot project that aims to develop a model for a Cyberinfrastructure Center of Excellence (CI CoE) that facilitates community building and knowledge sharing and that disseminates and applies best practices and innovative solutions for facility CI.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128539058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Data Identification and Process Monitoring for Reproducible Earth Observation Research 可重复地球观测研究的数据识别和过程监控

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00011

Bernhard Gößwein, Tomasz Miksa, A. Rauber, W. Wagner

Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by Copernicus, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends. Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments. In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead.

地球观测研究人员使用各种数据后端提供的专门计算服务进行卫星图像处理。数据的来源通常是相同的，例如哥白尼操作的哨兵2号卫星，但是数据的预处理、校正、更新和随后分析的方式可能在后端不同。后端通常缺乏数据版本控制机制，例如，不跟踪数据更正。此外，用于数据处理的不断发展的软件堆栈对研究人员来说仍然是一个黑盒子。研究人员无法确定为什么执行相同的代码会产生不同的结果。这阻碍了对地观测实验的可重复性。在本文中，我们介绍了如何修改现有地球观测数据后端的基础设施以支持再现性。提议的扩展是基于研究数据联盟关于数据识别和VFramework自动化过程来源文档的建议。我们在地球观测数据中心实施了这些扩展，该中心是openEO联盟的一个合作伙伴。我们在各种使用场景中评估了该解决方案，还提供了性能和存储度量来评估修改的影响。结果表明，可以用最小的性能和存储开销来支持再现性。

{"title":"Data Identification and Process Monitoring for Reproducible Earth Observation Research","authors":"Bernhard Gößwein, Tomasz Miksa, A. Rauber, W. Wagner","doi":"10.1109/eScience.2019.00011","DOIUrl":"https://doi.org/10.1109/eScience.2019.00011","url":null,"abstract":"Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by Copernicus, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends. Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments. In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126845969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

The International Forest Risk Model (INFORM): A Method for Assessing Supply Chain Deforestation Risk with Imperfect Data 国际森林风险模型(INFORM):一种不完全数据下供应链毁林风险评估方法

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00009

N. Caithness, Cécile Lachaux, D. Wallom

A method for quantifiably estimating the deforestation risk exposure of agricultural Forest Risk Commodities in commercial supply chains is presented. The model consists of a series of equations applied using end-to-end data representing quantitative descriptors of the supply chain and its effect on deforestation. A robust penalty is included for historical deforestation and a corresponding reward for reductions in the rate of deforestation. The INternational FOrest Risk Model (INFORM) is a method for data analysis that answers a particular question for any Forest Risk Commodity in a supply chain: what is its cumulative deforestation risk exposure? To illustrate the methodology a case study of a livestock producer in France who sources soya-based animal feed from Brazil and wishes to document the deforestation risk associated with the product is described and calculated. Building on this example a discussion of the future applicability of INFORM within emerging supply-chain transparency initiatives is made including describing clear shortcomings in the method and how it may also be used to motivate the production of better data by those that may be subject of its analysis.

提出了一种定量估算商业供应链中农业森林风险商品毁林风险暴露的方法。该模型由一系列方程组成，这些方程使用端到端数据表示供应链及其对森林砍伐的影响的定量描述符。对历史上的森林砍伐实行严厉的惩罚，并对减少森林砍伐率给予相应的奖励。国际森林风险模型(INFORM)是一种数据分析方法，用于回答供应链中任何森林风险商品的特定问题:其累积毁林风险暴露程度是多少?为了说明该方法，本文描述并计算了法国一家牲畜生产商的案例研究，该生产商从巴西采购大豆为基础的动物饲料，并希望记录与该产品相关的森林砍伐风险。在这个例子的基础上，讨论了INFORM在新兴供应链透明度倡议中的未来适用性，包括描述了该方法的明显缺点，以及如何使用它来激励可能被其分析的对象产生更好的数据。

{"title":"The International Forest Risk Model (INFORM): A Method for Assessing Supply Chain Deforestation Risk with Imperfect Data","authors":"N. Caithness, Cécile Lachaux, D. Wallom","doi":"10.1109/eScience.2019.00009","DOIUrl":"https://doi.org/10.1109/eScience.2019.00009","url":null,"abstract":"A method for quantifiably estimating the deforestation risk exposure of agricultural Forest Risk Commodities in commercial supply chains is presented. The model consists of a series of equations applied using end-to-end data representing quantitative descriptors of the supply chain and its effect on deforestation. A robust penalty is included for historical deforestation and a corresponding reward for reductions in the rate of deforestation. The INternational FOrest Risk Model (INFORM) is a method for data analysis that answers a particular question for any Forest Risk Commodity in a supply chain: what is its cumulative deforestation risk exposure? To illustrate the methodology a case study of a livestock producer in France who sources soya-based animal feed from Brazil and wishes to document the deforestation risk associated with the product is described and calculated. Building on this example a discussion of the future applicability of INFORM within emerging supply-chain transparency initiatives is made including describing clear shortcomings in the method and how it may also be used to motivate the production of better data by those that may be subject of its analysis.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Active Learning Yields Better Training Data for Scientific Named Entity Recognition 主动学习为科学命名实体识别提供更好的训练数据

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00021

Roselyne B. Tchoua, Aswathy Ajith, Zhi Hong, Logan T. Ward, K. Chard, Debra J. Audus, Shrayesh Patel, Juan J. de Pablo, Ian T Foster

Despite significant progress in natural language processing, machine learning models require substantial expertannotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER, a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based classifier. PolyNER facilitates a labeling process that is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. Our approach requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical NER toolkit.

尽管在自然语言处理方面取得了重大进展，但机器学习模型需要大量专业的训练数据才能在命名实体识别(NER)和实体关系提取等任务中表现良好。此外，在处理科学文本时，NER通常更复杂。例如，在聚合物科学中，化学结构可能使用非标准的命名约定进行编码，相同的概念可以使用许多不同的术语(同义词)来表达，作者可以使用特殊的标签来引用聚合物。这些挑战并不是聚合物科学所独有的，它们使得生成训练数据变得困难，因为正确标记文本需要专门的技能。我们之前设计了polyNER，这是一种半自动系统，用于有效识别文本中的科学实体。PolyNER应用词嵌入模型来生成实体丰富的语料库，用于高效的专家标记，然后使用结果标记数据来引导基于上下文的分类器。PolyNER促进了标签过程，否则是繁琐和昂贵的。在这里，我们使用主动学习来有效地从专家那里获得更多的注释并提高性能。我们的方法只需要5个小时的专家时间就能实现与最先进的化学NER工具包相当的识别能力。

{"title":"Active Learning Yields Better Training Data for Scientific Named Entity Recognition","authors":"Roselyne B. Tchoua, Aswathy Ajith, Zhi Hong, Logan T. Ward, K. Chard, Debra J. Audus, Shrayesh Patel, Juan J. de Pablo, Ian T Foster","doi":"10.1109/eScience.2019.00021","DOIUrl":"https://doi.org/10.1109/eScience.2019.00021","url":null,"abstract":"Despite significant progress in natural language processing, machine learning models require substantial expertannotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER, a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based classifier. PolyNER facilitates a labeling process that is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. Our approach requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical NER toolkit.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124351486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Expanding Library Resources for Data and Compute-Intensive Education and Research 为数据和计算密集型教育和研究扩展图书馆资源

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00100

S. Labou, Reid Otsuji

As reproducible research tools and skills become increasingly in-demand across disciplines, so too does the need for innovative and collaborative training. While some academic departments incorporate software like R or Python in coursework and research, many students remain reliant on self-teaching in order to gain the necessary skills to work with their data. However, given the growing number of students interested in computational tools and resources for research automation, relying on student self-teaching and learning is not an efficient method for training the next generation of scholars. To address the educational need for computational thinking and learning across various academic departments on campus, the UC San Diego Library has been running Software Carpentry workshops (two day bootcamps to introduce foundational programming concepts and best practices) since 2015. The Library, as a discipline-agnostic entity with a history of serving as a trusted resource for information, has been well positioned to provide training for this new era of research methodology. The core of our success is the collaboration with the growing community of Software and Data Carpentry instructors at UC San Diego with expertise in various research disciplines. Building on this strong partnership and leveraging the Library’s resources and expertise in digital literacy, the campus can better support data-driven and technologically-focused education and research.

随着跨学科对可重复的研究工具和技能的需求越来越大，对创新和协作培训的需求也越来越大。虽然一些学术部门将R或Python等软件纳入课程和研究中，但许多学生仍然依靠自学来获得处理数据的必要技能。然而，鉴于越来越多的学生对用于研究自动化的计算工具和资源感兴趣，依靠学生自学并不是培养下一代学者的有效方法。为了满足校园各个学术部门对计算思维和学习的教育需求，加州大学圣地亚哥分校图书馆自2015年以来一直在举办软件木工研讨会(为期两天的训练营，介绍基础编程概念和最佳实践)。图书馆作为一个学科不可知论的实体，具有作为值得信赖的信息资源的历史，已经处于良好的位置，为这个研究方法的新时代提供培训。我们成功的核心是与加州大学圣地亚哥分校不断增长的软件和数据木工教师社区的合作，他们在各种研究学科方面具有专业知识。基于这种牢固的伙伴关系，并利用图书馆在数字素养方面的资源和专业知识，校园可以更好地支持数据驱动和以技术为重点的教育和研究。

{"title":"Expanding Library Resources for Data and Compute-Intensive Education and Research","authors":"S. Labou, Reid Otsuji","doi":"10.1109/eScience.2019.00100","DOIUrl":"https://doi.org/10.1109/eScience.2019.00100","url":null,"abstract":"As reproducible research tools and skills become increasingly in-demand across disciplines, so too does the need for innovative and collaborative training. While some academic departments incorporate software like R or Python in coursework and research, many students remain reliant on self-teaching in order to gain the necessary skills to work with their data. However, given the growing number of students interested in computational tools and resources for research automation, relying on student self-teaching and learning is not an efficient method for training the next generation of scholars. To address the educational need for computational thinking and learning across various academic departments on campus, the UC San Diego Library has been running Software Carpentry workshops (two day bootcamps to introduce foundational programming concepts and best practices) since 2015. The Library, as a discipline-agnostic entity with a history of serving as a trusted resource for information, has been well positioned to provide training for this new era of research methodology. The core of our success is the collaboration with the growing community of Software and Data Carpentry instructors at UC San Diego with expertise in various research disciplines. Building on this strong partnership and leveraging the Library’s resources and expertise in digital literacy, the campus can better support data-driven and technologically-focused education and research.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126348052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Data Encoding in Lossless Prediction-Based Compression Algorithms 基于无损预测的压缩算法中的数据编码

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00032

Ugur Çayoglu, Frank Tristram, Jörg Meyer, J. Schröter, T. Kerzenmacher, P. Braesicke, A. Streit

The increase in compute power and development of sophisticated simulation models with higher resolution output triggers a need for compression algorithms for scientific data. Several compression algorithms are currently under development. Most of these algorithms are using prediction-based compression algorithms, where each value is predicted and the residual between the prediction and true value is saved on disk. Currently there are two established forms of residual calculation: Exclusive-or and numerical difference. In this paper we will summarize both techniques and show their strengths and weaknesses. We will show that shifting the prediction and true value to a binary number with certain properties results in a better compression factor with minimal additional computational costs. This gain in compression factor allows for the usage of less sophisticated prediction algorithms to achieve a higher throughput during compression and decompression. In addition, we will introduce a new encoding scheme to achieve an 9% increase in compression factor on average compared to the current state-of-the-art.

计算能力的提高和具有更高分辨率输出的复杂仿真模型的发展引发了对科学数据压缩算法的需求。目前正在开发几种压缩算法。这些算法中的大多数都使用基于预测的压缩算法，其中每个值都是预测的，预测值与真实值之间的残差保存在磁盘上。目前已有两种确定的残差计算形式:异或和数值差分。在本文中，我们将总结这两种技术，并展示它们的优点和缺点。我们将展示，将预测值和真值转换为具有某些属性的二进制数会产生更好的压缩因子，并且额外的计算成本最小。压缩系数的增加允许使用不太复杂的预测算法来实现压缩和解压缩期间更高的吞吐量。此外，我们将引入一种新的编码方案，与目前最先进的编码方案相比，压缩系数平均提高9%。

{"title":"Data Encoding in Lossless Prediction-Based Compression Algorithms","authors":"Ugur Çayoglu, Frank Tristram, Jörg Meyer, J. Schröter, T. Kerzenmacher, P. Braesicke, A. Streit","doi":"10.1109/eScience.2019.00032","DOIUrl":"https://doi.org/10.1109/eScience.2019.00032","url":null,"abstract":"The increase in compute power and development of sophisticated simulation models with higher resolution output triggers a need for compression algorithms for scientific data. Several compression algorithms are currently under development. Most of these algorithms are using prediction-based compression algorithms, where each value is predicted and the residual between the prediction and true value is saved on disk. Currently there are two established forms of residual calculation: Exclusive-or and numerical difference. In this paper we will summarize both techniques and show their strengths and weaknesses. We will show that shifting the prediction and true value to a binary number with certain properties results in a better compression factor with minimal additional computational costs. This gain in compression factor allows for the usage of less sophisticated prediction algorithms to achieve a higher throughput during compression and decompression. In addition, we will introduce a new encoding scheme to achieve an 9% increase in compression factor on average compared to the current state-of-the-art.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132222961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Photon Propagation using GPUs by the IceCube Neutrino Observatory 冰立方中微子观测站使用gpu的光子传播

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00050

D. Chirkin, J. C. Díaz-Vélez, C. Kopper, A. Olivas, B. Riedel, M. Rongen, D. Schultz, J. Santen

IceCube Neutrino Observatory is a cubic kilometer neutrino detector located at the South Pole designed to detect high-energy astrophysical neutrinos. To thoroughly understand the detected neutrinos and their properties, the detector response to simulated signal and background has to be modeled using Monte Carlo techniques. An integral part of these studies are the optical properties of the ice the observatory is built into. The propagation of individual photons from particles produced by neutrino interactions in the ice can be greatly accelerated using graphics processing units (GPUs). In this paper, we will describe how we perform the photon propagation and create a global pool of GPU resources for both production and individual users.

冰立方中微子天文台是一个立方公里的中微子探测器，位于南极，旨在探测高能天体物理中微子。为了彻底了解探测到的中微子及其性质，探测器对模拟信号和背景的响应必须使用蒙特卡罗技术进行建模。这些研究的一个组成部分是天文台所建冰的光学特性。利用图形处理单元(gpu)可以大大加速冰中中微子相互作用产生的粒子中单个光子的传播。在本文中，我们将描述我们如何执行光子传播，并为生产和个人用户创建GPU资源的全局池。

引用次数: 6

Increasing Life Science Resources Re-Usability using Semantic Web Technologies 使用语义Web技术提高生命科学资源的可重用性

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00031

Marine Louarn, F. Chatonnet, Xavier Garnier, T. Fest, A. Siegel, O. Dameron

In life sciences, current standardization and integration efforts are directed towards reference data and knowledge bases. However, original studies results are generally provided in non standardized and specific formats. In addition, the only formalization of analysis pipelines is often limited to textual descriptions in the method sections. Both factors impair the results reproducibility, their maintenance and their reuse for advancing other studies. Semantic Web technologies have proven their efficiency for facilitating the integration and reuse of reference data and knowledge bases. We thus hypothesize that Semantic Web technologies also facilitate reproducibility and reuse of life sciences studies involving pipelines that compute associations between entities according to intermediary relations and dependencies. In order to assess this hypothesis, we considered a case-study in systems biology (http://regulatorycircuits.org), which provides tissue-specific regulatory interaction networks to elucidate perturbations across complex diseases. Our approach consisted in surveying the complete set of provided supplementary files to reveal the underlying structure between the biological entities described in the data. We relied on this structure and used Semantic Web technologies (i) to integrate the Regulatory Circuits data, and (ii) to formalize the analysis pipeline as SPARQL queries. Our result was a 335,429,988 triples dataset on which two SPARQL queries were sufficient to extract each single tissuespecific regulatory network.

在生命科学方面，目前的标准化和整合工作是针对参考数据和知识库的。然而，原始研究结果通常以非标准化和特定格式提供。此外，分析管道的唯一形式化通常仅限于方法部分中的文本描述。这两个因素都影响了结果的可重复性、可维护性和对其他研究的重用性。语义Web技术在促进参考数据和知识库的集成和重用方面已经证明了它们的效率。因此，我们假设语义网技术也促进了生命科学研究的可重复性和重用性，这些研究涉及根据中介关系和依赖关系计算实体之间关联的管道。为了评估这一假设，我们考虑了系统生物学中的一个案例研究(http://regulatorycircuits.org)，该研究提供了组织特异性调节相互作用网络，以阐明复杂疾病中的扰动。我们的方法包括调查提供的补充文件的完整集合，以揭示数据中描述的生物实体之间的潜在结构。我们依赖于这个结构，并使用语义Web技术(1)来集成监管电路数据，(2)将分析管道形式化为SPARQL查询。我们的结果是一个335,429,988个三元组数据集，其中两个SPARQL查询足以提取每个组织特定的调节网络。

{"title":"Increasing Life Science Resources Re-Usability using Semantic Web Technologies","authors":"Marine Louarn, F. Chatonnet, Xavier Garnier, T. Fest, A. Siegel, O. Dameron","doi":"10.1109/eScience.2019.00031","DOIUrl":"https://doi.org/10.1109/eScience.2019.00031","url":null,"abstract":"In life sciences, current standardization and integration efforts are directed towards reference data and knowledge bases. However, original studies results are generally provided in non standardized and specific formats. In addition, the only formalization of analysis pipelines is often limited to textual descriptions in the method sections. Both factors impair the results reproducibility, their maintenance and their reuse for advancing other studies. Semantic Web technologies have proven their efficiency for facilitating the integration and reuse of reference data and knowledge bases. We thus hypothesize that Semantic Web technologies also facilitate reproducibility and reuse of life sciences studies involving pipelines that compute associations between entities according to intermediary relations and dependencies. In order to assess this hypothesis, we considered a case-study in systems biology (http://regulatorycircuits.org), which provides tissue-specific regulatory interaction networks to elucidate perturbations across complex diseases. Our approach consisted in surveying the complete set of provided supplementary files to reveal the underlying structure between the biological entities described in the data. We relied on this structure and used Semantic Web technologies (i) to integrate the Regulatory Circuits data, and (ii) to formalize the analysis pipeline as SPARQL queries. Our result was a 335,429,988 triples dataset on which two SPARQL queries were sufficient to extract each single tissuespecific regulatory network.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131421232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

dislib: Large Scale High Performance Machine Learning in Python dislib: Python中的大规模高性能机器学习

2019 15th International Conference on eScience (eScience)

Pub Date : 2019-09-01 DOI: 10.1109/eScience.2019.00018

J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia

In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.

近年来，机器学习已经被证明是从数据中提取知识的一个非常有用的工具。这可以用于许多研究领域，如基因组学、地球科学和天体物理学，以获得有价值的见解。与此同时，Python因其高生产率和丰富的生态系统而成为研究人员中最受欢迎的编程语言之一。不幸的是，现有的Python机器学习库不能扩展到大型数据集，非专家很难使用，而且很难在高性能计算集群中进行设置。这些限制阻碍了科学家在研究中充分利用机器学习的潜力。在本文中，我们提出并评估了dislib，一个基于pycomps编程模型的分布式机器学习库，它解决了其他现有库的问题。在我们的评估中，我们表明dislib可以比其他流行的分布式机器学习库(如MLlib)快9倍，并且可以处理高达16倍的数据集。除此之外，我们还展示了如何使用dislib将真正的科学应用程序的计算时间从18小时减少到17分钟。

{"title":"dislib: Large Scale High Performance Machine Learning in Python","authors":"J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia","doi":"10.1109/eScience.2019.00018","DOIUrl":"https://doi.org/10.1109/eScience.2019.00018","url":null,"abstract":"In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115795067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17