Data Intelligence最新文献_第10页

Scaling Notebooks as Re-configurable Cloud Workflows 将笔记本扩展为可重新配置的云工作流

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00140

Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao

Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.

摘要-Literate计算环境，如Jupyter（即Jupyter笔记本、JupyterLab和JupyterHub），已被广泛用于科学研究；它们允许用户以交互方式开发科学代码、测试算法，并在集成文档中描述实验的科学叙述。为了扩大科学分析的规模，许多实现的Jupyter环境架构将整个Jupyter笔记本封装为可复制单元，并在专用的远程基础设施（例如，高性能计算和云计算环境）上自动扩展。现有的解决方案在很多方面仍然受到限制，例如，1）工作流（或管道）隐含在笔记本中，一些步骤可以由不同的代码通用并并行执行，但由于单元结构紧凑，Jupyter笔记本中的所有步骤都必须按顺序执行，并且缺乏重用核心代码片段的灵活性，2）在处理大量输入数据和复杂计算时，存在需要提高并行性和可扩展性的性能瓶颈。在这项工作中，我们重点关注如何在笔记本电脑中无缝管理工作流。我们1）将可重用单元封装为RESTful服务，并将其容器化为门户组件，2）提供用于描述这些可重用组件的工作流逻辑的组合工具，以及3）在远程云基础设施上自动执行。从经验上讲，我们通过生态和地球科学领域的用例验证了该解决方案的可用性，说明了大规模光探测和测距（LiDAR）数据的处理。演示和分析表明，我们的方法是可行的，但还需要进一步改进，特别是在集成分布式工作流调度、自动部署和执行方面，以发展成为一种成熟的方法。

{"title":"Scaling Notebooks as Re-configurable Cloud Workflows","authors":"Yuandou Wang, Spiros Koulouzis, Riccardo Bianchi, N. Li, Yifang Shi, J. Timmermans, W. Kissling, Zhiming Zhao","doi":"10.1162/dint_a_00140","DOIUrl":"https://doi.org/10.1162/dint_a_00140","url":null,"abstract":"Abstract Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., highperformance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution's usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"409-425"},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46210347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Analysis of Pioneering Computable Biomedical Knowledge Repositories and their Emerging Governance Structures 开创性的可计算生物医学知识库及其新兴治理结构分析

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-14 DOI: 10.1162/dint_a_00148

P. Amara, M. Conte, Allen J. Flynn, Jodyn E. Platt, Grace Trinidad

Abstract A growing interest in producing and sharing computable biomedical knowledge artifacts (CBKs) is increasing the demand for repositories that validate, catalog, and provide shared access to CBKs. However, there is a lack of evidence on how best to manage and sustain CBK repositories. In this paper, we present the results of interviews with several pioneering CBK repository owners. These interviews were informed by the Trusted Repositories Audit and Certification (TRAC) framework. Insights gained from these interviews suggest that the organizations operating CBK repositories are somewhat new, that their initial approaches to repository governance are informal, and that achieving economic sustainability for their CBK repositories is a major challenge. To enable a learning health system to make better use of its data intelligence, future approaches to CBK repository management will require enhanced governance and closer adherence to best practice frameworks to meet the needs of myriad biomedical science and health communities. More effort is needed to find sustainable funding models for accessible CBK artifact collections.

摘要对生成和共享可计算生物医学知识工件（CBK）的兴趣日益增长，这增加了对验证、编目和提供对CBK的共享访问的存储库的需求。然而，缺乏关于如何最好地管理和维持CBK存储库的证据。在本文中，我们介绍了对几位CBK存储库先驱所有者的访谈结果。这些访谈是根据可信存储库审计和认证框架进行的。从这些采访中获得的见解表明，运营CBK存储库的组织有些新，他们最初的存储库治理方法是非正式的，实现其CBK存储的经济可持续性是一个重大挑战。为了使学习型卫生系统能够更好地利用其数据智能，未来的CBK存储库管理方法将需要加强治理和更严格地遵守最佳实践框架，以满足无数生物医学科学和卫生社区的需求。需要更多的努力来为可访问的CBK文物收藏找到可持续的资助模式。

{"title":"Analysis of Pioneering Computable Biomedical Knowledge Repositories and their Emerging Governance Structures","authors":"P. Amara, M. Conte, Allen J. Flynn, Jodyn E. Platt, Grace Trinidad","doi":"10.1162/dint_a_00148","DOIUrl":"https://doi.org/10.1162/dint_a_00148","url":null,"abstract":"Abstract A growing interest in producing and sharing computable biomedical knowledge artifacts (CBKs) is increasing the demand for repositories that validate, catalog, and provide shared access to CBKs. However, there is a lack of evidence on how best to manage and sustain CBK repositories. In this paper, we present the results of interviews with several pioneering CBK repository owners. These interviews were informed by the Trusted Repositories Audit and Certification (TRAC) framework. Insights gained from these interviews suggest that the organizations operating CBK repositories are somewhat new, that their initial approaches to repository governance are informal, and that achieving economic sustainability for their CBK repositories is a major challenge. To enable a learning health system to make better use of its data intelligence, future approaches to CBK repository management will require enhanced governance and closer adherence to best practice frameworks to meet the needs of myriad biomedical science and health communities. More effort is needed to find sustainable funding models for accessible CBK artifact collections.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"653-670"},"PeriodicalIF":3.9,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47280853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Canonical Workflows in Simulation-based Climate Sciences 基于模拟的气候科学中的规范工作流程

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00127

I. Anders, Karsten Peters-von Gehlen, H. Thiemann

Abstract In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulation-based research. We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center (DKRZ). What is special about this is that the DKRZ provides the climate science community with resources like high performance computing (HPC), data storage and specialised services, and hosts the World Data Center for Climate (WDCC). Therefore, users can perform their entire research workflows up to the publication of the data on the same infrastructure. Our analysis shows, that the resources are used by two primary user types: those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse, build-on and analyse existing data. We then further subdivided these top-level user categories based on their specific goals and analysed their typical, idealised workflows applied to achieve the respective project goals. We find that due to the subdivision and further granulation of the user groups, the workflows show apparent differences. Nevertheless, similar “Canonical Workflow Modules” can be clearly made out. These modules are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, “Data and Software Publication”, “Generating Knowledge” and in their entirety form the basis for a Canonical Workflow Framework for Research (CWFR). It is desirable that parts of the workflows in a CWFR act as FDOs, but we view this aspect critically. Also, we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.

在本文中，我们从基于模拟的气候科学的当前工作流程中提出了规范工作流模块的推导，以支持基于模拟的研究的相应框架的阐述。我们首先根据他们使用德国气候计算中心(DKRZ)提供的资源的原因，确定了基于模拟的气候科学的不同用户和用户组。特别之处在于，DKRZ为气候科学界提供高性能计算(HPC)、数据存储和专业服务等资源，并托管世界气候数据中心(WDCC)。因此，用户可以在相同的基础设施上执行他们的整个研究工作流程，直到发布数据。我们的分析表明，这些资源主要由两种用户类型使用:那些需要高性能计算系统执行资源密集型模拟以随后分析它们的用户，以及那些重用、构建和分析现有数据的用户。然后，我们根据他们的具体目标进一步细分这些顶级用户类别，并分析他们用于实现各自项目目标的典型的、理想化的工作流程。我们发现，由于用户组的细分和进一步粒度化，工作流显示出明显的差异。然而，类似的“规范工作流模块”可以清晰地辨认出来。这些模块是“数据和软件(再)使用”、“计算”、“数据和软件存储”、“数据和软件发布”、“生成知识”，它们的整体构成了研究规范工作流框架(CWFR)的基础。在CWFR中，工作流的某些部分充当fdo是可取的，但是我们严格地看待这方面。此外，我们还思考了一个问题，即从当前用户行为分析中推导出的规范化工作流模块是否仍然适用于未来的系统和工作流程。

{"title":"Canonical Workflows in Simulation-based Climate Sciences","authors":"I. Anders, Karsten Peters-von Gehlen, H. Thiemann","doi":"10.1162/dint_a_00127","DOIUrl":"https://doi.org/10.1162/dint_a_00127","url":null,"abstract":"Abstract In this paper we present the derivation of Canonical Workflow Modules from current workflows in simulation-based climate science in support of the elaboration of a corresponding framework for simulation-based research. We first identified the different users and user groups in simulation-based climate science based on their reasons for using the resources provided at the German Climate Computing Center (DKRZ). What is special about this is that the DKRZ provides the climate science community with resources like high performance computing (HPC), data storage and specialised services, and hosts the World Data Center for Climate (WDCC). Therefore, users can perform their entire research workflows up to the publication of the data on the same infrastructure. Our analysis shows, that the resources are used by two primary user types: those who require the HPC-system to perform resource intensive simulations to subsequently analyse them and those who reuse, build-on and analyse existing data. We then further subdivided these top-level user categories based on their specific goals and analysed their typical, idealised workflows applied to achieve the respective project goals. We find that due to the subdivision and further granulation of the user groups, the workflows show apparent differences. Nevertheless, similar “Canonical Workflow Modules” can be clearly made out. These modules are “Data and Software (Re)use”, “Compute”, “Data and Software Storing”, “Data and Software Publication”, “Generating Knowledge” and in their entirety form the basis for a Canonical Workflow Framework for Research (CWFR). It is desirable that parts of the workflows in a CWFR act as FDOs, but we view this aspect critically. Also, we reflect on the question whether the derivation of Canonical Workflow modules from the analysis of current user behaviour still holds for future systems and work processes.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"212-225"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44864013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Reproducible Research Publication Workflow: A Canonical Workflow Framework and FAIR Digital Object Approach to Quality Research Output 可重复的研究出版工作流程:一个规范的工作流程框架和公平的数字对象方法的质量研究成果

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00133

Limor Peer, Claudia Biniossek, Dirk Betz, Thu-Mai Christian

Abstract In this paper we present the Reproducible Research Publication Workflow (RRPW) as an example of how generic canonical workflows can be applied to a specific context. The RRPW includes essential steps between submission and final publication of the manuscript and the research artefacts (i.e., data, code, etc.) that underlie the scholarly claims in the manuscript. A key aspect of the RRPW is the inclusion of artefact review and metadata creation as part of the publication workflow. The paper discusses a formalized technical structure around a set of canonical steps which helps codify and standardize the process for researchers, curators, and publishers. The proposed application of canonical workflows can help achieve the goals of improved transparency and reproducibility, increase FAIR compliance of all research artefacts at all steps, and facilitate better exchange of annotated and machine-readable metadata.

在本文中，我们提出了可重复的研究出版工作流(RRPW)作为一个例子，如何将通用的规范工作流应用于特定的上下文中。RRPW包括从提交到最终出版的手稿和研究工件(即数据、代码等)之间的基本步骤，这些都是手稿中学术主张的基础。RRPW的一个关键方面是将工件审查和元数据创建作为发布工作流的一部分。本文讨论了围绕一组规范步骤的形式化技术结构，这些步骤有助于编纂和标准化研究人员，策展人和出版商的过程。规范化工作流程的建议应用可以帮助实现提高透明度和可重复性的目标，在所有步骤中增加所有研究工件的FAIR合规性，并促进更好地交换注释和机器可读的元数据。

引用次数: 1

Using a Workflow Management Platform in Textual Data Management 工作流管理平台在文本数据管理中的应用

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00139

T. Doan, S. Bingert, R. Yahyapour

Abstract The paper gives a brief introduction about the workflow management platform, Flowable, and how it is used for textual-data management. It is relatively new with its first release on 13 October, 2016. Despite the short time on the market, it seems to be quickly well-noticed with 4.6 thousand stars on GitHub at the moment. The focus of our project is to build a platform for text analysis on a large scale by including many different text resources. Currently, we have successfully connected to four different text resources and obtained more than one million works. Some resources are dynamic, which means that they might add more data or modify their current data. Therefore, it is necessary to keep data, both the metadata and the raw data, from our side up to date with the resources. In addition, to comply with FAIR principles, each work is assigned a persistent identifier (PID) and indexed for searching purposes. In the last step, we perform some standard analyses on the data to enhance our search engine and to generate a knowledge graph. End-users can utilize our platform to search on our data or get access to the knowledge graph. Furthermore, they can submit their code for their analyses to the system. The code will be executed on a High-Performance Cluster (HPC) and users can receive the results later on. In this case, Flowable can take advantage of PIDs for digital objects identification and management to facilitate the communication with the HPC system. As one may already notice, the whole process can be expressed as a workflow. A workflow, including error handling and notification, has been created and deployed. Workflow execution can be triggered manually or after predefined time intervals. According to our evaluation, the Flowable platform proves to be powerful and flexible. Further usage of the platform is already planned or implemented for many of our projects.

摘要本文简要介绍了工作流管理平台Flowable，以及它是如何用于文本数据管理的。它相对较新，于2016年10月13日首次发布。尽管上市时间很短，但它似乎很快就受到了广泛关注，目前在GitHub上有4.6万颗星。我们项目的重点是通过包含许多不同的文本资源来构建一个大规模的文本分析平台。目前，我们已经成功连接到四个不同的文本资源，并获得了超过一百万件作品。有些资源是动态的，这意味着它们可能会添加更多数据或修改当前数据。因此，有必要保持我们这边的数据，包括元数据和原始数据，与资源保持同步。此外，为了遵守FAIR原则，每个作品都被分配了一个持久标识符（PID），并被索引用于搜索目的。在最后一步中，我们对数据进行了一些标准分析，以增强我们的搜索引擎并生成知识图。最终用户可以利用我们的平台搜索我们的数据或访问知识图。此外，他们可以向系统提交用于分析的代码。代码将在高性能集群（HPC）上执行，用户稍后可以接收结果。在这种情况下，Flowable可以利用PID进行数字对象识别和管理，以方便与HPC系统的通信。正如人们可能已经注意到的那样，整个过程可以表示为一个工作流。已经创建并部署了包括错误处理和通知在内的工作流。工作流执行可以手动触发，也可以在预定义的时间间隔之后触发。根据我们的评估，Flowable平台被证明是强大和灵活的。我们的许多项目已经计划或实施了该平台的进一步使用。

{"title":"Using a Workflow Management Platform in Textual Data Management","authors":"T. Doan, S. Bingert, R. Yahyapour","doi":"10.1162/dint_a_00139","DOIUrl":"https://doi.org/10.1162/dint_a_00139","url":null,"abstract":"Abstract The paper gives a brief introduction about the workflow management platform, Flowable, and how it is used for textual-data management. It is relatively new with its first release on 13 October, 2016. Despite the short time on the market, it seems to be quickly well-noticed with 4.6 thousand stars on GitHub at the moment. The focus of our project is to build a platform for text analysis on a large scale by including many different text resources. Currently, we have successfully connected to four different text resources and obtained more than one million works. Some resources are dynamic, which means that they might add more data or modify their current data. Therefore, it is necessary to keep data, both the metadata and the raw data, from our side up to date with the resources. In addition, to comply with FAIR principles, each work is assigned a persistent identifier (PID) and indexed for searching purposes. In the last step, we perform some standard analyses on the data to enhance our search engine and to generate a knowledge graph. End-users can utilize our platform to search on our data or get access to the knowledge graph. Furthermore, they can submit their code for their analyses to the system. The code will be executed on a High-Performance Cluster (HPC) and users can receive the results later on. In this case, Flowable can take advantage of PIDs for digital objects identification and management to facilitate the communication with the HPC system. As one may already notice, the whole process can be expressed as a workflow. A workflow, including error handling and notification, has been created and deployed. Workflow execution can be triggered manually or after predefined time intervals. According to our evaluation, the Flowable platform proves to be powerful and flexible. Further usage of the platform is already planned or implemented for many of our projects.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"398-408"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44504533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Canonical Workflows to Make Data FAIR 规范工作流使数据公平

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00132

P. Wittenburg, A. Hardisty, Yann Le Franc, A. Mozaffari, Limor Peer, N. Skvortsov, Zhiming Zhao, A. Spinuso

Abstract The FAIR principles have been accepted globally as guidelines for improving data-driven science and data management practices, yet the incentives for researchers to change their practices are presently weak. In addition, data-driven science has been slow to embrace workflow technology despite clear evidence of recurring practices. To overcome these challenges, the Canonical Workflow Frameworks for Research (CWFR) initiative suggests a large-scale introduction of self-documenting workflow scripts to automate recurring processes or fragments thereof. This standardised approach, with FAIR Digital Objects as anchors, will be a significant milestone in the transition to FAIR data without adding additional load onto the researchers who stand to benefit most from it. This paper describes the CWFR approach and the activities of the CWFR initiative over the course of the last year or so, highlights several projects that hold promise for the CWFR approaches, including Galaxy, Jupyter Notebook, and RO Crate, and concludes with an assessment of the state of the field and the challenges ahead.

FAIR原则已被全球接受为改进数据驱动的科学和数据管理实践的指导方针，但目前激励研究人员改变其实践的动力很弱。此外，数据驱动的科学在接受工作流技术方面进展缓慢，尽管有明确的证据表明这种做法反复出现。为了克服这些挑战，研究规范工作流框架(CWFR)倡议建议大规模引入自文档工作流脚本，以自动化重复过程或其中的片段。这种以FAIR数字对象为基础的标准化方法将成为向FAIR数据过渡的一个重要里程碑，同时不会给研究人员增加额外的负担，而研究人员将从中受益最多。本文描述了CWFR方法和CWFR倡议在过去一年左右的过程中的活动，重点介绍了CWFR方法有希望的几个项目，包括Galaxy、Jupyter Notebook和RO Crate，最后对该领域的状态和未来的挑战进行了评估。

引用次数: 3

S-ProvFlow. Storing and Exploring Lineage Data as a Service S-ProvFlow。将血统数据作为服务进行存储和探索

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00128

A. Spinuso, M. Atkinson, F. Magnoni

Abstract We present a set of configurable Web service and interactive tools, s-ProvFlow, for managing and exploiting records tracking data lineage during workflow runs. It facilitates detailed analysis of single executions. It helps users manage complex tasks by exposing the relationships between data, people, equipment and workflow runs intended to combine productively. Its logical model extends the PROV standard to precisely record parallel data-streaming applications. Its metadata handling encourages users to capture the application context by specifying how application attributes, often using standard vocabularies, should be added. These metadata records immediately help productivity as the interactive tools support their use in selection and bulk operations. Users rapidly appreciate the power of the encoded semantics as they reap the benefits. This improves the quality of provenance for users and management. Which in turn facilitates analysis of collections of runs, enabling users to manage results and validate procedures. It fosters reuse of data and methods and facilitates diagnostic investigations and optimisations. We present S-ProvFlow's use by scientists, research engineers and managers as part of the DARE hyper-platform as they create, validate and use their data-driven scientific workflows.

摘要我们提供了一组可配置的Web服务和交互式工具s-ProvFlow，用于在工作流运行期间管理和利用记录跟踪数据沿袭。它便于对单个执行进行详细分析。它通过公开数据、人员、设备和工作流运行之间的关系来帮助用户管理复杂的任务，这些关系旨在有效地结合在一起。其逻辑模型将PROV标准扩展到精确记录并行数据流应用程序。它的元数据处理鼓励用户通过指定应用程序属性（通常使用标准词汇表）的添加方式来捕获应用程序上下文。这些元数据记录立即有助于提高生产力，因为交互式工具支持在选择和批量操作中使用它们。当用户获得好处时，他们很快就会意识到编码语义的威力。这提高了用户和管理人员的出处质量。这反过来又方便了对运行集合的分析，使用户能够管理结果和验证过程。它促进了数据和方法的重用，并促进了诊断调查和优化。我们介绍了科学家、研究工程师和管理人员在创建、验证和使用数据驱动的科学工作流程时，将S-ProvFlow作为DARE超平台的一部分进行的使用。

{"title":"S-ProvFlow. Storing and Exploring Lineage Data as a Service","authors":"A. Spinuso, M. Atkinson, F. Magnoni","doi":"10.1162/dint_a_00128","DOIUrl":"https://doi.org/10.1162/dint_a_00128","url":null,"abstract":"Abstract We present a set of configurable Web service and interactive tools, s-ProvFlow, for managing and exploiting records tracking data lineage during workflow runs. It facilitates detailed analysis of single executions. It helps users manage complex tasks by exposing the relationships between data, people, equipment and workflow runs intended to combine productively. Its logical model extends the PROV standard to precisely record parallel data-streaming applications. Its metadata handling encourages users to capture the application context by specifying how application attributes, often using standard vocabularies, should be added. These metadata records immediately help productivity as the interactive tools support their use in selection and bulk operations. Users rapidly appreciate the power of the encoded semantics as they reap the benefits. This improves the quality of provenance for users and management. Which in turn facilitates analysis of collections of runs, enabling users to manage results and validate procedures. It fosters reuse of data and methods and facilitates diagnostic investigations and optimisations. We present S-ProvFlow's use by scientists, research engineers and managers as part of the DARE hyper-platform as they create, validate and use their data-driven scientific workflows.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"226-242"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41411525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections 标本数据精炼厂:一个规范的工作流程框架和公平的数字对象方法，以加速自然历史藏品的数字化动员

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00134

A. Hardisty, P. Brack, C. Goble, Laurence Livermore, Ben Scott, Q. Groom, S. Owen, S. Soiland-Reyes

Abstract A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.

组织和使用自然科学馆藏中物理标本信息的一个关键限制因素是使这些信息可计算，机构数字化倾向于更多地关注标本本身的成像，而不是有效地捕获有关它们的可计算数据。如今，标签数据传统上是手工转录的，成本高，吞吐量低，这使得许多收藏机构在目前的资金水平下无法完成这样的任务。我们展示了如何将计算机视觉、光学字符识别、手写识别、命名实体识别和语言翻译技术实现到具有可查找、可访问、可互操作和可重用(FAIR)特征的规范化工作流组件库中。这些库是在基于云的工作流平台——“样本数据精化”(SDR)中开发的，该平台基于Galaxy工作流引擎、通用工作流语言、研究对象crate (RO-Crate)和WorkflowHub技术。SDR可以应用于标本的标签和其他人工制品，提供了以可计算形式大大加速和更准确的数据捕获的前景。通过将SDR工作流和工作流组件的输出打包为具有元数据、持久标识符和特定类型定义的数字对象，可以创建两种FAIR数字对象(FDO)。第一种FDO是可计算的数字样本(DS)对象，可以由工作流和其他应用程序消费/产生。单个DS是提交给工作流的输入数据结构，每个工作流组件依次对其进行修改，最终生成精细化的DS。样本数据精炼厂提供了一个这样的组件库，可以单独使用，也可以串联使用。为了协同工作，每个库组件描述了它需要从DS获得的字段，以及它将依次填充或充实的字段。第二种类型的FDO, RO-Crates收集和存档各种数字和现实世界的资源、配置和行为(来源)，为研究工作单位做出贡献，允许该工作被忠实地记录和复制。在这里，我们将描述样本数据精炼厂及其激励需求，重点关注规范化工作流组件库的创建中必不可少的内容，以及它与FDO论坛正在开发的新兴FDO核心规范的需求的一致性。

{"title":"The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections","authors":"A. Hardisty, P. Brack, C. Goble, Laurence Livermore, Ben Scott, Q. Groom, S. Owen, S. Soiland-Reyes","doi":"10.1162/dint_a_00134","DOIUrl":"https://doi.org/10.1162/dint_a_00134","url":null,"abstract":"Abstract A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"320-341"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43208997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction 面向HPC的典型工作流在气候和天气预测中的机器学习应用

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00131

A. Mozaffari, M. Langguth, Bing Gong, Jessica Ahring, Adrian Rojas Campos, Pascal Nieters, Otoniel José Campos Escobar, M. Wittenbrink, P. Baumann, M. Schultz

Abstract Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.

随着大数据和高性能计算(HPC)能力的巨大增长，机器学习(ML)在天气和气候方面的应用正在获得动力。确保公平的数据和可重复的ML实践是地球系统研究人员面临的重大挑战。尽管FAIR原则为许多科学家所熟知，但研究界采用它的速度很慢。研究规范工作流框架(CWFR)提供了一个平台，以确保这些实践的公平性和可重复性，而不会压倒研究人员。这篇概念性论文设想了一种全面的CWFR方法，用于天气和气候中的ML应用，重点是高性能计算和大数据。具体来说，我们讨论了DeepRain项目中的公平数字对象(FDO)和研究对象(RO)，以实现颗粒可重复性。DeepRain是一个旨在通过ML改善德国降水预报的项目。我们的概念设想栅格数据集提供数据协调和快速可扩展的数据访问。我们建议Juypter笔记本作为一个单一的可重复的实验。此外，我们将JuypterHub设想为一个可扩展的分布式中央平台，通过一个易于使用的图形界面将所有这些元素和HPC资源连接到研究人员。

{"title":"HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction","authors":"A. Mozaffari, M. Langguth, Bing Gong, Jessica Ahring, Adrian Rojas Campos, Pascal Nieters, Otoniel José Campos Escobar, M. Wittenbrink, P. Baumann, M. Schultz","doi":"10.1162/dint_a_00131","DOIUrl":"https://doi.org/10.1162/dint_a_00131","url":null,"abstract":"Abstract Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"271-285"},"PeriodicalIF":3.9,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45520849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Workflow Demonstrator for Processing Catalysis Research Data 用于处理催化研究数据的工作流演示器

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-03-07 DOI: 10.1162/dint_a_00143

A. N. L. Hidalga, Donato Decarolis, Shaojun Xu, S. Matam, Willinton Yesid Hernández Enciso, Joseph B. Goodall, B. Matthews, C. Catlow

Abstract The UK Catalysis Hub (UKCH) is designing a virtual research environment to support data processing and analysis, the Catalysis Research Workbench (CRW). The development of this platform requires identifying the processing and analysis needs of the UKCH members and mapping them to potential solutions. This paper presents a proposal for a demonstrator to analyse the use of scientific workflows for large scale data processing. The demonstrator provides a concrete target to promote further discussion of the processing and analysis needs of the UKCH community. In this paper, we will discuss the main requirements for data processing elicited and the proposed adaptations that will be incorporated in the design of the CRW and how to integrate the proposed solutions with existing practices of the UKCH. The demonstrator has been used in discussion with researchers and in presentations to the UKCH community, generating increased interest and motivating further development.

英国催化中心(UKCH)正在设计一个虚拟研究环境，以支持数据处理和分析，催化研究工作台(CRW)。该平台的开发需要确定UKCH成员的处理和分析需求，并将其映射到潜在的解决方案。本文提出了一个演示器的建议，以分析科学工作流在大规模数据处理中的使用。该演示提供了一个具体的目标，以促进对UKCH社区的处理和分析需求的进一步讨论。在本文中，我们将讨论数据处理的主要要求，以及将纳入CRW设计的拟议调整，以及如何将拟议的解决方案与UKCH的现有实践相结合。该演示器已用于与研究人员的讨论和UKCH社区的演示，产生了更多的兴趣并推动了进一步的发展。

引用次数: 4