Data Intelligence最新文献

英文中文

Comparative Evaluation and Comprehensive Analysis of Machine Learning Models for Regression Problems 回归问题机器学习模型的比较评价与综合分析

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-07-01 DOI: 10.1162/dint_a_00155

Boran Sekerogiu, Y. K. Ever, Kamil Dimililer, F. Al-turjman

Abstract Artificial intelligence and machine learning applications are of significant importance almost in every field of human life to solve problems or support human experts. However, the determination of the machine learning model to achieve a superior result for a particular problem within the wide real-life application areas is still a challenging task for researchers. The success of a model could be affected by several factors such as dataset characteristics, training strategy and model responses. Therefore, a comprehensive analysis is required to determine model ability and the efficiency of the considered strategies. This study implemented ten benchmark machine learning models on seventeen varied datasets. Experiments are performed using four different training strategies 60:40, 70:30, and 80:20 hold-out and five-fold cross-validation techniques. We used three evaluation metrics to evaluate the experimental results: mean squared error, mean absolute error, and coefficient of determination (R2 score). The considered models are analyzed, and each model's advantages, disadvantages, and data dependencies are indicated. As a result of performed excess number of experiments, the deep Long-Short Term Memory (LSTM) neural network outperformed other considered models, namely, decision tree, linear regression, support vector regression with a linear and radial basis function kernels, random forest, gradient boosting, extreme gradient boosting, shallow neural network, and deep neural network. It has also been shown that cross-validation has a tremendous impact on the results of the experiments and should be considered for the model evaluation in regression studies where data mining or selection is not performed.

摘要人工智能和机器学习应用几乎在人类生活的每个领域都具有重要意义，可以解决问题或支持人类专家。然而，对于研究人员来说，确定机器学习模型以在广泛的现实应用领域中为特定问题实现卓越的结果仍然是一项具有挑战性的任务。模型的成功可能受到几个因素的影响，如数据集特征、训练策略和模型响应。因此，需要进行全面分析，以确定模型能力和所考虑策略的效率。本研究在17个不同的数据集上实现了10个基准机器学习模型。实验使用四种不同的训练策略60:40、70:30和80:20保持和五倍交叉验证技术进行。我们使用三个评估指标来评估实验结果：均方误差、平均绝对误差和决定系数（R2分数）。分析了所考虑的模型，并指出了每个模型的优点、缺点和数据相关性。由于进行了过多的实验，深度长短期记忆（LSTM）神经网络的性能优于其他考虑的模型，即决策树、线性回归、具有线性和径向基函数核的支持向量回归、随机森林、梯度增强、极端梯度增强、浅层神经网络和深度神经网络。研究还表明，交叉验证对实验结果有着巨大的影响，在不进行数据挖掘或选择的回归研究中，应将其考虑用于模型评估。

{"title":"Comparative Evaluation and Comprehensive Analysis of Machine Learning Models for Regression Problems","authors":"Boran Sekerogiu, Y. K. Ever, Kamil Dimililer, F. Al-turjman","doi":"10.1162/dint_a_00155","DOIUrl":"https://doi.org/10.1162/dint_a_00155","url":null,"abstract":"Abstract Artificial intelligence and machine learning applications are of significant importance almost in every field of human life to solve problems or support human experts. However, the determination of the machine learning model to achieve a superior result for a particular problem within the wide real-life application areas is still a challenging task for researchers. The success of a model could be affected by several factors such as dataset characteristics, training strategy and model responses. Therefore, a comprehensive analysis is required to determine model ability and the efficiency of the considered strategies. This study implemented ten benchmark machine learning models on seventeen varied datasets. Experiments are performed using four different training strategies 60:40, 70:30, and 80:20 hold-out and five-fold cross-validation techniques. We used three evaluation metrics to evaluate the experimental results: mean squared error, mean absolute error, and coefficient of determination (R2 score). The considered models are analyzed, and each model's advantages, disadvantages, and data dependencies are indicated. As a result of performed excess number of experiments, the deep Long-Short Term Memory (LSTM) neural network outperformed other considered models, namely, decision tree, linear regression, support vector regression with a linear and radial basis function kernels, random forest, gradient boosting, extreme gradient boosting, shallow neural network, and deep neural network. It has also been shown that cross-validation has a tremendous impact on the results of the experiments and should be considered for the model evaluation in regression studies where data mining or selection is not performed.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"620-652"},"PeriodicalIF":3.9,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42312983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs 医学知识图中的模糊约束图模式匹配

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-07-01 DOI: 10.1162/dint_a_00153

Lei Li, Xun Du, Zan Zhang, Zhenchao Tao

Abstract The research on graph pattern matching (GPM) has attracted a lot of attention. However, most of the research has focused on complex networks, and there are few researches on GPM in the medical field. Hence, with GPM this paper is to make a breast cancer-oriented diagnosis before the surgery. Technically, this paper has firstly made a new definition of GPM, aiming to explore the GPM in the medical field, especially in Medical Knowledge Graphs (MKGs). Then, in the specific matching process, this paper introduces fuzzy calculation, and proposes a multi-threaded bidirectional routing exploration (M-TBRE) algorithm based on depth first search and a two-way routing matching algorithm based on multi-threading. In addition, fuzzy constraints are introduced in the M-TBRE algorithm, which leads to the Fuzzy-M-TBRE algorithm. The experimental results on the two datasets show that compared with existing algorithms, our proposed algorithm is more efficient and effective.

图模式匹配(GPM)的研究引起了广泛的关注。然而，大多数研究都集中在复杂网络上，医学领域对GPM的研究很少。因此，本文的目的是在手术前对乳腺癌进行定向诊断。在技术上，本文首先对GPM进行了新的定义，旨在探索GPM在医学领域，特别是在医学知识图谱(MKGs)中的应用。然后，在具体匹配过程中引入模糊计算，提出了基于深度优先搜索的多线程双向路由探索(M-TBRE)算法和基于多线程的双向路由匹配算法。此外，在M-TBRE算法中引入模糊约束，从而得到fuzzy -M-TBRE算法。在两个数据集上的实验结果表明，与现有算法相比，本文提出的算法更加高效。

引用次数: 2

Knowledge Representation and Reasoning for Complex Time Expression in Clinical Text 临床文本中复杂时间表达的知识表示与推理

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-07-01 DOI: 10.1162/dint_a_00152

Danyang Hu, Meng Wang, Feng Gao, Fangfang Xu, J. Gu

Abstract Temporal information is pervasive and crucial in medical records and other clinical text, as it formulates the development process of medical conditions and is vital for clinical decision making. However, providing a holistic knowledge representation and reasoning framework for various time expressions in the clinical text is challenging. In order to capture complex temporal semantics in clinical text, we propose a novel Clinical Time Ontology (CTO) as an extension from OWL framework. More specifically, we identified eight time-related problems in clinical text and created 11 core temporal classes to conceptualize the fuzzy time, cyclic time, irregular time, negations and other complex aspects of clinical time. Then, we extended Allen's and TEO's temporal relations and defined the relation concept description between complex and simple time. Simultaneously, we provided a formulaic and graphical presentation of complex time and complex time relationships. We carried out empirical study on the expressiveness and usability of CTO using real-world healthcare datasets. Finally, experiment results demonstrate that CTO could faithfully represent and reason over 93% of the temporal expressions, and it can cover a wider range of time-related classes in clinical domain.

时间信息在医疗记录和其他临床文本中普遍存在且至关重要，因为它描述了医疗状况的发展过程，对临床决策至关重要。然而，为临床文本中的各种时间表达提供一个整体的知识表示和推理框架是具有挑战性的。为了捕获临床文本中复杂的时间语义，我们提出了一种新的临床时间本体(CTO)作为OWL框架的扩展。更具体地说，我们确定了临床文本中8个与时间相关的问题，并创建了11个核心时间类，以概念化临床时间的模糊时间、循环时间、不规则时间、否定和其他复杂方面。然后，我们扩展了Allen’s和TEO’s时间关系，定义了复杂时间和简单时间之间的关系概念描述。同时，我们提供了复杂时间和复杂时间关系的公式化和图形化表示。我们使用现实世界的医疗保健数据集对CTO的表达性和可用性进行了实证研究。最后，实验结果表明，CTO可以忠实地表示和推理超过93%的时间表达式，并且可以覆盖临床领域中更广泛的与时间相关的类别。

{"title":"Knowledge Representation and Reasoning for Complex Time Expression in Clinical Text","authors":"Danyang Hu, Meng Wang, Feng Gao, Fangfang Xu, J. Gu","doi":"10.1162/dint_a_00152","DOIUrl":"https://doi.org/10.1162/dint_a_00152","url":null,"abstract":"Abstract Temporal information is pervasive and crucial in medical records and other clinical text, as it formulates the development process of medical conditions and is vital for clinical decision making. However, providing a holistic knowledge representation and reasoning framework for various time expressions in the clinical text is challenging. In order to capture complex temporal semantics in clinical text, we propose a novel Clinical Time Ontology (CTO) as an extension from OWL framework. More specifically, we identified eight time-related problems in clinical text and created 11 core temporal classes to conceptualize the fuzzy time, cyclic time, irregular time, negations and other complex aspects of clinical time. Then, we extended Allen's and TEO's temporal relations and defined the relation concept description between complex and simple time. Simultaneously, we provided a formulaic and graphical presentation of complex time and complex time relationships. We carried out empirical study on the expressiveness and usability of CTO using real-world healthcare datasets. Finally, experiment results demonstrate that CTO could faithfully represent and reason over 93% of the temporal expressions, and it can cover a wider range of time-related classes in clinical domain.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"573-598"},"PeriodicalIF":3.9,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48042330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The Integration of a Canonical Workflow Framework with an Informatics System for Disease Area Research 典型工作流程框架与疾病领域研究信息系统的集成

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00125

V. Navale, Matthew McAuliffe

Abstract A recurring pattern of access to existing databases, data analyses, formulation of new hypotheses, use of an experimental design, institutional review board approvals, data collection, curation, and storage within trusted digital repositories is observable during clinical research work. The workflows that support the repeated nature of these activities can be ascribed as a Canonical Workflow Framework for Research (CWFR). Disease area clinical research is protocol specific, and during data collection, the electronic case report forms can use Common Data Elements (CDEs) that have precisely defined questions and are associated with the specified value(s) as responses. The CDE-based CWFR is integrated with a biomedical research informatics computing system, which consists of a complete stack of technical layers including the Protocol and Form Research Management System. The unique data dictionaries associated with the CWFR for Traumatic Brain Injury and Parkinson's Disease resulted in the development of the Federal Interagency Traumatic Brain Injury and Parkinson's Disease Biomarker systems. Due to a canonical workflow, these two systems can use similar tools, applications, and service modules to create findable, accessible, interoperable, and reusable Digital Objects. The Digital Objects for Traumatic Brain Injury and Parkinson's disease contain all relevant information needed from the time data is collected, validated, and maintained within a Storage Repository for future access. All Traumatic Brain Injury and Parkinson's Disease studies can be shared as Research Objects that can be produced by aggregating related resources as information packages and is findable on the Internet by using unique identifiers. Overall, the integration of CWFR with an informatics system has resulted in the reuse of software applications for several National Institutes of Health-supported biomedical research programs.

在临床研究工作中，可以观察到对现有数据库的访问、数据分析、新假设的制定、实验设计的使用、机构审查委员会的批准、数据收集、策展和可信数字存储库中的存储等反复出现的模式。支持这些活动的重复性质的工作流可以被归为研究的规范工作流框架(CWFR)。疾病领域的临床研究是特定于协议的，在数据收集过程中，电子病例报告表格可以使用具有精确定义的问题并与指定值相关联的公共数据元素(cde)作为响应。基于cde的CWFR集成了一个生物医学研究信息计算系统，该系统包括协议和表单研究管理系统等完整的技术层堆栈。与创伤性脑损伤和帕金森病CWFR相关的独特数据词典导致了联邦跨部门创伤性脑损伤和帕金森病生物标志物系统的发展。由于规范的工作流，这两个系统可以使用类似的工具、应用程序和服务模块来创建可查找、可访问、可互操作和可重用的数字对象。创伤性脑损伤和帕金森病的数字对象包含从数据收集、验证和维护到存储库中以备将来访问所需的所有相关信息。所有创伤性脑损伤和帕金森病的研究都可以作为研究对象共享，这些研究对象可以通过将相关资源聚合为信息包产生，并且可以通过使用唯一标识符在互联网上找到。总体而言，CWFR与信息学系统的集成导致了几个国家卫生研究院支持的生物医学研究项目的软件应用程序的重用。

{"title":"The Integration of a Canonical Workflow Framework with an Informatics System for Disease Area Research","authors":"V. Navale, Matthew McAuliffe","doi":"10.1162/dint_a_00125","DOIUrl":"https://doi.org/10.1162/dint_a_00125","url":null,"abstract":"Abstract A recurring pattern of access to existing databases, data analyses, formulation of new hypotheses, use of an experimental design, institutional review board approvals, data collection, curation, and storage within trusted digital repositories is observable during clinical research work. The workflows that support the repeated nature of these activities can be ascribed as a Canonical Workflow Framework for Research (CWFR). Disease area clinical research is protocol specific, and during data collection, the electronic case report forms can use Common Data Elements (CDEs) that have precisely defined questions and are associated with the specified value(s) as responses. The CDE-based CWFR is integrated with a biomedical research informatics computing system, which consists of a complete stack of technical layers including the Protocol and Form Research Management System. The unique data dictionaries associated with the CWFR for Traumatic Brain Injury and Parkinson's Disease resulted in the development of the Federal Interagency Traumatic Brain Injury and Parkinson's Disease Biomarker systems. Due to a canonical workflow, these two systems can use similar tools, applications, and service modules to create findable, accessible, interoperable, and reusable Digital Objects. The Digital Objects for Traumatic Brain Injury and Parkinson's disease contain all relevant information needed from the time data is collected, validated, and maintained within a Storage Repository for future access. All Traumatic Brain Injury and Parkinson's Disease studies can be shared as Research Objects that can be produced by aggregating related resources as information packages and is findable on the Internet by using unique identifiers. Overall, the integration of CWFR with an informatics system has resulted in the reuse of software applications for several National Institutes of Health-supported biomedical research programs.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"186-195"},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42284006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Evaluation of Application Possibilities for Packaging Technologies in Canonical Workflows 规范工作流中包装技术应用可能性的评估

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00137

T. Jejkal, Sabrine Chelbi, A. Pfeil, P. Wittenburg

Abstract In Canonical Workflow Framework for Research (CWFR) “packages” are relevant in two different directions. In data science, workflows are in general being executed on a set of files which have been aggregated for specific purposes, such as for training a model in deep learning. We call this type of “package” a data collection and its aggregation and metadata description is motivated by research interests. The other type of “packages” relevant for CWFR are supposed to represent workflows in a self-describing and self-contained way for later execution. In this paper, we will review different packaging technologies and investigate their usability in the context of CWFR. For this purpose, we draw on an exemplary use case and show how packaging technologies can support its realization. We conclude that packaging technologies of different flavors help on providing inputs and outputs for workflow steps in a machine-readable way, as well as on representing a workflow and all its artifacts in a self-describing and self-contained way.

摘要在标准工作流研究框架（CWFR）中，“包”在两个不同的方向上是相关的。在数据科学中，工作流通常是在一组文件上执行的，这些文件已被聚合用于特定目的，例如用于在深度学习中训练模型。我们将这种类型的“包”称为数据收集，其聚合和元数据描述是出于研究兴趣。与CWFR相关的另一种类型的“包”应该以自描述和自包含的方式表示工作流，以便稍后执行。在本文中，我们将回顾不同的包装技术，并在CWFR的背景下研究它们的可用性。为此，我们借鉴了一个示例用例，并展示了封装技术如何支持其实现。我们得出的结论是，不同风格的打包技术有助于以机器可读的方式为工作流步骤提供输入和输出，以及以自我描述和自包含的方式表示工作流及其所有工件。

引用次数: 0

Galaxy: A Decade of Realising CWFR Concepts 银河:实现CWFR概念的十年

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00136

Beatriz Serrano-Solano, A. Fouilloux, Ignacio Eguinoa, Matúš Kalaš, B. Grüning, Frederik Coppens

Abstract Despite recent encouragement to follow the FAIR principles, the day-to-day research practices have not changed substantially. Due to new developments and the increasing pressure to apply best practices, initiatives to improve the efficiency and reproducibility of scientific workflows are becoming more prevalent. In this article, we discuss the importance of well-annotated tools and the specific requirements to ensure reproducible research with FAIR outputs. We detail how Galaxy, an open-source workflow management system with a web-based interface, has implemented the concepts that are put forward by the Canonical Workflow Framework for Research (CWFR), whilst minimising changes to the practices of scientific communities. Although we showcase concrete applications from two different domains, this approach is generalisable to any domain and particularly useful in interdisciplinary research and science-based applications.

摘要尽管最近鼓励遵循FAIR原则，但日常研究实践并没有发生实质性变化。由于新的发展和应用最佳做法的压力越来越大，提高科学工作流程的效率和可重复性的举措越来越普遍。在这篇文章中，我们讨论了注释良好的工具的重要性，以及确保FAIR输出的可重复研究的具体要求。我们详细介绍了Galaxy，一个具有网络界面的开源工作流管理系统，如何实现研究规范工作流框架（CWFR）提出的概念，同时最大限度地减少科学界实践的变化。尽管我们展示了两个不同领域的具体应用，但这种方法可以推广到任何领域，在跨学科研究和基于科学的应用中特别有用。

引用次数: 3

Editors’ Note: Special Issue on Canonical Workflow Frameworks for Research 编者注:研究规范工作流框架特刊

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_e_00122

P. Wittenburg, A. Hardisty, Amirpasha Mozzafari, Limor Peer, N. Skvortsov, A. Spinuso, Zhiming Zhao

1Gemeindweg 55, 47533 Kleve, Germany 2Cardiff University, Cardiff, South Glamorgan , CF14 3UX, Wales, UK 3Forschungszentrum Jülich GmbH, 52425 Jülich, Germany 4Institution for Social and Policy Studies, Yale University, New Haven, CT 06520, USA 5Vavilov 44/2, 121351 Moscow, Russia 6Utrechtseweg 297, 3731 GA De Bilt, the Netherlands 7University of Amsterdam, PO-Box 94323, 1090 GH Amsterdam, the Netherlands

1Gemeindweg 55,47533 Kleve 2Cardiff University, Cardiff, South Glamorgan, CF14 3UX, Wales, UK 3Forschungszentrum j利希有限公司，52425 j利希，德国4耶鲁大学社会与政策研究所，纽黑文，CT 06520，美国5Vavilov 44/2, 121351莫斯科，俄罗斯6Utrechtseweg 297, 3731 GA De Bilt，荷兰7阿姆斯特丹大学，邮政信箱94323,1090 GH阿姆斯特丹

引用次数: 0

Canonical Workflow for Experimental Research 规范的实验研究工作流程

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00123

Dirk Betz, Claudia Biniossek, Christophe Blanchi, Felix Henninger, T. Lauer, P. Wieder, P. Wittenburg, M. Zünkeler

Abstract The overall expectation of introducing Canonical Workflow for Experimental Research and FAIR digital objects (FDOs) can be summarised as reducing the gap between workflow technology and research practices to make experimental work more efficient and improve FAIRness without adding administrative load on the researchers. In this document, we will describe, with the help of an example, how CWFR could work in detail and improve research procedures. We have chosen the example of “experiments with human subjects” which stretches from planning an experiment to storing the collected data in a repository. While we focus on experiments with human subjects, we are convinced that CWFR can be applied to many other data generation processes based on experiments. The main challenge is to identify repeating patterns in existing research practices that can be abstracted to create CWFR. In this document, we will include detailed examples from different disciplines to demonstrate that CWFR can be implemented without violating specific disciplinary or methodological requirements. We do not claim to be comprehensive in all aspects, since these examples are meant to prove the concept of CWFR.

引入规范实验研究工作流和公平数字对象(fdo)的总体期望可以概括为减少工作流技术与研究实践之间的差距，从而提高实验工作的效率和公平性，同时不增加研究人员的管理负担。在本文中，我们将借助一个示例详细描述CWFR如何工作并改进研究过程。我们选择了“人类实验”的例子，从计划实验到将收集到的数据存储在存储库中。虽然我们专注于人类受试者的实验，但我们相信CWFR可以应用于基于实验的许多其他数据生成过程。主要的挑战是识别现有研究实践中的重复模式，这些模式可以被抽象为创建CWFR。在本文档中，我们将包括来自不同学科的详细示例，以证明CWFR可以在不违反特定学科或方法要求的情况下实现。我们并不声称在所有方面都是全面的，因为这些例子是为了证明CWFR的概念。

{"title":"Canonical Workflow for Experimental Research","authors":"Dirk Betz, Claudia Biniossek, Christophe Blanchi, Felix Henninger, T. Lauer, P. Wieder, P. Wittenburg, M. Zünkeler","doi":"10.1162/dint_a_00123","DOIUrl":"https://doi.org/10.1162/dint_a_00123","url":null,"abstract":"Abstract The overall expectation of introducing Canonical Workflow for Experimental Research and FAIR digital objects (FDOs) can be summarised as reducing the gap between workflow technology and research practices to make experimental work more efficient and improve FAIRness without adding administrative load on the researchers. In this document, we will describe, with the help of an example, how CWFR could work in detail and improve research procedures. We have chosen the example of “experiments with human subjects” which stretches from planning an experiment to storing the collected data in a repository. While we focus on experiments with human subjects, we are convinced that CWFR can be applied to many other data generation processes based on experiments. The main challenge is to identify repeating patterns in existing research practices that can be abstracted to create CWFR. In this document, we will include detailed examples from different disciplines to demonstrate that CWFR can be implemented without violating specific disciplinary or methodological requirements. We do not claim to be comprehensive in all aspects, since these examples are meant to prove the concept of CWFR.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"155-172"},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42683678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Canonical Workflow for Machine Learning Tasks 机器学习任务的规范工作流

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00124

Christophe Blanchi, B. Gebre, P. Wittenburg

Abstract There is a huge gap between (1) the state of workflow technology on the one hand and the practices in the many labs working with data driven methods on the other and (2) the awareness of the FAIR principles and the lack of changes in practices during the last 5 years. The CWFR concept has been defined which is meant to combine these two intentions, increasing the use of workflow technology and improving FAIR compliance. In the study described in this paper we indicate how this could be applied to machine learning which is now used by almost all research disciplines with the well-known effects of a huge lack of repeatability and reproducibility. Researchers will only change practices if they can work efficiently and are not loaded with additional tasks. A comprehensive CWFR framework would be an umbrella for all steps that need to be carried out to do machine learning on selected data collections and immediately create a comprehensive and FAIR compliant documentation. The researcher is guided by such a framework and information once entered can easily be shared and reused. The many iterations normally required in machine learning can be dealt with efficiently using CWFR methods. Libraries of components that can be easily orchestrated using FAIR Digital Objects as a common entity to document all actions and to exchange information between steps without the researcher needing to understand anything about PIDs and FDO details is probably the way to increase efficiency in repeating research workflows. As the Galaxy project indicates, the availability of supporting tools will be important to let researchers use these methods. Other as the Galaxy framework suggests, however, it would be necessary to include all steps necessary for doing a machine learning task including those that require human interaction and to document all phases with the help of structured FDOs.

摘要：（1）工作流技术的现状与许多实验室使用数据驱动方法的实践之间存在巨大差距，（2）对FAIR原则的认识以及过去5年实践中缺乏变化。CWFR概念已经定义，旨在将这两个意图结合起来，增加工作流技术的使用并提高FAIR合规性。在本文描述的研究中，我们指出了如何将其应用于机器学习，现在几乎所有的研究学科都在使用机器学习，其众所周知的影响是严重缺乏可重复性和再现性。研究人员只有在能够有效工作且没有额外任务的情况下才会改变实践。一个全面的CWFR框架将是所有需要执行的步骤的保护伞，以对选定的数据收集进行机器学习，并立即创建一个全面且符合FAIR的文档。研究人员在这样一个框架的指导下，输入的信息可以很容易地共享和重用。使用CWFR方法可以有效地处理机器学习中通常需要的许多迭代。可以使用FAIR数字对象作为一个通用实体来轻松编排组件库，以记录所有操作并在步骤之间交换信息，而研究人员无需了解PID和FDO细节，这可能是提高重复研究工作流程效率的方法。正如银河项目所表明的那样，支持工具的可用性对于让研究人员使用这些方法至关重要。然而，正如银河系框架所建议的那样，有必要包括执行机器学习任务所需的所有步骤，包括需要人工交互的步骤，并在结构化FDO的帮助下记录所有阶段。

{"title":"Canonical Workflow for Machine Learning Tasks","authors":"Christophe Blanchi, B. Gebre, P. Wittenburg","doi":"10.1162/dint_a_00124","DOIUrl":"https://doi.org/10.1162/dint_a_00124","url":null,"abstract":"Abstract There is a huge gap between (1) the state of workflow technology on the one hand and the practices in the many labs working with data driven methods on the other and (2) the awareness of the FAIR principles and the lack of changes in practices during the last 5 years. The CWFR concept has been defined which is meant to combine these two intentions, increasing the use of workflow technology and improving FAIR compliance. In the study described in this paper we indicate how this could be applied to machine learning which is now used by almost all research disciplines with the well-known effects of a huge lack of repeatability and reproducibility. Researchers will only change practices if they can work efficiently and are not loaded with additional tasks. A comprehensive CWFR framework would be an umbrella for all steps that need to be carried out to do machine learning on selected data collections and immediately create a comprehensive and FAIR compliant documentation. The researcher is guided by such a framework and information once entered can easily be shared and reused. The many iterations normally required in machine learning can be dealt with efficiently using CWFR methods. Libraries of components that can be easily orchestrated using FAIR Digital Objects as a common entity to document all actions and to exchange information between steps without the researcher needing to understand anything about PIDs and FDO details is probably the way to increase efficiency in repeating research workflows. As the Galaxy project indicates, the availability of supporting tools will be important to let researchers use these methods. Other as the Galaxy framework suggests, however, it would be necessary to include all steps necessary for doing a machine learning task including those that require human interaction and to document all phases with the help of structured FDOs.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"173-185"},"PeriodicalIF":3.9,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41320073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Canonical Analysis Workflows Documented Data Harmonization on Global Air Quality Data 实现全球空气质量数据规范化分析工作流程文档化数据协调

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-04-01 DOI: 10.1162/dint_a_00130

S. Schröder, Eleonora Epp, A. Mozaffari, M. Romberg, Niklas Selke, M. Schultz

Abstract Data harmonization and documentation of the data processing are essential prerequisites for enabling Canonical Analysis Workflows. The recently revised Terabyte-scale air quality database system, which the Tropospheric Ozone Assessment Report (TOAR) created, contains one of the world's largest collections of near-surface air quality measurements and considers FAIR data principles as an integral part. A special feature of our data service is the on-demand processing and product generation of several air quality metrics directly from the underlying database. In this paper, we show that the necessary data harmonization for establishing such online analysis services goes much deeper than the obvious issues of common data formats, variable names, and measurement units, and we explore how the generation of FAIR Digital Objects (FDO) in combination with automatically generated documentation may support Canonical Analysis Workflows for air quality and related data.

数据协调和数据处理的文档化是实现规范化分析工作流的必要先决条件。由对流层臭氧评估报告(TOAR)创建的最近修订的tb级空气质量数据库系统包含世界上最大的近地面空气质量测量数据集之一，并将FAIR数据原则视为不可或缺的一部分。我们数据服务的一个特殊功能是直接从底层数据库按需处理和生成几个空气质量指标。在本文中，我们表明，建立这种在线分析服务所需的数据协调要比常见数据格式、变量名称和测量单位的明显问题深入得多，并且我们探索了FAIR数字对象(FDO)的生成与自动生成的文档相结合如何支持空气质量和相关数据的规范分析工作流。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Data Intelligence

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀