Data Intelligence最新文献_第6页

HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets HUSS：一种理解电子表格语义结构的启发式方法

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-11-01 DOI: 10.1109/ICKG55886.2022.00049

Xindong Wu, Hao Chen, Chenyang Bu, Shengwei Ji, Zan Zhang, Victor S. Sheng

ABSTRACT Spreadsheets contain a lot of valuable data and have many practical applications. The key technology of these practical applications is how to make machines understand the semantic structure of spreadsheets, e.g., identifying cell function types and discovering relationships between cell pairs. Most existing methods for understanding the semantic structure of spreadsheets do not make use of the semantic information of cells. A few studies do, but they ignore the layout structure information of spreadsheets, which affects the performance of cell function classification and the discovery of different relationship types of cell pairs. In this paper, we propose a Heuristic algorithm for Understanding the Semantic Structure of spreadsheets (HUSS). Specifically, for improving the cell function classification, we propose an error correction mechanism (ECM) based on an existing cell function classification model [11] and the layout features of spreadsheets. For improving the table structure analysis, we propose five types of heuristic rules to extract four different types of cell pairs, based on the cell style and spatial location information. Our experimental results on five real-world datasets demonstrate that HUSS can effectively understand the semantic structure of spreadsheets and outperforms corresponding baselines.

电子表格包含了大量有价值的数据，有许多实际应用。这些实际应用的关键技术是如何使机器理解电子表格的语义结构，例如，识别单元格功能类型和发现单元格对之间的关系。大多数现有的理解电子表格语义结构的方法都没有利用单元格的语义信息。虽然有一些研究做到了这一点，但它们忽略了电子表格的布局结构信息，从而影响了单元格功能分类的性能和单元格对不同关系类型的发现。本文提出了一种理解电子表格语义结构的启发式算法(HUSS)。具体来说，为了改进单元格功能分类，我们提出了一种基于现有单元格功能分类模型[11]和电子表格布局特征的纠错机制(ECM)。为了改进表结构分析，我们提出了基于单元格样式和空间位置信息的五种启发式规则来提取四种不同类型的单元格对。我们在五个真实数据集上的实验结果表明，HUSS可以有效地理解电子表格的语义结构，并且优于相应的基线。

{"title":"HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets","authors":"Xindong Wu, Hao Chen, Chenyang Bu, Shengwei Ji, Zan Zhang, Victor S. Sheng","doi":"10.1109/ICKG55886.2022.00049","DOIUrl":"https://doi.org/10.1109/ICKG55886.2022.00049","url":null,"abstract":"ABSTRACT Spreadsheets contain a lot of valuable data and have many practical applications. The key technology of these practical applications is how to make machines understand the semantic structure of spreadsheets, e.g., identifying cell function types and discovering relationships between cell pairs. Most existing methods for understanding the semantic structure of spreadsheets do not make use of the semantic information of cells. A few studies do, but they ignore the layout structure information of spreadsheets, which affects the performance of cell function classification and the discovery of different relationship types of cell pairs. In this paper, we propose a Heuristic algorithm for Understanding the Semantic Structure of spreadsheets (HUSS). Specifically, for improving the cell function classification, we propose an error correction mechanism (ECM) based on an existing cell function classification model [11] and the layout features of spreadsheets. For improving the table structure analysis, we propose five types of heuristic rules to extract four different types of cell pairs, based on the cell style and spatial location information. Our experimental results on five real-world datasets demonstrate that HUSS can effectively understand the semantic structure of spreadsheets and outperforms corresponding baselines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"537-559"},"PeriodicalIF":3.9,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43443624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Analysis of Crosswalks from Research Data Schemas to Schema.org 从研究数据模式到Schema.org的交叉分析

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-10-07 DOI: 10.1162/dint_a_00186

Mingfang Wu, S. Richard, C. Verhey, L. J. Castro, Baptiste Cecconi, N. Juty

ABSTRACT The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging the web architecture by embedding structured metadata markup in dataset web landing pages using vocabularies from Schema.org and extensions. This paper aims to examine metadata interoperability for supporting global data discovery. Specifically, the paper reports a survey on which metadata schema has been adopted by participating data repositories, and presents an analysis of crosswalks from fourteen research data schemas to Schema.org. The analysis indicates most descriptive metadata are interoperable among the schemas, the most inconsistent mapping is the rights metadata, and a large gap exists in the structural metadata and controlled vocabularies to specify various property values. The analysis and collated crosswalks can serve as a reference for data repositories when they develop crosswalks from their own schemas to Schema.org, and provide the research data community a benchmark of structured metadata implementation.

摘要数据存储库数量的增加极大地提高了开放数据的可用性。为了实现对研究数据集的广泛发现和访问，一些数据存储库已经开始利用web架构，使用Schema.org和扩展中的词汇表在数据集web登录页中嵌入结构化元数据标记。本文旨在研究支持全局数据发现的元数据互操作性。具体而言，本文报告了一项关于参与的数据存储库采用了哪些元数据模式的调查，并对schema.org上的14个研究数据模式中的人行横道进行了分析。分析表明，大多数描述性元数据在模式之间是可互操作的，最不一致的映射是权利元数据，并且在用于指定各种属性值的结构元数据和受控词汇表中存在大的间隙。分析和整理的人行横道可以作为数据存储库在将自己的模式开发到Schema.org时的参考，并为研究数据社区提供结构化元数据实现的基准。

{"title":"An Analysis of Crosswalks from Research Data Schemas to Schema.org","authors":"Mingfang Wu, S. Richard, C. Verhey, L. J. Castro, Baptiste Cecconi, N. Juty","doi":"10.1162/dint_a_00186","DOIUrl":"https://doi.org/10.1162/dint_a_00186","url":null,"abstract":"ABSTRACT The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging the web architecture by embedding structured metadata markup in dataset web landing pages using vocabularies from Schema.org and extensions. This paper aims to examine metadata interoperability for supporting global data discovery. Specifically, the paper reports a survey on which metadata schema has been adopted by participating data repositories, and presents an analysis of crosswalks from fourteen research data schemas to Schema.org. The analysis indicates most descriptive metadata are interoperable among the schemas, the most inconsistent mapping is the rights metadata, and a large gap exists in the structural metadata and controlled vocabularies to specify various property values. The analysis and collated crosswalks can serve as a reference for data repositories when they develop crosswalks from their own schemas to Schema.org, and provide the research data community a benchmark of structured metadata implementation.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"100-121"},"PeriodicalIF":3.9,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49610991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

FAIR Equivalency in Indonesia's Digital Health Framework 印度尼西亚数字卫生框架中的公平对等

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-10-01 DOI: 10.1162/dint_a_00171

Putu Hadi Purnama Jati

Abstract The objective of this study was to assess the regulatory framework for health data in Indonesia in order to understand the policy context and explore the possibility of expanding the adoption and implementation of the FAIR Guidelines, which state that data should be Findable, Accessible, Interoperable and Reusable (FAIR), in Indonesia. Although the FAIR Guidelines were not explicitly mentioned in any of the policy documents relevant to the Indonesian digital health sector, six out of the eight documents analysed contained FAIR Equivalent principles. In particular, Indonesia's Population Identification Number (NIK) has the potential, as a unique identifier, to support the integration and interoperability (findability) of data, which is crucial to all other aspects of the FAIR Guidelines. There is also a plan to build standards and protocols into the implementation of information systems in each ministry and government agency to improve data accessibility (accessibility), the integration of the various information systems is planned/ongoing (interoperability), and the need for a standardised arrangement for health information systems related to health data following the community standard is recognised (reusability). The documents at the core of Indonesia's digital health/eHealth policy have the highest FAIR Equivalency Score (FE-Score), showing some degree of alignment between the Indonesian digital health implementation vision and the FAIR Guidelines. This indicates that Indonesia's digital health sector is open to using the FAIR Guidelines.

本研究的目的是评估印度尼西亚卫生数据的监管框架，以了解政策背景，并探索扩大公平准则的采用和实施的可能性，该准则指出，印度尼西亚的数据应该是可查找的、可访问的、可互操作的和可重复使用的(FAIR)。虽然与印度尼西亚数字卫生部门有关的任何政策文件都没有明确提到公平准则，但所分析的8个文件中有6个包含公平等效原则。特别是，印度尼西亚的人口识别号码(NIK)作为唯一标识符，具有支持数据整合和互操作性(可查找性)的潜力，这对《公平准则》的所有其他方面都至关重要。还有一项计划，在每个部委和政府机构的信息系统实施中建立标准和协议，以改善数据的可访问性(可访问性)，计划/正在进行各种信息系统的整合(互操作性)，并认识到需要按照社区标准对卫生数据相关的卫生信息系统进行标准化安排(可重用性)。作为印度尼西亚数字卫生/电子卫生政策核心的文件具有最高的公平等效分数(FE-Score)，表明印度尼西亚数字卫生实施愿景与公平指南之间存在一定程度的一致性。这表明印度尼西亚的数字卫生部门对使用《公平准则》持开放态度。

{"title":"FAIR Equivalency in Indonesia's Digital Health Framework","authors":"Putu Hadi Purnama Jati","doi":"10.1162/dint_a_00171","DOIUrl":"https://doi.org/10.1162/dint_a_00171","url":null,"abstract":"Abstract The objective of this study was to assess the regulatory framework for health data in Indonesia in order to understand the policy context and explore the possibility of expanding the adoption and implementation of the FAIR Guidelines, which state that data should be Findable, Accessible, Interoperable and Reusable (FAIR), in Indonesia. Although the FAIR Guidelines were not explicitly mentioned in any of the policy documents relevant to the Indonesian digital health sector, six out of the eight documents analysed contained FAIR Equivalent principles. In particular, Indonesia's Population Identification Number (NIK) has the potential, as a unique identifier, to support the integration and interoperability (findability) of data, which is crucial to all other aspects of the FAIR Guidelines. There is also a plan to build standards and protocols into the implementation of information systems in each ministry and government agency to improve data accessibility (accessibility), the integration of the various information systems is planned/ongoing (interoperability), and the need for a standardised arrangement for health information systems related to health data following the community standard is recognised (reusability). The documents at the core of Indonesia's digital health/eHealth policy have the highest FAIR Equivalency Score (FE-Score), showing some degree of alignment between the Indonesian digital health implementation vision and the FAIR Guidelines. This indicates that Indonesia's digital health sector is open to using the FAIR Guidelines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"798-812"},"PeriodicalIF":3.9,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64532083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

FAIREST: A Framework for Assessing Research Repositories FAIREST：评估研究知识库的框架

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-09-28 DOI: 10.1162/dint_a_00159

M. d’Aquin, Fabian Kirstein, Daniela Oliveira, Sonja Schimmler, Sebastian Urbanek

ABSTRACT The open science movement has gained significant momentum within the last few years. This comes along with the need to store and share research artefacts, such as publications and research data. For this purpose, research repositories need to be established. A variety of solutions exist for implementing such repositories, covering diverse features, ranging from custom depositing workflows to social media-like functions. In this article, we introduce the FAIREST principles, a framework inspired by the well-known FAIR principles, but designed to provide a set of metrics for assessing and selecting solutions for creating digital repositories for research artefacts. The goal is to support decision makers in choosing such a solution when planning for a repository, especially at an institutional level. The metrics included are therefore based on two pillars: (1) an analysis of established features and functionalities, drawn from existing dedicated, general purpose and commonly used solutions, and (2) a literature review on general requirements for digital repositories for research artefacts and related systems. We further describe an assessment of 11 widespread solutions, with the goal to provide an overview of the current landscape of research data repository solutions, identifying gaps and research challenges to be addressed.

摘要：开放科学运动在过去几年中取得了巨大的发展势头。与此同时，还需要存储和共享研究成果，如出版物和研究数据。为此，需要建立研究资料库。有多种解决方案可用于实现此类存储库，涵盖各种功能，从自定义存放工作流到类似社交媒体的功能。在本文中，我们介绍了FAIREST原则，这是一个受著名的FAIR原则启发的框架，但旨在提供一组指标，用于评估和选择创建研究成果数字存储库的解决方案。目标是支持决策者在规划存储库时选择这样的解决方案，尤其是在机构层面。因此，所包含的指标基于两个支柱：（1）从现有的专用、通用和常用解决方案中提取的既定特征和功能的分析，以及（2）对研究成果和相关系统的数字存储库的一般要求的文献综述。我们进一步描述了对11个广泛应用的解决方案的评估，目的是概述研究数据存储库解决方案的当前前景，确定差距和需要解决的研究挑战。

{"title":"FAIREST: A Framework for Assessing Research Repositories","authors":"M. d’Aquin, Fabian Kirstein, Daniela Oliveira, Sonja Schimmler, Sebastian Urbanek","doi":"10.1162/dint_a_00159","DOIUrl":"https://doi.org/10.1162/dint_a_00159","url":null,"abstract":"ABSTRACT The open science movement has gained significant momentum within the last few years. This comes along with the need to store and share research artefacts, such as publications and research data. For this purpose, research repositories need to be established. A variety of solutions exist for implementing such repositories, covering diverse features, ranging from custom depositing workflows to social media-like functions. In this article, we introduce the FAIREST principles, a framework inspired by the well-known FAIR principles, but designed to provide a set of metrics for assessing and selecting solutions for creating digital repositories for research artefacts. The goal is to support decision makers in choosing such a solution when planning for a repository, especially at an institutional level. The metrics included are therefore based on two pillars: (1) an analysis of established features and functionalities, drawn from existing dedicated, general purpose and commonly used solutions, and (2) a literature review on general requirements for digital repositories for research artefacts and related systems. We further describe an assessment of 11 widespread solutions, with the goal to provide an overview of the current landscape of research data repository solutions, identifying gaps and research challenges to be addressed.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"202-241"},"PeriodicalIF":3.9,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47618941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The FAIR Data Point: Interfaces and Tooling FAIR数据点：接口和工具

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-09-28 DOI: 10.1162/dint_a_00161

Ousamma Mohammed Benhamed, K. Burger, R. Kaliyaperumal, Luiz Olavo Bonino da Silva Santos, M. Suchánek, Jan Slifka, Mark D. Wilkinson

ABSTRACT While the FAIR Principles do not specify a technical solution for ‘FAIRness’, it was clear from the outset of the FAIR initiative that it would be useful to have commodity software and tooling that would simplify the creation of FAIR-compliant resources. The FAIR Data Point is a metadata repository that follows the DCAT(2) schema, and utilizes the Linked Data Platform to manage the hierarchical metadata layers as LDP Containers. There has been a recent flurry of development activity around the FAIR Data Point that has significantly improved its power and ease-of-use. Here we describe five specific tools—an installer, a loader, two Web-based interfaces, and an indexer—aimed at maximizing the uptake and utility of the FAIR Data Point.

摘要虽然FAIR原则没有规定“FAIRness”的技术解决方案，但从FAIR倡议一开始就很清楚，拥有商品软件和工具将有助于简化符合FAIR的资源的创建。FAIR数据点是一个遵循DCAT（2）模式的元数据存储库，并利用链接数据平台将分层元数据层管理为LDP容器。最近围绕FAIR数据点进行了一系列开发活动，显著提高了其功能和易用性。在这里，我们描述了五个特定的工具——一个安装程序、一个加载程序、两个基于Web的接口和一个索引器——旨在最大限度地利用FAIR数据点。

引用次数: 3

FAIR data and metadata: GNSS precise positioning user perspective FAIR数据和元数据:GNSS精确定位用户视角

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-09-28 DOI: 10.1162/dint_a_00185

I. Ivánová, R. Keenan, Christopher Marshall, Lori Mancell, E. Rubinov, R. Ruddick, Nicholas Brown, Graeme Kernich

ABSTRACT The FAIR principles of Wilkinson et al. [1] are finding their way from research into application domains, one of which is the precise positioning with global satellite navigation systems (GNSS). Current GNSS users demand that data and services are findable online, accessible via open protocols (by both, machines and humans), interoperable with their legacy systems and reusable in various settings. Comprehensive metadata are essential in seamless communication between GNSS data and service providers and their users, and, for decades, geodetic and geospatial standards are efficiently implemented to support this. However, GNSS user community is transforming from precise positioning by highly specialised use by geodetic professionals to every-day precise positioning by autonomous vehicles or wellness obsessed citizens. Moreover, rapid technological developments allow alternative ways of offering data and services to their users. These transforming circumstances warrant a review whether metadata defined in generic geospatial and geodetic standards in use still support FAIR use of modern GNSS data and services across its novel user spectrum. This paper reports the results of current GNSS users’ requirements in various application sectors on the way data, metadata and services are provided. We engaged with GNSS stakeholders to validate our findings and to gain understanding on their perception of the FAIR principles. Our results confirm that offering FAIR GNSS data and services is fundamental, but for a confident use of these, there is a need to review the way metadata are offered to the community. Defining standard compliant GNSS community metadata profile and providing relevant metadata with data on-demand, the approach outlined in this paper, is a way to manage current GNSS users’ expectations and the way to improve FAIR GNSS data and service delivery for both humans and the machines.

摘要Wilkinson等人[1]的FAIR原理正在从应用领域的研究中找到出路，其中之一就是利用全球卫星导航系统（GNSS）进行精确定位。目前的全球导航卫星系统用户要求数据和服务可以在线找到，可以通过开放协议（机器和人类）访问，可以与传统系统互操作，并在各种环境中可重复使用。全面的元数据对于全球导航卫星系统数据和服务提供商及其用户之间的无缝通信至关重要，几十年来，大地测量和地理空间标准一直在有效实施，以支持这一点。然而，全球导航卫星系统用户群体正在从大地测量专业人员高度专业化的精确定位转变为自动驾驶汽车或痴迷健康的公民每天的精确定位。此外，快速的技术发展允许以其他方式向用户提供数据和服务。这些变化的情况需要审查正在使用的通用地理空间和大地测量标准中定义的元数据是否仍然支持FAIR在其新的用户频谱中使用现代GNSS数据和服务。本文报告了当前全球导航卫星系统用户在各个应用部门对提供数据、元数据和服务的方式提出的要求的结果。我们与全球导航卫星系统的利益相关者进行了接触，以验证我们的发现，并了解他们对公平竞争原则的看法。我们的研究结果证实，提供FAIR GNSS数据和服务是至关重要的，但为了充分利用这些数据和服务，有必要审查向社区提供元数据的方式。本文概述的方法是定义符合标准的全球导航卫星系统社区元数据档案，并按需提供相关元数据和数据，这是管理当前全球导航卫星服务用户期望的一种方式，也是改进FAIR全球导航卫星系统数据和为人类和机器提供服务的一种途径。

{"title":"FAIR data and metadata: GNSS precise positioning user perspective","authors":"I. Ivánová, R. Keenan, Christopher Marshall, Lori Mancell, E. Rubinov, R. Ruddick, Nicholas Brown, Graeme Kernich","doi":"10.1162/dint_a_00185","DOIUrl":"https://doi.org/10.1162/dint_a_00185","url":null,"abstract":"ABSTRACT The FAIR principles of Wilkinson et al. [1] are finding their way from research into application domains, one of which is the precise positioning with global satellite navigation systems (GNSS). Current GNSS users demand that data and services are findable online, accessible via open protocols (by both, machines and humans), interoperable with their legacy systems and reusable in various settings. Comprehensive metadata are essential in seamless communication between GNSS data and service providers and their users, and, for decades, geodetic and geospatial standards are efficiently implemented to support this. However, GNSS user community is transforming from precise positioning by highly specialised use by geodetic professionals to every-day precise positioning by autonomous vehicles or wellness obsessed citizens. Moreover, rapid technological developments allow alternative ways of offering data and services to their users. These transforming circumstances warrant a review whether metadata defined in generic geospatial and geodetic standards in use still support FAIR use of modern GNSS data and services across its novel user spectrum. This paper reports the results of current GNSS users’ requirements in various application sectors on the way data, metadata and services are provided. We engaged with GNSS stakeholders to validate our findings and to gain understanding on their perception of the FAIR principles. Our results confirm that offering FAIR GNSS data and services is fundamental, but for a confident use of these, there is a need to review the way metadata are offered to the community. Defining standard compliant GNSS community metadata profile and providing relevant metadata with data on-demand, the approach outlined in this paper, is a way to manage current GNSS users’ expectations and the way to improve FAIR GNSS data and service delivery for both humans and the machines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"43-74"},"PeriodicalIF":3.9,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45440698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automated metadata annotation: What is and is not possible with machine learning 自动化元数据注释：机器学习可以做什么，不可以做什么

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-09-28 DOI: 10.1162/dint_a_00162

Mingfang Wu, Hans Brandhorst, M. Marinescu, J. M. López, Marjorie M. K. Hlava, J. Busch

ABSTRACT Automated metadata annotation is only as good as training dataset, or rules that are available for the domain. It's important to learn what type of data content a pre-trained machine learning algorithm has been trained on to understand its limitations and potential biases. Consider what type of content is readily available to train an algorithm—what's popular and what's available. However, scholarly and historical content is often not available in consumable, homogenized, and interoperable formats at the large volume that is required for machine learning. There are exceptions such as science and medicine, where large, well documented collections are available. This paper presents the current state of automated metadata annotation in cultural heritage and research data, discusses challenges identified from use cases, and proposes solutions.

摘要自动化元数据注释只能与训练数据集或域可用的规则一样好。了解预先训练的机器学习算法在什么类型的数据内容上进行了训练，以了解其局限性和潜在的偏见，这一点很重要。考虑什么类型的内容可以很容易地用于训练算法——什么是流行的，什么是可用的。然而，机器学习所需的大量学术和历史内容往往无法以可消费、同质化和可互操作的格式提供。也有例外，比如科学和医学，那里有大量的、有充分记录的藏品。本文介绍了文化遗产和研究数据中自动元数据注释的现状，讨论了从用例中发现的挑战，并提出了解决方案。

引用次数: 7

Terminology for a FAIR Framework for the Virus Outbreak Data Network-Africa 病毒爆发数据网络公平框架术语-非洲

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-08-18 DOI: 10.1162/dint_a_00167

Ruduan Plug, Yan Liang, Aliya Aktau, Mariam Basajja, Francisca Onaolapo Oladipo, M. van Reisen

Abstract The field of health data management poses unique challenges in relation to data ownership, the privacy of data subjects, and the reusability of data. The FAIR Guidelines have been developed to address these challenges. The Virus Outbreak Data Network (VODAN) architecture builds on these principles, using the European Union's General Data Protection Regulation (GDPR) framework to ensure compliance with local data regulations, while using information knowledge management concepts to further improve data provenance and interoperability. In this article we provide an overview of the terminology used in the field of FAIR data management, with a specific focus on FAIR compliant health information management, as implemented in the VODAN architecture.

卫生数据管理领域在数据所有权、数据主体的隐私和数据的可重用性方面提出了独特的挑战。制定《公平准则》就是为了应对这些挑战。病毒爆发数据网络(VODAN)架构以这些原则为基础，使用欧盟的《一般数据保护条例》(GDPR)框架确保遵守当地数据法规，同时使用信息知识管理概念进一步改进数据来源和互操作性。在本文中，我们概述了FAIR数据管理领域中使用的术语，并特别关注在VODAN架构中实现的符合FAIR的健康信息管理。

引用次数: 2

FAIR Machine Learning Model Pipeline Implementation of COVID-19 Data COVID-19数据的FAIR机器学习模型流水线实现

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-08-18 DOI: 10.1162/dint_a_00182

Sakinat Folorunso, E. Ogundepo, Mariam Basajja, Joseph Awotunde, A. Kawu, Francisca Onaolapo Oladipo, Ibrahim Abdullahi

Abstract Research and development are gradually becoming data-driven and the implementation of the FAIR Guidelines (that data should be Findable, Accessible, Interoperable, and Reusable) for scientific data administration and stewardship has the potential to remarkably enhance the framework for the reuse of research data. In this way, FAIR is aiding digital transformation. The ‘FAIRification’ of data increases the interoperability and (re)usability of data, so that new and robust analytical tools, such as machine learning (ML) models, can access the data to deduce meaningful insights, extract actionable information, and identify hidden patterns. This article aims to build a FAIR ML model pipeline using the generic FAIRification workflow to make the whole ML analytics process FAIR. Accordingly, FAIR input data was modelled using a FAIR ML model. The output data from the FAIR ML model was also made FAIR. For this, a hybrid hierarchical k-means (HHK) clustering ML algorithm was applied to group the data into homogeneous subgroups and ascertain the underlying structure of the data using a Nigerian-based FAIR dataset that contains data on economic factors, healthcare facilities, and coronavirus occurrences in all the 36 states of Nigeria. The model showed that research data and the ML pipeline can be FAIRified, shared, and reused by following the proposed FAIRification workflow and implementing technical architecture.

研究和发展正逐渐成为数据驱动的，科学数据管理和管理的FAIR指南(数据应该是可查找的、可访问的、可互操作的和可重用的)的实施有可能显著增强研究数据重用的框架。通过这种方式，FAIR正在帮助数字化转型。数据的“公平化”提高了数据的互操作性和(再)可用性，因此，新的和强大的分析工具，如机器学习(ML)模型，可以访问数据，以推断有意义的见解，提取可操作的信息，并识别隐藏的模式。本文旨在使用通用的公平工作流构建公平机器学习模型管道，使整个机器学习分析过程公平。因此，FAIR输入数据使用FAIR ML模型建模。对FAIR ML模型的输出数据也进行了FAIR处理。为此，应用混合分层k-均值(HHK)聚类ML算法将数据分组为同质子组，并使用基于尼日利亚的FAIR数据集确定数据的底层结构，该数据集包含尼日利亚所有36个州的经济因素、医疗设施和冠状病毒发病率的数据。该模型表明，通过遵循提出的farification工作流和实现技术架构，可以对研究数据和ML管道进行farification、共享和重用。

{"title":"FAIR Machine Learning Model Pipeline Implementation of COVID-19 Data","authors":"Sakinat Folorunso, E. Ogundepo, Mariam Basajja, Joseph Awotunde, A. Kawu, Francisca Onaolapo Oladipo, Ibrahim Abdullahi","doi":"10.1162/dint_a_00182","DOIUrl":"https://doi.org/10.1162/dint_a_00182","url":null,"abstract":"Abstract Research and development are gradually becoming data-driven and the implementation of the FAIR Guidelines (that data should be Findable, Accessible, Interoperable, and Reusable) for scientific data administration and stewardship has the potential to remarkably enhance the framework for the reuse of research data. In this way, FAIR is aiding digital transformation. The ‘FAIRification’ of data increases the interoperability and (re)usability of data, so that new and robust analytical tools, such as machine learning (ML) models, can access the data to deduce meaningful insights, extract actionable information, and identify hidden patterns. This article aims to build a FAIR ML model pipeline using the generic FAIRification workflow to make the whole ML analytics process FAIR. Accordingly, FAIR input data was modelled using a FAIR ML model. The output data from the FAIR ML model was also made FAIR. For this, a hybrid hierarchical k-means (HHK) clustering ML algorithm was applied to group the data into homogeneous subgroups and ascertain the underlying structure of the data using a Nigerian-based FAIR dataset that contains data on economic factors, healthcare facilities, and coronavirus occurrences in all the 36 states of Nigeria. The model showed that research data and the ML pipeline can be FAIRified, shared, and reused by following the proposed FAIRification workflow and implementing technical architecture.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"971-990"},"PeriodicalIF":3.9,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45554526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Expanding Non-Patient COVID-19 Data: Towards the FAIRification of Migrants’ Data in Tunisia, Libya and Niger 扩大非患者COVID-19数据:实现突尼斯、利比亚和尼日尔移民数据的公平化

IF 3.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence

Pub Date : 2022-08-18 DOI: 10.1162/dint_a_00181

M. Ghardallou, Morgane Wirtz, Sakinat Folorunso, Z. Touati, E. Ogundepo, Klara Smits, A. Mtiraoui, M. Reisen

Abstract This article describes the FAIRification process (which involves making data Findable, Accessible, Interoperable and Reusable—or FAIR—for both machines and humans) for data related to the impact of COVID-19 on migrants, refugees and asylum seekers in Tunisia, Libya and Niger, according to the scheme adopted by GO FAIR. This process was divided into three phases: pre-FAIRification, FAIRification and post-FAIRification. Each phase consisted of seven steps. In the first phase, 118 in-depth interviews and 565 press articles and research reports were collected by students and researchers at the University of Sousse in Tunisia and researchers in Niger. These interviews, articles and reports constitute the dataset for this research. In the second phase, the data were sorted and converted into a machine actionable format and published on a FAIR Data Point hosted at the University of Sousse. In the third phase, an assessment of the implementation of the FAIR Guidelines was undertaken. Certain barriers and challenges were faced in this process and solutions were found. For FAIR data curation, certain changes need to be made to the technical process. People need to be convinced to make these changes and that the implementation of FAIR will generate a long-term return on investment. Although the implementation of FAIR Guidelines is not straightforward, making our resources FAIR is essential to achieving better science together.

摘要本文描述了根据GO FAIR采用的方案，与新冠肺炎对突尼斯、利比亚和尼日尔移民、难民和寻求庇护者的影响有关的数据的FAIRification过程（包括使机器和人类的数据可查找、可访问、可互操作和可重复使用）。该过程分为三个阶段：FAI前、FAI后和FAI后。每个阶段由七个步骤组成。在第一阶段，突尼斯苏塞大学的学生和研究人员以及尼日尔的研究人员收集了118次深入采访、565篇新闻文章和研究报告。这些访谈、文章和报告构成了本研究的数据集。在第二阶段，数据被分类并转换为机器可操作的格式，并在苏塞大学的FAIR数据点上发布。在第三阶段，对FAIR准则的执行情况进行了评估。在这一过程中遇到了一些障碍和挑战，并找到了解决办法。对于FAIR数据管理，需要对技术流程进行某些更改。人们需要被说服做出这些改变，并且FAIR的实施将产生长期的投资回报。尽管FAIR指南的实施并不简单，但使我们的资源成为FAIR对于共同实现更好的科学至关重要。

{"title":"Expanding Non-Patient COVID-19 Data: Towards the FAIRification of Migrants’ Data in Tunisia, Libya and Niger","authors":"M. Ghardallou, Morgane Wirtz, Sakinat Folorunso, Z. Touati, E. Ogundepo, Klara Smits, A. Mtiraoui, M. Reisen","doi":"10.1162/dint_a_00181","DOIUrl":"https://doi.org/10.1162/dint_a_00181","url":null,"abstract":"Abstract This article describes the FAIRification process (which involves making data Findable, Accessible, Interoperable and Reusable—or FAIR—for both machines and humans) for data related to the impact of COVID-19 on migrants, refugees and asylum seekers in Tunisia, Libya and Niger, according to the scheme adopted by GO FAIR. This process was divided into three phases: pre-FAIRification, FAIRification and post-FAIRification. Each phase consisted of seven steps. In the first phase, 118 in-depth interviews and 565 press articles and research reports were collected by students and researchers at the University of Sousse in Tunisia and researchers in Niger. These interviews, articles and reports constitute the dataset for this research. In the second phase, the data were sorted and converted into a machine actionable format and published on a FAIR Data Point hosted at the University of Sousse. In the third phase, an assessment of the implementation of the FAIR Guidelines was undertaken. Certain barriers and challenges were faced in this process and solutions were found. For FAIR data curation, certain changes need to be made to the technical process. People need to be convinced to make these changes and that the implementation of FAIR will generate a long-term return on investment. Although the implementation of FAIR Guidelines is not straightforward, making our resources FAIR is essential to achieving better science together.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"955-970"},"PeriodicalIF":3.9,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45191740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2