开放研究数据的聚类分析:复制元数据的一个案例

International journal of digital curation Pub Date : 2023-02-02 DOI:10.2218/ijdc.v17i1.833

Ana Trisovic

{"title":"开放研究数据的聚类分析:复制元数据的一个案例","authors":"Ana Trisovic","doi":"10.2218/ijdc.v17i1.833","DOIUrl":null,"url":null,"abstract":"Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cluster Analysis of Open Research Data: A Case for Replication Metadata\",\"authors\":\"Ana Trisovic\",\"doi\":\"10.2218/ijdc.v17i1.833\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.\",\"PeriodicalId\":87279,\"journal\":{\"name\":\"International journal of digital curation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of digital curation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2218/ijdc.v17i1.833\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of digital curation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2218/ijdc.v17i1.833","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

研究数据通常在期刊发表时发布，以便结果验证和可重复性。因此，研究传播基础设施通常支持来自众多学科的各种数据集，从表格数据和程序代码到视听文件。元数据，或关于数据的数据，对于使研究成果充分记录和公平至关重要。为了促进对研究产出元数据发展的讨论，我进行了探索性分析，以确定研究数据集如何基于研究人员有机沉积在一起。我使用来自Harvard Dataverse研究数据存储库的40,000多个数据集的内容作为聚类分析的样本。我发现大多数聚类是由单一类型的数据集形成的，而在其余的样本中，没有发现有意义的聚类。对于结果解释，我使用DataCite使用的元数据标准，DataCite是记录学术记录的领先组织，并将现有资源类型映射到我的结果。大约65%的样本可以用单一类型的元数据(如Dataset、Software或report)来描述，而其余的则需要聚合元数据类型。虽然DataCite支持聚合类型，比如Collection，但我认为有相当数量的数据集，特别是那些同时包含数据和代码文件的数据集(约占样本的20%)，可以更准确地描述为Replication资源元数据类型。这种资源类型对于促进研究的可重复性特别有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cluster Analysis of Open Research Data: A Case for Replication Metadata

Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助