Wikidata subsetting: Approaches, tools, and evaluation

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Semantic Web Pub Date : 2023-12-27 DOI:10.3233/sw-233491

Seyed Amir Hosseini Beghaeiraveri, J. E. Labra Gayo, A. Waagmeester, Ammar Ammar, Carolina Gonzalez, D. Slenter, Sabah Ul-Hasan, E. Willighagen, Fiona McNeill, A. Gray

{"title":"Wikidata subsetting: Approaches, tools, and evaluation","authors":"Seyed Amir Hosseini Beghaeiraveri, J. E. Labra Gayo, A. Waagmeester, Ammar Ammar, Carolina Gonzalez, D. Slenter, Sabah Ul-Hasan, E. Willighagen, Fiona McNeill, A. Gray","doi":"10.3233/sw-233491","DOIUrl":null,"url":null,"abstract":"Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.","PeriodicalId":48694,"journal":{"name":"Semantic Web","volume":"71 s1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Semantic Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/sw-233491","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

维基数据子集：方法、工具和评估

Wikidata 是一个庞大的知识图谱 (KG)，包括 1 亿多个数据项和近 15 亿条语句，涵盖地理、历史、学术文章和生命科学数据等广泛主题。Wikidata 的海量数据很难用于研究目的；许多研究人员无力承担托管 100 GB 数据的费用。虽然 Wikidata 提供了一个公共 SPARQL 端点，但它只能用于短期查询。通常情况下，研究人员只需要从 Wikidata 中获取有限范围的数据，重点关注其使用案例中的特定主题。子集化是从维基数据中定义和提取所需数据范围的过程；近年来，这一过程受到越来越多的关注。针对子集开发的特定工具和几种方法尚未得到评估。在本文中，我们调查了现有的子集方法，介绍了它们的一般优缺点，并从执行性能、提取准确性和定义子集的灵活性方面评估了四种专门用于维基数据子集的实用工具--WDSub、KGTK、WDumper 和 WDF。结果表明，所有四种工具提取定义项的准确率最低为 99.96%，提取语句的准确率最低为 99.25%。提取速度最快的工具是 WDF，而最灵活的工具是 WDSub。在实验过程中，定义了多个子集使用案例，并对提取的子集进行了分析，从而获得了有关维基数据种类和质量的宝贵信息，而这些信息是无法通过公共维基数据 SPARQL 端点获得的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Semantic Web COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

8.30

自引率

6.70%

发文量

期刊介绍： The journal Semantic Web – Interoperability, Usability, Applicability brings together researchers from various fields which share the vision and need for more effective and meaningful ways to share information across agents and services on the future internet and elsewhere. As such, Semantic Web technologies shall support the seamless integration of data, on-the-fly composition and interoperation of Web services, as well as more intuitive search engines. The semantics – or meaning – of information, however, cannot be defined without a context, which makes personalization, trust, and provenance core topics for Semantic Web research. New retrieval paradigms, user interfaces, and visualization techniques have to unleash the power of the Semantic Web and at the same time hide its complexity from the user. Based on this vision, the journal welcomes contributions ranging from theoretical and foundational research over methods and tools to descriptions of concrete ontologies and applications in all areas. We especially welcome papers which add a social, spatial, and temporal dimension to Semantic Web research, as well as application-oriented papers making use of formal semantics.

期刊最新文献

Wikidata subsetting: Approaches, tools, and evaluation An ontology of 3D environment where a simulated manipulation task takes place (ENVON) Sem@ K: Is my knowledge graph embedding model semantic-aware? Using semantic story maps to describe a territory beyond its map NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment