{"title":"维基百科引文:从多语言维基百科中提取可复制的引文","authors":"Natallia Kokash, Giovanni Colavizza","doi":"arxiv-2406.19291","DOIUrl":null,"url":null,"abstract":"Wikipedia is an essential component of the open science ecosystem, yet it is\npoorly integrated with academic open science initiatives. Wikipedia Citations\nis a project that focuses on extracting and releasing comprehensive datasets of\ncitations from Wikipedia. A total of 29.3 million citations were extracted from\nEnglish Wikipedia in May 2020. Following this one-off research project, we\ndesigned a reproducible pipeline that can process any given Wikipedia dump in\nthe cloud-based settings. To demonstrate its usability, we extracted 40.6\nmillion citations in February 2023 and 44.7 million citations in February 2024.\nFurthermore, we equipped the pipeline with an adapted Wikipedia citation\ntemplate translation module to process multilingual Wikipedia articles in 15\nEuropean languages so that they are parsed and mapped into a generic structured\ncitation template. This paper presents our open-source software pipeline to\nretrieve, classify, and disambiguate citations on demand from a given Wikipedia\ndump.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia\",\"authors\":\"Natallia Kokash, Giovanni Colavizza\",\"doi\":\"arxiv-2406.19291\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Wikipedia is an essential component of the open science ecosystem, yet it is\\npoorly integrated with academic open science initiatives. Wikipedia Citations\\nis a project that focuses on extracting and releasing comprehensive datasets of\\ncitations from Wikipedia. A total of 29.3 million citations were extracted from\\nEnglish Wikipedia in May 2020. Following this one-off research project, we\\ndesigned a reproducible pipeline that can process any given Wikipedia dump in\\nthe cloud-based settings. To demonstrate its usability, we extracted 40.6\\nmillion citations in February 2023 and 44.7 million citations in February 2024.\\nFurthermore, we equipped the pipeline with an adapted Wikipedia citation\\ntemplate translation module to process multilingual Wikipedia articles in 15\\nEuropean languages so that they are parsed and mapped into a generic structured\\ncitation template. This paper presents our open-source software pipeline to\\nretrieve, classify, and disambiguate citations on demand from a given Wikipedia\\ndump.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.19291\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.19291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia
Wikipedia is an essential component of the open science ecosystem, yet it is
poorly integrated with academic open science initiatives. Wikipedia Citations
is a project that focuses on extracting and releasing comprehensive datasets of
citations from Wikipedia. A total of 29.3 million citations were extracted from
English Wikipedia in May 2020. Following this one-off research project, we
designed a reproducible pipeline that can process any given Wikipedia dump in
the cloud-based settings. To demonstrate its usability, we extracted 40.6
million citations in February 2023 and 44.7 million citations in February 2024.
Furthermore, we equipped the pipeline with an adapted Wikipedia citation
template translation module to process multilingual Wikipedia articles in 15
European languages so that they are parsed and mapped into a generic structured
citation template. This paper presents our open-source software pipeline to
retrieve, classify, and disambiguate citations on demand from a given Wikipedia
dump.