首页 > 最新文献

arXiv - CS - Digital Libraries最新文献

英文 中文
Exploring the applicability of Large Language Models to citation context analysis 探索大语言模型在引文语境分析中的适用性
Pub Date : 2024-09-04 DOI: arxiv-2409.02443
Kai Nishikawa, Hitoshi Koshiba
Unlike traditional citation analysis -- which assumes that all citations in apaper are equivalent -- citation context analysis considers the contextualinformation of individual citations. However, citation context analysisrequires creating large amounts of data through annotation, which hinders thewidespread use of this methodology. This study explored the applicability ofLarge Language Models (LLMs) -- particularly ChatGPT -- to citation contextanalysis by comparing LLMs and human annotation results. The results show thatthe LLMs annotation is as good as or better than the human annotation in termsof consistency but poor in terms of predictive performance. Thus, having LLMsimmediately replace human annotators in citation context analysis isinappropriate. However, the annotation results obtained by LLMs can be used asreference information when narrowing the annotation results obtained bymultiple human annotators to one, or LLMs can be used as one of the annotatorswhen it is difficult to prepare sufficient human annotators. This studyprovides basic findings important for the future development of citationcontext analyses.
传统的引文分析假定论文中的所有引文都是等价的,而引文上下文分析则不同,它考虑的是单个引文的上下文信息。然而,引文语境分析需要通过标注来创建大量数据,这阻碍了该方法的广泛应用。本研究通过比较大型语言模型(LLM)和人工标注结果,探索了大型语言模型(尤其是 ChatGPT)在引文上下文分析中的适用性。结果表明,LLMs 的注释在一致性方面与人类注释一样好,甚至更好,但在预测性能方面却很差。因此,在引文上下文分析中让 LLMs 立即取代人类注释者是不合适的。不过,在将多个人工标注员的标注结果缩小到一个标注员的标注结果时,可以将 LLM 获得的标注结果作为参考信息;或者在难以准备足够的人工标注员时,可以将 LLM 作为标注员之一。这项研究为引文上下文分析的未来发展提供了重要的基本结论。
{"title":"Exploring the applicability of Large Language Models to citation context analysis","authors":"Kai Nishikawa, Hitoshi Koshiba","doi":"arxiv-2409.02443","DOIUrl":"https://doi.org/arxiv-2409.02443","url":null,"abstract":"Unlike traditional citation analysis -- which assumes that all citations in a\u0000paper are equivalent -- citation context analysis considers the contextual\u0000information of individual citations. However, citation context analysis\u0000requires creating large amounts of data through annotation, which hinders the\u0000widespread use of this methodology. This study explored the applicability of\u0000Large Language Models (LLMs) -- particularly ChatGPT -- to citation context\u0000analysis by comparing LLMs and human annotation results. The results show that\u0000the LLMs annotation is as good as or better than the human annotation in terms\u0000of consistency but poor in terms of predictive performance. Thus, having LLMs\u0000immediately replace human annotators in citation context analysis is\u0000inappropriate. However, the annotation results obtained by LLMs can be used as\u0000reference information when narrowing the annotation results obtained by\u0000multiple human annotators to one, or LLMs can be used as one of the annotators\u0000when it is difficult to prepare sufficient human annotators. This study\u0000provides basic findings important for the future development of citation\u0000context analyses.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coverage and metadata availability of African publications in OpenAlex: A comparative analysis OpenAlex 中非洲出版物的覆盖面和元数据可用性:比较分析
Pub Date : 2024-09-02 DOI: arxiv-2409.01120
Patricia Alonso-Alvarez, Nees Jan van Eck
Unlike traditional proprietary data sources like Scopus and Web of Science(WoS), OpenAlex emphasizes its comprehensive coverage, particularlyhighlighting its inclusion of the humanities, non-English languages, andresearch from the Global South. Strengthening diversity and inclusivity inscience is crucial for ethical and practical reasons. This paper analysesOpenAlex's coverage and metadata availability of African-based publications.For this purpose, we compare OpenAlex with Scopus, WoS, and African JournalsOnline (AJOL). We first compare the coverage of African research publicationsin OpenAlex against that of WoS, Scopus, and AJOL. We then assess and comparethe available metadata for OpenAlex, Scopus, and WoS publications. Our analysisshows that OpenAlex offers the most extensive publication coverage. In terms ofmetadata, OpenAlex offers a high coverage of publication and authorinformation. It performs worse regarding affiliations, references, and funderinformation. Importantly, our results also show that metadata availability inOpenAlex is better for publications that are also indexed in Scopus or WoS.
与 Scopus 和 Web of Science(WoS)等传统的专有数据源不同,OpenAlex 强调其覆盖面的全面性,尤其突出其对人文学科、非英语语言和全球南部研究的收录。出于伦理和现实的原因,加强科学的多样性和包容性至关重要。为此,我们将 OpenAlex 与 Scopus、WoS 和 African JournalsOnline (AJOL) 进行了比较。我们首先将 OpenAlex 中非洲研究出版物的覆盖范围与 WoS、Scopus 和 AJOL 的覆盖范围进行了比较。然后,我们对 OpenAlex、Scopus 和 WoS 出版物的可用元数据进行了评估和比较。我们的分析表明,OpenAlex 的出版物覆盖面最广。在元数据方面,OpenAlex 的出版物和作者信息覆盖率较高。它在从属关系、参考文献和资助者信息方面的表现较差。重要的是,我们的结果还表明,对于同时被 Scopus 或 WoS 索引的出版物,OpenAlex 的元数据可用性更好。
{"title":"Coverage and metadata availability of African publications in OpenAlex: A comparative analysis","authors":"Patricia Alonso-Alvarez, Nees Jan van Eck","doi":"arxiv-2409.01120","DOIUrl":"https://doi.org/arxiv-2409.01120","url":null,"abstract":"Unlike traditional proprietary data sources like Scopus and Web of Science\u0000(WoS), OpenAlex emphasizes its comprehensive coverage, particularly\u0000highlighting its inclusion of the humanities, non-English languages, and\u0000research from the Global South. Strengthening diversity and inclusivity in\u0000science is crucial for ethical and practical reasons. This paper analyses\u0000OpenAlex's coverage and metadata availability of African-based publications.\u0000For this purpose, we compare OpenAlex with Scopus, WoS, and African Journals\u0000Online (AJOL). We first compare the coverage of African research publications\u0000in OpenAlex against that of WoS, Scopus, and AJOL. We then assess and compare\u0000the available metadata for OpenAlex, Scopus, and WoS publications. Our analysis\u0000shows that OpenAlex offers the most extensive publication coverage. In terms of\u0000metadata, OpenAlex offers a high coverage of publication and author\u0000information. It performs worse regarding affiliations, references, and funder\u0000information. Importantly, our results also show that metadata availability in\u0000OpenAlex is better for publications that are also indexed in Scopus or WoS.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simbanex: Similarity-based Exploration of IEEE VIS Publications Simbanex:基于相似性的 IEEE VIS 出版物探索
Pub Date : 2024-08-31 DOI: arxiv-2409.00478
Daniel Witschard, Ilir Jusufi, Andreas Kerren
Embeddings are powerful tools for transforming complex and unstructured datainto numeric formats suitable for computational analysis tasks. In this work,we use multiple embeddings for similarity calculations to be applied inbibliometrics and scientometrics. We build a multivariate network (MVN) from alarge set of scientific publications and explore an aspect-driven analysisapproach to reveal similarity patterns in the given publication data. Bydividing our MVN into separately embeddable aspects, we are able to obtain aflexible vector representation which we use as input to a novel method ofsimilarity-based clustering. Based on these preprocessing steps, we developed avisual analytics application, called Simbanex, that has been designed for theinteractive visual exploration of similarity patterns within the underlyingpublications.
嵌入式是将复杂和非结构化数据转换为适合计算分析任务的数字格式的强大工具。在这项工作中,我们使用多重嵌入进行相似性计算,并将其应用于文献计量学和科学计量学。我们从大量科学出版物中建立了一个多变量网络(MVN),并探索了一种方面驱动的分析方法,以揭示给定出版物数据中的相似性模式。通过将多变量网络划分为可单独嵌入的方面,我们能够获得灵活的向量表示,并将其作为基于相似性聚类的新方法的输入。在这些预处理步骤的基础上,我们开发了名为 Simbanex 的可视化分析应用,该应用旨在对底层出版物中的相似性模式进行交互式可视化探索。
{"title":"Simbanex: Similarity-based Exploration of IEEE VIS Publications","authors":"Daniel Witschard, Ilir Jusufi, Andreas Kerren","doi":"arxiv-2409.00478","DOIUrl":"https://doi.org/arxiv-2409.00478","url":null,"abstract":"Embeddings are powerful tools for transforming complex and unstructured data\u0000into numeric formats suitable for computational analysis tasks. In this work,\u0000we use multiple embeddings for similarity calculations to be applied in\u0000bibliometrics and scientometrics. We build a multivariate network (MVN) from a\u0000large set of scientific publications and explore an aspect-driven analysis\u0000approach to reveal similarity patterns in the given publication data. By\u0000dividing our MVN into separately embeddable aspects, we are able to obtain a\u0000flexible vector representation which we use as input to a novel method of\u0000similarity-based clustering. Based on these preprocessing steps, we developed a\u0000visual analytics application, called Simbanex, that has been designed for the\u0000interactive visual exploration of similarity patterns within the underlying\u0000publications.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"307 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Post-OCR Text Correction for Bulgarian Historical Documents 保加利亚历史文献的后OCR 文本更正
Pub Date : 2024-08-31 DOI: arxiv-2409.00527
Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov
The digitization of historical documents is crucial for preserving thecultural heritage of the society. An important step in this process isconverting scanned images to text using Optical Character Recognition (OCR),which can enable further search, information extraction, etc. Unfortunately,this is a hard problem as standard OCR tools are not tailored to deal withhistorical orthography as well as with challenging layouts. Thus, it isstandard to apply an additional text correction step on the OCR output whendealing with such documents. In this work, we focus on Bulgarian, and we createthe first benchmark dataset for evaluating the OCR text correction forhistorical Bulgarian documents written in the first standardized Bulgarianorthography: the Drinov orthography from the 19th century. We further develop amethod for automatically generating synthetic data in this orthography, as wellas in the subsequent Ivanchev orthography, by leveraging vast amounts ofcontemporary literature Bulgarian texts. We then use state-of-the-art LLMs andencoder-decoder framework which we augment with diagonal attention loss andcopy and coverage mechanisms to improve the post-OCR text correction. Theproposed method reduces the errors introduced during recognition and improvesthe quality of the documents by 25%, which is an increase of 16% compared tothe state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our dataand code at url{https://github.com/angelbeshirov/post-ocr-text-correction}.}
历史文献的数字化对于保护社会文化遗产至关重要。这一过程中的一个重要步骤是使用光学字符识别技术(OCR)将扫描图像转换为文本,从而实现进一步的搜索和信息提取等。遗憾的是,这是一个棘手的问题,因为标准的 OCR 工具并不适合处理历史正字法和具有挑战性的布局。因此,在处理此类文档时,通常需要对 OCR 输出应用额外的文本校正步骤。在这项工作中,我们将重点放在保加利亚语上,并创建了首个基准数据集,用于评估以首个标准化保加利亚语正字法(19 世纪的 Drinov 正字法)书写的保加利亚历史文档的 OCR 文本校正。我们进一步开发了一种方法,通过利用大量当代文献保加利亚文本,自动生成该正字法以及随后的伊万切夫正字法的合成数据。然后,我们使用最先进的 LLMs 和编码器-解码器框架,并通过对角注意力损失和复制与覆盖机制来改进后OCR 文本校正。在 ICDAR 2019 保加利亚数据集上,我们提出的方法减少了识别过程中引入的错误,并将文档质量提高了 25%,与最先进的方法相比提高了 16%。我们在 url{https://github.com/angelbeshirov/post-ocr-text-correction}.} 发布了我们的数据和代码。
{"title":"Post-OCR Text Correction for Bulgarian Historical Documents","authors":"Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov","doi":"arxiv-2409.00527","DOIUrl":"https://doi.org/arxiv-2409.00527","url":null,"abstract":"The digitization of historical documents is crucial for preserving the\u0000cultural heritage of the society. An important step in this process is\u0000converting scanned images to text using Optical Character Recognition (OCR),\u0000which can enable further search, information extraction, etc. Unfortunately,\u0000this is a hard problem as standard OCR tools are not tailored to deal with\u0000historical orthography as well as with challenging layouts. Thus, it is\u0000standard to apply an additional text correction step on the OCR output when\u0000dealing with such documents. In this work, we focus on Bulgarian, and we create\u0000the first benchmark dataset for evaluating the OCR text correction for\u0000historical Bulgarian documents written in the first standardized Bulgarian\u0000orthography: the Drinov orthography from the 19th century. We further develop a\u0000method for automatically generating synthetic data in this orthography, as well\u0000as in the subsequent Ivanchev orthography, by leveraging vast amounts of\u0000contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and\u0000encoder-decoder framework which we augment with diagonal attention loss and\u0000copy and coverage mechanisms to improve the post-OCR text correction. The\u0000proposed method reduces the errors introduced during recognition and improves\u0000the quality of the documents by 25%, which is an increase of 16% compared to\u0000the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data\u0000and code at url{https://github.com/angelbeshirov/post-ocr-text-correction}.}","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models CLOCR-C:利用预训练语言模型进行上下文关联 OCR 更正
Pub Date : 2024-08-30 DOI: arxiv-2408.17428
Jonathan Bourne
The digitisation of historical print media archives is crucial for increasingaccessibility to contemporary records. However, the process of OpticalCharacter Recognition (OCR) used to convert physical records to digital text isprone to errors, particularly in the case of newspapers and periodicals due totheir complex layouts. This paper introduces Context Leveraging OCR Correction(CLOCR-C), which utilises the infilling and context-adaptive abilities oftransformer-based language models (LMs) to improve OCR quality. The study aimsto determine if LMs can perform post-OCR correction, improve downstream NLPtasks, and the value of providing the socio-cultural context as part of thecorrection process. Experiments were conducted using seven LMs on threedatasets: the 19th Century Serials Edition (NCSE) and two datasets from theOverproof collection. The results demonstrate that some LMs can significantlyreduce error rates, with the top-performing model achieving over a 60%reduction in character error rate on the NCSE dataset. The OCR improvementsextend to downstream tasks, such as Named Entity Recognition, with increasedCosine Named Entity Similarity. Furthermore, the study shows that providingsocio-cultural context in the prompts improves performance, while misleadingprompts lower performance. In addition to the findings, this study releases adataset of 91 transcribed articles from the NCSE, containing a total of 40thousand words, to support further research in this area. The findings suggestthat CLOCR-C is a promising approach for enhancing the quality of existingdigital archives by leveraging the socio-cultural information embedded in theLMs and the text requiring correction.
历史印刷媒体档案的数字化对于提高当代记录的可访问性至关重要。然而,用于将物理记录转换为数字文本的光学字符识别(OCR)过程很容易出错,尤其是报纸和期刊,因为它们的版式非常复杂。本文介绍了基于上下文的 OCR 纠错(CLOCR-C),它利用基于转换器的语言模型(LM)的填充和上下文自适应能力来提高 OCR 质量。该研究旨在确定 LM 是否能够执行 OCR 后校正、改进下游 NLP 任务以及在校正过程中提供社会文化背景的价值。我们使用七种 LM 在三个数据集上进行了实验:19 世纪丛书版(NCSE)和来自 Overproof 数据集的两个数据集。结果表明,一些 LM 可以显著降低错误率,其中表现最好的模型在 NCSE 数据集上的字符错误率降低了 60% 以上。OCR 的改进还延伸到了下游任务,如命名实体识别,余弦命名实体相似度得到了提高。此外,研究还表明,在提示中提供社会文化背景可以提高性能,而误导性提示则会降低性能。除研究结果外,本研究还发布了来自 NCSE 的 91 篇转录文章的数据集,共包含 40,000 个单词,以支持该领域的进一步研究。研究结果表明,CLOCR-C 是一种很有前途的方法,它可以利用蕴含在 LM 和需要校正的文本中的社会文化信息来提高现有数字档案的质量。
{"title":"CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models","authors":"Jonathan Bourne","doi":"arxiv-2408.17428","DOIUrl":"https://doi.org/arxiv-2408.17428","url":null,"abstract":"The digitisation of historical print media archives is crucial for increasing\u0000accessibility to contemporary records. However, the process of Optical\u0000Character Recognition (OCR) used to convert physical records to digital text is\u0000prone to errors, particularly in the case of newspapers and periodicals due to\u0000their complex layouts. This paper introduces Context Leveraging OCR Correction\u0000(CLOCR-C), which utilises the infilling and context-adaptive abilities of\u0000transformer-based language models (LMs) to improve OCR quality. The study aims\u0000to determine if LMs can perform post-OCR correction, improve downstream NLP\u0000tasks, and the value of providing the socio-cultural context as part of the\u0000correction process. Experiments were conducted using seven LMs on three\u0000datasets: the 19th Century Serials Edition (NCSE) and two datasets from the\u0000Overproof collection. The results demonstrate that some LMs can significantly\u0000reduce error rates, with the top-performing model achieving over a 60%\u0000reduction in character error rate on the NCSE dataset. The OCR improvements\u0000extend to downstream tasks, such as Named Entity Recognition, with increased\u0000Cosine Named Entity Similarity. Furthermore, the study shows that providing\u0000socio-cultural context in the prompts improves performance, while misleading\u0000prompts lower performance. In addition to the findings, this study releases a\u0000dataset of 91 transcribed articles from the NCSE, containing a total of 40\u0000thousand words, to support further research in this area. The findings suggest\u0000that CLOCR-C is a promising approach for enhancing the quality of existing\u0000digital archives by leveraging the socio-cultural information embedded in the\u0000LMs and the text requiring correction.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Accuracy of the Labeling System in Web of Science for the Sustainable Development Goals 评估可持续发展目标科学网标签系统的准确性
Pub Date : 2024-08-30 DOI: arxiv-2408.17084
Yu Zhao, Li Li, Zhesi Shen
Monitoring and fostering research aligned with the Sustainable DevelopmentGoals (SDGs) is crucial for formulating evidence-based policies, identifyingbest practices, and promoting global collaboration. The key step is developinga labeling system to map research publications to their related SDGs. The SDGslabeling system integrated in Web of Science (WoS), which assigns citationtopics instead of individual publication to SDGs, has emerged as a promisingtool.However we still lack of a comprehensive evaluation of the performance ofWoS labeling system. By comparing with the Bergon approach, we systematicallyassessed the relatedness between citation topics and SDGs. Our analysisidentified 15% of topics showing low relatedness to their assigned SDGs at a 1%threshold. Notably, SDGs such as '11 Cities', '07 Energy', and '13 Climate'exhibited higher percentages of low related topics. In addition, we revealedthat certain topics are significantly underrepresented in their relevant SDGs,particularly for '02 Hunger', '12 Consumption', and '15 Land'. This studyunderscores the critical need for continual refinement and validation of SDGslabeling systems in WoS.
监测和促进与可持续发展目标(SDGs)相一致的研究对于制定循证政策、确定最佳做法和促进全球合作至关重要。关键的一步是开发一个标签系统,将研究出版物与相关的可持续发展目标对应起来。集成在科学网(WoS)中的可持续发展目标标注系统将引文主题而不是单个出版物分配给可持续发展目标,已成为一种很有前途的工具。通过与 Bergon 方法进行比较,我们系统地评估了引文主题与可持续发展目标之间的相关性。我们的分析发现,在1%的阈值下,有15%的主题与指定的可持续发展目标相关性较低。值得注意的是,"11 城市"、"07 能源 "和 "13 气候 "等可持续发展目标的低相关性主题比例较高。此外,我们还发现某些主题在相关可持续发展目标中的代表性明显不足,尤其是 "02 饥饿"、"12 消费 "和 "15 土地"。这项研究强调了在 WoS 中不断完善和验证可持续发展目标标签系统的迫切需要。
{"title":"Evaluating the Accuracy of the Labeling System in Web of Science for the Sustainable Development Goals","authors":"Yu Zhao, Li Li, Zhesi Shen","doi":"arxiv-2408.17084","DOIUrl":"https://doi.org/arxiv-2408.17084","url":null,"abstract":"Monitoring and fostering research aligned with the Sustainable Development\u0000Goals (SDGs) is crucial for formulating evidence-based policies, identifying\u0000best practices, and promoting global collaboration. The key step is developing\u0000a labeling system to map research publications to their related SDGs. The SDGs\u0000labeling system integrated in Web of Science (WoS), which assigns citation\u0000topics instead of individual publication to SDGs, has emerged as a promising\u0000tool.However we still lack of a comprehensive evaluation of the performance of\u0000WoS labeling system. By comparing with the Bergon approach, we systematically\u0000assessed the relatedness between citation topics and SDGs. Our analysis\u0000identified 15% of topics showing low relatedness to their assigned SDGs at a 1%\u0000threshold. Notably, SDGs such as '11 Cities', '07 Energy', and '13 Climate'\u0000exhibited higher percentages of low related topics. In addition, we revealed\u0000that certain topics are significantly underrepresented in their relevant SDGs,\u0000particularly for '02 Hunger', '12 Consumption', and '15 Land'. This study\u0000underscores the critical need for continual refinement and validation of SDGs\u0000labeling systems in WoS.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context μgat:通过提供多页面上下文改进单页面文档解析
Pub Date : 2024-08-28 DOI: arxiv-2408.15646
Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara
Regesta are catalogs of summaries of other documents and, in some cases, arethe only source of information about the content of such full-length documents.For this reason, they are of great interest to scholars in many social andhumanities fields. In this work, we focus on Regesta Pontificum Romanum, alarge collection of papal registers. Regesta are visually rich documents, wherethe layout is as important as the text content to convey the containedinformation through the structure, and are inherently multi-page documents.Among Digital Humanities techniques that can help scholars efficiently exploitregesta and other documental sources in the form of scanned documents, DocumentParsing has emerged as a task to process document images and convert them intomachine-readable structured representations, usually markup language. However,current models focus on scientific and business documents, and most of themconsider only single-paged documents. To overcome this limitation, in thiswork, we propose {mu}gat, an extension of the recently proposed Documentparsing Nougat architecture, which can handle elements spanning over the singlepage limits. Specifically, we adapt Nougat to process a larger, multi-pagecontext, consisting of the previous and the following page, while parsing thecurrent page. Experimental results, both qualitative and quantitative,demonstrate the effectiveness of our proposed approach also in the case of thechallenging Regesta Pontificum Romanorum.
Regesta 是其他文件摘要的目录,在某些情况下,是有关此类完整文件内容的唯一信息来源。在这项工作中,我们的重点是 Regesta Pontificum Romanum,它是教皇登记簿的大型合集。Regesta 是视觉丰富的文档,其布局与文本内容同等重要,可通过结构传达所包含的信息,而且本身就是多页文档。在可帮助学者有效利用 Regesta 和其他扫描文档形式的文档源的数字人文技术中,文档解析(DocumentParsing)已成为一项处理文档图像并将其转换为机器可读的结构化表示(通常是标记语言)的任务。然而,目前的模型主要集中在科学和商业文档上,而且大多数模型只考虑单页文档。为了克服这一限制,我们在本研究中提出了{mu}gat,它是最近提出的文档解析 Nougat 架构的扩展,可以处理跨越单页限制的元素。具体来说,我们对Nougat进行了调整,使其能够在解析当前页面的同时,处理由上一页和下一页组成的更大的多页面上下文。实验结果从定性和定量两个方面证明了我们提出的方法在处理具有挑战性的《罗马教规》(Regesta Pontificum Romanorum)时的有效性。
{"title":"μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context","authors":"Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara","doi":"arxiv-2408.15646","DOIUrl":"https://doi.org/arxiv-2408.15646","url":null,"abstract":"Regesta are catalogs of summaries of other documents and, in some cases, are\u0000the only source of information about the content of such full-length documents.\u0000For this reason, they are of great interest to scholars in many social and\u0000humanities fields. In this work, we focus on Regesta Pontificum Romanum, a\u0000large collection of papal registers. Regesta are visually rich documents, where\u0000the layout is as important as the text content to convey the contained\u0000information through the structure, and are inherently multi-page documents.\u0000Among Digital Humanities techniques that can help scholars efficiently exploit\u0000regesta and other documental sources in the form of scanned documents, Document\u0000Parsing has emerged as a task to process document images and convert them into\u0000machine-readable structured representations, usually markup language. However,\u0000current models focus on scientific and business documents, and most of them\u0000consider only single-paged documents. To overcome this limitation, in this\u0000work, we propose {mu}gat, an extension of the recently proposed Document\u0000parsing Nougat architecture, which can handle elements spanning over the single\u0000page limits. Specifically, we adapt Nougat to process a larger, multi-page\u0000context, consisting of the previous and the following page, while parsing the\u0000current page. Experimental results, both qualitative and quantitative,\u0000demonstrate the effectiveness of our proposed approach also in the case of the\u0000challenging Regesta Pontificum Romanorum.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models LyCon:利用大型语言模型从词袋重建歌词
Pub Date : 2024-08-27 DOI: arxiv-2408.14750
Haven Kim, Kahyun Choi
This paper addresses the unique challenge of conducting research in lyricstudies, where direct use of lyrics is often restricted due to copyrightconcerns. Unlike typical data, internet-sourced lyrics are frequently protectedunder copyright law, necessitating alternative approaches. Our study introducesa novel method for generating copyright-free lyrics from publicly availableBag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not thelyrics themselves. Utilizing metadata associated with BoW datasets and largelanguage models, we successfully reconstructed lyrics. We have compiled andmade available a dataset of reconstructed lyrics, LyCon, aligned with metadatafrom renowned sources including the Million Song Dataset, Deezer Mood DetectionDataset, and AllMusic Genre Dataset, available for public access. We believethat the integration of metadata such as mood annotations or genres enables avariety of academic experiments on lyrics, such as conditional lyricgeneration.
由于版权问题,歌词的直接使用往往受到限制,本文探讨了歌词研究中开展研究所面临的独特挑战。与典型数据不同的是,互联网来源的歌词经常受到版权法的保护,因此有必要采用其他方法。我们的研究介绍了一种从公开可用的词袋(BoW)数据集生成无版权歌词的新方法,这些数据集包含歌词词汇,但不包含歌词本身。利用与 BoW 数据集相关的元数据和大型语言模型,我们成功地重建了歌词。我们编译并提供了一个重构歌词数据集 LyCon,该数据集与包括百万首歌曲数据集、Deezer 音乐检测数据集和 AllMusic 音乐流派数据集在内的著名来源的元数据保持一致,可供公众访问。我们相信,通过整合情绪注释或流派等元数据,可以对歌词进行各种学术实验,如条件歌词生成。
{"title":"LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models","authors":"Haven Kim, Kahyun Choi","doi":"arxiv-2408.14750","DOIUrl":"https://doi.org/arxiv-2408.14750","url":null,"abstract":"This paper addresses the unique challenge of conducting research in lyric\u0000studies, where direct use of lyrics is often restricted due to copyright\u0000concerns. Unlike typical data, internet-sourced lyrics are frequently protected\u0000under copyright law, necessitating alternative approaches. Our study introduces\u0000a novel method for generating copyright-free lyrics from publicly available\u0000Bag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not the\u0000lyrics themselves. Utilizing metadata associated with BoW datasets and large\u0000language models, we successfully reconstructed lyrics. We have compiled and\u0000made available a dataset of reconstructed lyrics, LyCon, aligned with metadata\u0000from renowned sources including the Million Song Dataset, Deezer Mood Detection\u0000Dataset, and AllMusic Genre Dataset, available for public access. We believe\u0000that the integration of metadata such as mood annotations or genres enables a\u0000variety of academic experiments on lyrics, such as conditional lyric\u0000generation.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transdisciplinary research: How much is academia heeding the call to work more closely with societal stakeholders such as industry, government, and nonprofits? 跨学科研究:学术界在多大程度上响应了与产业界、政府和非营利组织等社会利益相关者更紧密合作的号召?
Pub Date : 2024-08-26 DOI: arxiv-2408.14024
Philip James Purnell
Transdisciplinary research, the co-creation of scientific knowledge bymultiple stakeholders, is considered essential for addressing major societalproblems. Research policy makers and academic leaders frequently call forcloser collaboration between academia and societal stakeholders to address thegrand challenges of our time. This bibliometric study evaluates progress incollaboration between academia and three societal stakeholders: industry,government, and nonprofit organisations. It analyses the level of co-publishingbetween academia and these societal stakeholders over the period 2013-2022. Wefound that research collaboration between academia and all stakeholder typesstudied grew in absolute terms. However, academia-industry collaborationdeclined 16% relative to overall academic output while academia-government andacademia-nonprofit collaboration grew at roughly the same pace as academicoutput. Country and field of research breakdowns revealed wide variance. Inlight of previous work, we consider potential explanations for the gap betweenpolicymakers' aspirations and the real global trends. This study is a usefuldemonstration of large scale, quantitative bibliometric techniques for researchpolicymakers to track the impact of decisions related to funding, intellectualproperty law, and nonprofit support.
跨学科研究,即由多个利益相关方共同创造科学知识,被认为是解决重大社会问题的关键。研究政策制定者和学术带头人经常呼吁学术界与社会利益相关者开展更紧密的合作,以应对我们这个时代的重大挑战。这项文献计量学研究评估了学术界与三个社会利益相关者(工业界、政府和非营利组织)之间的合作进展。它分析了 2013-2022 年间学术界与这些社会利益相关者之间的合作出版水平。我们发现,学术界与所研究的所有利益相关者类型之间的研究合作在绝对值上都有所增长。然而,相对于整体学术产出而言,学术界与产业界的合作下降了 16%,而学术界与政府和学术界与非营利组织的合作与学术产出的增长速度基本持平。国家和研究领域的细分显示了巨大的差异。根据以往的工作,我们考虑了政策制定者的愿望与实际全球趋势之间差距的潜在解释。这项研究为研究政策制定者跟踪与资金、知识产权法和非营利性支持相关的决策的影响提供了大规模、定量文献计量技术的有益证明。
{"title":"Transdisciplinary research: How much is academia heeding the call to work more closely with societal stakeholders such as industry, government, and nonprofits?","authors":"Philip James Purnell","doi":"arxiv-2408.14024","DOIUrl":"https://doi.org/arxiv-2408.14024","url":null,"abstract":"Transdisciplinary research, the co-creation of scientific knowledge by\u0000multiple stakeholders, is considered essential for addressing major societal\u0000problems. Research policy makers and academic leaders frequently call for\u0000closer collaboration between academia and societal stakeholders to address the\u0000grand challenges of our time. This bibliometric study evaluates progress in\u0000collaboration between academia and three societal stakeholders: industry,\u0000government, and nonprofit organisations. It analyses the level of co-publishing\u0000between academia and these societal stakeholders over the period 2013-2022. We\u0000found that research collaboration between academia and all stakeholder types\u0000studied grew in absolute terms. However, academia-industry collaboration\u0000declined 16% relative to overall academic output while academia-government and\u0000academia-nonprofit collaboration grew at roughly the same pace as academic\u0000output. Country and field of research breakdowns revealed wide variance. In\u0000light of previous work, we consider potential explanations for the gap between\u0000policymakers' aspirations and the real global trends. This study is a useful\u0000demonstration of large scale, quantitative bibliometric techniques for research\u0000policymakers to track the impact of decisions related to funding, intellectual\u0000property law, and nonprofit support.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of Sustainable Development Goals Labeling Systems based on Topic Coverage 基于主题覆盖范围的可持续发展目标标签系统比较
Pub Date : 2024-08-24 DOI: arxiv-2408.13455
Li Li, Yu Zhao, Zhesi Shen
With the growing importance of sustainable development goals (SDGs), variouslabeling systems have emerged for effective monitoring and evaluation. Thisstudy assesses six labeling systems across 1.85 million documents at both paperlevel and topic level. Our findings indicate that the SDGO and SDSN systems aremore aggressive, while systems such as Auckland, Aurora, SIRIS, and Elsevierexhibit significant topic consistency, with similarity scores exceeding 0.75for most SDGs. However, similarities at the paper level generally fall short,particularly for specific SDGs like SDG 10. We highlight the crucial role ofcontextual information in keyword-based labeling systems, noting thatoverlooking context can introduce bias in the retrieval of papers (e.g.,variations in "migration" between biomedical and geographical contexts). Theseresults reveal substantial discrepancies among SDG labeling systems,emphasizing the need for improved methodologies to enhance the accuracy andrelevance of SDG evaluations.
随着可持续发展目标(SDGs)的重要性与日俱增,出现了各种标签系统来进行有效的监测和评估。本研究对 185 万份文档中的六种标签系统进行了纸张和主题层面的评估。我们的研究结果表明,SDGO 和 SDSN 系统更具侵略性,而 Auckland、Aurora、SIRIS 和 Elseviere 等系统则表现出显著的主题一致性,大多数 SDG 的相似度得分都超过了 0.75。然而,论文层面的相似性一般都不高,尤其是像 SDG 10 这样的特定 SDG。我们强调了上下文信息在基于关键词的标注系统中的关键作用,并指出忽略上下文会在检索论文时产生偏差(例如,生物医学和地理上下文之间的 "迁移 "差异)。这些结果揭示了可持续发展目标标注系统之间的巨大差异,强调了改进方法以提高可持续发展目标评估的准确性和相关性的必要性。
{"title":"Comparison of Sustainable Development Goals Labeling Systems based on Topic Coverage","authors":"Li Li, Yu Zhao, Zhesi Shen","doi":"arxiv-2408.13455","DOIUrl":"https://doi.org/arxiv-2408.13455","url":null,"abstract":"With the growing importance of sustainable development goals (SDGs), various\u0000labeling systems have emerged for effective monitoring and evaluation. This\u0000study assesses six labeling systems across 1.85 million documents at both paper\u0000level and topic level. Our findings indicate that the SDGO and SDSN systems are\u0000more aggressive, while systems such as Auckland, Aurora, SIRIS, and Elsevier\u0000exhibit significant topic consistency, with similarity scores exceeding 0.75\u0000for most SDGs. However, similarities at the paper level generally fall short,\u0000particularly for specific SDGs like SDG 10. We highlight the crucial role of\u0000contextual information in keyword-based labeling systems, noting that\u0000overlooking context can introduce bias in the retrieval of papers (e.g.,\u0000variations in \"migration\" between biomedical and geographical contexts). These\u0000results reveal substantial discrepancies among SDG labeling systems,\u0000emphasizing the need for improved methodologies to enhance the accuracy and\u0000relevance of SDG evaluations.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"167 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Digital Libraries
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1