International Journal on Digital Libraries最新文献_第2页

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications 基于 BERT 的序列深度神经架构，识别科学出版物中的贡献声明并提取三联短语

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-01-23 DOI: 10.1007/s00799-023-00393-y

Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal

Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (subject, predicate, object) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, viz. Section Identification and Citance Classification. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: SciERC and SciClaim. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the NLPContributionGraph (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.

自然语言处理（NLP）领域的研究正在迅速发展，因此也有大量研究论文发表。要从海量的非结构化数据中找到研究论文在任何特定领域的贡献是一项挑战。因此，有必要在知识图谱（KG）中对相关贡献进行结构化处理。在本文中，我们介绍了为构建科学知识图谱（SKG）而完成的四项工作。我们提出了一个流水线系统，该系统可执行贡献句识别、贡献句中的短语提取、信息单元（IU）分类，并将短语从 NLP 学术出版物中组织成三联体（主语、谓语、宾语）。我们开发了一个多任务系统（ContriSci），用于识别贡献句，并有两个辅助任务，即章节识别和信息单位分类。我们使用来自变换器的双向编码器表示（BERT）-条件随机场（CRF）模型进行短语提取，并使用两个额外的数据集进行训练：SciERC 和 SciClaim。我们使用基于 BERT 的模型对贡献句子进行 IU 分类。在三连音提取方面，我们将三连音分为五类，并使用基于 BERT 的分类器对三连音进行分类。在非端到端环境下，我们提出的方法在贡献句识别、短语提取、IU 分类和三连音提取方面的 F1 得分值分别为 64.21%、77.47%、84.52% 和 62.71%。在 NLPContributionGraph（NCG）数据集上，贡献句识别、IUs 分类和三连音提取的 F1 分数分别提高了 8.08、2.46 和 2.31。我们的系统在所有四个子任务的端到端流水线中取得了最佳性能（57.54% 的 F1 分数）。我们的代码可在以下网址获取：https://github.com/92Komal/pipeline_triplet_extraction.

{"title":"A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications","authors":"Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal","doi":"10.1007/s00799-023-00393-y","DOIUrl":"https://doi.org/10.1007/s00799-023-00393-y","url":null,"abstract":"Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (subject, predicate, object) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, viz. Section Identification and Citance Classification. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: SciERC and SciClaim. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the NLPContributionGraph (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sequential sentence classification in research papers using cross-domain multi-task learning 利用跨域多任务学习对研究论文中的连续句子进行分类

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-01-22 DOI: 10.1007/s00799-023-00392-z

Arthur Brack, Elias Entrup, Markos Stamatakis, Pascal Buschermöhle, Anett Hoppe, Ralph Ewerth

The automatic semantic structuring of scientific text allows for more efficient reading of research articles and is an important indexing step for academic search engines. Sequential sentence classification is an essential structuring task and targets the categorisation of sentences based on their content and context. However, the potential of transfer learning for sentence classification across different scientific domains and text types, such as full papers and abstracts, has not yet been explored in prior work. In this paper, we present a systematic analysis of transfer learning for scientific sequential sentence classification. For this purpose, we derive seven research questions and present several contributions to address them: (1) We suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific text. (2) We tailor two transfer learning methods to deal with the given task, namely sequential transfer learning and multi-task learning. (3) We compare the results of the two best models using qualitative examples in a case study. (4) We provide an approach for the semi-automatic identification of semantically related classes across annotation schemes and analyse the results for four annotation schemes. The clusters and underlying semantic vectors are validated using k-means clustering. (5) Our comprehensive experimental results indicate that when using the proposed multi-task learning architecture, models trained on datasets from different scientific domains benefit from one another. Our approach significantly outperforms state of the art on full paper datasets while being on par for datasets consisting of abstracts.

科学文本的自动语义结构化可提高研究文章的阅读效率，也是学术搜索引擎的重要索引步骤。顺序句子分类是一项重要的结构化任务，其目标是根据句子的内容和上下文对句子进行分类。然而，对于不同科学领域和文本类型（如论文全文和摘要）的句子分类而言，迁移学习的潜力尚未在之前的工作中得到探索。在本文中，我们系统分析了迁移学习在科学序列句子分类中的应用。为此，我们提出了七个研究问题，并针对这些问题做出了几项贡献：（1）我们提出了一种新颖的统一深度学习架构和多任务学习，用于科学文本中的跨域序列句子分类。(2) 我们定制了两种迁移学习方法来处理给定任务，即顺序迁移学习和多任务学习。(3) 通过案例研究中的定性实例，比较两种最佳模型的结果。(4) 我们提供了一种跨注释方案半自动识别语义相关类别的方法，并分析了四种注释方案的结果。聚类和基础语义向量通过 k-means 聚类进行验证。(5) 我们的综合实验结果表明，当使用所提出的多任务学习架构时，在来自不同科学领域的数据集上训练的模型可以相互受益。在完整论文数据集上，我们的方法明显优于现有技术，而在由摘要组成的数据集上，我们的方法与现有技术相当。

{"title":"Sequential sentence classification in research papers using cross-domain multi-task learning","authors":"Arthur Brack, Elias Entrup, Markos Stamatakis, Pascal Buschermöhle, Anett Hoppe, Ralph Ewerth","doi":"10.1007/s00799-023-00392-z","DOIUrl":"https://doi.org/10.1007/s00799-023-00392-z","url":null,"abstract":"The automatic semantic structuring of scientific text allows for more efficient reading of research articles and is an important indexing step for academic search engines. Sequential sentence classification is an essential structuring task and targets the categorisation of sentences based on their content and context. However, the potential of transfer learning for sentence classification across different scientific domains and text types, such as full papers and abstracts, has not yet been explored in prior work. In this paper, we present a systematic analysis of transfer learning for scientific sequential sentence classification. For this purpose, we derive seven research questions and present several contributions to address them: (1) We suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific text. (2) We tailor two transfer learning methods to deal with the given task, namely sequential transfer learning and multi-task learning. (3) We compare the results of the two best models using qualitative examples in a case study. (4) We provide an approach for the semi-automatic identification of semantically related classes across annotation schemes and analyse the results for four annotation schemes. The clusters and underlying semantic vectors are validated using k-means clustering. (5) Our comprehensive experimental results indicate that when using the proposed multi-task learning architecture, models trained on datasets from different scientific domains benefit from one another. Our approach significantly outperforms state of the art on full paper datasets while being on par for datasets consisting of abstracts.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"34 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Academics’ experience of online reading lists and the use of reading list notes 学者使用在线阅读清单和阅读清单笔记的经验

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-01-12 DOI: 10.1007/s00799-023-00387-w

P. P. N. V. Kumara, Annika Hinze, Nicholas Vanderschantz, Claire Timpany

Reading Lists Systems are widely used in tertiary education as a pedagogical tool and for tracking copyrighted material. This paper explores academics' experiences with reading lists and in particular the use of reading lists notes feature. A mixed-methods approach was employed in which we first conducted interviews with academics about their experience with reading lists. We identified the need for streamlining the workflow of the reading lists set-up, improved usability of the interfaces, and better synchronization with other teaching support systems. Next, we performed a log analysis of the use of the notes feature throughout one academic year. The results of our log analysis were that the note feature is under-utilized by academics. We recommend improving the systems’ usability by re-engineering the user workflows and to better integrate notes feature into academic teaching.

阅读清单系统作为一种教学工具和追踪受版权保护材料的工具被广泛应用于高等教育领域。本文探讨了学者们使用阅读清单的经验，特别是使用阅读清单笔记功能的情况。我们采用了一种混合方法，首先对学者进行访谈，了解他们使用阅读清单的经验。我们发现有必要简化阅读清单设置的工作流程，提高界面的可用性，并更好地与其他教学辅助系统同步。接下来，我们对一个学年中笔记功能的使用情况进行了日志分析。日志分析的结果表明，学术界对笔记功能的利用率很低。我们建议通过重新设计用户工作流程来提高系统的可用性，并将笔记功能更好地融入到学术教学中。

引用次数: 0

SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs SciND：通过知识图谱进行科学新颖性检测的基于三元组的新数据集

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-01-08 DOI: 10.1007/s00799-023-00386-x

Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal

Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.

检测包含语义级新信息的文本并非易事。对于研究文章来说，这个问题变得更具挑战性。多年来，人们开发了许多数据集和技术来尝试自动新颖性检测。然而，现有的文本新颖性检测研究大多针对新闻通讯等一般领域。科学新颖性检测的综合数据集在文献中并不存在。在本文中，我们提出了一个基于三元组的新语料库（SciND），用于通过知识图谱从研究文章中检测科学新颖性。本文提出的数据集由三类三元组组成：(i) 知识图谱三元组；(ii) 新颖三元组；(iii) 非新颖三元组。我们利用多个自然语言处理（NLP）领域的三元组为研究文章构建科学知识图谱，并从 2021 年发表的论文中提取新颖的三元组。对于非小说类文章，我们使用研究文章的博文摘要。我们的知识图谱是针对特定领域的。我们为七个 NLP 领域构建了知识图谱。我们还以研究文章中基于特征的新颖性检测方案为基准。此外，我们还使用基线新颖性检测算法展示了我们提出的数据集的适用性。我们的算法获得了 72% 的基准 F1 分数。我们展示了使用我们提出的数据集进行的分析，并讨论了未来的应用范围。据我们所知，这是第一个通过知识图谱进行科学新颖性检测的数据集。我们将在 https://github.com/92Komal/Scientific_Novelty_Detection 上公开我们的代码和数据集。

{"title":"SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs","authors":"Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal","doi":"10.1007/s00799-023-00386-x","DOIUrl":"https://doi.org/10.1007/s00799-023-00386-x","url":null,"abstract":"Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"57 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Human-in-the-loop latent space learning for biblio-record-based literature management 基于书目记录的文献管理的人在环潜在空间学习

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-01-04 DOI: 10.1007/s00799-023-00389-8

Abstract

Every researcher must conduct a literature review, and the document management needs of researchers working on various research topics vary. However, there are two major challenges. First, traditional methods such as the tree hierarchy of document folders and tag-based management are no longer effective with the enormous volume of publications. Second, although their bibliographic information is available to everyone, many papers can only be accessed through paid services. This study attempts to develop an interactive tool for personal literature management based solely on their bibliographic records. To make such a tool possible, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on his or her feedback to calculate the positions of documents in a two-dimensional space on the screen. As a set of bibliographic records forms a graph, our model is naturally designed as a graph-based encoder–decoder model that connects the graph and the space. In addition, we also devised an active learning framework using uncertainty sampling for it. The challenge here is to define the uncertainty in a problem setting. Experiments with ten researchers from the humanities, science, and engineering domains show that the proposed framework provides superior results to a typical graph convolutional encoder–decoder model. In addition, we found that our active learning framework was effective in selecting good samples.

摘要每个研究人员都必须进行文献综述，而从事不同研究课题的研究人员对文献管理的需求也各不相同。然而，目前存在两大挑战。首先，传统的方法，如文件文件夹的树状层次结构和基于标签的管理，在出版物数量巨大的情况下已不再有效。其次，虽然每个人都能获得其书目信息，但许多论文只能通过付费服务才能访问。本研究试图开发一种完全基于书目记录的个人文献管理互动工具。为了使这种工具成为可能，我们开发了一种原则性的 "人在回路中的潜在空间学习 "方法，该方法可根据每位研究人员的反馈估算其管理标准，从而计算出文档在屏幕上二维空间中的位置。由于书目记录集合构成了一个图，我们的模型自然被设计成一个基于图的编码器-解码器模型，将图和空间连接起来。此外，我们还利用不确定性采样设计了一个主动学习框架。这里的挑战在于如何在问题设置中定义不确定性。与来自人文、科学和工程领域的十位研究人员进行的实验表明，与典型的图卷积编码器-解码器模型相比，所提出的框架能提供更优越的结果。此外，我们还发现，我们的主动学习框架在选择良好样本方面非常有效。

{"title":"Human-in-the-loop latent space learning for biblio-record-based literature management","authors":"","doi":"10.1007/s00799-023-00389-8","DOIUrl":"https://doi.org/10.1007/s00799-023-00389-8","url":null,"abstract":"<h3>Abstract</h3> Every researcher must conduct a literature review, and the document management needs of researchers working on various research topics vary. However, there are two major challenges. First, traditional methods such as the tree hierarchy of document folders and tag-based management are no longer effective with the enormous volume of publications. Second, although their bibliographic information is available to everyone, many papers can only be accessed through paid services. This study attempts to develop an interactive tool for personal literature management based solely on their bibliographic records. To make such a tool possible, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on his or her feedback to calculate the positions of documents in a two-dimensional space on the screen. As a set of bibliographic records forms a graph, our model is naturally designed as a graph-based encoder–decoder model that connects the graph and the space. In addition, we also devised an active learning framework using uncertainty sampling for it. The challenge here is to define the uncertainty in a problem setting. Experiments with ten researchers from the humanities, science, and engineering domains show that the proposed framework provides superior results to a typical graph convolutional encoder–decoder model. In addition, we found that our active learning framework was effective in selecting good samples.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"9 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139374565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OAVA: the open audio-visual archives aggregator OAVA：开放式视听档案聚合器

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2023-12-16 DOI: 10.1007/s00799-023-00384-z

Polychronis Charitidis, Sotirios Moschos, Chrysostomos Bakouras, Stavros Doropoulos, Giorgos Makris, Nikolas Mauropoulos, Ilias Nitsos, Sofia Zapounidou, Afrodite Malliari

The purpose of the current article is to provide an overview of an open-access audiovisual aggregation and search service platform developed for Greek audiovisual content during the OAVA (Open Access AudioVisual Archive) project. The platform allows the search of audiovisual resources utilizing metadata descriptions, as well as full-text search utilizing content generated from automatic speech recognition (ASR) processes through deep learning models. A dataset containing reliable Greek audiovisual content providers and their resources (1710 in total) is created. Both providers and resources are reviewed according to specific criteria already established and used for content aggregation purposes, to ensure the quality of the content and to avoid copyright infringements. Well-known aggregation services and well-established schemas for audiovisual resources have been studied and considered regarding both aggregated content and metadata. Most Greek audiovisual content providers do not use established metadata schemas when publishing their content, nor technical cooperation with them is guaranteed. Thus, a model is developed for reconciliation and aggregation. To utilize audiovisual resources the OAVA platform makes use of the latest state-of-the-art ASR approaches. OAVA platform supports Greek and English speech-to-text models. Specifically for Greek, to mitigate the scarcity of available datasets, a large-scale ASR dataset is annotated to train and evaluate deep learning architectures. The result of the above-mentioned efforts, namely selection of content, metadata, development of appropriate ASR techniques, and aggregation and enrichment of content and metadata, is the OAVA platform. This unified search mechanism for Greek audiovisual content will serve teaching, research, and cultural activities. OAVA platform is available at: https://openvideoarchives.gr/.

本文旨在概述在 OAVA（开放存取视听档案）项目期间为希腊视听内容开发的开放存取视听聚合和搜索服务平台。该平台允许利用元数据描述搜索音像资源，也允许利用通过深度学习模型从自动语音识别（ASR）过程中生成的内容进行全文搜索。创建的数据集包含可靠的希腊视听内容提供商及其资源（共 1710 个）。对提供商和资源的审查都是根据已经制定并用于内容聚合目的的特定标准进行的，以确保内容的质量并避免侵犯版权。在聚合内容和元数据方面，对知名的聚合服务和成熟的视听资源模式进行了研究和考量。大多数希腊音像内容提供商在发布内容时并不使用既定的元数据模式，与这些模式的技术合作也得不到保证。因此，我们开发了一个用于协调和聚合的模型。为了利用视听资源，OAVA 平台采用了最新的 ASR 方法。OAVA 平台支持希腊语和英语语音到文本模型。特别是对于希腊语，为缓解可用数据集的稀缺性，对大规模 ASR 数据集进行了注释，以训练和评估深度学习架构。OAVA 平台是上述工作的成果，即选择内容、元数据、开发适当的 ASR 技术以及聚合和丰富内容和元数据。这一统一的希腊视听内容搜索机制将服务于教学、研究和文化活动。OAVA 平台的网址是：https://openvideoarchives.gr/。

{"title":"OAVA: the open audio-visual archives aggregator","authors":"Polychronis Charitidis, Sotirios Moschos, Chrysostomos Bakouras, Stavros Doropoulos, Giorgos Makris, Nikolas Mauropoulos, Ilias Nitsos, Sofia Zapounidou, Afrodite Malliari","doi":"10.1007/s00799-023-00384-z","DOIUrl":"https://doi.org/10.1007/s00799-023-00384-z","url":null,"abstract":"The purpose of the current article is to provide an overview of an open-access audiovisual aggregation and search service platform developed for Greek audiovisual content during the OAVA (Open Access AudioVisual Archive) project. The platform allows the search of audiovisual resources utilizing metadata descriptions, as well as full-text search utilizing content generated from automatic speech recognition (ASR) processes through deep learning models. A dataset containing reliable Greek audiovisual content providers and their resources (1710 in total) is created. Both providers and resources are reviewed according to specific criteria already established and used for content aggregation purposes, to ensure the quality of the content and to avoid copyright infringements. Well-known aggregation services and well-established schemas for audiovisual resources have been studied and considered regarding both aggregated content and metadata. Most Greek audiovisual content providers do not use established metadata schemas when publishing their content, nor technical cooperation with them is guaranteed. Thus, a model is developed for reconciliation and aggregation. To utilize audiovisual resources the OAVA platform makes use of the latest state-of-the-art ASR approaches. OAVA platform supports Greek and English speech-to-text models. Specifically for Greek, to mitigate the scarcity of available datasets, a large-scale ASR dataset is annotated to train and evaluate deep learning architectures. The result of the above-mentioned efforts, namely selection of content, metadata, development of appropriate ASR techniques, and aggregation and enrichment of content and metadata, is the OAVA platform. This unified search mechanism for Greek audiovisual content will serve teaching, research, and cultural activities. OAVA platform is available at: https://openvideoarchives.gr/.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"34 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138686808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

User versus institutional perspectives of metadata and searching: an investigation of online access to cultural heritage content during the COVID-19 pandemic 元数据和搜索的用户视角与机构视角：COVID-19 大流行期间文化遗产内容在线访问调查

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2023-12-15 DOI: 10.1007/s00799-023-00385-y

Ryan Colin Gibson, Sudatta Chowdhury, Gobinda Chowdhury

Findings from log analyses of user interactions with the digital content of two large national cultural heritage institutions (National Museums of Scotland and National Galleries of Scotland) during the COVID-19 lockdown highlighted limited engagement compared to pre-pandemic levels. Just 8% of users returned to these sites, whilst the average time spent, and number of pages accessed, were generally low. This prompted a user study to investigate the potential mismatch between the way content was indexed by the curators and searched for by users. A controlled experiment with ten participants, involving two tasks and a selected set of digital cultural heritage content, explored: (a) how does the metadata assigned by cultural heritage organisations meet or differ from the search needs of users? and (b) how can the search strategies of users inform the search pathways employed by cultural heritage organisations? Findings reveal that collection management standards like Spectrum encourage a variety of different characteristics to be considered when developing metadata, yet much of the content is left to the interpretations of curators. Rather, user- and context-specific guidelines could be beneficial in ensuring the aspects considered most important by consumers are indexed, thereby producing more relevant search results. A user-centred approach to designing cultural heritage websites would help to improve an individual’s experience when searching for information. However, a process is needed for institutions to form a concrete understanding of who their target users are before developing features and designs to suit their specific needs and interests.

对两个大型国家文化遗产机构（苏格兰国家博物馆和苏格兰国家美术馆）的用户在 COVID-19 封锁期间与数字内容的交互日志分析结果表明，与大流行前的水平相比，用户的参与程度有限。仅有 8% 的用户返回了这些网站，而平均花费的时间和访问的页面数量也普遍较低。这促使我们开展了一项用户研究，以调查内容管理者编制索引的方式与用户搜索方式之间可能存在的不匹配。一项有十名参与者参加的对照实验涉及两项任务和一组选定的数字文化遗产内容，目的是探索：(a) 文化遗产机构分配的元数据如何满足或有别于用户的搜索需求？ (b) 用户的搜索策略如何为文化遗产机构采用的搜索途径提供信息？研究结果表明，像《光谱》这样的藏品管理标准鼓励在开发元数据时考虑各种不同的特征，但大部分内容都要由馆长来解释。相反，针对用户和具体情况的指导原则可能有助于确保消费者认为最重要的方面被编入索引，从而产生更相关的搜索结果。以用户为中心的文化遗产网站设计方法将有助于改善个人搜索信息的体验。不过，各机构在开发功能和设计以满足用户的具体需求和兴趣之前，需要对目标用户有一个具体的了解过程。

{"title":"User versus institutional perspectives of metadata and searching: an investigation of online access to cultural heritage content during the COVID-19 pandemic","authors":"Ryan Colin Gibson, Sudatta Chowdhury, Gobinda Chowdhury","doi":"10.1007/s00799-023-00385-y","DOIUrl":"https://doi.org/10.1007/s00799-023-00385-y","url":null,"abstract":"Findings from log analyses of user interactions with the digital content of two large national cultural heritage institutions (National Museums of Scotland and National Galleries of Scotland) during the COVID-19 lockdown highlighted limited engagement compared to pre-pandemic levels. Just 8% of users returned to these sites, whilst the average time spent, and number of pages accessed, were generally low. This prompted a user study to investigate the potential mismatch between the way content was indexed by the curators and searched for by users. A controlled experiment with ten participants, involving two tasks and a selected set of digital cultural heritage content, explored: (a) how does the metadata assigned by cultural heritage organisations meet or differ from the search needs of users? and (b) how can the search strategies of users inform the search pathways employed by cultural heritage organisations? Findings reveal that collection management standards like Spectrum encourage a variety of different characteristics to be considered when developing metadata, yet much of the content is left to the interpretations of curators. Rather, user- and context-specific guidelines could be beneficial in ensuring the aspects considered most important by consumers are indexed, thereby producing more relevant search results. A user-centred approach to designing cultural heritage websites would help to improve an individual’s experience when searching for information. However, a process is needed for institutions to form a concrete understanding of who their target users are before developing features and designs to suit their specific needs and interests.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"14 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138686506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing the examination of obstacles in an automated peer review system 加强对自动同行评议制度障碍的审查

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2023-12-02 DOI: 10.1007/s00799-023-00382-1

Gustavo Lúcius Fernandes, Pedro O. S. Vaz-de-Melo

The peer review process is the main academic resource to ensure that science advances and is disseminated. To contribute to this important process, classification models were created to perform two tasks: the review score prediction (RSP) and the paper decision prediction (PDP). But what challenges prevent us from having a fully efficient system responsible for these tasks? And how far are we from having an automated system to take care of these two tasks? To answer these questions, in this work, we evaluated the general performance of existing state-of-the-art models for RSP and PDP tasks and investigated what types of instances these models tend to have difficulty classifying and how impactful they are. We found, for example, that the performance of a model to predict the final decision of a paper is 23.31% lower when it is exposed to difficult instances and that the classifiers make mistake with a very high confidence. These and other results lead us to conclude that there are groups of instances that can negatively impact the model’s performance. That way, the current state-of-the-art models have potential to helping editors to decide whether to approve or reject a paper; however, we are still far from having a system that is fully responsible for scoring a paper and decide if it will be accepted or rejected.

同行评议过程是确保科学进步和传播的主要学术资源。为了促进这一重要过程，我们创建了分类模型来执行两个任务:评审分数预测(RSP)和论文决策预测(PDP)。但是，是什么挑战阻碍我们建立一个完全有效的系统来负责这些任务呢?我们离拥有一个自动化系统来处理这两项任务还有多远?为了回答这些问题，在这项工作中，我们评估了现有最先进的RSP和PDP任务模型的一般性能，并调查了这些模型倾向于难以分类的实例类型以及它们的影响程度。例如，我们发现，当模型面对困难的实例时，预测论文最终决定的性能降低了23.31%，并且分类器犯错误的置信度非常高。这些和其他结果使我们得出这样的结论:有一组实例会对模型的性能产生负面影响。这样，目前最先进的模型就有可能帮助编辑决定是否批准或拒绝一篇论文;然而，我们还远远没有建立一个完全负责论文评分并决定论文是否被接受或拒绝的系统。

{"title":"Enhancing the examination of obstacles in an automated peer review system","authors":"Gustavo Lúcius Fernandes, Pedro O. S. Vaz-de-Melo","doi":"10.1007/s00799-023-00382-1","DOIUrl":"https://doi.org/10.1007/s00799-023-00382-1","url":null,"abstract":"The peer review process is the main academic resource to ensure that science advances and is disseminated. To contribute to this important process, classification models were created to perform two tasks: the review score prediction (RSP) and the paper decision prediction (PDP). But what challenges prevent us from having a fully efficient system responsible for these tasks? And how far are we from having an automated system to take care of these two tasks? To answer these questions, in this work, we evaluated the general performance of existing state-of-the-art models for RSP and PDP tasks and investigated what types of instances these models tend to have difficulty classifying and how impactful they are. We found, for example, that the performance of a model to predict the final decision of a paper is 23.31% lower when it is exposed to difficult instances and that the classifiers make mistake with a very high confidence. These and other results lead us to conclude that there are groups of instances that can negatively impact the model’s performance. That way, the current state-of-the-art models have potential to helping editors to decide whether to approve or reject a paper; however, we are still far from having a system that is fully responsible for scoring a paper and decide if it will be accepted or rejected.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"86 9 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Focused Issue on Digital Library Challenges to Support the Open Science Process 关注数字图书馆挑战以支持开放科学进程

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2023-11-29 DOI: 10.1007/s00799-023-00388-9

Giorgio Maria Di Nunzio

Open Science is the broad term that involves several aspects aiming to remove the barriers for sharing any kind of output, resources, methods or tools, at any stage of the research process (https://book.fosteropenscience.eu/en/). The Open Science process is a set of transparent research practices that help to improve the quality of scientific knowledge and are crucial to the most basic aspects of the scientific process by means of the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. Thanks to research transparency and accessibility, we can evaluate the credibility of scientific claims and make the research process reproducible and the obtained results replicable. In this context, digital libraries play a pivotal role in supporting the Open Science process by facilitating the storage, organization, and dissemination of research outputs, including open access publications and open data. In this focused issue, we invited researchers to discuss innovative solutions, also related to technical challenges, about the identifiability of digital objects as well as the use of metadata and ontologies in order to support replicable and reusable research, the adoption of standards and semantic technologies to link information, and the evaluation of the application of the FAIR principles.

开放科学是一个广义的术语，涉及几个方面，旨在消除在研究过程的任何阶段共享任何类型的产出、资源、方法或工具的障碍(https://book.fosteropenscience.eu/en/)。开放科学过程是一套透明的研究实践，有助于提高科学知识的质量，并通过FAIR(可查找、可访问、可互操作和可重用)原则对科学过程的最基本方面至关重要。由于研究的透明度和可及性，我们可以评估科学主张的可信度，使研究过程可重复，获得的结果可复制。在此背景下，数字图书馆通过促进研究成果(包括开放获取出版物和开放数据)的存储、组织和传播，在支持开放科学进程方面发挥着关键作用。在这个重点问题中，我们邀请研究人员讨论创新的解决方案，也与技术挑战有关，关于数字对象的可识别性，以及元数据和本体的使用，以支持可复制和可重复使用的研究，采用标准和语义技术来链接信息，以及评估公平原则的应用。

{"title":"Focused Issue on Digital Library Challenges to Support the Open Science Process","authors":"Giorgio Maria Di Nunzio","doi":"10.1007/s00799-023-00388-9","DOIUrl":"https://doi.org/10.1007/s00799-023-00388-9","url":null,"abstract":"Open Science is the broad term that involves several aspects aiming to remove the barriers for sharing any kind of output, resources, methods or tools, at any stage of the research process (https://book.fosteropenscience.eu/en/). The Open Science process is a set of transparent research practices that help to improve the quality of scientific knowledge and are crucial to the most basic aspects of the scientific process by means of the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. Thanks to research transparency and accessibility, we can evaluate the credibility of scientific claims and make the research process reproducible and the obtained results replicable. In this context, digital libraries play a pivotal role in supporting the Open Science process by facilitating the storage, organization, and dissemination of research outputs, including open access publications and open data. In this focused issue, we invited researchers to discuss innovative solutions, also related to technical challenges, about the identifiability of digital objects as well as the use of metadata and ontologies in order to support replicable and reusable research, the adoption of standards and semantic technologies to link information, and the evaluation of the application of the FAIR principles.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"69 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Universities, heritage, and non-museum institutions: a methodological proposal for sustainable documentation 大学、遗产和非博物馆机构:可持续文献的方法论建议

Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2023-10-27 DOI: 10.1007/s00799-023-00383-0

Marina Salse-Rovira, Nuria Jornet-Benito, Javier Guallar, Maria Pilar Mateo-Bretos, Josep Oriol Silvestre-Canut

Abstract To provide a sustainable methodology for documenting the small (and underfunded) but often important university heritage collections. The sequence proposed by the DBLC (Database Life Cycle) (Coronel and Morris, Database Systems: Design, Implementation, & Management. Cengage Learning, Boston, 2018; Oppel Databases a beginner’s guide. McGraw-Hill, New York, 2009) is followed, focusing on the database design phase. The resulting proposals aim at harmonising the different documentation tools developed by GLAM institutions (acronym that aims to highlight the common aspects of Galleries, Libraries, Archives and Museums), all of which are present in the university environment. The work phases are based mainly on the work of Valle, Fernández Cacho, and Arenillas (Muñoz Cruz et al. Introducción a la documentación del patrimonio cultural. Consejería de Cultura de la Junta de Andalucía, Seville, 2017), combined with the experience acquired from the creation of the virtual museum at our institution. The creation of a working team that includes university staff members is recommended because we believe that universities have sufficient power to manage their own heritage. For documentation, we recommend the use of application profiles that consider the new trends in semantic web and LOD (Linked Open Data) and that are created using structural interchange standards such as Dublin Core, LIDO, or Darwin Core, which should be combined with content and value standards adapted from the GLAM area. The application of the methodology described above will make it possible to obtain quality metadata in a sustainable way given the limited resources of university collections. A proposed metadata schema is provided as an annex.

提供一种可持续的方法来记录小型(和资金不足)但通常很重要的大学遗产收藏。DBLC(数据库生命周期)(Coronel and Morris, Database Systems: Design, Implementation， &管理。圣智学习，波士顿，2018;Oppel数据库初学者指南。McGraw-Hill, New York, 2009)，着重于数据库设计阶段。由此产生的建议旨在协调GLAM机构开发的不同文档工具(首字母缩略词，旨在突出画廊、图书馆、档案馆和博物馆的共同方面)，所有这些都存在于大学环境中。工作阶段主要基于Valle, Fernández Cacho和Arenillas (Muñoz Cruz等人)的工作。Introducción a la documentación del patrimonio cultural。Consejería de Cultura de la Junta de Andalucía，塞维利亚，2017年)，结合在我们机构创建虚拟博物馆所获得的经验。我们认为，大学有足够的能力管理自己的遗产，因此建议成立包括大学工作人员在内的工作小组。对于文档，我们建议使用考虑语义网和LOD(链接开放数据)新趋势的应用程序概要文件，并使用结构交换标准(如Dublin Core、LIDO或Darwin Core)创建应用程序概要文件，这些标准应与GLAM领域的内容和价值标准相结合。鉴于大学馆藏资源有限，应用上述方法将有可能以可持续的方式获得高质量的元数据。建议的元数据模式作为附件提供。

{"title":"Universities, heritage, and non-museum institutions: a methodological proposal for sustainable documentation","authors":"Marina Salse-Rovira, Nuria Jornet-Benito, Javier Guallar, Maria Pilar Mateo-Bretos, Josep Oriol Silvestre-Canut","doi":"10.1007/s00799-023-00383-0","DOIUrl":"https://doi.org/10.1007/s00799-023-00383-0","url":null,"abstract":"Abstract To provide a sustainable methodology for documenting the small (and underfunded) but often important university heritage collections. The sequence proposed by the DBLC (Database Life Cycle) (Coronel and Morris, Database Systems: Design, Implementation, & Management. Cengage Learning, Boston, 2018; Oppel Databases a beginner’s guide. McGraw-Hill, New York, 2009) is followed, focusing on the database design phase. The resulting proposals aim at harmonising the different documentation tools developed by GLAM institutions (acronym that aims to highlight the common aspects of Galleries, Libraries, Archives and Museums), all of which are present in the university environment. The work phases are based mainly on the work of Valle, Fernández Cacho, and Arenillas (Muñoz Cruz et al. Introducción a la documentación del patrimonio cultural. Consejería de Cultura de la Junta de Andalucía, Seville, 2017), combined with the experience acquired from the creation of the virtual museum at our institution. The creation of a working team that includes university staff members is recommended because we believe that universities have sufficient power to manage their own heritage. For documentation, we recommend the use of application profiles that consider the new trends in semantic web and LOD (Linked Open Data) and that are created using structural interchange standards such as Dublin Core, LIDO, or Darwin Core, which should be combined with content and value standards adapted from the GLAM area. The application of the methodology described above will make it possible to obtain quality metadata in a sustainable way given the limited resources of university collections. A proposed metadata schema is provided as an annex.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"24 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136235408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0