International Journal on Digital Libraries最新文献

Methods for generation, recommendation, exploration and analysis of scholarly publications 学术出版物的生成、推荐、探索和分析方法

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-09-03 DOI: 10.1007/s00799-024-00409-1

Gianmaria Silvello, Oscar Corcho, Paolo Manghi

In the shifting landscape of sharing knowledge, it is no longer only about writing papers. After a paper is written, what comes next is an integral part of the process. This special issue delves into the transformative landscape of scholarly communication, exploring novel methodologies and technologies reshaping how scholarly content is generated, recommended, explored and analysed. Indeed, the contemporary perspective on scholarly publication recognizes the centrality of post-publication activities. The criticality of refining and scrutinizing manuscripts has gained prominence, surpassing the act of dissemination. The emphasis has shifted from publication to ensuring visibility and comprehension of the conveyed content. The papers compiled in this special issue scrutinize these evolving dynamics. They delve into the intricacies of post-processing and close examination of manuscripts, acknowledging the impact of these aspects. The overarching objective is to stimulate scholarly discussions on the evolving nature of communication in academia.

在不断变化的知识共享环境中，知识共享不再仅仅是撰写论文。论文撰写完成后，接下来的工作是整个过程不可或缺的一部分。本特刊深入探讨了学术交流的变革形势，探讨了重塑学术内容的生成、推荐、探索和分析方式的新方法和新技术。事实上，当代学术出版视角承认出版后活动的中心地位。对手稿进行完善和审查的重要性日益突出，已经超越了传播行为。重点已从出版转向确保所传达内容的可见性和理解性。本特刊收录的论文仔细研究了这些不断变化的动态。这些论文深入探讨了稿件后期处理和仔细检查的复杂性，承认了这些方面的影响。本特刊的总体目标是激发学术界对不断演变的学术交流性质的讨论。

引用次数: 0

Comparing free reference extraction pipelines 比较免费参考文献提取管道

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-06-20 DOI: 10.1007/s00799-024-00404-6

Tobias Backes, Anastasiia Iurshina, Muhammad Ahsan Shahid, Philipp Mayr

In this paper, we compare the performance of several popular pre-trained reference extraction and segmentation toolkits combined in different pipeline configurations on three different datasets. The extraction is end-to-end, i.e. the input is PDF documents, and the output is parsed reference objects. The evaluation is for reference strings and individual fields in the reference objects using alignment by identical fields and close-to-identical values. Our results show that Grobid and AnyStyle perform best of all compared tools, although one may want to use them in combination. Our work is meant to serve as a reference for researchers interested in applying out-of-the-box reference extraction and -parsing tools, for example, as a preprocessing step to a more complex research question. Our detailed results on different datasets with results for individual parsed fields will allow them to focus on aspects that are particularly important to them.

在本文中，我们在三个不同的数据集上比较了几种流行的预训练参考文献提取和分割工具包在不同管道配置下的性能。提取是端到端的，即输入是 PDF 文档，输出是解析后的参考对象。评估针对的是参考字符串和参考对象中的单个字段，使用相同字段和接近相同值进行对齐。我们的结果表明，Grobid 和 AnyStyle 是所有比较工具中性能最好的，尽管人们可能希望将它们结合起来使用。我们的工作旨在为有兴趣应用开箱即用的参考文献提取和解析工具的研究人员提供参考，例如，作为更复杂研究问题的预处理步骤。我们在不同数据集上得出的详细结果，以及各个解析字段的结果，将使他们能够专注于对他们来说特别重要的方面。

引用次数: 0

Digital detection of play characters’ relationships in Shakespeare’s plays: extended cross-correlation analysis of the character appearance frequencies 莎士比亚戏剧中戏剧人物关系的数字检测：人物出场频率的扩展交叉相关分析

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-05-27 DOI: 10.1007/s00799-024-00401-9

Miyuki Yamada, Yuichi Murai, Ichiro Kumagai

We propose a method for visualizing literary works that quantitatively extracts the mutual relationships among play characters from the narrative of a storyline. The method first determines the cross-correlation of the appearance frequencies in the time domain between two play characters, which is calculated for all pairs of characters in each narrative. We also calculate the correlation among three play characters to find unique triangular relationships. Then we create a graphical representation of the relationships using node-link representations based on a physical potential model. The method is suitable for dramas, as demonstrated for ten famous Shakespeare plays. The resulting visualizations show good agreement with the conventional understanding of each play and also provide new insight into Shakespearean criticism.

我们提出了一种将文学作品可视化的方法，可以从故事情节的叙述中定量提取剧中人物之间的相互关系。该方法首先确定两个剧中人物在时域中出现频率的交叉相关性，并计算每个叙事中所有人物对的交叉相关性。我们还计算了三个剧中人物之间的相关性，以找到独特的三角关系。然后，我们使用基于物理势能模型的节点链接表示法来创建关系的图形表示法。该方法适用于戏剧，十部著名的莎士比亚戏剧就是例证。由此产生的可视化效果与对每部戏剧的传统理解非常吻合，同时也为莎士比亚戏剧批评提供了新的视角。

引用次数: 0

Book recommendation system: reviewing different techniques and approaches 图书推荐系统：回顾不同的技术和方法

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-05-14 DOI: 10.1007/s00799-024-00403-7

P. Devika, A. Milton

E-reading has become more popular by making the number of book readers high in number. With online book reading websites, it is much simpler to read any book at any time by simply typing its name into a search engine. These websites offer free reading platform to users with unlimited number of choices without exceeding any rights. However, statistics reveal that reading is dwindling, particularly among young people. In this survey, we presented several existing approaches employed to design a book recommendation system from 2012 to 2023. Different types of datasets, used to extract information about books and users, in terms of features, source and usage were discussed. Six different categories for book recommendation techniques have been recognized and discussed which would build the groundwork for future study in this area. The issues related to book recommendation system was also briefly discussed. We have discussed on the performance analysis of various research works on book recommendation system. We have also highlighted the research concerns and future scope to improve the performance of book recommender system. We hope these findings will help researchers to explore more in book recommender systems particularly.

电子阅读的普及使图书阅读者的数量激增。有了在线图书阅读网站，只需在搜索引擎上输入书名，就可以随时阅读任何书籍，简单得多。这些网站为用户提供了免费的阅读平台，用户可以无限制地选择书籍，而不会超出任何权限。然而，统计数据显示，阅读正在减少，尤其是在年轻人当中。在这项调查中，我们介绍了从 2012 年到 2023 年用于设计图书推荐系统的几种现有方法。我们讨论了用于提取图书和用户信息的不同类型数据集的特征、来源和使用情况。对图书推荐技术的六个不同类别进行了确认和讨论，这将为这一领域的未来研究奠定基础。我们还简要讨论了与图书推荐系统相关的问题。我们讨论了各种图书推荐系统研究成果的性能分析。我们还强调了改进图书推荐系统性能的研究关注点和未来范围。我们希望这些发现将有助于研究人员在图书推荐系统方面进行更多探索。

{"title":"Book recommendation system: reviewing different techniques and approaches","authors":"P. Devika, A. Milton","doi":"10.1007/s00799-024-00403-7","DOIUrl":"https://doi.org/10.1007/s00799-024-00403-7","url":null,"abstract":"E-reading has become more popular by making the number of book readers high in number. With online book reading websites, it is much simpler to read any book at any time by simply typing its name into a search engine. These websites offer free reading platform to users with unlimited number of choices without exceeding any rights. However, statistics reveal that reading is dwindling, particularly among young people. In this survey, we presented several existing approaches employed to design a book recommendation system from 2012 to 2023. Different types of datasets, used to extract information about books and users, in terms of features, source and usage were discussed. Six different categories for book recommendation techniques have been recognized and discussed which would build the groundwork for future study in this area. The issues related to book recommendation system was also briefly discussed. We have discussed on the performance analysis of various research works on book recommendation system. We have also highlighted the research concerns and future scope to improve the performance of book recommender system. We hope these findings will help researchers to explore more in book recommender systems particularly.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"64 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140926722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structured abstract generator (SAG) model: analysis of IMRAD structure of articles and its effect on extractive summarization 结构化摘要生成器（SAG）模型：分析文章的 IMRAD 结构及其对提取式摘要的影响

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-05-07 DOI: 10.1007/s00799-024-00402-8

Ayşe Esra Özkan Çelik, Umut Al

An abstract is the most crucial element that may convince readers to read the complete text of a scientific publication. However, studies show that in terms of organization, readability, and style, abstracts are also among the most troublesome parts of the pertinent manuscript. The ultimate goal of this article is to produce better understandable abstracts with automatic methods that will contribute to scientific communication in Turkish. We propose a summarization system based on extractive techniques combining general features that have been shown to be beneficial for Turkish. To construct the data set for this aim, a sample of 421 peer-reviewed Turkish articles in the field of librarianship and information science was developed. First, the structure of the full-texts, and their readability in comparison with author abstracts, were examined for text quality evaluation. A content-based evaluation of the system outputs was then carried out. System outputs, in cases of using and ignoring structural features of full-texts, were compared. Structured outputs outperformed classical outputs in terms of content and text quality. Each output group has better readability levels than their original abstracts. Additionally, it was discovered that higher-quality outputs are correlated with more structured full-texts, highlighting the importance of structural writing. Finally, it was determined that our system can facilitate the scholarly communication process as an auxiliary tool for authors and editors. Findings also indicate the significance of structural writing for better scholarly communication.

摘要是说服读者阅读科学出版物全文的最关键要素。然而，研究表明，就组织、可读性和风格而言，摘要也是相关稿件中最麻烦的部分之一。本文的最终目标是通过自动方法制作出更好理解的摘要，从而为土耳其语的科学交流做出贡献。我们提出了一种基于提取技术的摘要系统，该系统结合了已被证明对土耳其语有益的一般特征。为了构建实现这一目标的数据集，我们开发了图书馆学和信息科学领域的 421 篇同行评议的土耳其语文章样本。首先，对全文的结构及其与作者摘要的可读性进行了检查，以便对文本质量进行评估。然后对系统输出进行了基于内容的评估。比较了使用和忽略全文结构特征的系统输出结果。就内容和文本质量而言，结构化输出优于传统输出。每个输出组都比其原始摘要具有更好的可读性。此外，我们还发现，质量更高的输出结果与结构更合理的全文相关，这凸显了结构化写作的重要性。最后，我们确定我们的系统可以作为作者和编辑的辅助工具，促进学术交流过程。研究结果还表明，结构化写作对于更好地进行学术交流具有重要意义。

{"title":"Structured abstract generator (SAG) model: analysis of IMRAD structure of articles and its effect on extractive summarization","authors":"Ayşe Esra Özkan Çelik, Umut Al","doi":"10.1007/s00799-024-00402-8","DOIUrl":"https://doi.org/10.1007/s00799-024-00402-8","url":null,"abstract":"An abstract is the most crucial element that may convince readers to read the complete text of a scientific publication. However, studies show that in terms of organization, readability, and style, abstracts are also among the most troublesome parts of the pertinent manuscript. The ultimate goal of this article is to produce better understandable abstracts with automatic methods that will contribute to scientific communication in Turkish. We propose a summarization system based on extractive techniques combining general features that have been shown to be beneficial for Turkish. To construct the data set for this aim, a sample of 421 peer-reviewed Turkish articles in the field of librarianship and information science was developed. First, the structure of the full-texts, and their readability in comparison with author abstracts, were examined for text quality evaluation. A content-based evaluation of the system outputs was then carried out. System outputs, in cases of using and ignoring structural features of full-texts, were compared. Structured outputs outperformed classical outputs in terms of content and text quality. Each output group has better readability levels than their original abstracts. Additionally, it was discovered that higher-quality outputs are correlated with more structured full-texts, highlighting the importance of structural writing. Finally, it was determined that our system can facilitate the scholarly communication process as an auxiliary tool for authors and editors. Findings also indicate the significance of structural writing for better scholarly communication.\u0000","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"27 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140926494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building datasets to support information extraction and structure parsing from electronic theses and dissertations 建立数据集，支持从电子论文中提取信息和解析结构

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-05-03 DOI: 10.1007/s00799-024-00395-4

William A. Ingram, Jian Wu, Sampanna Yashwant Kahu, Javaid Akbar Manzoor, Bipasha Banerjee, Aman Ahuja, Muntabir Hasan Choudhury, Lamia Salsabil, Winston Shields, Edward A. Fox

Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs.

尽管网上公开的电子论文（ETD）多达数百万篇，但数字图书馆为电子论文提供的服务还没有超越元数据层面的简单搜索和浏览。我们需要更好的数字图书馆服务，让用户能够发现和探索这些长篇文献中隐藏的内容。机器学习领域的最新进展表明，将文档分解为各个组成部分的结果很有希望，但这些模型和技术需要数据来进行训练和评估。在本文中，我们提出了高质量的数据集，用于训练、评估和比较机器学习方法，这些方法特别适用于识别和提取 ETD 文档的关键要素。我们解释了如何通过人工标注数据或通过合成过程获得标注数据来构建数据集。我们展示了如何利用我们的数据集开发下游应用，以及如何评估、重新训练或微调预训练的机器学习模型。我们将介绍我们正在开展的工作，即编译基准数据集和利用机器学习技术为电子文献建立智能数字图书馆。

{"title":"Building datasets to support information extraction and structure parsing from electronic theses and dissertations","authors":"William A. Ingram, Jian Wu, Sampanna Yashwant Kahu, Javaid Akbar Manzoor, Bipasha Banerjee, Aman Ahuja, Muntabir Hasan Choudhury, Lamia Salsabil, Winston Shields, Edward A. Fox","doi":"10.1007/s00799-024-00395-4","DOIUrl":"https://doi.org/10.1007/s00799-024-00395-4","url":null,"abstract":"Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"83 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140926607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robots still outnumber humans in web archives in 2019, but less than in 2015 and 2012 2019 年网络档案中机器人的数量仍将超过人类，但低于 2015 年和 2012 年

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-03-07 DOI: 10.1007/s00799-024-00397-2

Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle

The significance of the web and the crucial role of web archives in its preservation highlight the necessity of understanding how users, both human and robot, access web archive content, and how best to satisfy this disparate needs of both types of users. To identify robots and humans in web archives and analyze their respective access patterns, we used the Internet Archive’s (IA) Wayback Machine access logs from 2012, 2015, and 2019, as well as Arquivo.pt’s (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the three years of IA access logs (2012 vs. 2015 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 (91% of requests) and IA 2015 (88% of requests) is greater than in IA 2019 (70% of requests). Robots account for 98% of requests in Arquivo.pt (2019). We found that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012 and 2015, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.

网络的重要性和网络档案在其保存中的关键作用突出表明，有必要了解人类和机器人用户如何访问网络档案内容，以及如何最好地满足这两类用户的不同需求。为了识别网络档案中的机器人和人类，并分析他们各自的访问模式，我们使用了互联网档案馆（IA）2012 年、2015 年和 2019 年的 Wayback Machine 访问日志，以及 Arquivo.pt（葡萄牙网络档案馆）2019 年的访问日志。我们识别了访问日志中的用户会话，并根据其浏览行为将这些会话分为人类会话和机器人会话。为了更好地了解用户如何浏览网络档案，我们对这些会话进行了评估，以发现用户的访问模式。基于两个档案和三年的 IA 访问日志（2012 年 vs. 2015 年 vs. 2019 年），我们对检测到的机器人与人类及其用户访问模式和时间偏好进行了比较。在 IA 2012（占请求的 91%）和 IA 2015（占请求的 88%）中检测到的机器人总数高于 IA 2019（占请求的 70%）。在 Arquivo.pt（2019 年）的请求中，机器人占 98%。我们发现，在 2012 年和 2015 年的 IA 中，机器人几乎完全局限于 "Dip "和 "Skim "访问模式，但在 2019 年的 IA 中，机器人展示了所有模式及其组合。人类和机器人都表现出对近期存档网页的偏好。

{"title":"Robots still outnumber humans in web archives in 2019, but less than in 2015 and 2012","authors":"Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle","doi":"10.1007/s00799-024-00397-2","DOIUrl":"https://doi.org/10.1007/s00799-024-00397-2","url":null,"abstract":"The significance of the web and the crucial role of web archives in its preservation highlight the necessity of understanding how users, both human and robot, access web archive content, and how best to satisfy this disparate needs of both types of users. To identify robots and humans in web archives and analyze their respective access patterns, we used the Internet Archive’s (IA) Wayback Machine access logs from 2012, 2015, and 2019, as well as Arquivo.pt’s (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the three years of IA access logs (2012 vs. 2015 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 (91% of requests) and IA 2015 (88% of requests) is greater than in IA 2019 (70% of requests). Robots account for 98% of requests in Arquivo.pt (2019). We found that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012 and 2015, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"10 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140074149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stance prediction with a relevance attribute to political issues in comparing the opinions of citizens and city councilors 在比较市民和市议员的意见时，利用政治问题的相关性属性进行立场预测

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-02-26 DOI: 10.1007/s00799-024-00396-3

Ko Senoo, Yohei Seki, Wakako Kashino, Atsushi Keyaki, Noriko Kando

This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.

本研究的重点是区分市民和市议员在政治问题上的立场（即赞成或反对）的方法，并尝试比较双方的论点。我们创建了一个数据集，在市民推文和市议会会议记录上标注了四个属性：立场、有用性、区域依赖性和相关性。然后，我们使用该数据集对预训练的大型语言模型进行了微调，以自动为大量未标注数据分配属性标签。我们引入了多任务学习，对每个属性和相关性进行联合训练，以便通过关注与政治问题相关的句子来识别线索。我们的预测模型基于 T5，这是一个适合多任务学习的大型语言模型。我们将我们的系统与使用 BERT 或 RoBERTa 的系统的结果进行了比较。实验结果表明，通过多任务学习，公民推文的宏观立场 F1 分数提高了 1.8%，市议会会议记录的宏观立场 F1 分数提高了 1.7%。通过使用微调模型分析真实的意见差距，我们发现虽然福冈市的市议员对疫苗接种制度给予了积极评价，但市民对其评价并不高。

{"title":"Stance prediction with a relevance attribute to political issues in comparing the opinions of citizens and city councilors","authors":"Ko Senoo, Yohei Seki, Wakako Kashino, Atsushi Keyaki, Noriko Kando","doi":"10.1007/s00799-024-00396-3","DOIUrl":"https://doi.org/10.1007/s00799-024-00396-3","url":null,"abstract":"This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"73 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards privacy-aware exploration of archived personal emails 实现对存档个人电子邮件的隐私感知探索

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-02-21 DOI: 10.1007/s00799-024-00394-5

Zoe Bartliff, Yunhyong Kim, Frank Hopfgartner

This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.

本文探讨了隐私措施（如电子邮件收藏的匿名化和聚合过程）如何影响电子邮件可视化在研究中的实用性，尤其是在人文和社会科学领域。这项工作旨在为档案管理人员和数据管理人员提供信息，因为他们面临着加入和审查日益庞大和复杂的个人数字收藏的挑战。本文的研究提供了一项重点用户研究，以调查数据可视化作为数据隐私意识管理和数据研究价值最大化之间的中介是否有用。研究对象是研究人员和档案管理人员，他们在使用、理解和/或归档数据以获得有意义的结果方面拥有既得利益。参与者倾向于认为电子邮件可视化很有用，研究中所有可视化的平均评分为 4.281（满分 7 分），山形图和词树的评分高于平均水平。研究表明，虽然参与者对电子邮件数据中识别个人的信息表达了强烈的愿望，但他们认为，如果除了匿名化之外还采用聚合，可视化对他们的研究和/或工作几乎同样有用。

{"title":"Towards privacy-aware exploration of archived personal emails","authors":"Zoe Bartliff, Yunhyong Kim, Frank Hopfgartner","doi":"10.1007/s00799-024-00394-5","DOIUrl":"https://doi.org/10.1007/s00799-024-00394-5","url":null,"abstract":"This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.\u0000","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"79 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139921521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting the untapped functional potential of Memento aggregators beyond aggregation 挖掘 Memento 聚合器聚合之外的未开发功能潜力

IF 1.5 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

International Journal on Digital Libraries

Pub Date : 2024-01-27 DOI: 10.1007/s00799-023-00391-0

Mat Kelly

Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.

网络档案可捕获、保留和展示网页的历史版本。查看网络档案通常只需用户访问 Wayback Machine 主页，输入 URL，然后选择捕获的重要日期和时间即可。其他网络档案也会捕获网页，并使用 Memento 作为查询其捕获内容的互操作点。Memento 聚合器是一种可访问网络的软件包，允许客户端向一个单一的端点源发送对过去网页的请求，然后该端点源再将请求转发给一组网络档案。尽管很少有已部署的聚合器实例显示出这种聚合特性，但它们在大多数情况下都符合一种模式，即通过首先查询然后聚合来自网络档案集的响应结果，向客户端提供对原始资源 URI（URI-R）的请求。这种单层查询不一定是聚合器的逻辑流程，只要用户仍可通过单个 URL 使用聚合器即可。本文将讨论网络档案的理论聚合模型。我们首先将现状描述为聚合器表现出的传统行为。然后，我们在先前工作的基础上，描述了聚合器可能表现出的多层次、结构化查询模型。我们强调了一些潜在问题和高层次优化，以确保高效聚合，同时也扩展了最先进的记忆聚合技术。我们的部分贡献是扩展了一个开源的、用户可部署的 Memento 聚合器，以展示本文所述的功能。我们还扩展了一个浏览器扩展，该浏览器通常会咨询聚合器，使其能够自行聚合，而无需咨询外部服务。纯客户端、基于浏览器的 Memento 聚合器是这项工作的新颖之处。

{"title":"Exploiting the untapped functional potential of Memento aggregators beyond aggregation","authors":"Mat Kelly","doi":"10.1007/s00799-023-00391-0","DOIUrl":"https://doi.org/10.1007/s00799-023-00391-0","url":null,"abstract":"Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139582920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0