Pub Date : 2024-09-03DOI: 10.1007/s00799-024-00409-1
Gianmaria Silvello, Oscar Corcho, Paolo Manghi
In the shifting landscape of sharing knowledge, it is no longer only about writing papers. After a paper is written, what comes next is an integral part of the process. This special issue delves into the transformative landscape of scholarly communication, exploring novel methodologies and technologies reshaping how scholarly content is generated, recommended, explored and analysed. Indeed, the contemporary perspective on scholarly publication recognizes the centrality of post-publication activities. The criticality of refining and scrutinizing manuscripts has gained prominence, surpassing the act of dissemination. The emphasis has shifted from publication to ensuring visibility and comprehension of the conveyed content. The papers compiled in this special issue scrutinize these evolving dynamics. They delve into the intricacies of post-processing and close examination of manuscripts, acknowledging the impact of these aspects. The overarching objective is to stimulate scholarly discussions on the evolving nature of communication in academia.
{"title":"Methods for generation, recommendation, exploration and analysis of scholarly publications","authors":"Gianmaria Silvello, Oscar Corcho, Paolo Manghi","doi":"10.1007/s00799-024-00409-1","DOIUrl":"https://doi.org/10.1007/s00799-024-00409-1","url":null,"abstract":"<p>In the shifting landscape of sharing knowledge, it is no longer only about writing papers. After a paper is written, what comes next is an integral part of the process. This special issue delves into the transformative landscape of scholarly communication, exploring novel methodologies and technologies reshaping how scholarly content is generated, recommended, explored and analysed. Indeed, the contemporary perspective on scholarly publication recognizes the centrality of post-publication activities. The criticality of refining and scrutinizing manuscripts has gained prominence, surpassing the act of dissemination. The emphasis has shifted from publication to ensuring visibility and comprehension of the conveyed content. The papers compiled in this special issue scrutinize these evolving dynamics. They delve into the intricacies of post-processing and close examination of manuscripts, acknowledging the impact of these aspects. The overarching objective is to stimulate scholarly discussions on the evolving nature of communication in academia.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142220013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-20DOI: 10.1007/s00799-024-00404-6
Tobias Backes, Anastasiia Iurshina, Muhammad Ahsan Shahid, Philipp Mayr
In this paper, we compare the performance of several popular pre-trained reference extraction and segmentation toolkits combined in different pipeline configurations on three different datasets. The extraction is end-to-end, i.e. the input is PDF documents, and the output is parsed reference objects. The evaluation is for reference strings and individual fields in the reference objects using alignment by identical fields and close-to-identical values. Our results show that Grobid and AnyStyle perform best of all compared tools, although one may want to use them in combination. Our work is meant to serve as a reference for researchers interested in applying out-of-the-box reference extraction and -parsing tools, for example, as a preprocessing step to a more complex research question. Our detailed results on different datasets with results for individual parsed fields will allow them to focus on aspects that are particularly important to them.
在本文中,我们在三个不同的数据集上比较了几种流行的预训练参考文献提取和分割工具包在不同管道配置下的性能。提取是端到端的,即输入是 PDF 文档,输出是解析后的参考对象。评估针对的是参考字符串和参考对象中的单个字段,使用相同字段和接近相同值进行对齐。我们的结果表明,Grobid 和 AnyStyle 是所有比较工具中性能最好的,尽管人们可能希望将它们结合起来使用。我们的工作旨在为有兴趣应用开箱即用的参考文献提取和解析工具的研究人员提供参考,例如,作为更复杂研究问题的预处理步骤。我们在不同数据集上得出的详细结果,以及各个解析字段的结果,将使他们能够专注于对他们来说特别重要的方面。
{"title":"Comparing free reference extraction pipelines","authors":"Tobias Backes, Anastasiia Iurshina, Muhammad Ahsan Shahid, Philipp Mayr","doi":"10.1007/s00799-024-00404-6","DOIUrl":"https://doi.org/10.1007/s00799-024-00404-6","url":null,"abstract":"<p>In this paper, we compare the performance of several popular pre-trained reference extraction and segmentation toolkits combined in different pipeline configurations on three different datasets. The extraction is end-to-end, i.e. the input is PDF documents, and the output is parsed reference objects. The evaluation is for reference strings and individual fields in the reference objects using alignment by identical fields and close-to-identical values. Our results show that Grobid and AnyStyle perform best of all compared tools, although one may want to use them in combination. Our work is meant to serve as a reference for researchers interested in applying out-of-the-box reference extraction and -parsing tools, for example, as a preprocessing step to a more complex research question. Our detailed results on different datasets with results for individual parsed fields will allow them to focus on aspects that are particularly important to them.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"46 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1007/s00799-024-00401-9
Miyuki Yamada, Yuichi Murai, Ichiro Kumagai
We propose a method for visualizing literary works that quantitatively extracts the mutual relationships among play characters from the narrative of a storyline. The method first determines the cross-correlation of the appearance frequencies in the time domain between two play characters, which is calculated for all pairs of characters in each narrative. We also calculate the correlation among three play characters to find unique triangular relationships. Then we create a graphical representation of the relationships using node-link representations based on a physical potential model. The method is suitable for dramas, as demonstrated for ten famous Shakespeare plays. The resulting visualizations show good agreement with the conventional understanding of each play and also provide new insight into Shakespearean criticism.
{"title":"Digital detection of play characters’ relationships in Shakespeare’s plays: extended cross-correlation analysis of the character appearance frequencies","authors":"Miyuki Yamada, Yuichi Murai, Ichiro Kumagai","doi":"10.1007/s00799-024-00401-9","DOIUrl":"https://doi.org/10.1007/s00799-024-00401-9","url":null,"abstract":"<p>We propose a method for visualizing literary works that quantitatively extracts the mutual relationships among play characters from the narrative of a storyline. The method first determines the cross-correlation of the appearance frequencies in the time domain between two play characters, which is calculated for all pairs of characters in each narrative. We also calculate the correlation among three play characters to find unique triangular relationships. Then we create a graphical representation of the relationships using node-link representations based on a physical potential model. The method is suitable for dramas, as demonstrated for ten famous Shakespeare plays. The resulting visualizations show good agreement with the conventional understanding of each play and also provide new insight into Shakespearean criticism.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"12 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141169382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-14DOI: 10.1007/s00799-024-00403-7
P. Devika, A. Milton
E-reading has become more popular by making the number of book readers high in number. With online book reading websites, it is much simpler to read any book at any time by simply typing its name into a search engine. These websites offer free reading platform to users with unlimited number of choices without exceeding any rights. However, statistics reveal that reading is dwindling, particularly among young people. In this survey, we presented several existing approaches employed to design a book recommendation system from 2012 to 2023. Different types of datasets, used to extract information about books and users, in terms of features, source and usage were discussed. Six different categories for book recommendation techniques have been recognized and discussed which would build the groundwork for future study in this area. The issues related to book recommendation system was also briefly discussed. We have discussed on the performance analysis of various research works on book recommendation system. We have also highlighted the research concerns and future scope to improve the performance of book recommender system. We hope these findings will help researchers to explore more in book recommender systems particularly.
{"title":"Book recommendation system: reviewing different techniques and approaches","authors":"P. Devika, A. Milton","doi":"10.1007/s00799-024-00403-7","DOIUrl":"https://doi.org/10.1007/s00799-024-00403-7","url":null,"abstract":"<p>E-reading has become more popular by making the number of book readers high in number. With online book reading websites, it is much simpler to read any book at any time by simply typing its name into a search engine. These websites offer free reading platform to users with unlimited number of choices without exceeding any rights. However, statistics reveal that reading is dwindling, particularly among young people. In this survey, we presented several existing approaches employed to design a book recommendation system from 2012 to 2023. Different types of datasets, used to extract information about books and users, in terms of features, source and usage were discussed. Six different categories for book recommendation techniques have been recognized and discussed which would build the groundwork for future study in this area. The issues related to book recommendation system was also briefly discussed. We have discussed on the performance analysis of various research works on book recommendation system. We have also highlighted the research concerns and future scope to improve the performance of book recommender system. We hope these findings will help researchers to explore more in book recommender systems particularly.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"64 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140926722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-07DOI: 10.1007/s00799-024-00402-8
Ayşe Esra Özkan Çelik, Umut Al
An abstract is the most crucial element that may convince readers to read the complete text of a scientific publication. However, studies show that in terms of organization, readability, and style, abstracts are also among the most troublesome parts of the pertinent manuscript. The ultimate goal of this article is to produce better understandable abstracts with automatic methods that will contribute to scientific communication in Turkish. We propose a summarization system based on extractive techniques combining general features that have been shown to be beneficial for Turkish. To construct the data set for this aim, a sample of 421 peer-reviewed Turkish articles in the field of librarianship and information science was developed. First, the structure of the full-texts, and their readability in comparison with author abstracts, were examined for text quality evaluation. A content-based evaluation of the system outputs was then carried out. System outputs, in cases of using and ignoring structural features of full-texts, were compared. Structured outputs outperformed classical outputs in terms of content and text quality. Each output group has better readability levels than their original abstracts. Additionally, it was discovered that higher-quality outputs are correlated with more structured full-texts, highlighting the importance of structural writing. Finally, it was determined that our system can facilitate the scholarly communication process as an auxiliary tool for authors and editors. Findings also indicate the significance of structural writing for better scholarly communication.
{"title":"Structured abstract generator (SAG) model: analysis of IMRAD structure of articles and its effect on extractive summarization","authors":"Ayşe Esra Özkan Çelik, Umut Al","doi":"10.1007/s00799-024-00402-8","DOIUrl":"https://doi.org/10.1007/s00799-024-00402-8","url":null,"abstract":"<p>An abstract is the most crucial element that may convince readers to read the complete text of a scientific publication. However, studies show that in terms of organization, readability, and style, abstracts are also among the most troublesome parts of the pertinent manuscript. The ultimate goal of this article is to produce better understandable abstracts with automatic methods that will contribute to scientific communication in Turkish. We propose a summarization system based on extractive techniques combining general features that have been shown to be beneficial for Turkish. To construct the data set for this aim, a sample of 421 peer-reviewed Turkish articles in the field of librarianship and information science was developed. First, the structure of the full-texts, and their readability in comparison with author abstracts, were examined for text quality evaluation. A content-based evaluation of the system outputs was then carried out. System outputs, in cases of using and ignoring structural features of full-texts, were compared. Structured outputs outperformed classical outputs in terms of content and text quality. Each output group has better readability levels than their original abstracts. Additionally, it was discovered that higher-quality outputs are correlated with more structured full-texts, highlighting the importance of structural writing. Finally, it was determined that our system can facilitate the scholarly communication process as an auxiliary tool for authors and editors. Findings also indicate the significance of structural writing for better scholarly communication.\u0000</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"27 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140926494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-03DOI: 10.1007/s00799-024-00395-4
William A. Ingram, Jian Wu, Sampanna Yashwant Kahu, Javaid Akbar Manzoor, Bipasha Banerjee, Aman Ahuja, Muntabir Hasan Choudhury, Lamia Salsabil, Winston Shields, Edward A. Fox
Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs.
{"title":"Building datasets to support information extraction and structure parsing from electronic theses and dissertations","authors":"William A. Ingram, Jian Wu, Sampanna Yashwant Kahu, Javaid Akbar Manzoor, Bipasha Banerjee, Aman Ahuja, Muntabir Hasan Choudhury, Lamia Salsabil, Winston Shields, Edward A. Fox","doi":"10.1007/s00799-024-00395-4","DOIUrl":"https://doi.org/10.1007/s00799-024-00395-4","url":null,"abstract":"<p>Despite the millions of electronic theses and dissertations (ETDs) publicly available online, digital library services for ETDs have not evolved past simple search and browse at the metadata level. We need better digital library services that allow users to discover and explore the content buried in these long documents. Recent advances in machine learning have shown promising results for decomposing documents into their constituent parts, but these models and techniques require data for training and evaluation. In this article, we present high-quality datasets to train, evaluate, and compare machine learning methods in tasks that are specifically suited to identify and extract key elements of ETD documents. We explain how we construct the datasets by manual labeling the data or by deriving labeled data through synthetic processes. We demonstrate how our datasets can be used to develop downstream applications and to evaluate, retrain, or fine-tune pre-trained machine learning models. We describe our ongoing work to compile benchmark datasets and exploit machine learning techniques to build intelligent digital libraries for ETDs.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"83 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140926607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-07DOI: 10.1007/s00799-024-00397-2
Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle
The significance of the web and the crucial role of web archives in its preservation highlight the necessity of understanding how users, both human and robot, access web archive content, and how best to satisfy this disparate needs of both types of users. To identify robots and humans in web archives and analyze their respective access patterns, we used the Internet Archive’s (IA) Wayback Machine access logs from 2012, 2015, and 2019, as well as Arquivo.pt’s (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the three years of IA access logs (2012 vs. 2015 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 (91% of requests) and IA 2015 (88% of requests) is greater than in IA 2019 (70% of requests). Robots account for 98% of requests in Arquivo.pt (2019). We found that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012 and 2015, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.
网络的重要性和网络档案在其保存中的关键作用突出表明,有必要了解人类和机器人用户如何访问网络档案内容,以及如何最好地满足这两类用户的不同需求。为了识别网络档案中的机器人和人类,并分析他们各自的访问模式,我们使用了互联网档案馆(IA)2012 年、2015 年和 2019 年的 Wayback Machine 访问日志,以及 Arquivo.pt(葡萄牙网络档案馆)2019 年的访问日志。我们识别了访问日志中的用户会话,并根据其浏览行为将这些会话分为人类会话和机器人会话。为了更好地了解用户如何浏览网络档案,我们对这些会话进行了评估,以发现用户的访问模式。基于两个档案和三年的 IA 访问日志(2012 年 vs. 2015 年 vs. 2019 年),我们对检测到的机器人与人类及其用户访问模式和时间偏好进行了比较。在 IA 2012(占请求的 91%)和 IA 2015(占请求的 88%)中检测到的机器人总数高于 IA 2019(占请求的 70%)。在 Arquivo.pt(2019 年)的请求中,机器人占 98%。我们发现,在 2012 年和 2015 年的 IA 中,机器人几乎完全局限于 "Dip "和 "Skim "访问模式,但在 2019 年的 IA 中,机器人展示了所有模式及其组合。人类和机器人都表现出对近期存档网页的偏好。
{"title":"Robots still outnumber humans in web archives in 2019, but less than in 2015 and 2012","authors":"Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle","doi":"10.1007/s00799-024-00397-2","DOIUrl":"https://doi.org/10.1007/s00799-024-00397-2","url":null,"abstract":"<p>The significance of the web and the crucial role of web archives in its preservation highlight the necessity of understanding how users, both human and robot, access web archive content, and how best to satisfy this disparate needs of both types of users. To identify robots and humans in web archives and analyze their respective access patterns, we used the Internet Archive’s (IA) Wayback Machine access logs from 2012, 2015, and 2019, as well as Arquivo.pt’s (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the three years of IA access logs (2012 vs. 2015 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 (91% of requests) and IA 2015 (88% of requests) is greater than in IA 2019 (70% of requests). Robots account for 98% of requests in Arquivo.pt (2019). We found that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012 and 2015, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"10 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140074149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-26DOI: 10.1007/s00799-024-00396-3
Ko Senoo, Yohei Seki, Wakako Kashino, Atsushi Keyaki, Noriko Kando
This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.
本研究的重点是区分市民和市议员在政治问题上的立场(即赞成或反对)的方法,并尝试比较双方的论点。我们创建了一个数据集,在市民推文和市议会会议记录上标注了四个属性:立场、有用性、区域依赖性和相关性。然后,我们使用该数据集对预训练的大型语言模型进行了微调,以自动为大量未标注数据分配属性标签。我们引入了多任务学习,对每个属性和相关性进行联合训练,以便通过关注与政治问题相关的句子来识别线索。我们的预测模型基于 T5,这是一个适合多任务学习的大型语言模型。我们将我们的系统与使用 BERT 或 RoBERTa 的系统的结果进行了比较。实验结果表明,通过多任务学习,公民推文的宏观立场 F1 分数提高了 1.8%,市议会会议记录的宏观立场 F1 分数提高了 1.7%。通过使用微调模型分析真实的意见差距,我们发现虽然福冈市的市议员对疫苗接种制度给予了积极评价,但市民对其评价并不高。
{"title":"Stance prediction with a relevance attribute to political issues in comparing the opinions of citizens and city councilors","authors":"Ko Senoo, Yohei Seki, Wakako Kashino, Atsushi Keyaki, Noriko Kando","doi":"10.1007/s00799-024-00396-3","DOIUrl":"https://doi.org/10.1007/s00799-024-00396-3","url":null,"abstract":"<p>This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"73 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-21DOI: 10.1007/s00799-024-00394-5
Zoe Bartliff, Yunhyong Kim, Frank Hopfgartner
This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.
{"title":"Towards privacy-aware exploration of archived personal emails","authors":"Zoe Bartliff, Yunhyong Kim, Frank Hopfgartner","doi":"10.1007/s00799-024-00394-5","DOIUrl":"https://doi.org/10.1007/s00799-024-00394-5","url":null,"abstract":"<p>This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.\u0000</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"79 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139921521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-27DOI: 10.1007/s00799-023-00391-0
Mat Kelly
Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.
{"title":"Exploiting the untapped functional potential of Memento aggregators beyond aggregation","authors":"Mat Kelly","doi":"10.1007/s00799-023-00391-0","DOIUrl":"https://doi.org/10.1007/s00799-023-00391-0","url":null,"abstract":"<p>Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139582920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}