首页 > 最新文献

International Journal on Digital Libraries最新文献

英文 中文
Robots still outnumber humans in web archives in 2019, but less than in 2015 and 2012 2019 年网络档案中机器人的数量仍将超过人类,但低于 2015 年和 2012 年
IF 1.5 Q1 Social Sciences Pub Date : 2024-03-07 DOI: 10.1007/s00799-024-00397-2
Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle

The significance of the web and the crucial role of web archives in its preservation highlight the necessity of understanding how users, both human and robot, access web archive content, and how best to satisfy this disparate needs of both types of users. To identify robots and humans in web archives and analyze their respective access patterns, we used the Internet Archive’s (IA) Wayback Machine access logs from 2012, 2015, and 2019, as well as Arquivo.pt’s (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the three years of IA access logs (2012 vs. 2015 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 (91% of requests) and IA 2015 (88% of requests) is greater than in IA 2019 (70% of requests). Robots account for 98% of requests in Arquivo.pt (2019). We found that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012 and 2015, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.

网络的重要性和网络档案在其保存中的关键作用突出表明,有必要了解人类和机器人用户如何访问网络档案内容,以及如何最好地满足这两类用户的不同需求。为了识别网络档案中的机器人和人类,并分析他们各自的访问模式,我们使用了互联网档案馆(IA)2012 年、2015 年和 2019 年的 Wayback Machine 访问日志,以及 Arquivo.pt(葡萄牙网络档案馆)2019 年的访问日志。我们识别了访问日志中的用户会话,并根据其浏览行为将这些会话分为人类会话和机器人会话。为了更好地了解用户如何浏览网络档案,我们对这些会话进行了评估,以发现用户的访问模式。基于两个档案和三年的 IA 访问日志(2012 年 vs. 2015 年 vs. 2019 年),我们对检测到的机器人与人类及其用户访问模式和时间偏好进行了比较。在 IA 2012(占请求的 91%)和 IA 2015(占请求的 88%)中检测到的机器人总数高于 IA 2019(占请求的 70%)。在 Arquivo.pt(2019 年)的请求中,机器人占 98%。我们发现,在 2012 年和 2015 年的 IA 中,机器人几乎完全局限于 "Dip "和 "Skim "访问模式,但在 2019 年的 IA 中,机器人展示了所有模式及其组合。人类和机器人都表现出对近期存档网页的偏好。
{"title":"Robots still outnumber humans in web archives in 2019, but less than in 2015 and 2012","authors":"Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle","doi":"10.1007/s00799-024-00397-2","DOIUrl":"https://doi.org/10.1007/s00799-024-00397-2","url":null,"abstract":"<p>The significance of the web and the crucial role of web archives in its preservation highlight the necessity of understanding how users, both human and robot, access web archive content, and how best to satisfy this disparate needs of both types of users. To identify robots and humans in web archives and analyze their respective access patterns, we used the Internet Archive’s (IA) Wayback Machine access logs from 2012, 2015, and 2019, as well as Arquivo.pt’s (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate through the web archives, we evaluated these sessions to discover user access patterns. Based on the two archives and between the three years of IA access logs (2012 vs. 2015 vs. 2019), we present a comparison of detected robots vs. humans and their user access patterns and temporal preferences. The total number of robots detected in IA 2012 (91% of requests) and IA 2015 (88% of requests) is greater than in IA 2019 (70% of requests). Robots account for 98% of requests in Arquivo.pt (2019). We found that the robots are almost entirely limited to “Dip” and “Skim” access patterns in IA 2012 and 2015, but exhibit all the patterns and their combinations in IA 2019. Both humans and robots show a preference for web pages archived in the near past.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140074149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stance prediction with a relevance attribute to political issues in comparing the opinions of citizens and city councilors 在比较市民和市议员的意见时,利用政治问题的相关性属性进行立场预测
IF 1.5 Q1 Social Sciences Pub Date : 2024-02-26 DOI: 10.1007/s00799-024-00396-3
Ko Senoo, Yohei Seki, Wakako Kashino, Atsushi Keyaki, Noriko Kando

This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.

本研究的重点是区分市民和市议员在政治问题上的立场(即赞成或反对)的方法,并尝试比较双方的论点。我们创建了一个数据集,在市民推文和市议会会议记录上标注了四个属性:立场、有用性、区域依赖性和相关性。然后,我们使用该数据集对预训练的大型语言模型进行了微调,以自动为大量未标注数据分配属性标签。我们引入了多任务学习,对每个属性和相关性进行联合训练,以便通过关注与政治问题相关的句子来识别线索。我们的预测模型基于 T5,这是一个适合多任务学习的大型语言模型。我们将我们的系统与使用 BERT 或 RoBERTa 的系统的结果进行了比较。实验结果表明,通过多任务学习,公民推文的宏观立场 F1 分数提高了 1.8%,市议会会议记录的宏观立场 F1 分数提高了 1.7%。通过使用微调模型分析真实的意见差距,我们发现虽然福冈市的市议员对疫苗接种制度给予了积极评价,但市民对其评价并不高。
{"title":"Stance prediction with a relevance attribute to political issues in comparing the opinions of citizens and city councilors","authors":"Ko Senoo, Yohei Seki, Wakako Kashino, Atsushi Keyaki, Noriko Kando","doi":"10.1007/s00799-024-00396-3","DOIUrl":"https://doi.org/10.1007/s00799-024-00396-3","url":null,"abstract":"<p>This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards privacy-aware exploration of archived personal emails 实现对存档个人电子邮件的隐私感知探索
IF 1.5 Q1 Social Sciences Pub Date : 2024-02-21 DOI: 10.1007/s00799-024-00394-5
Zoe Bartliff, Yunhyong Kim, Frank Hopfgartner

This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.

本文探讨了隐私措施(如电子邮件收藏的匿名化和聚合过程)如何影响电子邮件可视化在研究中的实用性,尤其是在人文和社会科学领域。这项工作旨在为档案管理人员和数据管理人员提供信息,因为他们面临着加入和审查日益庞大和复杂的个人数字收藏的挑战。本文的研究提供了一项重点用户研究,以调查数据可视化作为数据隐私意识管理和数据研究价值最大化之间的中介是否有用。研究对象是研究人员和档案管理人员,他们在使用、理解和/或归档数据以获得有意义的结果方面拥有既得利益。参与者倾向于认为电子邮件可视化很有用,研究中所有可视化的平均评分为 4.281(满分 7 分),山形图和词树的评分高于平均水平。研究表明,虽然参与者对电子邮件数据中识别个人的信息表达了强烈的愿望,但他们认为,如果除了匿名化之外还采用聚合,可视化对他们的研究和/或工作几乎同样有用。
{"title":"Towards privacy-aware exploration of archived personal emails","authors":"Zoe Bartliff, Yunhyong Kim, Frank Hopfgartner","doi":"10.1007/s00799-024-00394-5","DOIUrl":"https://doi.org/10.1007/s00799-024-00394-5","url":null,"abstract":"<p>This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.\u0000</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139921521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting the untapped functional potential of Memento aggregators beyond aggregation 挖掘 Memento 聚合器聚合之外的未开发功能潜力
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-27 DOI: 10.1007/s00799-023-00391-0
Mat Kelly

Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.

网络档案可捕获、保留和展示网页的历史版本。查看网络档案通常只需用户访问 Wayback Machine 主页,输入 URL,然后选择捕获的重要日期和时间即可。其他网络档案也会捕获网页,并使用 Memento 作为查询其捕获内容的互操作点。Memento 聚合器是一种可访问网络的软件包,允许客户端向一个单一的端点源发送对过去网页的请求,然后该端点源再将请求转发给一组网络档案。尽管很少有已部署的聚合器实例显示出这种聚合特性,但它们在大多数情况下都符合一种模式,即通过首先查询然后聚合来自网络档案集的响应结果,向客户端提供对原始资源 URI(URI-R)的请求。这种单层查询不一定是聚合器的逻辑流程,只要用户仍可通过单个 URL 使用聚合器即可。本文将讨论网络档案的理论聚合模型。我们首先将现状描述为聚合器表现出的传统行为。然后,我们在先前工作的基础上,描述了聚合器可能表现出的多层次、结构化查询模型。我们强调了一些潜在问题和高层次优化,以确保高效聚合,同时也扩展了最先进的记忆聚合技术。我们的部分贡献是扩展了一个开源的、用户可部署的 Memento 聚合器,以展示本文所述的功能。我们还扩展了一个浏览器扩展,该浏览器通常会咨询聚合器,使其能够自行聚合,而无需咨询外部服务。纯客户端、基于浏览器的 Memento 聚合器是这项工作的新颖之处。
{"title":"Exploiting the untapped functional potential of Memento aggregators beyond aggregation","authors":"Mat Kelly","doi":"10.1007/s00799-023-00391-0","DOIUrl":"https://doi.org/10.1007/s00799-023-00391-0","url":null,"abstract":"<p>Web archives capture, retain, and present historical versions of web pages. Viewing web archives often amounts to a user visiting the Wayback Machine homepage, typing in a URL, then choosing a date and time significant of the capture. Other web archives also capture the web and use Memento as an interoperable point of querying their captures. Memento aggregators are web accessible software packages that allow clients to send requests for past web pages to a single endpoint source that then relays that request to a set of web archives. Though few deployed aggregator instances exist that exhibit this aggregation trait, they all, for the most part, align to a model of serving a request for a URI of an original resource (URI-R) to a client by first querying then aggregating the results of the responses from a collection of web archives. This single tier querying need not be the logical flow of an aggregator, so long as a user can still utilize the aggregator from a single URL. In this paper, we discuss theoretical aggregation models of web archives. We first describe the status quo as the conventional behavior exhibited by an aggregator. We then build on prior work to describe a multi-tiered, structured querying model that may be exhibited by an aggregator. We highlight some potential issues and high-level optimization to ensure efficient aggregation while also extending on the state-of-the-art of memento aggregation. Part of our contribution is the extension of an open-source, user-deployable Memento aggregator to exhibit the capability described in this paper. We also extend a browser extension that typically consults an aggregator to have the ability to aggregate itself rather than needing to consult an external service. A purely client-side, browser-based Memento aggregator is novel to this work.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139582920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image searching in an open photograph archive: search tactics and faced barriers in historical research 开放照片档案中的图像搜索:历史研究中的搜索策略和面临的障碍
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-24 DOI: 10.1007/s00799-023-00390-1
E. Late, Hille Ruotsalainen, Sanna Kumpulainen
{"title":"Image searching in an open photograph archive: search tactics and faced barriers in historical research","authors":"E. Late, Hille Ruotsalainen, Sanna Kumpulainen","doi":"10.1007/s00799-023-00390-1","DOIUrl":"https://doi.org/10.1007/s00799-023-00390-1","url":null,"abstract":"","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139601840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications 基于 BERT 的序列深度神经架构,识别科学出版物中的贡献声明并提取三联短语
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-23 DOI: 10.1007/s00799-023-00393-y
Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal

Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (subject, predicate, object) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, viz. Section Identification and Citance Classification. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: SciERC and SciClaim. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the NLPContributionGraph (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.

自然语言处理(NLP)领域的研究正在迅速发展,因此也有大量研究论文发表。要从海量的非结构化数据中找到研究论文在任何特定领域的贡献是一项挑战。因此,有必要在知识图谱(KG)中对相关贡献进行结构化处理。在本文中,我们介绍了为构建科学知识图谱(SKG)而完成的四项工作。我们提出了一个流水线系统,该系统可执行贡献句识别、贡献句中的短语提取、信息单元(IU)分类,并将短语从 NLP 学术出版物中组织成三联体(主语、谓语、宾语)。我们开发了一个多任务系统(ContriSci),用于识别贡献句,并有两个辅助任务,即章节识别和信息单位分类。我们使用来自变换器的双向编码器表示(BERT)-条件随机场(CRF)模型进行短语提取,并使用两个额外的数据集进行训练:SciERC 和 SciClaim。我们使用基于 BERT 的模型对贡献句子进行 IU 分类。在三连音提取方面,我们将三连音分为五类,并使用基于 BERT 的分类器对三连音进行分类。在非端到端环境下,我们提出的方法在贡献句识别、短语提取、IU 分类和三连音提取方面的 F1 得分值分别为 64.21%、77.47%、84.52% 和 62.71%。在 NLPContributionGraph(NCG)数据集上,贡献句识别、IUs 分类和三连音提取的 F1 分数分别提高了 8.08、2.46 和 2.31。我们的系统在所有四个子任务的端到端流水线中取得了最佳性能(57.54% 的 F1 分数)。我们的代码可在以下网址获取:https://github.com/92Komal/pipeline_triplet_extraction.
{"title":"A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications","authors":"Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal","doi":"10.1007/s00799-023-00393-y","DOIUrl":"https://doi.org/10.1007/s00799-023-00393-y","url":null,"abstract":"<p>Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (<i>subject, predicate, object</i>) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, <i>viz.</i> <i>Section Identification</i> and <i>Citance Classification</i>. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: <i>SciERC</i> and <i>SciClaim</i>. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the <i>NLPContributionGraph</i> (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential sentence classification in research papers using cross-domain multi-task learning 利用跨域多任务学习对研究论文中的连续句子进行分类
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-22 DOI: 10.1007/s00799-023-00392-z
Arthur Brack, Elias Entrup, Markos Stamatakis, Pascal Buschermöhle, Anett Hoppe, Ralph Ewerth

The automatic semantic structuring of scientific text allows for more efficient reading of research articles and is an important indexing step for academic search engines. Sequential sentence classification is an essential structuring task and targets the categorisation of sentences based on their content and context. However, the potential of transfer learning for sentence classification across different scientific domains and text types, such as full papers and abstracts, has not yet been explored in prior work. In this paper, we present a systematic analysis of transfer learning for scientific sequential sentence classification. For this purpose, we derive seven research questions and present several contributions to address them: (1) We suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific text. (2) We tailor two transfer learning methods to deal with the given task, namely sequential transfer learning and multi-task learning. (3) We compare the results of the two best models using qualitative examples in a case study. (4) We provide an approach for the semi-automatic identification of semantically related classes across annotation schemes and analyse the results for four annotation schemes. The clusters and underlying semantic vectors are validated using k-means clustering. (5) Our comprehensive experimental results indicate that when using the proposed multi-task learning architecture, models trained on datasets from different scientific domains benefit from one another. Our approach significantly outperforms state of the art on full paper datasets while being on par for datasets consisting of abstracts.

科学文本的自动语义结构化可提高研究文章的阅读效率,也是学术搜索引擎的重要索引步骤。顺序句子分类是一项重要的结构化任务,其目标是根据句子的内容和上下文对句子进行分类。然而,对于不同科学领域和文本类型(如论文全文和摘要)的句子分类而言,迁移学习的潜力尚未在之前的工作中得到探索。在本文中,我们系统分析了迁移学习在科学序列句子分类中的应用。为此,我们提出了七个研究问题,并针对这些问题做出了几项贡献:(1)我们提出了一种新颖的统一深度学习架构和多任务学习,用于科学文本中的跨域序列句子分类。(2) 我们定制了两种迁移学习方法来处理给定任务,即顺序迁移学习和多任务学习。(3) 通过案例研究中的定性实例,比较两种最佳模型的结果。(4) 我们提供了一种跨注释方案半自动识别语义相关类别的方法,并分析了四种注释方案的结果。聚类和基础语义向量通过 k-means 聚类进行验证。(5) 我们的综合实验结果表明,当使用所提出的多任务学习架构时,在来自不同科学领域的数据集上训练的模型可以相互受益。在完整论文数据集上,我们的方法明显优于现有技术,而在由摘要组成的数据集上,我们的方法与现有技术相当。
{"title":"Sequential sentence classification in research papers using cross-domain multi-task learning","authors":"Arthur Brack, Elias Entrup, Markos Stamatakis, Pascal Buschermöhle, Anett Hoppe, Ralph Ewerth","doi":"10.1007/s00799-023-00392-z","DOIUrl":"https://doi.org/10.1007/s00799-023-00392-z","url":null,"abstract":"<p>The automatic semantic structuring of scientific text allows for more efficient reading of research articles and is an important indexing step for academic search engines. Sequential sentence classification is an essential structuring task and targets the categorisation of sentences based on their content and context. However, the potential of transfer learning for sentence classification across different scientific domains and text types, such as full papers and abstracts, has not yet been explored in prior work. In this paper, we present a systematic analysis of transfer learning for scientific sequential sentence classification. For this purpose, we derive seven research questions and present several contributions to address them: (1) We suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific text. (2) We tailor two transfer learning methods to deal with the given task, namely sequential transfer learning and multi-task learning. (3) We compare the results of the two best models using qualitative examples in a case study. (4) We provide an approach for the semi-automatic identification of semantically related classes across annotation schemes and analyse the results for four annotation schemes. The clusters and underlying semantic vectors are validated using <i>k</i>-means clustering. (5) Our comprehensive experimental results indicate that when using the proposed multi-task learning architecture, models trained on datasets from different scientific domains benefit from one another. Our approach significantly outperforms state of the art on full paper datasets while being on par for datasets consisting of abstracts.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139561578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Academics’ experience of online reading lists and the use of reading list notes 学者使用在线阅读清单和阅读清单笔记的经验
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-12 DOI: 10.1007/s00799-023-00387-w
P. P. N. V. Kumara, Annika Hinze, Nicholas Vanderschantz, Claire Timpany

Reading Lists Systems are widely used in tertiary education as a pedagogical tool and for tracking copyrighted material. This paper explores academics' experiences with reading lists and in particular the use of reading lists notes feature. A mixed-methods approach was employed in which we first conducted interviews with academics about their experience with reading lists. We identified the need for streamlining the workflow of the reading lists set-up, improved usability of the interfaces, and better synchronization with other teaching support systems. Next, we performed a log analysis of the use of the notes feature throughout one academic year. The results of our log analysis were that the note feature is under-utilized by academics. We recommend improving the systems’ usability by re-engineering the user workflows and to better integrate notes feature into academic teaching.

阅读清单系统作为一种教学工具和追踪受版权保护材料的工具被广泛应用于高等教育领域。本文探讨了学者们使用阅读清单的经验,特别是使用阅读清单笔记功能的情况。我们采用了一种混合方法,首先对学者进行访谈,了解他们使用阅读清单的经验。我们发现有必要简化阅读清单设置的工作流程,提高界面的可用性,并更好地与其他教学辅助系统同步。接下来,我们对一个学年中笔记功能的使用情况进行了日志分析。日志分析的结果表明,学术界对笔记功能的利用率很低。我们建议通过重新设计用户工作流程来提高系统的可用性,并将笔记功能更好地融入到学术教学中。
{"title":"Academics’ experience of online reading lists and the use of reading list notes","authors":"P. P. N. V. Kumara, Annika Hinze, Nicholas Vanderschantz, Claire Timpany","doi":"10.1007/s00799-023-00387-w","DOIUrl":"https://doi.org/10.1007/s00799-023-00387-w","url":null,"abstract":"<p>Reading Lists Systems are widely used in tertiary education as a pedagogical tool and for tracking copyrighted material. This paper explores academics' experiences with reading lists and in particular the use of reading lists <i>notes</i> feature. A mixed-methods approach was employed in which we first conducted interviews with academics about their experience with reading lists. We identified the need for streamlining the workflow of the reading lists set-up, improved usability of the interfaces, and better synchronization with other teaching support systems. Next, we performed a log analysis of the use of the notes feature throughout one academic year. The results of our log analysis were that the note feature is under-utilized by academics. We recommend improving the systems’ usability by re-engineering the user workflows and to better integrate notes feature into academic teaching.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs SciND:通过知识图谱进行科学新颖性检测的基于三元组的新数据集
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-08 DOI: 10.1007/s00799-023-00386-x
Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal

Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.

检测包含语义级新信息的文本并非易事。对于研究文章来说,这个问题变得更具挑战性。多年来,人们开发了许多数据集和技术来尝试自动新颖性检测。然而,现有的文本新颖性检测研究大多针对新闻通讯等一般领域。科学新颖性检测的综合数据集在文献中并不存在。在本文中,我们提出了一个基于三元组的新语料库(SciND),用于通过知识图谱从研究文章中检测科学新颖性。本文提出的数据集由三类三元组组成:(i) 知识图谱三元组;(ii) 新颖三元组;(iii) 非新颖三元组。我们利用多个自然语言处理(NLP)领域的三元组为研究文章构建科学知识图谱,并从 2021 年发表的论文中提取新颖的三元组。对于非小说类文章,我们使用研究文章的博文摘要。我们的知识图谱是针对特定领域的。我们为七个 NLP 领域构建了知识图谱。我们还以研究文章中基于特征的新颖性检测方案为基准。此外,我们还使用基线新颖性检测算法展示了我们提出的数据集的适用性。我们的算法获得了 72% 的基准 F1 分数。我们展示了使用我们提出的数据集进行的分析,并讨论了未来的应用范围。据我们所知,这是第一个通过知识图谱进行科学新颖性检测的数据集。我们将在 https://github.com/92Komal/Scientific_Novelty_Detection 上公开我们的代码和数据集。
{"title":"SciND: a new triplet-based dataset for scientific novelty detection via knowledge graphs","authors":"Komal Gupta, Ammaar Ahmad, Tirthankar Ghosal, Asif Ekbal","doi":"10.1007/s00799-023-00386-x","DOIUrl":"https://doi.org/10.1007/s00799-023-00386-x","url":null,"abstract":"<p>Detecting texts that contain semantic-level new information is not straightforward. The problem becomes more challenging for research articles. Over the years, many datasets and techniques have been developed to attempt automatic novelty detection. However, the majority of the existing textual novelty detection investigations are targeted toward general domains like newswire. A comprehensive dataset for scientific novelty detection is not available in the literature. In this paper, we present a new triplet-based corpus (SciND) for scientific novelty detection from research articles via knowledge graphs. The proposed dataset consists of three types of triples (i) triplet for the knowledge graph, (ii) novel triplets, and (iii) non-novel triplets. We build a scientific knowledge graph for research articles using triplets across several natural language processing (NLP) domains and extract novel triplets from the paper published in the year 2021. For the non-novel articles, we use blog post summaries of the research articles. Our knowledge graph is domain-specific. We build the knowledge graph for seven NLP domains. We further use a feature-based novelty detection scheme from the research articles as a baseline. Moreover, we show the applicability of our proposed dataset using our baseline novelty detection algorithm. Our algorithm yields a baseline F1 score of 72%. We show analysis and discuss the future scope using our proposed dataset. To the best of our knowledge, this is the very first dataset for scientific novelty detection via a knowledge graph. We make our codes and dataset publicly available at https://github.com/92Komal/Scientific_Novelty_Detection.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-in-the-loop latent space learning for biblio-record-based literature management 基于书目记录的文献管理的人在环潜在空间学习
IF 1.5 Q1 Social Sciences Pub Date : 2024-01-04 DOI: 10.1007/s00799-023-00389-8

Abstract

Every researcher must conduct a literature review, and the document management needs of researchers working on various research topics vary. However, there are two major challenges. First, traditional methods such as the tree hierarchy of document folders and tag-based management are no longer effective with the enormous volume of publications. Second, although their bibliographic information is available to everyone, many papers can only be accessed through paid services. This study attempts to develop an interactive tool for personal literature management based solely on their bibliographic records. To make such a tool possible, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on his or her feedback to calculate the positions of documents in a two-dimensional space on the screen. As a set of bibliographic records forms a graph, our model is naturally designed as a graph-based encoder–decoder model that connects the graph and the space. In addition, we also devised an active learning framework using uncertainty sampling for it. The challenge here is to define the uncertainty in a problem setting. Experiments with ten researchers from the humanities, science, and engineering domains show that the proposed framework provides superior results to a typical graph convolutional encoder–decoder model. In addition, we found that our active learning framework was effective in selecting good samples.

摘要 每个研究人员都必须进行文献综述,而从事不同研究课题的研究人员对文献管理的需求也各不相同。然而,目前存在两大挑战。首先,传统的方法,如文件文件夹的树状层次结构和基于标签的管理,在出版物数量巨大的情况下已不再有效。其次,虽然每个人都能获得其书目信息,但许多论文只能通过付费服务才能访问。本研究试图开发一种完全基于书目记录的个人文献管理互动工具。为了使这种工具成为可能,我们开发了一种原则性的 "人在回路中的潜在空间学习 "方法,该方法可根据每位研究人员的反馈估算其管理标准,从而计算出文档在屏幕上二维空间中的位置。由于书目记录集合构成了一个图,我们的模型自然被设计成一个基于图的编码器-解码器模型,将图和空间连接起来。此外,我们还利用不确定性采样设计了一个主动学习框架。这里的挑战在于如何在问题设置中定义不确定性。与来自人文、科学和工程领域的十位研究人员进行的实验表明,与典型的图卷积编码器-解码器模型相比,所提出的框架能提供更优越的结果。此外,我们还发现,我们的主动学习框架在选择良好样本方面非常有效。
{"title":"Human-in-the-loop latent space learning for biblio-record-based literature management","authors":"","doi":"10.1007/s00799-023-00389-8","DOIUrl":"https://doi.org/10.1007/s00799-023-00389-8","url":null,"abstract":"<h3>Abstract</h3> <p>Every researcher must conduct a literature review, and the document management needs of researchers working on various research topics vary. However, there are two major challenges. First, traditional methods such as the tree hierarchy of document folders and tag-based management are no longer effective with the enormous volume of publications. Second, although their bibliographic information is available to everyone, many papers can only be accessed through paid services. This study attempts to develop an interactive tool for personal literature management based solely on their bibliographic records. To make such a tool possible, we developed a principled “human-in-the-loop latent space learning” method that estimates the management criteria of each researcher based on his or her feedback to calculate the positions of documents in a two-dimensional space on the screen. As a set of bibliographic records forms a graph, our model is naturally designed as a graph-based encoder–decoder model that connects the graph and the space. In addition, we also devised an active learning framework using uncertainty sampling for it. The challenge here is to define the uncertainty in a problem setting. Experiments with ten researchers from the humanities, science, and engineering domains show that the proposed framework provides superior results to a typical graph convolutional encoder–decoder model. In addition, we found that our active learning framework was effective in selecting good samples.</p>","PeriodicalId":44974,"journal":{"name":"International Journal on Digital Libraries","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139374565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal on Digital Libraries
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1