Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

IF 1.7 3区 管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Journal of Documentation Pub Date : 2023-02-27 DOI:10.1108/jd-01-2022-0029
Dilawar Ali, Kenzo Milleville, S. Verstockt, N. van de Weghe, Sally Chambers, Julie M. Birkholz
{"title":"Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections","authors":"Dilawar Ali, Kenzo Milleville, S. Verstockt, N. van de Weghe, Sally Chambers, Julie M. Birkholz","doi":"10.1108/jd-01-2022-0029","DOIUrl":null,"url":null,"abstract":"PurposeHistorical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.Design/methodology/approachIn this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.FindingsThe results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.Originality/valueThe proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).","PeriodicalId":47969,"journal":{"name":"Journal of Documentation","volume":null,"pages":null},"PeriodicalIF":1.7000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Documentation","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1108/jd-01-2022-0029","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

PurposeHistorical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.Design/methodology/approachIn this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.FindingsThe results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.Originality/valueThe proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
元数据丰富的计算机视觉和机器学习方法提高历史报纸收藏的可搜索性
目的历史报刊收藏提供了丰富的关于过去的信息。尽管这些藏品的数字化大大提高了它们的可访问性,但大部分数字化的历史报纸藏品,如比利时皇家图书馆KBR的藏品,还无法在文章层面进行搜索。然而,基于人工智能的研究方法的最新发展,如文档布局分析,有可能进一步丰富元数据,以提高这些历史报纸收藏的可搜索性。本文旨在讨论上述问题。设计/方法论/方法在本文中,作者探讨了如何利用现有的计算机视觉和机器学习方法来改善数字化历史报纸的访问。为此,作者提出了一种工作流程,使用计算机视觉和机器学习方法(1)使用文档布局分析提供对数字化历史报纸收藏的文章级访问,(2)提取特定类型的文章(例如feuilletons——1938年Le Peuple的文学增刊),(3)使用(未)监督的分类方法进行图像相似性分析,以及(4)执行命名实体识别(NER)以将提取的信息链接到开放数据。结果表明,所提出的工作流程提高了数字化历史报纸的可访问性和可搜索性,也有助于建立数字人文研究的语料库。基于人工智能的方法能够自动提取特征,对相似图像进行聚类,并动态链接相关文章。独创性/价值所提出的工作流程能够自动提取文章,包括检测特定类型的文章,如feuilleton或文学增刊。这对人文学科研究人员来说尤其有价值,因为它提高了这些收藏的可搜索性,并使语料库能够围绕特定主题构建。通过在线工具展示了KBR数字化报纸的文章级访问和搜索能力的提高(https://tw06v072.ugent.be/kbr/)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Documentation
Journal of Documentation INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
4.20
自引率
14.30%
发文量
72
期刊介绍: The scope of the Journal of Documentation is broadly information sciences, encompassing all of the academic and professional disciplines which deal with recorded information. These include, but are certainly not limited to: ■Information science, librarianship and related disciplines ■Information and knowledge management ■Information and knowledge organisation ■Information seeking and retrieval, and human information behaviour ■Information and digital literacies
期刊最新文献
Information experiences of bonsai growers: a phenomenological study in serious leisure Information experiences of bonsai growers: a phenomenological study in serious leisure Constructing risk in trustworthy digital repositories Information seeking and communication model (ISCM): application and extension Evolving legitimacy of the public library in the 21st century
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1