The Shapes Constraint Language (SHACL) is a formal language for validating RDF graphs against a set of conditions. Following this idea and implementing a subset of the language, the Metadata Quality Assessment Framework provides Shacl4Bib: a mechanism to define SHACL-like rules for data sources in non-RDF based formats, such as XML, CSV and JSON. QA catalogue extends this concept further to MARC21, UNIMARC and PICA data. The criteria can be defined either with YAML or JSON configuration files or with Java code. Libraries can validate their data against criteria expressed in a unified language, that improves the clarity and the reusability of custom validation processes.
{"title":"Shacl4Bib: custom validation of library data","authors":"Péter Király","doi":"arxiv-2405.09177","DOIUrl":"https://doi.org/arxiv-2405.09177","url":null,"abstract":"The Shapes Constraint Language (SHACL) is a formal language for validating\u0000RDF graphs against a set of conditions. Following this idea and implementing a\u0000subset of the language, the Metadata Quality Assessment Framework provides\u0000Shacl4Bib: a mechanism to define SHACL-like rules for data sources in non-RDF\u0000based formats, such as XML, CSV and JSON. QA catalogue extends this concept\u0000further to MARC21, UNIMARC and PICA data. The criteria can be defined either\u0000with YAML or JSON configuration files or with Java code. Libraries can validate\u0000their data against criteria expressed in a unified language, that improves the\u0000clarity and the reusability of custom validation processes.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates the viability of distinguishing articles in questionable journals (QJs) from those in non-QJs on the basis of quantitative indicators typically associated with quality. Subsequently, I examine what can be deduced about the quality of articles in QJs based on the differences observed. I contrast the length of abstracts and full-texts, prevalence of spelling errors, text readability, number of references and citations, the size and internationality of the author team, the documentation of ethics and informed consent statements, and the presence erroneous decisions based on statistical errors in 1,714 articles from 31 QJs, 1,691 articles from 16 journals indexed in Web of Science (WoS), and 1,900 articles from 45 mid-tier journals, all in the field of psychology. The results suggest that QJ articles do diverge from the disciplinary standards set by peer-reviewed journals in psychology on quantitative indicators of quality that tend to reflect the effect of peer review and editorial processes. However, mid-tier and WoS journals are also affected by potential quality concerns, such as under-reporting of ethics and informed consent processes and the presence of errors in interpreting statistics. Further research is required to develop a comprehensive understanding of the quality of articles in QJs.
本研究探讨了根据通常与质量相关的定量指标区分问题期刊(QJ)与非问题期刊中文章的可行性。随后,我研究了根据观察到的差异可以推断出 QJ 期刊中文章质量的哪些方面。我对比了 31 家 QJ 的 1714 篇文章、16 家被 Web of Science(WoS)收录的期刊的 1691 篇文章以及 45 家中级期刊的 1900 篇文章中的摘要和全文长度、拼写错误的普遍程度、文章的可读性、参考文献和引用文献的数量、作者团队的规模和国际化程度、伦理和知情同意声明的记录,以及是否存在基于统计错误的错误决策,所有这些文章都属于心理学领域。结果表明,在反映同行评审和编辑过程效果的定量质量指标上,QJ 文章确实与心理学同行评审期刊设定的学科标准存在差异。然而,中级期刊和世界期刊也受到潜在质量问题的影响,如伦理和知情同意程序报告不足,以及在解释统计数据时存在错误。要全面了解高质量期刊的文章质量,还需要进一步的研究。
{"title":"Distinguishing articles in questionable and non-questionable journals using quantitative indicators associated with quality","authors":"Dimity Stephen","doi":"arxiv-2405.06308","DOIUrl":"https://doi.org/arxiv-2405.06308","url":null,"abstract":"This study investigates the viability of distinguishing articles in\u0000questionable journals (QJs) from those in non-QJs on the basis of quantitative\u0000indicators typically associated with quality. Subsequently, I examine what can\u0000be deduced about the quality of articles in QJs based on the differences\u0000observed. I contrast the length of abstracts and full-texts, prevalence of\u0000spelling errors, text readability, number of references and citations, the size\u0000and internationality of the author team, the documentation of ethics and\u0000informed consent statements, and the presence erroneous decisions based on\u0000statistical errors in 1,714 articles from 31 QJs, 1,691 articles from 16\u0000journals indexed in Web of Science (WoS), and 1,900 articles from 45 mid-tier\u0000journals, all in the field of psychology. The results suggest that QJ articles\u0000do diverge from the disciplinary standards set by peer-reviewed journals in\u0000psychology on quantitative indicators of quality that tend to reflect the\u0000effect of peer review and editorial processes. However, mid-tier and WoS\u0000journals are also affected by potential quality concerns, such as\u0000under-reporting of ethics and informed consent processes and the presence of\u0000errors in interpreting statistics. Further research is required to develop a\u0000comprehensive understanding of the quality of articles in QJs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"131 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The iterative character of work in machine learning (ML) and artificial intelligence (AI) and reliance on comparisons against benchmark datasets emphasize the importance of reproducibility in that literature. Yet, resource constraints and inadequate documentation can make running replications particularly challenging. Our work explores the potential of using downstream citation contexts as a signal of reproducibility. We introduce a sentiment analysis framework applied to citation contexts from papers involved in Machine Learning Reproducibility Challenges in order to interpret the positive or negative outcomes of reproduction attempts. Our contributions include training classifiers for reproducibility-related contexts and sentiment analysis, and exploring correlations between citation context sentiment and reproducibility scores. Study data, software, and an artifact appendix are publicly available at https://github.com/lamps-lab/ccair-ai-reproducibility .
{"title":"Can citations tell us about a paper's reproducibility? A case study of machine learning papers","authors":"Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu","doi":"arxiv-2405.03977","DOIUrl":"https://doi.org/arxiv-2405.03977","url":null,"abstract":"The iterative character of work in machine learning (ML) and artificial\u0000intelligence (AI) and reliance on comparisons against benchmark datasets\u0000emphasize the importance of reproducibility in that literature. Yet, resource\u0000constraints and inadequate documentation can make running replications\u0000particularly challenging. Our work explores the potential of using downstream\u0000citation contexts as a signal of reproducibility. We introduce a sentiment\u0000analysis framework applied to citation contexts from papers involved in Machine\u0000Learning Reproducibility Challenges in order to interpret the positive or\u0000negative outcomes of reproduction attempts. Our contributions include training\u0000classifiers for reproducibility-related contexts and sentiment analysis, and\u0000exploring correlations between citation context sentiment and reproducibility\u0000scores. Study data, software, and an artifact appendix are publicly available\u0000at https://github.com/lamps-lab/ccair-ai-reproducibility .","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NACSOS is a web-based platform for curating data used in systematic maps. It contains several (experimental) features that aid the evidence synthesis process from finding and ingesting primary data (mainly scientific publications), basic search and exploration thereof, but mainly the handling of managing the manual and automated annotations. The platform supports prioritised screening algorithms and is the first to fully implement statistical stopping criteria. Annotations by multiple coders can be resolved and customisable quality metrics are computed on-the-fly. In its current state, the annotations are performed on document level. The ecosystem around NACSOS offers packages for accessing the underlying database and practical utility functions that have proven useful in a multitude of projects. Further, it provides the backbone of living maps, review ecosystems, and our public literature hub for sharing high-quality curated corpora.
{"title":"NACSOS-nexus: NLP Assisted Classification, Synthesis and Online Screening with New and EXtended Usage Scenarios","authors":"Tim Repke, Max Callaghan","doi":"arxiv-2405.04621","DOIUrl":"https://doi.org/arxiv-2405.04621","url":null,"abstract":"NACSOS is a web-based platform for curating data used in systematic maps. It\u0000contains several (experimental) features that aid the evidence synthesis\u0000process from finding and ingesting primary data (mainly scientific\u0000publications), basic search and exploration thereof, but mainly the handling of\u0000managing the manual and automated annotations. The platform supports\u0000prioritised screening algorithms and is the first to fully implement\u0000statistical stopping criteria. Annotations by multiple coders can be resolved\u0000and customisable quality metrics are computed on-the-fly. In its current state,\u0000the annotations are performed on document level. The ecosystem around NACSOS\u0000offers packages for accessing the underlying database and practical utility\u0000functions that have proven useful in a multitude of projects. Further, it\u0000provides the backbone of living maps, review ecosystems, and our public\u0000literature hub for sharing high-quality curated corpora.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents multi- and interdisciplinary approaches for finding the appropriate AI technologies for research information. Professional research information management (RIM) is becoming increasingly important as an expressly data-driven tool for researchers. It is not only the basis of scientific knowledge processes, but also related to other data. A concept and a process model of the elementary phases from the start of the project to the ongoing operation of the AI methods in the RIM is presented, portraying the implementation of an AI project, meant to enable universities and research institutions to support their researchers in dealing with incorrect and incomplete research information, while it is being stored in their RIMs. Our aim is to show how research information harmonizes with the challenges of data literacy and data quality issues, related to AI, also wanting to underline that any project can be successful if the research institutions and various departments of universities, involved work together and appropriate support is offered to improve research information and data management.
本文介绍了为研究信息寻找合适的人工智能技术的多学科和跨学科方法。专业研究信息管理(RIM)作为研究人员的明确数据驱动工具,正变得越来越重要。它不仅是科学知识流程的基础,还与其他数据相关。本文介绍了从项目开始到人工智能方法在 RIM 中持续运行的基本阶段的概念和流程模型,描绘了一个人工智能项目的实施过程,该项目旨在使大学和研究机构能够支持其研究人员处理不正确和不完整的研究信息,同时将这些信息存储到他们的 RIM 中。我们的目的是展示研究信息如何与与人工智能相关的数据扫盲和数据质量问题的挑战相协调,同时也希望强调,如果参与其中的研究机构和大学各部门通力合作,并提供适当的支持以改进研究信息和数据管理,那么任何项目都能取得成功。
{"title":"Research information in the light of artificial intelligence: quality and data ecologies","authors":"Otmane Azeroual, Tibor Koltay","doi":"arxiv-2405.12997","DOIUrl":"https://doi.org/arxiv-2405.12997","url":null,"abstract":"This paper presents multi- and interdisciplinary approaches for finding the\u0000appropriate AI technologies for research information. Professional research\u0000information management (RIM) is becoming increasingly important as an expressly\u0000data-driven tool for researchers. It is not only the basis of scientific\u0000knowledge processes, but also related to other data. A concept and a process\u0000model of the elementary phases from the start of the project to the ongoing\u0000operation of the AI methods in the RIM is presented, portraying the\u0000implementation of an AI project, meant to enable universities and research\u0000institutions to support their researchers in dealing with incorrect and\u0000incomplete research information, while it is being stored in their RIMs. Our\u0000aim is to show how research information harmonizes with the challenges of data\u0000literacy and data quality issues, related to AI, also wanting to underline that\u0000any project can be successful if the research institutions and various\u0000departments of universities, involved work together and appropriate support is\u0000offered to improve research information and data management.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large bibliometric databases, such as Web of Science, Scopus, and OpenAlex, facilitate bibliometric analyses, but are performative, affecting the visibility of scientific outputs and the impact measurement of participating entities. Recently, these databases have taken up the UN's Sustainable Development Goals (SDGs) in their respective classifications, which have been criticised for their diverging nature. This work proposes using the feature of large language models (LLMs) to learn about the "data bias" injected by diverse SDG classifications into bibliometric data by exploring five SDGs. We build a LLM that is fine-tuned in parallel by the diverse SDG classifications inscribed into the databases' SDG classifications. Our results show high sensitivity in model architecture, classified publications, fine-tuning process, and natural language generation. The wide arbitrariness at different levels raises concerns about using LLM in research practice.
大型文献计量数据库,如 Web of Science、Scopus 和 OpenAlex,为文献计量分析提供了便利,但也具有执行性,影响了科学产出的可见性和参与实体的影响衡量。最近,这些数据库在各自的分类中采用了联合国的可持续发展目标(SDGs),这些目标因其不同的性质而受到批评。这项工作建议利用大型语言模型(LLM)的特点,通过探索五项可持续发展目标,了解不同的可持续发展目标分类给文献计量数据带来的 "数据偏差"。我们建立了一个大型语言模型(LLM),该模型可根据数据库的 SDG 分类中的不同 SDG 分类进行并行微调。我们的研究结果表明,在模型架构、分类出版物、微调过程和自然语言生成方面都具有很高的灵敏度。不同层面的广泛任意性引起了人们对在研究实践中使用 LLM 的担忧。
{"title":"On the performativity of SDG classifications in large bibliometric databases","authors":"Matteo Ottaviani, Stephan Stahlschmidt","doi":"arxiv-2405.03007","DOIUrl":"https://doi.org/arxiv-2405.03007","url":null,"abstract":"Large bibliometric databases, such as Web of Science, Scopus, and OpenAlex,\u0000facilitate bibliometric analyses, but are performative, affecting the\u0000visibility of scientific outputs and the impact measurement of participating\u0000entities. Recently, these databases have taken up the UN's Sustainable\u0000Development Goals (SDGs) in their respective classifications, which have been\u0000criticised for their diverging nature. This work proposes using the feature of\u0000large language models (LLMs) to learn about the \"data bias\" injected by diverse\u0000SDG classifications into bibliometric data by exploring five SDGs. We build a\u0000LLM that is fine-tuned in parallel by the diverse SDG classifications inscribed\u0000into the databases' SDG classifications. Our results show high sensitivity in\u0000model architecture, classified publications, fine-tuning process, and natural\u0000language generation. The wide arbitrariness at different levels raises concerns\u0000about using LLM in research practice.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amanda Bleichrodt, Lydia Bourouiba, Gerardo Chowell, Eric T. Lofgren, J. Michael Reed, Sadie J. Ryan, Nina H. Fefferman
When we think of model ensembling or ensemble modeling, there are many possibilities that come to mind in different disciplines. For example, one might think of a set of descriptions of a phenomenon in the world, perhaps a time series or a snapshot of multivariate space, and perhaps that set is comprised of data-independent descriptions, or perhaps it is quite intentionally fit *to* data, or even a suite of data sets with a common theme or intention. The very meaning of 'ensemble' - a collection together - conjures different ideas across and even within disciplines approaching phenomena. In this paper, we present a typology of the scope of these potential perspectives. It is not our goal to present a review of terms and concepts, nor is it to convince all disciplines to adopt a common suite of terms, which we view as futile. Rather, our goal is to disambiguate terms, concepts, and processes associated with 'ensembles' and 'ensembling' in order to facilitate communication, awareness, and possible adoption of tools across disciplines.
{"title":"Assembling ensembling: An adventure in approaches across disciplines","authors":"Amanda Bleichrodt, Lydia Bourouiba, Gerardo Chowell, Eric T. Lofgren, J. Michael Reed, Sadie J. Ryan, Nina H. Fefferman","doi":"arxiv-2405.02599","DOIUrl":"https://doi.org/arxiv-2405.02599","url":null,"abstract":"When we think of model ensembling or ensemble modeling, there are many\u0000possibilities that come to mind in different disciplines. For example, one\u0000might think of a set of descriptions of a phenomenon in the world, perhaps a\u0000time series or a snapshot of multivariate space, and perhaps that set is\u0000comprised of data-independent descriptions, or perhaps it is quite\u0000intentionally fit *to* data, or even a suite of data sets with a common theme\u0000or intention. The very meaning of 'ensemble' - a collection together - conjures\u0000different ideas across and even within disciplines approaching phenomena. In\u0000this paper, we present a typology of the scope of these potential perspectives.\u0000It is not our goal to present a review of terms and concepts, nor is it to\u0000convince all disciplines to adopt a common suite of terms, which we view as\u0000futile. Rather, our goal is to disambiguate terms, concepts, and processes\u0000associated with 'ensembles' and 'ensembling' in order to facilitate\u0000communication, awareness, and possible adoption of tools across disciplines.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process expands the opportunities for exchange and joint utilization, granted that the descriptive schemas are made interoperable in advance. To achieve this goal, we propose a replicable workflow for metadata schema crosswalks that facilitates the preservation and accessibility of cultural heritage in the digital ecosystem. This work presents a methodology for metadata generation and management in the case study of the digital twin of the temporary exhibition "The Other Renaissance - Ulisse Aldrovandi and the Wonders of the World". The workflow delineates a systematic, step-by-step transformation of tabular data into RDF format, to enhance Linked Open Data. The methodology adopts the RDF Mapping Language (RML) technology for converting data to RDF with a human contribution involvement. This last aspect entails an interaction between digital humanists and domain experts through surveys leading to the abstraction and reformulation of domain-specific knowledge, to be exploited in the process of formalizing and converting information.
{"title":"A Workflow for GLAM Metadata Crosswalk","authors":"Arianna Moretti, Ivan Heibi, Silvio Peroni","doi":"arxiv-2405.02113","DOIUrl":"https://doi.org/arxiv-2405.02113","url":null,"abstract":"The acquisition of physical artifacts not only involves transferring existing\u0000information into the digital ecosystem but also generates information as a\u0000process itself, underscoring the importance of meticulous management of FAIR\u0000data and metadata. In addition, the diversity of objects within the cultural\u0000heritage domain is reflected in a multitude of descriptive models. The\u0000digitization process expands the opportunities for exchange and joint\u0000utilization, granted that the descriptive schemas are made interoperable in\u0000advance. To achieve this goal, we propose a replicable workflow for metadata\u0000schema crosswalks that facilitates the preservation and accessibility of\u0000cultural heritage in the digital ecosystem. This work presents a methodology\u0000for metadata generation and management in the case study of the digital twin of\u0000the temporary exhibition \"The Other Renaissance - Ulisse Aldrovandi and the\u0000Wonders of the World\". The workflow delineates a systematic, step-by-step\u0000transformation of tabular data into RDF format, to enhance Linked Open Data.\u0000The methodology adopts the RDF Mapping Language (RML) technology for converting\u0000data to RDF with a human contribution involvement. This last aspect entails an\u0000interaction between digital humanists and domain experts through surveys\u0000leading to the abstraction and reformulation of domain-specific knowledge, to\u0000be exploited in the process of formalizing and converting information.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie
This paper presents Callico, a web-based open source platform designed to simplify the annotation process in document recognition projects. The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data. For document image annotation, Callico offers dual-display annotation for digitised documents, enabling simultaneous visualisation and annotation of scanned images and text. This capability is critical for OCR and HTR model training, document layout analysis, named entity recognition, form-based key value annotation or hierarchical structure annotation with element grouping. The platform supports collaborative annotation with versatile features backed by a commitment to open source development, high-quality code standards and easy deployment via Docker. Illustrative use cases - including the transcription of the Belfort municipal registers, the indexing of French World War II prisoners for the ICRC, and the extraction of personal information from the Socface project's census lists - demonstrate Callico's applicability and utility.
{"title":"Callico: a Versatile Open-Source Document Image Annotation Platform","authors":"Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie","doi":"arxiv-2405.01071","DOIUrl":"https://doi.org/arxiv-2405.01071","url":null,"abstract":"This paper presents Callico, a web-based open source platform designed to\u0000simplify the annotation process in document recognition projects. The move\u0000towards data-centric AI in machine learning and deep learning underscores the\u0000importance of high-quality data, and the need for specialised tools that\u0000increase the efficiency and effectiveness of generating such data. For document\u0000image annotation, Callico offers dual-display annotation for digitised\u0000documents, enabling simultaneous visualisation and annotation of scanned images\u0000and text. This capability is critical for OCR and HTR model training, document\u0000layout analysis, named entity recognition, form-based key value annotation or\u0000hierarchical structure annotation with element grouping. The platform supports\u0000collaborative annotation with versatile features backed by a commitment to open\u0000source development, high-quality code standards and easy deployment via Docker.\u0000Illustrative use cases - including the transcription of the Belfort municipal\u0000registers, the indexing of French World War II prisoners for the ICRC, and the\u0000extraction of personal information from the Socface project's census lists -\u0000demonstrate Callico's applicability and utility.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolai Vogler, Kartik Goyal, Samuel V. Lemley, D. J. Schuldt, Christopher N. Warren, Max G'Sell, Taylor Berg-Kirkpatrick
We propose a novel computational approach to automatically analyze the physical process behind printing of early modern letterpress books via clustering the running titles found at the top of their pages. Specifically, we design and compare custom neural and feature-based kernels for computing pairwise visual similarity of a scanned document's running titles and cluster the titles in order to track any deviations from the expected pattern of a book's printing. Unlike body text which must be reset for every page, the running titles are one of the static type elements in a skeleton forme i.e. the frame used to print each side of a sheet of paper, and were often re-used during a book's printing. To evaluate the effectiveness of our approach, we manually annotate the running title clusters on about 1600 pages across 8 early modern books of varying size and formats. Our method can detect potential deviation from the expected patterns of such skeleton formes, which helps bibliographers understand the phenomena associated with a text's transmission, such as censorship. We also validate our results against a manual bibliographic analysis of a counterfeit early edition of Thomas Hobbes' Leviathan (1651).
{"title":"Clustering Running Titles to Understand the Printing of Early Modern Books","authors":"Nikolai Vogler, Kartik Goyal, Samuel V. Lemley, D. J. Schuldt, Christopher N. Warren, Max G'Sell, Taylor Berg-Kirkpatrick","doi":"arxiv-2405.00752","DOIUrl":"https://doi.org/arxiv-2405.00752","url":null,"abstract":"We propose a novel computational approach to automatically analyze the\u0000physical process behind printing of early modern letterpress books via\u0000clustering the running titles found at the top of their pages. Specifically, we\u0000design and compare custom neural and feature-based kernels for computing\u0000pairwise visual similarity of a scanned document's running titles and cluster\u0000the titles in order to track any deviations from the expected pattern of a\u0000book's printing. Unlike body text which must be reset for every page, the\u0000running titles are one of the static type elements in a skeleton forme i.e. the\u0000frame used to print each side of a sheet of paper, and were often re-used\u0000during a book's printing. To evaluate the effectiveness of our approach, we\u0000manually annotate the running title clusters on about 1600 pages across 8 early\u0000modern books of varying size and formats. Our method can detect potential\u0000deviation from the expected patterns of such skeleton formes, which helps\u0000bibliographers understand the phenomena associated with a text's transmission,\u0000such as censorship. We also validate our results against a manual bibliographic\u0000analysis of a counterfeit early edition of Thomas Hobbes' Leviathan (1651).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}