arXiv - CS - Digital Libraries最新文献

英文中文

Shacl4Bib: custom validation of library data Shacl4Bib：自定义验证图书馆数据

arXiv - CS - Digital Libraries

Pub Date : 2024-05-15 DOI: arxiv-2405.09177

Péter Király

The Shapes Constraint Language (SHACL) is a formal language for validatingRDF graphs against a set of conditions. Following this idea and implementing asubset of the language, the Metadata Quality Assessment Framework providesShacl4Bib: a mechanism to define SHACL-like rules for data sources in non-RDFbased formats, such as XML, CSV and JSON. QA catalogue extends this conceptfurther to MARC21, UNIMARC and PICA data. The criteria can be defined eitherwith YAML or JSON configuration files or with Java code. Libraries can validatetheir data against criteria expressed in a unified language, that improves theclarity and the reusability of custom validation processes.

形状约束语言（SHACL）是一种根据一系列条件验证 RDF 图形的正式语言。元数据质量评估框架（Metadata Quality Assessment Framework）遵循这一理念并实现了该语言的子集，提供了 SHACL4Bib：一种为非基于 RDF 格式（如 XML、CSV 和 JSON）的数据源定义类似 SHACL 规则的机制。QA 目录将这一概念进一步扩展到 MARC21、UNIMARC 和 PICA 数据。标准既可以用 YAML 或 JSON 配置文件定义，也可以用 Java 代码定义。图书馆可以根据用统一语言表达的标准验证自己的数据，从而提高了自定义验证流程的清晰度和可重用性。

引用次数: 0

Distinguishing articles in questionable and non-questionable journals using quantitative indicators associated with quality 利用与质量相关的量化指标区分有问题和无问题期刊上的文章

arXiv - CS - Digital Libraries

Pub Date : 2024-05-10 DOI: arxiv-2405.06308

Dimity Stephen

This study investigates the viability of distinguishing articles inquestionable journals (QJs) from those in non-QJs on the basis of quantitativeindicators typically associated with quality. Subsequently, I examine what canbe deduced about the quality of articles in QJs based on the differencesobserved. I contrast the length of abstracts and full-texts, prevalence ofspelling errors, text readability, number of references and citations, the sizeand internationality of the author team, the documentation of ethics andinformed consent statements, and the presence erroneous decisions based onstatistical errors in 1,714 articles from 31 QJs, 1,691 articles from 16journals indexed in Web of Science (WoS), and 1,900 articles from 45 mid-tierjournals, all in the field of psychology. The results suggest that QJ articlesdo diverge from the disciplinary standards set by peer-reviewed journals inpsychology on quantitative indicators of quality that tend to reflect theeffect of peer review and editorial processes. However, mid-tier and WoSjournals are also affected by potential quality concerns, such asunder-reporting of ethics and informed consent processes and the presence oferrors in interpreting statistics. Further research is required to develop acomprehensive understanding of the quality of articles in QJs.

本研究探讨了根据通常与质量相关的定量指标区分问题期刊（QJ）与非问题期刊中文章的可行性。随后，我研究了根据观察到的差异可以推断出 QJ 期刊中文章质量的哪些方面。我对比了 31 家 QJ 的 1714 篇文章、16 家被 Web of Science（WoS）收录的期刊的 1691 篇文章以及 45 家中级期刊的 1900 篇文章中的摘要和全文长度、拼写错误的普遍程度、文章的可读性、参考文献和引用文献的数量、作者团队的规模和国际化程度、伦理和知情同意声明的记录，以及是否存在基于统计错误的错误决策，所有这些文章都属于心理学领域。结果表明，在反映同行评审和编辑过程效果的定量质量指标上，QJ 文章确实与心理学同行评审期刊设定的学科标准存在差异。然而，中级期刊和世界期刊也受到潜在质量问题的影响，如伦理和知情同意程序报告不足，以及在解释统计数据时存在错误。要全面了解高质量期刊的文章质量，还需要进一步的研究。

{"title":"Distinguishing articles in questionable and non-questionable journals using quantitative indicators associated with quality","authors":"Dimity Stephen","doi":"arxiv-2405.06308","DOIUrl":"https://doi.org/arxiv-2405.06308","url":null,"abstract":"This study investigates the viability of distinguishing articles in\u0000questionable journals (QJs) from those in non-QJs on the basis of quantitative\u0000indicators typically associated with quality. Subsequently, I examine what can\u0000be deduced about the quality of articles in QJs based on the differences\u0000observed. I contrast the length of abstracts and full-texts, prevalence of\u0000spelling errors, text readability, number of references and citations, the size\u0000and internationality of the author team, the documentation of ethics and\u0000informed consent statements, and the presence erroneous decisions based on\u0000statistical errors in 1,714 articles from 31 QJs, 1,691 articles from 16\u0000journals indexed in Web of Science (WoS), and 1,900 articles from 45 mid-tier\u0000journals, all in the field of psychology. The results suggest that QJ articles\u0000do diverge from the disciplinary standards set by peer-reviewed journals in\u0000psychology on quantitative indicators of quality that tend to reflect the\u0000effect of peer review and editorial processes. However, mid-tier and WoS\u0000journals are also affected by potential quality concerns, such as\u0000under-reporting of ethics and informed consent processes and the presence of\u0000errors in interpreting statistics. Further research is required to develop a\u0000comprehensive understanding of the quality of articles in QJs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"131 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140932664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can citations tell us about a paper's reproducibility? A case study of machine learning papers 引用能说明论文的可复制性吗？机器学习论文案例研究

arXiv - CS - Digital Libraries

Pub Date : 2024-05-07 DOI: arxiv-2405.03977

Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

The iterative character of work in machine learning (ML) and artificialintelligence (AI) and reliance on comparisons against benchmark datasetsemphasize the importance of reproducibility in that literature. Yet, resourceconstraints and inadequate documentation can make running replicationsparticularly challenging. Our work explores the potential of using downstreamcitation contexts as a signal of reproducibility. We introduce a sentimentanalysis framework applied to citation contexts from papers involved in MachineLearning Reproducibility Challenges in order to interpret the positive ornegative outcomes of reproduction attempts. Our contributions include trainingclassifiers for reproducibility-related contexts and sentiment analysis, andexploring correlations between citation context sentiment and reproducibilityscores. Study data, software, and an artifact appendix are publicly availableat https://github.com/lamps-lab/ccair-ai-reproducibility .

机器学习（ML）和人工智能（AI）领域的工作具有迭代性，并且依赖于与基准数据集的比较，这就强调了文献可重复性的重要性。然而，资源的限制和文档的不足可能会使复制的运行特别具有挑战性。我们的工作探索了使用下游引用上下文作为可重复性信号的潜力。我们引入了一个情感分析框架，将其应用于机器学习可重复性挑战赛中论文的引用上下文，以解释复制尝试的积极或消极结果。我们的贡献包括训练可重复性相关上下文和情感分析的分类器，以及探索引用上下文情感和可重复性分数之间的相关性。研究数据、软件和工具附录可通过 https://github.com/lamps-lab/ccair-ai-reproducibility 公开获取。

引用次数: 0

NACSOS-nexus: NLP Assisted Classification, Synthesis and Online Screening with New and EXtended Usage Scenarios NACSOS-nexus：使用新的和扩展的使用场景进行 NLP 辅助分类、合成和在线筛选

arXiv - CS - Digital Libraries

Pub Date : 2024-05-07 DOI: arxiv-2405.04621

Tim Repke, Max Callaghan

NACSOS is a web-based platform for curating data used in systematic maps. Itcontains several (experimental) features that aid the evidence synthesisprocess from finding and ingesting primary data (mainly scientificpublications), basic search and exploration thereof, but mainly the handling ofmanaging the manual and automated annotations. The platform supportsprioritised screening algorithms and is the first to fully implementstatistical stopping criteria. Annotations by multiple coders can be resolvedand customisable quality metrics are computed on-the-fly. In its current state,the annotations are performed on document level. The ecosystem around NACSOSoffers packages for accessing the underlying database and practical utilityfunctions that have proven useful in a multitude of projects. Further, itprovides the backbone of living maps, review ecosystems, and our publicliterature hub for sharing high-quality curated corpora.

NACSOS 是一个基于网络的平台，用于整理系统图中使用的数据。该平台具有多项（实验性）功能，可帮助证据合成过程，包括查找和获取原始数据（主要是科学出版物）、基本搜索和探索，但主要是处理管理手动和自动注释。该平台支持优先筛选算法，是首个完全实现统计停止标准的平台。多个编码员的注释可以得到解决，可定制的质量指标也可以即时计算。在目前的状态下，注释是在文档级别上进行的。围绕 NACSOS 的生态系统提供了用于访问底层数据库的软件包和实用的实用功能，这些功能已在众多项目中得到证实。此外，它还为活地图、评论生态系统和我们的公共文献中心提供了骨干力量，以共享高质量的策划语料库。

引用次数: 0

Research information in the light of artificial intelligence: quality and data ecologies 人工智能背景下的科研信息：质量与数据生态

arXiv - CS - Digital Libraries

Pub Date : 2024-05-06 DOI: arxiv-2405.12997

Otmane Azeroual, Tibor Koltay

This paper presents multi- and interdisciplinary approaches for finding theappropriate AI technologies for research information. Professional researchinformation management (RIM) is becoming increasingly important as an expresslydata-driven tool for researchers. It is not only the basis of scientificknowledge processes, but also related to other data. A concept and a processmodel of the elementary phases from the start of the project to the ongoingoperation of the AI methods in the RIM is presented, portraying theimplementation of an AI project, meant to enable universities and researchinstitutions to support their researchers in dealing with incorrect andincomplete research information, while it is being stored in their RIMs. Ouraim is to show how research information harmonizes with the challenges of dataliteracy and data quality issues, related to AI, also wanting to underline thatany project can be successful if the research institutions and variousdepartments of universities, involved work together and appropriate support isoffered to improve research information and data management.

本文介绍了为研究信息寻找合适的人工智能技术的多学科和跨学科方法。专业研究信息管理（RIM）作为研究人员的明确数据驱动工具，正变得越来越重要。它不仅是科学知识流程的基础，还与其他数据相关。本文介绍了从项目开始到人工智能方法在 RIM 中持续运行的基本阶段的概念和流程模型，描绘了一个人工智能项目的实施过程，该项目旨在使大学和研究机构能够支持其研究人员处理不正确和不完整的研究信息，同时将这些信息存储到他们的 RIM 中。我们的目的是展示研究信息如何与与人工智能相关的数据扫盲和数据质量问题的挑战相协调，同时也希望强调，如果参与其中的研究机构和大学各部门通力合作，并提供适当的支持以改进研究信息和数据管理，那么任何项目都能取得成功。

{"title":"Research information in the light of artificial intelligence: quality and data ecologies","authors":"Otmane Azeroual, Tibor Koltay","doi":"arxiv-2405.12997","DOIUrl":"https://doi.org/arxiv-2405.12997","url":null,"abstract":"This paper presents multi- and interdisciplinary approaches for finding the\u0000appropriate AI technologies for research information. Professional research\u0000information management (RIM) is becoming increasingly important as an expressly\u0000data-driven tool for researchers. It is not only the basis of scientific\u0000knowledge processes, but also related to other data. A concept and a process\u0000model of the elementary phases from the start of the project to the ongoing\u0000operation of the AI methods in the RIM is presented, portraying the\u0000implementation of an AI project, meant to enable universities and research\u0000institutions to support their researchers in dealing with incorrect and\u0000incomplete research information, while it is being stored in their RIMs. Our\u0000aim is to show how research information harmonizes with the challenges of data\u0000literacy and data quality issues, related to AI, also wanting to underline that\u0000any project can be successful if the research institutions and various\u0000departments of universities, involved work together and appropriate support is\u0000offered to improve research information and data management.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the performativity of SDG classifications in large bibliometric databases 论大型文献计量数据库中可持续发展目标分类的可执行性

arXiv - CS - Digital Libraries

Pub Date : 2024-05-05 DOI: arxiv-2405.03007

Matteo Ottaviani, Stephan Stahlschmidt

Large bibliometric databases, such as Web of Science, Scopus, and OpenAlex,facilitate bibliometric analyses, but are performative, affecting thevisibility of scientific outputs and the impact measurement of participatingentities. Recently, these databases have taken up the UN's SustainableDevelopment Goals (SDGs) in their respective classifications, which have beencriticised for their diverging nature. This work proposes using the feature oflarge language models (LLMs) to learn about the "data bias" injected by diverseSDG classifications into bibliometric data by exploring five SDGs. We build aLLM that is fine-tuned in parallel by the diverse SDG classifications inscribedinto the databases' SDG classifications. Our results show high sensitivity inmodel architecture, classified publications, fine-tuning process, and naturallanguage generation. The wide arbitrariness at different levels raises concernsabout using LLM in research practice.

大型文献计量数据库，如 Web of Science、Scopus 和 OpenAlex，为文献计量分析提供了便利，但也具有执行性，影响了科学产出的可见性和参与实体的影响衡量。最近，这些数据库在各自的分类中采用了联合国的可持续发展目标（SDGs），这些目标因其不同的性质而受到批评。这项工作建议利用大型语言模型（LLM）的特点，通过探索五项可持续发展目标，了解不同的可持续发展目标分类给文献计量数据带来的 "数据偏差"。我们建立了一个大型语言模型（LLM），该模型可根据数据库的 SDG 分类中的不同 SDG 分类进行并行微调。我们的研究结果表明，在模型架构、分类出版物、微调过程和自然语言生成方面都具有很高的灵敏度。不同层面的广泛任意性引起了人们对在研究实践中使用 LLM 的担忧。

引用次数: 0

Assembling ensembling: An adventure in approaches across disciplines 汇编汇编：跨学科方法探险

arXiv - CS - Digital Libraries

Pub Date : 2024-05-04 DOI: arxiv-2405.02599

Amanda Bleichrodt, Lydia Bourouiba, Gerardo Chowell, Eric T. Lofgren, J. Michael Reed, Sadie J. Ryan, Nina H. Fefferman

When we think of model ensembling or ensemble modeling, there are manypossibilities that come to mind in different disciplines. For example, onemight think of a set of descriptions of a phenomenon in the world, perhaps atime series or a snapshot of multivariate space, and perhaps that set iscomprised of data-independent descriptions, or perhaps it is quiteintentionally fit *to* data, or even a suite of data sets with a common themeor intention. The very meaning of 'ensemble' - a collection together - conjuresdifferent ideas across and even within disciplines approaching phenomena. Inthis paper, we present a typology of the scope of these potential perspectives.It is not our goal to present a review of terms and concepts, nor is it toconvince all disciplines to adopt a common suite of terms, which we view asfutile. Rather, our goal is to disambiguate terms, concepts, and processesassociated with 'ensembles' and 'ensembling' in order to facilitatecommunication, awareness, and possible adoption of tools across disciplines.

提到模型集合或集合建模，我们会想到不同学科中的多种可能性。例如，我们可能会想到对世界上某种现象的一组描述，也许是时间序列，也许是多元空间的快照，也许这组描述是由与数据无关的描述组成的，也许这组描述是有意与**数据相匹配的，甚至是具有共同主题或意图的一组数据集。集合"--集合在一起--的含义本身就能让人联想到不同学科甚至不同学科内部对现象的不同看法。在本文中，我们将对这些潜在视角的范围进行分类。我们的目标不是对术语和概念进行回顾，也不是说服所有学科采用一套共同的术语，我们认为这样做毫无用处。相反，我们的目标是明确与 "集合 "和 "集合 "相关的术语、概念和过程，以促进跨学科的交流、认识和可能的工具采用。

{"title":"Assembling ensembling: An adventure in approaches across disciplines","authors":"Amanda Bleichrodt, Lydia Bourouiba, Gerardo Chowell, Eric T. Lofgren, J. Michael Reed, Sadie J. Ryan, Nina H. Fefferman","doi":"arxiv-2405.02599","DOIUrl":"https://doi.org/arxiv-2405.02599","url":null,"abstract":"When we think of model ensembling or ensemble modeling, there are many\u0000possibilities that come to mind in different disciplines. For example, one\u0000might think of a set of descriptions of a phenomenon in the world, perhaps a\u0000time series or a snapshot of multivariate space, and perhaps that set is\u0000comprised of data-independent descriptions, or perhaps it is quite\u0000intentionally fit *to* data, or even a suite of data sets with a common theme\u0000or intention. The very meaning of 'ensemble' - a collection together - conjures\u0000different ideas across and even within disciplines approaching phenomena. In\u0000this paper, we present a typology of the scope of these potential perspectives.\u0000It is not our goal to present a review of terms and concepts, nor is it to\u0000convince all disciplines to adopt a common suite of terms, which we view as\u0000futile. Rather, our goal is to disambiguate terms, concepts, and processes\u0000associated with 'ensembles' and 'ensembling' in order to facilitate\u0000communication, awareness, and possible adoption of tools across disciplines.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Workflow for GLAM Metadata Crosswalk GLAM 元数据交叉工作流程

arXiv - CS - Digital Libraries

Pub Date : 2024-05-03 DOI: arxiv-2405.02113

Arianna Moretti, Ivan Heibi, Silvio Peroni

The acquisition of physical artifacts not only involves transferring existinginformation into the digital ecosystem but also generates information as aprocess itself, underscoring the importance of meticulous management of FAIRdata and metadata. In addition, the diversity of objects within the culturalheritage domain is reflected in a multitude of descriptive models. Thedigitization process expands the opportunities for exchange and jointutilization, granted that the descriptive schemas are made interoperable inadvance. To achieve this goal, we propose a replicable workflow for metadataschema crosswalks that facilitates the preservation and accessibility ofcultural heritage in the digital ecosystem. This work presents a methodologyfor metadata generation and management in the case study of the digital twin ofthe temporary exhibition "The Other Renaissance - Ulisse Aldrovandi and theWonders of the World". The workflow delineates a systematic, step-by-steptransformation of tabular data into RDF format, to enhance Linked Open Data.The methodology adopts the RDF Mapping Language (RML) technology for convertingdata to RDF with a human contribution involvement. This last aspect entails aninteraction between digital humanists and domain experts through surveysleading to the abstraction and reformulation of domain-specific knowledge, tobe exploited in the process of formalizing and converting information.

实物文物的获取不仅涉及到将现有信息转移到数字生态系统中，而且其过程本身也会产生信息，这就强调了对 FAIR 数据和元数据进行细致管理的重要性。此外，文化遗产领域物品的多样性体现在多种描述模型上。数字化进程扩大了交换和联合利用的机会，但前提是描述模式必须提前实现互操作。为了实现这一目标，我们提出了一种可复制的工作流程，用于元数据与模式的交叉，以促进文化遗产在数字生态系统中的保存和可访问性。本作品以临时展览 "另一个文艺复兴--乌利塞-阿尔德罗万迪与世界奇迹 "的数字孪生为案例，介绍了一种元数据生成和管理方法。该工作流程将表格数据系统地、一步步地转换为RDF格式，以增强关联开放数据。最后一个方面需要数字人文学者和领域专家通过调查进行互动，从而对特定领域的知识进行抽象和重新表述，并在形式化和转换信息的过程中加以利用。

{"title":"A Workflow for GLAM Metadata Crosswalk","authors":"Arianna Moretti, Ivan Heibi, Silvio Peroni","doi":"arxiv-2405.02113","DOIUrl":"https://doi.org/arxiv-2405.02113","url":null,"abstract":"The acquisition of physical artifacts not only involves transferring existing\u0000information into the digital ecosystem but also generates information as a\u0000process itself, underscoring the importance of meticulous management of FAIR\u0000data and metadata. In addition, the diversity of objects within the cultural\u0000heritage domain is reflected in a multitude of descriptive models. The\u0000digitization process expands the opportunities for exchange and joint\u0000utilization, granted that the descriptive schemas are made interoperable in\u0000advance. To achieve this goal, we propose a replicable workflow for metadata\u0000schema crosswalks that facilitates the preservation and accessibility of\u0000cultural heritage in the digital ecosystem. This work presents a methodology\u0000for metadata generation and management in the case study of the digital twin of\u0000the temporary exhibition \"The Other Renaissance - Ulisse Aldrovandi and the\u0000Wonders of the World\". The workflow delineates a systematic, step-by-step\u0000transformation of tabular data into RDF format, to enhance Linked Open Data.\u0000The methodology adopts the RDF Mapping Language (RML) technology for converting\u0000data to RDF with a human contribution involvement. This last aspect entails an\u0000interaction between digital humanists and domain experts through surveys\u0000leading to the abstraction and reformulation of domain-specific knowledge, to\u0000be exploited in the process of formalizing and converting information.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Callico: a Versatile Open-Source Document Image Annotation Platform Callico：多功能开源文档图像注释平台

arXiv - CS - Digital Libraries

Pub Date : 2024-05-02 DOI: arxiv-2405.01071

Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie

This paper presents Callico, a web-based open source platform designed tosimplify the annotation process in document recognition projects. The movetowards data-centric AI in machine learning and deep learning underscores theimportance of high-quality data, and the need for specialised tools thatincrease the efficiency and effectiveness of generating such data. For documentimage annotation, Callico offers dual-display annotation for digitiseddocuments, enabling simultaneous visualisation and annotation of scanned imagesand text. This capability is critical for OCR and HTR model training, documentlayout analysis, named entity recognition, form-based key value annotation orhierarchical structure annotation with element grouping. The platform supportscollaborative annotation with versatile features backed by a commitment to opensource development, high-quality code standards and easy deployment via Docker.Illustrative use cases - including the transcription of the Belfort municipalregisters, the indexing of French World War II prisoners for the ICRC, and theextraction of personal information from the Socface project's census lists -demonstrate Callico's applicability and utility.

本文介绍的 Callico 是一个基于网络的开源平台，旨在简化文档识别项目中的注释过程。在机器学习和深度学习领域，人工智能正朝着以数据为中心的方向发展，这凸显了高质量数据的重要性，同时也说明需要专门的工具来提高生成此类数据的效率和有效性。在文档图像注释方面，Callico 为数字化文档提供了双显示注释功能，可同时对扫描图像和文本进行可视化和注释。这一功能对于OCR和HTR模型训练、文档布局分析、命名实体识别、基于表单的关键值注释或带有元素分组的层次结构注释至关重要。该平台支持协作注释，具有多功能，并致力于开源开发、高质量代码标准和通过 Docker 轻松部署。示例用例包括贝尔福市政登记簿的转录、为红十字国际委员会编制法国二战战俘索引，以及从 Socface 项目的人口普查名单中提取个人信息，这些都证明了 Callico 的适用性和实用性。

{"title":"Callico: a Versatile Open-Source Document Image Annotation Platform","authors":"Christopher Kermorvant, Eva Bardou, Manon Blanco, Bastien Abadie","doi":"arxiv-2405.01071","DOIUrl":"https://doi.org/arxiv-2405.01071","url":null,"abstract":"This paper presents Callico, a web-based open source platform designed to\u0000simplify the annotation process in document recognition projects. The move\u0000towards data-centric AI in machine learning and deep learning underscores the\u0000importance of high-quality data, and the need for specialised tools that\u0000increase the efficiency and effectiveness of generating such data. For document\u0000image annotation, Callico offers dual-display annotation for digitised\u0000documents, enabling simultaneous visualisation and annotation of scanned images\u0000and text. This capability is critical for OCR and HTR model training, document\u0000layout analysis, named entity recognition, form-based key value annotation or\u0000hierarchical structure annotation with element grouping. The platform supports\u0000collaborative annotation with versatile features backed by a commitment to open\u0000source development, high-quality code standards and easy deployment via Docker.\u0000Illustrative use cases - including the transcription of the Belfort municipal\u0000registers, the indexing of French World War II prisoners for the ICRC, and the\u0000extraction of personal information from the Socface project's census lists -\u0000demonstrate Callico's applicability and utility.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering Running Titles to Understand the Printing of Early Modern Books 对流水账进行分组以了解早期现代书籍的印刷情况

arXiv - CS - Digital Libraries

Pub Date : 2024-05-01 DOI: arxiv-2405.00752

Nikolai Vogler, Kartik Goyal, Samuel V. Lemley, D. J. Schuldt, Christopher N. Warren, Max G'Sell, Taylor Berg-Kirkpatrick

We propose a novel computational approach to automatically analyze thephysical process behind printing of early modern letterpress books viaclustering the running titles found at the top of their pages. Specifically, wedesign and compare custom neural and feature-based kernels for computingpairwise visual similarity of a scanned document's running titles and clusterthe titles in order to track any deviations from the expected pattern of abook's printing. Unlike body text which must be reset for every page, therunning titles are one of the static type elements in a skeleton forme i.e. theframe used to print each side of a sheet of paper, and were often re-usedduring a book's printing. To evaluate the effectiveness of our approach, wemanually annotate the running title clusters on about 1600 pages across 8 earlymodern books of varying size and formats. Our method can detect potentialdeviation from the expected patterns of such skeleton formes, which helpsbibliographers understand the phenomena associated with a text's transmission,such as censorship. We also validate our results against a manual bibliographicanalysis of a counterfeit early edition of Thomas Hobbes' Leviathan (1651).

我们提出了一种新颖的计算方法，通过对书页顶部的行书标题进行聚类，自动分析早期现代凸版印刷书籍印刷背后的物理过程。具体来说，我们设计并比较了自定义神经和基于特征的内核，用于计算扫描文档行书标题的视觉相似度，并对标题进行聚类，以跟踪书籍印刷过程中与预期模式的任何偏差。与每页都必须重新设置的正文不同，行书标题是骨架框架（即用于印刷纸张每一面的框架）中的静态类型元素之一，在书籍印刷过程中经常被重复使用。为了评估我们的方法的有效性，我们对 8 本不同大小和格式的早期现代书籍中约 1600 页的行书标题集群进行了人工标注。我们的方法可以检测出这种骨架形式与预期模式的潜在偏差，这有助于书目文献学家了解与文本传播相关的现象，如审查制度。我们还通过对托马斯-霍布斯（Thomas Hobbes）的《利维坦》（1651 年）早期伪造版本进行人工书目分析，验证了我们的结果。

{"title":"Clustering Running Titles to Understand the Printing of Early Modern Books","authors":"Nikolai Vogler, Kartik Goyal, Samuel V. Lemley, D. J. Schuldt, Christopher N. Warren, Max G'Sell, Taylor Berg-Kirkpatrick","doi":"arxiv-2405.00752","DOIUrl":"https://doi.org/arxiv-2405.00752","url":null,"abstract":"We propose a novel computational approach to automatically analyze the\u0000physical process behind printing of early modern letterpress books via\u0000clustering the running titles found at the top of their pages. Specifically, we\u0000design and compare custom neural and feature-based kernels for computing\u0000pairwise visual similarity of a scanned document's running titles and cluster\u0000the titles in order to track any deviations from the expected pattern of a\u0000book's printing. Unlike body text which must be reset for every page, the\u0000running titles are one of the static type elements in a skeleton forme i.e. the\u0000frame used to print each side of a sheet of paper, and were often re-used\u0000during a book's printing. To evaluate the effectiveness of our approach, we\u0000manually annotate the running title clusters on about 1600 pages across 8 early\u0000modern books of varying size and formats. Our method can detect potential\u0000deviation from the expected patterns of such skeleton formes, which helps\u0000bibliographers understand the phenomena associated with a text's transmission,\u0000such as censorship. We also validate our results against a manual bibliographic\u0000analysis of a counterfeit early edition of Thomas Hobbes' Leviathan (1651).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Digital Libraries

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀