Nils Englert , Constantin Schwab , Maximilian Legnar , Cleo-Aron Weis
{"title":"提出整个幻灯片图像文件巴别鱼的框架:基于 OCR 的文件标签工具","authors":"Nils Englert , Constantin Schwab , Maximilian Legnar , Cleo-Aron Weis","doi":"10.1016/j.jpi.2024.100402","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.</div><div>We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems.</div></div><div><h3>Material and methods</h3><div>The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.</div><div>Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides.</div></div><div><h3>Results</h3><div>For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982.</div></div><div><h3>Conclusion</h3><div>The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.</div></div>","PeriodicalId":37769,"journal":{"name":"Journal of Pathology Informatics","volume":"15 ","pages":"Article 100402"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool\",\"authors\":\"Nils Englert , Constantin Schwab , Maximilian Legnar , Cleo-Aron Weis\",\"doi\":\"10.1016/j.jpi.2024.100402\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.</div><div>We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems.</div></div><div><h3>Material and methods</h3><div>The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.</div><div>Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides.</div></div><div><h3>Results</h3><div>For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982.</div></div><div><h3>Conclusion</h3><div>The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.</div></div>\",\"PeriodicalId\":37769,\"journal\":{\"name\":\"Journal of Pathology Informatics\",\"volume\":\"15 \",\"pages\":\"Article 100402\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Pathology Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2153353924000415\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Informatics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2153353924000415","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
摘要
导言:从数字化幻灯片或整个幻灯片图像文件中提取元数据是一项频繁、费力且繁琐的工作。我们将该工具命名为巴别鱼,因为它有助于翻译印在玻片上的相关信息。我们将该工具命名为 "巴别鱼",因为它可以帮助翻译印在幻灯片上的相关信息。它的编写包含了一些基本假设,例如,某些信息的位置。这可以根据相应的位置进行调整。提取的元数据可用于将数字幻灯片分类到数据库中,或将其与实验室信息系统中的相关病例 ID 相链接。对于大多数信息,使用 easyOCR 工具。结果对于测试集,每张幻灯片检索所有相关信息的总体准确率为 0.982。值得注意的是,大多数信息部分的准确率为 1.000,而块号检测的准确率为 0.982。此外,它还可以提取病例号、年份、区块 ID 和染色等相关元数据,是 DICOM 转换管道的重要组成部分。
Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
Introduction
Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.
We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems.
Material and methods
The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.
Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides.
Results
For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982.
Conclusion
The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.
期刊介绍:
The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, symposia, meeting abstracts, book reviews, and correspondence to the editors. All submissions are subject to rigorous peer review by the well-regarded editorial board and by expert referees in appropriate specialties.